Posted by meander_water 1 day ago
Given the size of frontier models I would assume that they can incorporate many specializations and the most lasting thing here is the training environment.
But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.
Right now, I believe we're seeing that the big general-purpose models outperform approximately everything else. Special-purpose models (essentially: fine tunes) of smaller models make sense when you want to solve a specific task at lower cost/lower latency, and you transfer some/most of the abilities in that domain from a bigger model to a smaller one. Usually, people don't do that, because it's a quite costly process, and the frontier models develop so rapidly, that you're perpetually behind them (so in fact, you're not providing the best possible abilities).
If/when frontier model development speed slows down, training smaller models will make more sense.
You do not believe that this has already started? It seems to me that we’re well into a massive slowdown
In practice, I upgraded everything to GPT-5 and the performance was so terrible I had to rollback the update.
Depends on what you compare it to. For us who were using o3/o1 Pro Mode before GPT-5, the new model isn't that huge of a leap, compared to whatever was before Pro Mode existed.
So even though you have high taxes and a restrictive alcohol policy, the end result is shops that have high customer satisfaction because they have very competent staff, excellent selection and a surprisingly good price for quality products.
The downsides are the limited opening hours and the absence of cheap low-quality wine - the tax disproportionally impacts the low quality stuff, almost nobody will buy shitty wine at $7 per bottle when the decent stuff costs $10, so the shitty wine just doesn't get imported. But for most of the population these are minor drawbacks.
Wow, I am so curious, can you provide me the source
I am so interested in a chess LLM's benchmark as someone who occasionally plays chess. I have thought about creating things like these but it would be very interesting to find the best model at chess which isn't stockfish/lila but general purpose large language models.
I also agree that there might be an explosion of purpose trained LLM's. I had this idea some year ago when there was llama / before deepseek that what if I want to write sveltekit and there are models like deepseek which know about sveltekit but they are so damn big and bloated when I only want to use sveltekit/svelte models. Yes there are thoughts on why we might need the whole network to get better quality but I genuinely feel like right now, the better quality is debtable thanks to all this benchmarkmaxxing and I would happily take a model trained on sveltekit on like preferably 4b-8b parameter but if an extremely good SOTA-ish model for sveltekit is even around 30-40b I would be happy since I could buy a gpu on my pc to run it or run it on my mac
I think my brother who actually knows what he's talking about in the AI space, (unlike me), also said the same thing a few months back to me as well.
In fact, its funny because I had asked him to please create a website comparing benchmarks of AI playing chess and having an option where we can make two AI LLM's play against each other and we can view it or we can also play against an LLM inside an actual chess board on the web and more..., I had given this idea to him a few months ago after the talk about small llm's really lol and he said that its good but he was busy right now. I think then later he might have forgotten about it and I had forgotten about it too until now.
Key memory unlocked. I had an Aha moment with this article, thanks a lot for sharing it, appreciate it.
As far as I remember, it's post-training that kills chess ability for some reason (GPT-3 wasn't post-trained).
That you can individually train and improve smaller segments as necessary
only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.
To meet this challenge, we introduce Game-TARS: a next-generation generalist game agent designed to master complex video games and interactive digital environments using human-like perception, reasoning, and action. Unlike traditional game bots or modular AI frameworks, Game-TARS integrates all core faculties—visual perception, strategic reasoning, action grounding, and long-term memory—within a single, powerful vision-language model (VLM). This unified approach enables true end-to-end autonomous gameplay, allowing the agent to learn and succeed in any game without game-specific code, scripted behaviors, or manual rules.
With Game-TARS, this work is not about achieving the highest possible score in a single game. Instead, our focus is on building a robust foundation model for both generalist game-playing and broader computer use. We aim to create an agent that can learn to operate in any interactive digital environment it encounters, following instructions just like a human.
Domain specific models have been on the roadmap for most companies for years now for both competitive (why give up your moat to OpenAI or Anthropic) and financial (why finance OpenAI's margins) perspective.
So yeah I think there are different levels of thinking, maybe future models with have some sort of internal models once they recognize patterns of some level of thinking, I'm not that knowledgeable of the internal workings of LLMs so maybe this is all nonsense.
For sure it's probably missing stuff that a well payed lawyer would catch, but for a project with zero budget it's a massive step up over spending hours reading through search results and trying to cobble something together myself.
Whereas with real legal advice, your lawyer will carry Professional Indemnity Insurance which will cover any costs incurred if they make a mistake when advising you.
As you say, it's a reasonable trade-off for you to have made when the alternative was sifting through the legislation in your own spare time. But it's not actually worth very much, and you might just as well have used a general model to carry out the same task and the outcome would likely have been much the same.
So it's not particularly clear that the benefits of these niche-specific models or specialised fine-tunes are worth the additional costs.
(Caveat: things might change in the future, especially if advancements in the general models really are beginning to plateau.)
Not to different from a lot of consulting reports, in fact, and pretty much of no value if if you’re actually trying to learn something.
Edit to add: even the name “deep research” to me feels like something defined to appeal to people who have never actually done or consumed research, sort of like the whole “phd level” thing.
ask a loaded, "filter question" I more or less know the answer for, and mostly skip the prose and get to the links to its sources.
I wrote it back when AI web search was a paid feature and I wanted access to it.
At the time Auto-GPT was popular and using the LLM itself to slowly and unreliably do the research.
So I realized a Python program would be way faster and it would actually be deterministic in terms of doing what you expect.
This experience sort of shaped my attitude about agentic stuff, where it looks like we are still relying too heavily on the LLM and neglecting to mechanize things that could just work perfectly every time.
My point was it's silly to rely on a slow, expensive, unreliable system to do things you can do quickly and reliably with ten lines of Python.
I saw this in the Auto-GPT days. They tried to make GPT-4 (the non-agentic one with the 8k context window) use tool calls to do a bunch of tasks. And it kept getting confused and forgetting to do stuff.
Whereas if you just had
for page in pages: summarize(page)
it works 100% of the time, can be parallelized etc.
And of course the best part is that the LLM itself can write that code, i.e. it already has the power to make up for its own weaknesses, and make (parts of itself) run deterministically.
---
On that note, do you know more about the environment they ran this thing in? I got API access (it's free on OpenRouter), but I'm not sure what to plug this into. OpenRouter provides a search tool, but the paper mentions intelligent context compression and all sorts of things.
I use it dozens of times per day, and typically follow up or ask refining questions within the thread if it’s not giving me what I need.
It typically takes between 10sec and 5 minutes, and mostly replicates my manual process - search, review results, another 1..N search passes, review, etc. Initially it rephrases/refines my query, then builds a plan, and this looks a lot like what I might do manually.
Then I can further interrogate the information returned with a vanilla LLM.
Besides I might give other large deep research models a try when needed.
I had once an idea of using something like qwen4 or some pre-trained AI model just to do a (to censor or not to) idea after the incidents of mecha-hitler. I thought if there was some extremely cheap model which could detect that it is harmful that the AI models of Grok itself couldn't recognize, it would've been able to prevent the absolute advertising/ complete disaster that happened.
What are your thoughts on it? I would love to see an Qwen 4B of something similar if possible if you or anyone is up to the challenge or any small LLM's in generals. I just want to know if this idea fundamentally made sense or not.
Another idea was to use it for routing purposes similar to what chatgpt does but I am not sure about that now really but I still think that it maybe worth it but this routing idea I had was before chatgpt had implemented it, so now after it implemented, we are gonna find some more data/insights about if its good or not/ worth it, so that's nice.
You don't really need an entire LLM to do this - lightweight encoder models like BERT are great at sentiment analysis. You feed it an arbitrary string of text, and it just returns a confidence value from 0.0 to 1.0 that it matches the characteristics you're looking for.
function replaceInTextNodes(node) { if (node.nodeType === Node.TEXT_NODE) { node.nodeValue = node.nodeValue .replace(/\u00A0/g, ' '); } else { node.childNodes.forEach(replaceInTextNodes); } }
replaceInTextNodes(document.body);
The script is great!
Constraints are the fun part here. I know this isn't the 8x Blackwell Lamborghini, that's the point. :)
If you really do have a 2080ti with 128gb of VRAM, we'd love to hear more about how you did it!
It's slower than a rented Nvidia GPU, but usable for all the models I've tried (even gpt-oss-120b), and works well in a coffee shop on battery and with no internet connection.
I use Ollama to run the models, so can't run the latest until they are ported to the Ollama library. But I don't have much time for tinkering anyway, so I don't mind the publishing delay.
This comfortably fits FP8 quantized 30B models that seem to be "top of the line for hobbyists" grade across the board.
- Ryzen 9 9950X
- MSI MPG X670E Carbon
- 96GB RAM
- 2x RTX 3090 (24GB VRAM each)
- 1600W PSU
Of course this is in a single-user environment, with vLLM keeping the model warm.
I think you mean ram and no vram. AFAIK this is a 30b moe model with 3b active parameters. Comparable to the Qwen3 MOE model. If you do not expect 60 tps such models should run sufficiently fast.
I run the Qwen3 MOE Model (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/...) in 4-bit quantization on an 11 year old i5-6600 (32GB) and a Radeon 6600 with 8GB. According to a quick search your card is faster than that and I get ~12 tps with 16k context on Llama.cpp, which is ok for playing around.
My Radeon (ROCm) specific batch file to start this:
llama-server --ctx-size 16384 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --device ROCm0 -ngl -1 --model /usr/local/share/gguf/Qwen3-30B-A3B-Q4_0.gguf --cache-ram 16384 --cpu-moe --numa distribute --override-tensor "\.ffn_.*_exps\.weight=CPU" --jinja --temp 0.7 --port 8080
This can end up getting you 128gb of VRAM for under $1000.
get the biggest one that will fit in your vram.
(If nothing else Tongyi are currently winning AI with cutest logo)
The Chinese version of the link says "通义 DeepResearch" in the title, so doesn't look like the "agree" to be the case. Completely agreed that it would be hilarious.
1: https://www.alibabacloud.com/en/solutions/generative-ai/qwen...
The pattern is effectively long-running research tasks that drive a search tool. You give them a prompt, they churn away for 5-10 minutes running searches and they output a report (with "citations") at the end.
This Tongyi model has been fine-tuned to be really good at using its search tool in a loop to produce a report.
So without specifying which model is being used, it's really hard to know what is better than something else, because we don't understand what the underlying model is, and if it's better because of the model itself, or the tooling, which feels like an important distinction.
https://openrouter.ai/alibaba/tongyi-deepresearch-30b-a3b
https://openrouter.ai/alibaba/tongyi-deepresearch-30b-a3b:fr...