You might want millions of geniuses in a data center, but perhaps you can only afford one and haven't built out enough compute? Might sound ridiculous to the critics of the current data center build-out, but doesn't seem impossible to me.
I also asked perplexity to give a report of the most notable ARXIV papers. This one was at the top of the list -
"The most consequential intellectual development on arXiv is Sara Hooker's "On the Slow Death of Scaling," which systematically dismantles the decade-long consensus that computational scale drives progress. Hooker demonstrates that smaller models—Llama-3 8B and Aya 23 8B—now routinely outperform models with orders of magnitude more parameters, such as Falcon 180B and BLOOM 176B. This inversion suggests that the future of AI development will be determined not by raw compute, but by algorithmic innovations: instruction finetuning, model distillation, chain-of-thought reasoning, preference training, and retrieval-augmented generation. The implications are profound—progress is no longer the exclusive domain of well-capitalized labs, and academia can meaningfully compete again."
I do broadly agree that smaller, better tuned models are likely to be the future, if only because the economics of the large models seem somewhat suspect right now, and also the ability to run models on cheaper hardware’s likely to expand their usability and the use cases they can profitably address.
- Run a 1500W USA microwave for 10 seconds: 15,000 joules
- Llama 3.1 405B text generation prompts: On average 6,706 joules total, for each response
- Stable Diffusion 3 Medium generating a 1024 x 1024 pixel image w/ 50 diffusion steps: about 4,402 joules
[1] - MIT Technology Review, 2025-05-20 https://www.technologyreview.com/2025/05/20/1116327/ai-energ...
Couldn't find any more up-to-date number, everyone just keeps repeating that 0.0003kWh number from 2009
https://googleblog.blogspot.com/2009/01/powering-google-sear...
Though, once the LLM has to engage a hypothetical "google search" or "web search" tool to supplement its own internal knowledge; I think the efficiency obviously goes out the window. I suspect that Google is doing this every time you engage with Gemini on Search AI Mode.
If we compare apples-to-apples, eg. across Claude models, the larger Opus still happily outperforms it's smaller counterparts.
I've also been increasingly curious about better metrics to objectively assess relative model progress. In addition to the decreasing ability of standardized benchmarks to identify meaningful differences in the real-world utility of output, it's getting harder to hold input variables constant for apples-to-apples comparison. Knowing which model scores higher on a composite of diverse benchmarks isn't useful without adjusting for GPU usage, energy, speed, cost, etc.
My problem with deep research tends to be that what it does is it searches the internet, and most of the stuff it turns up is the half baked garbage that gets repeated on every topic.
That’s a huge leap of logic.
The simpler explanation is that it has better searching functionality and performance.
The models are multi-lingual and can parse results from global websites just fine.
I think existence of Wikipedia is a red herring, there's no historical inevitability that people will band together to curate a high-quality encyclopedia on every imaginable topic.
There might be similar, even broader/better efforts on the Chinese internet we (I) know nothing about.
It also might be that Chinese search engines are better than Google at finding high quality data.
But I reiterate - these search based LLMs kinda suck in the West, because Google kinda sucks. Every use of deep research usually ended up with the model citing the same crap articles and data you could find on Google manually, but whereas I could tell the data was no good, AI took it at face value.
Maybe that's a requirement from whoever funds them, probably public money.
The cost of LLMs are the infrastructure. Unless someone can buy/power/run compute cheaper (Google w/ TPUs, locales with cheap electricity, etc), there won't be a meaningful difference in costs.
Here's a short video on the subject:
Now they have to be lucky to be 6 months ahead to an open model with at most half the parameter count, trained on 1%-2% the hardware US models are trained on.
I thought that OpenAI was doomed the moment that Zuckerberg showed he was serious about commoditizing LLM. Even if llama wasn't the GPT killer, it showed that there was no secret formula and that OpenAI had no moat.
Eh. It's at least debatable. There is a moat in compute (this was openly stated at a meeting of AI tech ceos in china, recently). And a bit of a moat in architecture and know-how (oAI gpt-oss is still best in class, and if rumours are to be believed, it was mostly trained on synthetic data, a la phi4 but with better data). And there are still moats around data (see gemini family, especially gemini3).
But if you can conjure up compute, data and basic arch, you get xAI which is up there with the other 3 labs in SotA-like performance. So I'd say there are some moats, but they aren't as safe as they'd thought they'd be in 2023, for sure.
Maybe there's a limit in training and throwing more hardware at it does very little improvement?
The HN obsession with Claude Code might be a bit biased by people trying to justify their expensive subscriptions to themselves.
However, Opus 4.5 is much faster and very high quality too, and that ends up mattering more in practice. I end up using it much more and paying a dear but worthwhile price for it.
PS: Despite what the benchmarks say, I find Gemini 3 Pro and Flash to be a step below Claude and GPT, although still great compared to the state-of-the-art last year, and very fast and cheap. Gemini also seems to have a less AI sounding writing-style.
I am aware this is all quite vague and anecdotal, just my two cents.
I do think these kinds of opinions are valuable. Benchmarks are a useful reference, but they do give the illusion of certainty to something that is fundamentally much harder to measure and quite subjective.
I was especially impressed by 5.1-codex-max for a webapp, but that is ofc where these model in general shine. But it was freak, never had 15-20 iterations (with 100s of lines added each time) before where I did not have to correct anything.
https://boutell.dev/misc/qwen3-max-pelican.svg
I used Simon Willison's usual prompt.
It thought for over 2 minutes (free account). The commentary was even more glowing than the image.
It has a certain charm.
Whether that means anything, I dunno.
I gave one of the GPUs to my kid to play games on.
If you had more like 200GB ram you might be able to run something like MiniMax M2.1 to get last-gen performance at something resembling usable speed - but it's still a far cry from codex on high.
I guess you could technically run the huge leading open weight models using large disks as RAM and have close to the "same quality" but with "heat death of the universe" speeds.
with 32gb RAM:
qwen3-coder and glm 4.7 flash are both impressive 30b parameter models
not on the level of gpt 5.2 codex but small enough to run locally (w/ 32gb RAM 4bit quantized) and quite capable
but it is just a matter of time I think until we get quite capable coding models that will be able to run with less RAM
Current test version runs in 8GB @ 60tks. Lmk if you want to join our early tester group!
The best could be GLN 4.7 Flash, and I doubt it's close to what you want.
If remote models are ok you could have a look at MiniMax M2.1 (minimax.io) or GLM from z.ai or Qwen3 Coder. You should be able to use all of these with your local openai app.
I was very impressed, but I racked up a big bill (for me, in the hundreds of dollars per month) because I insisted on using the Alibaba provider to get the highest context window size and token cache.
* https://lmarena.ai/leaderboard — crowd-sourced head-to-head battles between models using ELO
* https://dashboard.safe.ai/ — CAIS' incredible dashboard (cited in OP)
* https://clocks.brianmoore.com/ — a visual comparison of how well models can draw a clock. A new clock is drawn every minute
* https://eqbench.com/ — emotional intelligence benchmarks for LLMs
* https://www.ocrarena.ai/battle — OCR battles, ELO
So, how large is that new model?
In addition, there seem to be many different versions of Qwen3. E.g. here the list from ollama library: https://ollama.com/library/qwen3/tags
But these open weight models are tremendously valuable contributions regardless.
If you were pulling someone much weaker than you behind yourself in a race, they would be right on your heels, but also not really a threat. Unless they can figure out a more efficient way to run before you do.