Access to frontier AI will soon be limited by economic and security constraints

Posted by thoughtpeddler 10 hours ago

Access to frontier AI will soon be limited by economic and security constraints(writing.antonleicht.me)

174 points | 163 comments

sho 6 hours ago|

I am no-where near as concerned by this as I was a year ago, when I was expecting the axe to fall at any moment before the Chinese labs achieved some sort of escape velocity. I now think it's too late, all the cats are out of all the bags, there's no moat except maybe a temporal one of a few months, the genie is out of the bottle.

There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough. Deepseek 4 and Kimi 2.5 are not quite Claude 4.5/GPT5.5 but there's no fundamental principle missing - they are strong evidence that there's no real advantage the "frontier" labs possess that isn't related to scale, which they will gain in time (if they even need to). The RL post-training techniques that work are widely known and easily copied. All Deepseek is really lacking is data, which they're getting - and the harder Anthropic/the USG makes it to access claude in china, the more of that precious data they'll get!

I used to sort of entertain the "fast take-off breakaway" scenario as being plausible but not really anymore. The only genuine moat the frontier labs have is their product take-up, which isn't nothing, far from it, but it's not some unbreakable technological wall. Too late guys - it might have been too late for quite some time.

gpt5 6 hours ago||

I wish it was true. I would gladly use a GPT 5.2 high model equivalent for coding (6 months old) if it was offered cheaper by Deepseek or Kimi. And I'm sure that's an extremely prevalent opinion by the millions of Claude and Codex users who are bothered by the costs.

However, they just don't perform that well in practice. That's the real issue. You can actually see it when you move away from open benchmarks. Deep seek 3.2 is 4% on Arc-AGI 2 [1], while GPT 5.2 high is 52% and GPT 5.5 pro high is 84.6%. That's the real reason why nobody is using these models for serious work. It's incredibly frustrating.

In addition, I already feel the pain myself on the model restriction. I'll asking my codex 5.5 agent to crawl a website - BOOM, cybersecurity warning on my account. I'll ask it to fix SSH on my local network - another warning. I'm worried about the day my account would be randomly banned and I cannot create a new one. OpenAI already asks you to perform full identification in order to eliminate these warnings - probably exactly for that - so that if they ban you, it's permanent.

[1] https://arcprize.org/leaderboard

usernametaken29 3 hours ago|||

I worked extensively on ARC AGI before and one thing is SURE as hell. OpenAI and Gemini in particular use this as marketing material. You can correlate the benchmark release with stock price increase. They feed synthetic datasets of ARC into their models to boost the numbers. There is no doubt in my mind Gemini is no better than DeepSeek other than being specifically fine tuned for ARC AGI. Heck, they even say so and they say they have paid annotations for ARC. Again, economic incentives. In terms of whether these models are actually better at the benchmarks, likely not. See ARC 3, where the gap is diminishingly small.

versteegen 1 hour ago|||

I've also worked extensively on ARC AGI 1/2, and I mainly agree. Marketing and training. Performance of LLMs on ARC is most importantly a function of training on grid/table-like data. It doesn't have to be specifically synthetic ARC data though. Training an LLM to be better at perceiving grid-like arrangements of data in a spatial way like an image, rather than just tabular, is hugely useful for things outside of ARC benchmarks, though it's a narrow skill. Hence, I'm sure they do it. I want them to do that. I believe the labs when they say they didn't train specifically for ARC-AGI 1/2 (where did Google say otherwise? I don't see it). But it does not mean the models are getting better at general purpose reasoning. They were already plenty good enough at that. You can describe ARC images in words and reason about it using a level of intelligence LLMs have had for years: they're designed to be easy! LLMs just couldn't reason about image-like grids very well.

gpt5 3 hours ago||||

ARC-AGI isn't perfect, but it helps demonstrates the gap. I'm sure all companies optimize their models for this benchmark given its dominance.

energy123 3 hours ago|||

Why do you think DeepSeek isn't also fine tuned on ARC AGI? Maybe they're more fine tuned on ARC AGI but still get worse scores. There's no way to know.

usernametaken29 2 hours ago||

My gut feeling is that ARC doesn’t play as big of a role in the Chinese model manufacturer landscape. It’s one byproduct but China is focusing on resource efficiency (for political reasons and low compute). So unlike OpenAI, poor performance on ARC doesn’t hurt as much if the model works well. OpenAI literally hinges on hype so the insane economic bets they make somehow pay off. If you have billions and the future of the company on the line, you ace the exam any way you can. We noticed this early on that whenever some dataset of ARC was released suddenly the classes of problems in that dataset GPT would do well on. But it just doesn’t generalise. They fine tune like crazy. I bet they fine tune for raspberry counting at this point. Again, for OpenAI the perception of moat is everything! Keep that in mind

zozbot234 2 hours ago||

True, ARC is mostly an artificial "human-like AGI" benchmark that doesn't really reflect any plausible workload. Very different from things like Humanity's Last Exam that reflect real-world knowledge and are now getting closer and closer to saturation even with open models.

applfanboysbgon 4 hours ago||||

> Deep seek 3.2 is 4% on Arc-AGI 2

Why are you bringing up an outdated Chinese model from 6 months ago to compare to a US model from 6 months ago? The outdated Chinese model will have performance from ~12 months ago, obviously. But today's Chinese model DeepSeek 4 has performance not far from the US model 6 months ago; 46% compared to 52% from 5.2.

gpt5 3 hours ago||

Because Deepseek 4.0 is not yet there, but the jump isn't expected to be large. Kimi 2.5 is there and is also scoring low.

DCKing 3 hours ago|||

Deepseek V4 came out three weeks ago: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

Kimi K2.5 has also been superseded by a finer tuned Kimi K2.6 three weeks ago. Moonshot's Kimi models appear to be the favored Chinese model, at least for coding, and not Deepseek V4. z.AI's GLM 5.1 is also worth mentioning as rather competent for coding, also released in April.

Those models too will not be beating US AI labs by your metrics (although for coding, Kimi K2.6 might beat the very uneven Gemini depending on the situation), but in your critism at least consider the state of the art in your comparisons.

pjerem 3 hours ago|||

Hum, I'm using it [0] with my Ollama Cloud subscription since the last two weeks and I love it. Never reached the 5 hours usage limits of the $20 plan (on side projects) where I would reach it sometimes in ONE prompt with Opus.

[0]: https://ollama.com/library/deepseek-v4-pro

sho 5 hours ago||||

I 100% agree with you, but I've been convinced over the last year that it's a time and scale issue, not anything fundamental.

The Chinese models right now are in a weird spot. Compared to the frontiers, both their pre and post training is woeful - tiny, resource constrained in every dimension including human, slow. I'd compare it to OpenAI 5 years ago except I think even then OpenAI had way more!

But they "cheat" quite a lot in distillation and very benchmark-focussed RL and that's where you get this superficial quality in the leaderboards that doesn't match up when you go off-script. Arc is a great example in that it really belies an "inferior soul" at the heart of it all.

What gives me great hope though is that those same scaling laws that Altman and others have been hyping forever will absolutely kick in for the Chinese labs just as they did for the US ones, and I don't think anything can stop that process now. So they will catch up. It won't be tomorrow, but it's not going to be 10 years either. 3-5 would be my reasonably educated guess.

And the final risk, that China itself might try to restrict availability of the tsunami of GPU or other AI hardware it will inevitably produce - well, I just can't really imagine a country that has been configuring itself for the last 40 years as a single purpose export machine deciding that actually, no, it doesn't want to export something.

About the model restrictions - absolutely. I've been trying to do security research on my own software and the frontier models immediately get suspicious. I've been playing with the local ones much more this year basically because of this. They have deficiencies, for sure - they feel very "hollow" compared to the major labs. But I've talked to a lot of people, and the consensus is pretty clear - just a matter of time.

flir 3 hours ago||

Just an observation: constraints often result in creative solutions. I wouldn't be surprised if a smaller lab makes a big breakthrough because they have to.

ageitgey 5 hours ago||||

Have you tried the latest DeepSeek v4 Pro inside of the Claude Code harness? It's not listed in that site.

It definitely 'feels like' it is as good as Claude for many regular web app coding tasks (though I don't have real benchmarks). And it is comically cheap.

I'm not suggesting it is better than the latest Claude or codex models, but it seems 'good enough' for a lot of use cases in my limited real world testing.

PAndreew 3 hours ago|||

I'm starting to feel like a parrot, but people seem to forget that software engineering is actually a very narrow slice of the white collar pie. You don't need a mega-model which can reason about 100 000 lines of code when you want to create a nice PPT (which consumed literally hours of your life before) to impress your boss. SOTA models will probably be used for frontier research, complex coding tasks, large scale data analysis, etc. And the average Joe shall be able to buy a pre-configured box with a plug-and-play harness and run medium models air-gapped. Or use such models through cloud APIs dirt cheap if privacy is not a concern.

ageitgey 2 hours ago|||

On the same topic but from a slightly different angle - as SOTA models get more capable, the 'quality' and 'feel' of the experience they provide in each domain is heavily dependent on the reinforcement learning the vendor does for that specific domain. After all, many fields have 100 flavors of "good answers," but the model has to pick one answer.

Benchmarks are not very good at capturing this yet. But it could be the case that DeepSeek v4 Pro is 100% as good as Claude Opus 4.7 at scaffolding a basic Rails app, but absolutely terrible at creating a credible business plan that another businessperson would think is real. That's a made-up example, but you get the point.

The end result will be a lot of people arguing about which model is "better," but "better" depends heavily on the task and how that model was trained to interact with the user for that task. Two users may have very different qualitative experiences using the exact same model, despite the benchmarks.

zozbot234 3 hours ago|||

Creating a nice PPT is actually hard because it requires visual capabilities and so-called "computer use" (really, GUI use) of fiddly proprietary software. The nice thing about the coding case compared to a lot of disparate white-collar work is that it's all plain ASCII text. You can already ask a coding model to create a nice TeX/beamer slideshow (or whatever the Typst-based equivalent is) but whether your boss will be duly impressed by that is anyone's guess.

m_mueller 2 hours ago|||

Tangential, but in our opinion corporate PPTX automation is an unsolved problem, even with Claude for PowerPoint (and it's worse with everything else common out there). Its harness (a) is not tuned very well for corporate use and (b) even if it were, fails to manage the specific business knowledge within each org needed to create effective (i.e. audience tailored) presentations.

I've just written a blog post about this topic this week: https://octigen.com/blog/posts/2026-05-11-ai-presentation-ga...

nimonian 3 hours ago|||

This is a tangent but I'd also mention sli.dev -- slideshow-as-website is really great and fun to make with llms

omnimus 4 hours ago|||

Also so many developers i know use LLMs for one shoting isolated problems, explainers, discussions and planning. For these even Kimi is pretty great.

I don't think every dev will be comfortable just releasing claude on their project.

energy123 3 hours ago||||

They're not even that much cheaper (1/2 price per task according to Artificial Analysis) once you account for lower token usage of GPT-5.5. I can't justify it when factoring in the extra time wasted, and the cheap codex usage I get through the monthly plan. Frontier intelligence is not a commodity product ... yet.

irthomasthomas 2 hours ago||||

Arc has no predictive power whatsoever. I always use the best models available. So far I haven't found a task that chineses models cannot solve very quickly and reasonably. Do you have any examples where they failed for you?

otabdeveloper4 5 hours ago|||

And yet Claude six months ago was amazing and good enough for you.

This shows that AI cloud consumption is just a conspicuous consumption status symbol, nobody knows why they need cloud AI or what problem they are even solving.

yorwba 2 hours ago|||

All of the reasons in the article also apply to Chinese companies. If a Chinese model becomes good enough to make it significantly easier to hack Chinese government servers, do you think they'll allow random people unfettered access to it?

The economic pressures are the same, too. Currently, Chinese models are offered for cheap or in some cases provide weights for free because that's the only way to gain traction. (That closed-weight releases by Baidu, Bytedance, iFlyTek etc. hardly generate any buzz bears that out, as does the fact that when Alibaba does a closed-weight release, someone always gets confused because they associate the Qwen brand with open models.) At some point, their investors are going to want profits, not just user counts. That means higher prices, or no more new models.

If there's no secret sauce and all you need is scale, that would actually be kind of the worst-case scenario for catching up to the frontier, since scaling is expensive and the frontier model companies have easier access to capital as well as higher revenues.

zozbot234 2 hours ago||

> If a Chinese model becomes good enough to make it significantly easier to hack Chinese government servers, do you think they'll allow random people unfettered access to it?

They aren't trying to become that good, nor do they need to in order to have real positive impact. Models like Mythos are estimated to be humongous even on a datacenter-wide scale, which is actually a big factor in its limited availability at present. It's mostly helpful as a one-of-a-kind proof of concept, to answer the question of whether AI can still plausibly scale by growing capabilities and what happens to alignment concerns when you do that.

yorwba 1 hour ago||

I expect every company to try to make a model as good as they possibly can, especially now that Mythos has served as a proof of concept to demonstrate that there's lots of interest in AI for cybersecurity. But if they don't try, that hardly assuages concerns about not being able to access the very best models, does it?

hbarka 5 hours ago|||

Harness engineering is a moat. There’s user loyalty and reliance on the chassis that Claude is on, for example, just like there’s more market share by MacOS+WindowsOS over Linux Open Source.

kasey_junk 3 hours ago|||

I regularly switch between codex and Claude in the same sessions. I’d throw in other models if I could.

Data governance and enterprise sales is a moat. The harnesses aren’t.

ElFitz 4 hours ago||||

I thought so too.

But 1) people use other models with that same harness. 2) I moved on from Claude Code and all the features I cared for up and running in less than a couple days. Without even looking for available plugins or extensions.

thepasch 3 hours ago||||

> Harness engineering is a moat.

I mean, if that’s the case, then Anthropic themselves are currently actively filling in that moat with nice, solid, walkable dirt. Claude Code may have been a moat 6 months ago but these days you’ll want to replace the “m” with a “bl”.

PunchyHamster 2 hours ago|||

The industry on tooling have been very much moving in direction of "plug the AI of your choosing" for a while now, and given how much Anthropic fights the 3rd party tools they are definitely afraid to be left in the dust.

> just like there’s more market share by MacOS+WindowsOS over Linux Open Source.

It's hard to change OS. It's not hard to jump from one AI tool to another

BrtByte 6 hours ago|||

I agree the genie is out of the bottle technologically. I'm less convinced that means access stops being politically and economically important. The bottle may be gone but the best lamps are still expensive

trollbridge 6 hours ago|||

But a “good enough” lamp just got a lot cheaper. The cost of tokens on DeepSeek V4 Pro is so low I don’t even think about and currently am trying to figure out useful things for as many agents simultaneously running as I can. What would have cost $150 less than a year ago now costs 35¢.

Likewise Qwen 3.6 absolutely blows me away and that’s on a 35b 6-bit model on a local 5090. Same thing, busy trying to find stuff to do to keep it busy 24/7.

I can still find some niches for Opus 4.7 but being able to attack problems and not worry about consumption is a game changer.

jorvi 6 hours ago|||

Virtually no one is going to pay for the best performing lamp if the next best lamp does 90% as good for an order of magnitude cheaper.

I will say, as pointed out by others, DeepSeek and other Chinese providers still lack a bit in the tooling that Claude has, but they'll get there.

Paradigma11 2 hours ago|||

That presumes that there is a linear scale that measures performance. This can be tested: https://en.wikipedia.org/wiki/Rasch_model

Even assuming this holds, what utility you gain by the best models depend completely by your workload. If you have tasks that require performance 10 and DeepSeek has 9, you will gladly pay for SotA models.

baq 4 hours ago||||

And yet it seems that 90% are happily paying for the marginal 10% capability and saturate datacenters.

lmm 3 hours ago|||

Happy to pay for? Or happy to spend other people's money on?

lugu 3 hours ago|||

That is called marketing.

baq 1 hour ago||

not necessarily. it might just as well be 'time is money'.

BrtByte 6 hours ago|||

If the second-best lamp is 90% as good and 10x cheaper, most people will use the second-best lamp...

avazhi 6 hours ago|||

That’s what he said?

nojs 3 hours ago|||

What about access to GPUs and memory? This is becoming a pretty major bottleneck.

repelsteeltje 2 hours ago|||

Today's tech echoes 1960-1970 mainframe era: very centralized around a handful of companies controlling "massive cloud compute" in bespoke mainframe-like topology.

All of that will all be legacy in a couple of years. Today's B200 clusters are tomorrow's e-waste. Decentralization might happen gradually or abruptly. But to me it's obvious that we'll be thinking of high-tech tensor processors and GPUs the way we thought of individual transistors and tube amplifiers in the 1980s.

If AI turns out to be the revolution it purports to be, than the underlying hardware will change much more rapidly than it did with ICs and microprocessors in the late 1970s. Today's hot is tomorrow's junk.

aurareturn 1 hour ago|||

One thing that is potentially different this time is that Moore's Law has stopped scaling. Computers aren't getting smaller exponentially. They're getting bigger with multiple chips glued together to make up for Moore's Law.

repelsteeltje 41 minutes ago||

...But there's a new world dawning for photonic chips.

No reason to expect Moore's observation to apply there (though, maybe?), but it will have big implications for power usage.

aurareturn 12 minutes ago||

Photonic chips allow computers to get bigger, not smaller.

zozbot234 2 hours ago|||

> Today's B200 clusters are tomorrow's e-waste.

Hardware depreciation timescales are actually getting longer, not shorter, because frontier hardware like B200 clusters is highly bottlenecked. It's not just a RAMpocalypse out there, we're seeing early signs of production bottlenecks with GPUs and maybe even CPUs.

wokkel 2 hours ago||||

It's basically converted sand. Most of that conversion happens in Taiwan at the moment. Which is considered, by China, to be one of their provinces and as a protectorate by the usa. Hence the interest in that region....

asdff 3 hours ago|||

Everyone is expecting them to invade Taiwan, but why not merely extort Taiwan?

littleparrot 2 hours ago||

You mean by contributing to RAMpocalypse the mainland incentives the west to build own fabs, making Taiwan expendable for us someday?

zozbot234 2 hours ago||

Mainland China is growing its own RAM manufacturing capacity. They are too tiny to make a real dent into the RAMpocalypse yet but this can potentially change.

ElFitz 4 hours ago|||

> The only genuine moat the frontier labs have is their product take-up

And even then, their is no stickiness. For most use cases there isn’t much value in one frontier model over the other.

Just have to look at the people flocking from one to the other for whatever reason.

baq 4 hours ago|||

I’m flocking from GPT to opus every week for the past 3 months and always come back.

The point isn’t that gpt is better, it’s that it is so much better for my work it isn’t even sticky, it’s reinforced concrete. I use opus 1% of the time because it writes better and it’s sticky there.

Yes I’ll switch approximately immediately if opus or Gemini (which I use more than opus!) is better for what I do, but at this point frontier model tokens are not fungible.

ElFitz 4 hours ago||

There will always be dataset and training quirks, and the provider’s own biases and focus, granting one model an edge over the others in some specific domain.

baq 3 hours ago||

Yup and that’s where the moats are.

dotancohen 4 hours ago|||

The large AI houses arguably ensure that model switching be a natural action for their clients, by switching the default model of their flagship offerings every few months. Such is the price of progress.

scotty79 2 hours ago|||

> There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough.

Over last year it seems that the only thing US labs are ahead is money spent. At least half of technical innovations if not more came from Chinese labs and was published openly.

shevy-java 6 hours ago||

> There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough

This is not just about mainland China though. The current US government is extremely selfish and self-centered. Other countries really need to consider for their own long-term situation here.

terrib1e 7 hours ago||

No mention of open weights anywhere in the piece, which is weird. Qwen, Llama, DeepSeek are months behind frontier, not years. If you're a European startup worried about getting cut off from Anthropic's API in 2027, the real question is what the open-weight frontier looks like then. Probably pretty capable. That undercuts most of the doom scenario.

Also, he concedes Mythos-level capabilities will be cheap next year, then handwaves it with "you need the best AI, not good-enough AI." For most use cases, frontier minus six months is fine.

BrtByte 6 hours ago||

Open weights undercut the absolute cutoff scenario. They don't fully solve the question of who gets the best model first, who gets enough tokens to use it heavily, and who gets to integrate it into sensitive workflows without waiting for permission

rTX5CMRXIfFG 6 hours ago|||

Affordability of hardware that can run local LLMs is a real factor, too. Not sure when RAM prices are going down, but with everything that’s happening and can happen in the world right now, it doesn’t look like it’ll drop in the near or medium-term

pjerem 3 hours ago|||

Open weight models does not means you can run them on your laptop (except for the small ones). It means that someone independent (a cloud provider, another company ...) can build big computers that are capable ton run those models and provide you a metered usage.

At the end of the day, as a consumer, you still pay per token (or per something) to your provider, except you can chose from multiple providers with your own criteria. If you want to use DeepSeek v4 hosted in Europe, it's possible.

wahnfrieden 6 hours ago|||

No one is going to run models that are comparable to frontier locally without spending enormous sums for use at scale or in large orgs. Even with cheap RAM, you will still need a very large budget for frontier-level capability.

Open models that are competitive with frontier will be used on shared hosts.

jorvi 6 hours ago|||

Models have been capped out on training and (active) parameters a while ago, its tooling / harness that is making the big jumps in performance happen. And then you have things like DeepSeek with a pretty small KV cache.

And with the extreme chip shortages for the next two years, there's little appetite for even bigger models anyway.

Barring a breakthrough in scaling, the only direction the models can really go is smaller. Which will inevitably mean better performing local models for same chip budget.

zozbot234 5 hours ago|||

> No one is going to run models that are comparable to frontier locally without spending enormous sums for use at scale

You can always run these models cheaper locally if you're willing to compromise on total throughput and speed of inference. For most end-user or small-scale business needs, you don't really need a lot of either.

9dev 4 hours ago||

It would be awful if running models locally became the primary way of using LLMs. On dedicated servers sharing GPUs across requests, energy usage and environmental impact is way lower overall than if everyone and their mother suddenly needs beefy GPUs. It’s the equivalent of everyone commuting alone in their own car instead of a train picking up hundreds at once.

zozbot234 4 hours ago|||

You can batch requests when running locally too, if you're using a model with low-enough requirements for KV-cache; essentially targeting the same resource efficiencies that the big providers rely on. This is useful since it gives you more compute throughput "for free" during decode, even when running on very limited hardware.

duskdozer 2 hours ago||||

Maybe people would target their use more appropriately, then.

amelius 3 hours ago|||

It's even more awful if the compute capital is owned by only a handful of players.

baq 4 hours ago|||

Open weights will remain open only if they’re significantly worse than the frontier weights.

Before you challenge with benchmarks, consider the labs which release open weight models have internal testing and unpublished results.

pu_pe 4 hours ago|||

There are two problems with that scenario:

1. Your European startup will be competing with others using a much better frontier model. In a scenario where you already have other major disadvantages (access to capital, labor), you might be outcompeted

2. Open models have been keeping pace very nicely, but they rely on distillation of frontier models. If the race gets really tight, this could be affected so that the time gap grows larger (ie, it's very unlikely anyone but Anthropic is distilling from Mythos at the moment)

pjerem 3 hours ago||

> 1. Your European startup will be competing with others using a much better frontier model.

If the small (and I'd even say, sometimes imperceptible) difference between Opus & DeepSeek v4 Pro is such a disadvantage for your startup, it's that your startup have an issue, not the LLM.

At the end of the day, your startup is there to solve real problems and even before the LLMs, being fast at coding things have never been such a huge competitive advantage compared to marketing, sales, customer support, product vision ...

pu_pe 1 hour ago||

The direction we are going suggests AI will also be used for marketing, sales, customer support and product vision.

Besides, if the difference between Opus and DeepSeek 4 is so small and imperceptible, you are missing the opportunity to launch a startup on your own and compete with Claude Code.

cubefox 4 hours ago|||

Someone recently made a graph showing that the gap between US American frontier LLMs and Chinese open weight LLMs (including DeepSeek v4) is widening. Unfortunately I can't find it anymore.

Update: GPT-5.5 found it.

Article: https://www.nist.gov/news-events/news/2026/05/caisi-evaluati...

Graph: https://www.nist.gov/sites/default/files/images/2026/05/01/1...

tirpen 2 hours ago|||

This is propaganda, not data.

If the Chinese government published a graph that said the opposite, would you consider that a serious and objective source?

cubefox 1 hour ago||

If the methodology in the accompanying write-up did look credible, yes. Though I have significantly more trust in US agencies, like NIST in this case.

mordae 2 hours ago||||

Give it time. It's inevitably a logistic curve.

cubefox 1 hour ago||

I believe logistic curves make no sense when you have Elo scores.

lugu 3 hours ago|||

Someone is an official website of the united states gouvernement. I would prefer another source.

cubefox 3 hours ago||

I think no other source exists.

wahnfrieden 6 hours ago|||

Llama is not months behind GPT 5.5 Pro. I don't think Qwen or DeepSeek are either.

edit: I'm specifically referring to the "5.5 Pro" model, not regular 5.5 with Pro tier subscription. Claude has no model available that's comparable to 5.5 Pro either.

vasachi 6 hours ago|||

I’ve used DeepSeek 4 Pro through Claude. It’s fine. Plans are similar to what sonnet/opus make. Same massage-the-plan -> massage-the-code loop. Maybe the code is a bit worse, but that’s the “months behind” thing.

The thing is, vast majority of code tasks aren’t a venture into the unknown. We as an industry for the most part build CRUD interfaces and dashboards. That can be achieved, with supervision, with frontier open-weights models quite well.

fwipsy 6 hours ago||

I think maybe you are both right. Perhaps AI coding assistants just don't need to be all that smart in many cases, so open weights models are fine. At the same time, frontier models are advancing in other domains, like mathematics, where raw intelligence is a more important factor.

vasachi 6 hours ago||

I can’t compare raw intelligence of these models, and I certainly can’t say anything about their advances in mathematics (without repeating press releases). But, erm, does it really matter? It’s not like some engineer somewhere will vibe-calculate how much weight a bridge can hold.

Well, yes, someone probably will do that. But I’m pretty sure there will be consequences for the engineer errors in this vibe-calculations.

lostmsu 3 hours ago|||

There's no evidence there's any 5.5 Pro model distinct from 5.5 xhigh or whatever.

https://developers.openai.com/api/docs/models

zozbot234 3 hours ago|||

https://developers.openai.com/api/docs/models/gpt-5.5-pro is a thing

wahnfrieden 2 hours ago|||

lol

(tap view all on yr link or ask gpt to search for you next time)

sholladay 6 hours ago||

Open models are pretty good at this point but the problem is that they are limited by the tooling and infrastructure that surrounds them. For example, the last time I tried to set up web search with an open model, the experience was pretty bad.

rsolva 1 hour ago||

In our company of 24 employees, we get by with two DGX Sparks. We don't use AI heavily, but each Spark can serve about 6-8 concurrent requests with a full context lenght of 256k, which is decent. We get about ~35 t/s depending on the model we use (currently Qwen3.5 122B A10B and Qwen3 Coder Next), but we might set up a smaller model too for simpler tasks.

This works for us and will work for years to come. It is not SOTA, but it works darn well for our purposes, and we control the compute and data flowing through it, so totally worth it.

zozbot234 47 minutes ago|

That's pretty nice actually, how much KV cache does that model require at full context? That tends to be the main limit to running concurrent requests locally, there's KV quantization but it has outsized negative impact on model quality.

seydor 33 minutes ago||

We should be aiming for less token usage, ideally none at all. The current AI is using LLMs to expanding horizontally but with the goal of achieving vertical progress - inventing truly new stuff and being able to eliminate our biggest problems. problems like cancer need only be solved once, and is no more tokens needed after that.

pu_pe 4 hours ago||

The more fundamental bottleneck is not even the frontier models, it's the datacenters. Let's say Europe breaks apart from the US completely tomorrow. It does not have enough datacenters (or GPUs in general) to sustain its inference needs even if it would resort to Chinese open models. And to build new datacenters, it would need to source parts from the US and China.

In other words, if AI does have continued significant economic impact, only the US and China would be able to leverage it completely. The rest of the world is implicitly betting that AI won't be good enough, or that eventually the compute curve flattens out so using a model that is 10x larger only leads to marginal benefits.

davesque 3 hours ago||

> The more fundamental bottleneck is not even the frontier models, it's the datacenters.

Is it even though? Quantization and speculative decoding are improving the local AI story by leaps and bounds every month.

zozbot234 3 hours ago|||

Speculative decoding is not that useful at scale, it's mostly about making local single-user inference faster. When you're batching multiple inferences together, that's already as fast as the verification you have to perform w/ speculative decoding.

peheje 2 hours ago||

The future will have LLMs running local at your laptop/devices. If not almost exclusively then at least for 90-95% of the tasks. Speculative decoding is just one technique out of many existing and more to come that will make this even more viable. The gap is closing on both fronts. Software gets faster/more clever. Hardware gets faster and smaller. The single user story is the story. I'm obviously speculating myself, but that's how I see it.

pu_pe 1 hour ago|||

There is "local AI" which is running on consumer grade hardware and "local AI" which still needs a datacenter (DeepSeek 4, GLM 4.7, etc). If you woke up tomorrow and could only use the latter you are about 6 months behind the frontier, if you have to rely on the former you are 2 or 3 years behind.

All these tricks like quantization and speculative decoding can also be used by the leading AI labs, which means they will simply have more compute than you at the end of the day. So far this has translated into better performance.

zozbot234 51 minutes ago||

Nothing released so far inherently "needs" a datacenter, it's just a matter of how much performance you require. Slow, high-latency inference will be a natural way to run "datacenter" models locally.

pu_pe 25 minutes ago||

Yes it does. You will not be able to run models like DeepSeek v4 (>1.5 trillion parameters) on a regular workstation any time soon, unless by "slow" you mean "unusable". And those are the models that are ~6 months behind Opus 4.7.

zozbot234 16 minutes ago||

[dead]

yalok 3 hours ago||

but ASML is in Europe - so they hold at least some critical part of the stack.

lmm 3 hours ago||

In theory yes. They've got a bargaining chip with TSMC. But it's unclear how much use that would be without a safe shipping route between Europe and Taiwan and/or a navy capable of maintaining such.

coderenegade 7 hours ago||

The distillation risk has been brewing for a while now. In a very real sense, the model is the data, so if the data is locked down because of how valuable it is, it was only a matter of time before fully open access to the models would be revoked.

There's also an additional economic concern that rarely gets mentioned: because no one has cracked continual learning, keeping models up-to-date and filling in gaps in performance requires retraining on an ever growing dataset. Granted, you aren't starting from scratch each time, but the scaling required just to stay relevant looks daunting.

I don't know where any this goes on a societal level, but I've believed since the release of deepseek r1 that access to frontier models would eventually be locked up behind contracts, since the only moats protecting the models themselves are purely artificial. It remains to be seen how effective China is at pushing the envelope, and whether they are interested in providing unfettered access. And on top of that, it remains to be seen how well these models actually turn out to scale in the long run.

ehnto 3 hours ago||

They are also not getting the same quantity or quality of data as was possible in the first years of "ingest". Compared to the beginning, from here on it is more like a drip feed of new training data. Still immense volumes of data, but we are talking 1 year of data production from society versus centuries of text and data ingested in a short time frame.

nayroclade 2 hours ago||

For pre-training, yes. But for post-training you need high-quality labelled datasets for reinforcement learning. So far AI has been most successful in coding, because you can translate the usage into such datasets, and thus produce a virtuous cycle: More usage produces more data, which produces better models, which drives more usage.

The question is whether this same model can successfully be applied in disciplines like medicine, law, engineering, etc.

BrtByte 6 hours ago||

This is a good point, especially the "model is the data" framing

adrithmetiqa 5 hours ago||

Considering the economic angle, one possible long term future is that access to frontier models is only realistic for the wealthiest 1% They will use this access to the ultra intelligent models to increase their wealth further. Inequality will continue to be negatively impacted

jillesvangurp 56 minutes ago||

Physics and economics will drive cost. Current token pricing is based on unsustainable investment and energy cost. However, this is more of an optimization problem than an inherent show stopper. Token cost will inevitably come down over time. But this could take a while before it catches up with demand. Manufacturing will step up to provide cheaper GPUs. Etc. There will be some consolidation but the whole thing will converge on something that should make long term economical sense.

Ultimately it's a resource control issue. To power AI you need land/space (to build on), water, energy, and lots of hardware. Hardware needs to be manufactured and engineered. It needs metals, some exotic materials, machines, etc. More resources in other words. If you look at China vs US here, they are really well positioned in terms of resources and supply chains. The US has fallen behind quite a bit on energy and all the critical resources needed to produce hardware. AI is bottle necked on a lot of stuff that China has or makes in abundance.

For the frontier models, there are a growing number of companies and countries that provide them. We're used to mostly talking about the US ones. But of course the Chinese have a lot of capability here and they are not that far behind. And that's judging by the models they choose to release under OSS licenses. Those models are not their frontier models. And there are a lot of other countries developing and using models that aren't necessarily talking openly about what they are doing.

The irony with these frontier models is that they only generate revenue if people can use them. Why sink billions in AI infrastructure and models without a revenue model?

The reality with Mythos is that you have to assume that the Chinese (and others) are not that far behind and may already be running an equivalent model that they just haven't told anyone about yet. Anthropic gate keeping Mythos and its findings is probably wise. But it's not long term sustainable to depend on that happening or working very well. Or even on them even being a leader in this space.

This is becoming an arms race between countries, and economies. And it's an economical and resource control race. Developing and researching in the open has advanced things massively. But it has also empowered the rest of the world. Both Anthropic and OpenAI are staffed with people from all over the world. You have to assume that they probably aren't very good at keeping things secret.

zozbot234 41 minutes ago|

Those billions in AI datacenter infrastructure will eventually be repurposed to run smart models like Mythos, not ChatGPT or even Opus/Sonnet. That future "revenue model" is quite robust to any foreseeable competition from on-prem FLOSS inference. It's a natural fit to the actual capabilities of large datacenter-scale compute.

digitaltrees 5 hours ago||

The thing is, the open source models are are smart enough to do most work if the harness and orchestration is right. So even if the next gen model get locked behind monopoly pay walls build Real things in the real world and fight for a humane world

margorczynski 3 hours ago|

The availability of open models with such capabilities are based on the goodwill of the Chinese. And that might end eventually, especially that the matter is one decision of Xi and the party.

Animats 4 hours ago|

Over on the image generation side, "frontier AI" seems to be coming along rather well. Watch this video, which was released eight days ago.[1] Can you find any flaws? Two years ago, just getting hands with the right number of fingers was tough. Last year, there were jarring errors in every scene. Now, very little is wrong. How much longer will anyone need Hollywood studios?

[1] https://www.youtube.com/watch?v=4zTCLIhScCM

Morromist 3 hours ago||

It is a LOT better than 2 years ago, but there are flaws and its unpleasant to watch. The most easy to spot is their shoes (which they weren't wearing 1 second ago) flying off their feet without being kicked off in the first 10 seconds.

But if progress keeps going I'm sure it will get to the point where my brain doesn't feel sick after watching it. I hope so, because I'm sure there's a lot of AI videos in my future, whether I want them or not.

piker 3 hours ago|||

Still in the uncanny valley for me. Like watching AI the film. That said, it’s 3-minutes long and maintains the setting across many different angles, zoom levels, etc. pretty impressive.

turpentine 2 hours ago|||

The video doesn't even have to load to know it's AI generated. The channel profile thumbnail and the video description are dead giveaways. The first frame of the video has too many errors to be worth repeating here. The first 0.5 seconds of the video has implausible movement.

dabinat 2 hours ago|||

The thing that’s always missing from videos like this is how much prompting or manual editing it took. It’s always implied that it was a one-shot, when it almost certainly was not.

rightbyte 2 hours ago|||

It is way better than some years ago but like every scene got something strange. Look closely. Look at them throwing shoes at 55s if you want something really obvious.

quink 3 hours ago|||

And even if there weren’t any jarring errors, and rest assured there’s about a billion of them, there’s no appeal to this. It’s all context free short unassociated clips of pretty faces dancing on a beach. And?

There’s no narrative, there’s now sense of reality, it’s just a sense of here’s a million pixels of colours that have proven to go well with each other, it’s _slop_.

It’s been years and the only place AI has conquered in visual entertainment is as a subpar Photoshop replacement to fill in the B-roll gaps for those that don’t have the patience or money to do it the proper way.

Paradigma11 1 hour ago|||

How elitist of you to belittle 95% of all creators who are no better than that.

duskdozer 1 hour ago|||

Well it is an ad, and all ads need to do is pump their "brand" into your head, so it was always slop.

_diyar 4 hours ago|||

> Can you find any flaws

Physics.

dinkumthinkum 2 hours ago||

I hear you. It is impressive technically but as far as finding flaws, I will just say this. This looks like something aliens would create in a dystopian simulation based on very odd understanding of old movies. I found it quite unsettling. Do you really think this would replace films with real actors and real writers (assuming they left the millennial talk and "modern audience" stuff)? I think the memes and parodies for AI video is more interesting than this kind of thing.

More comments...