Now, for me, it was really about how well it worked on big existing human made code bases. I was working on some new screens in GalCiv IV and if you've ever had to make screens for games, it is incredibly tedious, low brain work. But GPT 5.5 and Opus 4.8 would just struggle with these over and over again and this is C++ work with limited hotloading so it's a slow process. Fable nailed these screens fast.
I suggest tasks cannot be guessed (find, not tell). And 2d charts, both for ROC and pricing, vide https://quesma.com/benchmarks/binaryaudit/
My hat's off to swelljoe.
This part was especially interesting:
> The cheap Chinese models kick ass. MiMo and DeepSeek are directly competitive with Opus 4.8 and GPT 5.5 at roughly an order of magnitude lower price. There have been accusations of “benchmaxxing” with the Chinese models, but I don’t think there’s any reasonable way for the models to already be tuned for these very recently disclosed bugs. I think they’re genuinely becoming competitive with the frontier from Anthropic and OpenAI. If you’re in a hurry, DeepSeek was the fastest, on average, while finding 4/9 bugs. And, if you’re cheap, MiMo found bugs as well as any model for the lowest price.
"Don't be snarky."
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
"Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."
"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."
The problem with that math is that if they don't do any training they would be out of the market in 12 months, they're only relevant ("profitable") precisely because they trained the current reference SOTA model.
They can't just release Mythos and sit on top of it forever, competition is catching up fast and people expect a new more powerful model every 6 months.
You may notice that the performance of the old model tends to decline before each new model release.
Are they quietly compacting context to reduce kv cache usage, before the actual compaction? Like there’s a slider for how much to compress it, and that’s never revealed to us?
You have to learn to think like a drug dealer. The first hit is always free.
Companies and developers are growing more and more dependent on coding agents. Eventually, the owners of the AI will be able to charge whatever they want. What are you going to do? Go back to coding by hand? Do you even remember how?
It won't be as good.