“I want to wash my car. The carwash is 50m away. Should I take the car or go by foot?”
https://claude.ai/share/5f7f738a-5f29-48ff-9807-9a2dd37fb405
https://claude.ai/share/ecd14393-9d42-4527-ae0c-89f3d05216c8
I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).
It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).
[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
Double-checking my test harness, but it's the first model that does this, so I doubt the issue is on my side...
EDIT: Harness seems correct, for straight coding tasks they perform identical: https://i.snipboard.io/5xbpzY.jpg
> Claude Opus 4.8 is available everywhere today. Pricing for regular usage is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. Pricing for fast mode is $10 per million input tokens and $50 per million output tokens.
Where do you see the 2x cost?
This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.
But ( maybe because it was hardware ) that took 10ish years while it seems like the slowdown here only took about 4
later on someone figured if you asked it to output a reasoning before it gave a response its output would have more logical coherence, as though the reasoning output tokens functioned as a scratch space for it to work on.
at the end its all next token prediction
later on someone figured if you shove Adderall in it and it to think before it speaks, it gave a response its output would have more logical coherence, as though the Adderall concentration drugd functioned as a scratch space for it to work on.
in the end its a squishy lump of meat.
Have fun betting your competency on the quality and quantity of tokens you have access too. Hate to break it to you, but the billionaires aren't going to keep renting you $2mm in GPUs for 5 hours a day for $200.00 a month forever.
So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.
And I was dead wrong. Now I mostly use DeepSeek Pro myself.
I've wasted over a hundred Euros re-doing work that was done badly due to the model not being up to task (Vue with TS + wrapper components around PrimeVue, needing to handle event and property passthrough and deal with the stupid Vue SFC issues, TS made this much worse than JS would be). I think it was the GLM model through Cerebras Code at the time, in addition to some GPT and Gemini models with the API pricing.
That said, DeepSeek V4 Pro is pretty good and I can totally see myself offloading some of the work, as long as a better model reviews the work and provides suggestions/tests for it.
A $20 claude sub goes a long way when you plan with Opus and execute with Sonnet.
The most I’ve ever spent in a month extra on API tokens for my own work is $200, and I pay for the $200/mo Claude. I use these models quite a lot, though not idly (I usually just walk around and do other stuff until I know how im going to approach the next set of problems). So it costs me about $3000/year to get as much as I want of the best model available. Already that seems low enough to not be worth stressing out too much about optimizing it, because it feels like an indisputable good value, and trying to save money with a less powerful model would be optimizing for a $1000-$2000 saving at the expense of a large portion of my work taking longer or being more frustrating and iterative.
That’s not a flex or anything, I get that in other countries $3000/yr is a lot of money for a software developer and also a lot of people would perhaps rationally be better off doing X% worse at work or spending Y% more time on tasks to save $Z, if their productivity improvements didn’t translate to more salary. Otherwise if your performance has more upside I really do think that the smartest models are better with the current pricing scheme. Deepseek and the other Chinese models spend a LOT of time thinking, and tend to be much more jagged (benchmaxxed) in performance. How can dealing with that over an entire year be worth $2k?
The only situation I can think of where sacrificing my own time/performance to save on inference is batch compute (of course, $1k vs $100k is different from $30 vs $3k) or work where the tier 2 models have crossed the “good enough” threshold. But I think Opus is not even close to that threshold generally yet. As it gets smarter I, and I think most others probably, just try to do harder things faster and hit the next wall.
I've just recently started trying out DeepSeek 4 Flash and I was very skeptical at first because I've had some really good experiences with GPT-5.{4,5}, and couldn't possibly believe that this model they charge nothing for could give me similar results, but it absolutely shreds through things and ends up giving me very good answers in almost no time. I also like that it doesn't really seem to have much personality, it's given me mostly just facts and data so far without any additions to the prompt by me.
In my own agent I also specifically prompt to remove flowery language, snark, etc., but I haven't tried it with models like GPT-5.x which I've found has too much personality and tries to make it seem like I'm talking to a human too much.
I ask AI a lot of questions, not only about code but about my personal life, and I would be willing to pay very large sums to have the best quality output.
At my prior job there was still what felt like a strong enough correlation between my actual performance and my pay that I don't think I would have had a hard time justifying the expense there either; now I absolutely don't. With the current state of the models, it's baffling to me to hear about professional software developers planning their work around their $20/mo subscription's quotas.
Obviously it's more complicated than more tokens = more productive, but I see them less like SaaS and more like gasoline, where if I run out or need more to do what I'm doing, as long as I'm not being wasteful, I just buy more. Why would I waste a day walking 30 miles by foot when I can just pay $5 for gasoline and drive?
1. The sheer number of tokens that a coding agent can use flipped the math upside down on this equation. If you use the most expensive model for everything those costs quickly become untenable, even for software companies.
2. We realized many of the coding problems we're solving aren't incredibly difficult.
Chinese models are really quite good at a lot of stuff.
I think you're right especially if you're someplace that already has a data center, such as a university. Solves a lot of privacy concerns as well.
I just used ollama with a shell script to tackle my directory of papers/literature. I converted the first 6 pages of each document to PNG, handed them off to Qwen, and told it to spit out BibTeX, including the abstract. Two days later it was done, and I didn't spend anything on "tokens."
I don't think it's as simple as saying China's hosting is subsidized, they have generally cheaper electricity and labor costs than in the US and don't have access to the top tier models, and a large internal market where the big models are the best thing they can run with what they have. So obviously they max out on their top models (which are trained with their hardware market in mind, not ours) and get the economy of scale from that, and can run generally the same hardware for less money than in the US because
The edge models are very cheap to run and can do so on inexpensive hardware. They are like 95% cheaper to run than Haiku, so the math is in their favor for certain batch workloads. Most people just run the models for themselves when they do that without making it available on openrouter or whatever, because you can just provision a gpu node and use it as needed, and it's not that expensive to run this family of models.
Is your problem that you want to call Chinese models hosted in the US because you're worried about the data handling?
Edge models, yes, they can be convenient to run batch jobs locally. I still would argue there's no economic benefit over paying for models. Haiku has a bad price/perf but others in that class are significantly cheaper in hosted APIs.
Doesn't matter what I think, the reality is that the majority of enterprises (where the real $ comes from) will not consider sending their data to China.
1. https://epoch.ai/data-insights/ai-datacenter-cost-breakdown
Its just that some of us didn't imagine having GPUs would be advantageous and were not gamers on the side. Those who had beefy GPUs or GPU rigs for any reason, they rarely need to go anywhere else.
At least I am so impressed with Deepseekv4 AFTER using Claude Opus 4.7 for significant amount of time that I am not going anywhere but Deepseekv4.
The model is just INSANE. Things I have done with it include attempting to write a 2.5D game engine in C with full animation and map rendering layer by layer.
If you want to support a team of engineers, DeepSeek V4 Flash is antirez's current favorite. And you could support a team of engineers pretty nicely for $40-50k. Which might not make sense if you're on a Claude MAX 5x plan or the old enterprise group plan with fixed price seats. But Anthropic is switching their enterprise contracts over to token-based pricing, at which point $50k is looking pretty good.
For me, things are getting better faster than my ability to review / trust the resulting code, so tok/sec isn't a bottleneck anymore. Instead, quality of the tokens is the bottleneck. That points to me wanting a 1TB DRAM iGPU once they're available at pre-bubble RAM pricing.
If you compare to a smarter US model like Grok 4.3, $1400 will pay for 560M output tokens, which at ~25 t/s locally using it nonstop for 8 hours a day would take two years to pay back. Not accounting for bubble prices or electricity.
According to openrouter, Opus 4.8 is 128 t/s. So 10x faster than my antirez/ds4.
Meanwhile you could use Grok 4.3 for the same price which is smarter and 5X faster[4].
1. https://deepinfra.com/pricing
2. https://api-docs.deepseek.com/quick_start/pricing
3. https://artificialanalysis.ai/models/deepseek-v4-pro/provide...
I don't see myself returning to Claude or Codex anytime soon.
Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.
For example, it's being pushed pretty hard where I'm at, though not quite on the tokenmaxxer level. I started skipping related meetings cause it was nauseating. I can only tolerate so many platitudes.
At the same time, I just used the ever living snot out of Opus 4.6 for hours, grinning like an idiot throughout. Automated a whole bunch of enterprise cross-system drudgery away.
Fairly constant over time as well. Expressed a similar sentiment not too long ago here: https://news.ycombinator.com/item?id=48154277
Would you rather e.g. your doctor prioritized their wealth over your health? Popular conspiracy, but I'm not sure many health professionals follow in it. Not sure why you think this field would be much different. If this job is gone, it's gone. I can enjoy recreational programming on my own time, I don't feel entitled that my interest remains a money maker.
What worries me - and it does - is a further and accelerating shift in wealth (and thus capability) asymmetry. But for that, I look out for the performance and requirements of self hostable models instead, rather than reenact some sort of luddite, or lie to myself and others about the state of this technology.
If you want safety for country sovereignty, get a nuke. If you want safety for knowledge work, get a local model.
I called it out.
It then gave me one of the most super heartfelt honest and sincere apologies I have ever received.
Glad the safety team was there for me and able to make such an honest model or I would have been very upset about it.
Bash(echo test123) ⎿ test123
Read 1 file, listed 1 directory (ctrl+o to expand)
Bash(echo "checking output works")
⎿ checking output works
Read 1 file (ctrl+o to expand)
⎿ API Error: 400 messages.3.content.56: `thinking`
or `redacted_thinking` blocks in the latest
assistant message cannot be modified. These
blocks must remain as they were in the original
response.
Very inspiring improvements. DIssapointing result for a code review i expected to see after my 30 min walk ln -s $HOME/.local/share/claude/versions/2.1.153 $HOME/.local/bin/claude> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels
Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.
I think that buys enough credibility to propose an alternative.
I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.