From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.
I wish AMD would get around to adding NPU support in Linux for it though, it has more potential that could be unlocked.
I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context.
I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines.
I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit.
I suspect the API providers will offer this model for nice and cheap, too.
I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...
--no-mmap --fa on options seemed to help, but not dramatically.
As with everything Spark, memory bandwidth is the limitation.
I'd like to be impressed with 30tok/sec but it's sort of a "leave it overnight and come back to the results" kind of experience, wouldn't replace my normal agent use.
However I suspect in a few days/weeks DeepInfra.com and others will have this model (maybe Groq, too?), and will serve it faster and for fairly cheap.
Hope they update the model page soon https://chat.qwen.ai/settings/model
Sorry, but we're talking about models as content now? There's almost always a better word than "content" if you're describing something that's in tech or online.
I didn’t know it was this serious with the vocabulary, I’ll be more cautious in the future.
To be clear, since this confuses a lot of people in every thread: Anthropic will let you use their API with any coding tools you want. You just have to go through the public API and pay the same rate as everyone else. They have not "blocked" or "banned" any coding tools from using their API, even though a lot of the clickbait headlines have tried to insinuate as much.
Anthropic never sold subscription plans as being usable with anything other than their own tools. They were specifically offered as a way to use their own apps for a flat monthly fee.
They obviously set the limits and pricing according to typical use patterns of these tools, because the typical users aren't maxing out their credits in every usage window.
Some of the open source tools reverse engineered the protocol (which wasn't hard) and people started using the plans with other tools. This situation went on for a while without enforcement until it got too big to ignore, and they began protecting the private endpoints explicitly.
The subscription plans were never sold as a way to use the API with other programs, but I think they let it slide for a while because it was only a small number of people doing it. Once the tools started getting more popular they started closing loopholes to use the private API with other tools, which shouldn't really come as a surprise.
No, in 2026, even with their API plan the create key is disabled for most orgs, you basically have to ask your admin to give you a key to use something other than Claude Code. You can imagine how that would be a problem.
eventually they added subscription support and that worked better than cline or kilo, but im still not clear what anthropic tools the subscription was actually useful for
Some people think LLMs are the final frontier. If we just give in and let Anthropic dictate the terms to us we're going to experience unprecedented enshittification. The software freedom fight is more important than ever. My machine is sovereign; Anthropic provides the API, everything I do on my machine is my concern.
They simply don't want to compete, they want to force the majority of people that can't spend a lot on tokens to use their inferior product.
Why build a better product if you control the cost?
Problem is, most people don't do this, choosing convenience at any given moment without thinking about longer-term impact. This hurts us collectively by letting governments/companies, etc tighten their grip over time. This comes from my lived experience.
I agree anticompetitive behavior is bad, but the productivity gains to be had by using Anthropic models and tools are undeniable.
Eventually the open tools and models will catch up, so I'm all for using them locally as well, especially if sensitive data or IP is involved.
I can't comment on Opus in CC because I've never bit the bullet and paid the subscription, but I have worked my way up to the $200/month Cursor subscription and the 5.2 codex models blow Opus out of the water in my experience (obviously very subjective).
I arrived at making plans with Opus and then implementing with the OpenAI model. The speed of Opus is much better for planning.
I'm willing to believe that CC/Opus is truly the overall best; I'm only commenting because you mentioned Cursor, where I'm fairly confident it's not. I'm basing my judgement on "how frequently does it do what I want the first time".
I've tried explaining the implementation word and word and it still prefers to create a whole new implementation reimplementing some parts instead of just doing what I tell it to. The only time it works is if I actually give it the code but at that point there's no reason to use it.
There's nothing wrong with this approach if it actually had guarantees, but current models are an extremely bad fit for it.
For actual work that I bill for, I go in with intructions to do minimal changes, and then I carefully review/edit everything.
That being said, the "toy" fully-AI projects I work with have evolved to the point where I regularly accomplish things I never (never ever) would have without the models.
At the moment I have a personal Claude Max subscription and ChatGPT Enterprise for Codex at work. Using both, I feel pretty definitively that gpt-5.2-codex is strictly superior to Opus 4.5. When I use Opus 4.5 I’m still constantly dealing with it cutting corners, misinterpreting my intentions and stopping when it isn’t actually done. When I switched to Codex for work a few months ago all of those problems went away.
I got the personal subscription this month to try out Gas Town and see how Opus 4.5 does on various tasks, and there are definitely features of CC that I miss with Codex CLI (I can’t believe they still don’t have hooks), but I’ve cancelled the subscription and won’t renew it at the end of this month unless they drop a model that really brings them up to where gpt-5.2-codex is at.
Edit: It's very true that the big 4 labs silently mess with their models and any action of that nature is extremely user hostile.
I agree with all posts in the chain: Opus is good, Anthropic have burned good will, I would like to use other models...but Opus is too good.
What I find most frustrating is that I am not sure if it is even actual model quality that is the blocker with other models. Gemini just goes off the rails sometimes with strange bugs like writing random text continuously and burning output tokens, Grok seems to have system prompts that result in odd behaviour...no bugs just doing weird things, Gemini Flash models seem to output massive quantities of text for no reason...it is often feels like very stupid things.
Also, there are huge issues with adopting some of these open models in terms of IP. Third parties are running these models and you are just sending them all your code...with a code of conduct promise from OpenRouter?
I also don't think there needs to be a huge improvement in models. Opus feels somewhat close to the reasonable limit: useful, still outputs nonsense, misses things sometimes...there are open models that can reach the same 95th percentile but the median is just the model outputting complete nonsense and trying to wipe your file system.
The day for open models will come but it still feels so close and so far.
If people start using the Claude Max plans with other agent harnesses that don't use the same kinds of optimizations the economics may no longer have worked out.
(But I also buy that they're going for horizontal control of the stack here and banning other agent harnesses was a competitive move to support that.)
They seem to have started rejecting 3rd party usage of the sub a few weeks ago, before Claw blew up.
By the way, does anyone know about the Agents SDK? Apparently you can use it with an auth token, is anyone doing that? Or is it likely to get your account in trouble as well?
I've had a similar experience with opencode, but I find that works better with my local models anyway.
(There probably is, but I found it very hard to make sense of the UI and how everything works. Hard to change models, no chat history etc.?)
> hitting that limit is within the terms of the agreement with Anthropic
It's not, because the agreement says you can only use CC.
Selling dollars for $.50 does that. It sounds like they have a business model issue to me.
Without knowing the numbers it's hard to tell if the business model for these AI providers actually works, and I suspect it probably doesn't at the moment, but selling an oversubscribed product with baked in usage assumptions is a functional business model in a lot of spaces (for varying definitions of functional, I suppose). I'm surprised this is so surprising to people.
It'll be interesting to see what OpenAI and Anthropic will tell us about this when they go public (seems likely late this year--along with SpaceX, possibly)
There are already many serious concerns about sharing code and information with 3rd parties, and those Chinese open models are dangerously close to destroying their entire value proposition.
Being a common business model and it being functional are two different things. I agree they are prevalent, but they are actively user hostile in nature. You are essentially saying that if people use your product at the advertised limit, then you will punish them. I get why the business does it, but it is an adversarial business model.
The problem is, there's not a clear every-man value like Uber has. The stories I see of people finding value are sparse and seem from the POV of either technosexuals or already strong developer whales leveraging the bootstrapy power .
If AI was seriously providing value, orgs like Microsoft wouldn't be pushing out versions of windows that can't restart.
It clearly is a niche product unlike Uber, but it's definitely being invested in like it is universal product.
It's within their capability to provision for higher usage by alternative clients. They just don't want to.
it's like Apple: you can use macOS only on our Macs, iOS only on iPhones, etc. but at least in the case of Apple, you pay (mostly) for the hardware while the software it comes with is "free" (as in free beer).
Could have just turned a blind eye.
(Edit due to rate-limiting: I see, thanks -- I wasn't aware there was more than one token type.)
That's not the product you buy when you a Claude Code token, though.
This confused me for a while, having two separate "products" which are sold differently, but can be used by the same tool.
If a company is going to automate our jobs, we shouldn't be giving them money and data to do so. They're using us to put ourselves out of work, and they're not giving us the keys.
I'm fine with non-local, open weights models. Not everything has to run on a local GPU, but it has to be something we can own.
I'd like a large, non-local Qwen3-Coder that I can launch in a RunPod or similar instance. I think on-demand non-local cloud compute can serve as a middle ground.
I can also imagine a dysfunctional future where a developers spend half their time convincing their AI agents that the software they're writing is actually aligned with the model's set of values
And yeah, I got three (for some reason) emails titled "Your account has been suspended" whose content said "An internal investigation of suspicious signals associated with your account indicates a violation of our Usage Policy. As a result, we have revoked your access to Claude.". There is a link to a Google Form which I filled out, but I don't expect to hear back.
I did nothing even remotely suspicious with my Anthropic subscription so I am reasonably sure this mirroring is what got me banned.
Edit: BTW I have since iterated on doing the same mirroring using OpenCode with Codex, then Codex with Codex and now Pi with GPT-5.2 (non-Codex) and OpenAI hasn't banned me yet and I don't think they will as they decided to explicitly support using your subscription with third party coding agents following Anthropic's crackdown on OpenCode.
I'm not so sure. It doesn't sound like you were circumventing any technical measures meant to enforce the ToS which I think places them in the wrong.
Unless I'm missing some obvious context (I don't use Mac and am unfamiliar with the Bun.spawn API) I don't understand how hooking a TUI up to a PTY and piping text around is remotely suspicious or even unusual. Would they ban you for using a custom terminal emulator? What about a custom fork of tmux? The entire thing sounds absurd to me. (I mean the entire OpenCode thing also seems absurd and wrong to me but at least that one is unambiguously against the ToS.)
It’d be cool if Anthropic were bound by their terms of use that you had to sign. Of course, they may well be broad enough to fire customers at will. Not that I suggest you expend any more time fighting this behemoth of a company though. Just sad that this is the state of the art.
* Subscription plans, which are (probably) subsidized and definitely oversubscribed (ie, 100% of subscribers could not use 100% of their tokens 100% of the time).
* Wholesale tokens, which are (probably) profitable.
If you try to use one product as the other product, it breaks their assumptions and business model.
I don't really see how this is weaponized malaise; capacity planning and some form of over-subscription is a widely accepted thing in every industry and product in the universe?
Also, this is more like "I sell a service called take a bike to the grocery store" with a clause in the contract saying "only ride the bike to the grocery store." I do this because I am assuming that most users will ride the bike to the grocery store 1 mile away a few times a week, so they will remain available, even though there is an off chance that some customers will ride laps to the store 24/7. However, I also sell a separate, more expensive service called Bikes By the Hour.
My customers suddenly start using the grocery store plan to ride to a pub 15 miles away, so I kick them off of the grocery store plan and make them buy Bikes By the Hour.
They could, of course, price your 10GB plan under the assumption that you would max out your connection 24 hours a day.
I fail to see how this would be advantageous to the vast majority of the customers.
OpenCode et al continue to work with my Max subscription.
What Anthropic blocked is using OpenCode with the Claude "individual plans" (like the $20/month Pro or $100/month Max plan), which Anthropic intends to be used only with the Claude Code client.
OpenCode had implemented some basic client spoofing so that this was working, but Anthropic updated to a more sophisticated client fingerprinting scheme which blocked OpenCode from using this individual plans.
I recommend Ghostty for Mac users. Alacritty probably works too.
{
"plugin": [
"opencode-anthropic-auth@latest"
]
}It's that simple. Everyone else is trying to compete in other ways and Anthropic are pushing for dominate the market.
They'll eventually lose their performance edge and suddenly they will back to being cute and fluffy
I've cancelled a clause sub, but still have one.
I've tried all of the models available right now, and Claude Opus is by far the most capable.
I had an assertion failure triggered in a fairly complex open-source C library I was using, and Claude Opus not only found the cause, but wrote a self-contained reproduction code I could add to a GitHub issue. And it also added tests for that issue, and fixed the underlying issue.
I am sincerely impressed by the capabilities of Claude Opus. Too bad its usage is so expensive.
I wonder what they are up to.
Please list what capabilities you would like our local model to have and how you would like to have it served to you.
[1] a sovereign digital nation built on a national framework rather than a for-profit or even non-profit framework, will be available at https://stateofutopia.com (you can see some of my recent posts or comments here on HN.)
[2] https://www.youtube.com/live/0psQ2l4-USo?si=RVt2PhGy_A4nYFPi
You are doing that all the time. You just draw the line, arbitrarily.
It's like this old adage "Our brains are poor masters and great slaves". We are basically just wanting to survive and we've trained ourselves to follow the orders of our old corporate slave masters who are now failing us, and we are unfortunately out of fear paying and supporting anticompetitive behavior and our internal dissonance is stopping us from changing it (along with fear of survival and missing out and so forth).
The global marketing by the slave master class isn't helping. We can draw a line however arbitrary we'd like though and its still better and more helpful than complaining "you drew a line arbitrarily" and not actually doing any of the hard courageous work of drawing lines of any kind in the first place.
Granted these 80B models are probably optimized for H100/H200 which I do not have. Here's to hoping that OpenClaw compat. survives quantization
I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.
The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...
one more thing, that guide says:
> You can choose UD-Q4_K_XL or other quantized versions.
I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use?
IQ4_XS
Q4_K_S
Q4_1
IQ4_NL
MXFP4_MOE
Q4_0
Q4_K_M
Q4_K_XLAlso, depending on how much regular system RAM you have, you can offload mixture-of-expert models like this, keeping only the most important layers on your GPU. This may let you use larger, more accurate quants. That is functionality that is supported by llama.cpp and other frameworks and is worth looking into how to do.
Im currently using qwen 2.5 16b , and it works really well
The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.
Does anyone any experience with these and is this release actually workable in practice?
They are usually as good as the flagship model for 12-18 months ago. Which may sound like a massive difference, because somehow it is, but it's also fairly reasonable, you don't need to live to the bleeding edge.
Running this thing locally on my Spark with 4-bit quant I'm getting 30-35 tokens/sec in opencode but it doesn't feel any "stupider" than Haiku, that's for sure. Haiku can be dumb as a post. This thing is smarter than that.
It feels somewhere around Sonnet 4 level, and I am finding it genuinely useful at 4-bit even. Though I have paid subscriptions elsewhere, so I doubt I'll actually use it much.
I could see configuration OpenCode somehow to use paid Kimi 2.5 or Gemini for the planning/analysis & compaction, and this for the task execution. It seems entirely competent.