Posted by alphabettsy 15 hours ago
With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.
With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.
With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.
This is not scientific at all, just vibes, YMMV.
This is the problem.
I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.
Think of it less like a static tool, and more like a human helper, where the same holds.
That said, I can't wait for LLMs to stop being AI and start being just another tool. Anything cursed with the "AI" label seems to go through this mess. In the earlier AI cycles, rules engines were considered "human-ish" and got hyped up, but today we just see then as just another tool available to us, and we're better off for it.
From a horse's perspective, the internal combustion engine is just another tool for making scary noises and powering horse trailers to take me on fun horse adventures. So ... perhaps.
ChatGPT was obedient with the grill-me technique, just wrote a plan. Yesterday it started jumping to implementation. Why?
It's a very very bizarre way to use a computer.
Personally, I just don't. I'll use and prompt the LLMs the way that feels natural to me and move on with my life. Maybe I don't always get completely optimal results from them, but im also not spending half my day pleading with the computer to do a task.
The most important thing to be aware of in my opinion would be that Claude is better at UI design, and leaves a lot more comments in the code.
Other than that the results seem similar, at least functionally. I do not usually review the code style.
With humans it's actually good and worthwhile to create and strengthen connections. With an LLM, that's psychosis.
I don't think LLMs are people in any sense, at least as they're constructed now -- but they very much have what we would call "culture" and "personality" in suitably alien forms.
This is not the same as, e.g., feelings, experience, or humanity, or actual opinions or ideas (versus essentially "distilled vibes") and I feel that AI will more and more force us to confront that (including if new AIs are ever developed that may have the latter, as well!)
And if humans are anything, they are tool users.
Can be both. Use of some tools like LLMs might be more inducing psychosis than others like plain compilers or hammers.
>And if humans are anything, they are tool users.
To the point of self-destruction sometimes.
I really don't get it. Why the fact that it outputs words is so goddamn important for everybody? How does it suddenly make you so emotionally vulnerable? Does my brain work in a different way than the rest of humanity? Can't you disregard what's irrelevant? Is every programmer suddenly a trump supporter that has no ability to recognize empty words? To recognize lies about emotions and facts?
Words are just input. Mostly garbage. Emotion inducing words are garbage 10 times more often than any other. I could expect romance reader to be affected, or somebody with iq 70. But how the caste of some of the most technical people ever is afraid of catching psychosis just because they might read some words?
When we built the idea that anthropomorphising is wrong, we meant when doing it for rocks or trees or thunders or deer or some such.
We communicate with other humans using voice and three dimensional hand gestures. To use computers and early phones we had to learn to operate new input devices: keyboards and mice. Later with touchscreens we moved to two dimensional hand (finger) gestures. We're barely making voice commands work with our devices just recently.
Then, a large number of humans are figuratively tethered to their desks because the devices need power and stable internet connection. Mobile devices break this relationship a bit but you still need to charge them and be close to some sort of access point. In any case, the devices encourage sitting in one place for hours at time.
And this is just computers and smartphones. Humans adapted their entire lifestyles and transformed the landscape to cater to cars.
Was it? Think first about what it replaced. Lots of manual computation in bookkeeping and financial sectors. Telegrams and snail mail moved to email. Typesetting in books and magazines became easier and widely available,…
If there’s one thing that you can’t say about computers is that they’re limited.
The context was that technology should evolve to fit the humans [not the other way around]. And if contemporary technology didn't have limitations, it would be correct.
But it did and humans had to adapt to the computers. Humans had to develop and learn special languages so they could communicate with computers to do all those useful things you mentioned. Why? They were limited in understanding (or parsing) human languages. It took us decades before we could talk to computers in human languages. We're getting pretty close - especially in the past few years - but there's still some friction.
You may need to revisit your computation theory courses. Computers are the embodiment of a mathematical model and thus the inputs and outputs are formalized.
Do you just hold a pen and words are written automatically? Do you just hover your hands over a piano and have the moonlight sonata played? No, you have to do precise mechanical movements because that’s how the output is realized.
There’s no such things as words, sentences, keywords, statements at the computer level. What it does is symbol manipulation. You provide it a string of symbols, the rules for the manipulation, and it will provide a string of symbols as the output.
What symbols, what rules, are completely arbitrary . We just found that {1,0} are all that we needed as the set of symbols and that Context-Free Grammar is perfect for specifying the rules.
We still need to encode everything down to binary (ascii, unicode, bcd, floating points, pixel formats, PCM,…) and use a programming language (as defined by a grammar) to get the computer to do anything. Inference is made possible by those two mechanisms. It’s not a new computation model.
Realising this made me respect the "I" in "AI" a bit more seriously.
Maybe we need better reviewers then?
This presumes that the labs themselves know how well their models perform. But all they have are overtuned benchmarks and hype vibes.
Admittedly, yes, there's some overlap there.
They would have to admit 'seen it in the training data' as a factor, and that opens a can of worms.
They do not test how models perform when used interactively, like most of us do.
I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input. We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.
There’s a skill to it. With agentic loops if you get the model into a self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training, you’re golden. But it’s hard to find the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).
The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I wish this kind of “stability” in output was more emphasized in their training so they’d be predictable. I assume it’s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…
It can be frustrating. The AI pretends to be a human, and so a part of my brain expects them to commit and have a "parti pris" like a human, so the exercise is a good reminder of the feedback loop. My mental model is that before the first three or four messages, the model has many finer points of its personality still underdetermined. I'd suggest that as the mechanism for "role-based prompting". And it explains the "savant sleeper agent" thing you describe. You want to get the state in the right attractor on the manifold.
These machines are pretty incredible, but for conversation-driven workflows you really have to be in the driver's seat. A human has a property that the AI does not have, at least under current architectures: we are regulated by the outside world. A bit of a tangent, but I can see how AI psychosis arises from these dynamics.
> We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.
Any chance you could share some of these? Seems like something we could all benefit from.First, I almost always try to seed every new project or context/domain with canonical technical specifications or examples I found elsewhere. When I set up this project recently, I linked to a bunch of the official Apple docs for sysctl, and told it to use a specific technique for calling assembly code from Go, that from experience it almost never realizes it can do or knows about (and similarly for sysctl, I knew it kinda sorta knew about it, but not in its entirely): https://github.com/accretional/sysctl/commit/da52438233e5b33...
The other thing I did was tell it to enumerate all the test cases ahead of time rather than to just directly implement them; again this is something where you have to explicitly tell it to go digging for information where it has blind spots and get it to set up properly grounded self-eval in a way that it can test against. I usually tell it to take notes as it works or commit notes to itself that will persist over sessions: https://github.com/accretional/sysctl/blob/main/FINDINGS_2.m...
Once we get back to working on this project we'll just have it implement / validate the rest of the sysctl feature support against the full inventory we had it uncover: https://github.com/accretional/sysctl/blob/main/cmd/darwin-n...
Another thing we do is have it specify an API that it can produce against; then in other projects we have them consume the API via reflection (and our special sauce we've been working on is the ability to discover and integrate against these automatically across thousands of APIs from many providers, which we've got working and can share if you're interested in using it as an early customer): https://github.com/accretional/sysctl/blob/main/proto/sysctl... This isn't the greatest example because it doesn't actually fully specify the sysctl keys yet. But I did have it create a knowledge base trying to cover the 1000+ keys as best as it could, to reference as it continued: https://github.com/accretional/sysctl/tree/main/macos-sysctl...
We have a better example in eg https://github.com/accretional/proto-sqlite/tree/main/lang where we were able to encode the entire sqlite grammar into a grpc interface so that you could eg find the exact structure (and sanitize) of a select statement: https://github.com/accretional/proto-sqlite/blob/main/lang/p... This way integration and discovery becomes a matter of telling it "use reflection against this endpoint to discover the sql interface, then implement against it" and we can model formats/input validation as formal grammars via EBNF (all magic words) vs just adhoc
We also tell it to set up and use a browser automation toolkit/testing and always run it at the end of testing workflows (often in a way that auto-opens screenshots on our local machines + commits them to git) via tools like https://github.com/accretional/chromerpc#headlessbrowser-aut... so that whenever we produce UIs it can evaluate its own output and iterate without direct human intervention. This is another case where the knowledge-discovery problem becomes a problem so we tell the models to use reflection to discover the browser automation apis. That ends up giving us things like this where it records user journeys through sites and creates visualizations without us having to debug them or do them ourselves: https://github.com/accretional/proto-css/tree/main/chrome-te...
Do note that I only use LLMs in the ChatUI, I never use agents. I don't believe having a blackbox codebase managed by entities with a half-life of 'delete conversation' or 200k tokens is a responsible idea. In ChatUI, I lay the ground rules, kill assumptions about our working relationship, give it foundational context on the problem and codebase we're working on, explain the problem and then we have a conversation about it and I gradually disclose more logically context as it becomes relevant. So, to directly answer your question, maybe I'm missing out on a ton of upside by not using the absolute best but I'd say familiarizing yourself with a specific model has all the benefits of having a human friend you've grown up with... except your buddy's a savant and would absolutely love to help!
Or it is more like playing a slot machine and you imagine the rest.
Maybe it works some of the time but it isn't a solution that works everytime.
It reminds me of people hovering to play a slot machine when someone gets up and it hasn't paid out as if they've solved slot machines.
While I don't mind putting something in a loop until the tests pass, I'm less comfortable doing that when providers are silently rerouting to lower quality models, or in Google's case burning quota faster to ease their own server load without being transparent about what the "standard limits" are to begin with. [1]
I'm hopeful I'll be more comfortable with these "slot machines" when frontier models get to the point where they can be run locally on hardware I can actually afford so I know exactly what I'm getting and not jumping at shadows with providers playing tricks behind the scenes to ease their own load without admitting the customer is getting less for their money as they get more popular.
[1]: https://support.google.com/gemini/answer/16275805?hl=en&sjid...
Last I saw, engineers working at OpenAI denied this on HN.
I saw that someone set up a tracker that aims to record the performance of the models, and so far it has not shown any statistically significant deviation in performance for Codex, and not yet enough data for Claude: https://marginlab.ai/trackers/codex/
The firm [Anthropic] would deliberately degrade the model’s performance in ways that were invisible to the user.
For such thing to be useful, it's enough that they works substantially more times that not having those instructions in.
Playing a B on a saxophone always plays a B.
But your analogy remains solid if you substitute e.g. a piano and a reasonably proficient player. A single note would be nearly indistinguishable between players... But a full piece most certainly will sound different.
The original take was "LLMs are very much like playing an instrument". I think they are very much NOT like playing an instrument.
While different musicians will produce different results, one musician won't get drastically different results on different days or when trying a different "copy" of the same instrument. If you can play the violin on your violin and I lend you my violin, you will still be able to play very consistently. You may argue that the sound will differ and you will have to adapt slightly, but that's not remotely similar to the randomness coming from LLMs.
That's only if both violins are tuned the same way, and one must continually tune them lest they get out of sync.
Similarly, an LLM can be extremely consistent if tuned properly -- indeed, if you fix the weights and settings, they can be made "essentially deterministic" for many prompts!
This is because LLMs have aspects of chaotic dynamical systems, where small changes in initial conditions can lead to vastly different outcomes. That property is independent from nondeterminism.
You know what we are talking about. Tuning, poor playing, all of that is mild variation from what we know it is supposed to do every time and we can target the the notes they are supposed to hit consistently. You're comparing slight tonal variations to completely different outputs from the same inputs. If I hit a "C" on the piano, it is going to play "C." If it does not, then the piano is not functioning properly. LLM's for some reason get a pass on this and it makes them very distinct from musical instruments.
This feels like a very nitpicky steel man, not a productive attempt at discussion.
LLM’s do not operate consistently and make their own errors while we argue about which incantation makes it less inconsistent, knowing it will never actually perform as expected.
I played woodwinds regularly for 15 years so I feel fine with my example.
Instruments present a clear interface to a user, have predictable outputs, etc.
The only comparison that might work for me is that LLMs are very bad instruments where you are constantly forced to negotiate its idiosyncrasies in order to massage the output you want from it, and even then there is enough randomness that trying to do so is almost a fool's errand.
I also think it's disingenuous to call LLMs "tools" in the stricter sense of the definition, but I've mostly given up trying to convince people of this. Main reason being that a terrible writer and a gifted writer can produce similar outputs, and for the terrible writer it will be above their average, and for the gifted writer it will be below what they could produce with full control.
> With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.
I agree with all of this except for one thing: I swear to god, being mean to Claude at the right time can be enormously effective. The F-bomb in particular seems to really help it snap out of ruts sometimes.
No, don't f***ing do that! What part of "[previous instruction]" don't you f***ing understand? I am extremely angry and disappointed by your inability to [whatever]. Do better please.
> maybe it could trigger more defensive responses with argumentation to explain its conclusions.Quite the opposite, it makes the model extremely conciliatory—which in this situation is what I want. If you're hoping to make the model less sycophantic, this is the wrong tool.
I asked Claude Sonnet 4.6 for the same thing, and Claude's version was like if the game had been written in JS originally.
Also, for some reason it made it a single HTML file, removed all assets, dynamically generated graphics and dynamically generated music. It also gave me a new, better background.
This surprised me, since it was not what I asked for. I just asked it to port the game.
I was pretty pleased about the choices it made, but I'm not sure how to turn that behavior on and off. Sometimes you want it to be creative, sometimes you want it to actually do what you said.
You have to be a lot more explicit but it’s hard to know a priori what decisions it’ll make. A good idea is to run it in plan mode so you can read those decisions before it sets out on a path and have an opportunity to make corrections.
What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.
It's like taking the engine out a each car, putting it to a test bed and running it and then making a decision whether the car is good or bad based on the graphs the test bed provided.
You might have the best engine in the world, but if you put it in a shit car, the result is still bad. The seats are squeaky plastic, the infotainment is touch-only and you can't put on your seatbelt without knocking down whatever is in the cupholder.
It is a fundamentally hard problem to solve
Take AI out of the picture for a moment. What makes someone a good coder? What makes someone intelligent? How do you evaluate those skills?
Of course we have standardized tests, and they're useful, but they're also imperfect. And they become especially imperfect when people start training for the tests specifically—which is, essentially, benchmaxxing.
We have never been able to quantitatively measure most skills to a high degree of accuracy, despite centuries of trying. That's not going to change now.
(I don't mean to anthropomorphize the LLMs, but I do think they're like humans in this way.)
When I use a calculator, I know exactly what it does and what it is supposed to do. It always gives me a verifiable, predictable result. If I input “8+8” 10,000x it will give me “16” 10,000x outside of incredibly fringe edge cases/bugs. I can’t say the same for LLMs
With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.
So almost by design that particular benchmark is tainted, and benchmarks recall rather than reasoning.
I haven’t seen details of LLM benchmarks’ data sets but I would suppose that “questions” are public so known in advance therefore you can tune a model as much as possible.
One of real benchmarks is drawing of pelican - https://github.com/simonw/pelican-bicycle - Simon Willison made it for his llms’ tests.
If you want really find out a model that works for your specific purpose I would recommend several rounds at arena.ai - it helps to find a anonymously a model without confirmation bias.
Some ppl: Claude is the best! Others to them: but Qwen is the best! Or… Codex is better! …
it all depends on the language (English, Dutch, French…), style of querying (caveman, specs, skills, goal etc.)
Even with the same model I get different answers to same prompt that is just tweaked a little.
So benchmarks are nice but mostly useless.
Without your usecase it is just a reference number indicating the approximate position of that model among the others. And for those who want to make money it is a marketing tool to sell more as every customer counts.
One good analogy is the Macbook vs generic windows laptop debate online.
The engineer mind just compares numbers, the Lingwoo laptop from Amazon has biggest numbers for everything and the lowest price. Ergo it is the best.
But the numbers don't measure the fact that the Lingwoo creaks and squeaks when you lift it due to the cheap plastic. It also runs at 100C when both CPU and GPU are fully utilised. The keyboard feels like a membrane keyboard from a milspec device from the 90s. Numbers also don't measure the fact that Linwoo is an alphabet soup whitelabel manufacturer that won't exist in any legal capacity in 6 months so good luck with any warranty issues.
There will be an identical laptop called Chongwin being sold though. Completely different company, definitely.
--
The same applies to LLMs. You can do benchmarks like ask them to one-shot different kinds of gotcha questions (car wash, strawberry and other idiotic ones) or get them to write different kinds of programs.
But that doesn't measure the UX of doing so at all. How many times do you actually need any of those when you're actually working?
It's like unit testing an application. Every function can have 100% test coverage and the app can still be shit because there are things you can't unit test for.
One can always measure whatever they wonder about. It doesn't mean the measure will be trustworthy and that anything built on it won't be at best not worst than wet finger judgement.
Even songs that break the "rules" of music can be subjectively good, either because they broke the rules or despite it.
Or with cars, a car that's beautiful to one person is the ugliest piece of trash on the street. Some people want a super soft ride where their espresso martini doesn't even vibrate when gunning it through a gravel road and others want to feel every grain of sand on the asphalt in their buttocks. Neither is "correct" and there is no objective measurement for ride comfort.
I have my own "interview questions" for models where I give them a premade Git repo and a problem to solve. Then, I rate them like a teacher. I believe other do that as well, so we only need a reliable system to aggregate these results.
The only way to make it fair is to have the model provider give some benchmarking org the weights + inference engine, so that the model can be run in complete isolation and no information about the benchmark is leaked.
Though I guess for a 'random' person's benchmark that hides between all other requests it's probably ok.
It's hard to decide when to use the best tool for a job you are aware of to ensure throughput and when to spend time experimenting with a new tool to learn what it's good at.
This has been my experience with most models. If you say "How do I do X? I was thinking maybe Y or Z" then the model will probably try to make Y or Z work. They will very likely not say some third option that is wildly different is better, even if it may be. And actually maybe less so with Claude because sometimes it pushes back.
Actually this seems like it would be an interesting test. Maybe I will come up with some contrived question and ask several models.
Fable seemed less apt to do so but I didn't get enough time with it before it was yanked away to know for sure. It may have had mixed results on the benchmarks but it was finding bugs opus never found.
there was something on HN a few weeks ago about how most/all models perform better the more rude you are to them.
(i still say "please", i can't help it)
IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.
BUT it's also "relentlessly proactive" like simonw put it. It _will_ get the job done, it's the smartest idiot in town. Why use a library to parse $format when you can just write a custom 1000 line parser? Or if it can't access something, it'll pursue the goal of accessing it in the most creative ways - instead of stopping, asking the user "yo, can you give me access to X" and then continuing.
My solution is to use Claude as a pair programmer. I _very_ rarely just do /goal fix this shit, I watch what it does and interrupt if it gets to the "smart idiot" phase. Also I communicate with it like I would a coworker, never had it berate me or get combative. There's a Finnish proverb for that too[0]
As for Codex, Deepseek, GLM, those I use when the goal is 100% clear like "convert this Brewfile to a list of packages for Arch and Debian, use these two Docker containers to test that pacman and apt work correctly". Boom, done.
But I won't give any creative open-ended tasks to any other model than Claude.
[0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...
On one hand I’m glad Anthropic is only just now starting to get into infrastructure because it means there’s opportunity there, but it’d be great for their models to be more knowledgeable or able to seek out that knowledge on their own, or for the UX of Claude code to be more amenable to launching 5 in parallel and picking the best one, so I don’t have to spend time arguing with a robot. I think there’s a much better balance to strike between just charging ahead towards the goal at all costs vs being lazy and pushing everything back up to the user. Basically they write too much code that’s too contingent/brittle outside its exact current context and don’t do a good job distilling out the essence of the problem “cleanly”. Almost all of them are like this right now, it’s partially a problem with long-range planning but I think a real bias from over optimization for certain RLVR outcomes vs others.
Gemini CLI at work has the same issue: it'll prefer hacking your workstation over just asking you how to proceed.
I think the harnesses are setup to have a bias to action otherwise the LLM would just stop all the time when doing trivial task but it also mean they'll keep going when the "obvious" path is to just prompt the user.
I often tell it to stop asking me and just keep going until it accomplishes X task; unfortunately it tends to assume I want something that only just barely works, in the sense that it means it's time to stop once its there, which is I don't think a harness by itself could easily address (ultimately the model itself needs to determine the stopping points unless I literally specify by hand hidden evaluation criteria).
That's why think it's at least partially a training issue where the model gets rewarded for "solving" the problem within a certain amount of context/time without access to grounded knowledge (eg looking up the actual spec for a format) nor adversarially/rigorously evaluated against a reviewer capable of finding all the edge cases/shortcuts preventing something from being a properly generalized solution. I don't want it to ask me for guidance when it's working on a well-specified problem, I want it to either find the right parser and use it, or to completely implement one against the spec, rather than write some half-assed string inserter that eg only works on the specific select statements my examples use right now. My understanding is that the Mythos/Fable models were better trained for this but from my brief foray into using Fable for work I wasn't that impressed. For me they need to get better at agentic search and self-eval still
Having a reliable shared memory for hundreds of agentic AI users is something that's 95% snake oil at the moment. There are a few successes on an individual level (I really like Hermes[0]) but nothing scales to a company level easily.
It should be possible to (pre)configure all agentic harnesses used in a company to use a single source for information so that it'd automatically pick up internal libraries, conventions, licensing decisions etc and remember them across sessions.
I've had limited success with this on a personal level, but it's still not ingrained in the model because it would really need a custom harness. Hooks, skills, prompts get you like 80% of the way. I still need to do a "please check that the project matches the conventions defined in ..." regularly to catch any drift - especially on more vague stuff that can't be locked down with unit testing.
If you can't show ROI there's literally no reason to ever switch anything.
this is what 'tokens are commodities' and 'there is no moat' people miss. the models are in general not easily swapped out. you always have to run evals before you can swap them around, tune prompts etc. even minor versions of models from same providers need this process.
> this is phenomenal work, genuinely! I feel like you read my mind! <next instruction here>
can go a long way.
of course, I would only say that when I mean it, because Claude can get superficial and cut corners which is why I prefer GPT for raw implementation.
... That does sound like something that Anthropic would deliberately aim for, yeah.
> With GPT, you have to be precise and reduce ambiguity.
I have found that it occasionally makes a wild misinterpretation, that makes a bit of sense in retrospect given how I worded something but is still surprising.
It also sometimes tries to loop in and tie together ideas from earlier in the conversation that really shouldn't still appear relevant. But that might be a general LLM thing.
Classify under non-reproducible artifacts of LLM generation.
> These products use very low level Linux primitives like containers, Kubernetes, Firecracker microVMs, and networked protocols.
Out of anything that is a "low level linux primitive" I could maybe argue that networking? protocols fit the bill.
And it's obviously fully AI-generated! Which I wouldn't even care about if I could actually trust the content, which I can't!
The post is not AI generated, I use AI for code generation and write my own articles.
Which part of the post are you struggling with? This is a post describing our own experience and journey. Happy to back up any specific claim.
What model are you again?
I actually find this somewhat interesting, because it seems that a lot of people who weren't comfortable with expressing themselves verbally are feeling more empowered in that area. We're hearing new voices for the first time, albeit heavily-filtered ones, and I have to believe that's a good thing.
But part of me still finds it offputting for some reason. It's interesting to think about whether that's more of a "you" problem, or more of a "me" problem.
ChatGPT and Anthropic will never, ever get me to tie my Health Data to their systems, but I still believe in the capabilities of AI in identifying patterns from data I would otherwise overlook, and sorely want a local-only ecosystem where I can expose this data safely, privately, and securely to something like Qwen or Gemma for processing.
Same goes for Smart Homes, and Personal Assistants. The corporate approach of letting Company A access your data stored at Company B and processed by Companies D and E while also sold to Advertisers and Data Brokers with no way for you to extract or view it on your local hardware - just isn’t tenable for these sorts of intimate use cases. I want my data to be owned and controlled and exposed on my terms, to be used to improve my life first rather than someone else’s bottom line. I want technology to give me back more of my time and improve my outcomes again, and I’ve been burned enough by Big Tech in the past that I flatly reject any presumption of nobility or public good from their AI-as-a-Service business model.
The capability is there, and I definitely think the folks working to build local tooling that supports and unlocks the potential for local models are the ones in the right. I love seeing what they build.
The strength with AI is with open-source models. We need to keep away from vendor lock-in and use models that allow both local usage and hosting by independent providers.
IMHO, the author could have done two things better:
- vllm instead of llama.cpp. With NVIDIA HW, there is huge difference in multi-user loads and caching with vllm; when he was complaining about what happens when more than one user uses the model, and about losing caching, I was "well, duh".
- The budget he used for a single card could have instead be put to far, far better use with SPARKs. I have access to a cluster of 2 x GX10 - total cost less than half what he paid, even today - and I am running vllm and Deepseek v4 Flash. The difference compared to any Qwen is tremendous - I've NEVER seen it loop, and in all my experiments so far, it's the most Sonnet-y model I've ever tried (antirez seems to agree, hence his ds4 fork).
If you're wondering about how I set it up in the 2 GX10s: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...
Performance: 2K t/s prefill ( very useful for feeding tons of source code into its massive context window ) and around 50-60 tg/s in my coding sessions in the pi.dev harness. With the money the author paid, he could have bought 4 GX10s, and double both numbers ( vllm basically scales almost linearly with tensor parallelism ).
It's the right call for concurrent batched serving (barrkel's point downthread is spot on), but for how we use it llama.cpp is still better for us.
The Spark/GX10 route is a genuinely different bet though and appreciate you sharing your numbers. At the time (several months ago) the consensus was that GX10s were for fine-tuning only, and the numbers were severely low.
..and the card was never about replacing a Claude Max sub. For the workloads we actually bought it for, it's giving us 140-200 tok/s (which matters).
But mostly I wanted to raise awareness to readers of your article that no, if you want to do inference, paying 15K for a single 96GB card almost certainly makes no sense. Buy 4 GX10s with the same money, and enjoy dramatically better models and user scalability.
Regardless - thanks for putting the effort to share your findings! I keep postponing doing the same... there's tons of things everyone is re-discovering on their own.
The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.
So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.
You can try it by using Opus through Github Copilot vs official Anthropic tools. You'll get very different results and experience (in my opinion).
And that's if we assume that the vscode GHCP default Agent ("Local") is the same as the "Copilot CLI" one that is also selectable in vscode. I have not tried that one.
A few weeks ago the Claude Code Agent SDK was much better than the default Copilot Agent, but nowadays I am not sure.
Some people would claim they are already far better than CC and Codex.
I like how especially the Claude Code CLI version communicates how it's progressing, something they hide a lot more on the desktop app for example.
I have found claude models, especially fable, to be impossible to work with when the work requires reading papers from days ago and reasoning on top of the findings in it. I have multiple long sessions with opus (not as many with fable as it got taken down quickly) where it keeps fighting me on problems, sayings "that's not how it works" / "that is not possible", followed by me linking the paper (after i've told it to actually read up on the latest research in this field), and it hits me with the usual "You were right.". If your workflow is using the exact tools, frameworks, git layouts that claude expects, it can be magical, yes. But it is very heavily optimized to never say 'I am not sure' (as that gives 'bad vibes') and instead lean on its (nowadays with the speed of things DOE) knowledge to formulate a reasonable sounding answer, dissectible only if you already know the answer beforehand (which defeats the purpose of using it in the first place).
Qwen3.6 27B (the only <100B model worth looking at in my experience) is dumb, knows it, and will fight tooth and nail to complete the task it was given, gaining the needed context (online or file-wise) in the meantime. If you mention it should read papers, it goes and reads a pile of papers. If you tell it 'implement MCP in my app', the result will (probably) be catastrophic. If you instead describe where the feature should sit, how it should handle edge cases, what use cases it needs to attend to, and to first look online for reference implementations, it does it and does it well.
Knowing what is in context, what should and shouldn't be there, and how to manage it for the specific model you are using (as every model, even in the same family, behaves differently to differently worded prompts) is what makes or breaks them. They are just auto-complete, they complete text based on what is already there, it's not magic.
So yes, while this small open-weights models are not opus 4.5, it's good precisely because if that, because it is a good tool and a bad 'coworker replacement'. If you want the latter, kimi is already there, it has started to not believe the user and do what it was taught just like claude models (which is helpful when you don't care about implementation specifics or performance/security). GLM models (mostly 5.1, i haven't tested 5.2 extensively yet) have fixed a lot of low-level programming issues I've had that opus just walks in circles and writes reports that "it doesn't/can't work". That is to say, open-weights, in many cases, have already surpassed Opus. I can't comment on gpt 5.5, but while I used 5.4, it also performed a lot more tasks without being fussy than opus 4.6/4.7.
I genuinely do not understand why people not only just put up with this but also pay _a lot of money_ for the _privilege_ of doing so.
It's like having _the worst_ colleague but you actually go out of your way to talk with the guy. Why.
It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.
It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.
I do however now know that they're a totally cool dude building stuff physically and as software + that other people give them money for it.
Does that have anything to do with the topic suggested by the headline? Not sure.
Is it bad software? Idk. Probably not.
Should you treat it as a grassroots Foss thing maintained by fellow sane hackers? No sir.
Where they shine is in your ability to control them, their privacy, their predictability (e.g. if you are doing a repetitive task, like classifying your photo/video library), and depending on your energy bill - their costs.
It would have 99% reliable tool calling - and most importantly - the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere.
This way all of the simple stuff would be done on-device, gathering data, figuring out the context of the problem etc. And when that's done, the "smart" model would come in to work on the issue when all of the easy stuff is already done.
It feels super stupid that my /commit skill calls an online model when that is something a local model can 100% do. Mostly this is a harness issue though and mostly solvable.
Qwen 3.6 27B can do that today, but setup properly and in a good quant, I run an autoround [0] with weights in int8 and attention heads in f16 on a single RTX 6000 Pro Blackwell Max-Q via vllm with mtp=2 and full context, --max-num-seqs 3, KV in f16, mamba f32.
>It would have 99% reliable tool calling
I managed to score 93/100 in tool-eval-bench [1]. For me this is very good already, at least in the pi coding harness I've never had an issue that wasn't auto-fixed in the next turn(s).
>the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere
This is heavy on the harness engineering side I think, but also quite contrary to the nature of LLMs today. If you figure this out I'd love to know.
[0] https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/...
So far the best results I’ve got have been using a much smaller local model as a simple classifier, that makes a call based on the system prompt and incoming prompt where to route it. It works okay, still a long way to go though
They really are fantastic for a lot of use cases and I think most people do not need SOTA. When I run that qwen model in my measly 4070 12 GB for my personal email agent that I build and experiment with, I need privacy more than anything else. It does a great job. Even for coding tasks, given you know how to use them instead of dumping a grand plan, it's great.
SOTA can code but can also prove theorems and teach you about music theory or ancient Greece's substrate language or botany. Speaking in tens of different languages. I wonder how many hundreds of billions of parameters can be saved just by removing much of the general knowledge parts while keeping logical and programming abilities the exact same.
Network Bandwidth, Storage space and speed, memory capacity. While all of these were worth optimizing for at a point in history, that point is behind us. It's probably a reasonable expectation that it will eventually be true for VRAM.
IME vLLM is quite a bit faster than llama.cpp but where it really wipes the floor with it is in batching concurrent load. The downside is that it is dramatically less flexible in terms of tweaking. It gives you very few options for running quantized weights. It takes a lot longer to start up because it optimizes the compute graph. So for single user experimentation on a model that's a bit too big for your box, vLLM is just going to be frustrating.
Dismissed is a strong term, but let me give you some more details.
It took a good 4 minutes plus to load up on the 2x 3090 rig, and served a single request 3 tokens/second slower.
And the worst bit? With all that work - setting it up and tuning it - it still looped. I was hoping "use just vLLM" advice that we get touted everywhere was the silver bullet.
The only thing I'd caution here is that we don't start bashing on llama.cpp like people did with Ollama. It's a very capable tool and for the use-cases we actually want the card for makes more sense.
For a large team replacing their Claude Subs perhaps vLLM is the only option, but you really need to add about 5 more RTX 6000 cards into the mix, so you can load something like GLM 5.2.
That's not _nothing_, but it's pretty close to nothing, and for the prosumer crowd it edges towards "just gets in the way".
They are similar, but for different use cases.
And ZDR is still data sharing with a third party. This is the essence of an enterprise agreement, it's not allowed, even if they pinkie promise not to store it.
If your customers allow you to share their data with third parties, then ZDR may be an option for you. I am not a laywer.
Where I see ZDR as being more relevant is in protecting your employer's IP - not allowing a missed setting to mean AI labs can train, retain, and publish/resell your work. It's what we'll consider when the subsidies stop being available - open-router, ZDR - but for coding - not for customer data. Very important distinction.