Local Qwen isn't a worse Opus, it's a different tool

Posted by alphabettsy 15 hours ago

Local Qwen isn't a worse Opus, it's a different tool(blog.alexellis.io)

417 points | 224 comments

glerk 13 hours ago|

If you play with these models long enough, you realize there is more to them than just "model X is smarter than model Y" or "model Y is cheaper than model Z". They are different tools and the prompting technique is different. It is very much like playing an instrument.

With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.

With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.

With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.

This is not scientific at all, just vibes, YMMV.

dkersten 10 hours ago||

> This is not scientific at all, just vibes, YMMV.

This is the problem.

I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.

coldtea 9 hours ago|||

>I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.

Think of it less like a static tool, and more like a human helper, where the same holds.

mahidhar 5 hours ago|||

Well, unlike a human, I cannot expect any these LLMs to take any ownership of the work they do. I cannot expect any given model and version (sonnet 4.6) to learn, improve and adapt over time. I cannot expect it's limitations to ever go away at the model level. So it is not like a human in most ways that I actually care about.

That said, I can't wait for LLMs to stop being AI and start being just another tool. Anything cursed with the "AI" label seems to go through this mess. In the earlier AI cycles, rules engines were considered "human-ish" and got hyped up, but today we just see then as just another tool available to us, and we're better off for it.

squidbeak 1 hour ago|||

You're on the hook for their work in the way a manager is for their staff's output. The insistence of AI being a mere tool very often comes with this strange desire to be free of responsibility for its work. People seem to forget that the big advantage in these things is the range they have for obscure insight and creative solutions, both impossible with determinism.

themgt 2 hours ago||||

That said, I can't wait for LLMs to stop being AI and start being just another tool.

From a horse's perspective, the internal combustion engine is just another tool for making scary noises and powering horse trailers to take me on fun horse adventures. So ... perhaps.

kolinko 3 hours ago|||

models don’t improve, but harnesses/tools/rules around them grow with the project.

ACCount37 8 hours ago||||

One issue with that is that human helpers last longer. LLMs cycle in and out in months, and what held for Your Favorite LLM 6.7 may not hold for Your Favorite LLM 6.9.

renegade-otter 6 hours ago||

Right, this is why I would slam the breaks on investing into your workflow all of your time and effort, because 2 months from now it may be out the window. Frontier models are also constantly being tweaked, so what worked yesterday may be off today.

ChatGPT was obedient with the grill-me technique, just wrote a plan. Yesterday it started jumping to implementation. Why?

HappySweeney 5 hours ago||

I find that when an LLM jumps into tasks it was not told to do (or even worse, doing things it was explicitly told not to), it is a good sign the context is too full, and you should do a controlled hand-off to a new instance.

renegade-otter 4 hours ago||

I wipe my context relentlessly. I never have long-running conversations. In and out like Seal Team Six.

madeofpalk 8 hours ago||||

Except, where every different model and version is like a different person where you need to learn their idiosyncrasies of how they work every other month.

It's a very very bizarre way to use a computer.

Personally, I just don't. I'll use and prompt the LLMs the way that feels natural to me and move on with my life. Maybe I don't always get completely optimal results from them, but im also not spending half my day pleading with the computer to do a task.

user43928 6 hours ago||

I also don't think I need to prompt Claude differently than Codex.

The most important thing to be aware of in my opinion would be that Claude is better at UI design, and leaves a lot more comments in the code.

Other than that the results seem similar, at least functionally. I do not usually review the code style.

cassianoleal 7 hours ago||||

They are not human. Humans have names, faces, voices, personality, a personal history, family, care for whatever they call their community.

With humans it's actually good and worthwhile to create and strengthen connections. With an LLM, that's psychosis.

tekne 7 hours ago|||

To be fair: a voice, personality, and personal history sounds a lot like training data.

I don't think LLMs are people in any sense, at least as they're constructed now -- but they very much have what we would call "culture" and "personality" in suitably alien forms.

This is not the same as, e.g., feelings, experience, or humanity, or actual opinions or ideas (versus essentially "distilled vibes") and I feel that AI will more and more force us to confront that (including if new AIs are ever developed that may have the latter, as well!)

epicepicurean 4 hours ago||||

They are not human, but it helps to prompt them similarly. See: https://www.anthropic.com/research/emotion-concepts-function

anthonyrstevens 3 hours ago||

Good read. Thanks for sharing.

Wowfunhappy 6 hours ago||||

They're not human. But they are trained on human language, and thinking of them as similar to a human helps me work with them effectively.

malwrar 7 hours ago||||

These things passing the Turing Test makes anthropomorphizing their behavior awkward, but don’t forget it’s just an analogy to communicate an experience. If you convey a certain written voice to these models in your input, you get a somewhat consistent end effect. I think that’s all that is being communicated.

scotty79 7 hours ago|||

If you have a toolbox full of similar but different tool getting to know them is a prudent thing to do, not a psychosis. There's no connection because the tool is immutable (except for adjustments you made) but you do develop a specific relation with that tool. Some people even love some of their tools at some level.

And if humans are anything, they are tool users.

coldtea 6 hours ago|||

>If you have a toolbox full of similar but different tool getting to know them is a prudent thing to do, not a psychosis

Can be both. Use of some tools like LLMs might be more inducing psychosis than others like plain compilers or hammers.

>And if humans are anything, they are tool users.

To the point of self-destruction sometimes.

scotty79 5 hours ago||

> Use of some tools like LLMs might be more inducing psychosis than others like plain compilers or hammers.

I really don't get it. Why the fact that it outputs words is so goddamn important for everybody? How does it suddenly make you so emotionally vulnerable? Does my brain work in a different way than the rest of humanity? Can't you disregard what's irrelevant? Is every programmer suddenly a trump supporter that has no ability to recognize empty words? To recognize lies about emotions and facts?

Words are just input. Mostly garbage. Emotion inducing words are garbage 10 times more often than any other. I could expect romance reader to be affected, or somebody with iq 70. But how the caste of some of the most technical people ever is afraid of catching psychosis just because they might read some words?

chadgpt3 4 hours ago||

It's a certain percentage of people and yes it's different for them because it outputs words and triggers some kind of emotional trust response.

scotty79 3 hours ago||

As good opportunity as any to acquire some emotional intelligence.

j-bos 7 hours ago|||

Yeah, AI tools bring software developers closer to the messy real world where 0 and 1 aren't always exactly 0 and 1.

skydhash 3 hours ago||

Computing is useful for exactly going away from the messy real world of humans. I don’t need random errors in my financial transactions. I don’t want random errors when doctors are retrieving my medical history. And I don’t want random errors in my backup,… There’s plenty of non-deterministic things in my life, I don’t want my computer to follow suite.

gib444 8 hours ago||||

No, I won't anthropomorphise LLMs.

coldtea 6 hours ago|||

If there was anything that made sense to anthropomorphise it would be a machine meant to mimic talking, thinking and answering like a human, one that even passes the Turing test.

When we built the idea that anthropomorphising is wrong, we meant when doing it for rocks or trees or thunders or deer or some such.

TeMPOraL 2 hours ago|||

That's your prerogative, but be aware you'll continue to remain confused about LLMs. Anthropomorphizing them is what gives you the best high-level intuition about where and how to employ them, and where and how not to.

yeer2 6 hours ago||||

This is so dumb and goes against all the principles that enabled computers and smartphones to achieve wide adoption - the technology should evolve to fit the human. Not the other way around.

duckmysick 5 hours ago|||

I'd argue the opposite. Technology in the past few decades was (is) limited and humans had to adapt to it.

We communicate with other humans using voice and three dimensional hand gestures. To use computers and early phones we had to learn to operate new input devices: keyboards and mice. Later with touchscreens we moved to two dimensional hand (finger) gestures. We're barely making voice commands work with our devices just recently.

Then, a large number of humans are figuratively tethered to their desks because the devices need power and stable internet connection. Mobile devices break this relationship a bit but you still need to charge them and be close to some sort of access point. In any case, the devices encourage sitting in one place for hours at time.

And this is just computers and smartphones. Humans adapted their entire lifestyles and transformed the landscape to cater to cars.

skydhash 3 hours ago||

> Technology in the past few decades was (is) limited and humans had to adapt to it.

Was it? Think first about what it replaced. Lots of manual computation in bookkeeping and financial sectors. Telegrams and snail mail moved to email. Typesetting in books and magazines became easier and widely available,…

If there’s one thing that you can’t say about computers is that they’re limited.

duckmysick 3 hours ago||

No doubt that computers enabled a lot of automation. We can both agree with that.

The context was that technology should evolve to fit the humans [not the other way around]. And if contemporary technology didn't have limitations, it would be correct.

But it did and humans had to adapt to the computers. Humans had to develop and learn special languages so they could communicate with computers to do all those useful things you mentioned. Why? They were limited in understanding (or parsing) human languages. It took us decades before we could talk to computers in human languages. We're getting pretty close - especially in the past few years - but there's still some friction.

skydhash 2 hours ago||

> Humans had to develop and learn special languages so they could communicate with computers to do all those useful things you mentioned. Why? They were limited in understanding (or parsing) human languages

You may need to revisit your computation theory courses. Computers are the embodiment of a mathematical model and thus the inputs and outputs are formalized.

Do you just hold a pen and words are written automatically? Do you just hover your hands over a piano and have the moonlight sonata played? No, you have to do precise mechanical movements because that’s how the output is realized.

There’s no such things as words, sentences, keywords, statements at the computer level. What it does is symbol manipulation. You provide it a string of symbols, the rules for the manipulation, and it will provide a string of symbols as the output.

What symbols, what rules, are completely arbitrary . We just found that {1,0} are all that we needed as the set of symbols and that Context-Free Grammar is perfect for specifying the rules.

We still need to encode everything down to binary (ascii, unicode, bcd, floating points, pixel formats, PCM,…) and use a programming language (as defined by a grammar) to get the computer to do anything. Inference is made possible by those two mechanisms. It’s not a new computation model.

Wowfunhappy 6 hours ago|||

I mean, like, you can lament the state of the world all you want. It is what it is. Of course the AI labs would also like to make their models more consistent, but it's not how the technology works. They're black boxes to everybody.

dreambuffer 8 hours ago|||

Please do not think of LLMs like human helpers, that is a recipe for long term sociopathy.

egwor 3 hours ago||||

Maybe this is similar to web search too. We know how to get google to return the results we want, and when we use other tools like Bing we get other behaviour.

dotancohen 9 hours ago||||

Honestly, the differences between AI models always felt to me like the differences between coworkers or job candidates. They don't all share the same strengths and weaknesses - and they all have both good days and bad days.

Realising this made me respect the "I" in "AI" a bit more seriously.

amelius 9 hours ago||||

Yes, but benchmarks can be gamed.

Maybe we need better reviewers then?

yunohn 5 hours ago||||

> a product sheet showing what each models strengths an weaknesses are

This presumes that the labs themselves know how well their models perform. But all they have are overtuned benchmarks and hype vibes.

couscouspie 9 hours ago||||

That would be ideal, but AI is less like a tool and more like a human in this regard and you don't have character sheets for each of your colleagues, as well.

supergarfield 8 hours ago|||

If my coworker was part of a clone series of 100 million units, requesting a character sheet would be pretty reasonable

bluegatty 8 hours ago|||

These are $1 Trillion dollar companies that can't produce explicit details on how their products work? It's nonsense.

sixothree 2 hours ago|||

I think if they could explain how they work, their strengths and weaknesses, they would reveal to the world whose data they've been appropriating.

bluegatty 2 hours ago||

That's another thing altogether. They can characterize the behaviour without quite giving up who and where the data comes from.

Admittedly, yes, there's some overlap there.

They would have to admit 'seen it in the training data' as a factor, and that opens a can of worms.

epolanski 6 hours ago|||

The problem is that this is very hard to replicate and benchmarks focus on E2E tests, going from one prompt to the final solution.

They do not test how models perform when used interactively, like most of us do.

weitendorf 11 hours ago|||

One thing I used to test quite a lot was rerunning the exact same prompt on the same input, or semantically equivalent (in my mind) but differently framed or worded input, and seeing how much they diverged. In particular I’ve done this quite a lot between Sonnet vs Opus and across Qwen models.

I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input. We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.

There’s a skill to it. With agentic loops if you get the model into a self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training, you’re golden. But it’s hard to find the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).

The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I wish this kind of “stability” in output was more emphasized in their training so they’d be predictable. I assume it’s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…

movpasd 9 hours ago|||

I've not done particularly rigorous testing, but I've done this a lot with Claude to get a feel. What I've noticed is for certain open-ended tasks, Claude is extremely primeable: it will pick up on minor differences in wording in your prompt and run with them hard.

It can be frustrating. The AI pretends to be a human, and so a part of my brain expects them to commit and have a "parti pris" like a human, so the exercise is a good reminder of the feedback loop. My mental model is that before the first three or four messages, the model has many finer points of its personality still underdetermined. I'd suggest that as the mechanism for "role-based prompting". And it explains the "savant sleeper agent" thing you describe. You want to get the state in the right attractor on the manifold.

These machines are pretty incredible, but for conversation-driven workflows you really have to be in the driver's seat. A human has a property that the AI does not have, at least under current architectures: we are regulated by the outside world. A bit of a tangent, but I can see how AI psychosis arises from these dynamics.

evntdrvn 7 hours ago||||

One thing that I learned when doing raw API LLM usage is how drastically the results can vary call per call with exactly the same input. I think that on average, people using agents underestimate the variation in results from a given turn command are, and so overindex on "X technique worked well" or "if I do Y then this will happen" or even "it did Z task well last time so it will this time too" or "{Model} is great at {thing}"

dotancohen 9 hours ago||||

  > We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.

Any chance you could share some of these? Seems like something we could all benefit from.

weitendorf 9 hours ago||

Sure, my company has been working on a broad swathe of infrastructure projects and developer tools, which requires prompting models to seek out other tools/apis/docs/examples but in a way where we can't just dump all the context on the model up front. We also need the models to oftentimes look up technical documentation and specs, and sometimes build custom parsers for specific documentation websites that only make the data available embedded across 200+ pages of html.

First, I almost always try to seed every new project or context/domain with canonical technical specifications or examples I found elsewhere. When I set up this project recently, I linked to a bunch of the official Apple docs for sysctl, and told it to use a specific technique for calling assembly code from Go, that from experience it almost never realizes it can do or knows about (and similarly for sysctl, I knew it kinda sorta knew about it, but not in its entirely): https://github.com/accretional/sysctl/commit/da52438233e5b33...

The other thing I did was tell it to enumerate all the test cases ahead of time rather than to just directly implement them; again this is something where you have to explicitly tell it to go digging for information where it has blind spots and get it to set up properly grounded self-eval in a way that it can test against. I usually tell it to take notes as it works or commit notes to itself that will persist over sessions: https://github.com/accretional/sysctl/blob/main/FINDINGS_2.m...

Once we get back to working on this project we'll just have it implement / validate the rest of the sysctl feature support against the full inventory we had it uncover: https://github.com/accretional/sysctl/blob/main/cmd/darwin-n...

Another thing we do is have it specify an API that it can produce against; then in other projects we have them consume the API via reflection (and our special sauce we've been working on is the ability to discover and integrate against these automatically across thousands of APIs from many providers, which we've got working and can share if you're interested in using it as an early customer): https://github.com/accretional/sysctl/blob/main/proto/sysctl... This isn't the greatest example because it doesn't actually fully specify the sysctl keys yet. But I did have it create a knowledge base trying to cover the 1000+ keys as best as it could, to reference as it continued: https://github.com/accretional/sysctl/tree/main/macos-sysctl...

We have a better example in eg https://github.com/accretional/proto-sqlite/tree/main/lang where we were able to encode the entire sqlite grammar into a grpc interface so that you could eg find the exact structure (and sanitize) of a select statement: https://github.com/accretional/proto-sqlite/blob/main/lang/p... This way integration and discovery becomes a matter of telling it "use reflection against this endpoint to discover the sql interface, then implement against it" and we can model formats/input validation as formal grammars via EBNF (all magic words) vs just adhoc

We also tell it to set up and use a browser automation toolkit/testing and always run it at the end of testing workflows (often in a way that auto-opens screenshots on our local machines + commits them to git) via tools like https://github.com/accretional/chromerpc#headlessbrowser-aut... so that whenever we produce UIs it can evaluate its own output and iterate without direct human intervention. This is another case where the knowledge-discovery problem becomes a problem so we tell the models to use reflection to discover the browser automation apis. That ends up giving us things like this where it records user journeys through sites and creates visualizations without us having to debug them or do them ourselves: https://github.com/accretional/proto-css/tree/main/chrome-te...

dotancohen 4 hours ago||

Thank you very much. I'm going to re-read this evening. Have a great day!

mnicky 9 hours ago|||

If the benefits of using the model you've come to know well outweigh the disadvantages, you can continue using it even after the release of a successor model, right?

saint-evan 5 hours ago||

Yes! That's exactly true. I have a very real experience on this. I got introduced to Anthropic's family of models with Claude3.5. I fell in love with the specific personality of Sonnet, the model. I can't remember if back then Opus wasn't public yet but I remember very clearly trying out Opus several times when it became touted as best-in-class and actually recoiling from the foreign feel of the Opus model. I remember very well that my problem was that it was way too eager and pretty hard to steer. I returned to Sonnet and I've used ONLY Sonnet ever since. I have/had access to Fable and Opus4.8 but I never once tried them. In the early days with Sonnet3/4.5, I bought ChatGPT, I also remember thinking that it was a great teacher but a lazy coder. You'd get the scaffolding and then '# rest of code block' not full implementation so unless you wanted to learn the concept, weigh trade-offs, ask clarifying questions or jump into a rabbit hole... You had to go code it yourself. ChatGPT generally as a model is a very good teacher so much so that the free version is enough and I use the free in combination with the most advanced Sonnet model for actual SWE day to day. And whenever there's an Opus release I'm actually very excited because it means there's a smarter Sonnet model OTW. I'll actually be veryyy very sad if the Sonnet line gets sunset. There has been no Sonnet upgrades since even as other family lines get improved.

Do note that I only use LLMs in the ChatUI, I never use agents. I don't believe having a blackbox codebase managed by entities with a half-life of 'delete conversation' or 200k tokens is a responsible idea. In ChatUI, I lay the ground rules, kill assumptions about our working relationship, give it foundational context on the problem and codebase we're working on, explain the problem and then we have a conversation about it and I gradually disclose more logically context as it becomes relevant. So, to directly answer your question, maybe I'm missing out on a ton of upside by not using the absolute best but I'd say familiarizing yourself with a specific model has all the benefits of having a human friend you've grown up with... except your buddy's a savant and would absolutely love to help!

h05sz487b 12 hours ago|||

> It is very much like playing an instrument.

Or it is more like playing a slot machine and you imagine the rest.

cube00 11 hours ago|||

This is how I feel whenever I see bold all caps instructions in a system prompt or someone claims they conducted "research" and found the magic prompt template that makes the model pay out.

Maybe it works some of the time but it isn't a solution that works everytime.

It reminds me of people hovering to play a slot machine when someone gets up and it hasn't paid out as if they've solved slot machines.

While I don't mind putting something in a loop until the tests pass, I'm less comfortable doing that when providers are silently rerouting to lower quality models, or in Google's case burning quota faster to ease their own server load without being transparent about what the "standard limits" are to begin with. [1]

I'm hopeful I'll be more comfortable with these "slot machines" when frontier models get to the point where they can be run locally on hardware I can actually afford so I know exactly what I'm getting and not jumping at shadows with providers playing tricks behind the scenes to ease their own load without admitting the customer is getting less for their money as they get more popular.

[1]: https://support.google.com/gemini/answer/16275805?hl=en&sjid...

user43928 6 hours ago|||

Has there been any evidence of a well known provider rerouting to lower quality models?

Last I saw, engineers working at OpenAI denied this on HN.

I saw that someone set up a tracker that aims to record the performance of the models, and so far it has not shown any statistically significant deviation in performance for Codex, and not yet enough data for Claude: https://marginlab.ai/trackers/codex/

cube00 3 hours ago||

> Has there been any evidence of a well known provider rerouting to lower quality models?

The firm [Anthropic] would deliberately degrade the model’s performance in ways that were invisible to the user.

https://news.ycombinator.com/item?id=48485958

coldtea 9 hours ago|||

>This is how I feel whenever I see bold all caps instructions in a system prompt or someone claims they conducted "research" and found the magic prompt template that makes the model pay out. Maybe it works some of the time but it isn't a solution that works everytime.

For such thing to be useful, it's enough that they works substantially more times that not having those instructions in.

Planktonne 6 hours ago||

Every gambler thinks their system works, given enough chances.

hodgehog11 10 hours ago||||

A poor analogy depending on the setting because you can't adjust the odds with a slot machine, and the ROI is negative by design. If that's your experience, yeah, I wouldn't use an LLM either.

victorbjorklund 6 hours ago||

Pretty sure most modern slot machines are digital and you could adjust the odds (even to a positive EV) if you change the code.

hodgehog11 4 hours ago||

You're being unfaithful to the original statement. The whole point of saying something is like a slot machine is that there are significant odds that you lose. If you ever have access to a casino slot machine that has a positive EV, there are no tangible negative aspects anymore; you would use it over and over again and accumulate significant wealth from the house. That's my point.

ramon156 11 hours ago||||

Instruments are pseudo-random until you know what you're doing. Slot machines are just slot machines

Forgeties79 10 hours ago|||

Musical instruments are not random. You’re just doing random inputs. Instruments are consistent, even if the “flavor” and quality varies with different builds.

Playing a B on a saxophone always plays a B.

headcanon 4 hours ago|||

I see you haven't tried a modular synthesizer yet :) Getting back to the same "place" in a patch can sometimes be impossible, and it does feel "random" until you get the hang of it.

dotancohen 9 hours ago|||

Saxophone, being a wind instrument was a bad choice. I can definitely tell which student was blowing when hearing a note.

But your analogy remains solid if you substitute e.g. a piano and a reasonably proficient player. A single note would be nearly indistinguishable between players... But a full piece most certainly will sound different.

palata 9 hours ago|||

While I agree with you, I think it's diverging from the initial point.

The original take was "LLMs are very much like playing an instrument". I think they are very much NOT like playing an instrument.

While different musicians will produce different results, one musician won't get drastically different results on different days or when trying a different "copy" of the same instrument. If you can play the violin on your violin and I lend you my violin, you will still be able to play very consistently. You may argue that the sound will differ and you will have to adapt slightly, but that's not remotely similar to the randomness coming from LLMs.

tekne 7 hours ago||

Will you?

That's only if both violins are tuned the same way, and one must continually tune them lest they get out of sync.

Similarly, an LLM can be extremely consistent if tuned properly -- indeed, if you fix the weights and settings, they can be made "essentially deterministic" for many prompts!

layer8 5 hours ago|||

The difference is that a violin player can predict how the known violin will behave under all relevant circumstances, will know how to get the right tone out of it, while you’re generally unable to predict the adequacy of output of even a deterministic LLM. You can’t practically reason about how varying the input to the LLM will ensure the adequacy of its output, while the violin player is perfectly able to do so for the violin.

This is because LLMs have aspects of chaotic dynamical systems, where small changes in initial conditions can lead to vastly different outcomes. That property is independent from nondeterminism.

Forgeties79 6 hours ago|||

Anyone who has even modest experience with a particular instrument can pick any one up at any time and play it. The way the notes are played is consistent and produces a consistent note. If you tune 50 guitars to standard, the chords all produce what they should., It is a predictable instrument. You do not pick up a trumpet in one place then another and find the key combinations are suddenly different.

You know what we are talking about. Tuning, poor playing, all of that is mild variation from what we know it is supposed to do every time and we can target the the notes they are supposed to hit consistently. You're comparing slight tonal variations to completely different outputs from the same inputs. If I hit a "C" on the piano, it is going to play "C." If it does not, then the piano is not functioning properly. LLM's for some reason get a pass on this and it makes them very distinct from musical instruments.

This feels like a very nitpicky steel man, not a productive attempt at discussion.

Forgeties79 9 hours ago|||

A poor B is still a B fingering and the sax is supposed to play a B every time. Missing it is human error, not tool error. I can pick up an alto sax, a clarinet, etc. any time, anywhere, and expect the same fingerings to work every time. My individual skill or mistakes or peculiarities of each build are not what is relevant here.

LLM’s do not operate consistently and make their own errors while we argue about which incantation makes it less inconsistent, knowing it will never actually perform as expected.

I played woodwinds regularly for 15 years so I feel fine with my example.

glerk 11 hours ago||||

It is a bit of both. A non-deterministic instrument and a predictable slot machine.

psychoslave 11 hours ago|||

I play slot machines as instrument! ;)

dotancohen 9 hours ago||

Roger Waters and Nick Mason were playing the cash register in 1973!

devin 2 hours ago|||

It is not at all like playing an instrument.

Instruments present a clear interface to a user, have predictable outputs, etc.

The only comparison that might work for me is that LLMs are very bad instruments where you are constantly forced to negotiate its idiosyncrasies in order to massage the output you want from it, and even then there is enough randomness that trying to do so is almost a fool's errand.

djeastm 1 hour ago||

I think they mean playing different instruments not other instances of the same instrument. A tuba's interface differs from a violin's, etc.

devin 1 hour ago||

My criticism of the comparison would stand in either case. There is nothing clear and uniform about the interface to LLMs that match their musical counterparts. Even modular synthesizers with random sources are far more controlled.

I also think it's disingenuous to call LLMs "tools" in the stricter sense of the definition, but I've mostly given up trying to convince people of this. Main reason being that a terrible writer and a gifted writer can produce similar outputs, and for the terrible writer it will be above their average, and for the gifted writer it will be below what they could produce with full control.

Wowfunhappy 6 hours ago|||

> With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.

> With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.

I agree with all of this except for one thing: I swear to god, being mean to Claude at the right time can be enormously effective. The F-bomb in particular seems to really help it snap out of ruts sometimes.

mcbits 4 hours ago||

I haven't really experimented with being "nice" or "mean", but I would worry that a prompt like "No, dumbass, ..." would kick it into the patterns of someone who frequently got called a dumbass (perhaps for good reason) in the training set. On the other hand, maybe it could trigger more defensive responses with argumentation to explain its conclusions.

Wowfunhappy 3 hours ago||

I only use it for behaviors I really want the model to clamp down on, and I don't think I've ever told the model it was stupid. But I might say something like:

    No, don't f***ing do that! What part of "[previous instruction]" don't you f***ing understand? I am extremely angry and disappointed by your inability to [whatever]. Do better please.

> maybe it could trigger more defensive responses with argumentation to explain its conclusions.

Quite the opposite, it makes the model extremely conciliatory—which in this situation is what I want. If you're hoping to make the model less sycophantic, this is the wrong tool.

andai 6 hours ago|||

I asked GLM 5.2 for a HTML5 port of my old C#/XNA game. It ported all the code exactly (except for operator overloading, which doesn't exist in JS), and added more code to make the code work.

I asked Claude Sonnet 4.6 for the same thing, and Claude's version was like if the game had been written in JS originally.

Also, for some reason it made it a single HTML file, removed all assets, dynamically generated graphics and dynamically generated music. It also gave me a new, better background.

This surprised me, since it was not what I asked for. I just asked it to port the game.

I was pretty pleased about the choices it made, but I'm not sure how to turn that behavior on and off. Sometimes you want it to be creative, sometimes you want it to actually do what you said.

vlovich123 4 hours ago|||

You’d probably have to say “port exactly as is without changing any assets and keeping the original structure of the code” or “port with using the exact same assets but write as if native JS but use good code structure principles for organizing”.

You have to be a lot more explicit but it’s hard to know a priori what decisions it’ll make. A good idea is to run it in plan mode so you can read those decisions before it sets out on a path and have an opportunity to make corrections.

CuriouslyC 5 hours ago|||

What you've described is Claude's "secret sauce" and the reason some people love it and some people hate it. It's not really possible to turn off, you can try to prompt against it but it's not reliable, the solution is to use Claude when you want that behavior and other models when you don't.

stingraycharles 13 hours ago|||

I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.

What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.

sanderjd 13 hours ago|||

I share this sense, but my immediate thought is that we need to improve the evaluations! Do you think this is impossible? That there is something indelible that it is not possible to capture empirically? I kind of have this intuitive sense that it is this way, but simultaneously I think that it's unlikely to really be true.

theshrike79 12 hours ago|||

We shouldn't just measure the power of the raw LLM, harnesses matter more and more.

It's like taking the engine out a each car, putting it to a test bed and running it and then making a decision whether the car is good or bad based on the graphs the test bed provided.

You might have the best engine in the world, but if you put it in a shit car, the result is still bad. The seats are squeaky plastic, the infotainment is touch-only and you can't put on your seatbelt without knocking down whatever is in the cupholder.

sanderjd 4 hours ago||

Aren't there benchmarks that measure at the harness level as well?

gbalduzzi 12 hours ago||||

Following the original comment concepts, if every model requires a different prompting technique to maximize its output, how can a benchmark based on sending the same prompt to all models be accurate? We should create different prompts for each model, but then how reliable and unbiased can the benchmark be?

It is a fundamentally hard problem to solve

Wowfunhappy 5 hours ago||||

I'm not GP, but yes, I think it's impossible.

Take AI out of the picture for a moment. What makes someone a good coder? What makes someone intelligent? How do you evaluate those skills?

Of course we have standardized tests, and they're useful, but they're also imperfect. And they become especially imperfect when people start training for the tests specifically—which is, essentially, benchmaxxing.

We have never been able to quantitatively measure most skills to a high degree of accuracy, despite centuries of trying. That's not going to change now.

(I don't mean to anthropomorphize the LLMs, but I do think they're like humans in this way.)

Forgeties79 8 hours ago|||

The reason we can’t capture it empirically is that nobody truly knows exactly what we are supposed to be using these tools for or how they are going to operate. We are still fitting squares into holes with them. We are told to treat them like some bespoke tool for coding, shopping, tech-support, etc. But it is not actually purpose built for any of these things.

When I use a calculator, I know exactly what it does and what it is supposed to do. It always gives me a verifiable, predictable result. If I input “8+8” 10,000x it will give me “16” 10,000x outside of incredibly fringe edge cases/bugs. I can’t say the same for LLMs

willtemperley 12 hours ago||||

Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?

With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.

stingraycharles 5 hours ago||

Ehr, the SWE bench examples are particularly horrible as those are just publicly available historical PRs. So if the models are trained on GitHub data, it will be included.

So almost by design that particular benchmark is tainted, and benchmarks recall rather than reasoning.

willtemperley 5 hours ago||

Wow that's worse than I thought, and breaks the number one rule of machine learning: you don't train the model with your test dataset.

dv35z 12 hours ago|||

What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.

sixtyj 12 hours ago|||

Issue with LLM benchmarks is similar to cars’ benchmarks. Eg journalists almost always get the full equipped model so their review is honest but sort of rigged.

I haven’t seen details of LLM benchmarks’ data sets but I would suppose that “questions” are public so known in advance therefore you can tune a model as much as possible.

One of real benchmarks is drawing of pelican - https://github.com/simonw/pelican-bicycle - Simon Willison made it for his llms’ tests.

If you want really find out a model that works for your specific purpose I would recommend several rounds at arena.ai - it helps to find a anonymously a model without confirmation bias.

Some ppl: Claude is the best! Others to them: but Qwen is the best! Or… Codex is better! …

it all depends on the language (English, Dutch, French…), style of querying (caveman, specs, skills, goal etc.)

Even with the same model I get different answers to same prompt that is just tweaked a little.

So benchmarks are nice but mostly useless.

Without your usecase it is just a reference number indicating the approximate position of that model among the others. And for those who want to make money it is a marketing tool to sell more as every customer counts.

theshrike79 12 hours ago||||

You can't measure "feels".

One good analogy is the Macbook vs generic windows laptop debate online.

The engineer mind just compares numbers, the Lingwoo laptop from Amazon has biggest numbers for everything and the lowest price. Ergo it is the best.

But the numbers don't measure the fact that the Lingwoo creaks and squeaks when you lift it due to the cheap plastic. It also runs at 100C when both CPU and GPU are fully utilised. The keyboard feels like a membrane keyboard from a milspec device from the 90s. Numbers also don't measure the fact that Linwoo is an alphabet soup whitelabel manufacturer that won't exist in any legal capacity in 6 months so good luck with any warranty issues.

There will be an identical laptop called Chongwin being sold though. Completely different company, definitely.

The same applies to LLMs. You can do benchmarks like ask them to one-shot different kinds of gotcha questions (car wash, strawberry and other idiotic ones) or get them to write different kinds of programs.

But that doesn't measure the UX of doing so at all. How many times do you actually need any of those when you're actually working?

It's like unit testing an application. Every function can have 100% test coverage and the app can still be shit because there are things you can't unit test for.

psychoslave 11 hours ago||

> You can't measure "feels".

One can always measure whatever they wonder about. It doesn't mean the measure will be trustworthy and that anything built on it won't be at best not worst than wet finger judgement.

theshrike79 9 hours ago||

Feels are just opinions and taste. It's like art and music, you can't quantify either to a mathematical formula or an absolute test of which is good.

Even songs that break the "rules" of music can be subjectively good, either because they broke the rules or despite it.

Or with cars, a car that's beautiful to one person is the ugliest piece of trash on the street. Some people want a super soft ride where their espresso martini doesn't even vibrate when gunning it through a gravel road and others want to feel every grain of sand on the asphalt in their buttocks. Neither is "correct" and there is no objective measurement for ride comfort.

da-x 12 hours ago|||

Maybe someone can devise a distributed bench-marking system where multiple people collaborate on tests and also vet each other's tests and rating without revealing them to the public.

I have my own "interview questions" for models where I give them a premade Git repo and a problem to solve. Then, I rate them like a teacher. I believe other do that as well, so we only need a reliable system to aggregate these results.

microtonal 11 hours ago||

The problem with proprietary models behind APIs is that they could have saved your benchmark for future training though.

The only way to make it fair is to have the model provider give some benchmarking org the weights + inference engine, so that the model can be run in complete isolation and no information about the benchmark is leaked.

Though I guess for a 'random' person's benchmark that hides between all other requests it's probably ok.

clhodapp 10 hours ago|||

While the gist of what you say is true, it is hard to get very good at treating them as instruments when they keep getting replaced with new, ostensibly-better versions every few months. But those new versions are not strictly better. They are mostly-better while actually having different strengths and weaknesses.

It's hard to decide when to use the best tool for a job you are aware of to ensure throughput and when to spend time experimenting with a new tool to learn what it's good at.

nonethewiser 4 hours ago|||

> you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative

This has been my experience with most models. If you say "How do I do X? I was thinking maybe Y or Z" then the model will probably try to make Y or Z work. They will very likely not say some third option that is wildly different is better, even if it may be. And actually maybe less so with Claude because sometimes it pushes back.

Actually this seems like it would be an interesting test. Maybe I will come up with some contrived question and ask several models.

LogicFailsMe 19 minutes ago|||

I find with Claude that when I call its BS I get better results. And it openly admits to lying to and gaslighting me as well as not seeing any way to stop itself from continuing to do so.

Fable seemed less apt to do so but I didn't get enough time with it before it was yanked away to know for sure. It may have had mixed results on the benchmarks but it was finding bugs opus never found.

vkazanov 13 hours ago|||

The problem is not that there details, the problem is constantly shifting ground. We can only rlpy on a harness to be sort of predictable but the models change all the time.

rkuska 12 hours ago||

It system prompts that change all the time especially in claude code.

furyofantares 2 hours ago|||

I strive to make this NOT the case, by fixing up my skills or agents.md whenever they don't work how I want in one provider or the other. I mean, yeah, it would be awesome if I was a virtuoso with all the agents/models I use. But I am switching all the time, either because one leapfrogs the other, or because I hit limits (I'm on $200/mo on both Claude and Codex, and also subscribe to some others when I hit limits on both of those simultaneously).

tingletech 1 hour ago|||

I do think it pays to be nice to the model. When the context window is running out I like to ask "please summarize what went well and what didn't work in this session. How could the user be more helpful?"

john_strinlai 1 hour ago||

>I do think it pays to be nice to the model.

there was something on HN a few weeks ago about how most/all models perform better the more rude you are to them.

(i still say "please", i can't help it)

visiondude 12 hours ago|||

while not scientific this is been my experience as well. i will add that language specificity in word choice is also a learned behavior. for example, the word “investigate” vs the phrase “look into”. You will find the outputs are quite different. can you guess which will use more tokens? it’s stuff like this that actually sets people apart in the top percentile of using these tools

qsera 11 hours ago||

Mmm..interesting..So now people are finding behavior patterns in LLMs which are trained on behavior patterns of people...

theshrike79 12 hours ago|||

Yyep.

IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.

BUT it's also "relentlessly proactive" like simonw put it. It _will_ get the job done, it's the smartest idiot in town. Why use a library to parse $format when you can just write a custom 1000 line parser? Or if it can't access something, it'll pursue the goal of accessing it in the most creative ways - instead of stopping, asking the user "yo, can you give me access to X" and then continuing.

My solution is to use Claude as a pair programmer. I _very_ rarely just do /goal fix this shit, I watch what it does and interrupt if it gets to the "smart idiot" phase. Also I communicate with it like I would a coworker, never had it berate me or get combative. There's a Finnish proverb for that too[0]

As for Codex, Deepseek, GLM, those I use when the goal is 100% clear like "convert this Brewfile to a list of packages for Arch and Debian, use these two Docker containers to test that pacman and apt work correctly". Boom, done.

But I won't give any creative open-ended tasks to any other model than Claude.

[0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...

weitendorf 11 hours ago|||

The parsing thing, or the willingness to instantly drop into janky unsanitized string manipulations, or to constantly push back against work on infra projects because some random package on GitHub has 200 stars so it’s totally the safer approach, is driving me insane.

On one hand I’m glad Anthropic is only just now starting to get into infrastructure because it means there’s opportunity there, but it’d be great for their models to be more knowledgeable or able to seek out that knowledge on their own, or for the UX of Claude code to be more amenable to launching 5 in parallel and picking the best one, so I don’t have to spend time arguing with a robot. I think there’s a much better balance to strike between just charging ahead towards the goal at all costs vs being lazy and pushing everything back up to the user. Basically they write too much code that’s too contingent/brittle outside its exact current context and don’t do a good job distilling out the essence of the problem “cleanly”. Almost all of them are like this right now, it’s partially a problem with long-range planning but I think a real bias from over optimization for certain RLVR outcomes vs others.

tym0 10 hours ago||

I feel like this is really due to the harness.

Gemini CLI at work has the same issue: it'll prefer hacking your workstation over just asking you how to proceed.

I think the harnesses are setup to have a bias to action otherwise the LLM would just stop all the time when doing trivial task but it also mean they'll keep going when the "obvious" path is to just prompt the user.

weitendorf 10 hours ago||

While I agree that the harness is part of it, I think it's also a lack of epistemic understanding or awareness for what it means to actually solve a problem vs just get something kinda working; maybe if Claude Code or other harnesses made web search more likely or had a better way to make technical documentation and specs available to models, it would be better solvable there.

I often tell it to stop asking me and just keep going until it accomplishes X task; unfortunately it tends to assume I want something that only just barely works, in the sense that it means it's time to stop once its there, which is I don't think a harness by itself could easily address (ultimately the model itself needs to determine the stopping points unless I literally specify by hand hidden evaluation criteria).

That's why think it's at least partially a training issue where the model gets rewarded for "solving" the problem within a certain amount of context/time without access to grounded knowledge (eg looking up the actual spec for a format) nor adversarially/rigorously evaluated against a reviewer capable of finding all the edge cases/shortcuts preventing something from being a properly generalized solution. I don't want it to ask me for guidance when it's working on a well-specified problem, I want it to either find the right parser and use it, or to completely implement one against the spec, rather than write some half-assed string inserter that eg only works on the specific select statements my examples use right now. My understanding is that the Mythos/Fable models were better trained for this but from my brief foray into using Fable for work I wasn't that impressed. For me they need to get better at agentic search and self-eval still

theshrike79 9 hours ago||

There are still billion dollar opportunities in the harness/LLM space.

Having a reliable shared memory for hundreds of agentic AI users is something that's 95% snake oil at the moment. There are a few successes on an individual level (I really like Hermes[0]) but nothing scales to a company level easily.

It should be possible to (pre)configure all agentic harnesses used in a company to use a single source for information so that it'd automatically pick up internal libraries, conventions, licensing decisions etc and remember them across sessions.

I've had limited success with this on a personal level, but it's still not ingrained in the model because it would really need a custom harness. Hooks, skills, prompts get you like 80% of the way. I still need to do a "please check that the project matches the conventions defined in ..." regularly to catch any drift - especially on more vague stuff that can't be locked down with unit testing.

[0] https://hermes-agent.nousresearch.com

zahlman 2 hours ago|||

FWIW I find that GPT can be very creative when discussing a high-level design. Once it starts writing code snippets it will offer to take things in a bunch of different directions.

keeganpoppen 3 hours ago|||

this is the best distillation of what various models are like that i've ever heard... it's wild to me that people view LLMs as this monolithic entity, like "how do i get the best prompts to do <X>?", when it is such a clearly interactive medium, but the returns to engaging with the various models and understanding their "vibes" are very, very high.

nosyke 6 hours ago|||

It's interesting because this really hasn't been my experience over the last month or two. I would prior it was, but it's definitely changed on my end. In my exp I've needed to be way more specific with Claude and with Codex I can generally approach a problem in a much more open ended way.

bandrami 8 hours ago|||

I think this goes beyond "vibes" to cargo-culting. It's why nobody's ever able to actually show ROI from LLMs

CuriouslyC 4 hours ago||

It's hard to actually show ROI from any programming methodology or tool. You can show ROI from a product or feature, but the tool/methodology is a multiplier on the velocity of creating that which is not directly observable.

bandrami 4 hours ago||

It's really not. When we switched from CVS to SVN I had to show ROI and when I we switched from SVN to git I had to show ROI and when we switched from Ada to Java I had to show ROI. When we switched from Xen to KVM I had to show ROI and when we switched from PAM realtime privileges to rtkit I had to show ROI. When we switched from chroots to LXC I had to show ROI, when we switched from LXC to docker I had to show ROI, and when we switched from docker to podman I had to show ROI.

If you can't show ROI there's literally no reason to ever switch anything.

baq 9 hours ago|||

+1.

this is what 'tokens are commodities' and 'there is no moat' people miss. the models are in general not easily swapped out. you always have to run evals before you can swap them around, tune prompts etc. even minor versions of models from same providers need this process.

hashmap 12 hours ago|||

totally true. one key for claude is to not smell like an evaluator, its good at knowing when its being tested and will behave defensively and avoid doing work. i avoid this basin by typing unreasonably excited about the thing i want done. like way over the top. it's harder to keep that up than it sounds.

notduncansmith 11 hours ago|||

I’m able to avoid this basin with a pretty natural baseline professional positivity and frustration management that I would employ with pair-programming. For example, if I just made progress with a human I was guiding through a task, I would be like “Nice, now let’s xyz” (instead of just “now let’s xyz” as if _I_ were the robot lol) or if we had to work for a result I’ll be like “Sweet! Looks good, now let’s xyz” - this is important signal for humans, and the same is true for agents. Also staying emotionally regulated and focused on the goal when things don’t work as expected or when we haven’t made progress after a few tries at something, critical in human interactions :) and even if it’s my job paying for the tokens, the idea of racking up even a microscopic bill for the privilege of having a machine read my insults and then formulate some credible-sounding blob of apology text is belly-laugh absurd to me. I do try to express my genuine feelings during more vision-oriented planning sessions, and just like with a human, you have to maintain the vibes if you want a genuinely collaborative session to go well. If you are toxic people will become either defensive or aggressive in response. From reading the rest of the front page it seems like we are lucky that Claude is the former, and that we especially best maintain a positive atmosphere around Grok.

glerk 12 hours ago||||

at the risk of sharing my secret magic spells :)

> this is phenomenal work, genuinely! I feel like you read my mind! <next instruction here>

can go a long way.

of course, I would only say that when I mean it, because Claude can get superficial and cut corners which is why I prefer GPT for raw implementation.

hansmayer 10 hours ago|||

[dead]

zahlman 3 hours ago|||

> being nice to Claude will be rewarded and being mean to Claude will be punished

... That does sound like something that Anthropic would deliberately aim for, yeah.

> With GPT, you have to be precise and reduce ambiguity.

I have found that it occasionally makes a wild misinterpretation, that makes a bit of sense in retrospect given how I worded something but is still surprising.

It also sometimes tries to loop in and tie together ideas from earlier in the conversation that really shouldn't still appear relevant. But that might be a general LLM thing.

reverius42 13 hours ago|||

These are the vibes that power vibecoding.

vorticalbox 10 hours ago|||

I find opus for planning and sonnet for coding but codex for code review.

photochemsyn 3 hours ago|||

We can’t tell if reported anecdotal behaviors of given LLMs are due to (1) one’s engagement history with that particular LLM provider or (2) ongoing variations in the secret system prompt all commercial LLM providers insert or (3) some other variable feature like RAG.

Classify under non-reproducible artifacts of LLM generation.

QwenGlazer9000 2 hours ago|||

As someone who actually uses musical instruments, it's not at all the same. If anything, traditional IDEs are closer to musical instruments, which seem to be going EOL if you listen to the hype bros.

gateonai 12 hours ago|||

[flagged]

epolanski 6 hours ago|||

[dead]

izucken 10 hours ago||

[flagged]

skipants 4 hours ago||

I feel like it's the Emperor's new clothes reading this article and seeing the praise it's getting. This sentence doesn't even make sense:

> These products use very low level Linux primitives like containers, Kubernetes, Firecracker microVMs, and networked protocols.

Out of anything that is a "low level linux primitive" I could maybe argue that networking? protocols fit the bill.

And it's obviously fully AI-generated! Which I wouldn't even care about if I could actually trust the content, which I can't!

chadgpt3 4 hours ago||

Low level today means JavaScript instead of typescript

alexellisuk 2 hours ago||

Fair enough, that sentence was fairly compressed. I’ve reworded it - the meaning remains the same.

The post is not AI generated, I use AI for code generation and write my own articles.

Which part of the post are you struggling with? This is a post describing our own experience and journey. Happy to back up any specific claim.

alentred 1 hour ago|||

> Fair enough ... compressed ... ACTION->RESULT ... NEGATION->STATEMENT ... follow up questions.

What model are you again?

CamperBob2 2 hours ago|||

How about your reply here? Was that AI-generated? If not, are you conscious of how much you're starting to sound like AI? Is that something you see as a positive thing, or something you'd like to avoid?

I actually find this somewhat interesting, because it seems that a lot of people who weren't comfortable with expressing themselves verbally are feeling more empowered in that area. We're hearing new voices for the first time, albeit heavily-filtered ones, and I have to believe that's a good thing.

But part of me still finds it offputting for some reason. It's interesting to think about whether that's more of a "you" problem, or more of a "me" problem.

cptskippy 1 hour ago||

It's an interesting idea. Tools like Grammarly exist to help with business communication. I wonder if there's a space for a Social Media or online writing assistant to help people. I for one could probably benefit from a tone shift away from acerbic troll.

stego-tech 7 hours ago||

I still believe that the strength of AI is when it can be applied locally in a secure and private manner, rather than yet another cloud-based service you must pay for indefinitely even as it gets progressively worse to satiate the greed of corporate shareholders.

ChatGPT and Anthropic will never, ever get me to tie my Health Data to their systems, but I still believe in the capabilities of AI in identifying patterns from data I would otherwise overlook, and sorely want a local-only ecosystem where I can expose this data safely, privately, and securely to something like Qwen or Gemma for processing.

Same goes for Smart Homes, and Personal Assistants. The corporate approach of letting Company A access your data stored at Company B and processed by Companies D and E while also sold to Advertisers and Data Brokers with no way for you to extract or view it on your local hardware - just isn’t tenable for these sorts of intimate use cases. I want my data to be owned and controlled and exposed on my terms, to be used to improve my life first rather than someone else’s bottom line. I want technology to give me back more of my time and improve my outcomes again, and I’ve been burned enough by Big Tech in the past that I flatly reject any presumption of nobility or public good from their AI-as-a-Service business model.

The capability is there, and I definitely think the folks working to build local tooling that supports and unlocks the potential for local models are the ones in the right. I love seeing what they build.

hootz 5 hours ago|

The thing about "local" models for me is that they usually mean open-weight (and maybe open-source too), so they can be used locally, yes, but they can also be hosted by independent providers! With models like Qwen, DeepSeek and others, you aren't tied to a single corp, you can switch between indie providers, some of which may give you better privacy guarantees. That allows you to use the models even on devices uncapable of running them, if they have an Internet connection.

The strength with AI is with open-source models. We need to keep away from vendor lock-in and use models that allow both local usage and hosting by independent providers.

ttsiodras 3 hours ago||

Interesting article.

IMHO, the author could have done two things better:

- vllm instead of llama.cpp. With NVIDIA HW, there is huge difference in multi-user loads and caching with vllm; when he was complaining about what happens when more than one user uses the model, and about losing caching, I was "well, duh".

- The budget he used for a single card could have instead be put to far, far better use with SPARKs. I have access to a cluster of 2 x GX10 - total cost less than half what he paid, even today - and I am running vllm and Deepseek v4 Flash. The difference compared to any Qwen is tremendous - I've NEVER seen it loop, and in all my experiments so far, it's the most Sonnet-y model I've ever tried (antirez seems to agree, hence his ds4 fork).

If you're wondering about how I set it up in the 2 GX10s: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...

Performance: 2K t/s prefill ( very useful for feeding tons of source code into its massive context window ) and around 50-60 tg/s in my coding sessions in the pi.dev harness. With the money the author paid, he could have bought 4 GX10s, and double both numbers ( vllm basically scales almost linearly with tensor parallelism ).

alexellisuk 2 hours ago|

We did run vLLM on the 3090s — measured ~3 tok/s slower on generation for our single-to-few-user pattern, plus less flexibility on quant and slower startup (actual minutes vs single digit seconds). We may do more with it again in the future - there isn't unlimited time for us to tinker, I'm sharing our journey (so far) and reasoning.

It's the right call for concurrent batched serving (barrkel's point downthread is spot on), but for how we use it llama.cpp is still better for us.

The Spark/GX10 route is a genuinely different bet though and appreciate you sharing your numbers. At the time (several months ago) the consensus was that GX10s were for fine-tuning only, and the numbers were severely low.

..and the card was never about replacing a Claude Max sub. For the workloads we actually bought it for, it's giving us 140-200 tok/s (which matters).

ttsiodras 2 hours ago||

I hear you on the insane amount of time vllm takes to launch (atlas is a move in the right direction in that regard).

But mostly I wanted to raise awareness to readers of your article that no, if you want to do inference, paying 15K for a single 96GB card almost certainly makes no sense. Buy 4 GX10s with the same money, and enjoy dramatically better models and user scalability.

Regardless - thanks for putting the effort to share your findings! I keep postponing doing the same... there's tons of things everyone is re-discovering on their own.

rs38 9 minutes ago||

wanna chime in, recently tried vLLM to consume a NVFP4 Gemma4 safetensor model and see how the batching can show up in nice t/s numbers. it's slow to start, it's Linux only, it doesn't like WSL much, ended up with either old or nightly container builds, I more or less have given up. Appreciate how llama.cpp simply works and does things fast and obvious

zmmmmm 13 hours ago||

That's a great write up.

The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.

So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.

sanderjd 13 hours ago||

Right. Opus 4.5 8 months ago, good enough for agentic coding. How far behind that are open weight models? More than 8 months? But how much more? When will they reach Opus 4.5 level? A few months from now? A year from now? Never?

theshrike79 12 hours ago|||

The power of Opus isn't just the model, it's in the harness too.

You can try it by using Opus through Github Copilot vs official Anthropic tools. You'll get very different results and experience (in my opinion).

user43928 6 hours ago|||

GitHub Copilot in vscode has two ways to access Opus: the Copilot harness or the Claude Code Agent SDK within Copilot.

And that's if we assume that the vscode GHCP default Agent ("Local") is the same as the "Copilot CLI" one that is also selectable in vscode. I have not tried that one.

A few weeks ago the Claude Code Agent SDK was much better than the default Copilot Agent, but nowadays I am not sure.

lelandbatey 50 minutes ago||||

I've tried Opus 4.6 in the Opencode harness through the Github Copilot API, and I've tried Opus 4.8 in Claude Code. I found I preferred Opus 4.6 in Opencode (and in general, I like Opencode much more in that it hid less from me). I found both to be pretty similar as far as efficacy (I was surprised that Opus 4.8 felt like such a minor improvement over 4.6).

throwa356262 9 hours ago||||

open source harnesses are also improving rapidly.

Some people would claim they are already far better than CC and Codex.

larsnystrom 11 hours ago|||

I’ve only used Opus in GitHub copilot and was hugely underwhelmed. It was barely usable. Are you saying it’s better with the official Anthropic tools?

theshrike79 9 hours ago|||

Night and day in my opinion. But these are all purely Feels so YMMV etc.

I like how especially the Claude Code CLI version communicates how it's progressing, something they hide a lot more on the desktop app for example.

m-ee 11 hours ago|||

I don't know about better but it's certainly different. It's painfully slow through claude code vscode extension compared to copilot but maybe "smarter", I feel like I have to correct it less using sonnet on both. I don't use opus much because of the cost but coworkers say the difference between harnesses there is also pronounced.

theplumber 13 hours ago||||

I think in the next 6 months we will have Opus 4.5 performance in open models. We are very close

krzyk 8 hours ago||

We need first to reach level of Sonnet 4.x, we aren't at that level yet.

BoorishBears 6 hours ago||

GLM 5.2 is comfortably at Sonnet 4 at the very least. Same with Minimax M3

marak830 13 hours ago||||

GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).

sleepyeldrazi 6 hours ago|||

Opus also has a deeply ingrained personality that always de-rails sneakily into what it's taught, not what the user intends. This is good if the user doesn't know the details of the work they need performed and a huge time waste when the user knows exactly how something needs to be implemented.

I have found claude models, especially fable, to be impossible to work with when the work requires reading papers from days ago and reasoning on top of the findings in it. I have multiple long sessions with opus (not as many with fable as it got taken down quickly) where it keeps fighting me on problems, sayings "that's not how it works" / "that is not possible", followed by me linking the paper (after i've told it to actually read up on the latest research in this field), and it hits me with the usual "You were right.". If your workflow is using the exact tools, frameworks, git layouts that claude expects, it can be magical, yes. But it is very heavily optimized to never say 'I am not sure' (as that gives 'bad vibes') and instead lean on its (nowadays with the speed of things DOE) knowledge to formulate a reasonable sounding answer, dissectible only if you already know the answer beforehand (which defeats the purpose of using it in the first place).

Qwen3.6 27B (the only <100B model worth looking at in my experience) is dumb, knows it, and will fight tooth and nail to complete the task it was given, gaining the needed context (online or file-wise) in the meantime. If you mention it should read papers, it goes and reads a pile of papers. If you tell it 'implement MCP in my app', the result will (probably) be catastrophic. If you instead describe where the feature should sit, how it should handle edge cases, what use cases it needs to attend to, and to first look online for reference implementations, it does it and does it well.

Knowing what is in context, what should and shouldn't be there, and how to manage it for the specific model you are using (as every model, even in the same family, behaves differently to differently worded prompts) is what makes or breaks them. They are just auto-complete, they complete text based on what is already there, it's not magic.

So yes, while this small open-weights models are not opus 4.5, it's good precisely because if that, because it is a good tool and a bad 'coworker replacement'. If you want the latter, kimi is already there, it has started to not believe the user and do what it was taught just like claude models (which is helpful when you don't care about implementation specifics or performance/security). GLM models (mostly 5.1, i haven't tested 5.2 extensively yet) have fixed a lot of low-level programming issues I've had that opus just walks in circles and writes reports that "it doesn't/can't work". That is to say, open-weights, in many cases, have already surpassed Opus. I can't comment on gpt 5.5, but while I used 5.4, it also performed a lot more tasks without being fussy than opus 4.6/4.7.

hypfer 6 hours ago||

> I have multiple long sessions with opus (not as many with fable as it got taken down quickly) where it keeps fighting me on problems, sayings "that's not how it works" / "that is not possible", followed by me linking the paper (after i've told it to actually read up on the latest research in this field), and it hits me with the usual "You were right.".

I genuinely do not understand why people not only just put up with this but also pay _a lot of money_ for the _privilege_ of doing so.

It's like having _the worst_ colleague but you actually go out of your way to talk with the guy. Why.

3abiton 10 hours ago|||

And a big thing that's missing is ... the harness comparison. Ot plays a very big role. I use forge, and I have been inpressed with what it can do given all the limitations of local models.

rippeltippel 13 hours ago|||

Since the author is referring to a specific model, I think it makes sense to ignore how the model (or local models in general) may improve over time.

It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.

It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.

appplication 13 hours ago||

Agree 100%, even on claude 4.5 being the turning point for agentic coding. It completely turned me around on it.

hypfer 13 hours ago||

That was a lot of text for me still having no idea what the point of the author was (beside what I can infer from the headline that is).

I do however now know that they're a totally cool dude building stuff physically and as software + that other people give them money for it.

Does that have anything to do with the topic suggested by the headline? Not sure.

neonstatic 12 hours ago|

Everything is an ad these days. The article was not useless, but for the information it provides, it could have been two paragraphs.

hypfer 12 hours ago||

FWIW it told me stuff about openfaas. Now I know how to mentally file it and how to mentally file the author. The GitHub profile alone might not have sent the same signal, so this is useful.

Is it bad software? Idk. Probably not.

Should you treat it as a grassroots Foss thing maintained by fellow sane hackers? No sir.

gpt5 14 hours ago||

This article is a good summary of local models. Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work. The reality is that they are rather limited, would not do well on a long or complex task, and are prone to fall into loops, forget their tasks, etc. Not mentioned in the article is that they are also rather expensive - not just for the hardware cost, but also electricity. These 3090 and 5090 machines are pretty power hungry, and these models are pretty slow on these machines, making them consume more power per token.t

Where they shine is in your ability to control them, their privacy, their predictability (e.g. if you are doing a repetitive task, like classifying your photo/video library), and depending on your energy bill - their costs.

usernomdeguerre 13 hours ago||

I believe that local models are a necessary extension of the personal computer and I imagine that one could have had similar criticisms of early personal computers.

pmontra 13 hours ago||

Of course the early MSDOS PCs where loud and power hungry. I can't remember the specs but according to Wikipedia the IBM PC with a 80286 had a 192 Watt power supply. I don't remember if by then we had internal hard disks or we still had to buy a case as large as the one of the PC with a 10 or 20 MB disk inside. It was handy to raise the monitor further up.

theshrike79 12 hours ago|||

My dream would be a local model that can do, say, 80% of the day to day tasks I need; "how does X Handler connect to Y storage?", "commit that feature, but leave out the bits that relate to billing" etc.

It would have 99% reliable tool calling - and most importantly - the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere.

This way all of the simple stuff would be done on-device, gathering data, figuring out the context of the problem etc. And when that's done, the "smart" model would come in to work on the issue when all of the easy stuff is already done.

It feels super stupid that my /commit skill calls an online model when that is something a local model can 100% do. Mostly this is a harness issue though and mostly solvable.

redrove 10 hours ago|||

> My dream would be a local model that can do, say, 80% of the day to day tasks I need; "how does X Handler connect to Y storage?", "commit that feature, but leave out the bits that relate to billing" etc.

Qwen 3.6 27B can do that today, but setup properly and in a good quant, I run an autoround [0] with weights in int8 and attention heads in f16 on a single RTX 6000 Pro Blackwell Max-Q via vllm with mtp=2 and full context, --max-num-seqs 3, KV in f16, mamba f32.

>It would have 99% reliable tool calling

I managed to score 93/100 in tool-eval-bench [1]. For me this is very good already, at least in the pi coding harness I've never had an issue that wasn't auto-fixed in the next turn(s).

>the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere

This is heavy on the harness engineering side I think, but also quite contrary to the nature of LLMs today. If you figure this out I'd love to know.

[0] https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/...

[1] https://github.com/SeraphimSerapis/tool-eval-bench

walthamstow 10 hours ago|||

Claude kind of has this already in their Advisor feature. I don't think I've seen it elsewhere. Open harnesses could add this feature and call out to big boy models when required. It's a really great idea.

girvo 9 hours ago||

It’s a lot harder to get right than it sounds. I’ve been trying to as a Pi extension, but models are biased to think they’re better than they actually are.

So far the best results I’ve got have been using a much smaller local model as a simple classifier, that makes a call based on the system prompt and incoming prompt where to route it. It works okay, still a long way to go though

regularfry 10 hours ago|||

I've been getting 40-50t/s out of qwen3.6:27b on a 4090 limited to 350W with the MTP changes that went in. That comes out at 8.75J/t at the upper end. No idea how that compares with anything else out there. I'd expect a 5090 to be a bit cheaper because it'd be faster within the same power limit.

i_idiot 13 hours ago|||

> Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work.

They really are fantastic for a lot of use cases and I think most people do not need SOTA. When I run that qwen model in my measly 4070 12 GB for my personal email agent that I build and experiment with, I need privacy more than anything else. It does a great job. Even for coding tasks, given you know how to use them instead of dumping a grand plan, it's great.

throw310822 11 hours ago||

> I think most people do not need SOTA

SOTA can code but can also prove theorems and teach you about music theory or ancient Greece's substrate language or botany. Speaking in tens of different languages. I wonder how many hundreds of billions of parameters can be saved just by removing much of the general knowledge parts while keeping logical and programming abilities the exact same.

trey-jones 8 hours ago||

Exactly. I have sort of a fetish for trying to make things smaller by trimming out things that aren't needed. Unfortunately this skill has been largely useless since forever, because hardware improves to the point that these optimizations are trivial:

Network Bandwidth, Storage space and speed, memory capacity. While all of these were worth optimizing for at a point in history, that point is behind us. It's probably a reasonable expectation that it will eventually be true for VRAM.

sanderjd 13 hours ago||

But that's current hardware. What about future hardware? What about hardware optimized for inference? What about hardware optimized to run a particular model?

barrkel 11 hours ago||

I found it interesting that vLLM was dismissed as slower than llama.cpp.

IME vLLM is quite a bit faster than llama.cpp but where it really wipes the floor with it is in batching concurrent load. The downside is that it is dramatically less flexible in terms of tweaking. It gives you very few options for running quantized weights. It takes a lot longer to start up because it optimizes the compute graph. So for single user experimentation on a model that's a bit too big for your box, vLLM is just going to be frustrating.

alexellisuk 2 hours ago||

vLLM is great at continuous batching and model serving in production, but it's a very different beast and much less versatile for the prosumer category (where we sit for our usage)

Dismissed is a strong term, but let me give you some more details.

It took a good 4 minutes plus to load up on the 2x 3090 rig, and served a single request 3 tokens/second slower.

And the worst bit? With all that work - setting it up and tuning it - it still looped. I was hoping "use just vLLM" advice that we get touted everywhere was the silver bullet.

The only thing I'd caution here is that we don't start bashing on llama.cpp like people did with Ollama. It's a very capable tool and for the use-cases we actually want the card for makes more sense.

For a large team replacing their Claude Subs perhaps vLLM is the only option, but you really need to add about 5 more RTX 6000 cards into the mix, so you can load something like GLM 5.2.

lelandbatey 38 minutes ago||

Bashing on ollama is totally warranted, since ollama is a UI skin around llama.cpp and that's it. If all you cared about was "I want to run a model and use it via an API" then the only thing it did was give you a GUI to download models (vs browsing HuggingFace yourself and downloading .gguf files yourself) and a GUI with a button labeled "run" (instead of a run.sh or run.bat script launching llama-server).

That's not _nothing_, but it's pretty close to nothing, and for the prosumer crowd it edges towards "just gets in the way".

chartered_stack 10 hours ago|||

One could say: vLLM isn't a worse Llama.cpp, it's a different tool

krzyk 8 hours ago|||

AFAIR the general consensus is (was?): - llama.cpp for single user - vLLM for multi-user (e.g. enterprises)

They are similar, but for different use cases.

navbaker 5 hours ago||

Yeah, I was a bit baffled by the author complaining about cache prefixes getting destroyed when more than one user hit the model, but then continuing to use llama.cpp instead of switching to vLLM.

mistercheese 1 hour ago||

I’m not sure if I missed it, but I’m curious how you feel about cloud hosted models with ZDR policies? GLM5.2 or even Minimax M3 on Fireworks or Together ai should be still relatively/consistently cheap and private but a lot more capable and easier to setup?

alexellisuk 55 minutes ago|

Thanks for the comment ZDR is mentioned in the post - in particular many the coding plans that are not from the two major leaders have questionable IP/ownership claims on inputs/outputs :)

And ZDR is still data sharing with a third party. This is the essence of an enterprise agreement, it's not allowed, even if they pinkie promise not to store it.

If your customers allow you to share their data with third parties, then ZDR may be an option for you. I am not a laywer.

Where I see ZDR as being more relevant is in protecting your employer's IP - not allowing a missed setting to mean AI labs can train, retain, and publish/resell your work. It's what we'll consider when the subsidies stop being available - open-router, ZDR - but for coding - not for customer data. Very important distinction.

selfawareMammal 26 minutes ago|

I am not a worse player than Messi, I'm just a different player.

More comments...