Posted by aray07 12 hours ago
I am finding that for complex tasks, Claude's quality of output varies _tremendously_ with repeated runs of the same model and prompt. For example, last week I wrote up (with my own brain and keyboard) a somewhat detailed plain english spec of a work-related productivity app that I've always wanted but never had the time to write. It was roughly the length of an average college essay. The first thing I asked Claude to do was not write any code, but come up with a more formal design and implementation plan based on the requirements that I gave. The idea was to then hand _that_ to Claude and say, okay, now build it.
I used Opus 4.6 with High reasoning for all of this and did not change any model settings between runs.
The first run was overall _amazing_. It was detailed, well-written, contained everything that I asked for. The only drawback was that I was ambiguous on a couple of points which meant that the model went off and designed something in a way that I wasn't expecting and didn't intend. So I cleared that up in my prompt, and instead of keeping the context and building on what was already there, I started a new chat and had it start again from scratch.
What it wrote the second time was _far_ less impressive. The writing was terse, there was a lot less detail, the pretty dependency charts and various tables it made the first time were all gone. Lots of stuff was underspecified or outright missing.
New chat, start again. Similar results as the second run, maybe a bit worse. It also started _writing code_ which was something I told it NOT to do. At this point I'm starting to panic a little because I'm sure I didn't add, "oh, and make it crappy" to the prompt and I was a little angry about not saving the first iteration since it was fairly close to what I had wanted anyway.
I decided to try one last time and it finally gave me back something within about 95% of the first run in terms of quality, but with all the problems fixed. So, I was (finally) happy with that, and it used that to generate the application surprisingly well, with only a few issues that should not be too hard to fix after the fact.
So I guess 4th time was a charm, and the fare was about $7 in tokens to get there.
This is not so much about my instructions being followed more closely. It's the LLM being smarter about what's going on and for example saving me time on unnecessary expeditions. This is where models have been most notably been getting better to my experience. Understanding the bigger picture. Applying taste.
It's harder to measure, of course, but, at least for my coding needs, there is still a lot of room here.
If one session costs an additional 20% that's completely fine, if that session gets me 20% closer to a finished product (or: not 20% further away). Even 10% closer would probably still be entirely fine, given how cheap it is.
Except, it's not that trivial to solve. I tried experimenting with asking the model to first give a list of symbols it will modify, and then just write the modified symbols. The results were OK, but less refined than when it echoes back the entire file.
The way I see it is that when you echo back the entire file, the process of thinking "should I do an edit here" is distributed over a longer span, so it has more room to make a good decision. Like instead of asking "which 2 of the 10 functions should you change" you're asking it "should you change method1? what about method2? what about method3?", etc., and that puts less pressure on the LLM.
Except, currently we are effectively paying for the LLM to make that decision for *every token*, which is terribly inefficient. So, there has to be some middle ground between expensively echoing back thousands of unchanged tokens and giving an error-ridden high-level summary. We just haven't found that middle ground yet.
grit.io was working on this years ago, not sure if they are still alive/around, but I liked their approach (just had a very buggy transformer/language).
I thought coding harnesses provided tools to apply diffs so the LLM didn't have to echo back the entire file?
So, in practice, many tools still work on the file level.
claude code on opus continuously = whole bill. different measurement.
haiku 4.5 is good enough for fanout. opus earns it on synthesis where you need long context + complex problem solving under constraints
And if it's not good enough for coding, what kind of money, if any, would make it good enough?
Do yourself a favor: Set up OpenCode and OpenRouter, and try all the models you want to try there.
Other than the top performers (e.g. GLM 5.1, Kimi K2.5, where required hardware is basically unaffordable for a single person), the open models are more trouble than they're worth IMO, at least for now (in terms of actually Getting Shit Done).
Open models are not bullshit, they work fine for many cases and newer techniques like SSD offload make even 500B+ models accessible for simple uses (NOT real-time agentic coding!) on very limited hardware. Of course if you want the full-featured experience it's going to cost a lot.
People that love open models dramatically overstate how good the benchmaxxed open models are. They are nowhere near Opus.
I love my little hobby aquarium though... It's pretty impressive when Qwen Coder Next and Qwen 3.5 122B can accomplish (in terms of general agentic use and basic coding tasks), considering that the models are freely-available. (Also heard good things about Qwen 3.5 27B, but haven't used it much... yes I am a Qwen fanboi.)
Just because you can't figure out how to use the open models effectively doesn't mean they're bullshit. It just takes more skill and experience to use them :)
Fun fact: AWS offers apple silicon EC2 instances you can spin up to test.
I took the plan that I used from Codex and handed it to opencode with Qwen 3.5 running locally.
It created a library very similar to Codex but took 2x longer.
I haven't tried Qwen 3.6 but I hear it's another improvement. I'm confident with my AI skills that if/when cheap/subsidized models go away, I'll be fine running locally.
Many providers out there host open weights models for cheap, try them out and see what you think before actually investing in hardware to run your own.
The best bang for the buck now is subcribing to token plans from Z.ai (GLM 5.1), MiniMax (MiniMax M2.7) or ALibaba Cloud (Qwen 3.6 Plus)
Running quantized models won't give you results comparable to Opus or GPT.