Posted by y42 9 hours ago
Then hand over to Claude Sonnet.
With hard requirements listed, I found out that the generated code missed requirements, had duplicate code or even unnecessary code wrangling data (mapping objects into new objects of narrower types when won't be needed) along with tests that fake and work around to pass.
So turns out that I'm not writing code but I'm reading lots of code.
The fact that I know first hand prior to Gen AI is that writing code is way easier. It is reading the code, understanding it and making a mental model that's way more labour intensive.
Therefore I need more time and effort with Gen AI than I needed before because I need to read a lot of code, understand it and ensure it adheres to what mental model I have.
Hence Gen AI at this price point which Anthropic offers is a net negative for me because I am not vibe coding, I'm building real software that real humans depend upon and my users deserve better attention and focus from me hence I'll be cancelling my subscription shortly.
I think the AI companies all stink to high heaven and the whole thing being built on copyright infringement still makes me squirm. But the latest models are stupidly smart in some cases. It's starting to feel like I really do have a sci-fi AI assistant that I can just reach for whenever I need it, either to support hard thinking or to speed up or entirely avoid drudgery and toil.
You don't have to buy into the stupid vibecoding hype to get productivity value out of the technology.
You of course don't have to use it at all. And you don't owe your money to any particular company. Heck for non-code tasks the local-capable models are great. But you can't just look at vibecoding and dismiss the entire category of technology.
Anecdata, but I'm still finding CC to be absolutely outstanding at writing code.
It's regularly writing systems-level code that would take me months to write by hand in hours, with minimal babysitting, basically no "specs" - just giving it coherent sane direction: like to make sure it tests things in several different ways, for several different cases, including performance, comparing directly to similar implementations (and constantly triple-checking that it actually did what you asked after it said "done").
For $200/mo, I can still run 2-3 clients almost 24/7 pumping out features. I rarely clear my session. I haven't noticed quality declines.
Though, I will say, one random day - I'm not sure if it was dumb luck - or if I was in a test group, CC was literally doing 10x the amount of work / speed that it typically does. I guess strange things are bound to happen if you use it enough?
Related anecdata: IME, there has been a MASSIVE decline in the quality of claude.ai (the chatbot interface). It is so different recently. It feels like a wanna-be crapier version of ChatGPT, instead of what it used to be, which was something that tried to be factual and useful rather than conversational and addictive and sycophantic.
A small app, or a task that touches one clear smaller subsection of a larger codebase, or a refactor that applies the same pattern independently to many different spots in a large codebase - the coding agents do extremely well, better than the median engineer I think.
Basically "do something really hard on this one section of code, whose contract of how it intereacts with other code is clear, documented, and respected" is an ideal case for these tools.
As soon as the codebase is large and there are gotchas, edge cases where one area of the code affects the other, or old requirements - things get treacherous. It will forget something was implemented somewhere else and write a duplicate version, it will hallucinate what the API shapes are, it will assume how a data field is used downstream based on its name and write something incorrect.
IMO you can still work around this and move net-faster, especially with good test coverage, but you certainly have to pay attention. Larger codebases also work better when you started them with CC from the beginning, because it's older code is more likely to actually work how it exepects/hallucinates.
Agreed, but I'm working on something >100k lines of code total (a new language and a runtime).
It helps when you can implement new things as if they're green-field-ish AND THEN implement and plumb them later.
I have my own anecdata but my comment is more about the dissonance here.
For example I’m working on a huge data migration right now. The data has to be migrated correctly. If there are any issues I want to fail fast and loud.
Claude hates that philosophy. No matter how many different ways I add my reasons and instructions to stop it to the context, it will constantly push me towards removing crashes and replacing them with “graceful error handling”.
If I didn’t have a strong idea about what I wanted, I would have let it talk me into building the wrong thing.
Claude has no taste and its opinions are mostly those of the most prolific bloggers. Treating Claude like a peer is a terrible idea unless you are very inexperienced. And even then I don’t know if that’s a good idea.
I often think that LLMs are like a reddit that can talk. The more I use them, the more I find this impression to be true - they have encyclopedic knowledge at a superficial level the approximate judgement and maturity of a teenager, and the short-term memory of a parakeet.
That’s amazing and incredible, and probably more knowledgeable than the median person, but would you outsource your thinking to reddit? If not, then why would you do it with an LLM?
This is one variable I almost always see in this discussion: the more strict the rules that you give the LLM, the more likely it is to deeply disappoint you
The earlier in the process you use it (ie: scaffolding) the more mileage you will get out of it
It's about accepting fallability and working with it, rather than trying to polish it away with care
And sure, AI could “scaffold” further into controllers and views and maybe even some models, and they probably work ok. It’s then when they don’t, or when I need something tweaked, that the worry becomes “do I really understand what’s going on under the hood? Is the time to understand that worth it? Am I going to run across a small thread that I end up pulling until my 80% done sweater is 95% loose yarn?”
To me the trade-off hasn’t proven worth it yet. Maybe for a personal pet project, and even then I don’t like the idea of letting something else undeterministically touch my system. “But use a VM!” they say, but that’s more overhead than I care for. Just researching the safest way to bootstrap this feels like more effort than value to me.
Lastly, I think that a big part of why I like programming is that I like the act of writing code, understanding how it works, and building something I _know_.
Doing nonsensical things with a library feed it the documentation still busted make it read the source
If you do spot checks, that is woefully inadequate. I have lost count of the number of times when, poring over code a SOTA LLM has produced, I notice a lot of subtle but major issues (and many glaring ones as well), issues a cursory look is unlikely to pick up on. And if you are spending more time going over the code, how is that a massive speed improvement like you make it seem?
And, what do you even mean by 10x the amount of work? I keep saying anybody that starts to spout these sort of anecdotes absolutely does NOT understand real world production level serious software engineering.
Is the model doing 10x the amount of simplification, refactoring, and code pruning an effective senior level software engineer and architect would do? Is it doing 10x the detailed and agonizing architectural (re)work that a strong developer with honed architectural instincts would do?
And if you tell me it's all about accepting the LLM being in the driver's seat and embracing vibe coding, it absolutely does NOT work for anything exceeding a moderate level of complexity. I used to try that several times. Up to now no model is able to write a simple markdown viewer with certain specific features I have wanted for a long time. I really doubt the stories people tell about creating whole compilers with vide coding.
If all you see is and appreciate that it is pumping out 10x features, 10x more code, you are missing the whole point. In my experience you are actually producing a ton of sh*t, sorry.
Honestly, this more of a question about scope of the application and the potential threat vectors.
If the GP is creating software that will never leave their machine(s) and is for personal usage only, I'd argue the code quality likely doesn't matter. If it's some enterprise production software that hundreds to millions of users depend on, software that manages sensitive data, etc., then I would argue code quality should asymptotically approach perfection.
However, I have many moons of programming under my belt. I would honestly say that I am not sure what good code even is. Good to who? Good for what? Good how?
I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
I apply the Herbie Hancock philosophy when defining good code. When once asked what is Jazz music, Herbie responded with, "I can't describe it in words, but I know it when I hear it."
That’s the problem. If we had an objective measure of good code, we could just use that instead of code reviews, style guides, and all the other things we do to maintain code quality.
> I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
Not if you have more than a few years of experience.
But what your point is missing is the reason that software keeps working in the fist, or stays in a good enough state that development doesn’t grind to a halt.
There are people working on those code bases who are constantly at war with the crappy code. At every place I’ve worked over my career, there have been people quietly and not so quietly chipping away at the horrors. My concern is that with AI those people will be overwhelmed.
They can use AI too, but in my experience, the tactical tornadoes get more of a speed boost than the people who care about maintainability.
I am not a lawyer, but am generally familiar with two "is it fair use" tests.
1. Is it transformative?
I take a picture, I own the copyright. You can't sell it. But if you take a copy, and literally chop it to pieces, reforming it into a collage, you can sell that.
2. Does the alleged infringing work devalue the original?
If I have a conversation with ai about "The Lord of the Rings". Even if it reproduces good chunks of the original, it does not devalue the original... in fact, I would argue, it enhances it.
Have I failed to take into account additional arguments and/or scenarios? Probably.
But, in my opinion, AI passes these tests. AI output is transformative, and in general, does not devalue the original.
And they are making money off of other people's work. Sure, you can use mental jiujutsu to make it fair use. But fair use for LLMs means you basically copy the whole thing. All of it. It sounds more like a total use to me.
I hope the free market and technology catches up and destroys the VC backed machinery. But only time will tell.
Seriously though, I do think that is the case. It would be self-righteous to argue otherwise. It's just the scale and the nature of this, that makes it so repulsive. For my taste, copying something without permission, is stealing. I don't care what a judge somewhere thinks of it. Using someone's good will for profit is disgusting. And I hope we all get to profit from it someday, not just a select few. But that is just my opinion.
They just stole everyone's hard work over decades to make this or it wouldn't have been useful at all.
The fact of the matter is that for profit corporations consumed the sum knowledge of mankind with the intent to make money on it by encoding it into a larger and better organized corpus of knowledge. They cited no sources and paid no fees (to any regular humans, at least).
They are making enormous sums of money (and burning even more, ironically) doing this.
If that doesn't violate copyright, it violates some basic principle of decency.
That's vibecoding with an extra documentation step.
Also, Sonnet is not the model you'd want to use if you want to minimize cleanup. Use the best available model at the time if you want to attempt this, but even those won't vibecode everything perfectly for you. This is the reality of AI, but at least try to use the right model for the job.
> Therefore I need more time and effort with Gen AI than I needed before
Stop trying to use it as all-or-nothing. You can still make the decisions, call the shots, write code where AI doesn't help and then use AI to speed up parts where it does help.
That's how most non-junior engineers settle into using AI.
Ignore all of the LinkedIn and social media hype about prompting apps into existence.
EDIT: Replaced a reference to Opus and GPT-5.5 with "best available model at the time" because it was drawing a lot of low-effort arguments
It is NOT the way to work with humans basically because most software engineers I worked with in my career were incredibly smart and were damn good at identifying edge cases and weird scenarios even when they were not told and the domain wasn't theirs to begin with. You didn't need to write lengthy several page long Jira tickets. Just a brief paragraph and that's it.
With AI, you need to spell everything out in detail. But that's NO guarantee either because these models are NOT deterministic in their output. Same prompt different output each time. That's why every chat box has that "Regenerate" button. So your output with even a correct and detailed prompt might not lead to correct output. You're just literally rolling a dice with a random number generator.
Lastly - no matter how smart and expensive the model is, the underlying working principles are the same as GPT-2. Same transformers with RL on top, same random seed, same list of probabilities of tokens and same temperature to select randomly one token to complete the output and feedback in again for the next token.
I have no clue what AI you're using, but both Claude and Codex, you just explain the outcome, and they are pretty smart figuring out stuff on complex codebases.You don't even need a paragraph, just say "doing this I got an error".
> NO guarantee either because these models are NOT deterministic in their output. Same prompt different output each time.
So, exactly like humans. But a bit more predictable and way more reliable.
> That's why every chat box has that "Regenerate" button.
If you're using the chat box to write code, that's a human error, not an LLM one. Don't blame "AI" for your ignorance.
> no matter how smart and expensive the model is, the underlying working principles are the same as GPT-2.
Sure. Every machine is a smoke machine if operated wrong enough. This tells me you should not get your insight from random YT videos. As a bit of nugget, some of the underlying working principles of the chat system also powered search engines; and their engineers also drank water, like hitler.
I don't think anyone was claiming otherwise. Sonnet is still better at writing code than GPT-2, and worse than Opus. Workflows that work with Opus won't always work with Sonnet, just as you can't use GPT-2 in place of Sonnet to do code autocomplete.
Wait, are you doing this in the web chat interface?!
That's definitely not a good way. You need to be using a harness (like Claude Code) where the agent can plan its work, explore the codebase, execute code, run tests, etc. With this sort of set up, your prompts can be short (like 1 to 5 sentences) and still get great results.
It’s pretty funny to claim that a model released 22 hours ago is the bare minimum requirement for AI-assisted programming. Of course the newest models are best at writing code, but GPT-* and Claude have written pretty decent systems for six months or so, and they’ve been good at individual snippets/edits for years.
Not what I said.
The OP was trying to write specs and have an AI turn it into an app, then getting frustrated with the amount of cleanup.
If you want the AI to write code for you and minimize your cleanup work, you have to use the latest models available.
They won't be perfect, but they're going to produce better results than using second-tier models.
The OP comment was talking about Claude Sonnet. I was comparing to that.
I should have just said "use the best model available"
Nobody was talking about how much better it is until you wrote this though
It's like you're building your own windmills brick by brick
You're assuming that finding the places where AI needs help isn't already a larger task than just writing it yourself. AI can be helpful in development in very limited scenarios but the main thrust of the comment above yours is that it takes longer to read and understand code than to write it and AI tooling is currently focused on writing code.
We're optimizing the easy part at the expense of the difficult part - in many cases it simply isn't worth the trouble (cases where it is helpful, imo, exist when AI is helping with code comprehension but not new code production).
Not assuming anything, I'm well versed in how to do this.
Anyone who defers to having AI write massive blocks of code they don't understand is going to run into this.
You have to understand what you want and guide the AI to write it.
The AI types faster than me. I can have the idea and understand and then tell the LLM to rearrange the code or do the boring work faster than I can type it.
I think we're seeing something similar with AI: There are devs who spend a couple days trying to get AI to magically write all of their code for them and then swear it off forever, thinking they're the only people who see the reality of AI and everyone else is wrong.
It's a sort of context of life that the easy problems are solved - those where an extreme answer is always correct are things we no longer even consider problems... most of the options that remain have their advantages and disadvantages so the true answer is somewhere in the middle.
Juniors are mostly better than what you write as behavior, I certainly never had to correct as much after any junior as OP writes. If you have 'boring code' in your codebase, maybe it signals not that great architecture (and I presume we don't speak about some codegens which existed since 90s at least).
Also, any senior worth their salt wants to intimately understand their code, the only way you can anyhow guarantee correctness. Man, I could go on and on and pick your statements one by one but that would take long.
Yes, it's quicker to do it yourself this time, but if we build out the artifacts to do a good enough job this time, next time it'll have all the context it needs to take a good shot at it, and if you get overtaken by AI in the meantime you've got an insane head start.
Which side of history are you betting on?
I'm okay not being at the bleeding edge - I can see the remains of the companies that aggressively switch to the new best thing. Sometimes it'll pay off and sometimes it won't. I am comfortable being a person that waits until something hits a 2.0 and the advantages and disadvantages are clear before seriously considering a migration.
Read uncharitably, yeah. But you're making a big assumption that the writing of spec wasn't driven by the developer, checked by developer, adjusted by developer. Rewritten when incorrect, etc.
> You can still make the decisions, call the shots
One way to do this is to do the thinking yourself, tell it what you want it to do specifically and... get it to write a spec. You get to read what it thinks it needs to do, and then adjust or rewrite parts manually before handing off to an agent to implement. It depends on task size of course - if small or simple enough, no spec necessary.
It's a common pattern to hand off to a good instruction following model - and a fast one if possible. Gemini 3 Flash is very good at following a decent spec for example. But Sonnet is also fine.
> Stop trying to use it as all-or-nothing
Agree. Some things just aren't worth chasing at the moment. For example, in native mobile app development, it's still almost impossible to get accurate idiomatic UI that makes use of native components properly and adheres to HIG etc
I'm unsure if this is actually faster than me writing it myself, but it certainly expends less mental energy for me personally.
The real gains I'm getting are with debugging prod systems, where normally I would have to touch five different interfaces to track down an issue, I've just encompassed it all within an mcp and direct my agent on the debugging steps(check these logs, check this in the db, etc)
I was trying to explain that this isn't how successful engineers use AI. There is a way to understand the code and what the AI is doing as you're working with it.
Writing a spec, submitting it to the AI (a second-tier model at that) and then being disappointed when it didn't do exactly what you wanted in a perfect way is a tired argument.
I'm saying that if you're trying to have AI write code for you and you want to do as little cleanup as possible, you have to use the best model available.
"Ignore all of the LinkedIn and social media hype about prompting apps into existence." Absolutely, its not hype, its pure marketing bullshitzen.
Stop doing that. Micromanage it instead. Don't give it the specs for the system, design the system yourself (can use it for help doing that), inform it of the general design, but then give it tasks, ONE BY ONE, to do for fleshing it out. Approve each one, ask for corrections if needed, go to the next.
Still faster than writing each of those parts yourself (a few minutes instead of multiple hours), but much more accurate.
"We have this thing that can speed your code writing 10x"
"If it isn't 1000x and it doesn't give me a turnkey end to end product might as well write the whole thing myself"
People have forgotten balance. Which is funny, because the inability of the AI to just do the whole thing end to end correctly is what stands between 10 developers having a job versus 1 developer having a job telling 10 or 20 agents what to do end to end and collecting the full results in a few hours.
And if you do it the way I describe you get to both use AI, AND have "a much better understanding of the codebase (and way better code)".
This is based on the premise that given detailed plan, the model will exactly produce the same thing because the model is deterministic in nature which is NOT the case. These models are NOT deterministic no matter how detailed plan you feed it in. If you doubt, give the model same plan twice and see something different churned out each time.
> And honestly, I’m mostly within my Pro subscription, granted I also have ChatGPT Plus but I’ve mostly only used that as the chat/quick reference model. But yeah takes some time to read and understand everything, a lot of the time I make manual edits too.
I do not know how you can do it on a Pro plan with Claude Opus 4.7 which is 7.5x more in terms of limit consumption and any small to medium size codebase would easily consume your limits in just the planning phase up to 50% in a single prompt on a Pro plan (the $20/month one that they are planning to eliminate)
I also don’t understand because all I ever hear is people saying $100 Max plan is the minimum for serious work. I made 3-4 plans today, I’m familiar with the codebase and pointed the LLM in the direction where it needed to go. I described the functionality I wanted which wasn’t a huge rewrite, it touched like 4 files of which one was just a module of pydantic models. But one plan was 30% of usage and I had this over two sessions because I got a reset. I did read and understand everything line of code so that part takes me some time to do.
Get it to write a context capsule of everything we've discussed.
Chuck that in another model and chat around it, flesh out the missing context from the capsule. Do that a couple of times.
Now I have an artifact I can use to one-shot a hell of a lot of things.
This is amazing for 0-1.
For brown field development, add in a step to verify against the current code base, capture the gotchas and bounds, and again I've got something an agent has a damn good chance of one-shotting.
Like there is no way in world that Gen AI is faster then an actual cracked coder shooting the exact bash/sql commands he needs to explore and writing a proper intent-communicating abstraction.
I’m thinking the difference is in order of magnitudes.
On top of that it adds context loss, risk of distraction, the extra work of reading after the job is done + you’ll have less of a mental model no matter how good you read, because active > passive.
Man it was really the weirdest thing that Claude Coded started hiding more and more changes. Thats what you need, staying closely on the loop.
I'm not having the same problem as you and I follow a very similar methodology. I'm producing code faster and at much higher quality with a significant reduction in strain on my wrists. I doubt I'm typing that much less, but what I am typing is prose which is much more compatible with a standard QWERTY keyboard.
I think part of it is that I'm not running forward as fast as I can and I keep scope constrained and focused. I'm using the AI as a tool to help me where it can, and using my brain and multiple decades of experience where it can't.
Maybe you're expecting too much and pushing it too hard/fast/prematurely?
I don't find the code that hard to read, but I'm also managing scope and working diligently on the plans to ensure it conforms to my goals and taste. A stream of small well defined and incremental changes is quite easy to evaluate. A stream of 10,000 line code dumps every day isn't.
I bet if you find that balance you will see value, but it might not be as fast as you want, just as fast as is viable which is likely still going to be faster than you doing it on your own.
This is hardly a surprise, no? No matter how much training we run, we are still producing a generative model. And a generative model doesn't understand your requirements and cross them off. It predicts the next most likely token from a given prompt. If the most statistically plausible way to finish a function looks like a version that ignores your third requirement, the model will happily follow through. There's really no rules in your requirements doc. They are just the conditional events X in a glorified P(Y|X). I'd venture to guess that sometimes missing a requirement may increase the probability of the generated tokens, so the model will happily allow the miss. Actually, "allow" is too strong a word. The model does not allow shit. It just generates.
If you are seeing an agent missing tasks, work with it to write down the task list first and then hold it accountable to completing them all. A spec is not a plan.
I ask the model to rename MyClass to MyNewClass. It will generate a checklist like:
- Rename references in all source files
- Rename source/header files
- Update build files to point at new source files
Then it will do those things in that order.
Now you can re-run it but inject the start of the model's response with the order changed in that list. It will follow the new order. The list plainly provides real information that influences future predictions and isn't just a facade for the user.
Are you seriously saying that breaking a large complex problem down into it's constituent steps, and then trying to solve each one of them as an individual problem is just a sensation of rigour?
Edit: I'll give you another example that I realized because someone pointed it out here: when the stupid bot tells you why it fucked up, it doesn't actually understand anything about itself - it's just generating the most likely response given the enormous amount of pontification on the internet about this very subject...
Whist I can't usually start from the exact same point in the decisioning, I can usually bootstrap a new session. It's not all ephemeral.
To your edit: I find that the most galling thing about finding out about the thinking being discarded at cache clear. Reconstruction of the logical route it took to get to the end state is just not the same as the step by step process it took in the first place, which again I feel counters your "feelies".
There's a really simple solution to this galling sensation: simply always keep in mind it's a stupid GenAI chat bot.
Have you tried Opus 4.6 with "/effort max" in Claude Code? That's pretty much all I use these days, and it is, honestly, doing a fantastic job. The code it's writing looks quite good to me. Doesn't seem to matter if it's greenfield or existing code.
If code is harder to read than to write, you're doing yourself a disservice by having the output stage not be top shelf.
Feels crazy to me for people to use anything other than the best available.
[1]: https://www.anthropic.com/engineering/april-23-postmortem ... but also see the September 2025 one at https://www.anthropic.com/engineering/a-postmortem-of-three-...
Not everyone has unlimited budgets to burn on tokens.
Just the coding window makes mistakes, duplicates code, does not follow the patterns. The reviewer catches most of this, and the coder fixes them all after rationalizing them.
Works pretty well for me. This model is somewhat institutionalized in my company as well.
I use CC Opus 4.7 or Codex GPT 5.4 High (more and more codex off late).
Maybe it was Timothy Gowers who commented on this.
Lots of human proofs have the unfortunate “creative leap” that isn’t fully explained but with some detectable subtlety. LLMs end up making large leaps too, but too often the subtle ways mathematicians think and communicate is lost, and so the proof becomes so much more laborious to check.
Like you don’t always see how a mathematician came up with some move or object to “try”, and to an LLM it appears random large creative leaps are the way to write proofs.
This may be worth trying out.
I feel like I have easily multiplied my productivity because I do not really have to read more than a single chat response at a time, and I am still familiar with everything in my apps because I wrote everything.
I've been working on Window Manager + other nice-to-haves for macOS 26. I do not need a model to one-shot the program for me. However, I am thrilled to get near instantaneous answers to questions I would generally have to churn through various links from Google/StackOverflow for.
Just saying that I know a lot of people like to raw dog it and say plugins and skills and other things aren't necessary, but in my case I've had good success with this.
You then spend months cleaning it up.
Could just have written it by hand from scratch in the same amount of time.
But the benefit is not having to type code.
Dude! The amount of ad-hoc, interface-specific DTOs that LLM coding agents define drives me up the wall. Just use the damn domain models!
Well, there's your problem. Why aren't you using the best tool for the job?
The last two paragraphs, however, show what happens when people start trying to use inductive reasoning -- and that part is really hard: ...
> Therefore I need more time and effort with Gen AI than I needed before because I need to read a lot of code, understand it and ensure it adheres to what mental model I have.
I don't disagree that the above is reasonable to say. But it isn't all -- not even enough -- about what needs to be said. The rate of change is high, the amount of adaptation required is hard. This in a nutshell is why asking humans to adapt to AI is going to feel harder and harder. I'm not criticizing people for feeling this. But I am criticizing the one-sided-logic people often reach for.
We have a range of options in front of us:
A. sharing our experience with others
B. adapting
C. voting with your feet (cancelling a subscription)
D. building alternatives to compete
E. organizing at various levels to push back
(A) might start by sounding like venting. Done well it progresses into clearer understanding and hopefully even community building towards action plans: [1]> Hence Gen AI at this price point which Anthropic offers is a net negative for me because I am not vibe coding, I'm building real software that real humans depend upon and my users deserve better attention and focus from me hence I'll be cancelling my subscription shortly.
The above quote is only valid unless some pretty strict (implausible) assumptions: (1) "GenAI" is a valid generalization for what is happening here; (2) Person cannot learn and adapt; (2) The technology won't get better.
[1]: I'm at heart more of a "let's improve the world" kind of person than "I want to build cool stuff" kind of person. This probably causes some disconnect in some interactions here. I think some people primarily have other motives.
Some people cancel their subscriptions and kind of assume "the market and public pushback will solve this". The market's reaction might be too slow or too slight to actually help much. Some people put blind faith into markets helping people on some particular time scales. This level of blind faith reminds me of Parable of the Drowning Man. In particular, markets often send pretty good signals that mean, more or less, "you need to save yourself, I'm just doing my thing." Markets are useful coordinating mechanisms in the aggregate when functioning well. One of the best ways to use them is to say "I don't have enough of a cushion or enough skills to survive what the market is coordinating" so I need a Plan B!
Some people go further and claim markets are moral by virtue of their principles; this becomes moral philosophy, and I think that kind of moral philosophy is usually moral confusion. Broadly speaking, in practice, morality is a complex human aspiration. We probably should not not abdicate our moral responsibilities and delegate them to markets any more than we would say "Don't worry, people who need significant vision correction (or other barrier to modern life)... evolution will 'take care' of you."
One subscription cancellation is a start (if you actually have better alternative and that alternative being better off for the world ... which is debatable given the current set of alternatives!)
Talking about it, i.e. here on HN might one place to start. But HN is also kind of a "where frustration turns into entertainment, not action" kind of place, unfortunately. Voting is cheap. Karma sometimes feels like a measure of conformance than quality thinking. I often feel like I am doing better when I write thoughtfully and still get downvotes -- maybe it means I got some people out of their comfort zone.
Here's what I try to do (but fail often): Do the root cause analysis, vent if you need to, and then think about what is needed to really fix it.
[2]: https://en.wikipedia.org/wiki/Parable_of_the_drowning_man
[3]: The first four are:
I write detailed specs. Multifile with example code. In markdown.
Then hand over to Claude Sonnet.
With hard requirements listed, I found out that the generated code missed requirements, had duplicate code or even unnecessary code wrangling data (mapping objects into new objects of narrower types when won't be needed) along with tests that fake and work around to pass.
So turns out that I'm not writing code but I'm reading lots of code.The market-leading technology is pretty close to "good enough" for how I'm using it. I look forward to the day when LLM-assisted coding is commoditized. I could really go for an open source model based on properly licensed code.
(but I guess they're not really conflicting, if the "solution" involves upgrading to a higher plan)
This seems to be a good window where I can implement a pretty large feature, and then go through and address structural issues. Goofy thinks like the agent adding an extra database, weird fallback logic where it ends up building multiple systems in parallel, etc.
Currently, I find multiple agents in parallel on the same project to be not super functional. Theres just a lot of weird things, agents get confused about work trees, git conflicts abound, and I found the administrative overhead to be too heavy. I think plenty of people are working on streamlining the orchestration issue.
In the mean time, I combat the ADD by working on a few projects in parallel. This seems to work pretty well for now.
It's still cat herding, but the thing is that refactors are now pretty quick. You just have to have awareness of them
I was thinking it'd be cool to have an IDE that did coloring of, say, the last 10 git commits to a project so you could see what has changed. I think robust static analysis and code as data tools built into an IDE would be powerful as well.
The agents basically see your codebase fresh every time you prompt. And with code changes happening much more regularly, I think devs have to build tools with the same perspective.
To give them the benefit of doubt, perhaps these people provide such detailed spec that they basically write code in natural language.
That said, looking at the way things work in big companies, AI has definitely made it so one senior engineer with decent opinions can outperform a mediocre PM plus four engineers who just do what they're told.
Like yesterday? LLM-assisted coding is $100/mo. It looks very commoditized when most houses in developed world pay more for electricity than that.
My definition of LLM-assisted coding is that you fully understand every change and every single line of the code. Otherwise it's vibe coding. And I believe if one is honest to this principle, it's very hard to deplete the quota of the $100 tier.
But, it's not $100/mo. I think the best showcase of where AI is at is on the generative video side. Look at players like Higgsfield. Check out their pricing and then go look at Reddit for actual experiences. With video generation the results are very easy to see. With code generation the results are less clear for many users. Especially when things "just work".
Again, it's not $100/month for Anthropic to serve most uses. These costs are still being subsidized and as more expensive plans roll out with access to "better" models and "more* tokens and context the true cost per user is slowly starting to be exposed. I routinely hit limits with Anthropic that I hadn't been for the same (and even less) utilization. I dumped the Pro Max account recently because the value wasn't there anymore. I am convinced that Opus 3 was Anthropic's pinnacle at this point and while the SotA models of today are good they're tuned to push people towards paying for overages at a significantly faster consumption rate than a right sized plan for usage.
The reality is that nobody can afford to continue to offer these models at the current price points and be profitable at any time in the near future. And it's becoming more and more clear that Google is in a great position to let Anthropic and OAI duke it out with other people's money while they have the cash, infrastructure and reach to play the waiting game of keeping up but not having to worry about all of the constraints their competitors do.
But I'd argue that nothing has been commoditized as we have no clue what LLMs cost at scale and it seems that nobody wants to talk about that publicly.
Video is a different ballgame entirely, its less than realtime on _large_ gpus. moreover because of the inter-frame consistency its really hard to transfer and keep context
Running inference on text is, or can be very profitable. its research and dev thats expensive.
im probably just not being charitable enough to what you mean, but thats an absurd bar that almost nobody conforms to even if its fully handwritten. nothing would get done if they did. But again, my emphasis is on that im probably just not being charitable to what you mean.
x = 0
for i in range(1, 10):
x += i
print(x)
They don't mean they understand silicon substrate of the microprocessor executing microcode or the CMOS sense amplifiers reading the SRAM cells caching the loop variable.They just mean they can more or less follow along with what the code is doing. You don't need to be very charitable in order to understand what he genuinely meant, and understanding code that one writes is how many (but not all) professional software developers who didn't just copy and paste stuff from Stackoverflow used to carry out their work.
How deep do i need to understand range() or print() to utilize either, on the slightly less extreme end of the spectrum.
But ya, im pretty sure its a point that maybe i coulda kept to myself and been charitable instead.
print(X) is a great example. That's going to print X. Every time.
Agent.print(x) is pretty likely to print X every time. But hey, who knows, maybe it's having an off day.
Jeff Atwood, along with numerous others (who Atwood cites on his blog [1]) were not exaggerating when the observed that the majority of candidates who had existing professional experience, and even MSc. degrees, were unable to code very simple solutions to trivial problems.
[1] https://blog.codinghorror.com/why-cant-programmers-program/
If it's low-stakes, then the required depth to accept the code is also low.
That's how I read it, and I would agree with that.
Obviously I don't mean "understanding it so you can draw the exact memory layout on the white board from memory."
this is a small nit, but you still have to pay your electric bill, the $100/mo is on top of that. if you're doing cost accounting you don't want to neglect any costs. Just because you can afford to lease a car, doesn't mean you can afford to lease a 2nd car.
I anticipate a Napster-style reckoning at some point when there's a successful high-profile copyright suit around obviously derivative output. It will probably happen in video or imagery first.
But I and others in my company have very heavy usage. We only rarely, with parallel agentic processes, run out of the $200 a month plan.
And what do I mean by "hard"? I mean, it requires a lot of active thinking to think about how you can actively max it out. I'm sure there's some use cases where maybe it is not hard to do this, but in general, I find most devs can't even max out the $100 a month plan, because they haven't quite figured out how to leverage it to that degree yet.
(Again, if someone is using the API instead of subscription, I wouldn't be surprised to see $2,000 bills.)
You can use a Max subscription for work, btw.
I find it incredibly difficult to saturate my usage. I'm ending the average week at 30-ish percentage, despite this thing doing an enormous amount of work for (with?) me.
Now I will say that with pro I was constantly hitting the limit -- like comically so, and single requests would push me over 100% for the session and into paying for extra usage -- and max 5x feels like far more than 5x the usage, but who knows. Anthropic is extremely squirrely about things like surge rates, and so on.
I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
Recently I've gotten Qwen 3.6 27b working locally and it's pretty great, but still doesn't match Opus; I've gotten check out that new Deepseek model sometime.
>I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
A lot of people are angry about the whole openclaw situation. They are especially bitter that when they attempted to justify exfiltrating the OAuth token to use for openclaw, nobody agreed with them that they had the right to do so, and sided with Claude that different limits for first-party use is standard. So they create threads like this, and complain about some opaque reason why Anthropic is finished (while still keeping their subscription, of course).
I did a 1:1 map of all my Claude Code skills, and it feels like I never left Opus.
Super happy with the results.
For my use-case, I want the providers to get my tokens as long as they plan to keep releasing open-weight models
Kimi wants my phone number on signup so a no-go for me.
Claude's uptime is terrible. The uptime of most other providers is even worse...and you get all the quantization, don't know what model you are actually getting, etc.
I'm just getting a but tired of using Opus 2.6 which eats my whole allowance and then some £££ going through the 4kB prompt to review ~13 kB text file twice - and that's on top of the sometimes utter bonkers, bad, lazy answers I'm not getting even from the local Gemma 4 E4B.
I also created a mini framework so it can test that the skills are actually working after implementation.
Everything runs perfectly.
It does seem like the sweet spot between WallE and the destroyed earth in WallE.
I'm a BSD-style Open Source advocate who has published a lot of Apache-licensed code. I have never accepted that AI companies can just come in and train their models on that code without preserving my license, just allowing their users to claim copyright on generated output and take it proprietary or do whatever.
I would actually not mind licensing my work in an LLM-friendly way, contributing towards a public pool from which generated output would remain in that pool. Perhaps there is opportunity for Open Source organizations to evolve licenses to facilitate such usage.
For what it's worth, I would be happy to pay for a commercial LLM trained on public domain or other properly licensed works whose output is legitimately public domain.
slow and steady is worth exponentials. keep slopppping it my boid.
For now. That doesn't really change the risk, that just means they are all hyper competitive right this moment, and so they are comparable. If one of them becomes king of the hill, nothing stops them from silently degrading or jacking prices.
The only shield is to not be dependent in the first place. That means keeping your skills sharp and being willing to pass on your knowledge to juniors, so they aren't dependent on these things.
Of course, many people are building their business on huge AI scaffolding. There's nothing they can do.
But, so far, competition remains fierce. Anthropic still has the best tools for writing code. That lead is smaller than it's ever been, though. But, honestly, Opus 4.5 is when it got Good Enough. If Anthropic suddenly increased prices beyond what I'm willing to pay, any model that gives me Opus 4.5 or better performance is good enough for the vast majority of the work I do with agents. And, there are a bunch of models at that level, now maybe including some discount Chinese models. Certainly Gemini Pro 3.1 is on par with Opus 4.5. Current Codex is better than Opus 4.5 and close to Opus 4.7 (though I won't use OpenAI because I don't trust them to be the dominant player in AI).
I often switch agents/models on the same project because I like tinkering with self-hosted and I like to keep an eye on the most efficient way to work...which models wastes less of my time on silly stuff. Switching is literally nothing; I run `gemini` or `copilot` or `hermes` instead of `claude`. There's simply no deep dependency on a specific model or agent. They're all trying to find ways to make unique features for people to build a dependence on, of course, but the top models are all so fucking smart you can just tell them to do whatever thing it is that you need done. That feature could probably be a skill, whatever it is, and the model can probably write the skill. Or, even better, it could be actual software, also written by the model, rather than a set of instructions for the model to interpret based on the current random seed.
Currently, the only consistent moat is making the best model. Anthropic makes the best model and tools for coding, but that's a pretty shallow moat...I could live with several other models for coding. I'll gladly pay a premium for the best model and tools for coding, but I also won't be devastated if I suddenly don't have Claude Code tomorrow. Even open models I can host myself are getting very close to Good Enough.
They won't ever be SOTA due to money, but "last year's SOTA" when it costs 1/4 or less, may be good enough. More quantity, more flexibility, at lower edge quality. It can make sense. A 7% dumber agent TEAM Vs. a single objectively superior super-agent.
That's the most exciting thing going on in that space. New workflows opening up not due to intelligence improvements but cost improvements for "good enough" intelligence.
Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters. I'm not a poor college student anymore, and I need more return on my time.
I'm not shitting on open weights here - I want open source to win. I just don't see how that's possible.
It's like Photoshop vs. Gimp. Not only is the Gimp UX awful, but it didn't even offer (maybe still doesn't?) full bit depth support. For a hacker with free time, that's fine. But if my primary job function is to transform graphics in exchange for money, I'm paying for the better tool. Gimp is entirely a no-go in a professional setting.
Or it's like Google Docs / Microsoft Office vs. LibreOffice. LibreOffice is still pretty trash compared to the big tools. It's not just that Google and Microsoft have more money, but their products are involved in larger scale feedback loops that refine the product much more quickly.
But with weights it's even worse than bad UX. These open weights models just aren't as smart. They're not getting RLHF'd on real world data. The developers of these open weights models can game benchmarks, but the actual intelligence for real world problems is lacking. And that's unfortunately the part that actually matters.
Again, to be clear: I hate this. I want open. I just don't see how it will ever be able to catch up to full-featured products.
The trick is going to be recognizing tasks which have some ceiling on what they need and which will therefore eventually be doable by open models, and those which can always be done better if you add a bit more intelligence.
This kind of rhetoric is not helpful. If you want to make a point, then make one, but this adds nothing to the conversation. Maybe open source models don't work for you. They work very well for me.
The gap has been shrinking with each release, and the SOTA has already run into diminishing returns for each extra unit of data+computation it uses.
Do you really want to bet that the gap will not eventually be a hairs breadth?
The breakeven at this price is 6 minutes of productivity per work day for an engineer making $200k.
Are you suggesting that someone making $20k should be spending $200/mo on Claude?
If you pay someone $20,000 for labor, and they save 65 minutes worth of labor per day using a $200/mo Claude subscription, you are better off buying the Claude subscription.
You've got the real insight with this claim.
This is the way the world is moving. Open source isn't even going where the ball is being tossed. There is no leadership here.
You're spot on.
If the cost to deliver a unit of business automation is:
A. $1M with human labor
B. $700k human labor + open source models
C. $500k human labor + $10,000 in claude code max (duration of project)
D. $250k with humans + $200k claude code "mythos ultra"
The one that will get picked is option "D".Your poor college students and hobbyists will be on option "B". But this won't be as productive as evidenced by the human labor input costs.
Option "C" will begin to disappear as models/compute get more expensive and capable.
Option "A" will be nonviable. Humans just won't be able to keep up.
Open source strictly depends on models decreasing their capability gap. But I'm not seeing it.
Targeting home hardware is the biggest smell. It's showing that this is non-serious, hobby tinkery and has no real role in business.
For open source to work and not to turn into a toy, the models need to target data center deployment.
The real money in this market, though, is going to be made in the C suite, and they don't really care about the model. They don't care if it's open source, closed source, or what it is. They don't want to buy a model. They're interested in buying a solution to their problems. They're not going to be afraid of a software price tag -- any number they spend on labor is far more.
Labor is something like 50%+ of the Fortune 500's operating expenses -- capturing any chunk of this is a ridiculous sum of money.
When was the last time you used any of them? Because, a lot of people are actively using them for 9-5 work today, I count myself in that group. That opinion feels outdated, like it was formed a year ago+ and held onto. Or based on highly quantized versions and or small non-Thinking models.
Do you really think Qwen3.6 for a specific example is "50%" as good as Opus4.7? Opus4.7 is clearly and objectively better, no debate on that, but the gap isn't anywhere near that wide. I'd call "20%" hyperbole, the true difference is difficult to exactly measure but sub-10% for their top-tier Thinking models is likely.
Sure, we use Google Drive, too, but that's just for sharing documents across offices, not for everyday use. For that, the open source model is a clear winner in my book.
So the starting point is Opus 4.7 pricing and we're contrasting alternatives near the top end (offered across multiple providers).
Also I said 20% was hyperbole, meaning far too high.
Those closed weight models aren't available like we're discussing. They're only available from the vendor that created them.
I'm not disagreeing per-se but if you think the benchmarks are flawed and "my real world usage" is more reflective of model capabilities, why not write some benchmarks of your own?
You stand to make a lot of money and gain a lot of clout in the industry if you've figured out a better way to measure model capability, maybe the frontier labs would hire you.
Because in almost no real-world project is "programming time" the limiting factor?
Who said so? GLM 5.1 is 90% Opus, at least. Some people quite happy with Kimi 2.6 too. I did not try Deepseek 4 yet but also hearing it is as good as Opus. You might be confusing open source models with local models. It is not easy to run a 1.6T model locally, but they are not 50% of SOTA models.
Edit: the replies to my comment are great examples of what I’m talking about when I say it’s hard to determine what hardware I’d need :).
Hooking up Claude Code to it is trivial with omlx.
Starting closer to 40k if you want something that's practical. 10k can't run anything worthwhile for SDLC at useful speeds.
(If you are willing to let the machine work mostly overnight/unattended, with only incidental and sporadic human intervention, you could even decrease that memory requirement a bit.)
Also, I don't know of a general solution to streaming models from disk. Is there an inference engine that has this built-in in a way that is generally applicable for any model? I know (I mean, I've seen people say it, I haven't tried it) you can use swap memory with CPU offloading in llama.cpp, and I can imagine that would probably work...but definitely slowly. I don't know if it automatically handles putting the most important routing layers on the GPU before offloading other stuff to system RAM/swap, though. I know system RAM would, over time, come to hold the hottest selection of layers most of the time as that's how swap works. Some people seem to be manually splitting up the layers and distributing them across GPU and system RAM.
Have you actually done this? On what hardware? With what inference engine?
[†] The latest Qwen 3.6 whatever has been a noticeable improvement, and I'm not even at the point where I tweak settings like sampling, temperature, etc. No idea what that stuff does, I just use the staff picks in LM Studio and customize the system prompts.
So you can run 1 agent locally on $1k to $3k hardware
They can run a fleet of thousands
Yes, it's possible to run tiny quantized models, but you're working with extremely small context windows and tons of hallucinations. It's fun to play with them, but they're not at all practical.
Practical? Maybe not (unless you highly value privacy) because you can get better models and better performance with cheap API access or even cheaper subscriptions. As you said, this may indefinitely be the case.
Yes, a lot better, but still terribly unreliable and far less capable than the big unquantized models.
Competition (OpenAI vs Anthropic is fun to watch) and open source will get us there soon I think.
Not the best argument.
Also there is nothing without dependencies. Loose coupling means coupling.
AI tools... do what you already do, sometimes faster, sometimes worse, usually both depending on the task.
There's a massive gap of necessity between them.
Until very recently, local models been little more than brittle toys in my experience, if you're trying to use them for coding.
But lately I've been running Pi (minimal coding agent harness) with Gemma4 and Qwen3.6 and I've been blown away by how capable and fast they are compared to other models of their size. (I'm using the biggest that can fit into 24gb, not the smaller ones.) In fact, I don't really need to reach for Claude and friends much of the time (for my use cases at least).
but then two months ago 4.6 started getting forgetful and making very dumb decisions and so on. Everyone started comparing notes and realising it wasn’t “just them”. And 4.7 isn’t much better and the last few weeks we keep having to battle the auto level of effort downgrade and so on. So much friction as you think “that was dumb” and have to go check the settings again and see there has been some silent downgrade.
We all miss the early days of 4.6, which just show you can have a good useful model. LLMs can be really powerful but in delivering it to the mass market Anthropic throttle and downgrade it to not useful.
My thinking is that soon deepseek reaches the more-than-good-enough 4.6+ level and everyone can get off the Claude pay-more-for-less trajectory. We don’t need much more than we’ve already had a glimpse of and now know is possible. We just need it in our control and provisioned not metered so we can depend upon it.
https://www.anthropic.com/engineering/april-23-postmortem
Of course, it sucks when companies screw up ... but at the same time, they "paid everyone back" by removing limits for awhile, and (more importantly to me) they were transparent about the whole thing.
I have a hard time seeing any other major AI provider being this transparent, so while I'm annoyed at Claude ... I respect how they handled it.
https://www.anthropic.com/engineering/a-postmortem-of-three-...
I think there's a certain amount of running with scissors going on here. I appreciate the transparency, but the time to remediation here seems pretty long compared to the rate of new features.
I recall reading similar tales of woe with other providers here on HN. I think the gradual dialling back of capability as capacity becomes strained as users pile on is part of the MO of all the big AI companies.
API Error: Claude's response exceeded the 32000 output token maximum. To configure this behavior, set the CLAUDE_CODE_MAX_OUTPUT_TOKENS environment variable.
That’s a hallucination. All they did was hide thinking by default. Quick Google search should easily teach you how to turn it back on (I literally have it enabled in my harness).
Whoever is their product manager should be embarrassed at the UX they provide.
Please. This is a toy. A novel little tech-toy. If you depend on it now for doing your job then, frankly, you deserve to have your rug pulled now and then.
If you didn't try to use it to work for you, that's okay, but maybe try once more? It does work and adds value. It's a non-standard and weirdly flexible tool with limitations.
...but in retrospect, seeing how you finished your comment, maybe you really want to remain angry and misinformed.
GPT 5.4+ takes its time and considers even edgecases unprovoked that in fact are correct and saves me subsequent error hunting turns and finally delivers. Plus no "this doesn't look like malware" or "actually wait" thinking loops for minutes over a oneliner script change.
GLM always feels like it's doing things smarter, until you actually review the code. So you still need the build/prune cycle. That's my experience anyway.
But now I just use Codex. Claude is unreliable and leaves data races all over and leaves, as you say, negative conditions unhandled fairly consistently.
AI companies have the same incentive. Make it cheaper and people will use it more, making you more money (assuming your price is still above cost). And of course they have every reason to reduce their on costs.
It's like dating apps. They don't want you to find a good match, because then you cancel the subscription.
Speaking of which:
https://www.cnbc.com/2026/04/24/deepseek-v4-llm-preview-open...
Less spend means less real cost to the provider while your flat monthly subscription stays the same price. As well, reducing token use per customer means you can over-subscribe even harder, allowing for more flat monthly subscriptions.
Less tokens = more free capacity = more subscription income.
Now I'm looking for an extremely simple open-source coding agent. Nanocoder doesn't seem install on my Mac and it brings node-modules bloat, so no. Opencode seems not quite open-source. For now, I'm doing the work of coding agent and using llama_cpp web UI. Chugging it along fine.
Even the FSF recognizes that non-copyleft licenses still follow the Freedoms, and therefore are still Free Software.
On launch, it checks for updates and autoupdates.
I got annoyed enough with Anthropic's weird behavior this week to actually try this, and got something workable up & running in a few days. My case was unique: there's no Claude Code for BeOS, or my older / ancient Macs, so it was easier to bootstrap & stitch something together if I really wanted an agentic coding agent on those platforms. You'll learn a lot about how models actually work in the process too, and how much crazy ridiculous bandaid patching is happening Claude Code. Though you might also appreciate some of the difficulties that the agent / harnesses have to solve too. (And to be clear, I'm still using CC when I'm on a platform that supports it.)
As for the llama_cpp vs Claude Code delays - I've run into that too. My theory is API is prioritized over Claude Code subscription traffic. API certainly feels way faster. But you're also paying significantly more.
However, it's hard to justify Cursor's cost. My bill was $1,500/mo at one point, which is what encouraged me to give CC a try.
I haven't seen anyone mention this publicly, but I've noticed that the same model will give wildly different results depending on the quantization. 4-bit is not the same as 8-bit and so on in compute requirements and output quality. https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
I'm aware that frontier models don't work in the same way, but I've often wondered if there's a fidelity dial somewhere that's being used to change the amount of memory / resources each model takes during peak hours v. off hours. Does anyone know if that's the case?
> “you can’t be serious — is this how you fix things? just WORKAROUNDS????”
If this is how you’re interacting with your agents I think you’re in for a world of disappointment. An important part of working with agents is providing specific feedback. And beyond that making sure this feedback actually available to them in their context when relevant.
I will ask them why they made a decision and review alternatives with them. These learnings will aid both you and the agent in the future.