That's the moment when you let "claude --dangerously-skip-permissions" go to work on a difficult problem and watch it crunch away by itself for a couple of minutes running a bewildering array of tools until the problem is fixed.
I had it compile, run and debug a Mandelbrot fractal generator in 486 assembly today, executing in Docker on my Mac, just to see how well it could do. It did great! https://gist.github.com/simonw/ba1e9fa26fc8af08934d7bc0805b9...
I'm bullish it'll get there sooner rather than later, but we're not there yet.
I'd say your mandelbrot debug and the LLVM patch are both "trivial" in the same sense: they're discrete, well defined, clear-success-criteria-tasks that could be assigned to any mid/senior software engineer in a relevant domain and they could chip through it in a few weeks.
Don't get me wrong, that's an insane power and capability of LLMs, I agree. But ultimately it's just doing a day job that millions of people can do sleep deprived and hungover.
Non-trivial examples are things that would take a team of different specialist skillsets months to create. One obvious potential reason why there's few non-trivial AI examples is because non-trivial AI examples require non-trivial amount of time to be able to generate and verify.
A non-trivial example isn't an example you can look at the output and say "yup, AI's done well here". It requires someone spends time going into what's been produced, assessing it, essentially redesigning it as a human to figure out all the complexity of a modern non-trivial system to confirm the AI actually did all that stuff correctly.
An in depth audit of a complex software system can take months or even years and is a thorough and tedious task for a human, and the Venn diagrams of humans who are thinking "I want to spend more time doing thorough, tedious code tasks" and "I want to mess around with AI coding" is 2 separate circles.
There’s an enormous amount of value in doing this. For the harder problems you mentioned - most IC SWE are also incapable or unwilling to do the work. So maybe the current state has equivalent capabilities to 95% of coders out there? But it works faster, cheaper, and doesn’t object to tedious work like documentation. It doesn’t require labor law compliance, hiring, onboarding/offboarding, or cause interpersonal conflict.
Doing for < $10 and under an hour what could be done in a few weeks by $10K+ worth of senior staff time is pretty valuable.
I'm pro AI, I'm not saying it's not valuable for trivial things. But that's a distinct discussion to the trivial nature of many LLM examples/demos in relation to genuinely complex computer systems.
Thank you for providing a spelled out definition of "non-trivial" there!
I think the void where non-trivial examples should be is the same space where contrarians and the last remaining few LLMs-are-useless crowd hangout.
It might be something being actually new (cutting edge) vs new to someone vs the human mind wanting to have it be novel and different enough as a comparable percentage of the experience of the first time using ChatGPT 4.
There is also the wiring of non-deterministic software frameworks and architectures compared to the deterministic only software development we're used to.
The former is a different thing than the latter.
The models clearly know the equations, but run into the same issues I had when implementing it myself (namely exploding simulations that the models try to paper over by applying more and more relaxation terms).
I used this prompt a few weeks ago:
> This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.
https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...
>Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards.
>To emphasize, this is not "vibe coded". Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.
Some time later...
https://github.com/advisories/GHSA-4pc9-x2fx-p7vj / CVE-2025-4143
>The OAuth implementation in workers-oauth-provider that is part of MCP framework https://github.com/cloudflare/workers-mcp, did not correctly validate that redirect_uri was on the allowed list of redirect URIs for the given client registration.
Can't be too far off!
The implicit decisions it had to make were also inconsequential, eg. selection of ASCII chars, color or not, bounds of the domain,...
However, it shows that agents are powerful translators / extractors of general knowledge!
What people agree on being non-trivial is working on a real project. There's a lot of opensource projects that could benefit from a useful code contribution. But they only got slop thrown at them.
But there's nothing truly novel in the result. The key aspect is being similar enough to something that's already in the training data so that the LLM can extrapolate the rest. The hint can be quite useful and sometimes you have something that shorten the implementation time, but you have to at least have some basic understanding of the domain in order to recognize the signs.
The issue is that the result is always tainted by your prompt. The signs may be there because of your prompt and not because there's some kind of data that need s to be explored further. And sometimes it's a bad fit, similar but different (what you want and what you get). So for the few domain that's valuable to me, I prefer to construct my own mental database that can lead me to concrete artifacts (books, articles, blog posts,...) that exists outside the influence of my query.
ADDENDUM
I can use LLMs with great results and I've done so. But it's more rewarding (and more useful to me) to actually think through the problem and learning from references. Instead of getting a perfect (or wobbly or the wrong category) circle that fits my query, I go to find a strange polygon formed (by me) from other strange polygon. Then because I know I need a circle, I only need to find its center and its radius.
It's slower, but the next time I need another circle (or a square) from the same polygon, it's going to be faster and faster.
So it's pretty stupid to just assume that critics haven't tried.
Example feature: send analytics events on app start triggered by notifications. Both Gemini and Claude completely failed to understand the component tree; rewrote hundreds of lines of code in broken ways; and even when prompted with the difficulty (this is happening outside of the component tree), failed to come up with a good solution. And even when deliberately prompted not to, like to simultaneously make cosmetic code changes to other pieces of the files they're touching.
What do you think is so difficult about doing the same thing with coding problems?
Your comment was about how this was unreasonably hard (for coding challenges).
Anecdotally Ive seen LLMs do all sorts of amazing shit which was obviously drawn from their training set and fall flat on their faces doing simple coding tasks which are novel enough to not appear in the training set.
I don't think it has much relevance at all to a conversational about how good LLMs are at solving programming problems by running tools in a loop.
I keep seeing this idea that LLMs can't handle problems that aren't in their training data and it's frustrating because anyone who has spent significant time working with these systems knows that it obviously isn't true.
Strikes a balance between simplicity and real world usefulness
I think they’re trying to compete with Gemini cli and now I’m glad I’m paying less
But yeah, if you're babysitting a single agent, only applying after reading what it wants to do ... You'll be fine for 3-4 hours before the token limit refreshed after the 5th
We most likely implement a policy that starters in our company can use Pro. Power users need Max.
Yes an No. You are right that it's a relatively small project. However, I've had really bad experiences trying to get ChatGPT (any of their models) to write small arm64 assembly programs that can compile and run on apple silicon
Eh, I just watched Claude spend an hour trying to incorrectly fix code. Eventually I realized what was happening, stepped in and asked it to write a bunch of unit tests first, get the code working against those unit tests, and then get back to me.
Claude Code is amazing, but I still have to step in and give it basic architectural guidance again and again.
Why are you not already a unicorn?
That was the point he was making, at least that's how I understood it
Standalone vibe coded apps for personal use? Pretty easy to believe.
Writing high quality code in a complex production system? Much harder to believe.
Writing this I realise, i should more clearly separate the functional tests from the implementation oriented unit tests.
But I use multiple agents talking to each other, async agents, git work trees etc on complex production systems as my day to day workflow. I wouldn’t say I go so far as to never change the outputs but I certainly view it as signal when I don’t get the outputs I want that I need to work on my workflow.
I talked about a similar, but slightly simpler workflow in my post on "Vibe Specs".
https://lukebechtel.com/blog/vibe-speccing
I use these rules in all my codebases now. They essentially cause the AI to do two things differently:
(1) ask me questions first (2) Create a `spec.md` doc, before writing any code.
Seems not too dissimilar from yours, but I limit it to a single LLM
The md files are actually pretty great for shareability, versioning, and picking up where you left off.
I might be a little too hung up on the details compared to a lot of these agent cluster testimonials I've read, but unlike the author I'll be open and say that the codebase I work on is several hundred thousand lines of Go and currently does serve a high 5 to low 6 figure number of real, B2C users. Performance requirements are forgiving but correctness and reliability are very important. Finance.
Currently I use a very basic setup of scripts that clone a repo, configure an agent, and then run it against a prompt in a tmux session. I rely mainly on codex-cli since I am only given an OpenAI key to work with. The codex instances ping me in my system notifications when it's my turn, and I can easily quake-mode my terminal into view and then attach to the session (with a bit of help from fzf). I haven't gotten into MCP yet but it's on my radar.
I can sort of see the vision. For those small but distracting tasks, they are very helpful and I (mostly) passively produce a lot more small PRs to clean up papercuts around our codebase now. The "cattle not pets" mentality remains relevant - I just fire off a quick prompt when I feel the urge to get sidetracked on something minor.
I haven't gotten as much out of them for more involved tasks. Maybe I haven't really got enough of a context flywheel going yet, but I do typically have to intervene most of the time. Even on a working change, I always read the generated code first and make any edits for taste before submitting it for code review since I still view the output as my complete responsibility.
I still mostly micromanage the change control process too (branching, making commits, and pushing). I've dabbled in tools that can automate this but haven't gotten around to it.
I 100% resonate with the "fix the inputs, not the outputs" mindset as well. It's incredibly powerful without AI and our industry has been slowly but surely adopting it in more places (static typing, devops, IAC, etc). With nondeterministic processes like LLMs though it feels a lot harder to achieve, more like practice and not science.
The multi-model AI part is just the (current) tool to help avoid bias and make fine tuned selections for certain parts of the task.
Eventually large complex systems will be built and re-built from a set of requirements and software will finally match the stated requirements. The only "legacy code" will be legacy requirements specifications. Fix your requirements, not the generated code.
https://i.pinimg.com/736x/03/af/06/03af0602a8caa51507717edd6...
I have a problem where half the times I see people talking about their AI workflow, I can't tell if they are talking about some kind of dream workflow that they have, or something they're actually using productively
In my case it’s more like developing a mindset building a framework than to push feature after feature. I would think it’s like that for most companies. You can get an unpolished version of most apps easily, but polishing takes 3-5x the time.
Lets not talk about development robustness, backend security etc etc. Like AI has just way too many slippages for me in these cases.
However I would still consider myself a heavy AI user, but I mainly use it to discuss plans,(what google used to be) or to check it if I’ve forgotten anything.
For most features in my app I’m faster typing it out exactly the way I want it. (with a bit of auto-complete) The whole brain-coordination works better.
I guess long talk, but you’re not alone trust your instinct. You don’t seem narrow minded.
I’ve had an impossible learning curve the last year, but as I should rather be vibe-coded biased I still use less AI now to make sure it’s more consistent.
I think the two camps are different in terms of skill honestly, but also in terms of needs. Like of course you are faster vibe-coding a front-end then to write the code manually, but build a robust backend/processing system its a different kind of tier.
So instead of picking a side it’s usually best to stay as unbiased as possible and choose the right tool for the task
It does really cool stuff now when it is given away for free, but how cool is it when they want you to pay what it actually costs? With ROI and profits on top.
For example, an agent working on the dashboard for the Documents portion of my project has a completely different idea from the agent working on the dashboard for the Design portion of my project. The design consistency is not there, not just visually, but architecturally. Database schema and API ideas are inconsistent, for example. Even on the same input things are wildly different. It seems that if it can be different, it will be different.
You start to update instruction files to get things consistent, but then these end up being thousands of lines on a large project just to get the foundations right, eating into the context window.
I think ultimately we might need smaller language models trained on certain rules & schemas only, instead of on the universe of ideas that a prompt could result in. Small language models are likely the correct path.
> The design consistency is not there, not just visually, but architecturally.
Seniors always gonna have to senior. Doesn't matter if the coders are AI or humans. You have to make sure you provide enough structures for the agents to move in roughly the same direction while allowing enough flexibility that you're not better off just writing the code.
Business owner asks for a new CRUD app and there it is in production.
Of course it's full of full of bugs, slow as syrup, saves to a public unauthed database but that's none of my business *gulps scalding hot tea*
You could even add a magic button for when things don't work that reruns the same prompt and possibly get better results.
A slot machine animation while waiting would be cool.
The Model T car was notorious for blowing out tires left and right, to the point that a carriage might have been less hassle at times. Yet here we are.
It could be much bigger than the model T or much bigger than asbestos.
Why is it always this argument? Is it that hard to believe that you can get recent coding assistants to write expandable and maintainable code in 0shot? Have you tried just ... asking for that type of code?
Are we now pretending that humans aren't doing the same? Sure, it's usually on a higher level, but at the end we are also just brute forcing our way toward a solution through trial and error, and if someone is very experienced in the problem-domain, they can do it mostly in their head.
> carbon footprint
So if the AI-datacentre is running on renewables, you would be OK with this?
It's wild to see in action when it's unprompted.
For planning, I usually do a trip out to Gemini to check our work, offer ideas, research, and ratings of completeness. The iterations seem to be helpful, at least to me.
Everyone in these sorta threads asks for "proofs" and I don't really know what to offer. It's like 4 cents for a second opinion on what claude's planning has cooked up, and the detailed response has been interesting.
I loaded 10 bucks onto OpenRouter last month and I think I've pulled it down by like 50 cents. Meanwhile I'm on Claude Max @ $200/mo and GPT Plus for another $20. The OpenRouter stuff seems like less than couch change.
$0.02 :D
I’ve tried building these kinds of multi agent systems a couple times, and I’ve found that there’s a razor thin edge between a nice “humming along” system I feel good about and a “car won’t start” system where the first LLM refuses to properly output JSON and then the rest of them start reading each others <think> thoughts.
The difference seems to often come down to:
- Which LLM wrappers are you using? Are they using/exposing features like MCP, tools and chain-of-thought correctly for the particular models you’re using?
- What are your prompts? What are the 5 bullet points with capital letters that need to be in there to keep things in line? Is there a trick to getting certain LLMs to actually use the available MCP tools?
- Which particular LLM versions are you using? I’ve heard people say that Claude Sonnet 4 is actually better than Claude Opus 4 sometimes, so it’s not always an intuitive “pick the best model” kind of thing.
- Is your system capable of “humming along” for hours or is this a thing where you’re doing a ton of copy-paste between interfaces? If it’s the latter then hey, whatever works for you works for you. But a lot of people see the former as a difficult-to-attain Holy Grail, so if you’ve figured out the exact mixture of prompts/tools that makes that happen people are gonna want to know the details.
The overall wisdom in the post about inputs mattering more than outputs etc is totally spot on, and anyone who hasn’t figured that out yet should master that before getting into these weeds. But for those of us who are on that level, we’d love to know more about exactly what you’re getting out of this and how you’re doing it.
(And thanks for the details you’ve provided so far! I’ll have to check out Zen MCP)
I have two MCPs installed (playwright and context7) but it never seems like Claude decides to reach for them on its own.
I definitely appreciate why you’re not posting code, as you said in another comment.
Not even when you add ‘memories’ that tell it to always use those tools in certain situations?
My admonitions to always run repomix at the start of coding, and always run the build command before crying victory seem to be followed pretty well anyway.
Then engineers can judge for themselves
It makes usable code for my projects. It often gets into the weeds and makes weird tesseracts of nonsense that I need to discover, tear down, and re-prompt it to not do that again.
It's cheap or free to try. It saves me time, particularly in languages I am not used to daily driving. Funnily enough, I get madder when I have it write ts/py/sql code since I'm most conversant in those, but for fringe stuff that I find tedious like AWS config and tests -- it mostly just works.
Will it rot my brain? Maybe? If this thing turns me from an engineer to a PM, well, I'll have nobody to blame but myself as I irritate other engineers and demand they fibonacci-size underdefined jira tix. :D
I think there's going to be a lot of momentum in this direction in the coming year. I'm fortunate that my clients embrace this stuff and we all look for the same hallucinations in the codebase and shut them down and laugh together, but I worry that I'm not exactly justifying my rate by being an LLM babysitter.
> I also have a local mcp which runs Goose and o3.
For example: https://block.github.io/goose/docs/category/tutorials/ I just want to see an example workflow before I set this up in CI or build a custom extension to it!
I guess vibe-coding is on its way to becoming the next 3D printing: Expensive hobby best suited for endless tinkering. What’s today’s vibe coding equivalent of a “benchy”? Todo apps?
In a pre online shopping world 3D printing would be far more useful for the average person. Going forward it looks like it's only really useful for people who can design their own files for actually custom stuff you can't buy.