Every tool is just a tool. No tool is a solution. Until and unless we hit AGI, only the human brain is that.
Instead we're stuck talking about if the lie machine can fucking code. God.
I've been allowing LLMs to do more "background" work for me. Giving me some room to experiment with stuff so that I can come back in 10-15 minutes and see what it's done.
The key things I've come to are that it HAS to be fairly limited. Giving it a big task like refactoring a code base won't work. Giving it an example can help dramatically. If you haven't "trained" it by giving it context or adding your CLAUDE.md file, you'll end up finding it doing things you don't want it to do.
Another great task I've been giving it while I'm working on other things is generating docs for existing features and modules. It is surprisingly good at looking at events and following those events to see where they go and generating diagrams and he like.
But it's also not crazy to think that with LLMs getting smarter (and considerable resources put into making them better at coding), that future versions would clean up and refactor code written by past versions. Correct?
And I don't really see any reason to declare we've hit the limit of what can be done with those kinds of techniques.
But, fundamentally, LLMs lack a theory of the program as intended in this comment https://news.ycombinator.com/item?id=44443109#44444904 . Hence, they can never reach the promised land that is being talked about - unless there are innovations beyond next-token prediction.
In other words, I would be wrong of me to assume that the only way I can think of to go about solving a problem is the only way to do it.
Maybe quite a few pounds, if the cure in question hasn't been invented yet and may turn out to be vaporware.
The chatbot portion of the software is useless.
Chat mode on the other hand follows my rules really well.
I mostly use o3 - it seems to be the only model that has "common sense" in my experience
Really powerful seeing different options, especially based on your codebase.
> I wouldn't give them a big feature again. I'll do very small things like refactoring or a very small-scoped feature.
That really resonates with me. Anything larger often ends badly and I can feel the „tech debt“ building in my head with each minute Copilot is running. I do like the feeling though when you understood a problem already, write a detailed prompt to nudge the AI into the right direction, and it executes just like you wanted. After all, problem solving is why I’m here and writing code is just the vehicle for it.
Somehow if I take the best models and agents, most hard coding benchmarks are at below 50% and even swe bench verified is like at 75 maybe 80%. Not 95. Assuming agents just solve most problems is incorrect, despite it being really good at first prototypes.
Also in my experience agents are great to a point and then fall off a cliff. Not gradually. Just the type of errors you get past one point is so diverse, one cannot even explain it.
I take the message, provide the surrounding code, and it gives me a few approaches to solve them. More than half the time, the resolution is there and I can copy the relevant bit in the literal verbiage. (The other times it's garbage but at least I can see that this is going to require some AI—Actual Intelligence.)