Posted by mraniki 4 days ago
The fact that it's free for now (I know they use it for training, that's OK) is a big plus, because I've had to restart a task from scratch quite a few time. If I calculate what this would have cost me using Claude, it would have been 200-300 euros.
I've noticed that as soon as it makes a mistake (messing up the diff format is a classic), the current task is basically a total loss. For some reason, most coding tools basically just inform the model they made a mistake and should try again... but at that point, it's broken response is part of the history, and it's basically multi-shotting itself into making more mistakes. They should really just filter these out.
All in all, I think we humans are well on our way to become legal flesh[].
[] The part of the system to whip or throw in jail when a human+LLM commit a mistake.
I wonder if you treat code from a Jr engineer the same way? Seems impossible to scale a team that way. You shouldnt need to verify every line but rather have test harnesses that ensure adherence to the spec.
Based on my own experience and anectodes, it's worse than Claude 3.5 and 3.7 Sonnet for actual coding tasks on existing projects. It is very difficult to control the model behavior.
I will probably make a blog post on real world usage.
Sometimes I have it write functions that are very boilerplate to save time, but I mostly like to use it as a tool to think through problems, among other tools like writing in a notebook or drawing diagrams. I enjoy programming too much that I’d want an AI to do it all for me (it also helps that I don’t do it as a job though).
With a 1 million token context you'd think they'd let the LLM actually use it but all the tricks to save token count just make it... not useful.
The vast majority of coding energy is what comes next.
Even today, sonnet-3.5 is still the best "existing code base" model. Which is gratifying (to Anthropic) and/or alarming to everyone else