Posted by sbochins 3 hours ago
What does this look like after 6-12 months? Like, how much code are you trying to write total?
Maybe it just doesn’t click in my mind, but sometimes I wonder about how much work people are trying to do and how they actually have enough to get done so quickly in such a short amount of time.
I've never worked on a complicated codebase that started out that way until the rest of the business concerns and office politics came into effect. People may not like it, but the bureaucracy is far and away more valuable than the core functionality.
Mature codebases are years of people thinking of all the possible gotchas while solving their acute pain points. This is not fluff, but the living and breathing part of it. Without that code, it's just a machine barely doing stuff in the most obtuse ways possible that nobody wants to pay for.
I would argue that they're putting LLMs to work on that finer detail stuff, but AI is still far too dumb. No, what they're doing is playing with their skinner box.
I wonder if part of the solution is building/finding the right libraries, with the right documentation/language/API(one that plays well with LLM's) and maybe creating some synthetic data around them - to make it very easy for the llm.
And maybe there could be a business model around creating those libraries.
If you can ask the model for a specific function; with a spec design (typed languages help too) then the small models are great! I have had good progress with generating small python modules for example, but you need verification rounds to catch issues.
So test driven design + a good spec sheet + a very detailed todo.md (or even better if its todo.json because then the LLM does not need to manage it, you do from the harness) is your best bet for small models.
Like perhaps you could produce 5 versions of a piece of code, and then compare them to choose the best.
Also if the local LLMs can call tools, maybe you can use static analysis tools to catch errors and try again in a loop or process of some sort.
There also might be certain languages that work better because those languages have better static checks.
I'll write a detailed prompt for a function, hand it off to 5 or so models (all of which are on my local machine), wait about 5 min and then compare.
Which is to say, I might use AI to do an outline/organizational , but I'm prompting every chunk of code "one-by-one," (e.g. at about the "function" level) which still feels lightyears ahead of what I used to do.
Because (1) Huawei collab and (2) vLLM etc dont implement half of the inference optimisations deepseek proposed in their paper.
For me MiniMax 3 has really hit the sweet spot of being very cheap, though more than flash, but I’d also very capable.
My baseline is sonnet 4.6. I think it's good enough for most tasks sincerly. So, from what I see, we are already at a point where we don't need frontier models for serious coding and debuging. Give it a couple of years and that level will fit 120B models.
At the same time, we saw the rise of direct acess memory systems like DGX or Stryx Halo that will allow to run models of this size for "cheap" in the medium term.
That's what I'm betting in. That in 2 years I can buy a system for about $2500 that will run a model that's similar to Sonnet 4.6 locally.
I might be spectacularly wrong though. But I'm willing to wait and use subscriptions/API calls for now.
Depending on what one builds, comprehensive documentation and applicable skills and memory tools often allow for a substantial reduction of tokens previously used by the agent to comprehend and remember what is being built