Scaling LLMs to Larger Codebases

Posted by kierangill 12/22/2025

Scaling LLMs to Larger Codebases(blog.kierangill.xyz)

307 points | 119 commentspage 2

blauditore 12/22/2025|

It's like people are rediscovering the most basic principles: E.g. that documentation ("prompt library") is usecho, or that well-organized code leads to higher velocity in development.

hu3 12/22/2025|

if that's what it takes for more people to write tests, then so be it

victorbjorklund 12/22/2025||

Biggest change to my workflow has been to break down projects to smaller parts using libraries. So where I in the past would put everything in the same code base I now break down stuff that can be separate to its own libraries (like wrapping an external API). That way the AI only needs to read the docs for the library instead of having to read all the code when working on features that use the API.

mym1990 12/22/2025||

Its kind of crazy that the knee jerk reaction to failing to one shot your prompt is to abandon the whole thing because you think the tool sucks. It very well might, but it could also be user error or a number of other things. There wouldn't be a good nights sleep in sight if I knew an LLM was running rampant all over production code in an effort to "scale it".

zeroonetwothree 12/22/2025||

There’s always a trade off in terms of alternative approaches. So I don’t think it’s “crazy” that if one fails you switch to a different one. Sure, sometimes persistence can pay off, but not always.

Like if I go to a restaurant for the first time and the item I order is bad, could I go back and try something else? Perhaps, but I could also go somewhere else.

t_tsonev 12/22/2025||

I'm okay with writing developer docs in the form of agent instructions, those are useful for humans too. If they start to get oddly specific or sound mental, then it's obviously the tool at fault.

tracker1 12/22/2025||

Just over the weekend, I decided to shell out for the top tier Claude Code to give it a try... definitely an improvement over the year I spent with Github CoPilot enabled on my personal projects (mostly an annoyance more than a help that I eventually disabled altogether).

I've seen some impressive output so far, and have a couple friends that have been using AI generation a lot... I'm trying to create a couple legacy (BBS tech related, in Rust) applications to see how they land. So far mostly planning and structure beyond the time I've spent in contemplation. I'm not sure I can justify the expense long term, but wanting to experience the fuss a bit more to have at least a better awareness.

patcon 12/22/2025||

Is it not the case that "production level code" coming out of these processes makes the whole system of coder-plus-machine weaker?

I find it to be a good thing that the code must be read in order to be production-grade, because that implies the coder must keep learning.

I worry about the collapse in knowledge pipeline when there is very little benefit to overseeing the process...

I say that as a bad coder who can and has done SO MUCH MORE with llm agents. So I'm not writing this as someone who has an ideal of coding that is being eroded. I'm just entering the realm of "what elite coding can do" with LLMs, but I worry for what the realm will lose, even as I'm just arriving

vivin 12/22/2025||

You can't get away from the engineering part of software engineering even if you are using LLMs. I have been using Claude Opus 4.5, and it's the best out of the models I have tried. I find that I can get Claude to work well if I already know the steps I need to do beforehand, and I can get it to do all of the boring stuff. So it's a series of very focused and directed one-shot prompts that it largely gets correct, because I'm not giving it a huge task, or something open-ended.

Knowing how you would implement the solution beforehand is a huge help, because then you can just tell the LLM to do the boring/tedious bits.

teaearlgraycold 12/22/2025||

They’re good for getting you from A to B. But you need to know A (current state of the code) and how to get to B (desired end state). They’re fast typers not automated engineers.

ericmcer 12/22/2025||

seriously, I stopped agent mode altogether. I hit it with very specific like: write a function that takes an array of X and returns y.

It almost never fails and usually does it in a neat way, plus its ~50 lines of code so I can copy and paste confidently. Letting the agent just go wild on my code has always been a PITA for me.

vivin 12/22/2025||

I've used agent mode, but I tell it not to go hog wild and to not do anything other than what I have instructed it to do. Also, sometimes I will tell it not to change the code, and to go over its changes with me first, before I tell it that it can make the changes.

I feel the same way as you in general -- I don't trust it to go and just make changes all over the codebase. I've seen it do some really dumb stuff before because it doesn't really understand the context properly.

ColinEberhardt 12/23/2025||

“When an LLM can generate a working high-quality implementation in a single try, that is called one-shotting. This is the most efficient form of LLM programming.”

This is a good article, but misses one of the most important advances this year - the agentic loop.

There are always going to be limits to how much code a model can one-shot. Give it the ability to verify its changes and iterate, massively increase its ability to write sizeable chunks of verified and working code.

EastLondonCoder 12/22/2025||

I’ve ended up with a workflow that lines up pretty closely with the guidance/oversight framing in the article, but with one extra separation that’s been critical for me.

I’m working on a fairly messy ingestion pipeline (Instagram exports → thumbnails → grouped “posts” → frontend rendering). The data is inconsistent, partially undocumented, and correctness is only visible once you actually look at the rendered output. That makes it a bad fit for naïve one-shotting.

What’s worked is splitting responsibility very explicitly:

• Human (me): judge correctness against reality. I look at the data, the UI, and say things like “these six media files must collapse into one post”, “stories should not appear in this mode”, “timestamps are wrong”. This part is non-negotiably human.

• LLM as planner/architect: translate those judgments into invariants and constraints (“group by export container, never flatten before grouping”, “IG mode must only consider media/posts/*”, “fallback must never yield empty output”). This model is reasoning about structure, not typing code.

• LLM as implementor (Codex-style): receives a very boring, very explicit prompt derived from the plan. Exact files, exact functions, no interpretation, no design freedom. Its job is mechanical execution.

Crucially, I don’t ask the same model to both decide what should change and how to change it. When I do, rework explodes, especially in pipelines where the ground truth lives outside the code (real data + rendered output).

This also mirrors something the article hints at but doesn’t fully spell out: the codebase isn’t just context, it’s a contract. Once the planner layer encodes the rules, the implementor can one-shot surprisingly large changes because it’s no longer guessing intent.

The challenges are mostly around discipline:

• You have to resist letting the implementor improvise.

• You have to keep plans small and concrete.

• You still need guardrails (build-time checks, sanity logs) because mistakes are silent otherwise.

But when it works, it scales much better than long conversational prompts. It feels less like “pair programming with an AI” and more like supervising a very fast, very literal junior engineer who never gets tired, which, in practice, is exactly what these tools are good at.

eurekin 12/23/2025||

If you're interested in the large codebase... The best I found so far are extended context models. Using newest Nemotron3 nano, you can put a 1m tokens (about 3 ish megabytes of text) of pure code dump (I use repomix --style markdown) and ask around. That's been one of the biggest wow moments I had with LLMs so far. Much better experience than any RAG I used

spullara 12/22/2025|

Using AugmentCode's Context Engine you can get this either through their VSCode/JetBrains plugins, their Auggie command line coding agent or by registering their MCP server with your local coding agent like Claude Code. It works far better than painstakingly stuffing your own context manually or having your agent use grep/lsp/etc to try and find what it needs.

More comments...