Scaling long-running autonomous coding

Posted by samwillis 1/14/2026

Scaling long-running autonomous coding(cursor.com)

290 points | 197 commentspage 2

Snuggly73 1/15/2026|

The only thing that I got to actually run on WSL2 was the "Excel" (couldnt get anything actually to compile on Mac or Windows).

It a broken mess that probably implements 0.00001% of Excel. And its 1.2m locs.

With codebases developed in this way - either they need to figure out how agents are going to maintain them (in which case SWE as we know is dead - it will only be limited to those that can spend trillions of tokens, or they are going to remain weird demos.

timabdulla 1/15/2026|

I'd be curious to see screenshots or a video! I only have a Mac at my disposal, unfortunately.

logicallee 1/15/2026||

At the same time they were doing this, I also iterated on an AI-built web browser with around 2,000 lines of code. I was heavily in the loop for it, it didn't run autonomously. You can see the current version of the source code here:

https://taonexus.com/publicfiles/jan2026/172toy-browser.py.t... (turn the sound down, it's a bit loud if you interact with the built-in Tetris clone.)

You can run it after installing the packages, "pip install requests pillow urllib3 numpy simpleaudio"

I livestreamed the latest version here 2 weeks ago, it's a ten minute video:

https://www.youtube.com/watch?v=4xdIMmrLMLo&t=45s

I'm posting from that web browser. As an easter egg, mine has a cool Tetris clone (called Pentrix) based on pieces with 5 segments, the button for this is at the upper-right.

If you have any feature suggestions for what you want in a browser, please make them here:

https://pollunit.com/polls/ahysed74t8gaktvqno100g

physicsguy 1/15/2026||

I have been trying Claude Code a lot this week. Two projects:

* A small statically generated Hugo website but with some clever linking/taxonomy stuff. This was a fairly self-contained project that is now 'finished' but wouldn't hvae taken me more than a few days to code up from scratch. * A scientific simulation package, to try and do a clean refresh of an existing one which i can point at for implementation details but which has some technical problems I would like to reduce/remove.

Claude code absolutely smashed the first one - no issues at all. With the second, no matter what I tried, it just made lots of mistakes, even when I just told it to copy the problematic parts and transpose them into the new structure. It basically got to a point where it wasn't correct and it didn't seem to be able to get out of a bit of a 'doom loop' and required manual intervention, no matter how much prompting and hints I gave it.

Bishonen88 1/15/2026||

Similar experience here.

Did sign up for Claude Code myself this week, too, given the $10/month promo. I have experience with AI by using AWS Kiro at work and directly prompting Claude Opus for convos. After just 2 days and ~5-6 vibe coding sessions in total I got a working Life-OS-App created for my needs.

- Clone of Todoist with the features that I actually use/want. Projects, Tags, due dates, quick adding with a todoist like text-aware input (e.g. !p1, Today etc.)

- A fantastical like calendar. Again, 80% of the features I used from Fantastical

- A Habit Tracker

- A Goal Tracker (Quarterly / Yearly)

- A dashboard page showing todays summary with single click edit/complete marking

- User authentication and sharing of various features (e.g. tasks)

- Docker deployment which will eventually run on my NAS

I'm going to add a few more things and cancel quite a few subscriptions. It one-shots all tasks within minutes. It's wild. I can code but didn't bother looking at the code myself, because ... why.

Even though do not earn US Tech money, am tempted to buy the max subscription for a month or two although the price is still hard to swallow.

Claude and vibe coding is wild. If I can clone todoist within a few vibe coding sessions and then implement any additional/new feature I want within minutes instead proposing, praying and then waiting for months, why would I pay $$$...

DauntingPear7 1/16/2026||

Wth are your usage limits? Are they increased? I’ll hit a usage limit in about 2-3 hours of using sonnet 4.5, and opus is a weekly limit.

underdeserver 1/15/2026||

On Twitter people are saying GPT-5.2 is better. That's also what Cursor used in their testing. Maybe try it?

physicsguy 1/15/2026||

I have Web access for ChatGPT through work, but not API access annoyingly.

xpil 1/20/2026||

Codex plugin (VSCode) allows consuming your "web" (ie non-api) subscription for coding/agentic tasks.

nl 1/15/2026||

Remember when 3D printers meant the death of factories? Everyone would just print what they wanted at home.

I'm very bullish on LLMs building software, but this doesn't mean the death of software products anymore than 3D printers meant the death of factories.

ben_w 1/15/2026|

Perhaps, but I don't think that's a good analogy, there's too many important differences to say (3d printing : all manufacturing) : (vibe coding : all software).

The hype may be similar, if that's your point then I agree, but the weakness of 3D printing is the range of materials and the conditions needed to work with them (titanium is merely extremely difficult, but no sane government will let the general public buy tetrafluoroethylene as a feedstock), while the weakness of machine learning (even more broadly than LLMs) is the number of examples they require in order to learn stuff.

torginus 1/15/2026||

I'm kinda surprised how negative and skeptical anyone is here.

It kinda blows my mind that this is possible, to build a browser engine that approximates a somewhat working website renderer.

Even if we take the most pessimistic interpretation of events ( heavy human steering, relies on existing libraries, sloppy code quality at places, not all versions compile etc)

ben_w 1/15/2026||

I'm not too surprised, the way I read a lot of (not all!*) the negative comments is ~"I'm imagining having to work with this code, I'd hate it". Even though I'm fairly impressed with the work LLMs do, this has also been my experience of them… albeit with a vibe-coding** sample size of 1, done over a few days with some spare credit.

The positive views are mostly from people who point out that what matters in the end is what the code does, not what it looks like, e.g. users don't see the code, nor do they care about the code, and that even for businesses who do care, LLMs may be the ones who have to pay down any technical debt that builds up.

* Anyone in a field where mistakes are expensive. In one project, I asked the LLM to code-review itself and it found security vulnerabilities in its own solutions. It's probably still got more I don't know about.

** In the original sense of just letting the LLM do whatever it wanted in response to the prompt, never reading or code reviewing the result myself until the end.

polyglotfacto 1/21/2026||||

> what matters in the end is what the code does, not what it looks like

That is true in a way, although even for agents readability matters.

But the code here does not actually do the right thing, and the way it is written also means it never could.

Web devs do care whether the engine runs their code according to Web standards(otherwise it's early IE all over), and end-users do care that websites work as their devs intended to.

Current state is throw-away level quality.

I've critiqued it at length in the other post, see https://news.ycombinator.com/item?id=46705625

satvikpendem 1/15/2026|||

The problem I've had with vibe coding is akin the adage of the first 90% of the code taking 90% of the time, and the last 10% taking the other 90% of the time. The LLM can get you to 90% initially but it hits a wall unless you the user know what it's doing and outputting, but that is very difficult when you're vibe coding by its very definition, meaning that you're not looking at the code at all. And then you have to read thousands of lines of code which you don't understand that it's entirely easier to stop and hand code a new version yourself, which is precisely what I've done with some of my projects.

alfalfasprout 1/16/2026||

The problem is getting there 90% but poorly makes that last 10% much harder.

polyglotfacto 1/21/2026||

It's obvious by now that AI can write a whole bunch of code approximating all kinds of things. So there is no reason anymore for this to impress anyone.

A well-architected POC built in a week with a clear path to scaling it to a full implementation down the line would be impressive, but that's not what this is.

The current code output is basically throw-away level quality AI hallucinated BS.

danieloj 1/15/2026||

I'm not sure "building a web browser" is such a great test for an LLM. It helps confirm that they can handle large codebases. But the actual logic in the browser engine will be based very heavily on Chromium/Firefox etc.

jphoward 1/14/2026||

The browser it built, obviously the context window of the entire project is huge. They mention loads of parallel agents in the blog post, so I guess each agent is given a module to work on, and some tests? And then a 'manager' agent plugs this in without reading the code? Otherwise I can't see how, even with ChatGPT 5.2/Gemini 3, you could do this otherwise? In retrospect it seems an obvious approach and akin to how humans work in teams, but it's still interesting.

simonw 1/14/2026||

GPT-5.2-Codex has a 400,000 token window. Claude 4.5 Opus is half of that, 200,000 tokens.

It turns out to matter a whole lot less than you would expect. Coding Agents are really good at using grep and writing out plans to files, which means they can operate successfully against way more code than fits in their context at a single time.

jaggederest 1/15/2026||

The other issue with "a huge token window" is that if you fill it, it seems like relevance for any specific part of the window is diminished - which makes it hard to override default model behavior.

Interestingly, recently it seems to me like codex is actually compressing early and often so that it stays in the smarter-feeling reasoning zone of the first 1/3rd of the window, which is a neat solution for this, albeit with the caveat of post-compression behavior differences cropping up more often.

observationist 1/14/2026|||

Get a good "project manager" agents.md and it changes the whole approach of vibe coding. For a professional environment, with each person given a little domain, arranged in the usual hierarchy of your coding team, truly amazing things can get done.

Presumably the security and validation of code still needs work, I haven't read anything that indicates those are solved yet, so people still need to read and understand the code, but we're at the "can do massive projects that work" stage.

Division of labor and planning and hierarchy are all rapidly advancing, the orchestration and coordination capabilities are going to explode in '26.

heliumtera 1/15/2026|||

I tried this approach yesterday and I`m loving our daily standup with the agents. Looking forward to our retro and health-checks rituals

azan_ 1/15/2026|||

Could you perhaps share such agents.md? Sounds interesting

galaxyLogic 1/14/2026|||

> so I guess each agent is given a module to work on, and some tests?

Who created those agents and gives them the tasks to work on. Who created the tests? AI, or the humans?

nl 1/15/2026||

Generally they only load a bit of the project into the context at a time. Grep works really well for working out what.

tired_and_awake 1/14/2026||

The moment all code is interacted with through agents I cease to care about code quality. The only thing that matters is the quality of the product, cost of maintenance etc. exactly the thing we measure software development orgs against. It could be handy to have these projects deployed to demonstrate their utility and efficacy? Looking at PRs of agents feels a wrong headed, like who cares if agents code is hard to read if agents are managing the code base?

qingcharles 1/15/2026||

We don't read the binary output of our C compilers because we trust it to be correct almost every time. ("It's a compiler bug" is more of a joke than a real issue)

If AI could reach the point where we actually trusted the output, then we might stop checking it.

LiamPowell 1/15/2026|||

> "It's a compiler bug" is more of a joke than a real issue

It's a very real issue, people just seem to assume their code is wrong rather than the compiler. I've personally reported 12 GCC bugs over the last 2 years and there's 1239 open wrong-code bugs currently.

Here's an example of a simple one in the C frontend that has existed since GCC 4.7: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105180

ares623 1/15/2026|||

“If” doing a lot of work here

flyinglizard 1/15/2026|||

You could look at agents as meta-compilers, the problem is that unlike real compilers they aren't verified in any way (neither formally or informally), in fact you never know which particular agent you're running against when you're asking for something; and unlike compilers, you don't just throw away everything and start afresh on each run. I don't think you could test a reasonably complex system to a degree where it really wouldn't matter what runs underneath, and as you're going to (probably) use other agents to write THOSE tests, what makes you certain they offer real coverage? It's turtles all the way down.

tired_and_awake 1/15/2026||

Completely agree and great points. The conclusion of "agents are writing the tests" etc is where I'm at as well. More over the code quality itself is also an agentic problem, as is compile time, reliability, portability... Turtles all the way down as you say.

All code interactions all happen through agents.

I suppose the question is if the agents only produce Swiss cheese solutions at scale and there's no way to fill in those gaps (at scale). Then yeah fully agentic coding is probably a pipe dream.

On the other hand if you can stand up a code generation machine where it's watts + Gpus + time => software products. Then well... It's only a matter of time until app stores entirely disappear or get really weird. It's hard to fathom the change that's coming to our profession in this world.

AlexCoventry 1/15/2026|||

You should at least read the tests, to make sure they express your intent. Personally, I'm not going to take responsibility for a piece of code unless I've read every line of it and thought hard about whether it does what I think it does.

AI coding agents are still a huge force-multiplier if you take this approach, though.

visarga 1/14/2026|||

> Looking at PRs of agents feels a wrong headed

It would be walking the motorcycle.

icedchai 1/14/2026||

This is how we wound up with non-technical "engineering managers." Looks good to me.

tired_and_awake 1/15/2026||

I think this misses the point, see the other comments. Fully scaled agentic coding replaces managers too :) cause for celebration all around

satvikpendem 1/15/2026|||

No, it becomes only managers, because they are the ones who dictate the business needs (because otherwise, what is the software the agents are making even doing without such goals), and now even worse with non technical ones.

icedchai 1/15/2026|||

I don't believe that. If you go fully agentic and you don't understand the output, you become the manager. You're in no better position than the pointy-haired boss from Dilbert.

tired_and_awake 1/15/2026||

Hey just wanted to thank you for the healthy back and forth! I respect your opinion and don't hold mine strongly. That said I'm eager for this space to mature and for us all to figure out the best way to interact with fault prone code generation tooling... Especially at scale where we all have the hardest time navigating complexity.

icedchai 1/15/2026||

Thanks. It's fun chatting about this stuff! I don't hold mine strongly, either, though I am dealing with lots of AI generated slop code from others.

Interesting times ahead.

tired_and_awake 1/16/2026||

I feel for you. Hopefully your colleagues come around and realize that if they submit the code they are responsible for the slop.

navinsylvester 1/15/2026|

all these focus on long running agents without focussing on core restructure is baffling. the immediate need is to break down complex tasks into smaller ones and single shot them with some amount of parallelism. imo - we need an opinionated system but with human in the middle and then think about dreamy next steps. we need to focus on groundedness first instead of worrying about agent conjuring something from thin air. the decision to leap frog into automated long running agents is quite baffling.

boys are trying to single shot a browser when a moderate complex task can derail a repo. there’s no good amount of info which might be deliberate but from what i can pick, their value add was “distributed computing and organisational design” but that too they simplified. i agree that simplicity is always the first option but flat filesystem structure without standards will not work. period.

vivekv 1/15/2026|

I would agree with this. There are definite challenges in grounded specifications today and the tendency for an LLM to go in tangents that is still a struggle that we all deal with every day.

More comments...