Top
Best
New

Posted by samwillis 15 hours ago

Scaling long-running autonomous coding(cursor.com)
230 points | 138 comments
simonw 15 hours ago|
"To test this system, we pointed it at an ambitious goal: building a web browser from scratch."

I shared my LLM predictions last week, and one of them was that by 2029 "Someone will build a new browser using mainly AI-assisted coding and it won’t even be a surprise" https://simonwillison.net/2026/Jan/8/llm-predictions-for-202... and https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3913s

This project from Cursor is the second attempt I've seen at this now! The other is this one: https://www.reddit.com/r/Anthropic/comments/1q4xfm0/over_chr...

mrefish 15 hours ago||
Time to raise the bar. By 2029 someone will build a new browser using mainly AI-assisted coding and the surprise is that it was designed to be used by pelicans.
embedding-shape 4 hours ago||
> Time to raise the bar

Lets make someone pass the one we have, this experiment didn't seem to yield a functioning browser, why would we raise the bar?

jcfrei 1 hour ago|||
Surely a smart implementation would just find the chromium source on github, do some cosmetic rewrites and strip out all none-essential features?
afishhh 1 hour ago|||
> The other is this one: https://www.reddit.com/r/Anthropic/comments/1q4xfm0/over_chr...

I took a 5-minute look at the layout crate here and... it doesn't look great:

1. Line height calculation is suspicious, the structure of the implementation also suggests inline spans aren't handled remotely correctly

2. Uhm... where is the bidi? Directionality has far reaching implications on an inline layout engine's design. This is not it.

3. It doesn't even consider itself a real engine:

        // Estimate text width (rough approximation: 0.6 * font_size * char_count)
        // In a real implementation, this would use font metrics
        let char_count = text.chars().count() as f32;
        let avg_char_width = font_size * 0.5; // Approximate average character width
        let text_width = char_count * avg_char_width;
I won't even begin talking about how this particular aspect that it "approximates" also has far reaching implications on your design...

I could probably go on in perpetuity about the things wrong with this, even test it myself or something. But that's a waste of time I'm not undertaking.

Making a "browser" that renders a few particular web pages "correctly" is an order of magnitude easier than a browser that also actually cares about standards.

If this is how "A Browser for the modern age." looks then I want a time machine.

bob1029 14 hours ago|||
The goal I am currently using for long horizon coding experiments is implementation of a PDF rasterizer given an ISO32000 specification document.
xenni 13 hours ago||
We're almost there, I've been working on something similar using a markdown'd version of the ISO32000 spec
leptons 13 hours ago|||
Great, they can call it "artificial Internet Explorer", or aIE for short.
hahahahhaah 6 hours ago|||
Web browser should be easy as source exists. Fix all SVG bugs in my browser tho...
viraptor 5 hours ago||
There are 3.5 serious open codebases of web browsers currently. Only two are full featured. It's not nothing, but it's very far from "source exists so it's easy to copy what they do".
machiaweliczny 2 hours ago||
But detailed specs exists for both HTML and JS and tests also exists and unlimited amount of test data. You can just try running webpage or program and also have reference implementations - it's much easier for agents to understand that. Also HTML they know super well from scraping whole internet but still impressive.
cheevly 15 hours ago|||
2029? I have no idea why you would think this is so far off. More like Q2 2026.
xmprt 15 hours ago|||
You're either overestimating the capabilities of current AI models or underestimating the complexity of building a web browser. There are tons of tiny edge cases and standards to comply with where implementing one standard will break 3 others if not done carefully. AI can't do that right now.
torginus 6 hours ago|||
Even if AI will not achieve the ability to perform at this level on its own, it clearly is going to be an enormous force multiplier, allowing highly skilled devs to tackle huge projects more or less on their own.
rlt 6 hours ago||||
Not only edge cases and standards, but also tons of performance optimizations.
rvz 13 hours ago|||
It's most likely both.

> There are tons of tiny edge cases and standards to comply with where implementing one standard will break 3 others if not done carefully. AI can't do that right now.

Firstly the CI is completely broken on every commit, all tests have failed and its and looking closely at the code, it is exactly what you expect for unmaintainable slop.

Having more lines of code is not a good measure of robust software, especially if it does not work.

gordonhart 14 hours ago||||
Web browsers are insanely hard to get right, that’s why there are only ~3 decent implementations out there currently.
qingcharles 9 hours ago||
The one nice thing about web browsers is that they have a reasonably formalized specification set and a huge array of tests that can be used. So this makes them a fairly unique proposition ideally suited to AI construction.
pleurotus 6 hours ago||
As far as I read on Ladybird's blog updates, the issue is less the formalised specs, and more that other browsers break the specs, so websites adjust, so you need to take the non-compliance to specs into account with your design
johnfn 9 hours ago||||
You should make your own predictions, and then we can do a retrospective on who was right.
mkoubaa 14 hours ago||||
Yeah if you let them index chromium I'm sure it could do it next week. It just won't be original or interesting.
geeunits 15 hours ago|||
[flagged]
dang 9 hours ago||
Please don't cross into personal attack on HN.

https://news.ycombinator.com/showhn.html

keepamovin 3 hours ago||
That makes a lot of sense for massive-scale efforts like a browser, using coordinated agents to push toward a huge, well defined target with existing benchmarks and tests.

My angle has been a bit different: scaling autonomous coding for individual developers, and in a much simpler way. I love CLI agents, but I found myself wasting time babysitting terminals while waiting for turns to finish. At some point it clicked: what if I could just email them?

Email sounds backward, but that’s the feature. It’s universal, async, already collaborative. The agent sends me a focused update, I reply with guidance, and it keeps working on a server somewhere, or my laptop, while I’m not glued to my desk. There’s still a human in the loop, just without micromanagement.

It’s been surprisingly joyful and productive, and it feels closer to how real organizations already work. I’ve put together a small, usable tool around this and shared it here if anyone wants to try it or kick the tires: https://news.ycombinator.com/item?id=46629191

jamesnorden 4 minutes ago||
Is the code not even compiling a feature or...
embedding-shape 15 hours ago||
Did anyone manage to run the tests from the repository itself? The code seems filled with errors and warnings, as far as I can tell none of them because of the platform I'm on (Linux). I went and looked at the Action workflow history for some pages, and seems CI been failing for a while, PRs also all been failing CI but merged. How exactly was this verified to be something to be used as an successful example, or am I misunderstanding what point they are trying to make? They mention a screenshot, but they never actually mention if their goal was successfully met, do they?

I'm not sure the approach of "completely autonomous coding" is the right way to go. I feel like maybe we'll be able to use it more effectively if we think of them as something to be used by a human to accomplish some thing instead, lean into letting the human drive the thing instead, because quality spirals so quickly out of control.

snek_case 10 hours ago||
I found the codebase very hard to navigate. Hundreds (over a thousand?) tiny files with less than 200 lines of code, in deeply nested subdirectories. I wanted to find where the JavaScript engine was, and where the DOM implementation was located, and I couldn't easily find it, even using the GitHub search feature. I'm not exactly sure what this browser implements and how.

Even their README is kind of crappy. Ideally you want installation instructions right near the top, but it's broken into multiple files. The README link that says "running + architecture" (but the file is actually called browser_ui.md???) is hard to follow. There is no explicit list of dependencies, and again no explanation of how JavaScript execution works, or how rendering works, really.

It's impressive that they got such a big project to be built by agents and to compile, but this codebase... Feels like AI slop, and you couldn't pay me to maintain it. You could try to get AI agents to maintain it, but my prediction is that past some scale, they would have a hard time figuring out their own mess. You would just be left with permanent bugs you can't easily fix.

datsci_est_2015 1 hour ago|||
To steelman the vibecoders’ perspective, I think the point is that the code is not meant for you to read.

Anyone who has looked at AI art, read AI stories, listened to AI music, or really interacted with AI in any meaningfully critical way would recognize that this was the only predictable result given the current state of AI generated “content”. It’s extremely brittle, and collapses at the smallest bit of scrutiny.

But I guess (to continue steelmanning) the paradigm has shifted entirely. Why do we even need an entire browser for the whole internet? Why can’t we just vibe code a “browser” on demand for each web page we interact with?

I feel gross after writing this.

embedding-shape 1 hour ago||
If it's not meant to be read, and not meant to be run since it doesn't compile and doesn't seem like it's been able to for quite some time, what is this mean to demonstrate?

That agents can write a bunch of code by themselves? We already knew that, and what's even the point of that if the code doesn't work?

I feel like I'm still missing what this entire project and blogpost is about. Is it supposed to be all theoretical or what's the deal?

bonesss 7 hours ago||||
So the chain of events here is: copy existing tutorials and public/available code, train the model to spit it out-ish when asked, a mature-ish specification is used, and now they jitter and jumble towards a facsimile of a junior copy paste outsourcing nightmare they can’t maintain (creating exciting liabilities for all parties involved).

I can’t shake the feeling that simply being a shameless about copy-paste (ie copyright infringement), would let existing tools do much the same faster and more efficiently. Download Chromium, search-replace ‘Google’ with ‘ME!’, run Make… if I put that in a small app someone would explain that’s actually solvable as a bash one-liner.

There’s a lot of utility in better search and natural language interactions. The siren call of feedback loops plays with our sense of time and might be clouding or sense of progress and utility.

kungfuscious 6 hours ago||
You raise a good point, which is that autonomous coding needs to be benchmarked on designs/challenges where the exact thing being built isn't part of the model's training set.
NitpickLawyer 5 hours ago||
swe-REbench does this. They gather real issues from github repos on a ~monthly basis, and test the models. On their leaderboard you can use a slider to select issues created after a model was released, and see the stats. It works for open models, a bit uncertain on closed models. Not perfect, but best we have for this idea.
embedding-shape 4 hours ago|||
> It's impressive that they got such a big project to be built by agents and to compile

But that's the thing, it doesn't compile, has a ton of errors, CI seems broken since long... What exactly is supposed to impressive here, that it managed to generate a bunch of code that doesn't even compile?

What in the holy hackers is this even about? Am I missing something obvious here? How is this news?

underdeserver 3 hours ago|||
Looks like it doesn't compile for at least one other guy (I myself haven't tried): https://github.com/wilsonzlin/fastrender/issues/98

Yeah, answers need to be given.

askl 4 hours ago|||
> What in the holy hackers is this even about? Am I missing something obvious here?

It's about hyping up cursor and writing a blog post. You're not supposed to look at or use the code, obviously.

csomar 7 hours ago||
You can stop reading the article from here:

> Today's agents work well for focused tasks, but are slow for complex projects.

What does slow mean? Slower than humans? Need faster GPUs? What does it even imply? Too slow to produce the next token? Too slow in attempts to be usable? Need human intervention?

This piece is made and written to keep the bubble inflating further.

george_atom 5 minutes ago||
Reviewing all this code is the issue.
trjordan 15 hours ago||
This is going to sound sarcastic, but I mean this fully: why haven't they merged that PR.

The implied future here is _unreal cool_. Swarms of coding agents that can build anything, with little oversight. Long-running projects that converge on high-quality, complex projects.

But the examples feel thin. Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets. The closest to real code is what they've done with Cursor's codebase .... but it's not merged yet.

I don't want to say, call me when it's merged. But I'm not worried about agents ability to produce millions of lines of code. I'm worried about their ability to intersect with the humans in the real world, both as users of that code and developers who want to build on top of it.

dust42 1 hour ago||
> This is going to sound sarcastic, but I mean this fully: why haven't they merged that PR.

I would go even further, why have they not created at least one less complex project that is working and ready to be checked out? To me it sounds like having a carrot dangle in front of the face of VC investors: 'Look, we are almost there to replace legions of software developers! Imagine the market size and potential cost reductions for companies.'

LLMs are definitely an exciting new tool and they are going to change a lot. But are they worth $B for everything being stamped 'AI'? The future will tell. Looking back the dotcom boom hype felt exactly the same.

The difference with the dotcom boom is that at the time there was a lot more optimism to build a better future. The AI gold rush seems to be focused on getting giga-rich while fscking the bigger part of humanity.

orlp 6 hours ago|||
> Long-running projects that converge on high-quality, complex projects

In my experience agents don't converge on anything. They diverge into low-quality monstrosities which at some point become entirely unusable.

embedding-shape 4 hours ago||
Yeah, I don't think they're built for that either, you need a human to steer the "convergtion", otherwise they indeed end up building monstrosities.
viraptor 5 hours ago|||
> Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets.

There's just a bit over 3 browsers, 1 serious excel-like and small part of windows user side. That's really not enough for training for replicating those specific tasks.

energy123 7 hours ago|||
> Long-running projects that converge

This is how I think about it. I care about asymptotics. What initial conditions (model(s) x workflow/harness x input text artefacts) causes convergence to the best steady state? The number of lines of code doesn't have to grow, it could also shrink. It's about the best output.

risyachka 14 hours ago|||
>> why haven't they merged that PR.

because it is absolutely impossible to review that code and there is gazillion issues there.

The only way it can get merged is YOLO and then fix issues for months in prod which kinda defeats the purpose and brings gains close to zero.

mkoubaa 14 hours ago||
On the other hand, finding fixing issues for months is still training data
dist-epoch 15 hours ago||
Pretty much everything exists in the training sets. All non-research software is just a mishmash of various standard modules and algorithms.
galaxyLogic 14 hours ago||
Not everything, only code-bases of existing (open-source?) applications.

But what would be the point of re-creating existing applications? It would be useful if you can produce a better version of those applications. But the point in this experiment was to produce something "from scratch" I think. Impressive yes, but is it useful?

A more practically useful task would be for Mozilla Foundation and others to ask AI to fix all bugs in their application(s). And perhaps they are trying to do that, let's wait and see.

conradev 7 hours ago|||
Re-creating closed source applications as open source would have a clear benefit because people could use those applications in a bunch of new ways. (implied: same quality bar)
mkoubaa 14 hours ago|||
You have to be careful which codebase to try this on. I have a feeling if someone unleashed agents on the Linux kernel to fix bugs it'd lead to a ban on agents there
torginus 5 hours ago||
Personally what I don't like about this now that I think about it, is that they didn't scale up gradually, let's say there there's a ladder of complexity in software, starting at a simple React CRUD app, going on to something more complex, such as a Paint clone, to something even more complex, like a file manager etc, ending up at one of the most complex pieces of software ever made, a web browser.

I'd want to see some system, that 100%s the first task, saturation, does a great job on the next, then does a valiant effort on the third, then finally makes something promising but as yet unusable on the last.

This way we could see that scaling up difficulty results in a gradual decline in quality, and could have a decent measurement of where we are at and where we are going.

micimize 14 hours ago||
> While it might seem like a simple screenshot, building a browser from scratch is extremely difficult.

> Another experiment was doing an in-place migration of Solid to React in the Cursor codebase. It took over 3 weeks with +266K/-193K edits. As we've started to test the changes, we do believe it's possible to merge this change.

In my view, this post does not go into sufficient detail or nuance to warrant any serious discussion, and the sparseness of info mostly implies failure, especially in the browser case.

It _is_ impressive that the browser repo can do _anything at all_, but if there was anything more noteworthy than that, I feel they'd go into more detail than volume metrics like 30K commits, 1M LoC. For instance, the entire capability on display could be constrained to a handful of lines that delegate to other libs.

And, it "is possible" to merge any change that avoids regressions, but the majority of our craft asks the question "Is it possible to merge _the next_ change? And the next, and the 100th?"

If they merge the MR they're walking the walk.

If they present more analysis of the browser it's worth the talk (not that useful a test if they didn't scrutinize it beyond "it renders")

Until then, it's a mountain of inscrutable agent output that manages to compile, and that contains an execution pathway which can screenshot apple.com by some undiscovered mechanism.

embedding-shape 14 hours ago||
> it's a mountain of inscrutable agent output that manages to compile

But is this actually true? They don't say that as far as I can tell, and it also doesn't compile for me nor their own CI it seems.

sashank_1509 14 hours ago|||
Oh it doesn’t compile? that’s very revealing
rvz 13 hours ago||
Some people just believe anything said on X these days. No timeline from start to finish, just "trust me bro".

If you can't reproduce or compile the experiment then it really doesn't work at all and nothing but a hype piece.

micimize 14 hours ago|||
Hah I don't know actually! I was assuming it must if they were able to get that screenshot video.
Snuggly73 13 hours ago||
error: could not compile `fastrender` (lib) due to 34 previous errors; 94 warnings emitted

I guess probably at some point, something compiled, but cba to try to find that commit. I guess they should've left it in a better state before doing that blog post.

jaggederest 13 hours ago||
I find it very interesting the degree to which coding agents completely ignore warnings. When I program I generally target warning-free code, and even with significant effort in prompting, I haven't found a model that treats warnings as errors, and they almost all love the "ignore this warning" pragmas or comments over actually fixing them.
conception 12 hours ago|||
You can use hooks to keep them from being able to do this btw
jaggederest 9 hours ago||
I generally think of needing hooks as being a model training issue - I've had to use them less as the models have gotten smarter, hopefully we'll reach the point where they're a nice bonus instead of needed to prevent pathological model behavior.
ianbutler 13 hours ago||||
Yeah I've had problems with this recently. "Oh those are just warnings." Yes but leaving them will make this codebase shit in short time.

I do use AI heavily so I resorted to actually turning on warnings as errors in the rust codebases I work in.

suriya-ganesh 13 hours ago|||
unfortunately this is not the most common practice. I've worked on rust codebases with 10K+ warning. and rust was supposed to help you.

It is also close to impossible run any node ecosystem without getting a wall of warnings.

You are an extreme outlier for putting in the work to fix all warnings

jaggederest 12 hours ago||
`cargo clippy` is also very happy with my code. I agree and I think it's kind of a tragedy, I think for production work warnings are very important. Certainly, even if you have a large number of warnings and `clippy` issues, that number ideally should go down over time, rather than up.
meander_water 10 hours ago||
The lowest bar in agentic coding is the ability to create something which compiles successfully. Then something which runs successfully in the happy path. Then something which handles all the obvious edge cases.

By far the most useful metric is to have a live system running for a year with widespread usage that produces a lower number of bugs than that of a codebase created by humans.

Until that happens, my skeptic hat will remain firmly on my head.

danieloj 1 hour ago||
I'm not sure "building a web browser" is such a great test for an LLM. It helps confirm that they can handle large codebases. But the actual logic in the browser engine will be based very heavily on Chromium/Firefox etc.
ZitchDog 15 hours ago||
I used similar techniques to build tjs [1] - the worlds fastest and most accurate json schema validator, with magical TypeScript types. I learned a lot about autonomous programming. I found a similar "planner/delegate" pattern to work really well, with the use of git subtrees to fan out work [2].

I think any large piece of software with well established standards and test suites will be able to be quickly rewritten and optimized by coding agents.

[1] https://github.com/sberan/tjs

[2] /spawn-perf-agents claude command: https://github.com/sberan/tjs/blob/main/.claude/commands/spa...

tehsauce 10 hours ago|
I was excited to try it out so I downloaded the repo and ran the build. However there were 100+ compilation errors. So I checked the commit history on github and saw that for at least several pages back all recent commits had failed in the CI. It was not clear which commit I should pick to get the semi-working version advertised.

I started looking in the Cargo.toml to at least get an idea how the project was constructed. I saw there that rather than being built from scratch as the post seemed to imply that almost every core component was simply pulled in from an open source library. quickjs engine, wgpu graphics, winit windowing & input, egui for ui, html parsing, the list goes on. On twitter their CEO explicitly stated that it uses a "custom js vm" which seemed particularly misleading / untrue to me.

Integrating all of these existing components is still super impressive for these models to do autonomously, so I'm just at a loss how to feel when it does something impressive but they then feel the need to misrepresent so much. I guess I just have a lot less respect and trust for the cursor leadership, but maybe a little relief knowing that soon I may just generate my own custom cursor!

jkelleyrtp 7 hours ago||
WGPU for render, winit for window, servo css engine, taffy for layout sounds eerily similar to our existing open source Rust browser blitz.

https://github.com/dioxuslabs/blitz

Maybe we ended up in the training data!

satvikpendem 4 hours ago||
I follow Dioxus and particularly blitz / #native on your Discord and I noticed the exact same thing too. There was a comment in a readme in Cursor's browser repo they linked mentioning taffy and I thought, hang on, it's definitely not from scratch, as they advertise. People really do believe everything they read on Twitter.

Great work by the way, blitz seems to be coming along nicely, and I even see you guys created a proto browser yourselves which is pretty cool, actually functional unlike Cursor's.

whatever1 9 hours ago|||
You are doing it wrong.

Take a screenshot and take it to your manager / investor and make a presentation “Imagine what is now possible for our business”.

Get promoted / exit, move to other pastures and let them figure it out.

eeL3bo1mohn7pee 8 hours ago|||
Of 63295 workflow runs, apparently only 1426 have been successful.

It's hard to avoid the impression that this is an unverified pile of slop that may have actually never worked.

The CI process certainly hasn't succeeded for the vast majority of commits.

Baffling, really.

handfuloflight 7 hours ago||
Let us all generate our own custom cursors.
More comments...