Posted by samwillis 1/14/2026
I shared my LLM predictions last week, and one of them was that by 2029 "Someone will build a new browser using mainly AI-assisted coding and it won’t even be a surprise" https://simonwillison.net/2026/Jan/8/llm-predictions-for-202... and https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3913s
This project from Cursor is the second attempt I've seen at this now! The other is this one: https://www.reddit.com/r/Anthropic/comments/1q4xfm0/over_chr...
Lets make someone pass the one we have, this experiment didn't seem to yield a functioning browser, why would we raise the bar?
The web needs to be more p5n friendly.
I took a 5-minute look at the layout crate here and... it doesn't look great:
1. Line height calculation is suspicious, the structure of the implementation also suggests inline spans aren't handled remotely correctly
2. Uhm... where is the bidi? Directionality has far reaching implications on an inline layout engine's design. This is not it.
3. It doesn't even consider itself a real engine:
// Estimate text width (rough approximation: 0.6 * font_size * char_count)
// In a real implementation, this would use font metrics
let char_count = text.chars().count() as f32;
let avg_char_width = font_size * 0.5; // Approximate average character width
let text_width = char_count * avg_char_width;
I won't even begin talking about how this particular aspect that it "approximates" also has far reaching implications on your design...I could probably go on in perpetuity about the things wrong with this, even test it myself or something. But that's a waste of time I'm not undertaking.
Making a "browser" that renders a few particular web pages "correctly" is an order of magnitude easier than a browser that also actually cares about standards.
If this is how "A Browser for the modern age." looks then I want a time machine.
It reminds of having AI write me an MUI component the other day that implemented the "sx" prop [1] with some code that handles all the individual properties that were used by the component in that particular application, it might have been correct, the component at all was successful and well coded... but MUI provides a styled() function and a <Box> component, either one of which could have been used to make this component handle all the properties that "sx" is supposed to handle with as little as one line of code. I asked the agent "how would I do this using the tools that MUI provides to support sx" and had a great conversation and got a complete and clear understanding about the right way to do it but on the first try it wrote something crazy overcomplicated to handle the specific case as opposed to a general-purpose solution that was radically simple. That "web browser" was all like that.
[1] you can write something like sx={width: 4} and MUI multiplies 4 by the application scale and puts on, say, a width: 20px style
You're referring to State of Utopia's[1] web browser, currently available here:
https://taonexus.com/publicfiles/jan2026/172toy-browser.py.t... (turn the volume down if you play the included easter egg mini-game as it's very loud.)
10-minute livestream demonstration:
https://www.youtube.com/watch?v=4xdIMmrLMLo&t=45s
That livestream demonstration is side-by-side with Chrome, rendering very simple pages.
It compiles, renders simple web pages and is able to post.
The differences between cursor's browser and our browser:
- Cursor's long-running autonomously coded browser: over a million lines of code and a trillion tokens, which is computationally intensive and has a high cost.
- State of Utopia's browser: under 3000 lines of code.
- Cursor's browser: does not compile at present. There's no way to use it.
- State of Utopia's browser: compiles in every version. You can use it right away, and it includes a fun easter-egg game.
- Cursor's browser: can't make form submissions
- State of Utopia's browser: can make form submissions.
I'm submitting this using that browser. (I don't know if it will really post or not.)We are taking feature requests!! Submit your requested feature here:
https://pollunit.com/polls/ahysed74t8gaktvqno100g
We are happy to put any feature you want into the web browser.
[1] will be available at https://stateofutopia.com or https://stofut.com for short (St. of Ut.)
> Given how badly my 2025 predictions aged I'm probably going to sit that one out! [1]
Seven days later you appear on the same podcast you appeared on in 2025 to share your LLM predictions for 2026.
What changed?
That said, I don't really find the critique that models have browser source code in their training data particularly interesting.
If they spat out a full, working implementation in response to a single prompt then sure, I'd be suspicious they were just regurgitating their training data.
But if you watch the transcripts for these kinds of projects you'll see them make thousands of independent changes, reacting to test failures and iterating towards an implementation that matches the overall goals of the project.
The fact that Firefox and Chrome and WebKit are likely buried in the training data somewhere might help them a bit, but it still looks to me more like an independent implementation that's influenced by those and many other sources.
They generate a statistically appropriate token based on a very small context window. And they are slightly nerfed not to reproduce everything verbatim because that would bring all sorts of lawsuits.
Of course they are not reproducing Webkit or Blink or Firefox verbatim. However, it's not an "independent implementation". That's why it's "stringing together a bunch of open-source components": https://news.ycombinator.com/item?id=46649586
Edit: also, this "independent implementation" cannot be compiled by their own CI and doesn't work, apparently.
https://www.youtube.com/watch?v=8kUQWuK1L4w
The "discoverer" of APL tried to express as many problems as he could with his notation. First he found that notation expands and after some more expansion he found that it began shrinking.
The same goes to Forth, which provides means for a Sequitur-compressed [1] representation of a program.
[1] https://en.wikipedia.org/wiki/Sequitur_algorithm
Myself, I always strive to delete some code or replace some code with shorter version. First, to better understand it, second, to return back and read less.
> There are tons of tiny edge cases and standards to comply with where implementing one standard will break 3 others if not done carefully. AI can't do that right now.
Firstly the CI is completely broken on every commit, all tests have failed and its and looking closely at the code, it is exactly what you expect for unmaintainable slop.
Having more lines of code is not a good measure of robust software, especially if it does not work.
My angle has been a bit different: scaling autonomous coding for individual developers, and in a much simpler way. I love CLI agents, but I found myself wasting time babysitting terminals while waiting for turns to finish. At some point it clicked: what if I could just email them?
Email sounds backward, but that’s the feature. It’s universal, async, already collaborative. The agent sends me a focused update, I reply with guidance, and it keeps working on a server somewhere, or my laptop, while I’m not glued to my desk. There’s still a human in the loop, just without micromanagement.
It’s been surprisingly joyful and productive, and it feels closer to how real organizations already work. I’ve put together a small, usable tool around this and shared it here if anyone wants to try it or kick the tires: https://news.ycombinator.com/item?id=46629191
I'm not sure the approach of "completely autonomous coding" is the right way to go. I feel like maybe we'll be able to use it more effectively if we think of them as something to be used by a human to accomplish some thing instead, lean into letting the human drive the thing instead, because quality spirals so quickly out of control.
Even their README is kind of crappy. Ideally you want installation instructions right near the top, but it's broken into multiple files. The README link that says "running + architecture" (but the file is actually called browser_ui.md???) is hard to follow. There is no explicit list of dependencies, and again no explanation of how JavaScript execution works, or how rendering works, really.
It's impressive that they got such a big project to be built by agents and to compile, but this codebase... Feels like AI slop, and you couldn't pay me to maintain it. You could try to get AI agents to maintain it, but my prediction is that past some scale, they would have a hard time figuring out their own mess. You would just be left with permanent bugs you can't easily fix.
I can’t shake the feeling that simply being a shameless about copy-paste (ie copyright infringement), would let existing tools do much the same faster and more efficiently. Download Chromium, search-replace ‘Google’ with ‘ME!’, run Make… if I put that in a small app someone would explain that’s actually solvable as a bash one-liner.
There’s a lot of utility in better search and natural language interactions. The siren call of feedback loops plays with our sense of time and might be clouding or sense of progress and utility.
Anyone who has looked at AI art, read AI stories, listened to AI music, or really interacted with AI in any meaningfully critical way would recognize that this was the only predictable result given the current state of AI generated “content”. It’s extremely brittle, and collapses at the smallest bit of scrutiny.
But I guess (to continue steelmanning) the paradigm has shifted entirely. Why do we even need an entire browser for the whole internet? Why can’t we just vibe code a “browser” on demand for each web page we interact with?
I feel gross after writing this.
That agents can write a bunch of code by themselves? We already knew that, and what's even the point of that if the code doesn't work?
I feel like I'm still missing what this entire project and blogpost is about. Is it supposed to be all theoretical or what's the deal?
I guess the fundamental truth that I’m working towards for generative AI is that it appears to have asymptotic performance with respect to recreating whatever it’s trying to recreate. That is, you can throw unlimited computing power and unlimited time at trying to recreate something, but there will still be a missing essence that separates the recreation from the creation. In very small snippets, and for very large compute, there may be reasonable results, but it will never be able to completely replace what can be created in meatspace by meatpeople.
Whether the economics of the tradeoff between “nearly recreated” and “properly created” is net positive is what I think this constant ongoing debate is about. I don’t think it’s ever going to be “it always makes sense to generate content instead of hire someone for this”, but rather a more dirty, “in this case, we should generate content”.
Writing code one function at a time is the furthest thing than what is being showcased in TFA.
But that's the thing, it doesn't compile, has a ton of errors, CI seems broken since long... What exactly is supposed to impressive here, that it managed to generate a bunch of code that doesn't even compile?
What in the holy hackers is this even about? Am I missing something obvious here? How is this news?
Yeah, answers need to be given.
It's about hyping up cursor and writing a blog post. You're not supposed to look at or use the code, obviously.
I suspect the author of the post would agree. This feels much more like a experiment to push the limits of LLMs than anything they're looking to seriously use as a product (or even the basis of a product).
I think the more interesting question is when the approach of completely autonomous coding will be the right way to go. LLMs are definitely progressing along a spectrum of: Can't do it -> Can do it with help -> Can do it alone but code isn't great -> Can do it alone with good code. Right now I'd say they're only in that final step for very small projects (e.g. simple Python scripts), but it seems like an inevitability that they will get there for increasingly large ones.
> Today's agents work well for focused tasks, but are slow for complex projects.
What does slow mean? Slower than humans? Need faster GPUs? What does it even imply? Too slow to produce the next token? Too slow in attempts to be usable? Need human intervention?
This piece is made and written to keep the bubble inflating further.
So I guess they've achieved human parity then?
(I'll see myself out)
The implied future here is _unreal cool_. Swarms of coding agents that can build anything, with little oversight. Long-running projects that converge on high-quality, complex projects.
But the examples feel thin. Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets. The closest to real code is what they've done with Cursor's codebase .... but it's not merged yet.
I don't want to say, call me when it's merged. But I'm not worried about agents ability to produce millions of lines of code. I'm worried about their ability to intersect with the humans in the real world, both as users of that code and developers who want to build on top of it.
I would go even further, why have they not created at least one less complex project that is working and ready to be checked out? To me it sounds like having a carrot dangle in front of the face of VC investors: 'Look, we are almost there to replace legions of software developers! Imagine the market size and potential cost reductions for companies.'
LLMs are definitely an exciting new tool and they are going to change a lot. But are they worth $B for everything being stamped 'AI'? The future will tell. Looking back the dotcom boom hype felt exactly the same.
The difference with the dotcom boom is that at the time there was a lot more optimism to build a better future. The AI gold rush seems to be focused on getting giga-rich while fscking the bigger part of humanity.
because it is absolutely impossible to review that code and there is gazillion issues there.
The only way it can get merged is YOLO and then fix issues for months in prod which kinda defeats the purpose and brings gains close to zero.
In my experience agents don't converge on anything. They diverge into low-quality monstrosities which at some point become entirely unusable.
There's just a bit over 3 browsers, 1 serious excel-like and small part of windows user side. That's really not enough for training for replicating those specific tasks.
This is how I think about it. I care about asymptotics. What initial conditions (model(s) x workflow/harness x input text artefacts) causes convergence to the best steady state? The number of lines of code doesn't have to grow, it could also shrink. It's about the best output.
But what would be the point of re-creating existing applications? It would be useful if you can produce a better version of those applications. But the point in this experiment was to produce something "from scratch" I think. Impressive yes, but is it useful?
A more practically useful task would be for Mozilla Foundation and others to ask AI to fix all bugs in their application(s). And perhaps they are trying to do that, let's wait and see.
I think any large piece of software with well established standards and test suites will be able to be quickly rewritten and optimized by coding agents.
[1] https://github.com/sberan/tjs
[2] /spawn-perf-agents claude command: https://github.com/sberan/tjs/blob/main/.claude/commands/spa...
> Another experiment was doing an in-place migration of Solid to React in the Cursor codebase. It took over 3 weeks with +266K/-193K edits. As we've started to test the changes, we do believe it's possible to merge this change.
In my view, this post does not go into sufficient detail or nuance to warrant any serious discussion, and the sparseness of info mostly implies failure, especially in the browser case.
It _is_ impressive that the browser repo can do _anything at all_, but if there was anything more noteworthy than that, I feel they'd go into more detail than volume metrics like 30K commits, 1M LoC. For instance, the entire capability on display could be constrained to a handful of lines that delegate to other libs.
And, it "is possible" to merge any change that avoids regressions, but the majority of our craft asks the question "Is it possible to merge _the next_ change? And the next, and the 100th?"
If they merge the MR they're walking the walk.
If they present more analysis of the browser it's worth the talk (not that useful a test if they didn't scrutinize it beyond "it renders")
Until then, it's a mountain of inscrutable agent output that manages to compile, and that contains an execution pathway which can screenshot apple.com by some undiscovered mechanism.
By far the most useful metric is to have a live system running for a year with widespread usage that produces a lower number of bugs than that of a codebase created by humans.
Until that happens, my skeptic hat will remain firmly on my head.
But is this actually true? They don't say that as far as I can tell, and it also doesn't compile for me nor their own CI it seems.
If you can't reproduce or compile the experiment then it really doesn't work at all and nothing but a hype piece.
I guess probably at some point, something compiled, but cba to try to find that commit. I guess they should've left it in a better state before doing that blog post.
I do use AI heavily so I resorted to actually turning on warnings as errors in the rust codebases I work in.
Product is still fairly beta, but in Sculptor[^1] we have an MCP that provides agent & human with suggestions along the lines of "the agent didn't actually integrate the new module" or "the agent didn't actually run the tests after writing them." It leads to some interesting observations & challenges - the agents still really like ignoring tool calls compared to human messages b/c they "know better" (and sometimes they do).
It is also close to impossible run any node ecosystem without getting a wall of warnings.
You are an extreme outlier for putting in the work to fix all warnings
Haven't found that myself, are you talking about TypeScript warnings perhaps? Because I'm mostly using just JavaScript and try to steer clear of TypeScript projects, and AFAIK, JavaScript the language nor runtimes don't really have warnings, except for deprecations, are those the ones you're talking about?
Looking at OAI API pricing, 5.2 Codex is $14 per 1 million output tokens. Which makes cool $14m for 1 trillion tokens (multiplied by whatever the plural is). For something that "kind of works".
Its a nice ad for OAI and Anysphere, but maybe next time - just donate the money to a browser team?
I started looking in the Cargo.toml to at least get an idea how the project was constructed. I saw there that rather than being built from scratch as the post seemed to imply that almost every core component was simply pulled in from an open source library. quickjs engine, wgpu graphics, winit windowing & input, egui for ui, html parsing, the list goes on. On twitter their CEO explicitly stated that it uses a "custom js vm" which seemed particularly misleading / untrue to me.
Integrating all of these existing components is still super impressive for these models to do autonomously, so I'm just at a loss how to feel when it does something impressive but they then feel the need to misrepresent so much. I guess I just have a lot less respect and trust for the cursor leadership, but maybe a little relief knowing that soon I may just generate my own custom cursor!
https://github.com/dioxuslabs/blitz
Maybe we ended up in the training data!
Great work by the way, blitz seems to be coming along nicely, and I even see you guys created a proto browser yourselves which is pretty cool, actually functional unlike Cursor's.
Take a screenshot and take it to your manager / investor and make a presentation “Imagine what is now possible for our business”.
Get promoted / exit, move to other pastures and let them figure it out.
It's hard to avoid the impression that this is an unverified pile of slop that may have actually never worked.
The CI process certainly hasn't succeeded for the vast majority of commits.
Baffling, really.
> On twitter their CEO explicitly stated that it uses a "custom js vm" which seemed particularly misleading / untrue to me.
The JS engine used a custom JS VM being developed in vendor/ecma-rs as part of the browser, which is a copy of my personal JS parser project vendored to make it easier to commit to.
I agree that for some core engine components, it should not be simply pulling in dependencies. I've begun the process of removing many of these and co-developing them within the repo alongside the browser. A reasonable goal for "from scratch" may be "if other major browsers use a dependency, it's fine to do so too". For example: OpenSSL, libpng, HarfBuzz, Skia. The current project can be moved more towards this direction, although I think using libraries for general infra that most software use (e.g. windowing) can be compatible with that goal.
I'd push back on the idea that all the agents did was wire up dependencies — the JS VM, DOM, paint systems, chrome, text pipeline, are all being developed as part of this project, and there are real complex systems being engineered towards the goal of a browser engine, even if not there yet.
In various comments in https://news.ycombinator.com/item?id=46624541 I have explained at length why your fleet of autonomous agents failed miserably at building something that could be seen as a valid POC.
One example: your rendering loop does not follow the web specs and makes no sense.
https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...
The above design document is simply nonsense; typical AI hallucinated BS. Detailed critique at https://news.ycombinator.com/item?id=46705625
The actual code is worse; I can only describe it as a tangle of spaghetti. As a Browser expert I can't make much, if anything, out of it. In comparison, when I look at code in Ladybird, a project I am not involved in, I can instantly find my way around the code because I know the web specs.
So I agree this isn't just wiring up of dependencies, and neither is it copied from existing implementations: it's a uniquely bad design that could never support anything resembling a real-world web engine.
Now don't get me wrong, I do think AI could be leveraged to build a web engine, but not by unleashing autonomous agents. You need humans in the loop at all levels of abstractions; the agents should only be used to bang out features re-using patterns established or vetted by human experts.
If you want to do this the right way, get in touch: https://github.com/gterzian
This to me seems to raise more questions than it answers.
What is `FrameState::render_placeholder`?
``` pub fn render_placeholder(&self, frame_id: FrameId) -> Result<FrameBuffer, String> { let (width, height) = self.viewport_css; let len = (width as usize) .checked_mul(height as usize) .and_then(|px| px.checked_mul(4)) .ok_or_else(|| "viewport size overflow".to_string())?;
if len > MAX_FRAME_BYTES {
return Err(format!(
"requested frame buffer too large: {width}x{height} => {len} bytes"
));
}
// Deterministic per-frame fill color to help catch cross-talk in tests/debugging.
let id = frame_id.0;
let url_hash = match self.navigation.as_ref() {
Some(IframeNavigation::Url(url)) => Self::url_hash(url),
Some(IframeNavigation::AboutBlank) => Self::url_hash("about:blank"),
Some(IframeNavigation::Srcdoc { content_hash }) => {
let folded = (*content_hash as u32) ^ ((*content_hash >> 32) as u32);
Self::url_hash("about:srcdoc") ^ folded
}
None => 0,
};
let r = (id as u8) ^ (url_hash as u8);
let g = ((id >> 8) as u8) ^ ((url_hash >> 8) as u8);
let b = ((id >> 16) as u8) ^ ((url_hash >> 16) as u8);
let a = 0xFF;
let mut rgba8 = vec![0u8; len];
for px in rgba8.chunks_exact_mut(4) {
px[0] = r;
px[1] = g;
px[2] = b;
px[3] = a;
}
Ok(FrameBuffer {
width,
height,
rgba8,
})
}
}
```What is it doing in these diffs?
https://github.com/wilsonzlin/fastrender/commit/f4a0974594e3...
I'd be really curious to see the amount of work/rework over time, and the token/time cost for each additional actual completed test case.
I'd want to see some system, that 100%s the first task, saturation, does a great job on the next, then does a valiant effort on the third, then finally makes something promising but as yet unusable on the last.
This way we could see that scaling up difficulty results in a gradual decline in quality, and could have a decent measurement of where we are at and where we are going.
If one vulnerability exists in those crates well, thats that.
I can create a web browser in under a minute in Copilot if I ask it to build a WinForms project that embeds the WebView2 "Edge" component and just adds an address bar and a back button.