Posted by leerob 10/29/2025
> their own internal benchmark that they won't release
If they'd release their internal benchmark suite, it'd make it into the training set of about every LLM, which from a strictly scientific standpoint, invalidates all conclusions drawn from that benchmark from then on. On the other hand, not releasing the benchmark means they could've hand-picked the datapoints to favor them. It's a problem that can't be resolved unfortunately.
ARC-AGI-2 keeps a private set of questions to prevent LLM contamination, but they have a public set of training and eval questions so that people can both evaluate their modesl before submitting to ARC-AGI and so that people can evalute what the benchmark is measuring:
https://github.com/arcprize/ARC-AGI-2
Cursor is not alone in the field in having to deal with issues of benchmark contamination. Cursor is an outlier in sharing so little when proposing a new benchmark while also not showing performance in the industry standard benchmarks. Without a bigger effort to show what the benchmark is and how other models perform, I think the utility of this benchmark is limited at best.
We could have third-party groups with evaluation criteria who don't make models or sell A.I.. Strictly evaluators. Alternatively, they have a different type of steady income with the only A.I. work they're doing being evaluation.
Then why publish the obscured benchmarks in the first place then?
Benchmarks have become less and less useful. We have our own tests that we run whenever a new model comes out. It's a collection of trivial -> medium -> hard tasks that we've gathered, and it's much more useful to us than any published table. And it leads to more interesting finds, such as using cheaper models (5-mini, fast-code-1, etc) on some tasks vs. the big guns on other tasks.
I'm happy to see cursor iterate, as they were pretty vulnerable to the labs leaving them behind when all of them came out with coding agents. The multi-agents w/ built in git tree support is another big thing they launched recently. They can use their users as "teacher models" for multiple completions by competing models, and by proxying those calls, they get all the signals. And they can then use those signals to iterate on their own models. Cool stuff. We actually need competing products keeping eachother in check, w/ the end result being more options for us, and sometimes even cheaper usage overall.
I wonder how much the methods/systems/data transfer, if they can pull off the same with their agentic coding model that would be exciting.
I actually find myself using the agent mode less now, I like keeping code lean by hand and avoid technical debt. But I do use the tab completions constantly and they are fantastic now ever since they can jump around the file.
I run Claude Code in the background near constantly for a variety of projects, with --dangerously-skip-permissions, and review progress periodically. Tabbing is only relevant when it's totally failing to make progress and I have to manually intervene, and that to me is a failure scenario that is happening less and less often.
Usually I'll have several Claude Code sessions running in parallel on different projects, and when one of them stops I will review the code for that project and start it again - either moving forwards or re-doing things that have issues.
I'm not against YOLO vibe coding, but being against tab completion is just insane to me. At the end of the day, LLMs help you achieve goals quicker. You still need to know what goal you want to achieve, and tab completion basically let's me complete a focused goal nearly as soon as I determine what my goal is.
And it's not remotely "YOLO vibe coding". All the code gets reviewed, and tested thoroughly, and they are worked to specs, and gated by test suites.
What I don't do is babysit the LLM until it's code passes both the test suite and automated review stages, because it's a waste of time.
Others of these projects are research tasks. While I wrote this comment, Claude unilaterally fixed a number of bugs in a compiler.
I tried to use an appropriate emoji to express the joking nature of this comment, but HN silently filtered it out, so pretend you see a grinning face.
Every time I write code myself I find myself racing the AI to get an indentation in before the AI is done... gets annoying
I am an ML researcher at Cursor, and worked on this project. Would love to hear any feedback you may have on the model, and can answer question about the blog post.
I don't use these tools that much ( I tried and rejected Cursor a while ago, and decided not to use it ) but having played with GPT5 Codex ( as a paying customer) yesterday in regular VSCode , and having had Composer1 do the exact same things just now, it's night and day.
Composer did everything better, didn't stumble where Codex failed, and most importantly, the speed makes a huge difference. It's extremely comfortable to use, congrats.
Edit: I will therefore reconsider my previous rejection
Cursor Composer and Windsurf SWE 1.5 are both finetuned versions of GLM.
GPT-5-codex does more research before tackling a task, that is the biggest weakness for me not using Composer yet.
Could you provide any color on whether ACP (from zed) will be supported?
It's generation speed is not the problem or the time sink.
It's wrestling with it to get the right output.
---
And just to clarify as maybe I misunderstood again but people are comparing cursor to Claude Code and codex etc here- isn't this whole article all cursor just using different models?
literally a 30 day old model and you've moved the "low" goalpost all the way there haha. funny how humans work
Speed of model just isn't the bottleneck for me.
Before it I used Opus 4.1, and before that Opus 4.0 and before that Sonnet 4.0 - which each have been getting slightly better. It's not like Sonnet 4.5 is some crazy step function improvement (but the speed over Opus is definitely nice)
Also, didn't realize you worked at Cursor - I'm a fan of your work - they're lucky to have you!
Totally agree that "smart model" is the table stakes for usefulness these days.
Wow, no kidding. It is quite good!
It’s the only coding agent I’m actually really motivated to use out of the box because it really does make me feel more productive while the others keep messing up the project, from way too large changes I didn’t ask for all the way to constant syntax and request errors.
It’s the only coding agent I’ve used that feels serious about being a product rather than a prototype. Their effort in improving their stack is totally paying off.
Countless times my requests in the AI chat just hang there for 30+ seconds more until I can retry them.
When I decided to give Claude Code a try (I thought I didn't need it because I used Claude in Cursor) I couldn't believe how faster it was, and literally 100% reliable.
EDIT: given today's release, decided to give it a go. The Composer1 model _is_ fast, but right at the second new agent I started I got this:
> Connection failed. If the problem persists, please check your internet connection or VPN
I would be willing to bet money your issue is on your side. I am a daily user since the beginning and cannot recall when I have had issues like you describe unless it was related to my corp network.
(Cursor dev)
Note, later I started using Codex and now Codex is my daily driver, Claude Code for problems where Codex fails (not many), and again Cursor is never used.
They were the first mover but Codex (in my opinion) blows Cursor up into 1000 tiny pieces. It's just so, so much better.
Can't help but notice you haven't tried Zed!
Also, somehow magically, I’ve found Cursor’s Auto mode to be significantly faster than the specific models I’ve tried, Claude being among them.
I would agree it is not as good on doing lengthy work where it’s taking design all the way through implementing a feature in a single shot but trivial is not a good description.
I also don’t think you’re right. 3.5 was recently deprecated and even before then, Cursor has been hitting rate limits with Anthropic. Auto is as much a token cost optimization as it is a rate limit optimization.
(Cursor researcher)
($1.25 input, $1.25 cache write, $0.13 cache read, and $10 output per million tokens)