CursorBench 3.1 - Hacker News

Posted by handfuloflight 14 hours ago

143 points | 77 commentspage 2

verse 12 hours ago|

backwards X axis? is there a reason for that? it looks ridiculous

gkbrk 11 hours ago||

It looks very natural, cheaper is better after all. Performance axis going up, and cheapness axis going up match each other.

0123456789ABCDE 11 hours ago||

gp's argument is that cheapness is a construct, derived from the real, and natural, cost parameter which most people are naturally accustomed to interpreting as increasing from left to right. cheapness would then replace the cost label, and feel natural. alas, this is not what we have here.

anon373839 11 hours ago||

This seems to be a common choice with AI industry graphs, to give you that “upward and outward” frontier shape.

shadeslayer_ 11 hours ago||

Do these benchmarks even add any value at this point? This one is basically Cursor saying that their model is as good as the frontier ones at a fraction of the price. The independent benchmarks are probably part of training data now and the models are pattern-matching against them all the time. The final test of a model (and the harness, probably) is how good it works FOR YOU - since most of the models can pretty much do most of our tasks on a daily basis - it boils down to which one has the least friction to its usage.

bfjvibybd6cuvu6 11 hours ago||

No shot 2.5 is beating out 4.8

tmach32 10 hours ago||

Why would anyone take this benchmark seriously? Cursor is obviously biased here. They can design it and its presentation however they want to tell the story they want to tell.

mi_lk 7 hours ago||

Cursor: Find me another benchmark where Composer 2.5 is a top 10 frontier coding model

leerob 2 hours ago|

(I work at Cursor) We score well on Terminal-Bench and SWE-bench Multilingual. DeepSWE, not so great yet, as it's more for very long-horizon tasks. We're planning to include more public benchmarks in our next model release.

xrisk 11 hours ago||

Would like to see wall times. I feel that’s the part that annoys me most, my tasks aren’t particularly challenging I want them done fast

luckilydiscrete 11 hours ago||

insert obama medal meme

anilgulecha 13 hours ago||

is composer 2.5 that good at that pricepoint? Seems like the gemini flash playbook of trying to get most bang for the buck.

soyin 13 minutes ago||

I'm also using it as my daily driver. I've been trying Opus 4.8 this week to see if I was missing something but haven't noticed a meaningful difference.

I'm working on a fairly routine full stack web app that isn't doing anything incredible. Once I had the patterns I wanted in place, it's been very capable of following those with new work. I also don't ever give it long running tasks, it's always focused and small chunks.

My typical work flow is 1. /grill-me feature description 2. Create a plan 3. Manually review plan and tweak as needed (usually very little to none) 4. Build the plan

All with Composer 2.5. Earlier on in the project I used Claude and GPT for #1 and #2.

I find it really hard to justify the other models for the performance/cost I'm getting with Composer 2.5. Maybe it's not as strong as the frontier models, but it's been plenty good enough for my use cases.

danfritz 13 hours ago|||

It's my daily driver, it's fast affordable and with a bit of guidance gets the job done.

I only reach for Claud when i need to plan something big or want to have a sparring partner to fire of some ideas.

I think what a lot of people don't realize is that you don't need a fronteer model for 80% of coding tasks. Composer 2.5 is often more than good enough, less token hungry and way faster

shockembopper 12 hours ago||

I have been doing the same for quite a while now. Composer 2.5 is incredible when you’re working in the loop.

simondotau 7 hours ago||

When you normalise for time and money, Composer 2.5 is way, way, way, way better than anything else out there. Yes it requires more babysitting, but that's a good thing.

uf00lme 13 hours ago|||

It's surprising usable and cheap enough to run in 'fast' mode when vibing something quick. For simple code I find I prefer the code it writes over GLM or Gemini family.

fumar 13 hours ago|||

It’s fast and affordable.

aabdi 13 hours ago||

yes, its very good.

o10449366 13 hours ago||

I feel like this benchmark reiterates my disbelief that anyone uses the latest Anthropic models for any productive work. They seem to be the best at burning tokens and spawning unnecessary subagents even for well-defined and tightly scoped tasks.

Can we get a count of people that have had Claude read irrelevant documents or perform unnecessary web searches even when told not to from the beginning?

I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model, especially leading up to IPO. As older models are deprecated and users are forced onto newer models, if the default is less efficient and more token expensive that directly results in higher "profit" for Anthropic in terms of the consumption their users have to tolerate - lest they jump to a competitor.

cbg0 9 hours ago||

I've had no problems like the ones you've mentioned while using Opus 4.8. It does overthink stuff with higher effort levels but that's kind of expected.

mwigdahl 6 hours ago||

Same (including the overthinking issue).

pbowyer 13 hours ago|||

> I feel like this benchmark reiterates my disbelief that anyone uses the latest Anthropic models for any productive work. They seem to be the best at burning tokens and spawning unnecessary subagents even for well-defined and tightly scoped tasks.

I keep Claude around for some specific tasks:

- Linked up to Figma MCP to implement front-end stuff

- Data analysis, in the "Connect AI to a data source and ask questions" way. I've tried both Opus 4.8 high and GPT 5.5 high for this and Opus is stronger because it gets the intent in the question better

I used to keep it around for planning too, but the 4.8 plans have had more holes than swiss cheese.

anon373839 11 hours ago|||

> I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model

Related: Sonnet 5’s new tokenizer increases token usage by 30%. (https://simonwillison.net/2026/Jun/30/claude-sonnet-5/)

mrngld 7 hours ago||

Now that enterprise customers are pay-as-you-go with tokens I suspect we'll see renewed interest in OpenAI and their focus on token efficiency. At least I hope so if the alternative is abandoning the tools entirely.

avikaa_ 12 hours ago|

[flagged]