Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Posted by kachapopopow 14 hours ago

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed(blog.can.ac)

569 points | 225 commentspage 4

rafaelmn 14 hours ago|

I wonder if we'll get to "VI for LLMs" - if the model was trained on using that kind of text navigation and you show context around cursor when it navigates.

Would also be worth having special tokens for this kind of navigation.

1313ed01 14 hours ago||

I always thought ed would be a perfect match. Line-based instead of having to manage cursor movements.

cousinbryce 14 hours ago||

I bet it’s good enough at VI already

XCSme 6 hours ago||

Google banning you for benchmarking is crazy, are you sure that's the cause? How would they even know you are benchmarking?

giancarlostoro 11 hours ago||

One of the first things I add to my claude instructions file is to stop using grep, its awfully slow, just use ripgrep instead, you can just type the word of what you're looking for from the project root and find it all in one shot. Claude likes to go folder by folder with grep and it drives me crazy.

"You're absolutely right!"

At this point I'd take a contract with Anthropic to have Claude code pick better tooling.

softwaredoug 12 hours ago||

Underrated is how much improving harnesses, not just models, has a lot to do with productive uses of LLMs at tasks like coding in the last year.

jbetala7 8 hours ago||

I switched from a basic prompt wrapper to structured tool use with Claude Code and the quality of output jumped overnight. Same model, completely different results.

the_harpia_io 11 hours ago||

the harness bottleneck is real - I've been working on ai code security stuff and the biggest issue isn't model capability, it's that most tools treat the output as gospel. they'll take a suggested fix and apply it without checking if it even compiles, let alone if it introduces new vulns. I've seen fixes that patch one CVE but break auth logic entirely.

the edit tool point hits though. when you give the model a better interface to express changes (structured diffs vs free-form patches), error rates drop. but nobody talks about this because benchmarks measure "did it solve the problem" not "how many attempts" or "what's the blast radius when it fails". idk maybe I'm just jaded from debugging too many of these.

notsylver 13 hours ago||

I feel like cursors solution is still the best answer. Let the model suggest edits in whatever format it prefers using as few "extra" tokens as possible and have a small model figure it out. I don't use cursor anymore but when I did it was impressive how consistently it worked, I think there was a single time it failed. 70b might be overkill though...

mromanuk 13 hours ago|

Someone should try prompting the same LLM in use, to suggest an edit as a subagent.

nekitamo 11 hours ago||

Getting banned from Gemini while attempting to improve Gemini is the most Googley thing ever :D imagine letting your automated "trust and safety" systems run amok so that they ban the top 0.01% of your users with no recourse. Google really knows how score an own-goal.

sgc 11 hours ago|

I really don't understand what is his usage pattern would have triggered that obviously automated ban. Can somebody let me know what they might think is adversarial enough to be considered 'hacking' or similar by a bot?

visarga 11 hours ago||

Yeah I invented a similar method for information extraction attribution around 2022, I would place custom markers in a document so the extraction model can reference them together with the answer and be unique on the document to be able to locate it.

0xbadcafebee 11 hours ago|

Putting it out there: if any frontier model provider starts allowing any agent to use their $20/month plan, we will all switch to you. We don't want to be forced into 1 harness, we want OAuth, and we want respectable limits without excessive budgets.

aniviacat 10 hours ago|

How would that differ from buying $20 worth of API credits each month?

0xbadcafebee 8 hours ago||

1) security (oauth is much more secure than a static api key; if your key gets stolen, a hacker can run up your bill)

2) AFAIK the $20/month plan allows use of more tokens per month than if you bought $20 of tokens. my understanding is it assumes most users will only use a fraction of that each month, and they rake in profit (like a gym membership)

More comments...