Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Posted by kachapopopow 16 hours ago

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed(blog.can.ac)

592 points | 234 commentspage 5

nekitamo 13 hours ago|

Getting banned from Gemini while attempting to improve Gemini is the most Googley thing ever :D imagine letting your automated "trust and safety" systems run amok so that they ban the top 0.01% of your users with no recourse. Google really knows how score an own-goal.

sgc 13 hours ago|

I really don't understand what is his usage pattern would have triggered that obviously automated ban. Can somebody let me know what they might think is adversarial enough to be considered 'hacking' or similar by a bot?

visarga 13 hours ago||

Yeah I invented a similar method for information extraction attribution around 2022, I would place custom markers in a document so the extraction model can reference them together with the answer and be unique on the document to be able to locate it.

0xbadcafebee 13 hours ago||

Putting it out there: if any frontier model provider starts allowing any agent to use their $20/month plan, we will all switch to you. We don't want to be forced into 1 harness, we want OAuth, and we want respectable limits without excessive budgets.

aniviacat 12 hours ago|

How would that differ from buying $20 worth of API credits each month?

0xbadcafebee 10 hours ago||

1) security (oauth is much more secure than a static api key; if your key gets stolen, a hacker can run up your bill)

2) AFAIK the $20/month plan allows use of more tokens per month than if you bought $20 of tokens. my understanding is it assumes most users will only use a fraction of that each month, and they rake in profit (like a gym membership)

znnajdla 15 hours ago||

Yep this has been my experience with browser agents as well. One little change in the harness/agentic loop and the model suddenly becomes a whole lot smarter at navigating the web. I was also able to build a better browser agent than ‘claude —chrome’ in just a few afternoons just by tweaking the harness.

energy123 16 hours ago||

I feel the baseline comparison should be relative to the intuitive and simple "line-numbers only" schema.

It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.

withinboredom 16 hours ago||

The issue is when the file changed between when the LLM read the file and when it wrote to the file. Just using line numbers will clobber a file if that happens. The hashes prevent that from being an issue.

energy123 16 hours ago||

Point taken.

kachapopopow 15 hours ago||

it starts writing to the wrong part of the file after multiple edits.

0xdeafbeef 13 hours ago||

Filed an issue for codex

https://github.com/openai/codex/issues/11601

jwpapi 14 hours ago||

Great article and tbh I thought it would’ve been implemented that way makes sense to hash and save mainly context I don’t expect them to care about token usage

How about Kimi tho how can I play with it?

babkayaga 12 hours ago||

Still weird to me that most people are not just giving an LLM an access to an editor, forcing it to write shell scripts to edit files. Shrug.

HarHarVeryFunny 11 hours ago||

That's not quite how it works, and anyways if the model can't generate an accurate find/replace string, why would you expect it to do any better generating accurate commands to drive your editor (assuming it knew how do do that in the first place) ?!

The way edits happen is that the agent (local) first tells the model (typically remote) that it has an edit tool (e.g. taking parameters file name, find string and replace string). If the model decides it wants to edit a file then it'll invoke this edit tool, which just results in a blob of JSON being put in the model's response specifying the edit (filename, etc). The agent then receives the response, intercepts this JSON blob, sees that it is an edit request and does what is asked.

The problem the article is describing is that the edit request (tool invocation) generated by the model isn't always 100% accurate. Even if the agent told the model it had a tool to invoke an actual editor, say sed, assuming the model knew how to use sed, this is still going to fail if the edit request cannot be interpreted literally by the editor (due to being inaccurate).

cyanydeez 7 hours ago||

Seems like it's veering towards a per-model protocol similar to the expectation that these models will develop their own languages to speak among themselves as agents.

The trouble is though, because it's all indeterminant slop, every model will break in small ways that you're back to indeterminancy and building a harness ontop of the harness.

Still, <nerd snipe>, there's probably a way to get the local model and arbitrary remote model to agree on how to make a method call. But the only way that will be fruitful if you find a highly reproducible set of tuples within the model's shared space.

znnajdla 12 hours ago|||

How do you give it access to an editor? It doesn't have a keyboard and mouse.

HarHarVeryFunny 10 hours ago|||

Well, it could be a batch editor, such as linux's sed, invoked from the command line, or with "computer use" the model could indeed potentially drive a real interactive editor.

Part of the problem though is that tools like Claude Code don't want to assume too much of the environment - that a specific editor is available, or even that it is running on a particular OS. The way it remains platform agnostic and not reliant on specific tools is by only having a dependency on Node.js, which provides file read/write support, so to implement an edit request the agent uses Node.js to read the file, itself implements the edit, then again uses Node.js to create the new updated file.

visarga 12 hours ago|||

I built a structural zoom tool, it would fit flat or tree like content into a 10K char budget. It can compress HTML, JSON, folders, zip files, logs, chat sessions, basically large files or collections of files. Moving around is done by range selection. The idea is to have the agent find its way iteratively to the target, while having the structure exposed. RAG would totally cut everything to pieces and put them in a hat. My approach is to follow the structure of a large content by a series of glimpses. Unfortunately I myself am not sure it is better to use this tool vs bash and python one off scripts.

the_harpia_io 10 hours ago||

honestly the harness thing is way more important than people realize - I've been working on code security tools and the gap between what a model generates raw vs with better structure is massive, way bigger than model versions mattering. like the security bugs I see in AI code, half of them are just because the prompt didn't include enough context or the edit format was wonky

the benchmark overselling isn't the point though - it's that we're barely using these things right. most people still chat with them like it's 2023. what happens when you combine this with actual review flows not just 'beat swe-bench'

idk I think everyone's too focused on the model when tooling matters more, since that's something you can actually control

MetaWhirledPeas 14 hours ago|

> Treating harnesses as solved, or even inconsequential, is very short-sighted

Is it possible that burning extra tokens is the point, since they get paid more?

vlovich123 14 hours ago||

Given the fierce competition, I would imagine a better performing model generates more revenue than burning extra tokens

dack 14 hours ago|||

they have pretty fierce competition though, so i doubt this is intentional. my guess is they just have a million things to do and that isn't at the top of the list

naasking 13 hours ago||

That doesn't make sense with subscriptions.

More comments...