Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Posted by kachapopopow 18 hours ago

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed(blog.can.ac)

619 points | 237 commentspage 6

the_harpia_io 12 hours ago|

honestly the harness thing is way more important than people realize - I've been working on code security tools and the gap between what a model generates raw vs with better structure is massive, way bigger than model versions mattering. like the security bugs I see in AI code, half of them are just because the prompt didn't include enough context or the edit format was wonky

the benchmark overselling isn't the point though - it's that we're barely using these things right. most people still chat with them like it's 2023. what happens when you combine this with actual review flows not just 'beat swe-bench'

idk I think everyone's too focused on the model when tooling matters more, since that's something you can actually control

MetaWhirledPeas 16 hours ago||

> Treating harnesses as solved, or even inconsequential, is very short-sighted

Is it possible that burning extra tokens is the point, since they get paid more?

vlovich123 16 hours ago||

Given the fierce competition, I would imagine a better performing model generates more revenue than burning extra tokens

dack 16 hours ago|||

they have pretty fierce competition though, so i doubt this is intentional. my guess is they just have a million things to do and that isn't at the top of the list

naasking 15 hours ago||

That doesn't make sense with subscriptions.

jwpapi 16 hours ago||

Arguably I would think that the last year was mainly inner harness improvement instead model improvement but I could be wrong, just feels like that to me

SatvikBeri 14 hours ago|

We can measure this by looking at the same harness applied to different models, e.g. the very plain Terminus: https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

Models have improved dramatically even with the same harness

a11r 17 hours ago||

This is very nicely done. We have seen the same issue at a higher level of getting separators right when generating multiple files in a single inference call.

aghilmort 13 hours ago|

curious: wdym by "getting separators right when generating multiple files in a single inference call"

context: created hypertokens an even more robust hashing mechanism to create context-addressable memory (CAM), one cheat code is make them prefix-free, lots of others that get deep into why models work the way they do, etc.

evolly 16 hours ago||

My experience exactly! I’ve recently become so tired of the Claude harness that I switched to OpenCode (which is extremely good compared to Claude). However, OpenCode is also tedious to change, and it inherits all the “good stuff,” like treating agents as Markdown files and all the dancing around with hooks/plugins/skills scattered all over the place. Getting stuck again and again, I’ve ultimately come to the conclusion that this must be solved by writing my own damn coding agent, with extensibility that’s acceptable for real-world engineering.

HumanOstrich 16 hours ago||

Give Pi[1] a try. Comes pretty barebones out of the box, yet still provides a decent default experience. Extension points are all TypeScript if you want. There are a lot of examples[2] and some 3rd party extensions[3].

I'll point out that if you want permission prompts for certain behavior, you have to add that yourself. There's at least one example.

Edit: Just noticed the article's author is using a fork of Pi.

[1]: https://shittycodingagent.ai/

[2]: https://github.com/badlogic/pi-mono/tree/main/packages/codin...

[3]: https://github.com/nicobailon

wyre 16 hours ago||

Before you build you own, try pi. It is what you are looking for.

[0] https://shittycodingagent.ai/

scotty79 16 hours ago||

Harness is where the open source should shine. It doesn't require millions of dollars of compute but the search space is vast and explorable with limited budgets.

andai 12 hours ago||

> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.

The VC economics are creating a reality distortion field where Anthropic is incentivized to burn more tokens so they can rent more GPUs so they can get more investment, and where I am incentivized to pipe the LLM inputs into `claude -p` and blast 50KB of useless proompt onto it so they don't ban me from their 95% discounted API endpoint.

avereveard 17 hours ago||

I use small model I like to give them TOC more than lines wonder how it'd stack up with the hashline approach

read_toc tool:

...

  {

    "name": "mcp",

    "qualified_name": "mcp",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::17::18::python::mcp",

    "is_nested": false

  },

  {

    "name": "handler",

    "qualified_name": "handler",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

    "is_nested": false

  },

....

update_content tool:

{

  "content": "...",

  "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

  "project_root": ....

}

falkenstein 15 hours ago||

really enjoyed reading this, although I'm a dumb farmer and it took me a while to understand lol

azinman2 15 hours ago|

Why not just use line numbers?

giancarlostoro 15 hours ago||

I was wondering the same thing.

renewiltord 15 hours ago||

Forces you to read after every write. E.g. you edit line 15 to be two lines. Then now you need arithmetic for later vs earlier lines or you need to read full file to reindex by line number.

azinman2 15 hours ago||

Good point!

I just wonder how unique these hashes will be if only 2 characters. It seems like the collision rate would be really high.

aghilmort 13 hours ago|||

we dug into those sorts of questions with hypertokens, a robust hash for lines, code, tables/rows or any in-context token tagging to give models photographic memory

one mechanism we establish is that each model has a fidelity window, i.e., r tokens of content for s tag tokens; each tag token adds extra GUID-like marker capacity via its embedding vector; since 1,2,3 digit numbers only one token in top models, a single hash token lacks enough capacity & separation in latent space

we also show hash should be properly prefix-free, or unique symbols perp digit, e.g., if using A-K & L-Z to hash then A,R is legal hash whereas M,C is not permitted hash

we can do all this & more rather precisely as we show in our arXiv paper on same; next update goes deeper into group theory, info theory, etc. on boosting model recall, reasoning, tool calls, etc. by way of robust hashing

pbowyer 10 hours ago||

For others, here's the paper: https://arxiv.org/abs/2507.00002

MrGreenTea 12 hours ago|||

The author writes that these hashes are 2 or 3 characters long. I assume depending on the line count. That's good for almost 48k lines. You have other issues then.

azinman2 12 hours ago||

But if it’s a hash vs a line number, then we can collide much more easily.

There many be many lines that are duplicates, eg “{“

More comments...