Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Posted by kachapopopow 20 hours ago

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed(blog.can.ac)

650 points | 248 commentspage 7

scotty79 18 hours ago|

Harness is where the open source should shine. It doesn't require millions of dollars of compute but the search space is vast and explorable with limited budgets.

andai 14 hours ago||

> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.

The VC economics are creating a reality distortion field where Anthropic is incentivized to burn more tokens so they can rent more GPUs so they can get more investment, and where I am incentivized to pipe the LLM inputs into `claude -p` and blast 50KB of useless proompt onto it so they don't ban me from their 95% discounted API endpoint.

avereveard 19 hours ago||

I use small model I like to give them TOC more than lines wonder how it'd stack up with the hashline approach

read_toc tool:

...

  {

    "name": "mcp",

    "qualified_name": "mcp",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::17::18::python::mcp",

    "is_nested": false

  },

  {

    "name": "handler",

    "qualified_name": "handler",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

    "is_nested": false

  },

....

update_content tool:

{

  "content": "...",

  "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

  "project_root": ....

}

falkenstein 17 hours ago||

really enjoyed reading this, although I'm a dumb farmer and it took me a while to understand lol

azinman2 17 hours ago||

Why not just use line numbers?

giancarlostoro 17 hours ago||

I was wondering the same thing.

renewiltord 17 hours ago||

Forces you to read after every write. E.g. you edit line 15 to be two lines. Then now you need arithmetic for later vs earlier lines or you need to read full file to reindex by line number.

azinman2 17 hours ago||

Good point!

I just wonder how unique these hashes will be if only 2 characters. It seems like the collision rate would be really high.

aghilmort 15 hours ago|||

we dug into those sorts of questions with hypertokens, a robust hash for lines, code, tables/rows or any in-context token tagging to give models photographic memory

one mechanism we establish is that each model has a fidelity window, i.e., r tokens of content for s tag tokens; each tag token adds extra GUID-like marker capacity via its embedding vector; since 1,2,3 digit numbers only one token in top models, a single hash token lacks enough capacity & separation in latent space

we also show hash should be properly prefix-free, or unique symbols perp digit, e.g., if using A-K & L-Z to hash then A,R is legal hash whereas M,C is not permitted hash

we can do all this & more rather precisely as we show in our arXiv paper on same; next update goes deeper into group theory, info theory, etc. on boosting model recall, reasoning, tool calls, etc. by way of robust hashing

pbowyer 12 hours ago||

For others, here's the paper: https://arxiv.org/abs/2507.00002

MrGreenTea 14 hours ago|||

The author writes that these hashes are 2 or 3 characters long. I assume depending on the line count. That's good for almost 48k lines. You have other issues then.

azinman2 14 hours ago||

But if it’s a hash vs a line number, then we can collide much more easily.

There many be many lines that are duplicates, eg “{“

deaux 19 hours ago||

Great article, recommend reading all of it.

This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.

techpression 19 hours ago|

I mean they want to make money right? CC is a cool tool, but obviously they want you to use the api eventually if you’re even remotely a power user, 200/month for all you can eat tokens (well, until some arbitrary limit of the day kicks in) just doesn’t make sense when compared to api prices. In other words, CC should be seen as a software subscription.

deaux 19 hours ago||

The token limit is the same whether used in CC or in other harnesses.

techpression 17 hours ago||

Sure, but then Anthropic loses the possibility to upsell, show ads, telemetry, brag about number of users and how long they use it etc etc. Not necessarily what’s in there today, but what can be in there tomorrow. They also get the ability to much better fine tune backoffs etc from a purely technical side of things.

kacper-vstorm 11 hours ago||

Great post!

__mharrison__ 19 hours ago||

Is there a skill file I can use for these edits?

badhorseman 17 hours ago||

I feel a lot of confusion at which coding harness is best and what options to use. tbh I have mostly used standard aider and I don't know what the consensus is on this tool.

I feel I want to write my own and that maybe in the future a lot of developers will have custom harnesses and have highly customized versions as each user of these models wants to use these things in a way that's unique to their brain, much like how emacs is so great for the customization but one persons emacs config is often not what another wants or only wants a subset and then write their own features.

As an aside what is the feeling on all the various ai coding tools, does aider suck is that aider-ce/cecli are better or are the bespoke tools for each model like claudeCode and such better.

kittbuilds 12 hours ago|

[dead]

More comments...