Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Posted by kachapopopow 12 hours ago

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed(blog.can.ac)

539 points | 218 commentspage 3

mehdibl 4 hours ago|

You can improve a lot the success rate with providing HELM and clear instructions with the tool description.

Over a year ago had a lot of issues and the description and example was the difference between 30-50% failure to 1%!

So I'm surprised a bit about the point. May be I'm missing it.

fcanesin 10 hours ago||

The harness is the model "body", it's weight the cognition. Like in nature they develop together and the iteration of natural selection works at both.

If smaller labs (Zai, Moonshot, deepseek, mistral..) get together and embrace a harness, like opencode for example, as a consortium just by the power of "evolution across different environments" they might hit jackpot earlier than bigger labs.

TZubiri 10 hours ago|

But they rely on distilling the output of american leader models. Which will probably train against their own harness.

Someone has to do the baseline training, development, and innovation. it can't be clones all the way down

robotresearcher 9 hours ago|||

Why not? Humans are (very nearly) clones all the way down.

lillecarl 9 hours ago|||

Citation needed, SOTA labs surely has technical protection and legaleese against using them for training. It's been done in th past but what indicates this is still the case?

cyanydeez 4 hours ago||

this didn't stop the millions of copyrighted works used to train the models.

parhamn 10 hours ago||

On first principles it would seem that the "harness" is a myth. Surely a model like Opus 4.6/Codex 5.3 which can reason about complex functions and data flows across many files would trip up over top level function signatures it needs to call?

I see a lot of evidence to the contrary though. Anyone know what the underlying issue here is?

znnajdla 8 hours ago||

How hard is it to for you to assemble a piece of IKEA furniture without an allen wrench, screwdriver, and clear instructions, vs with those 3?

0x457 7 hours ago|||

Well, I assembled Alex once without instruction and with impact driver and hammer last year. Hardest part was to make tools fit.

parhamn 8 hours ago|||

You didn't read the article it seems (or the analogy is a bad one). The differences are much more subtle than having a screwdriver or not.

znnajdla 8 hours ago||

I did read the article quite enthusiastically and my practical experience confirms the same. Sure the difference is more subtle. But my point was, an "agent" whether human or AI can be a lot more productive with better tools. This guy found a better screwdriver than the most commonly used one. That's amazing and nothing from "first principles" denies that a better tool harness would mean better/faster/more correct AI agents.

3371 9 hours ago|||

If you agree that current LLMs (Transformers) are naturally very susceptible to context/prompt, then you can go on to ask agents for a "raw harness dump" "because I need to understand how to better present my skills and tools in the harness", you maybe will see how "Harness" impact model behavior.

robotresearcher 9 hours ago|||

Humans have a demonstrated ability to program computers by flipping switches on the front panel.

Like a good programming language, a good harness offers a better affordance for getting stuff done.

Even if we put correctness aside, tooling that saves time and tokens is going to be very valuable.

manbash 10 hours ago|||

The models generalized "understanding" and "reasoning" is the real myth that makes us take a step back and offload the process deterministic computing and harnesses.

madeofpalk 9 hours ago||

Isn't 'the harness' essentially just prompting?

It's completely understandable that prompting in better/more efficient means would produce different results.

furyofantares 9 hours ago||

No, it's also a suite of tools beyond what's available in bash, tailored to context management.

tgtweak 6 hours ago||

When you're in the business of selling tokens - you look at technology that reduces that as a threat. If they were selling services that USE tokens, then reducing them would be welcome... so they'll likely steal this and incorporate it into their proprietary CLIs like claude code...

MarsIronPI 5 hours ago|

Huh? Anthropic doesn't sell Claude Code, they sell tokens. Why would they make Claude Code more token-efficient?

christophilus 7 hours ago||

Has any harness matched the effectiveness of Claude Code yet? I haven't experimented much recently, but every time I have in the past, I wasn't able to get any other tool to approach how effective I am in CC.

I'd love to use a different harness-- ideally an OSS one-- and hook it up to whichever LLM provides the best bang for the buck rather than being tied to Claude.

XCSme 4 hours ago||

Google banning you for benchmarking is crazy, are you sure that's the cause? How would they even know you are benchmarking?

aszen 10 hours ago||

So the new implementation always operates at the line level, replacing one or more lines. That's not ideal for some refactorings like rename where search and replace is faster.

Edit

Checking ohmypi The model has access to str replace too so this is just a edit till

benreesman 9 hours ago||

The logical end state of this line of reasoning is a collective action problem that dooms the frontier lab establishment. You can't devote model capacity to having an attention transformer match nested delimiters or cope with bash and be maximally capable, you can't mix authentication, authorization, control plane, and data plane into an ill specified soup and be secure enough for any that isn't a pilot or toy ever.

If you run this out, you realize that the Worse is Better paradox has inverted, it's an arbitrage, and the race is on.

uriegas 7 hours ago||

I do agree with his identification of the problem: sometimes agents fail because of the tools around it and not because of the model's reasoning. However, for the failing tests I think he is not making the distinction between a failed test due to a harness failure or due to a reasoning failure. It would be nice if someone analyzed that from the data set.

pcwelder 12 hours ago|

Great work, but concurrency is lost.

With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.

Have you tested followup edits on the same files?

kachapopopow 11 hours ago||

(not the author) it works fine most of the time been using it alongside an active agent and haven't ran into too many noticable problems. The token savings alone are worth it.

wrsh07 11 hours ago||

Serializing writes is probably fine and the hashes should only change if you're updating the same line, right?

You probably don't want to use the line number though unless you need to disambiguate

But your write tool implementation can take care of that

More comments...