Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Posted by kachapopopow 11 hours ago

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed(blog.can.ac)

495 points | 209 commentspage 2

joshuamoyers 3 hours ago|

I think this is the right take. I usually am aligned with most of what Anthropic is doing, but cutting off OAuth login from open harnesses was a bad move. My guess is there is some serious worry/overlap with the Cursor's of the world - e.g. folks who will be competitors in the future who are taking advantage of cheaper Opus rates/loss leader from them while simultaneously building a competitive model (Composer).

Also, nice clever optimization here. Lots of low hanging fruit in harness land.

socketcluster 2 hours ago||

Seeing all these 'coding' benchmarks reminds me that people still don't understand what coding means in practice. People still think one-phase puzzle-solving is coding. Real coding almost always has multiple phases which build on top of one another. There is an architectural component which is missed here - and the sheer number of phases/layers is actually where most of the complexity comes from.

cyanydeez 2 hours ago|

Usually what I need a LLM to do is find me a elegant agorithm for a problem I've encountered where I know there's an elegant algorithm but I've got no idea what it's called or how to google search for it.

kachapopopow 9 hours ago||

My personal notes (not the author): have been way faster performance wise which is honestly the biggest improvement over correctless. I've posted https://github.com/can1357/oh-my-pi before, but didn't seem to gain traction. It's a great little agent.

mijoharas 8 hours ago||

I've just started messing around with pi, but haven't fully dug in yet. How would you compare oh-my-pi? I see it has a lot of other bells and whistles built in.

Are they portable bit by bit back to pi, or is there enough differences that they can't? how about normal pi extensions, can they be used in omp?

Some of the stuff definitely looks interesting.

kachapopopow 8 hours ago||

the differences are documented but it is mostly 1:1, never used normal pi, but night and day difference compared to opencode, don't forget omp setup python.

scotth 8 hours ago||

I'm into it! This looks like an experimentation platform. OpenCode is beginning to feel like handcuffs. Let me hack!

rao-v 6 hours ago||

I’d really like to see this optimized for the 50-120B parameter open source models that are local viable (gpt-oss-120b, qwen3-80b-3a etc.).

For them I think it would be optimal to provide a tag per function and trust the llm to rewrite the function. As the article notes full reproduction is generally more reliable than edited for short code.

The token and attention overhead from a per line hash I suspect limits this approach for smaller models

ianbutler 7 hours ago||

It’s funny to see where we are on model improvements.

Back when I was maintaining a coding harness around the time of Claude 3.5 we tried hash prefixes we tried line number prefixes we tried a lot of different approaches to making the model better at selecting edit blocks and ultimately at-least then fuzzy string matching won out.

jbellis 7 hours ago|

Yes, very similar results here (http://brokk.ai)

We got lines-with-anchors working fine as a replacement strategy, the problem was that when you don't make the model echo what it's replacing, it's literally dumber at writing the replacement; we lost more in test failures + retries than we gained in faster outputs.

Makes sense when you think about how powerful the "think before answering" principle is for LLMs, but it's still frustrating

mehdibl 2 hours ago||

You can improve a lot the success rate with providing HELM and clear instructions with the tool description.

Over a year ago had a lot of issues and the description and example was the difference between 30-50% failure to 1%!

So I'm surprised a bit about the point. May be I'm missing it.

Bolwin 8 hours ago||

You forgot to mention your tool does worse for 8/16 LLMs compared to replace?

Problem is, replace has been around for so long, most LLMs are tuned for it now

indubioprorubik 3 hours ago||

My guess always was that - if you took the source of training data- meaning the authors of the "best" answers and solutions on stackoverflow or github- and got the question reformatted, to sound like it was created by these experts- the created code, would try to hug these sources of truth while getting created.

So, the challenge is actually to find a map of "problem" to "author" and then from "author" to "related code" and from their to a solution.

animan 10 hours ago||

What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

bri3d 9 hours ago||

When you buy a subscription plan, you’re buying use of the harness, not the underlying compute / tokens. Buying those on their own is way more expensive. This is probably because:

* Subscriptions are oversubscribed. They know how much an “average” Claude Code user actually consumes to perform common tasks and price accordingly. This is how almost all subscription products work.

* There is some speculation that there is cooperative optimization between the harness and backend (cache related etc).

* Subscriptions are subsidized to build market share; to some extent the harnesses are “loss leader” halo products which drive the sales of tokens, which are much more profitable.

sigmar 9 hours ago|||

He wasn't using the regular paid api (ie per token pricing). He was using the endpoints for their subscribed customers (ie paid per month and heavily subsidized).

infecto 10 hours ago|||

I assume he was using Gemini the same way as he was Claude when I make the following statement.

I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.

deaux 9 hours ago||

Indeed, that's why Anthropic, OpenAI and other LLM providers are known to adhere to published APIs to gather the world's data, obeying licensing and ROBOTS.txt.

It's truly disgusting.

skybrian 9 hours ago||

I was under the impression that they do obey robots.txt now? There are clearly a lot of dumb agents that don’t, but didn’t think it was the major AI labs.

deaux 9 hours ago||

After 3 years of pirating and scraping the entire world by doing the above, I guess they have everything that they now need or want.

So then it's better to start obeying ROBOTS.txt as a ladder pull through a "nicely behaved" image advantage.

skybrian 9 hours ago||

Obeying robots.txt (now) is still better than not obeying it, regardless of what they did before.

The alternative is to say that bugs shouldn’t be fixed because it’s a ladder pull or something. But that’s crazy. What’s the point of complaining if not to get people to fix things?

DANmode 9 hours ago|||

Why does Google/Facebook et al arbitrarily enforce one human per account?

It’s because they want to study you.

They want the data!

logicallee 9 hours ago||

>What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

Underscores the importance of sovereign models you can run on the edge, finetune yourself, and run offline. At State of Utopia, we're working on it!

znnajdla 9 hours ago|

My experience as well. People worry our profession is being reduced to "prompt engineer", but actually I get the feeling that programming will soon be mainly about designing and building harnesses for specific tasks.

ambicapter 9 hours ago||

Personal opinion is that LLMs are definitely not as magical as people think they are, they fill a specific niche of problem-solving, and harnesses are necessary to corral your problem into the niche that they are extremely good at solving.

cruffle_duffle 8 hours ago||

The more I dive into this space the more I think that developers will still be in heavy demand—just operating in a different level of abstraction most of the time. We will need to know our CS fundamentals, experience will still matter, juniors will still be needed. It’s just that a lot of time time the actual code being generated will come from our little helper buddies. But those things still need a human in the seat to drive them.

I keep asking myself “could my friends and family be handed this and be expected to build what I’m building on them” and the answer is an immediate “absolutely not”. Could a non technical manager use these tools do build what I’m building? Absolutely not. And when I think about it, it’s for the exact same reason it’s always been… they just aren’t a developer. They just don’t “think” in the way required to effectively control a computer.

LLMs are just another way to talk to a machine. They aren’t magic. All the same fundamental principles that apply to probably telling a machine what to do still apply. It’s just a wildly different mechanism.

That all being said, I think these things will dramatically speed up the pace that software eats the world. Put LLMs into a good harness and holy shit it’s like a superpower… but to get those superpowers unlocked you still have to know the basis, same as before. I think this applies to all other trades too. If you are a designer you still have to what good design is and how to articulate it. Data scientists still need to understand the basics of their trade… these tools just give them superpowers.

Whether or not this assertion remains true in two or three years remains to be seen but look at the most popular tool. Claude code is a command line tool! Their gui version is pretty terrible in comparison. Cursor is an ide fork of vscode.

These are highly technical tools requiring somebody that knows file systems, command lines, basic development like compilers, etc. they require you to know a lot of stuff most people simply don’t. The direction I think these tools will head is far closer to highly sophisticated dev tooling than general purpose “magic box” stuff that your parents can use to… I dunno… vibe code the next hit todo app.

neversupervised 8 hours ago|||

I believe you’re arriving at the wrong conclusion because you’re comparing to an opposite instead of to someone slightly worse than you. Will this enable people at the edge to perform like you? That’s the question. Will there be more developers? Will they compete with you?

keybored 5 hours ago||||

> The more I dive into this space the more I think that developers will still be in heavy demand—just operating in a different level of abstraction most of the time. We will need to know our CS fundamentals, experience will still matter, juniors will still be needed. It’s just that a lot of time time the actual code being generated will come from our little helper buddies. But those things still need a human in the seat to drive them.

It’s disheartening that programmers are using this advanced, cutting-edge technology with such a backwards, old-fashioned approach.[1]

Code generation isn’t a higher level abstraction. It’s the same level but with automation.

See [1]. I’m open to LLMs or humans+LLMs creating new abstractions. Real abstractions that hide implementation details and don’t “leak”. Why isn’t this happening?

Truly “vibe coding” might also get the same job done. In the sense of: you only have to look at the generated code for reasons like how a C++ programmer looks at the assembly. Not to check if it is even correct. But because there are concerns beyond just the correctness like code gen size. (Do you care about compiler output size? Sometimes. So sometimes you have to look.)

[1]: https://news.ycombinator.com/item?id=44163821

skydhash 7 hours ago|||

> LLMs are just another way to talk to a machine. They aren’t magic.

I will still opt for a scriptable shell. A few scripts, and I have a custom interface that can be easily composed. And could be run on a $100 used laptop from ebay.

More comments...