Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Posted by kachapopopow 9 hours ago

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed(blog.can.ac)

495 points | 209 comments

perrygeo 4 minutes ago|

Witness the giant leap forward in the capabilities of coding agents over the last year. There has been no such leap in LLM model performance. I think the causality is crystal clear. It's nothing about "AGI" and all about existing LLMs learning to use existing tools.

Even a sub-par LLM, put into a context where it has access to unix tools and network and files etc, is vastly more capable than the best LLM chatbot.

logicprog 7 hours ago||

I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!

I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.

I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.

andai 3 hours ago||

My Weird Hill is that we should be building things with GPT-4.

I can say unironically that we haven't even tapped the full potential of GPT-4. The original one, from 2023. With no reasoning, no RL, no tool calling, no structured outputs, etc. (No MCP, ye gods!) Yes, it's possible to build coding agents with it!

I say this because I did!

Forcing yourself to make things work with older models forces you to keep things simple. You don't need 50KB of prompts. You can make a coding agent with GPT-4 and half a page of prompt.

Now, why would we do this? Well, these constraints force you to think differently about the problem. Context management becomes non-optional. Semantic compression (for Python it's as simple as `grep -r def .`) becomes non-optional. Bloating the prompt with infinite detail and noise... you couldn't if you wanted to!

Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)

-- We can also get into "If you let the LLM design the API then you don't need a prompt because it already knows how it should work", but... we can talk about that later ;)

jstummbillig 1 hour ago|||

The problem with these exercises is always: I have limited time and capacity to do things, and a fairly unlimited number of problems that I can think of to solve. Coding is not a problem I want to solve. Prompt engineering is not a problem I want to solve.

If I do things for the love if it, the rules are different of course. But otherwise I will simply always accept that there are many things that improve around me, that I have no intimate knowledge of and probably never will, and I let other people work them out and happily lean on their work to do the next thing I care about, that is not already solved.

logicprog 2 hours ago|||

> Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)

Hahaha yeah. This is very true. I find myself making ad hoc versions of this in static markdown files to get around it. Just another example of the kind of low hanging fruit harnesses are leaving on the table. A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.

> My Weird Hill is that we should be building things with GPT-4.

I disagree, IMO using the best models we have is a good way to avoid wasting time, but that doesn't mean we shouldn't also be frugal and clever with our harnesses!

andai 2 hours ago||

To clarify, I didn't mean we should be using ancient models in production, I meant in R&D.

Anthropic says "do the simplest thing that works." If it works with the LLMs we had 3 years ago, doesn't that make it simpler?

The newer LLMs mostly seem to work around the poor system design. (Like spawning 50 subagents on a grep-spree because you forgot to tell it where anything is...) But then you get poor design in prod!

mycall 6 hours ago|||

If I remember, both Claude Code and OpenAI Codex "harnesses" improved themselves now.

OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.

Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.

logicprog 7 hours ago|||

Also, yes, I'm aware that I use a lot of "its not just X, its Y." I promise you this comment is entirely human written. I'm just really tired and tend to rely on more wrote rhetorical tropes when I am. Believe me, I wrote like this long before LLMs were a thing.

rubenflamshep 7 hours ago|||

It didn’t read as AI to me :)

drob518 3 hours ago||||

That's what all the AIs have been trained to say.

co_king_3 4 hours ago||||

No one here will accuse you of being an AI unless they're trying to dehumanize you for expressing anti-AI sentiment.

logicprog 2 hours ago||

I'm sorry, but that's empirically false. E.g., a substantial proportion of the highly upvoted comments on https://news.ycombinator.com/item?id=46953491, which was one of the best articles on software engineering I've read in a long time, are accusing it of being AI for no reason.

kachapopopow 6 hours ago|||

why the long -'s

logicprog 6 hours ago||

Because I like them?

kachapopopow 6 hours ago|||

reminds me of that one guy complaining that everyone is calling them an AI when AI was trained on their grammar style.

ahofmann 6 hours ago||

This happened to the female speaker with her voice, which I find terrifying: https://www.youtube.com/watch?v=qO0WvudbO04

soperj 5 hours ago|||

how do you make them?

RussianCow 5 hours ago||

On macOS, Option+Shift+- and Option+- insert an em dash (—) and en dash (–), respectively. On Linux, you can hit the Compose Key and type --- (three hyphens) to get an em dash, or --. (hyphen hyphen period) for an en dash. Windows has some dumb incantation that you'll never remember.

oblio 1 hour ago|||

For Windows it's just easier to make a custom keyboard layout and go to town with that: https://www.microsoft.com/en-us/download/details.aspx?id=102...

BizarroLand 3 hours ago|||

Alt+0151 or WIN+SHIFT+-, but I can't seem to make the WIN+SHIFT+- combo work in browser, only in a text editor.

noupdates 4 hours ago|||

I was just looking at the SWE-bench docs and it seems like they use almost an arbitrary form of context engineering (loading in some arbitrary amount of files to saturate context). So in a way, the bench suites test how good a model is with little to no context engineering (I know ... it doesn't need to be said). We may not actually know which models are sensitive to good context-engineering, we're simply assuming all models are. I absolutely agree with you on one thing, there is definitely a ton of low hanging fruit.

barrenko 6 hours ago|||

2026 is the year of the harness.

visarga 6 hours ago|||

Already made a harness for Claude to make R/W plans, not write once like they are usually implemented. They can modify themselves as they work through the task at hand. Also relying on a collection of patterns for writing coding task plans which evolves by reflection. Everything is designed so I could run Claude in yolo-mode in a sandbox for long stretches of time.

porker 1 hour ago||

Link?

ex-aws-dude 2 hours ago||||

As a VC in 2026 I'm going to be asking every company "but what's your harness strategy?"

cyanydeez 53 minutes ago||||

2027 is the year of the "maybe indeterminism isn't as valueable as we thought"

miohtama 6 hours ago|||

But will harness build desktop Linux for us?

vidarh 1 hour ago|||

My harness is improving my Linux desktop...

riskable 5 hours ago|||

Only if you put bells on it and sing Jingle Bells while it em dashes through the snow.

aeon_ai 7 hours ago|||

Once you begin to see the “model” as only part of the stack, you begin to realize that you can draw the line of the system to include the user as well.

That’s when the future really starts hitting you.

renato_shira 3 hours ago|||

yeah this clicked for me when i stopped obsessing over which model to use and focused on how i structure the context and feedback loops around it. for my project the same model went from "barely usable" to "legitimately helpful" just by changing how i fed it context and how i validated its output.

the user inclusion part is real too. the best results i get aren't from fully autonomous agents, they're from tight human-in-the-loop cycles where i'm steering in real time. the model does the heavy lifting, i do the architectural decisions and error correction. feels more like pair programming than automation.

logicprog 2 hours ago||

> the user inclusion part is real too. the best results i get aren't from fully autonomous agents, they're from tight human-in-the-loop cycles where i'm steering in real time. the model does the heavy lifting, i do the architectural decisions and error correction. feels more like pair programming than automation.

Precisely. This is why I use Zed and the Zed Agent. It's near-unparalleled for live, mind-meld pair programming with an agent, thanks to CRDTs, DeltaDB, etc. I can elaborate if anyone is interested.

ambicapter 2 hours ago|||

I am interested.

rahabash 2 hours ago|||

plz do

logicprog 1 hour ago||

The special (or at least new to me) things about Zed (when you use it with the built-in agent, instead of one of the ones available through ACP) basically boil down to the fact that it's a hyper advanced CRDT-based collaborative editor, that's meant for live pair programming in the same file, so it can just treat agents like another collaborator.

1. the diffs from the agent just show up in the regular file you were editing, you're not forced to use a special completion model, or view the changes in a special temporary staging mode or different window.

2. you can continue to edit the exact same source code without accepting or rejecting the changes, even in the same places, and nothing breaks — the diffs still look right, and doing an accept or reject Just Works afterwards.

3. you can accept or reject changes piecemeal, and the model doesn't get confused by this at all and have to go "oh wait, the file was/wasn't changed, let me re-read..." or whatever.

4. Even though you haven't accepted the changes, the model can continue to make new ones, since they're stored as branches in the CRDT, so you can have it iterate on its suggestions before you accept them, without forcing it to start completely over either (it sees the file as if its changes were accepted)

5. Moreover, the actual files on disk are in the state it suggests, meaning you can compile, fuzz, test, run, etc to see what it's proposed changes do before accepting them

6. you can click a follow button and see which files it has open, where it's looking in them, and watch as it edits the text, like you're following a dude in Dwarf Fortress. This means you can very quickly know what it's working on and when, correct it, or hop in to work on the same file it is.

7. It can actually go back and edit the same place multiple times as part of a thinking chain, or even as part of the same edit, which has some pretty cool implications for final code-quality, because of the fact that it can iterate on its suggestion before you accept it, as well as point (9) below

8. It streams its code diffs, instead of hanging and then producing them as a single gigantic tool call. Seeing it edit the text live, instead of having to wait for a final complete diff to come through that you either accept or reject, is a huge boon for iteration time compared to e.g. ClaudeCode, because you can stop and correct it mid way, and also read as it goes so you're more in lockstep with what's happening.

9. Crucially, because the text it's suggesting is actually in the buffer at all times, you can see LSP, tree-sitter, and linter feedback, all inline and live as it writes code; and as soon as it's done an edit, it can see those diagnostics too — so it can actually iterate on what it's doing with feedback before you accept anything, while it is in the process of doing a series of changes, instead of you having to accept the whole diff to see what the LSP says

logicprog 6 hours ago||||

Aha! A true cybernetics enthusiast. I didn't say that because I didn't want to scare people off ;)

drob518 3 hours ago|||

That's next-year's problem.

fazgha 7 hours ago||

So deep your comment. Asking for a friend, how did you manage to have the em dash — in your keyboard ?

throwup238 6 hours ago|||

Does your friend have an iPhone? The default iOS keyboard has automatically converted double dashes into an emdash for at least seven years now.

QuercusMax 2 hours ago||

I think Google docs does this too, which drives me up the wall when I'm trying to write `command --foo=bar` and it turns it into an M-dash which obviously doesn't work.

velcrovan 6 hours ago||||

https://joeldueck.com/manually-type-punctuation.html

https://joeldueck.com/ai-is-right-about-em-dashes.html

ahofmann 6 hours ago||||

Em dashes are used often by LLMs, because humans use them often. On mac keyboards its easily typed. I know this is oversimplifying the situation, but I don't see the usefulness of the constant witch-hunting for allegedly LLM-generated text. For text we are long beyond the point, where we can differenciate between human generated and machine generated. We're even at the point, where it gets somewhat hard to identify machine generated audio and visuals.

StilesCrisis 4 hours ago|||

I might not be able to spot ALL AI generated text, but I can definitely spot some. It's still kind of quirky.

vardalab 3 hours ago|||

Yeah, I agree with you. I'm so tired of people complaining about AI-generated text without focusing on the content. Just don't read it if you don't like it. It's another level of when people complain how a website is not readable for them or some CSS rendering is wrong or whatever. How does it add to the discussion?

ink 7 hours ago||||

On a Mac, it's alt-dash in case you weren't being facetious

snazz 6 hours ago|||

Extra pedantic: that’s the en dash, the em dash is option-shift-hyphen

macintux 6 hours ago|||

Technically option-shift-dash. option-dash is an en-dash.

vient 2 hours ago||||

On Windows it is Alt+0151. Harder to use than on Mac but definitely possible, I frequently use it.

On recent versions Shift+Win+- also work, and Win+- produces en dash.

wiredfool 2 hours ago||||

I just type -- and jira fixes it.

dolebirchwood 3 hours ago||||

I really despise that people like you ruined em dashes for the rest of us who have enjoyed using them.

bitwize 6 hours ago|||

I use Compose - - - on Linux and my cellphone (Unexpected Keyboard). Mac is Alt-_.

woah 3 hours ago||

Seems like a very cool technique, but also very oversold. He's seeing a 5% improvement on a find and replace benchmark of his own devising and saying stuff like this in the blog post:

> Here is why that is backwards. I just showed that a different edit format improves their own models by 5 to 14 points while cutting output tokens by ~20%. That’s not a threat. It’s free R&D.

He makes it sounds like he got a 5-14% boost on a top level benchmark, not 5% improvement on a narrow find and replace metric. Anecdotally, I don't usually have a lot of issues with editing in Claude Code or Cursor, and if there is an issue the model corrects it.

Assuming that it costs double the tokens when it has to correct itself, and find and replace errors are as prominent in actual day to day use as his benchmark, we're talking a 5% efficiency gain in editing token use (not reasoning or tool use). Given that editing must be less than 1/3 of the token use (I assume much less?), we're talking an overall efficiency gain of less than 1%.

This seems like a promising technique but maybe not a high priority in efficiency gains for these tools. The messianic tone, like assuming that Google cut off his access to suppress his genius editing technique rather than just because he was hammering their API also leaves a bad taste, along with the rampant and blatant ChatGPTisms in the blog post.

andai 3 hours ago||

The benchmarks seem to indicate 25-50% reduction in tokens. I'm not sure how that works in real world usage though.

athrowaway3z 2 hours ago||

> “replace line 2:f1, replace range 1:a3 through 3:0e, insert after 3:0e.”

Not sure what they're calculating, but this seems to me like it could be many times more efficient than 20%.

chrisweekly 8 hours ago||

Great post. A few choice quotes:

> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.

> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.

> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.

brendanmc6 7 hours ago||

You’re absolutely right! This isn’t your average engineering advice— it’s like painting the reader a vivid tapestry of the author’s mind.

esafak 7 hours ago||

Please stop; I just can't any more! Yes, I'm absolutely right.

cevn 4 hours ago||

You're absolutely right about being absolutely right!

dimgl 6 hours ago||

My personal favorite: That’s not a threat. It’s free R&D.

matheist 5 hours ago||

> Codex uses apply_patch: It takes a string as input, which is essentially an OpenAI-flavored diff, and instead of relying on a structured schema, the harness just expects this blob to follow a strict set of rules. Since OpenAI folks are without a doubt smart, I’m sure the token selection process is biased to fit this structure at the LLM gateway for the Codex variants of GPT, similar to how other constraints like JSON schemas or required tool calls work.

Codex does in fact use a schema for constrained sampling, it's here: https://github.com/openai/codex/blob/main/codex-rs/core/src/...

It still has to work to get an exact match, or at least I didn't read the code to see if there's any fuzzy matching used.

Note the two codex models were the only ones doing worse with the author's proposed format. The author found them doing better with replace than with apply patch, but since the author appears to be unaware that they use a schema for constrained sampling, I think a more realistic benchmark should enable constrained sampling for the apply test.

keeda 1 hour ago||

This makes sense to me because I've been having very accurate results with models from even 2+ years ago... but I had to "hold them right." Even when reasoning models and coding agents were just a gleam in Altman's and Amodei's eyes, I could tell a lot of the unrealized gains lay in building the right tools, harnesses and guardrails to manage the context and guide the model. (Relevant subthread as example: https://news.ycombinator.com/item?id=44171519)

But this article hints at deeper wins to be had. Consider that these models are operating on source code, which is a verbose, noisy, textual serialization of the intended syntax / semantic trees. TFA improves accuracy by retro-fitting some structure onto the text. But what if models could operate directly on these underlying structures themselves?

As a data point, there are projects like OpenRewrite, which encode a ton of information, from formatting to types with globally resolved dependencies for each symbol in what they call a "Lossless Semantic Tree", so that there is ~0 ambiguity about the code. When I worked with OpenRewrite (in the era before LLMs, how quaint!) compared to other tools, it produced the best results for code transformations with the highest fidelity to the surrounding code.

Now imagine if the agent has access to such detailed information. It would not have to waste tokens figuring incidental things out like formatting. Although I haven't tested it out myself, I believe Moderne (the maintainers of OpenRewrite) when they say that agents armed with LST-based tools make extremely accurate changes.

This is essentially the same reason why the answer to "Which is better, Vim or Emacs?" is "IntelliJ."

Now consider that these models are STILL operating on text as an input and output mode! What if they were multi-modally trained on source code and docs and their syntax / semantic trees? I don't even know what this would look like, but I'd bet this would produce the most accurate coding models ever -- probably neurosymbolic in the truest sense.

clx75 6 hours ago||

During my first LLM experiments in Emacs using gptel, I also found that the LLM has considerable difficulties changing source code files with the Unix patch tool.

As Emacs has a built-in tree-sitter package, I implemented this same idea. I created gptel tools like tree_sitter_list_nodes, tree_sitter_get_nodes, tree_sitter_update_nodes, tree_sitter_insert_before_node and tree_sitter_insert_after_node. The "list" tool returns a list of AST nodes with first line number, first line content and node hash. The LLM can then use "get" to collect interesting nodes in their entirety and "update" to update a list of nodes identified by hash with new content (var/function bodies).

Worked like a charm.

badhorseman 6 hours ago|

Sounds interesting, do you have the code to share.

clx75 5 hours ago||

Tool definitions: https://github.com/cellux/dotfiles/blob/master/.emacs.d/rb-g...

Implementation: https://github.com/cellux/dotfiles/blob/master/.emacs.d/rb-t...

jahala 5 hours ago||

I implemented this hash (read and edit) approach in tilth if you want to test it out.

https://github.com/jahala/tilth

its on npm and cargo:

- cargo install tilth

- npx tilth

then tilth install claude-code/windsurf/cursor --edit

(--edit flag is needed)

I made "tilth" a few days ago, since I'm consistently trying to get the LLMs to use tools more efficiently and spend less tokens doing it -- original tilth post from Monday: https://news.ycombinator.com/item?id=46952321

hedgehog 5 hours ago||

You might find it useful for markdown as well, especially if you add support for section-based addressing (e.g. cat or replace a section at a time). Section-based addresses are nice because they tend to be stable across versions.

jahala 3 hours ago||

Great idea - Just implemented this.

(Already published on cargo, on npm in a few mins).

kachapopopow 5 hours ago||

benchmarks vs grep?

jahala 4 hours ago||

tilth isn’t trying to replace grep for raw text search — for that, it wraps ripgrep internally so perf is comparable. It’s about reducing round-trips and giving the agent a verified edit workflow, not faster search.

Instead of cat + grep + manual line counting, one tool call returns a structural outline of a large file, lets you drill into sections, and since this last update also returns hashline-anchored output that an edit tool can target.

kachapopopow 4 hours ago||

well yah, that's what I mean how better is it versus cat + grep + manual line counting. Agents tend to perform worse with niche tools

jahala 2 hours ago||

Thank you for this question - I'm building out a benchmark now. Initial results are very promising, will update you once it's done!

woeirua 8 hours ago||

The harness matters far more than most people think. This post about the CORE benchmark where Opus’ score almost doubled when they switched to Claude Code from their own harness. https://x.com/sayashk/status/1996334941832089732

theturtletalks 8 hours ago||

Mario, the creator of Pi terminal agent, has this great blog post[0]. He talks about how TerminalBench's highest scores comes from using the Terminus 2 harness which uses tmux under the hood.

When I was reading the Opus 4.6 launch post, they mentioned the same thing and their TerminalBench score was based on using Terminus 2 and not CC.

0. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/

withinboredom 8 hours ago||

Which, IMHO, should be why we should be able to change them freely or make our own. Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.

CuriouslyC 8 hours ago|||

The reason Anthropic is pushing on the closed harness is that they're not confident with their ability to win on model quality long term, so they're trying to build lock-in. They can capture some additional telemetry owning the harness as well, but given the amount of data the agent loop already transmits, that borders on unethical spyware (which might be part of the reason they're afraid to open source).

Ultimately the market is going to force them to open up and let people flex their subs.

Aurornis 7 hours ago||||

> Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.

I’ll probably get downvoted for this, but am I the only one who thinks it’s kind of wild how much anger is generated by these companies offering discounted plans for use with their tools?

At this point, there would be less anger and outrage on HN if they all just charged us the same high per-token rate and offered no discounts or flat rate plans.

senordevnyc 6 hours ago||

No, you're not the only one. The outraged entitlement is pretty funny tbh. How dare they dictate that they'll only subsidize your usage if you use their software!!

chickensong 2 hours ago||

I'm not outraged, but the dynamic creates a tension that prevents me from building brand loyalty.

horsawlarway 8 hours ago|||

Also another place where having it change out from underneath you can drastically alter the quality of your work in unexpected ways.

Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

Even if the "limits" on them stay generous, the product will start shifting to prioritize things the user doesn't want.

Tool recommendations are my immediate and near term fear - paid placement for dev tools both at the model level and the harness level seem inevitable.

---

The right route is open models and open harnesses, ideally on local hardware.

Aurornis 7 hours ago|||

> Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

I don’t assume this at all. In fact, the opposite has been happening in my experience: I try multiple providers at the same time and the $20/month plans have only been getting better with the model improvements and changes. The current ChatGPT $20/month plan goes a very long way even when I set it to “Extra High” whereas just 6 months ago I felt like the $20/month plans from major providers were an exercise in bouncing off rate limits for anything non-trivial.

Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.

disgruntledphd2 7 hours ago|||

> Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.

This time also crosses over with the frontier labs raising ever larger and larger rounds. If Anthropic IPO (which I honestly doubt), then we may get a better sense of actual prices in the market, as it's unlikely the markets will continue letting them spend more and more money each year without a return.

TuxSH 4 hours ago|||

> The current ChatGPT $20/month plan goes a very long way

It sure does and Codex is great, but do you think they'll maintain the current prices after/if it eventually dominates Claude Code in terms of marketshare and mindshare?

deaux 8 hours ago||||

At this point subsidizing Chinese open-weights vendors by paying for them is just the right thing to do. Maybe they too might go closed-weights when they become SotA, but they're now pretty close and haven't done it.

DeathArrow 8 hours ago||

I am wondering what kinds of harness are best for GLM, Deepseek, Qwen, Kimi.

deaux 8 hours ago|||

OpenCode is great in general. At least one of them is specifically trained on CC - I think it was Qwen - so for those that should give best results.

azuanrb 6 hours ago|||

Claude Code better than opencode for GLM models for me.

eshaham78 8 hours ago|||

The harness is effectively the agent's 'body'. Swapping the brain (model) is good, but if the body (tools/environment) is locked down or inefficient, the brain can't compensate. Local execution environments that standardize the tool interface are going to be critical for avoiding that lock-in.

tosh 8 hours ago|

Shows how much room for improvement there is on the harness level.

Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.

Love the pragmatic mix of content based addressing + line numbers. Beautiful.

robbomacrae 7 hours ago||

Indeed. The biggest waste might be the overuse of MCP for everything. Sure it makes the initial development easier but then for every connection you're using a hundred billion dollar parameter model to decide how to make the call when it's usually completely unnecessary and then prone to random errors. MCP is the hammer that can make literally everything look like a nail...

senordevnyc 6 hours ago||

I see this ranting against MCP all the time, and I don't get it, maybe I'm missing something. I'm currently using an MCP in Cursor to give agents read-only access to my staging and prod databases, as well as BugSnag's MCP so it can look up errors that happen in those environments. It works great. What should I be using for this if not MCP?

visarga 5 hours ago|||

Make a CLI tool for it, of course

canadiantim 4 hours ago|||

agent skills, or use claude code to iteratively condense an MCP you want to use into only its most essential tools for your workflow

chasd00 7 hours ago|||

i haven't dug into the article but your comment reminded me about the ClaudeCode Superpowers plugin. I find the plugin great but it's quite "expensive", I use the pay-as-you-go account with CC because i've just been trying it out personally and the superpowers plugin spends a lot of money, relative to regular CC, with all the back and forth.

With CC you can do a /cost to see how much your session cost in dollar terms, that's a good benchmark IMO for plugins, .md files for agents, and so on. Minimize the LLM cost in the way you'd minimize typical resource usage on a computer like cpu, ram, storage etc.

kachapopopow 7 hours ago||

you can actually go the other way and spend more tokens to solve more complex problems (multi-agent) by letting agents work with smaller problems

More comments...