Posted by kachapopopow 16 hours ago
2) AFAIK the $20/month plan allows use of more tokens per month than if you bought $20 of tokens. my understanding is it assumes most users will only use a fraction of that each month, and they rake in profit (like a gym membership)
It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.
How about Kimi tho how can I play with it?
The way edits happen is that the agent (local) first tells the model (typically remote) that it has an edit tool (e.g. taking parameters file name, find string and replace string). If the model decides it wants to edit a file then it'll invoke this edit tool, which just results in a blob of JSON being put in the model's response specifying the edit (filename, etc). The agent then receives the response, intercepts this JSON blob, sees that it is an edit request and does what is asked.
The problem the article is describing is that the edit request (tool invocation) generated by the model isn't always 100% accurate. Even if the agent told the model it had a tool to invoke an actual editor, say sed, assuming the model knew how to use sed, this is still going to fail if the edit request cannot be interpreted literally by the editor (due to being inaccurate).
The trouble is though, because it's all indeterminant slop, every model will break in small ways that you're back to indeterminancy and building a harness ontop of the harness.
Still, <nerd snipe>, there's probably a way to get the local model and arbitrary remote model to agree on how to make a method call. But the only way that will be fruitful if you find a highly reproducible set of tuples within the model's shared space.
Part of the problem though is that tools like Claude Code don't want to assume too much of the environment - that a specific editor is available, or even that it is running on a particular OS. The way it remains platform agnostic and not reliant on specific tools is by only having a dependency on Node.js, which provides file read/write support, so to implement an edit request the agent uses Node.js to read the file, itself implements the edit, then again uses Node.js to create the new updated file.
the benchmark overselling isn't the point though - it's that we're barely using these things right. most people still chat with them like it's 2023. what happens when you combine this with actual review flows not just 'beat swe-bench'
idk I think everyone's too focused on the model when tooling matters more, since that's something you can actually control
Is it possible that burning extra tokens is the point, since they get paid more?