Posted by mpweiher 12/21/2025
The biggest saving you make is by making the context smaller and where many turns are required going for smaller models. For example a single 30min troubleshooting session with Gemini 3 can cost $15 if you run it "normally" or it can cost $2 if you use the agents, wipe context after most turns (can be done thanks to tracking progress in a plan file)
So I fired up Cline with gpt-oss-120b, asked it to tell me what a specific function does, and proceeded to watch it run `cat README.md` over and over again.
I'm sure it's better with other the Qwen Coder models, but it was a pretty funny first look.
[0] https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-...
I use Claude for all my planning, create task documents and hand over to GLM 4.6. It has been my workhorse as a bootstrapped founder (building nocodo, think Lovable for AI agents).
We need to clean up code lint and format errors across multiple files. Check which files are affected using cargo commands. Please use opencode, a coding agent that is installed. Use `opencode run <prompt>` to pass in a per-file prompt to opencode, wait for it to finish, check and ask again if needed, then move to next file. Do not work on files yourself.- claude code router - basically allows you to swap in other models using the real claude code cli and set up some triggers for when to use which one (eg. plan mode use real claude, non plan or with keywords use glm)
- opencode - this is what im mostly using now. similar to ccr but i find it a lot more reliable against alt models. thinking tasks go to claude, gemini, codex and lesser execution tasks go to glm 4.6 (on ceberas).
- sub-agent mcp - Another cool way is to use an mcp (or a skill or custom /command) that runs another agent cli for certain tasks. The mcp approach is neat because then your thinker agent like claude can decide when to call the execution agents, when to call in another smart model for a review of it's own thinking, etc instead of it being explicit choice from you. So you end up with the mcp + an AGENTS.md that instructs it to aggressively use the sub-agent mcp when it's a basic execution task, review, ...
I also find that with this setup just being able to tap in an alt model when one is stuck, or get review from an alt model can help keep things unstuck and moving.
A simpler approach without subtasks would be to just use the smart model for Ask/Plan/whatever mode and the dumb but cheap one for the Code one, so the smart model can review the results as well and suggest improvements or fixes.
Best choice will depend on use cases.
It will become like cloud computing - some people will have a cloud bill of $10k/m to host their apps, other people would run their app on a $15/m VPS.
Yes, the cost discrepancy will be as big as the current one we see in cloud services.
Imagine having the hardware capacity to run things locally, but not the necessary compliance infrastructure to ensure that you aren't committing a felony under the Copyright Technofeudalism Act of 2030.
There may be other reasons to go local, but I would say that the proposed way is not cost effective.
There's also a fairly large risk that this HW may be sufficient now, but will be too small in not too long. So there is a large financial risk built into this approach.
The article proposes using smaller/less capable models locally. But this argument also applies to online tools! If we use less capable tools even the $20/mo subscriptions won't hit their limit.
The best way to get the correct answer on something is posting the wrong thing. Not sure where I got this from, but I remember it was in the context of stackoverflow questions getting the correct answer in the comments of a reply :)
Props to the author for their honesty and having the impetus to blog about this in the first place.