Posted by mpweiher 1 day ago
Are people really doing that?
If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic. The OpenAI one in particular is a great deal, because Codex is charged a whole lot lower than Claude.
The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
My monthly spend on ai models is < $1
I'm not cheap, just ahead of the curve. With the collapse in inference cost, everything will be this eventually
I'll basically do
$ man tool | <how do I do this with the tool>
or even $ cat source | <find the flags and give me some documentation on how to use this>
Things I used to do intensively I now do lazily.I've even made a IEITYuan/Yuan-embedding-2.0-en database of my manpages with chroma and then I can just ask my local documentation how I do something conceptually, get the man pages, inject them into local qwen context window using my mansnip llm preprocessor, forward the prompt and then get usable real results.
In practice it's this:
$ what-man "some obscure question about nfs"
...chug chug chug (about 5 seconds)...
<answer with citations back to the doc pages>
Essentially I'm not asking the models to think, just do NLP and process text. They can do that really reliably.It helps combat a frequent tendency for documentation authors to bury the most common and useful flags deep in the documentation and lead with those that were most challenging or interesting to program instead.
I understand the inclination it's just not all that helpful for me
If you aren't using coding models you aren't ahead of the curve.
There are free coding models. I use them heavily. They are ok but only partial substitutes for frontier models.
Some people, with some tasks, get great results
But me, with my tasks, I need to maintain provenance and accountability over the code. I can't just have AI fly by the seat of its pants.
I can get into lots of detail on this. If you have seen tools and setups I have done you'd realize why it doesn't work for me.
I've spent money, the results for me, with my tasks, have not been the right decision.
$ man tool | <how do I do this with the tool>
or even
$ cat source | <find the flags and give me some documentation on how to use this>Could you please elaborate on this? Do I get this right that you can set up your your command line so that you can pipe something to a command that sends this something together with a question to an LLM? Or did you just mean that metaphorically? Sorry if this is a stupid question.
Example:
$ man tar | llm "how do I extract test.txt from a tar.gz"Actually for many cases the LLM already knows enough. For more obscure cases, piping in a --help output is also sometimes enough.
where ai could be a simple shell script combining the argument with stdin
$ man --html="$(which markitdown)" <man page>
That goes man -> html -> markdown which is not only token efficient but also llms are pretty good at creating hierarchies from markdownMy tool can read stdin, send it to an LLM, and do a couple nice things with the reply. Not exactly RAG, but most man pages fit into the context window so it's okay.
llm 'output a .gitignore file for typical python project that I can pipe into the actual file ' > .gitignore
> I'm not cheap
You're cheap. It's okay. We're all developers here. It's a safe space.
I'm not convinced.
I'm convinced you don't value your time. As Simon said, throw $20-$100/mo and get the best state of the art models with "near 0" setup and move on.
For most of my work I only need the LLM to perform a structured search of the codebase or to refactor something faster than I can type, so the $20/month plan is fine for me.
But for someone trying to get the LLM to write code for them, I could see the $20/month plans being exhausted very quickly. My experience with trying “vibecoding” style app development, even with highly detailed design documents and even providing test case expected output, has felt like lighting tokens on fire at a phenomenal rate. If I don’t interrupt every couple of commands and point out some mistake or wrong direction it can spin seemingly for hours trying to deal with one little problem after another. This is less obvious when doing something basic like a simple React app, but becomes extremely obvious once you deviate from material that’s represented a lot in training materials.
With Gemini/Antigravity, there’s the added benefit of switching to Claude Code Opus 4.5 once you hit your Gemini quota, and Google is waaaay more generous than Claude. I can use Opus alone for the entire coding session. It is bonkers.
So having subscribed to all three at their lowest subscriptions (for $60/mo) I get the best of each one and never run out of quota. I’ve also got a couple of open-source model subscriptions but I’ve barely had the chance to use them since Codex and Gemini got so good (and generous).
The fact that OpenAI is only spending 30% of their revenue on servers and inference despite being so generous is just mind boggling to me. I think the good times are likely going to last.
My advise - get Gemini + Codex lowest tier subscriptions. Add some credits to your codex subscription in case you hit the quota and can’t wait. You’ll never be spending over $100 even if you’re building complex apps like me.
This entire comment is confusing. Why are you buying the $200/month plan if you’re only using 10% of it?
I rotate providers. My comment above applies to all of them. It really depends on the work you’re doing and the codebase. There are tasks where I can get decent results and barely make the usage bar move. There are other tasks where I’ve seen the usage bar jump over 20% for the session before I get any usable responses back. It really depends.
For context, this was a few months ago when GPT 5 was new and I was used to constantly hitting o3 limits. It was an experiment to see if the higher plan could pay for itself. It most certainly can but I realized that I just don’t need it. My workflow has evolved into switching between different agents on the same project. So now I have much less of a need for any one.
You should also queue up many "continue ur work" type messages.
Note: I’m using the $20 plan for this! With codex-5.2-medium most of the time (previously codex-5.1-max-medium). For my work projects, Gemini 3 and Antigravity Claude Opus 4.5 are doing the heavy lifting at the moment, which frees up codex :) I usually have it running constantly in a second tab.
The only way I can now justify Pro is if I am developing multiple parallel projects with codex alone. But that isn’t the case for me. I am happier having a mix of agents to work with.
I've been doing something like this with the basic Gemini subscription using Antigravity. I end up hitting the Gemini 3 Pro High quota many times but then I can still use Claude Opus 4.5 on it!
Ah, I missed this part. Yes, this is basically what I would recommend today as well. Buy a couple of different frontier model provider basic subscriptions. See which works better on what problems. For me, I use them all. For someone else it might be codex alone. Ymmv but totally worth exploring!
This is why it’s confusing, though. Why start with the highest plan as the starting point when it’s so easy to upgrade?
I’m just a simple dude trying to optimize his life.
It's worth noting that the Claude subscription seems notably less than the others.
Also there are good free options for code review.
It could take longer, but save your subscription tokens.
https://geminicli.com/docs/faq/
> What is the privacy policy for using Gemini Code Assist or Gemini CLI if I’ve subscribed to Google AI Pro or Ultra?
> To learn more about your privacy policy and terms of service governed by your subscription, visit Gemini Code Assist: Terms of Service and Privacy Policies.
> https://developers.google.com/gemini-code-assist/resources/p...
The last page only links to generic Google policies. If they didn't train on it, they could've easily said so, which they've done in other cases - e.g. for Google Studio and CLI they clearly say "If you use a billed API key we don't train, else we train". Yet for the Pro and Ultra subscriptions they don't say anything.
This also tracks with the fact that they enormously cripple the Gemini app if you turn off "apps activity" even for paying users.
If any Googlers read this, and you don't train on paying Pro/Ultra, you need to state this clearly somewhere as you've done with other products. Until then the assumption should be that you do train on it.
Service Terms
17. Training Restriction. Google will not use Customer Data to train or fine-tune any AI/ML models without Customer's prior permission or instruction.
[1] https://cloud.google.com/terms/service-terms[2] https://docs.github.com/en/copilot/reference/ai-models/model...
I originally thought they only supported the previous generation models i.e. Claude Opus 4.1 and Gemini 2.5 Pro based on the copy on their pricing page [1] but clicking through [2] shows that they support far more models.
Lately Copilot have been getting access to new frontier models the same day they release elsewhere. That wasn't the case months ago (GPT 5.1). But annoyingly you have to explicitly enable each new model.
Anthropic has an option to opt out of training and delete the chats from their cloud in 30 days.
> The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
These are the same people, by and large. What I have seen is users who purely vibe code everything and run into the limits of the $20/m models and pay up for the more expensive ones. Essentially they're trading learning coding (and time, in some cases, it's not always faster to vibe code than do it yourself) for money.
I don't pay $100 to "vibe code" and "learn to program" or "avoid learning to program."
I pay $100 so I can get my personal (open source) projects done faster and more completely without having to hire people with money I don't have.
I review all of it, but hand write little of it. It's bizarre how I've ended up here, but yep.
That said, I wouldn't / don't trust it with something from scratch, I only trust it to do that because I built -- by hand -- a decent foundation for it to start from.
But I’ve not found that to be true at all. My actually engineered processes where I care the most is where I push tokens the hardest. Mostly because I’m using llms in many places in the sdlc.
When I’m vibing it’s just a single agent sort of puttering along. It uses much fewer tokens.
I said "by and large" ie generally speaking. As I mentioned before, the exception does not invalidate the trend. I assume HN is more heavily weighted towards non-vibe-coders using up tokens like me and you but again, that's the exception to what I see online elsewhere.
Restoring a bit of balance to things.
A "vibecoder" is to a programmer what script kiddie is to a hacker.
And when pressed on “this doesn't make sense, are you sure this works?” they ask the model to answer, it gets it wrong, and they leave it at that.
That hasn't been true with Opus 4.5. I usually hit my limit after an hour of intense sessions.
1. Do you start off using the Claude Code CLI, then when you hit limits, you switch to the GitHub Copilot CLI to finish whatever it is you are working on?
2. Or, you spend most of your time inside VSCode so the model switching happens inside an IDE?
3. Or, you are more of a strict browser-only user, like antirez :)?
Do you mean that users should start a new chat for every new task, to save tokens? Thanks.
On the other hand, Claude has been nothing but productive for me.
I’m also confused why you don’t assume people have the intelligence to only upgrade when needed. Isn’t that what we’re all doing? Why would you assume people would immediately sign up for the most expensive plan that they don’t need? I already assumed everyone starts on the lowest plan and quickly runs into session limits and then upgrades.
Also coaching people on which paid plan to sign up for kinda has nothing to do with running a local model, which is what this article is about
Incidentally, wondering if anyone has seen this approach of asking Claude to manage Codex:
https://www.reddit.com/r/codex/comments/1pbqt0v/using_codex_...
(I also have the same MBP the author has and have used Aider with Qwen locally.)
I just can't accept how slow codex is, and that you can't really use it interactively because of that. I prefer to just watch Claude code work and stop it once I don't like the direction it's taking.
Codex models tend to be extremely good at following instructions, to the point that it won't do any additional work unless you ask it to. GPT-5.1 and GPT-5.2 on the other hand is a little bit more creative.
Models from Anthropics on the other hand is a lot more loosy goosy on the instructions, and you need to keep an eye on it much more often.
I'm using models interchangeably from both providers all the time depending on the task at hand. No real preference if one is better then the other, they're just specialized on different things
Sonnet 4.5 is great for vibe coding. You can give it a relatively vague prompt and it will take the initiative to interpret it in a reasonable way. This is good for non-programmers who just want to give the model a vague idea and end up with a working, sensible product.
But I usually do not want that, I do not want the model to take liberties and be creative. I want the model to do precisely what I tell it and nothing more. In my experience, te GPT-5.x models are a better fit for that way of working.
YMMV based on the kinds of side projects you do, but it's definitely been cheaper for me in the long run to pay by token, and the flexibility it offers is great.
If I wasn't only using it for side projects I'd have to cough up the $200 out of necessity.
Claude Code is a whole lot less generous though.
I havent tried agentic coding as I havent set it up in a container yet, and not going to yolo my system (doing stuff via chat and a utility to copy and paste directories and files got me pretty far over the last year and a half).
If you're doing mostly smaller changes, you can go all day with the 20$ Claude plan without hitting the limits. Especially if you need to thoroughly review the AI changes for correctness, instead of relying on automated tests.
From what my team tells me, it's not a great deal since it's so far behind Claude in capabilities and IDE integration.
Sure am. Capacity to finish personal projects has tripled for a mere $200/month. Would purchase again.
leo dicaprio snapping gif
These kinds of articles should focus on use case because mileage may vary depending on maturity of idea, testing and host of other factors.
If the app, service, or whatever is unproven, that's a sunk cost on macbook vs. 4 weeks to validate an idea which is a pretty long time.
If the idea is sound then run it on macbook :)
When I consider it against my other hobbies, $100 is pretty reasonable for a month of supply. That being said, I wouldn’t do it every month. Just the months I need it.
in my experience cursor is nicer to work with the openai/anthropic cli tools
Not a serious question but I thought it's an interesting way of looking at value.
I used to sell cars in SF. Some people wouldn't negotiate over $50 on a $500 a month lease because their apartment was $4k anyway.
Other people WOULD negotiate over $50 because their apartment was $4k.
The $20 Anthropic plan is only enough to wet my appetite, I can't finish anything.
I pay for $100 Anthropic plan, and keep a $20 Codex plan in my back pocket for getting it to do additional review and analysis overtop of what Opus cooks up.
And I have a few small $ of misc credits in DeepSeek and Kimi K2 AI services mainly to try them out, and for tasks that aren't as complicated, and for writing my own agent tools.
$20 Claude doesn't go very far.
That said, the privacy argument is compelling for commercial projects. Running inference locally means no training data concerns, no rate limits during critical debugging sessions, and no dependency on external API uptime. We're building Prysm (analytics SaaS) and considered local models for our AI features, but the accuracy gap on complex multi-step reasoning was too large. We ended up with a hybrid: GPT-4o-mini for simple queries, GPT-4 for analysis, and potentially local models for PII-sensitive data processing.
The TCO calculation should also factor in GPU depreciation and electricity costs. A 4090 pulling 450W at $0.15/kWh for 8 hours/day is ~$200/year just in power, plus ~$1600 amortized over 3 years. That's $733/year before you even start inferencing. You need to be spending $61+/month on Claude to break even, and that's assuming local performance is equivalent.
Those aren't useful numbers.
With LLMs, I feel like price isn't the main factor: my time is valuable, and a tool that doesn't improve the way I work is just a toy.
That said, I do have hope, as the small models are getting better.
It works, but it's slow. Much more like set it up and come back in an hour and it's done. I am incredibly impressed by it. There are quantized GGUFs and MLXs of the 123B, which can fit on my M3 36GB Macbook that I haven't tried yet.
But overall, it feels about about 50% too slow, which blows my mind because we are probably 9 months away from a local model that is fast and good enough for my script kiddie work.
So my guess would be - we need open conversation or something along the line of "useful linguistic-AI approaches for combing and grooming code"
But the bottom line is that I still can't find a way to use either local LLMs and/or opencode and crush for coding.
Somewhat comically, the author seems to have made it about 2 days. Out of 1,825. I think the real story is the folly of fixating your eyes on shiny new hardware and searching for justifications. I'm too ashamed to admit how many times I've done that dance...
Local models are purely for fun, hobby, and extreme privacy paranoia. If you really want privacy beyond a ToS guarantee, just lease a server (I know they can still be spying on that, but it's a threshold.)
Same thing will happen with these tools, just a matter of time.
I'm not going to pay monthly for X service when similar Y thing can be purchased once (or ideally open source downloaded), self-hosted, and it's your setup forever.
Ideally Free software downloaded. Even more ideally copyleft Free software downloaded.
I haven't tried the local models as much but I'd find it difficult to believe that they would outperform the 2024 models from OpenAI or Anthropic.
The only major algorithmic shift was done towards the RLVR and I believe it was already being applied during the 2023-2024.
It's impressive to see what I can run locally, but they're just not at the level of anything from the GPT-4 era in my experience.
But for SOTA performance you need specialized hardware. Even for Open Weight models.
40k in consumer hardware is never going to compete with 40k of AI specialized GPUs/servers.
Your link starts with:
> "Using a single top-of-the-line gaming GPU like NVIDIA’s RTX 5090 (under $2500), anyone can locally run models matching the absolute frontier of LLM performance from just 6 to 12 months ago."
I highly doubt a RTX 5090 can run anything that competes with Sonnet 3.5 which was released June, 2024.
I don't know about the capabilities of a 5090 but you probably can run a Devstral-2 [1] model locally on a Mac with good performance. Even the small Devstral-2 model (24b) seems to easily beat Sonnet 3.5 [2]. My impression is that local models have made huge progress.
Coding aside I'm also impressed by the Ministral models (3b, 8b and 14b) Mistral AI released a a couple of weeks ago. The Granite 4.0 models by IBM also seem capable in this context.
I've played with Devstral 2 a lot since it came out. I've seen the benchmarks. I just don't believe it's actually better for coding.
It's amazing that it can do some light coding locally. I think it's great that we have that. But if I had to choose between a 2024-era model and Devstral 2 I'd pick the older Sonnet or GPTs any day.
It's neat to play with, but not practical.
The only story that I can see that makes sense for running at home is if you're going to fine tune a model by taking an open weight model and <hand waving> doing things to it and running that. Even then I believe there's places (hugging face?) that will host and run your updated model for cheaper than you could run it yourself.
For general purpose LLM probably yes. For something very domain-specialized not necessarily.
That's not the same as discounting the open weight models though. I use DeepSeek 3.2 heavily, and was impressed by the Devstral launch recently. (I tried Kimi K2 and was less impressed). I don't use them for coding so much as for other purposes... but the key thing about them is that they're cheap on API providers. I put $15 into my deepseek platform account two months ago, use it all the time, and still have $8 left.
I think the open weight models are 8 months behind the frontier models, and that's awesome. Especially when you consider you can fine tune them for a given problem domain...
Well, the hardware remains the same but local models get better and more efficient, so I don't think there is much difference between paying 5k for online models over 5 years vs getting a laptop (and well, you'll need a laptop anyway, so why not just get a good enough one to run local models in the first place?).
Even still, right now is when the first gen of pure LLM focused design chipsets are getting into data centers.
Unless you're YOLOing it, you can review only at a certain speed, and for a certain number of hours a day.
The only tokens/s you need is one that can keep you busy, and I expect that even a slow 5token/sec model utilised 60s in every minute, 60m of every hour and 24 hours of every day is way more than you can review in a single working day.
The goal we should be moving towards is longer-running tasks, not quicker responses, because if I can schedule 30 tasks to my local LLm before bed, then wake up in the morning and schedule a different 30, and only then start reviewing, then I will spend the whole day just reviewing while the LLM is generating code for tomorrow's review. And for this workflow a local model running 5 tokens/s is sufficient.
If you're working serially, i.e. ask the LLM to do something, then review what it gave you, then ask it to do the next thing, then sure, you need as many tokens per second as possible.
Personally, I want to move to long-running tasks and not have to babysit the thing all day, checking in at 5m intervals.
I always find it funny when the same people who were adamant that GPT-4 was game-changer level of intelligence are now dismissing local models that are both way more competent and much faster than GPT-4 was.
For simple compute, its usefulness curve is a log scale. 10x faster may only be 2x more useful. For LLMs (and human intelligence) its more quadratic, if not inverse log (140IQ human can do maths that you cannot do with 2x 70IQ humans. And I know, IQ is not a good/real metric, but you get the point)
If Claude 3 Sonnet was good enough to be your daily driver last year, surely something that is as powerful is good enough to be your daily driver today. It's not like the amount of work you must do to get paid doubled over the past year or anything.
Some people just feel the need to live always on the edge for no particular reason.
The above paragraph is meant to be a compliment.
But justifying it based on keeping his Mac for five years is crazy. At the rate things are moving, coding models are going to get so much better in a year, the gap is going to widen.
Also in the case of his father where he is working for a company that must use a self hosted model or any other company that needed it, would a $10K Mac Studio with 512GB RAM be worth it? What about two Mac Studios connected over Thunderbolt using the newly released support in macOS 26?
LM Studio can run both MLX and GGUF models but does so from an Ollama style (but more full-featured) macOS GUI. They also have a very actively maintained model catalog at https://lmstudio.ai/models
but people should use llama.cpp instead
I had no problems with ROCm 6.x but couldn't get it to run with ROCm 7.x. I switched to Vulkan and the performance seems ok for my use cases
MLX is a lot more performant than Ollama and llama.cpp on Apple Silicon, comparing both peak memory usage + tok/s output.
edit: LM Studio benefits from MLX optimizations when running MLX compatible models.
and why should that affect usage? it's not like ollama users fork the repo before installing it.
But vLLM and Sglang tend to be faster than both of those.
It's cross-platform (Win/Mac/Linux), detects the most appropriate GPU in your system and tells you whether the model you want to download will run within it's RAM footprint.
It lets you set up a local server that you can access through API calls as if you were remotely connected to an online service.
- Cross-platform
- Sets up a local API server
The tradeoff is a somewhat higher learning curve, since you need to manually browse the model library and choose the model/quantization that best fit your workflow and hardware. OTOH, it's also open-source unlike LMStudio which is proprietary.
[edit] Oh and apparently you can also directly run some models directly from HuggingFace: https://huggingface.co/docs/hub/ollama
If you've ever used a terminal, use llama.cpp. You can also directly run models from llama.cpp afaik.
I mean, what's the point of using local models if you can't trust the app itself?
and you think ollama doesn't do telemetry/etc. just because it's open source?
Just as we had the golden era of the internet in the late 90s, when the WWW was an eden of certificate-less homepages with spinning skulls on geocities without ad tracking, we are now in the golden era of agentic coding where massive companies make eye watering losses so we can use models without any concerns.
But this won't last and Local Llamas will become a compelling idea to use, particularly when there will be a big second hand market of GPUs from liquidated companies.
We have already seen cost cutting for some models. A model starts strong, but over time the parent company switches to heavily quantized versions to save on compute costs.
Companies are bleeding money, and eventually this will need to adjust, even for a behemoth like Google.
That is why running local models is important.