In comparison to just spending for tokens, the tokens would have been much cheaper and much much faster. I've been running against Gemma4:31b, Qwen3.5 and 3.6, and getting local LLMs to solve AMC 8/10 math questions and it's about 10-100x slower than just doing it online. When I tried it with ChatGPT late last year, it took about one night and $25 to solve about 1000 questions. Using my RTX 6000 and M3 Ultra and Gemma4:31b on both, it answered about 40 questions in 7 hours and I haven't checked how good the answer is yet. At 800 watts (600 for RTX and 200 for M3 Ultra) and running for 7 hours, it solved around 40 questions.
At the very least I'm going to try to sell my M3 Ultra if I can find a reliable place to sell it without getting ripped off by scammers.
Yes this is exactly what I'm doing. I isolated the actual math question, and then sent it to my two servers to process and that's what's taking 10m+ to return. I'm asking them to solve the question and return the full answer along with their steps. I care about correctness so taking time is okay but I can't use 10m per solution.
I saw your heat comments about the RTX 6000 Pro as well. I bought a few of them recently and I'm running 2 of them in a 2U case in a colo. You need a lot of active airflow to keep them cool. Mine range from 23 C to 80 C.
After my last run, I'm going to wait for the new case I ordered to come in and cannibalize my kid's PC that we built beginning of this year to form an entirely separate computer. And then figure out better ways to deal with the heat, especially with summer coming up. I'll have to play around with undervolting and running vents directly outside my house to see if that helps.
But the trend here is interesting. I think by 2030 you'll be able to buy fairly cheap hardware that is currently $10k+. I don't know what this does to the trillions invested in AI data centers because the next NVidia architecture after Blackwell will essentially half the value of purchased cards overnight.
I'm not convinced Apple has yet pivoted the Mac Studio line towards this market and the expected M5 Ultras in Q3 2026 will likely be an incremental improvement rather than big leap forward but I'd like to be proven wrong.
I feel that the open weight models pale in comparison to the frontier models, and I believe that if the gap closes quickly, that the open weight vendors will stop releasing it for free.
- https://www.williamangel.net/blog/2026/05/17/offline-llm-ene... - Discussion: https://news.ycombinator.com/item?id=48168198
Because that wasn't what they claimed to research?
>> for inference it's definitely not worth it.
It's entirely fine if you enjoy local LLMs on your computer, there are people doing horribly inefficient inference on smartphones now. But for pure inference tasks, it's pretty obvious why M5s and Mac Studios aren't replacing TPUs and GPUs.The idea is obviously to be running the LLM on your work laptop. As a developer I'd need a laptop with 24GB of RAM for work anyway, and 48GB, which is enough for a very good quant of Gemini, is just $400 extra.
But yes, for pure inference, the M5 Max Macbook Pros probably aren't there yet. They have other utility though of course. And you can get 64GB and 128GB MBPs at a discount. Micro Center currently will let you buy a 64GB M5 Max MBP for under $4k currently, for example.
It just scares me to own a box that is $48K in my house, especially if it breaks, or gets stolen.
No wonder gamers hate AI bros.
For a lot of research questions 6 GPUs is even overkill.
It’s one of the reasons I’m skeptical of the “trillion dollar supercluster” idea [0]. I think what we need is more reasonably smart people investigating medium-sized problems. A “GPU middle class” you might say.
[0] https://situational-awareness.ai/racing-to-the-trillion-doll...
"If I were to do this again, I wouldn’t do a custom build like this. I would buy a standard datacenter server and rent space in a colocation center"
I'm sure there are use cases when renting makes sense, but it can get crazy expensive really fast if you're not careful.
(I would assume they haven't made a lot of $ off of this, if nothing else because they've only just put out that post and demo. They do seem to have produced a model that doesn't sound very LLM-y to my ear, though it also seems rather weak for its size.)
Cynical take: They made an LLM that can bypass existing AI slop detectors.
Realistic take: They found a research problem they found interesting, dumped a bunch of capital and sweat equity into and (claimed to have, at least) found a solution. Neat!
https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
"I spent a long time trying high risk/high reward experiments and failing. But now I have something good. I’ve solved a major problem with LLMs. And I’m launching next Monday so we will soon see if it’s actually a breakthrough or just LLM psychosis "
Maybe ai companies today have some bounty program?
Or, for a person who did have a great way to monetize the same workload they’d probably find a lot of value in reading this post.
While I'm skeptical that there is much of a moat, at least for the large players, it should at least hopefully set rosmine up with for the next job :)
It does seem to fix the current biggest issues with using LLMs for writing at various publishers. If you're The Economist, you have a very specific house style and you have a decent corpus of articles written in that style. At least on my reading of it, rosmine can use DFT to get a model to closely match its outputs, in terms of the language quirks that are generated, to that of the corpus it is fine tuned on. ie it will very much match the house style, particularly as it is used in writing, vs giving a system prompt to an LLM that has some Economist articles in its vast training set, and telling it to write in that style- it will do an ok job, but still exhibit LLM language quirks despite itself. Even if you feed it the specific "style guide" that they give their authors, I dare say the reality of their writing is the best place to learn, and it sounds like DFT can ground the writing of a model in a specific corpus like that.
[1]: https://rosmine.ai/2026/05/18/fixing-llm-writing-with-distri...
They do it well enough that it'd take really good output to beat.
If your goal is to say, write science fiction, their reversion to classic LLM-isms, is really distracting and is what makes people say from a glance that it was written by an LLM. You basically can't use them at the moment in any real "natural" long-form writing. Everyone will call "slop" pretty quickly on the current frontier models.
Rosmin's DFT paper is worth a read.
Cloud is optimized for development velocity but its nature of high margin business eventually makes on-prem more promising
It could be too late but it might be worth looking into tax saving if you have a business. Depreciation of asset is a loss and may deduct your income. (I'm NOT a tax expert)
As the author notes, there are also electrical/wiring issues that cap how much compute gear you can run in a space not designed for it. I suspect a standard 20A 110V circuit can probably handle 2x RTX 6000 Pros. 15A probably can but that requires more research. Anything more than that and you're using multiple circuits, which has issues, or you need an upgraded circuit (eg 40A 240V) with all that entails (eg heavier duty cables, custom plug, etc).