Top
Best
New

Posted by birdculture 4 hours ago

I put a datacenter GPU in my gaming PC(blog.tymscar.com)
173 points | 118 comments
sonzohan 55 minutes ago|
I also recently decided to buy a datacenter GPU and slap it into a system. Some notes from my experience that the author doesn't mention in their article:

Decommissioned NVIDIA V100s and AMD MI50s are fairly cheap, $200 for 16gb and $400-500 for 32gb, for local experimentation. They are also very old. There's an enthusiast community keeping these two cards alive and working with current platforms and models.

Nitpick, but the V100 doesn't support bfloat16. The performance hit is not a big deal if you're fiddling with local models, but the card is on it's way out in terms of hardware features.

The MI50 does support bf16, but not the current edition of AMD ROCm. Vulkan support is good and the MI50 works with most major platforms (llama.cpp, vllm, etc.), but it's not without some pain points like manual recompilation. Fortunately the open source community has already paid most of your way.

The cooling requirements for these cards cannot be understated. A consumer grade GPU may throttle if in a small case without additional fans, but if given the same treatment a datacenter GPU will overheat itself idling. You will need to buy, at least, a bunch of decent 120mm fans to prevent this or invest in some water cooling.

I ultimately went with an AMD MI100 32GB ($950). I'm an AMD fan, current ROCm editions support it, and it was low-fuss to get things working. I'm debating getting a second so I can try out bigger models like qwen3-coder-next.

Silagi 11 minutes ago|
Did you consider the R9700 or B70 when you went for the MI100? If so, what made you choose the MI100?

I've been playing with picking up a card in this class but haven't been able to justify it when running the Qwen3.6 MOE model on a 6800xt is tolerable for the type of projects I've been willing to point local AI at.

Teknomadix 3 hours ago||
Tesla V100 SXM2 16GB is NOT DGX class as the author writes. It's HGX class. The V100 comes in two classes, SXM2 and SXM4, the latter coming with a Max of 80gb on board memory. Typically these are installed 8×A100 80GB SXM4 on an HGX riser, and what that gives you is NVSwitch fabric and 640GB of pooled HBM2e (on package stacked memory /w ~2 TB/s of memory bandwidth). 2u standard rack footprint too.
legitronics 1 hour ago|
I have no idea what you are trying to say.

V100 came as sxm2 and sxm3. And it was 16 and 32gb.

HGX is DGX with extra toppings.

mickeyp 3 hours ago||
Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.

It's prefill; slow prefill kills agentic workloads dead.

If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:

    You have: 100000 / (150/s)

    You want: hms

     11 min + 6.6666667 sec
Which is quite a wait indeed.
HarHarVeryFunny 1 hour ago||
I wonder if this could be usefully mitigated with a combination of prompt (prefix) caching and an agent that let you control what the prompt prefix consisted of. The goal would be to incur that slow prefill once to build the prompt cache, then have subsequent prompts consist of mostly this fixed prefix plus specific instructions.

For a language like C++ where modules are split into definition (.h) and implementation (.cpp) parts, one choice of prefix would be all the header files for the project (which aren't likely to change much).

More generally the idea would be to have an agent that had cached-prefix reuse as it's primary context management goal.

Another possibility, to support caching of files that have since changed, would be for the agent to build the context as a fixed prefix reflecting some or all of the codebase in its start-of-session state, then append any changes to that, with appropriate prompting to only use the latest definition of a function.

e.g.

Say file A initially contains functions X, Y and Z, then the prompt prefix is built to include X Y Z. If the user then modifies Y -> Y', then just add that to the context, so that the cached prefix is unchanged, giving X Y Z Y'.

Aurornis 3 hours ago|||
Most people won’t be dumping 100K tokens into it at once, but I agree that all of the prefill time that adds up during a session becomes a lot to account for.

This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens.

Tepix 36 minutes ago||
When you're using OpenCode it's easy to reach 100,000 tokens after a while.
pastage 57 minutes ago|||
A quick search say that this is a standard feature you cache the prefill and load it at PCIe bandwidth so it should be about 0.2s
keynha 11 minutes ago||
[dead]
mettamage 31 minutes ago||
> The way it works is that a vision encoder (similar to what ChatGPT and Claude use) takes image pixels and translates them into the LLM’s token embedding space. The model does not “see” the image the way a human does. Instead, the vision encoder compresses the image into a sequence of vectors that live in the same mathematical space as text tokens. The LLM then processes those vectors as if they were just another sequence of tokens.

Could you also do this for music and specifically sound synthesis? It would be awesome to vibe synthesize sounds and then see the VSTi parameters surrounding it.

jonhohle 1 hour ago||
I was just looking into this and was worried about the fan setup. Interesting that he was able to solve it with good results.

In case anyone is interested, I’m using PCIE passthrough on a FreeBSD host to a Linux guest with an older Pascal card. It’s worked great and I’ve been thinking about putting a nicer card in there. The SXM route seems great, but I’ve been burned (almost literally because of the heat) by DC components before.

bob1029 3 hours ago||
> And yes, if you want the absolute best, Opus 4.8 exists. It also costs more per 20 minutes of heavy use than I paid for this entire GPU and adapter setup combined. But the gap is shockingly small.

I don't think this is a fair characterization of the situation. I use frontier models via API pre-paid tokens every single day, and I can barely rack up $100 per month. The fact that we figured out how to burn double this in 20 minutes is impressive, but I don't think it reflects the reality that many are experiencing right now. There are some exceptionally gluttonous approaches to harnessing LLMs that I think are serving as convenient straw men in these discussions.

Paying for the API will almost always be more economical than self-hosting equivalent infrastructure. I am not against self-hosting, but the article suggests a primarily economic motivation for this effort. If you are consuming fewer than 10^9 tokens per month, I really don't think it's worth your time to try and compete with the hyperscalars. Most of the money is to be found in the integration of this technology with existing businesses.

vidarh 2 hours ago||
I use hosted providers myself, but I can churn through $100 worth of tokens in half a day even with cheap models like Deepseek easily. If someone's use is as light as yours, then sure - grab a subscription and you'll save far more. For higher use it will come down to how cheap your electricity is whether it is worth offloading at least some of it (for me it's not, FWIW)
iJohnDoe 1 hour ago||
Could you share a bit about what you’re working on or what type of projects require that much usage? Is it hobby, production, revenue generating?
vidarh 1 hour ago||
A mix. I have hobby projects that churn through that much when I don't need the tokens for others things. I also have projects for clients that easily consumes those levels. As well as a stealth-ish potential startup. Currently I'm at 4 different subscriptions + more than I'd like in spend via OpenRouter...

What multiplies it very quickly is when you start feeding them with test suites and "Ralph loops" that run until the test suites pass, or complex chains with lots of sub-agents being triggered.

If you're sitting there watching everything, it will be hard to burn all that much even if you're running multiple things in paralle.

oceanplexian 2 hours ago||
Claude is something like $35 per million tokens. If I was using API pricing I could trivially spend $100 in a single hour long coding session, with /fast turned on in about 10 minutes. Not sure how you guys are using it.
MattRix 2 hours ago|||
Opus is normally $5 per mtok, no idea why anyone would use /fast if they were at all concerned about price. ($5 is still pricy though tbh)
krzyk 1 hour ago||
Opus is $5 per mtok of input tokens, but $25 for output.
foolfoolz 2 hours ago|||
coding is the easy part of using claude
peibye 5 minutes ago||
All that work just to write an ai blog post. This is a cool topic but I just can’t deal with the aiisms.
matja 4 hours ago||
The AMD MI250X GPUs are also interesting - 128GB of HBM2E at 3TB/s, sometimes you see them second-hand for under $1k, the catch obviously is that it needs an OAM socket. Never seen an easy way to hook them up to a regular mainboard.
Gracana 3 hours ago||
An additional complication is that MI250Xes are two GPUs in one package, so you need to connect the first and last x16 SERDES groups to the host, otherwise you'll only see one GPU (or it won't work at all, idk).

Also, the cheap HPE pulls on eBay need some proprietary HPE magic to work, and I have yet to see anyone figure that out.

Teknomadix 3 hours ago|||
These are interesting, and offer beefy through put. No point in adapting to a PCI lane thought, stuck behind the slot-bus bottleneck.
plagiarist 3 hours ago||
Ahh luckily this OAM socket will prevent me from spending money.
selectively 1 hour ago||
[dead]
mondainx 4 hours ago||
Great write-up, I've often considered these DC cards for a project and now you've convinced me to pick one up; you describe the price of the unit against what one spends on tokens and that does it for me.
tymscar 47 minutes ago|
Thats why I did it. I think it’s important to put things like that into perspective
segmondy 2 hours ago|
The most interesting and perhaps useful for most would be how they control the fan. If you are thinking of doing this, you really want to get those fans under control, they are loud. For anyone thinking of these, v100s idle super high! 25-35watt with nothing loaded and easily 50w when a model is loaded.
More comments...