Show HN: sllm – Split a GPU node with other developers, unlimited tokens

Posted by jrandolf 8 hours ago

Show HN: sllm – Split a GPU node with other developers, unlimited tokens(sllm.cloud)

Running DeepSeek V3 (685B) requires 8×H100 GPUs which is about $14k/month. Most developers only need 15-25 tok/s. sllm lets you join a cohort of developers sharing a dedicated node. You reserve a spot with your card, and nobody is charged until the cohort fills. Prices start at $5/mo for smaller models.

The LLMs are completely private (we don't log any traffic).

The API is OpenAI-compatible (we run vLLM), so you just swap the base URL. Currently offering a few models.

110 points | 62 comments

spencer9714 54 minutes ago|

Interesting concept. One thing I’m curious about if I’m in a cohort for something like DeepSeek V3 and another user spins up a heavy 24/7 job, how do you keep TTFT from degrading? vLLM’s continuous batching helps, but there’s still a physical limit with shared VRAM/compute. I’ve been grappling with this exact 'noisy neighbor' issue while building Runfra. We actually ended up moving toward a credit per task model on idle GPUs specifically to avoid that resource contention entirely.

Curious how you’re thinking about isolation here. Is there any hard guarantee on a 'slice' of the GPU, or is it mostly just handled by the vLLM scheduler?

QuantumNomad_ 5 hours ago||

> How does billing work?

> When you join a cohort, your card is saved but not charged until the cohort fills. Stripe holds your card information — we never store it. Once the cohort fills, you are charged and receive an API key for the duration of the cohort.

Have any cohorts filled yet?

I’m interested in joining one, but only if it’s reasonable to assume that the cohort will be full within the next 7 days or so. (Especially because in a little over a week I’m attending an LLM-centered hackathon where we can either use AWS LLM credits provided by the organizer, or we can use providers of our own choosing, and I’d rather use either yours or my own hardware running vLLM than the LLM offerings and APIs from AWS.)

I’d be pretty annoyed if I join a cohort and then it takes like 3 months before the cohort has filled and I can begin to use it. By then I will probably have forgotten all about it and not have time to make use of the API key I am paying you for.

jrandolf 3 hours ago|

No cohorts have been filled yet. We're still early. We are seeing reservations pick up quickly, but I'd be able to give you a more concrete estimate of fill velocity after about a week.

That said, we're planning to add a 7-day window: if a cohort doesn't fill within 7 days of your reservation, it cancels automatically and your card is released. We don't want anyone's payment method sitting in limbo indefinitely.

freedomben 6 hours ago||

This is an excellent idea, but I worry about fairness during resource contention. I don't often need queries, but when I do it's often big and long. I wouldn't want to eat up the whole system when other users need it, but I also would want to have the cluster when I need it. How do you address a case like this?

pokstad 4 hours ago||

This problem sounds like an excellent opportunity. We need a race to the bottom for hosting LLMs to democratize the tech and lower costs. I cheer on anyone who figures this out.

zozbot234 2 hours ago|||

Ultimately the most sensible way of handling this is you end up with "surge pricing" for the highest-priority tokens whenever the inference platform is congested, over and above the base subscription (but perhaps ultimately making the subscription a bit cheaper).

jrandolf 6 hours ago|||

We implement rate-limiting and queuing to ensure fairness, but if there are a massive amount of people with huge and long queries, then there will be waits. The question is whether people will do this and more often than not users will be idle.

mogili1 5 hours ago|||

Rate limit essentially is a token limit

ibejoeb 4 hours ago|||

It depends on how it's implemented. If it's a fixed window, then your absolute ceiling is tokens/windows in a month. If it's a function of other usage, like a timeshare, you're still paying for some price for a month and you get what you get without paying more per token. There's an intrinsic limit based on how many tokens the model can process on that gpu in a month anyway, even if it's only you.

delusional 55 minutes ago|||

Time x capacity is also a limit. There's always a limit.

freedomben 5 hours ago||||

Is there any way to buy into a pool of people with similar usage patterns? Maybe I'm overthinking it, but just wondering

ssl-3 4 hours ago||

I think it'd be best to pool with people with different patterns, not the same patterns. Perhaps it would be best to pool with people in different timezones, and/or with different work/sleep schedules.

If everyone in a pool uses it during the ~same periods and sleeps during the ~same periods, then the node would oscillate between contention and idle -- every day. This seems largely avoidable.

(Or, darker: Maybe the contention/idle dichotomy is a feature, not a bug. After all, when one has control of $14k/month of hardware that is sitting idle reliably-enough for significant periods every day, then one becomes incentivized to devise a way to sell that idle time for other purposes.)

petterroea 5 hours ago|||

To be fair this is the price you pay for sharing a GPU. Probably good for stuff that doesn't need to be done "now" but that you can just launch and run in the background. I bet some graphs that show when the gpu is most busy could be useful as well

cyanydeez 2 hours ago||

Also, cache ejection during contention qill degrade everyones service.

I question whether they actually understand LLMs at scale.

zozbot234 2 hours ago||

I suppose it's meant to be a "minimum viable" third-party inference platform, where you're literally selling subscription-based access (i.e. fixed price, not PAYGO by token) to a single GPU cluster, and then only once enough users subscribe to make it viable (which is very nice from them, it works like a Kickstarter/group coupon model and creates a guaranteed win-win for the users). But they could easily expand to more than just the minimum cluster size, which would somewhat improve efficiency. (Deepseek themselves scale out their model over huge amounts of GPUs, which is how they manage to price their tokens quite cheap.)

kaoD 6 hours ago||

How is the time sharing handled? I assume if I submit a unit of work it will load to VRAM and then run (sharing time? how many work units can run in parallel?)

How large is a full context window in MiB and how long does it take to load the buffer? I.e. how many seconds should I expect my worst case wait time to take until I get my first token?

jrandolf 6 hours ago||

vLLM handles GPU scheduling, not sllm. The model weights stay resident in VRAM permanently so there's no loading/unloading per request. vLLM uses continuous batching, so incoming requests are dynamically added to the running batch every decode step and the GPU is always working on multiple requests simultaneously. There is no "load to VRAM and run" per request; it's more like joining an already-running batch.

TTFT is under 2 seconds average. Worst case is 10-30s.

kaoD 4 hours ago||

> The model weights stay resident in VRAM permanently so there's no loading/unloading per request.

Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right?

If I keep sending large context buffers, will that hog the batches?

jrandolf 3 hours ago|||

Not if you are the only one. We have rate limits to prevent this in case, idk, you share your key with 1000 people lol.

ninjha 6 hours ago||

> how many work units can run in parallel

not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead

mmargenot 6 hours ago||

This is a great idea! I saw a similar (inverse) idea the other day for pooling compute (https://github.com/michaelneale/mesh-llm). What are you doing for compute in the backend? Are you locked into a cohort from month to month?

avereveard 1 hour ago||

Interesting there's a trickle of low intensity job one can always get running but like glm own plan is $30/mo and something about 300tps now I know that one is subsidized but still.

varunr89 6 hours ago||

$40/mo for deepseek r1 seems steep compared to a pro sub on open ai /claude unless you run 24x7. im not sure how sharing is making this affirdable.

lelanthran 6 hours ago|

> $40/mo for deepseek r1 seems steep compared to a pro sub on open ai /claude unless you run 24x7.

"Running 24x7" is what people want to do with openclaw.

bluerooibos 35 minutes ago||

So shared hosting for LLMs?

p_m_c 4 hours ago||

Do you own the GPUs or are you multiplexing on a 3rd party GPU cloud?

jrandolf 3 hours ago|

Multiplexing on a GPU cloud.

vova_hn2 6 hours ago|

1. Is the given tok/s estimate for the total node throughput, or is it what you can realistically expect to get? Or is it the worst case scenario throughput if everyone starts to use it simultaneously?

2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?

Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.

jrandolf 6 hours ago|

1. It's an average. 2. We have sophisticated rate limiter.

poly2it 5 hours ago||

Does it take user time zones into account?

jrandolf 5 hours ago||

Yes

More comments...