Posted by jrandolf 8 hours ago
The LLMs are completely private (we don't log any traffic).
The API is OpenAI-compatible (we run vLLM), so you just swap the base URL. Currently offering a few models.
Curious how you’re thinking about isolation here. Is there any hard guarantee on a 'slice' of the GPU, or is it mostly just handled by the vLLM scheduler?
> When you join a cohort, your card is saved but not charged until the cohort fills. Stripe holds your card information — we never store it. Once the cohort fills, you are charged and receive an API key for the duration of the cohort.
Have any cohorts filled yet?
I’m interested in joining one, but only if it’s reasonable to assume that the cohort will be full within the next 7 days or so. (Especially because in a little over a week I’m attending an LLM-centered hackathon where we can either use AWS LLM credits provided by the organizer, or we can use providers of our own choosing, and I’d rather use either yours or my own hardware running vLLM than the LLM offerings and APIs from AWS.)
I’d be pretty annoyed if I join a cohort and then it takes like 3 months before the cohort has filled and I can begin to use it. By then I will probably have forgotten all about it and not have time to make use of the API key I am paying you for.
That said, we're planning to add a 7-day window: if a cohort doesn't fill within 7 days of your reservation, it cancels automatically and your card is released. We don't want anyone's payment method sitting in limbo indefinitely.
If everyone in a pool uses it during the ~same periods and sleeps during the ~same periods, then the node would oscillate between contention and idle -- every day. This seems largely avoidable.
(Or, darker: Maybe the contention/idle dichotomy is a feature, not a bug. After all, when one has control of $14k/month of hardware that is sitting idle reliably-enough for significant periods every day, then one becomes incentivized to devise a way to sell that idle time for other purposes.)
I question whether they actually understand LLMs at scale.
How large is a full context window in MiB and how long does it take to load the buffer? I.e. how many seconds should I expect my worst case wait time to take until I get my first token?
TTFT is under 2 seconds average. Worst case is 10-30s.
Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right?
If I keep sending large context buffers, will that hog the batches?
not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead
"Running 24x7" is what people want to do with openclaw.
2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?
Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.