DSpark: Speculative decoding accelerates LLM inference [pdf]

Posted by aurenvale 9 hours ago

DSpark: Speculative decoding accelerates LLM inference [pdf](github.com)

632 points | 237 commentspage 2

articlepan 4 hours ago|

Title is bad, it's the first line of the abstract instead of the paper title. Speculative decoding for LLM inference was published in 2022: https://arxiv.org/abs/2211.17192

This paper seems to be an improvement to speculative decoding but I haven't read it yet.

lelanthran 7 hours ago||

These companies providing tokens, whether SOTA or not, that want to IPO are so fucked as time goes on.

Can't sell their SOTA models, only slightly better than the open source models for the models they can sell, cost 20x to 50x for good models, a TAM that consists almost solely of developers, with no customer of theirs actually boasting increased profits as a result of AI...

I fear their time to IPO may have passed.

utopiah 6 hours ago||

The question is even, was there EVER a time for an IPO?

If the business model requires hundreds of billions to get the required quality (R&D but also infrastructure to collect data and train, either purchased or rented to 3rd party) while "only" dozens of billions can be earned back (as costs still exist to earn, it's not free once models are trained), then maybe there NEVER was nor till be a good time for an IPO in a rational market.

notnullorvoid 2 hours ago|||

> in a rational market.

Unfortunately the market is often not rational in this way.

Hype within retail market means there are suckers willing to buy. Institutional market knows there are suckers when the hype is high. Both would drive the price up, and retail investors the ones left when it falls.

2838383838 5 hours ago|||

IPOs with massive bags can be wework or spacex, it all depends on vibes. If they buy a couple more articles doomposting and glazing AI on the financial times right before exit they will def find a bunch of boomers to buy their bags. If the narrative changes before they IPO its over.

danielabinav160 7 hours ago||

Would love to see these numbers reproduced on consumer GPUs, not just A100s.

wolttam 4 hours ago||

This is an efficiency improvement that significantly lowers the amount of RAM you have to look at, on average, during decode.

It should improve performance on most hardware because most LLMs are memory bandwidth bound during decode.

tommica 7 hours ago||

Maybe somaday an 8gb videocard can be used for coding...

romanusrome 6 hours ago||

[dead]

rvz 8 hours ago||

This is just one of many papers DeepSeek have released to be able to serve models at extremely cheap prices, unlike the others taking on >$100B+ of debt in building data centers for the same thing.

> As with V4-Flash, we treat this point as an indication that DSpark sustains useful throughput under an interactivity target that the baseline cannot efficiently support. At matched system capacities, DSpark delivers 57% to 78% faster per-user generation.

Reminds me of the flawed solution in scaling servers in 2017 that use memory-intensive technologies by adding even more servers to solve the problem. (It just increases costs.)

Rather than doing that, think about which critical parts of your app can be written in a more performant technology.

Fast forward to 2026, now you can see who is just throwing more money at the problem to create even more problems where as DeepSeek is giving us optimized solutions.

I know exactly who I would pay attention to, and it is absolutely not Anthropic.

denverllc 5 hours ago||

For so long American companies have operated under the assumption that servers are cheaper than developers, and that was used to justify all sorts of inefficient practices.

The last year has shown that’s not true anymore (even for web servers).

simianwords 4 hours ago||

...... are you really suggesting OpenAI and Anthropic don't have access to these techniques?

wg0 3 hours ago||

That's why I pay them. Regularly. Without fail. Despite my token usage isn't that much.

But I vote for these heroes with my wallet. Just yesterday did again.

noIdeaTheSecond 29 minutes ago|

Cudos to you!If people realized how much power we had we's have a better world

bflesch 7 hours ago||

At this point why can't someone produce a fridge or container-sized AI appliance based on legacy chips (12nm)? I imagine this would cover 80% of corporate use cases where you need to "google-in-a-box" functionality.

The state-of-the-art nanometer are impossible to achieve but if you have infinite solar energy during business hours does it really matter? Every company has a parking spot so this ASIC-like appliance could be as big as a shipping container.

If it could just run recent open models for a handful of users it would be such a nobrainer to buy.

scrlk 6 hours ago||

See "exabox" from George Hotz: https://tinycorp.myshopify.com/products/exabox-preorder

flipped 6 hours ago||

No one's buying that shitbox.

mrklol 4 hours ago||

Why?

sixhobbits 6 hours ago|||

Nvidia is already selling exactly this I think, not sure when it's expected to ship

benjiro29 6 hours ago||

The issue is that there are only so many fabs in the world that make memory. And if you want the good stuff, your easily going into 400 ~ 750b parameter models. That means at FP4 400 to 750GB memory.

Did i mention there are only so many memory makers and they are all busy printing money with HBM memory?

Intel is trying with Crescent Island, to make a 160GB GPU that uses LPDDR5X memory.

HBM takes multiple times the resources to make vs basic DDR5 memory. So by going this route, you have more memory, with the disadvantage that its only 700GB/s. VS HBM pumping out Terrabyte numbers like its nothing.

These cards is reasonably priced, may be good alternative to $10k 96GB Nvidia Blackwells... You give up on token generation (heavily memory dependent), for more memory to run larger models at home/office/company servers.

The problem is, again, there are only so many memory makers and its not like the market is flooded with DDR5 memory anymore, as the big 3 moved a lot of production to HBM.

Another approach is Sandisk making HBF ... Flash memory, like your typical NVME but designed around maximum speed. So instead of loading the models into expensive HBM memory, you use the benefits of density in Flash memory, to offload models into that. Cheaper, but slower... But it leaves your expensive HBM memory free for things like KV Cache, Active parameters, etc... So your model will be slower, but your hybrid using it. As in, faster then running a model from system memory with normal DDR memory, but not as fast as HBM.

So yea, there is a lot in development to reduce the dependance of that resource eating HBM memory. For the wafer cost of 1GB HBM, you normally got 4GB normal memory. That is why the world supply of memory dropped. Not just the insane buying but be HBM is just very inefficient in wafer usage.

Can we not use DDR4 production and create some kind of hybrid solution? Sure, but the big 3 moved away from DDR4 in favor of DDR5 a long time ago. We have competition from China with a mix of DDR4/DDR5, but they also need to scale up. Nobody expected to see a large part of the world production vanish into HBM...

Even if its about DDR4 and older nodes, ironically, most companies had been moving away from DDR4. There is only so much wafer capability in the world, to the point that companies are moving to using DDR2 ... Yea, not a typo, like 2007 DDR2! for IOT devices etc, stuff that does not need fast memory. Because even DDR3 got too expensive for them.

Its not like the old nodes are not used anymore ... Like that capacity was sitting idle. It was still in production making other stuff. The only real solution is that we need more fabs, and those take years to build. And the big 3 delayed investing in new fabs for a long time, unsure about the whole AI bubble stuff. Aka, they did not want to make a ton of fabs to end up with over capacity if the AI growth collapsed.

bradfa 5 hours ago||

With MoE models like Deepseek’s and with multiple Crescent Island accelerators, the aggregate memory throughput actually doesn’t look that bad. Two Crescent Island gets roughly 1400GB/s and Deepseek-v4-flash with 13B parameters active nets roughly 100t/s which is decent for a small team or great for a single user.

More Crescent Island scale up, although not likely entirely linearly.

But all GPU inference work like this, it’s not specific to Intel. Just Intel promises more affordable cards with big memory so they’re attractive.

2838383838 8 hours ago||

Must be wonderful to be on the board of OpenAi et al & their PE investors whilst China keeps blowing up these mines under their feet lmao. Luckily Korean pension funds will buy all the trash as usual but goddamn you gotta start moving quick or you are gonna need some serious AGI to show you how to offload those bonds

ForHackernews 7 hours ago||

"We will build the machine-god and pray for it to pay for itself."

FridgeSeal 7 hours ago||

Every day, the rate of “could post a picture of 40k tech priests and have it taken unironically” goes up, and it’s starting to get concerning.

ozgrakkurt 7 hours ago||

Don’t worry they will sell all the hardware and data they acquired with their grift

lightedman 3 hours ago||

Anyone want to bet that much like speculative execution, speculative decoding is going to introduce a whole slew of vulnerabilities in the ways LLMs work?

preetham_rangu 8 hours ago||

do they use their OCR, or someone else?

eddysir 2 hours ago|

[flagged]

More comments...