Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s

Posted by campers 2 days ago

Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s(cerebras.ai)

145 points | 82 comments

simonw 2 days ago|

It turns out someone has written a plugin for my LLM CLI tool already: https://github.com/irthomasthomas/llm-cerebras

You need an API key - I got one from https://cloud.cerebras.ai/ but I'm not sure if there's a waiting list at the moment - then you can do this:

    pipx install llm # or brew install llm or uv tool install llm
    llm install llm-cerebras
    llm keys set cerebras
    # paste key here

Then you can run lightning fast prompts like this:

    llm -m cerebras-llama3.1-70b 'an epic tail of a walrus pirate'

Here's a video of that running, it's very speedy: https://static.simonwillison.net/static/2024/cerebras-is-fas...

croes 2 days ago||

It has a waiting list

londons_explore 2 days ago||

The "AI overview" in google search seems to be a similar speed, and the resulting text of similar quality.

simonw 2 days ago||

I wonder which of their models they use. Might even be Gemini 1.5 Flash 8B which is VERY quick.

I just tried that out with the same prompt and it's fast, but not as fast as Cerebras: https://static.simonwillison.net/static/2024/gemini-flash-8b...

londons_explore 2 days ago||

I suspect it is its own model. Running it on 10B+ user queries per day you're gonna want to optimize everything you can about it - so you'd want something really optimized to the exact problem rather than using a general purpose model with careful prompting.

maz1b 2 days ago||

Cerebras really has impressed me with their technicality and their approach in the modern LLM era. I hope they do well, as I've heard they are en-route to IPO. It will be interesting to see if they can make a dent vs NVIDIA and other players in this space.

madaxe_again 2 days ago|

Apparently so. You can also buy in via various PE outfits before IPO, if you so desire. I did.

Max-20 2 days ago||

Which one did you use? I am also interested to do that.

obviyus 2 days ago||

Wonder if they'll eventually release Whisper support. Groq has been great for transcribing 1hr+ calls at a significnatly lower price compared to OpenAI ($0.36/hr vs. $0.04/hr).

Arn_Thor 2 days ago||

Whisper runs so well locally on any hardware I’ve thrown at it, why run it in the cloud?

swores 2 days ago|||

Does it run well on CPU? I've used it locally but only with my high end (consumer/gaming) GPU, and haven't got round to finding out how it does on weaker machines.

Arn_Thor 2 days ago||

It’s not fast but if your transcript doesn’t have to get out ASAP it’s fine

obviyus 2 days ago|||

That's pretty much exactly how I started. Ran whisper.cpp locally for a while on a 3070Ti. It worked quite well when n=1.

For our use case, we may get 1 audio file at a time, we may get 10. Of course queuing them is possible but we decided to prioritize speed & reliability over self hosting.

Arn_Thor 2 days ago||

Got it. Makes sense in that context

BrunoJo 2 days ago||

https://Lemonfox.ai is another alternative to OpenAI's Whisper API if you need support for word-level timestamps and diarization.

GavCo 2 days ago||

When Meta releases the quantized 70B it will give another > 2X speedup with similar accuracy: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...

YetAnotherNick 2 days ago||

You don't need quantization aware training on larger models. 4 bit 70b and 405b models exhibit close to zero degradation in output with post training quantization[1][2].

[1]: https://arxiv.org/pdf/2409.11055v1 [2]: https://lmarena.ai/

WanderPanda 2 days ago||

I wonder why that is? because they are trained with dropout?

david-gpu 2 days ago||

Probably because of how bloody large they are. The quantization errors likely cancel each other out over the sum of so many terms.

Same reason why you can get a pretty good reconstruction when you add random noise to an image and then apply a binary threshold function to it. The more pixels there are, the more recognizable will be the B&W reconstruction.

ipsum2 2 days ago||

Probably not. Cerebras chip only has 16bit and 32bit operators.

d4rkp4ttern 2 days ago||

For those looking to easily build on top of this or other OpenAI-compatible LLM APIs -- you can have a look at Langroid[1] (I am the lead dev): you can easily switch to cerebras (or groq, or other LLMs/Providers). E.g. after installing langroid in your virtual env, and setting up CEREBRAS_API_KEY in your env or .env file, you can run a simple chat example[2] like this:

    python3 examples/basic/chat.py -m cerebras/llama3.1-70b

Specifying the model and setting up basic chat is simple (and there are numerous other examples in the examples folder in the repo):

    import langroid.language_models as lm
    import langroid as lr
    llm_config = lm.OpenAIGPTConfig(chat_model= "cerebras/llama3.1-70b")
    agent = lr.ChatAgent(
        lr.ChatAgentConfig(llm=llm_config, system_message="Be helpful but concise"))
    )
    task = lr.Task(agent)
    task.run()

[1] https://github.com/langroid/langroid [2] https://github.com/langroid/langroid/blob/main/examples/basi... [3] Guide to using Langroid with non-OpenAI LLM APIs https://langroid.github.io/langroid/tutorials/local-llm-setu...

asabla 2 days ago||

Damn, that's some impressive speeds.

At that rate it doesn't matter if the first try resulted in an unwanted answer, you'll be able to run once or twice more in a fast succession.

I hope their hardware stays relevant as this field continues to evolve

tjoff 2 days ago|

The biggest time sink for me is validating answers so not sure I agree on that take.

Fast iteration is a killer feature, for sure, but at this time I'd rather focus on quality for it to be worthwhile the effort.

vineyardmike 2 days ago|||

If you're using an LLM as a compressed version of a search index, you'll be constantly fighting hallucinations. Respectfully, you're not thinking big-picture enough.

There are LLMs today that are amazing at coding, and when you allow it to iterate (eg. respond to compiler errors), the quality is pretty impressive. If you can run an LLM 3x faster, you can enable a much bigger feedback loop in the same period of time.

There are efforts to enable LLMs to "think" by using Chain-of-thought, where the LLM writes out reasoning in a "proof" style list of steps. Sometimes, like with a person, they'd reach a dead-end logic wise. If you can run 3x faster, you can start to run the "thought chain" as more of a "tree" where the logic is critiqued and adapted, and where many different solutions can be tried. This can all happen in parallel (well, each sub-branch).

Then there are "agent" use cases, where an LLM has to take actions on its own in response to real-world situations. Speed really impacts user-perception of quality.

phito 2 days ago|||

> There are LLMs today that are amazing at coding, and when you allow it to iterate (eg. respond to compiler errors), the quality is pretty impressive. If you can run an LLM 3x faster, you can enable a much bigger feedback loop in the same period of time.

Well now the compiler is the bottleneck isn't it? And you would still need human check for bugs that aren't caught by the compiler.

Still nice to have inference speed improvements tho.

vineyardmike 2 days ago||

Something will always be the bottleneck, and it probably won’t be the speed of electrons for a while ;)

Some compilers (go) are faster than others (javac) and some languages are interpreted and can only be checked through tests. Moving the bottleneck from AI code gen step to the same bottleneck as a person seems like a win.

menaerus 2 days ago||

Spelling out the code in editor is not really the bottleneck.

tjoff 2 days ago||||

If the speed is used to get better quality with no more input from the user then sure, that is great. But that is not the only way to get better quality (though I agree that there are some low hanging fruit in the area).

OhNoNotAgain_99 2 days ago|||

To be honest most LLM's are reasonable at coding, they're not great. Sure they can code small stuff. But the can't refactor large software projects, or upgrade them.

regularfry 2 days ago|||

Upgrading large java projects is exactly what AWS want you to believe their tooling can do, but the ergonomics aren't great.

I think most of the capability problems with coding agents aren't the AI itself, it's that we haven't cracked how to let them interact with the codebase effectively yet. When I refactor something, I'm not doing it all at once, it's a step by step process. None of the individual steps are that complicated. Translating that over to an agent feels like we just haven't got the right harness yet.

vineyardmike 2 days ago|||

Honestly, most software tasks aren’t refactoring large projects, so it’s probably OK.

As the world gets more internet connected and more online, we’ll have an ever expanding list of “small stuff” - glue code that mixes and ever growing list of data sources/sinks and visualizations together. Many of which are “write once” and leave running.

Big companies (eg google) have built complex build systems (eg bazel ) to isolate small reusable libraries within in a larger repo. Which was a necessity to help unbelievably large development teams to manage a shared repository. An LLM acting in its small corner of the wold seems well suited to this sort of tooling, even if it can’t refactor large projects spanning large changes.

I suspect we’ll develop even more abstractions and layers to isolate LLMs and their knowledge of the wold. We already have containers and orchestration enabling “serverless” applications, and embedded webviews for GUIs.

Think about ChatGPT and their python interpreter or Claude and their web view. They all come with nice harnesses to support a boilerplate-free playground for short bits of code. That may continue to accelerate and grow in power.

hmaxdml 2 days ago||

What's your favorite orchestration solution for this kind of lightweight task?

jeswin 2 days ago||||

> The biggest time sink for me is validating answers so not sure I agree on that take.

But you're assuming that it'll always ne validated by humans. I'd imagine that most validation (and subsequent processing, especially going forward) will be done on machines.

tjoff 2 days ago|||

If that is the way to get quality, sure.

Otherwise I feel that power consumption is the bigger issue than speed, though in this case they are interlinked.

threatripper 2 days ago||

Humans consume a lot of power and resources.

croes 2 days ago||

The basic efficiency is pretty high.

yunohn 2 days ago||||

How does the next machine/LLM know what’s valid or not? I don’t really understand the idea behind layers of hallucinating LLMs.

ben_w 2 days ago||

By comparison with reality. The initial LLMs had "reality" be "a training set of text", when ChatGPT came out everyone rapidly expanded into RLFH (reinforcement learning from human feedback), and now there's vision and text models the training and feedback is grounded on a much broader aspect of reality than just text.

croes 2 days ago|||

Given that there are more and more AI generated texts and pictures that ground will be pretty unreliable.

ben_w 2 days ago||

Perhaps. But CCTV cameras and smartphones are huge sources of raw content of the real world.

Unless you want to take the argument of Morpheus in The Marix and ask "what is real?"

croes 2 days ago||

So let’s crank up total surveillance for better auto descriptions of a picture.

We aren’t exchanging freedom for security anymore, what could be reasonable under certain conditions, we just get convenience. Bad deal.

ben_w 2 days ago||

That's one way to do it, but overkill for this specific thing — self-driving cars or robotics, or natural use of smart-[phone|watch|glass|doorbell|fridge], likely sufficient.

Total surveillance may be necessary for other reasons, like making sure organised crime can't blackmail anyone because the state already knows it all, but it's overkill for AI.

yunohn 2 days ago|||

Could you link to a paper or working POC that shows how this “turtles all the way down“ solution works?

ben_w 2 days ago||

I don't understand your question.

This isn't turtles all the way down, it's grounded in real world data, and increasingly large varieties of it.

croes 2 days ago||

How does the AI know it’s reality and not a fake image or text fed to the system?

ben_w 2 days ago|||

I refer you to Wachowski & Wachowski (1999)*, building on previous work including Descartes and A. J. Ayer.

To whit: humans can't either, so that's an unreasonable question.

More formally, the tripartite definition of knowledge is flawed, and everything you think you know has a Munchausen trilemma.

* Genuinely part of my A-level in philosophy

croes 2 days ago|||

So we get the same flaws as before with a higher power consumption.

And because it’s fast and easy we now get more fakes, scams and disinformation.

That makes AI a lose-lose not to mention further negative consequences.

Workaccount2 2 days ago|||

Sooner or later someone is going to figure out how to do active training on AI models. It's the holy grail of AI before AGI. This would allow you to do base training on a small set of very high quality data, and then let the model actively decide what it wants to train on going forward or let it "forget" what it wants to unlearn.

ben_w 2 days ago|||

Not if you source your training data from reality.

Are you treating "the internet" as "reality" with this line of questions?

The internet is the map, don't mistake the map for the territory — it's fine as a bootstrap but not the final result, just like it's OK for a human to research a topic by reading on Wikipedia but not to use it as the only source.

yunohn 2 days ago|||

I wasn’t expecting your response to be “the truth is unknowable”, but was hoping for something of more substance to discuss.

ben_w 2 days ago||

Then you need a more precisely framed question.

1. AI can do what we can do, in much the same way we can do it, because it's biologically inspired. Not a perfect copy, but close enough for the general case of this argument.

2. AI can't ever be perfect because of the same reasons we can't ever be perfect: it's impossible to become certain of anything in finite time and with finite examples.

3. AI can still reach higher performance in specific things than us — not everything, not yet — because the information processing speedup going from synapses to transistors is of the same order of magnitude as walking is to continental drift, so when there exists sufficient training data to overcome the inefficiency of the model, we can make models absorb approximately all of that information.

exe34 2 days ago|||

Does the AI need to know or the curator of the dataset? If the curator took a camera and walked outside (or let a drone wander around for a while), do you believe this problem would still arise?

croes 2 days ago|||

And who validates the validation?

exe34 2 days ago||

the compiler/interpreter are assumed to work in this scenario.

croes 2 days ago|||

Exactly, validating and rewriting the prompt are the real time consuming tasks.

fancyfredbot 2 days ago||

Wow, software is hard! Imagine an entire company working to build an insanely huge and expensive wafer scale chip and your super smart and highly motivated machine learning engineers get 1/3 of peak performance on their first attempt. When people say NVIDIA has no moat I'm going to remember this - partly because it does show that they do, and partly because it shows that with time the moat can probably be crossed...

exe34 2 days ago|

make it work, make it work right(ish), now make it fast.

a2128 2 days ago||

I wonder at what point does increasing LLM throughput only start to serve negative uses of AI. This is already 2 orders of magnitude faster than humans can read. Are there any significant legitimate uses beyond just spamming AI-generated SEO articles and fake Amazon books more quickly and cheaply?

Workaccount2 2 days ago||

The way things are going it looks like tokens/s is going to play a big role. O1 preview devours tokens and now Anthropic computer use is devouring them too. Video generation is extremely token heavy too.

It sort of is starting to look like you can linearly boost utility by exponentially scaling token usage per query. If so we might see companies slowing on scaling parameters and instead focusing on scaling token usage.

adwn 2 days ago||

How about just serving more clients in parallel? I don't see why human reading-speed should pose any kind of upper bound.

And then there are use cases like OpenAI's o1, where most tokens aren't even generated for the benefit of a human, but as input for itself.

odo1242 2 days ago||

What made it so much faster based on just a software update?

anon291 2 days ago||

Ex-cereberas engineer here. The chip is very powerful and there is no 'one way' to do things. Rearchitecting data flow, changing up data layout, etc can lead to significant performance improvements. That's just my informed speculation. There's likely more perf somewhere

campers 2 days ago|||

  The first implementation of inference on the Wafer Scale Engine and utilized only a fraction of its peak bandwidth, compute, and IO capacity. Today’s release is the culmination of numerous software, hardware, and ML improvements we made to our stack to greatly improve the utilization and real-world performance of Cerebras Inference.
 
  We’ve re-written or optimized the most critical kernels such as MatMul, reduce/broadcast, element wise ops, and activations. Wafer IO has been streamlined to run asynchronously from compute. This release also implements speculative decoding, a widely used technique that uses a small model and large model in tandem to generate answers faster.

germanjoey 2 days ago||

They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.

A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.

It seems they also support only a very short sequence length. (1k tokens)

bubblethink 2 days ago||

Speculative decoding does not trade off accuracy. You reject the speculated tokens if the original model does not accept them, kind of like branch prediction. All these providers and third parties benchmark each other's solutions, so if there is a drop in accuracy, someone will report it. Their sequence length is 8k.

majke 2 days ago|

I wonder if there is a token/watt metric. Afaiu cerebras uses plenty of power/cooling.

accrual 2 days ago|

I found this on their product page, though just for peak power:

> At 16 RU, and peak sustained system power of 23kW, the CS-3 packs the performance of a room full of servers into a single unit the size of a dorm room mini-fridge.

It's pretty impressive looking hardware.

https://cerebras.ai/product-system/

menaerus 2 days ago||

Weighing 800kg (!). Like, what the heck.

More comments...