Posted by campers 2 days ago
You need an API key - I got one from https://cloud.cerebras.ai/ but I'm not sure if there's a waiting list at the moment - then you can do this:
pipx install llm # or brew install llm or uv tool install llm
llm install llm-cerebras
llm keys set cerebras
# paste key here
Then you can run lightning fast prompts like this: llm -m cerebras-llama3.1-70b 'an epic tail of a walrus pirate'
Here's a video of that running, it's very speedy: https://static.simonwillison.net/static/2024/cerebras-is-fas...I just tried that out with the same prompt and it's fast, but not as fast as Cerebras: https://static.simonwillison.net/static/2024/gemini-flash-8b...
For our use case, we may get 1 audio file at a time, we may get 10. Of course queuing them is possible but we decided to prioritize speed & reliability over self hosting.
[1]: https://arxiv.org/pdf/2409.11055v1 [2]: https://lmarena.ai/
Same reason why you can get a pretty good reconstruction when you add random noise to an image and then apply a binary threshold function to it. The more pixels there are, the more recognizable will be the B&W reconstruction.
python3 examples/basic/chat.py -m cerebras/llama3.1-70b
Specifying the model and setting up basic chat is simple (and there are numerous other examples in the examples folder in the repo): import langroid.language_models as lm
import langroid as lr
llm_config = lm.OpenAIGPTConfig(chat_model= "cerebras/llama3.1-70b")
agent = lr.ChatAgent(
lr.ChatAgentConfig(llm=llm_config, system_message="Be helpful but concise"))
)
task = lr.Task(agent)
task.run()
[1] https://github.com/langroid/langroid
[2] https://github.com/langroid/langroid/blob/main/examples/basi...
[3] Guide to using Langroid with non-OpenAI LLM APIs https://langroid.github.io/langroid/tutorials/local-llm-setu...At that rate it doesn't matter if the first try resulted in an unwanted answer, you'll be able to run once or twice more in a fast succession.
I hope their hardware stays relevant as this field continues to evolve
Fast iteration is a killer feature, for sure, but at this time I'd rather focus on quality for it to be worthwhile the effort.
There are LLMs today that are amazing at coding, and when you allow it to iterate (eg. respond to compiler errors), the quality is pretty impressive. If you can run an LLM 3x faster, you can enable a much bigger feedback loop in the same period of time.
There are efforts to enable LLMs to "think" by using Chain-of-thought, where the LLM writes out reasoning in a "proof" style list of steps. Sometimes, like with a person, they'd reach a dead-end logic wise. If you can run 3x faster, you can start to run the "thought chain" as more of a "tree" where the logic is critiqued and adapted, and where many different solutions can be tried. This can all happen in parallel (well, each sub-branch).
Then there are "agent" use cases, where an LLM has to take actions on its own in response to real-world situations. Speed really impacts user-perception of quality.
Well now the compiler is the bottleneck isn't it? And you would still need human check for bugs that aren't caught by the compiler.
Still nice to have inference speed improvements tho.
Some compilers (go) are faster than others (javac) and some languages are interpreted and can only be checked through tests. Moving the bottleneck from AI code gen step to the same bottleneck as a person seems like a win.
I think most of the capability problems with coding agents aren't the AI itself, it's that we haven't cracked how to let them interact with the codebase effectively yet. When I refactor something, I'm not doing it all at once, it's a step by step process. None of the individual steps are that complicated. Translating that over to an agent feels like we just haven't got the right harness yet.
As the world gets more internet connected and more online, we’ll have an ever expanding list of “small stuff” - glue code that mixes and ever growing list of data sources/sinks and visualizations together. Many of which are “write once” and leave running.
Big companies (eg google) have built complex build systems (eg bazel ) to isolate small reusable libraries within in a larger repo. Which was a necessity to help unbelievably large development teams to manage a shared repository. An LLM acting in its small corner of the wold seems well suited to this sort of tooling, even if it can’t refactor large projects spanning large changes.
I suspect we’ll develop even more abstractions and layers to isolate LLMs and their knowledge of the wold. We already have containers and orchestration enabling “serverless” applications, and embedded webviews for GUIs.
Think about ChatGPT and their python interpreter or Claude and their web view. They all come with nice harnesses to support a boilerplate-free playground for short bits of code. That may continue to accelerate and grow in power.
But you're assuming that it'll always ne validated by humans. I'd imagine that most validation (and subsequent processing, especially going forward) will be done on machines.
Otherwise I feel that power consumption is the bigger issue than speed, though in this case they are interlinked.
Unless you want to take the argument of Morpheus in The Marix and ask "what is real?"
We aren’t exchanging freedom for security anymore, what could be reasonable under certain conditions, we just get convenience. Bad deal.
Total surveillance may be necessary for other reasons, like making sure organised crime can't blackmail anyone because the state already knows it all, but it's overkill for AI.
This isn't turtles all the way down, it's grounded in real world data, and increasingly large varieties of it.
To whit: humans can't either, so that's an unreasonable question.
More formally, the tripartite definition of knowledge is flawed, and everything you think you know has a Munchausen trilemma.
* Genuinely part of my A-level in philosophy
And because it’s fast and easy we now get more fakes, scams and disinformation.
That makes AI a lose-lose not to mention further negative consequences.
Are you treating "the internet" as "reality" with this line of questions?
The internet is the map, don't mistake the map for the territory — it's fine as a bootstrap but not the final result, just like it's OK for a human to research a topic by reading on Wikipedia but not to use it as the only source.
1. AI can do what we can do, in much the same way we can do it, because it's biologically inspired. Not a perfect copy, but close enough for the general case of this argument.
2. AI can't ever be perfect because of the same reasons we can't ever be perfect: it's impossible to become certain of anything in finite time and with finite examples.
3. AI can still reach higher performance in specific things than us — not everything, not yet — because the information processing speedup going from synapses to transistors is of the same order of magnitude as walking is to continental drift, so when there exists sufficient training data to overcome the inefficiency of the model, we can make models absorb approximately all of that information.
It sort of is starting to look like you can linearly boost utility by exponentially scaling token usage per query. If so we might see companies slowing on scaling parameters and instead focusing on scaling token usage.
And then there are use cases like OpenAI's o1, where most tokens aren't even generated for the benefit of a human, but as input for itself.
The first implementation of inference on the Wafer Scale Engine and utilized only a fraction of its peak bandwidth, compute, and IO capacity. Today’s release is the culmination of numerous software, hardware, and ML improvements we made to our stack to greatly improve the utilization and real-world performance of Cerebras Inference.
We’ve re-written or optimized the most critical kernels such as MatMul, reduce/broadcast, element wise ops, and activations. Wafer IO has been streamlined to run asynchronously from compute. This release also implements speculative decoding, a widely used technique that uses a small model and large model in tandem to generate answers faster.
A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.
It seems they also support only a very short sequence length. (1k tokens)
> At 16 RU, and peak sustained system power of 23kW, the CS-3 packs the performance of a room full of servers into a single unit the size of a dorm room mini-fridge.
It's pretty impressive looking hardware.