Posted by lcolucci 1/27/2026
Show HN: LemonSlice – Upgrade your voice agents to real-time video
Chatbots are everywhere and voice AI has taken off, but we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.
We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.
Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.
From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).
And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.
We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.
Looking forward to your feedback!
EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)
*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.
It's a normal mp4 video that's looping initially (the "welcome message") and then as soon as you send the bot a message, we connect you to a GPU and the call becomes interactive. Connecting to the GPU takes about 10s.
(Update of https://news.ycombinator.com/item?id=43785494)
Questions what are the main differences between you and anam.ai ? They also do real time lip sync plus looks like they are cheaper. Do you optimise for price or quality? And do you focus on lipsync or full movement etc?
Anyway, big thumbs up for the LemonSlice team, I'm excited to see it progress. I can definitely see products start coming alive with tools like this.
My mind is blown! It feels like the first time I used my microphone to chat with ai
> Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
But after digging around for a while, searching for a huggingface link, I’m guessing this was just a unfortunate turn of phrase, and you are not in fact, releasing an open weights model that people can run themselves?
Oh well, this looks very cool regardless and congratulations on the release.
Answer is no because you will eventually release a subpar model not your sorta model.
Also people don't have infrastructure to run this at scale (100-500 concurrent users) at best they can run it for 1-2 concurrent users.
This could be a good way for peoples to test it then use your infra.
Ah but you do have an online demo, so you might think this is enough, WRONG.
I did /imagine cheeseburger and /imagine a fire extinguisher and both were correctly generated but the agent has no context. when I ask what they are holding in both cases they ramble about not holding anything and referencing lemons and lemon trees.
I expected it to retain the context as the chat continues. If I ask it what it imagined it just tells me I can use /imagine.
You can also control background motions (like ocean waves, or a waterfall or car driving).
We are actively training a model that has better text control over hand motions.