Show HN: LemonSlice – Upgrade your voice agents to real-time video

Posted by lcolucci 1/27/2026

Hey HN, we're the co-founders of LemonSlice (try our HN playground here: https://lemonslice.com/hn). We train interactive avatar video models. Our API lets you upload a photo and immediately jump into a FaceTime-style call with that character. Here's a demo: https://www.loom.com/share/941577113141418e80d2834c83a5a0a9

Chatbots are everywhere and voice AI has taken off, but we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.

We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.

Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.

How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.

From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).

And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.

We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.

Looking forward to your feedback!

EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)

*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.

133 points | 132 comments

anigbrowl 1/28/2026|

Absolutely Do Not Want.

EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)

Sure, that kind of thing is great fun. But photorealistic avatars are gonna be abused to hell and back and everyone knows it. I would rather talk to a robot that looks like a robot, ie C-3PO. I would even chat with scary skeleton terminator. I do not want to talk with convincingly-human-appearing terminator. Constantly checking whether any given human appearing on a screen is real or not is a huge energy drain on my primate brain. I already find it tedious with textual data, doing it on realtime video imagery consumers considerably more energy.

Very impressive tech, well done on your engineering achievement and all, but this is a Bad Thing.

echelon 1/28/2026||

The dichotomy of AI haters and AI dreamers is wild.

OP, I think this is the coolest thing ever. Keep going.

Naysayers have some points, but nearly every major disruptive technology has had downsides that have been abused. (Cars can be used for armed robbery. Steak knives can be used to murder people. Computers can be used for hacking.)

The upsides of tech typically far outweigh the downsides. If a tech is all downsides, then the government just bans it. If computers were so bad, only government labs and facilities would have them.

I get the value in calling out potential dangers, but if we do this we'll wind up with the 70 years where we didn't build nuclear reactors because we were too afraid. As it turns out, the dangers are actually negligible. We spent too much time imagining what would go wrong, and the world is now worse for it.

The benefits of this are far more immense.

While the world needs people who look at the bad in things, we need far more people who dream of the good. Listen to the critiques, allow it to aid in your safety measures, but don't listen to anyone who says the tech is 100% bad and should be stopped. That's anti nuclear rhetoric, and it's just not true.

Keep going!

zestyping 1/29/2026|||

The primary purpose of generating real-time video of realistic-looking talking people is deception. The explicit goal is to make people believe that they're talking to a real person when they aren't.

It's on you to identify the "immense" benefits that outweigh that explicit goal. What are they?

lcolucci 1/28/2026||||

Well put - and thanks, we'll keep building. Still chasing this level of magic: https://youtu.be/gL5PgvFvi8A?si=I__VSDqkXBdBTVvB&t=173 Not to mention language tutors, training experiences, and more.

anigbrowl 1/28/2026|||

I am not an AI hater, I use it every day. I made specific criticisms of why I think photorealistic realtime AI avatars are a problem; you've posted truisms. Please tell me what benefits you expect to reap from this.

nashashmi 1/28/2026||

> this is a Bad Thing.

"Your hackers were so preoccupied with whether or not they could, they didn't stop to think if they should."

armcat 1/28/2026||

This is so awesome, well done LemonSlice team! Super interesting on the ASR->LLM->TTS pipeline, and I agree, you can make it super fast (I did something myself as a 2-hour hobby project: https://github.com/acatovic/ova). I've been following full-duplex models as well and so far couldn't get even PersonaPlex to run properly (without choppiness/latency), but have you peeps tried Sesame, e.g. https://app.sesame.com/?

I played around with your avatars and one thing that it lacks is that it's "not patient", it's rushing the user, so maybe something to try and finetune there? Great work overall!

lcolucci 1/28/2026||

This is good feedback thanks! The "not patient" feeling probably comes from our VAD being set to "eager mode" so that the latency is better. VAD (i.e. deciding when the human has actually stopped talking) is a tough problem in all of voice AI. It basically adds latency to whatever your pipeline's base latency is. Speech2Speech models are better at this.

andrew-w 1/28/2026||

Thank you! Impressive demo with OVA. Still feels very snappy, even fully local. It will be interesting to see how video plays out in that regard. I think we're still at least a year away from the models being good enough and small enough that they can run on consumer hardware. We compared 6 of the major voice providers on TTFB, but didn't try Sesame -- we'll need to give that one a look. https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...

pickleballcourt 1/27/2026||

One thing I've learnt from movie production is actually what separates professional from amateur quality is in the audio itself. Have you thought about implementing personaplex from NVDIA or other voice models that can both talk and listen at the same time?

Currently the conversation still feels too STT-LLM-TTS that I think a lot of the voice agents suffer from (Seems like only Sesame and NVDIA so far have nailed the natural conversation flow). Still, crazy good work train your own diffusion models, I remember taking a look at the latest literature on diffusion and was mind blown by the advances in last years or so since u-net architecture days.

EDIT: I see that the primary focus is on video generation not audio.

lcolucci 1/27/2026|

This is a good point on audio. Our main priority so far has been reducing latency. In service of that, we were deep in the process of integrating Hume's two-way S2S voice model instead of ElevenLabs. But then we realized that ElevenLabs had made their STT-LLM-TTS pipeline way faster in the past month and left it at that. See our measurements here (they're super interesting): https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...

But, to your point, there are many benefits of two-way S2S voice beyond just speed.

Using our LiveKit integration you can use LemonSlice with any voice provider you like. The current S2S providers LiveKit offers include OpenAI, Gemini, and Grok and I'm sure they'll add Personaplex soon.

echelon 1/28/2026|||

I'm a filmmaker. While what OP said is 100% true, your instincts are right.

Not only is perfect is the enemy of good enough, you're only looking for PMF signal at this point. If you chase quality right now, you'll miss validation and growth.

The early "Will Smith eating spaghetti" companies didn't need perfect visuals. They needed excited early adopter customers. Now look where we're at.

In the fullness of time, all of these are just engineering problems and they'll all be sorted out. Focus on your customer.

pickleballcourt 1/27/2026|||

Thanks for sharing! Makes sense to go with latency first.

convivialdingo 1/27/2026||

That's super impressive! Definitely one of the best quality conversational agents I've tried syncing A/V and response times.

The text processing is running Qwen / Alibaba?

lcolucci 1/27/2026||

Qwen is the default but you can pick any LLM in the web app (though not the HN playground)

sid-the-kid 1/27/2026||

Thank you! Yes, right now we are using Qwen for the LLM. They also released a super fast TTS model that we have not tried yet, which is supposed to be very fast.

snowmaker 1/28/2026||

I made a golden retriever you can talk to using Lemon Slice: https://lemonslice.com/hn/agent_5af522f5042ff0a8

Having a real-time video conversation with an AI is a trippy feeling. Talk about a "feel the AGI moment", it really does feel like the computer has come alive.

lcolucci 1/28/2026||

So cool! I love how he sometimes looks down and over his glasses

knowitnone3 1/28/2026||

great. you'll never have to talk to another human being ever again

r0fl 1/27/2026||

Pricing is confusing

Video Agents Unlimited agents Up to 3 concurrent calls Creative Studio 1min long videos Up to 3 concurrent generations

Does that mean I can have a total of 1 minute of video calls? Or video calls can only be 1 minute long? Or does it mean I can have unlimited calls, 3 calls at a time all month long?

Can I have different avatars or only the same avatar x 3?

Can I record the avatar and make videos and post on social media?

lcolucci 1/27/2026|

Sorry about the confusion. Video Agents and Creative Studio are two entirely different products. Video Agents = interactive video. Creative Studio = make a video and download it. If you're interested in real-time video calls, then Video Agents is the only pricing and feature set you should look at.

thedangler 1/27/2026||

What happens if I want to make the video on the fly and save that to reuse it when the same question or topic comes up. No need to render a video. Just play the existing one.

andrew-w 1/28/2026||

This isn't natively supported -- we are continuously streaming frames throughout the conversation session that are generated in real-time. If you were building your own conversational AI pipeline (e.g. using our LiveKit integration), I suppose it would be possible to route things like this with your own logic. But it would probably include jump cuts and not look as good.

skandan 1/27/2026||

Wow this team is non-stop!!! Wild that this small crew is dropping hit after hit. Is there an open polymarket on who acquires them?

lcolucci 1/27/2026|

haha thank you so much! The team is incredible - small but mighty

jonsoft 1/28/2026||

I asked the Spanish tutor if he/it was familiar with the terms seseo[0] and ceceo[1] and he said it wasn't, which surprised me. Ideally it would be possible to choose which Spanish dialect to practise as mainland Spain pronunciation is very different to Latin America. In general it didn't convince me it was really hearing how I was pronouncing words, an important part of learning a language. I would say the tutor is useful for intermediate and advanced speakers but not beginners due to this and the speed at which he speaks.

At one point subtitles written in pseudo Chinese characters were shown; I can send a screenshot if this is useful.

The latency was slightly distracting, and as others have commented the NVIDIA Personaplex demos [2] are very impressive in this regard.

In general, a very positive experience, thank you.

[0] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [1] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [2] https://research.nvidia.com/labs/adlr/personaplex/

andrew-w 1/28/2026||

Thanks for the feedback. The current avatars use a STT-LLM-TTS pipeline (rather than true speech-to-speech), which limits nuanced understanding of pronunciations. Speech-to-speech models should solve this problem. (The ones we've tried so far have counterintuitively not been fast enough.)

sid-the-kid 1/28/2026||

ooof. You saw the Chinese text. Yup, that's super annoying. We are trying to squash that hallucination.

Thanks for the feedback! That's helpful!

Terretta 1/28/2026||

the chinese text happened last night in your main chat agent widget, the cartoon woman professing to be in a town in brazil with a lemon tree on her cupboard. she claimed it was a test of subtitling then admitted it wasn't.

btw, she gives helpful instructions like "/imagine" whatever but the instructions only seem to work about 50% of the time. meaning, try the same command or variants a few times, and it works about half of them. she never did shift out of aussie accent though.

she came up with a remarkably fanciful explanation why as a brazilian she sounded aussie and why imagining native accent like she said would work didn't...

i was shocked when /imagine face left turn to the side did actually work, the agent was in side profile and precisely as natural as the original front facing avatar

all in all, by far the best agent experience i've played with!

andrew-w 1/28/2026||

So glad you enjoyed it! We've been able to significantly reduce those text hallucinations with a few tricks, but it seems they haven't been fully squashed. The /imagine command only works with the image at the moment, but we'll think about ways to tie that into the personality and voice. Thanks for the feedback!

zestyping 1/29/2026||

When you generate real-time video of realistic-looking talking characters, the definition of success is fooling people into believing they are talking to a real person when they aren't.

If you pursue this, your explicit goal is deception, and it's a massively harmful kind of deception. I don't see how you can claim to be operating ethically here if that's your goal.

lcolucci 1/30/2026|

Do you think the same about text that is indistinguishable from human-written text (LLM chatbots)? Or voice that is indistinguishable from a human talking?

Illegal things, like fraud and impersonation, are illegal. There's a difference between the tool and the actions people do with the tool.

There are tons of useful applications of interactive avatars - from corporate training to kids education to language learning and more. Plus, why would you want to stop this little guy from existing in the world? :) https://lemonslice.com/try/alien

zestyping 1 day ago||

I don't think the same of them because they are not the same thing. Can you not see that the potential for harm is far greater? You can't simply ignore the potential uses of the technology you create. You have the choice to design your technology so it retains its usefulness while limiting the harm; have you given any time to thinking about how you could do that?

The alien is a diversion from the concern; I'm talking about realistic human avatars. Let's stay focused on that.

Let me suggest a worthwhile exercise. Just take ten minutes. What are some of the ways that realistic human avatars would make deception more effective or more scalable than previously possible?

Come up with three scenarios, and let's talk about them, honestly and thoughtfully.

Escapado 1/28/2026|

This was interesting. Had a 5 minute chat with the outsider from the dishonored series. Just a one sentence prompt and its phrasing was at least 60% there, but less cold and nicer in a sense than the video game counterpart. Still an interesting experiment. But I also know that maybe 12-24 months down the line, once this is available in real time on device there will be an ungodly amount of smut coming from this.

More comments...