Pocket TTS: A high quality TTS that gives your CPU a voice

Posted by pain_perdu 1/15/2026

Pocket TTS: A high quality TTS that gives your CPU a voice(kyutai.org)

635 points | 158 commentspage 3

syntaxing 1/15/2026|

Is there something similar for STT? I’m using whisper distill models and they work ok. Sometimes it gets what I say completely wrong.

daemonologist 1/16/2026||

Parakeet is not really more accurate than Whisper, but it's much faster - faster than realtime even on CPU: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 . You have to use Nemo though, or mess around with third-party conversions. (Also has a big brother Canary: https://huggingface.co/nvidia/canary-1b-v2. There's also the confusingly named/positioned Nemotron speech: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...)

jokethrowaway 1/16/2026|||

Parakeet feels much more accurate in practice than whisper, it was a real "a-ha" moment for me.

Of course, English only

satvikpendem 1/16/2026|||

Keep in mind Parakeet is pretty limited in the number of languages it supports compared to Whisper.

phoronixrly 1/15/2026||

from the other day https://github.com/cjpais/Handy

smallerfish 1/16/2026||

Hopefully the browsers will improve their built in TTS soon. It's still pretty unusable unless you really need it.

sysworld 1/16/2026|

And OS's. Mac has some decent models, but kokoro is much better. Even this one is better.

tschellenbach 1/15/2026||

It's cool how lightweight it is. Recently added support to Vision Agents for Pocket. https://github.com/GetStream/Vision-Agents/tree/main/plugins...

britannio 1/16/2026||

This is impressive but in a sample I tried, it switched language on the second paragraph. I'm on a M4 Pro Macbook.

https://gist.github.com/britannio/481aca8cb81a70e8fd5b7dfa2f...

Zardoz84 1/16/2026||

I'm missing the old days that connecting a SPOKE256 to the Spectrum and making it speak, looked like magic.

_ache_ 1/16/2026||

It's very impressive! I'm mean, it's better than other <200M TTS models I encounter.

In English, it's perfect and it's so funny in others languages. It sounds exactly like someone who actually doesn't speak the language, but got it anyway.

I don't know why Fantine is just better than the others in others languages. Javer seems to be the worst.

Try Jean in Spanish « ¡Es lo suficientemente pequeño como para caber en tu bolsillo! » sound a lot like they don't understand the language.

Or Azelma in French « C'est suffisament petit pour tenir dans ta poche. » is very good.I mean half of the words are from a Québécois accent, half French one but hey, it's correct French.

Però non capisce l'italiano.

lykahb 1/16/2026||

It'd be great if it supported stdin&stdout for text and wav. Then it could get piped right into afplay

gabrieldemarm 1/16/2026|

Gabriel from Kyutai here, we do support outputting wav to stdout. We don't support reading text from stdin but that should be easy enough. Feel free to drop a pull request!

agentifysh 1/16/2026||

Just added it to my codex plugin that reads summary of what it finishes after each turn and I am spooked! runs well on my macbook, much better than Samantha!

https://github.com/agentify-sh/speak/

gabrieldemarm 1/16/2026|

[dead]

exceptione 1/16/2026||

Question: does anyone recommend a TTS that automatically recognizes emotion from the text it self?

sofixa 1/16/2026||

Gradium (https://gradium.ai/), a commercial company offshoot of Kyutai (open source lab), are focusing on emotion (both being able to recognise emotion and also understanding what emotion to use depending on context). I don't think any of their public existing models already does that, but they demoed it pretty impressively at the ai-Pulse conference.

fluoridation 1/16/2026||

Chatterbox does something like that. For example, if the input is

"so and so," he <verb>

and the verb is not just "said", but "chuckled", or "whispered", or "said shakily", the output is modified accordingly, or if there's an indication that it's a woman speaking it may pitch up during the quotation. It also tries to guess emotive content from textual content, such if a passage reads angry it may try to make it sound angry. That's more hit-and-miss, but when it hits, it hits really well. A very common failure case is, imagine someone is trying to psych themselves up and they say internally "come on, Steve, stand up and keep going", it'll read it in a deeper voice like it was being spoken by a WW2 sergeant to a soldier.

exceptione 1/16/2026||

Thank you!

butz 1/16/2026|

How large is the model and is it possible to train it read other languages, not only English?

butz 1/16/2026|

After pip install pocket-tts all dependencies are 7.4 GB. And it generates at 2x speed on CPu. Neat!

More comments...