Top
Best
New

Posted by pain_perdu 1/15/2026

Pocket TTS: A high quality TTS that gives your CPU a voice(kyutai.org)
635 points | 158 commentspage 3
syntaxing 1/15/2026|
Is there something similar for STT? I’m using whisper distill models and they work ok. Sometimes it gets what I say completely wrong.
daemonologist 1/16/2026||
Parakeet is not really more accurate than Whisper, but it's much faster - faster than realtime even on CPU: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 . You have to use Nemo though, or mess around with third-party conversions. (Also has a big brother Canary: https://huggingface.co/nvidia/canary-1b-v2. There's also the confusingly named/positioned Nemotron speech: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...)
jokethrowaway 1/16/2026|||
Parakeet feels much more accurate in practice than whisper, it was a real "a-ha" moment for me.

Of course, English only

satvikpendem 1/16/2026|||
Keep in mind Parakeet is pretty limited in the number of languages it supports compared to Whisper.
phoronixrly 1/15/2026||
from the other day https://github.com/cjpais/Handy
smallerfish 1/16/2026||
Hopefully the browsers will improve their built in TTS soon. It's still pretty unusable unless you really need it.
sysworld 1/16/2026|
And OS's. Mac has some decent models, but kokoro is much better. Even this one is better.
tschellenbach 1/15/2026||
It's cool how lightweight it is. Recently added support to Vision Agents for Pocket. https://github.com/GetStream/Vision-Agents/tree/main/plugins...
britannio 1/16/2026||
This is impressive but in a sample I tried, it switched language on the second paragraph. I'm on a M4 Pro Macbook.

https://gist.github.com/britannio/481aca8cb81a70e8fd5b7dfa2f...

Zardoz84 1/16/2026||
I'm missing the old days that connecting a SPOKE256 to the Spectrum and making it speak, looked like magic.
_ache_ 1/16/2026||
It's very impressive! I'm mean, it's better than other <200M TTS models I encounter.

In English, it's perfect and it's so funny in others languages. It sounds exactly like someone who actually doesn't speak the language, but got it anyway.

I don't know why Fantine is just better than the others in others languages. Javer seems to be the worst.

Try Jean in Spanish « ¡Es lo suficientemente pequeño como para caber en tu bolsillo! » sound a lot like they don't understand the language.

Or Azelma in French « C'est suffisament petit pour tenir dans ta poche. » is very good.I mean half of the words are from a Québécois accent, half French one but hey, it's correct French.

Però non capisce l'italiano.

lykahb 1/16/2026||
It'd be great if it supported stdin&stdout for text and wav. Then it could get piped right into afplay
gabrieldemarm 1/16/2026|
Gabriel from Kyutai here, we do support outputting wav to stdout. We don't support reading text from stdin but that should be easy enough. Feel free to drop a pull request!
agentifysh 1/16/2026||
Just added it to my codex plugin that reads summary of what it finishes after each turn and I am spooked! runs well on my macbook, much better than Samantha!

https://github.com/agentify-sh/speak/

gabrieldemarm 1/16/2026|
[dead]
exceptione 1/16/2026||
Question: does anyone recommend a TTS that automatically recognizes emotion from the text it self?
sofixa 1/16/2026||
Gradium (https://gradium.ai/), a commercial company offshoot of Kyutai (open source lab), are focusing on emotion (both being able to recognise emotion and also understanding what emotion to use depending on context). I don't think any of their public existing models already does that, but they demoed it pretty impressively at the ai-Pulse conference.
fluoridation 1/16/2026||
Chatterbox does something like that. For example, if the input is

"so and so," he <verb>

and the verb is not just "said", but "chuckled", or "whispered", or "said shakily", the output is modified accordingly, or if there's an indication that it's a woman speaking it may pitch up during the quotation. It also tries to guess emotive content from textual content, such if a passage reads angry it may try to make it sound angry. That's more hit-and-miss, but when it hits, it hits really well. A very common failure case is, imagine someone is trying to psych themselves up and they say internally "come on, Steve, stand up and keep going", it'll read it in a deeper voice like it was being spoken by a WW2 sergeant to a soldier.

exceptione 1/16/2026||
Thank you!
butz 1/16/2026|
How large is the model and is it possible to train it read other languages, not only English?
butz 1/16/2026|
After pip install pocket-tts all dependencies are 7.4 GB. And it generates at 2x speed on CPu. Neat!
More comments...