Top
Best
New

Posted by pain_perdu 1 day ago

Pocket TTS: A high quality TTS that gives your CPU a voice(kyutai.org)
553 points | 126 commentspage 3
lykahb 12 hours ago|
It'd be great if it supported stdin&stdout for text and wav. Then it could get piped right into afplay
gabrieldemarm 6 hours ago|
Gabriel from Kyutai here, we do support outputting wav to stdout. We don't support reading text from stdin but that should be easy enough. Feel free to drop a pull request!
anonymous344 5 hours ago||
doesn't seem to know thai language. anyobody can suggest thai tts?
OfflineSergio 14 hours ago||
This is amazing. The audio feels very natural and it's fairly good at handling complext text to speech tasks. I've been working on WithAudio (https://with.audio). Currently it only uses Kokoros. I need to test this a bit more but I might actually add it to the app. It's too good to be ignored.
syntaxing 18 hours ago||
Is there something similar for STT? I’m using whisper distill models and they work ok. Sometimes it gets what I say completely wrong.
daemonologist 18 hours ago||
Parakeet is not really more accurate than Whisper, but it's much faster - faster than realtime even on CPU: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 . You have to use Nemo though, or mess around with third-party conversions. (Also has a big brother Canary: https://huggingface.co/nvidia/canary-1b-v2. There's also the confusingly named/positioned Nemotron speech: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...)
satvikpendem 17 hours ago|||
Keep in mind Parakeet is pretty limited in the number of languages it supports compared to Whisper.
jokethrowaway 8 hours ago|||
Parakeet feels much more accurate in practice than whisper, it was a real "a-ha" moment for me.

Of course, English only

phoronixrly 18 hours ago||
from the other day https://github.com/cjpais/Handy
tschellenbach 18 hours ago||
It's cool how lightweight it is. Recently added support to Vision Agents for Pocket. https://github.com/GetStream/Vision-Agents/tree/main/plugins...
GaggiX 19 hours ago||
I love that everyone is making their own TTS model as they are not as expensive as many other models to train. Also there are plenty of different architecture.

Another recent example: https://github.com/supertone-inc/supertonic

andai 18 hours ago||
In-browser demo of Supertonic with WASM:

https://huggingface.co/spaces/Supertone/supertonic-2

coder543 18 hours ago|||
Another one is Soprano-1.1.

It seems like it is being trained by one person, and it is surprisingly natural for such a small model.

I remember when TTS always meant the most robotic, barely comprehensible voices.

https://www.reddit.com/r/LocalLLaMA/comments/1qcusnt/soprano...

https://huggingface.co/ekwek/Soprano-1.1-80M

nowittyusername 12 hours ago|||
Thanks for heads up, this looks really interesting and claimed speed is nuts..
nunobrito 19 hours ago||
Thank you. Very good suggestion with code available and bindings for so many languages.
aidenn0 13 hours ago||
I'm sure I'm being stupid, but every voice except "alba" I recognize from Les Miserables; is there a character I'm forgetting?
vvolhejn 9 hours ago|
Václav from Kyutai here. Yes the original naming scheme was from Les Miserables, glad you noticed! We just stuck to Alba because that's the real name of the voice actor that provided the voice sample to us (see https://huggingface.co/kyutai/tts-voices), the other ones are either from pre-existing datasets or given anonymously.
_ache_ 14 hours ago||
It's very impressive! I'm mean, it's better than other <200M TTS models I encounter.

In English, it's perfect and it's so funny in others languages. It sounds exactly like someone who actually doesn't speak the language, but got it anyway.

I don't know why Fantine is just better than the others in others languages. Javer seems to be the worst.

Try Jean in Spanish « ¡Es lo suficientemente pequeño como para caber en tu bolsillo! » sound a lot like they don't understand the language.

Or Azelma in French « C'est suffisament petit pour tenir dans ta poche. » is very good.I mean half of the words are from a Québécois accent, half French one but hey, it's correct French.

Però non capisce l'italiano.

indigodaddy 16 hours ago||
Perfect timing that is exactly what I am looking for for a fun little thing I'm working on. The voices sound good!
maxglute 10 hours ago|
Would be nice if preview supports variable speed.
More comments...