Posted by pain_perdu 1/15/2026
Speech-dispatcher commonly uses espeak-ng, which sounds robotic but is reportedly better for visually impaired users, because at higher speeds it is still intelligible. This allows visually impaired users to hear UI labels more quickly. For non visually impaired users, we generally want natural sounding voices and to use TTS in the same way we would listen to podcasts or a bedtime story.
With this system, users are in full control and can swap TTS models easily. If a model is shipped and, two weeks later, a smaller, newer, or better one appears, their work would become obsolete very quickly.
For voice cloning, pocket tts is walled so I can't tell
It seems like Kokoro is the smaller model, also runs on CPU in real time, and is more open and fine tunable. More scripts and extensions, etc., whereas this is new and doesn't have any fine tuning code yet.
I couldn't tell an audio quality difference.
There's a bunch of inference stuff though, which is cool I guess. And it really is a quite nice little model in its niche. But let's not pretend there aren't huge tradeoffs in the design: synthetic data, phonemization, lack of train code, sharp boundary effects, etc.
If it were a big model and was trained on a diverse set of speakers and could remember how to replicate them all, then zero shot is a potentially bigger deal. But this is a tiny model.
I'll try out the zero shot functionality of Pocket TTS and report back.
Btw, I would love to hear from someone (who knows what they're talking about) to clear this up for me. Dealing with potential GPL contamination is a nightmare.
If you could find another compatible converter, you could probably replace eSpeak with it. The data could be a bit OOD, so you may need to fiddle with it, but it should work.
Because the GPL is outdated and doesn't really consider modern gen AI, what you could also do is to generate a bunch of text-to-phoneme pairs with Espeak and train your own transformer on them,. This would free you from the GPL license completely, and the task is easy enough that even a very small model should be able to do it.
Just made it an MCP server so claude can tell me when it's done with something :)
How am I supposed to enable this?
It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.
If a license says "you may use this, you are prohibited from using this", and I use it, did I break the license?
In this case, I'd interpret it as they made up a new licence based on MIT, but their addendum makes it non-MIT, but something else. I agree with what others said; this "new" license has internal conflicts.
Simply, it's MIT licensed. If they want to change that, they have to remove that license file OR clearly update it to be a modified version of MIT.
[1] https://data.norge.no/en/datasets/220ef03e-70e1-3465-a4af-ed...
So, on my M1 mac, did `uvx pocket-tts serve`. Plugged in
> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only
(Beginning of Tale of Two Cities)
but the problem is Javert skips over parts of sentences! Eg, it starts:
> "It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the spring of hope, it was the winter of despair, we had everything before us, ..."
Notice how it skips over "it was the age of foolishness,", "it was the winter of despair,"
Which... Doesn't exactly inspire faith in a TTS system.
(Marius seems better; posted https://github.com/kyutai-labs/pocket-tts/issues/38)
- "its noisiest superlative insisted on its being received"
Win10 RTX 5070 Ti
I also find Javert in particular seems to put in huge gaps and spaces... side effect of the voice?
Basically, yes, sort of expected: we don't have detailed enough control to precent it fully. We can measure how much it happens and train better models, but no 100% guarantee. The bigger the model, the less this happens, but this one is tiny, so it's not the sharpest tool in the shed. Hallucinated bits can theoretically happen but I haven't observed it with this model yet.
I wonder what's going wrong in there
Another recent example: https://github.com/supertone-inc/supertonic
It seems like it is being trained by one person, and it is surprisingly natural for such a small model.
I remember when TTS always meant the most robotic, barely comprehensible voices.
https://www.reddit.com/r/LocalLLaMA/comments/1qcusnt/soprano...
Ok, who knows where I can get those high-quality recordings of Majel Barrett' voice that she made before she died?
Like what if I want to graft on TTS to an existing text chat system and give each person an unique, randomly generated voice? Or want to try to get something that's not quite human, like some sort of alien or monster?