Posted by pain_perdu 1 day ago
Ok, who knows where I can get those high-quality recordings of Majel Barrett' voice that she made before she died?
[1] https://data.norge.no/en/datasets/220ef03e-70e1-3465-a4af-ed...
Speech-dispatcher commonly uses espeak-ng, which sounds robotic but is reportedly better for visually impaired users, because at higher speeds it is still intelligible. This allows visually impaired users to hear UI labels more quickly. For non visually impaired users, we generally want natural sounding voices and to use TTS in the same way we would listen to podcasts or a bedtime story.
With this system, users are in full control and can swap TTS models easily. If a model is shipped and, two weeks later, a smaller, newer, or better one appears, their work would become obsolete very quickly.
Just made it an MCP server so claude can tell me when it's done with something :)
How am I supposed to enable this?
For voice cloning, pocket tts is walled so I can't tell
It seems like Kokoro is the smaller model, also runs on CPU in real time, and is more open and fine tunable. More scripts and extensions, etc., whereas this is new and doesn't have any fine tuning code yet.
I couldn't tell an audio quality difference.
There's a bunch of inference stuff though, which is cool I guess. And it really is a quite nice little model in its niche. But let's not pretend there aren't huge tradeoffs in the design: synthetic data, phonemization, lack of train code, sharp boundary effects, etc.
If it were a big model and was trained on a diverse set of speakers and could remember how to replicate them all, then zero shot is a potentially bigger deal. But this is a tiny model.
I'll try out the zero shot functionality of Pocket TTS and report back.
Btw, I would love to hear from someone (who knows what they're talking about) to clear this up for me. Dealing with potential GPL contamination is a nightmare.
If you could find another compatible converter, you could probably replace eSpeak with it. The data could be a bit OOD, so you may need to fiddle with it, but it should work.
Because the GPL is outdated and doesn't really consider modern gen AI, what you could also do is to generate a bunch of text-to-phoneme pairs with Espeak and train your own transformer on them,. This would free you from the GPL license completely, and the task is easy enough that even a very small model should be able to do it.
So, on my M1 mac, did `uvx pocket-tts serve`. Plugged in
> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only
(Beginning of Tale of Two Cities)
but the problem is Javert skips over parts of sentences! Eg, it starts:
> "It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the spring of hope, it was the winter of despair, we had everything before us, ..."
Notice how it skips over "it was the age of foolishness,", "it was the winter of despair,"
Which... Doesn't exactly inspire faith in a TTS system.
(Marius seems better; posted https://github.com/kyutai-labs/pocket-tts/issues/38)
- "its noisiest superlative insisted on its being received"
Win10 RTX 5070 Ti
I wonder what's going wrong in there
It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.
If a license says "you may use this, you are prohibited from using this", and I use it, did I break the license?
In this case, I'd interpret it as they made up a new licence based on MIT, but their addendum makes it non-MIT, but something else. I agree with what others said; this "new" license has internal conflicts.
Simply, it's MIT licensed. If they want to change that, they have to remove that license file OR clearly update it to be a modified version of MIT.
All too often, new models' codebases are just a dump of code that installs half the universe in dependencies for no reason, etc.