Pocket TTS: A high quality TTS that gives your CPU a voice

Posted by pain_perdu 1/15/2026

Pocket TTS: A high quality TTS that gives your CPU a voice(kyutai.org)

635 points | 158 comments

derHackerman 1/16/2026|

I read this, then realized I needed a browser extension to read my long case study and made a browser interface of this and put this together:

https://github.com/lukasmwerner/pocket-reader

laszbalo 1/16/2026|

You can do the same thing with Firefox' Reader Mode. On Linux you have to set up speech-dispatcher to use your favorite TTS as a backend.Once it is set up, there will be an option to listen the page.

mentalgear 1/16/2026||

Firefox should integrate that in their Reader Mode (the default System Voices are often very un-listable). Would seems like an easy win, and it's a non-AI feature so not polarising.

laszbalo 1/16/2026||

Not sure about macOS or Windows, but on Linux Firefox uses speech-dispatcher, which is a server, and Firefox is the client. Speech-dispatcher then delegates the text to the correct TTS backend. It basically runs a shell command, either sending the text to a TTS HTTP server using curl, or piping it to the standard input of a TTS binary.

Speech-dispatcher commonly uses espeak-ng, which sounds robotic but is reportedly better for visually impaired users, because at higher speeds it is still intelligible. This allows visually impaired users to hear UI labels more quickly. For non visually impaired users, we generally want natural sounding voices and to use TTS in the same way we would listen to podcasts or a bedtime story.

With this system, users are in full control and can swap TTS models easily. If a model is shipped and, two weeks later, a smaller, newer, or better one appears, their work would become obsolete very quickly.

Barbing 1/16/2026||

Fascinating. Might be part of why I’ve seen some folks have such love for old voices like Fred.

armcat 1/15/2026||

Oh this is sweet, thanks for sharing! I've been a huge fan of Kokoro and event setup my own fully-local voice assistant [1]. Will definitely give Pocket TTS a go!

[1] https://github.com/acatovic/ova

gropo 1/15/2026||

Kokoro is better for tts by far

For voice cloning, pocket tts is walled so I can't tell

echelon 1/16/2026|||

What are the advantages of PocketTTS over Kokoro?

It seems like Kokoro is the smaller model, also runs on CPU in real time, and is more open and fine tunable. More scripts and extensions, etc., whereas this is new and doesn't have any fine tuning code yet.

I couldn't tell an audio quality difference.

hexaga 1/16/2026|||

Kokoro is fine tunable? Speaking as someone who went down the rabbit hole... it's really not. There's no (as of last time I checked) training code available so you need to reverse engineer everything. Beyond that the model is not good at doing voices outside the existing voicepacks: simply put, it isn't a foundation model trained on internet scale data. It is made from a relatively small set of focused, synthetic voice data. So, a very narrow distribution to work with. Going OOD immediately tanks perceptual quality.

There's a bunch of inference stuff though, which is cool I guess. And it really is a quite nice little model in its niche. But let's not pretend there aren't huge tradeoffs in the design: synthetic data, phonemization, lack of train code, sharp boundary effects, etc.

jamilton 1/16/2026||||

Being able to voice clone with PocketTTS seems major, it doesn't look like there's any support for that with Kokoro.

echelon 1/16/2026||

Zero shot voice clones have never been very good. Fine tuned models hit natural speaker similarity and prosody in a way zero shot models can't emulate.

If it were a big model and was trained on a diverse set of speakers and could remember how to replicate them all, then zero shot is a potentially bigger deal. But this is a tiny model.

I'll try out the zero shot functionality of Pocket TTS and report back.

Barbing 1/16/2026||

Would be curious to hear!

jhatemyjob 1/16/2026|||

Less licensing headache, it seems. Kokoro says its Apache licensed. But it has eSpeak-NG as a dependency, which is GPL, which brings into question whether or not Kokoro is actually GPL. PocketTTS doesn't have eSpeak-NG as a dependency so you don't need to worry about all that BS.

Btw, I would love to hear from someone (who knows what they're talking about) to clear this up for me. Dealing with potential GPL contamination is a nightmare.

miki123211 1/16/2026|||

Kokoro only uses Espeak for text-to-phoneme (AKA G2P) conversion.

If you could find another compatible converter, you could probably replace eSpeak with it. The data could be a bit OOD, so you may need to fiddle with it, but it should work.

Because the GPL is outdated and doesn't really consider modern gen AI, what you could also do is to generate a bunch of text-to-phoneme pairs with Espeak and train your own transformer on them,. This would free you from the GPL license completely, and the task is easy enough that even a very small model should be able to do it.

jcelerier 1/16/2026|||

If it depends on espeak NG code, the complete product is 100% GPL. That said, if you are able to change the code to take off the espeak dependency then the rest would revert to non-GPL (or even if it's a build time option that you can disable like FFMPEG with --enable-gpl)

seunosewa 1/16/2026|||

Chatterbox-turbo is really good too. Has a version that uses Apple's gpu.

amrrs 1/15/2026||

Thanks for sharing your repo..looks super cool.. I'm planning to try out. Is it based on mlx or just hf transformers?

armcat 1/15/2026||

Thank you, just transformers.

lukebechtel 1/15/2026||

Nice!

Just made it an MCP server so claude can tell me when it's done with something :)

https://github.com/Marviel/speak_when_done

tarcon 1/16/2026||

macOS already has some great intrinsic TTS capability as the OS seems to include a naturally sounding voice. I recently built a similar tool to just run the "say" command as a background process. Had to wrap it in a Deno server. It works, but with Tahoe it's difficult to consistently configure using that one natural voice, and not the subpar voices downloadable in the settings. The good voice seems to be hidden somehow.

supriyo-biswas 1/16/2026||

> The good voice seems to be hidden somehow.

How am I supposed to enable this?

tarcon 1/16/2026||

My mistake, seems like I was refering to the Siri voice, which seems to be the default. It sounds good. It is selectable and to my surprise - even configurable in speed, pitch and volume - in the OS Accessibility settings -> System Voice -> Click on the (i) symbol. (macOS Tahoe)

Fnoord 1/16/2026||

Or via $ say --voice "?"

tylerdavis 1/16/2026|||

Funny! I made one recently too using piper-tts! https://github.com/tylerdavis/speak-mcp

codepoet80 1/16/2026||

I just setup pushover to send a message to my phone for this exact reason! Trying out your server next!

singpolyma3 1/15/2026||

Love this.

It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.

CGamesPlay 1/16/2026||

For reference, the MIT license contains this text: "Permission is hereby granted... to deal in the Software without restriction, including without limitation the rights to use". So the README containing a "Prohibited Use" section definitely creates a conflicting statement.

jandrese 1/16/2026|||

The "prohibited uses" section seems to be basically "not to be used for crime", which probably doesn't have much legal weight one way or another.

WhyNotHugo 1/16/2026|||

You might use it for something illegal in one country, and then leave for another country with no extradition… but you’ve lost the license to sue the software and can be sued for copyright infringement.

mips_avatar 1/16/2026|||

I think the only restriction that seems problematic is not being able to clone someone’s voice without permission. I think there’s probably a valid case for using it for satire.

Buttons840 1/16/2026|||

Good question.

If a license says "you may use this, you are prohibited from using this", and I use it, did I break the license?

ethin 1/16/2026|||

If memory serves, the license is the ultimate source of truth on what is allowed or not. You cannot add some section that isn't in the text of the license (at least in the US and other countries that use similar legal systems) on some website and expect it to hold up in court because the license doesn't include that text. I know of a few other bigger-name projects that try to pull these kinds of stunts because they don't believe anyone is going to actually read the text of the license.

HenrikB 1/16/2026||

The copyright holder can set whatever license they want, including writing their own.

In this case, I'd interpret it as they made up a new licence based on MIT, but their addendum makes it non-MIT, but something else. I agree with what others said; this "new" license has internal conflicts.

kaliqt 1/16/2026||

The license is clearly defined. It would be misleading, possibly fraudulent for them to then override the license elsewhere.

Simply, it's MIT licensed. If they want to change that, they have to remove that license file OR clearly update it to be a modified version of MIT.

IshKebab 1/16/2026|||

I think if they took you to court for cloning someone's voice without permission they would probably lose because this conflict makes the terms unclear.

Buttons840 1/17/2026||

An unclear license would default back to full copyright protection I would think.

yencabulator 1/19/2026||

Not necessarily. I believe many courts have a principle that an unclear agreement is read in favor of the party that did not write the agreement.

MatthiasPortzel 1/16/2026|||

Tried to use voice cloning but in order to download the model weights I have to create a HuggingFace account, connect it on the command line, give them my contact information, and agree to their conditions. The open source part is just the client and chunking logic which is pretty minimal.

syockit 1/16/2026|||

From my understanding, the code is MIT, but the model isn't? What consitutes a "Software" anyway? Aren't resources like images, sounds and the likes exempt from it (hence, covered by usual copyright unless separately licensed)? If so, in the same vein, an ML model is not part of "Software". By the way, the same prohibition is repeated on the huggingface model card.

iamrobertismo 1/16/2026||

Yeah, I don't understand the point of the prohibited use section at all, seems like unnecessary fluff.

pain_perdu 1/16/2026||

I'm psyched to see so much interest in my post about Kyutai's latest model! I'm working on part of a related team in Paris that's building off Kutai's research to provide enterprise-grade voice solutions. If anyone building in this space I'd love to chat and share some our upcoming models and capabilities that I am told are SOTA. Please don't hesitate to ping me via the address in my profile.

rsolva 1/16/2026||

Woah, I'm impressed! The voice cloning also worked much better than expected! Will there be separate models for other languages? I know the National Library in Norway has done a good job curating speech datasets with many different dialects [1][2].

[1] https://data.norge.no/en/datasets/220ef03e-70e1-3465-a4af-ed...

[2] https://ai.nb.no/datasets/

armcat 1/16/2026||

Just want to say amazing work. It's really pushing the envelope of what is possible to run locally on everyday devices.

mgaudet 1/16/2026||

Eep.

So, on my M1 mac, did `uvx pocket-tts serve`. Plugged in

> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only

(Beginning of Tale of Two Cities)

but the problem is Javert skips over parts of sentences! Eg, it starts:

> "It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the spring of hope, it was the winter of despair, we had everything before us, ..."

Notice how it skips over "it was the age of foolishness,", "it was the winter of despair,"

Which... Doesn't exactly inspire faith in a TTS system.

(Marius seems better; posted https://github.com/kyutai-labs/pocket-tts/issues/38)

Paul_S 1/16/2026||

All the models I tried have similar problems. When trying to batch a whole audiobook, the only way is to run it, then run a model to transcribe and check you get the same text.

sbarre 1/16/2026|||

Yeah Javert mangled up those sentences for me as well, it skipped whole parts and then also moved words around

- "its noisiest superlative insisted on its being received"

Win10 RTX 5070 Ti

vvolhejn 1/16/2026|||

Václav from Kyutai here. Thanks for the bug report! A workaround for now is to chunk the text into smaller parts where the model is more reliable. We already do some chunking in the Python package. There is also a more fancy way to do this chunking in a way that ensures that the stitched-together parts continue well (teacher-forcing), but we haven't implemented that yet.

mgaudet 1/16/2026||

Is this just sort of expected for these models? Should users of this expect only truncation or can hallucinated bits happen too?

I also find Javert in particular seems to put in huge gaps and spaces... side effect of the voice?

vvolhejn 1/22/2026||

> Is this just sort of expected for these models? Should users of this expect only truncation or can hallucinated bits happen too?

Basically, yes, sort of expected: we don't have detailed enough control to precent it fully. We can measure how much it happens and train better models, but no 100% guarantee. The bigger the model, the less this happens, but this one is tiny, so it's not the sharpest tool in the shed. Hallucinated bits can theoretically happen but I haven't observed it with this model yet.

small_scombrus 1/16/2026|||

Using your first text block 'Eponine' skips "we had nothing before us" and doesn't speak the final "that some of its noisiest"

I wonder what's going wrong in there

memming 1/16/2026||

interesting; it skipped "we had everything before us," in my test. Yeah, not a good sign.

GaggiX 1/15/2026||

I love that everyone is making their own TTS model as they are not as expensive as many other models to train. Also there are plenty of different architecture.

Another recent example: https://github.com/supertone-inc/supertonic

andai 1/15/2026||

In-browser demo of Supertonic with WASM:

https://huggingface.co/spaces/Supertone/supertonic-2

coder543 1/15/2026|||

Another one is Soprano-1.1.

It seems like it is being trained by one person, and it is surprisingly natural for such a small model.

I remember when TTS always meant the most robotic, barely comprehensible voices.

https://www.reddit.com/r/LocalLLaMA/comments/1qcusnt/soprano...

https://huggingface.co/ekwek/Soprano-1.1-80M

nunobrito 1/15/2026|||

Thank you. Very good suggestion with code available and bindings for so many languages.

nowittyusername 1/16/2026||

Thanks for heads up, this looks really interesting and claimed speed is nuts..

NoSalt 1/16/2026||

> "You can also clone the voice from any audio sample by using our repo."

Ok, who knows where I can get those high-quality recordings of Majel Barrett' voice that she made before she died?

freedomben 1/16/2026|

TOS computer voice must be my computer's voice. And after every command I run, I need a "Working."

dale_glass 1/16/2026||

Is there any TTS engine that doesn't need cloning and has some sort of parameters one can specify?

Like what if I want to graft on TTS to an existing text chat system and give each person an unique, randomly generated voice? Or want to try to get something that's not quite human, like some sort of alien or monster?

unleaded 1/16/2026||

You could use an old-school formant synthesizer that lets you tune the parameters, like espeak or dectalk. espeak apparently has a klatt mode which might sound better than the default but i haven't tried it.

bkitano19 1/16/2026||

You can use voice prompting; it's supported on ElevenLabs and Hume.

Evidlo 1/16/2026|

How feasible would it be to build this project into a small static binary that could be distributed? The dependencies are pretty big.

homarp 1/16/2026|

you can track this issue https://github.com/mmwillet/TTS.cpp/issues/127

More comments...