Pocket TTS: A high quality TTS that gives your CPU a voice

Posted by pain_perdu 1/15/2026

Pocket TTS: A high quality TTS that gives your CPU a voice(kyutai.org)

635 points | 158 commentspage 2

Imustaskforhelp 1/16/2026|

Perhaps I have been not talking to voice models that much or the chatgpt voice always felt weird and off because I was thinking it goes to a cloud server and everything but from Pocket TTS I discovered unmute.sh which is open source and I think is from the same company as Pocket TTS/can I think use Pocket TTS as well

I saw some agentic models at 4B or similar which can punch above its weights or even some basic models. I can definitely see them in the context of home lab without costing too much money.

I think atleast unmute.sh is similar/competed with chatgpt's voice model. It's crazy how good and (effective) open source models are from top to bottom. There's basically just about anything for almost everyone.

I feel like the only true moat might exist in coding models. Some are pretty good but its the only industry where people might pay 10x-20x more for the best (minimax/z.ai subscription fees vs claude code)

It will be interesting to see if we will see another deepseek moment in AI which might beat claude sonnet or similar. I think Deepseek has deepseek 4 so it will be interesting to see how/if it can beat sonnet

(Sorry for going offtopic)

StevenNunez 1/16/2026|

Great find! unmute was a trip to play with

Imustaskforhelp 1/17/2026||

Your welcome! Glad you appreciated it man. I think Unmute was really cool and is open source but its deployment is a little on the more complex side of things.

dust42 1/15/2026||

Good quality but unfortunately it is single language English only.

riedel 1/17/2026||

I am also quite irritated by the fact that many TTS fail to state what language (and probably even dialect) they support. Actually to support a really good workflow for many Europeans (and probably also the rest of the world) one would actually need a multi language models that also support the use foreign words within one's own language. I am using a local notification reader on my smartphone (with SherpaTTS) and the mix of notification language as well as languages embedded in each other makes the experience rather funny at times.

jiehong 1/16/2026|||

Agreed.

I think they should have added the fact that it's English only in the title at the very least.

dust42 1/16/2026||

Yes, apart from voice cloning nothing really new. Kokoro is out since a long time and it supports at least a few languages other than english. Also there is supertonic TTS and there is Soprano TTS. The latter is developed by a single guy while Kyutai is funded with 150M€.

  https://github.com/supertone-inc/supertonic 
  https://github.com/ekwek1/soprano

No affiliation with either.

phoronixrly 1/15/2026||

I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual and dynamically switch between languages pretty much per word.

Cool tech demo though!

kamranjon 1/16/2026|||

That's a pretty crazy requirement for something to be "useful" especially something that runs so efficiently on cpu. Many content creators from non-english speaking countries can benefit from this type of release by translating transcripts of their content to english and then running it through a model like this to dub their videos in a language that can reach many more people.

phoronixrly 1/16/2026|||

You mean youtubers? And have to (manually) synchronise the text to their video, and especially when youtube apparently offers voice-voice translation out of the box to my and many others' annoyance?

littlestymaar 1/16/2026||

YouTube's voice to voice is absolutely horrible though. Having the ability for the youtubers to clone their own voice would make it much, much more appealing.

ethin 1/16/2026|||

Uh, no? This is not at all an absurd requirement? Screen readers literally do this all the time, with voices that are the classic way of making a speech synthesizer, no AI required. ESpeak is an example, or MS OneCore. The NVDA screen reader has an option for automatic language switching as does pretty much every other modern screen reader in existence. And absolutely none of these use AI models to do that switching, either.

kube-system 1/16/2026||

They didn’t say it was a crazy requirement. They said it was crazy to consider it useless without meeting that requirement.

ethin 1/16/2026||

That doesn't really change what I said though. It isn't crazy to call it useless without some form of ALS either. Given that old school synthesis has been able to do it for like 20 years or so.

echoangle 1/16/2026||

How does state of the art matter when talking about usefulness? Is old school synthesis useless?

ethin 1/16/2026||

No? But is it not unreasonable to expect "state of the art" TTS to be able to do at least what old school synthesis is capable of doing? Being "state of the art" means being the highest level of development or achievement in a particular field, device, procedure, or technique at a specific point in time. I don't think it's therefore unreasonable to expect supposed "state of the art" text-to-speech synthesis to do far better at everything old-school TTS could do and then some.

kube-system 1/16/2026||

> Being "state of the art" means being the highest level of development or achievement in a particular field, device, procedure, or technique at a specific point in time. I don't think it's therefore unreasonable to expect supposed "state of the art" text-to-speech synthesis to do far better at everything old-school TTS could do and then some.

Non sequitur. Unless the 'art' in question is the 'art of adding features', usually this phrase is to describe the quality of a very specific development, these are often not even feature complete products.

bingaweek 1/16/2026||||

This is a great illustration that nothing you ever do will be good enough without people whining.

phoronixrly 1/16/2026||

Excuse me for pointing out that yet another LLM tech demo is presented to our attention.

Levitz 1/16/2026||||

But it wouldn't be for those who "speak exclusively English", rather, for those who speak English. Not only that but it's also common to have system language set to English, even if one's language is different.

There's about 1.5B English speakers in the planet.

phoronixrly 1/16/2026||

Let's indeed limit the use case to the system language, let's say of a mobile phone.

You pull up a map and start navigation. All the street names are in the local language, and no, transliterating the local names to the English alphabet does not make them understandable when spoken by TTS. And not to mention localised foreign names which then are completely mangled by transliterating them to English.

You pull up a browser, open up an news article in your local language to read during your commute. You now have to reach for a translation model first before passing the data to the English-only TTS software.

You're driving, one of your friends Signals you. Your phone UI is in English, you get a notification (interrupting your Spotify) saying 'Signal message', followed by 5 minutes of gibberish.

But let's say you have a TTS model that supports your local language natively. Well due to the fact that '1.5B English speakers' apparently exist in the planet, many texts in other languages include English or Latin names and words. Now you have the opposite issue -- your TTS software needs to switch to English to pronounce these correctly...

And mind you, these are just very simple use cases for TTS. If you delve into use cases for people with limited sight that experience the entire Internet, and all mobile and desktop applications (often having poor localisation) via TTS you see how mono-lingual TTS is mostly useless and would be switched for a robotic old-school TTS in a flash...

> only that but it's also common to have system language set to English

Ask a German whether their system language is English. Ask a French person. I can go on.

VMG 1/16/2026|||

> Ask a German whether their system language is English. Ask a French person. I can go on.

I'm German but my system language is English

Because translations often suck, are incomplete or inconsistent

numpad0 1/16/2026|||

If you don't speak the local language anyway, you can't decode pronounced spoken local language names anyway. Your speech sub-systems can't lock and sync to the audio track containing languages you don't speak. Let alone transliterate or pronounce.

Multilingual doesn't mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained. It's more like you can make;make install docker, after which you can attach/detach into/out of alternate environments while on terminal to do things or take in/out notes.

People sometimes picture multilingualism as owning a single joined-together super-language in the brain. That usually doesn't happen. Attempting this especially at young age could lead a person into a "semi-lingual" or "double-limited" state where they are not so fluent or intelligent in any particular languages.

And so, trying to make an omnilingual TTS for criticizing someone not devoting significant resources at it, don't make much sense.

phoronixrly 1/16/2026||

> If you don't speak the local language anyway, you can't decode pronounced spoken local language names anyway

This is plainly not true.

> Multilingual doesn't mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained

This and the analogy make no sense to me. Mind you I am trilingual.

I also did not imply that the model itself needs to be multilingual. I implied that the software that uses the model to generate speech must be multilingual and support language change detection and switching mid-sentence.

knowitnone3 1/16/2026||||

I'm Martian so everything you create better support my language on day 1

numpad0 1/16/2026||||

> it must be multilingual and dynamically switch between languages pretty much per word

Not abundantly obviously a satire and so interjecting: humans, including professional "simultaneous" interpreters, can't do this. This is not how languages work.

koakuma-chan 1/16/2026||

You can speak one language, switch to another language for one word, and continue speaking in the previous language.

numpad0 1/16/2026||

But that's my point. You'll stop, switch, speak, stop, switch, resume. You're not going to be "I was in 東京 yesterday" as a single continuous sentence. It'll have to be broken up to three separate sentences spoken back to back, even for humans.

jiehong 1/16/2026|||

>"I was in 東京 yesterday"

I think it's the wrong example, because this is actually very common if you're a Chinese speaker.

Actually, people tend to say the name of the cities in their own countries in their native language.

> I went to Nantes [0], to eat some kouign-amann [1].

As a French, both [0] and [1] will be spoken the French way on the fly in the sentence, while the other words are in English. Switching happens without any pause whatsoever (because there is really only one single way to pronounce those names in my mind, no thinking required).

Note that with Speech Recognition, it is fairly common to have models understanding language switches within a sentence like with Parakeet.

numpad0 1/16/2026||

Okay, it's getting clear that I'm in the wrong here with my insistence that languages don't mix and foreign words can't be inserted mid-sentence, yet that is my experience as well as behaviors of people sharing the language, incidentally including GP who suggested that I can always do the switching dance - people can if wanted, but normally don't. It's considered a show-off if the inserted word could be understood at all.

Perhaps I have to admit that my particular primary language is officially a human equivalent of an esoteric language; the myth that it's a complex language is increasingly becoming obsolete(for good!), but maybe it still qualify as being esoteric one that are not insignificantly more incompatible with others.

polshaw 1/16/2026||||

I think this is totally wrong. When you have both parties speaking multiple languages this happens all the time. You see this more with English being the loaner more often than it is the borrower, due to the reach that the language has. Listen to an Indian or Filipino speak for a while, it's interspersed with English words ALL the time. It happens less in English as there is not the universal knowledge base of one specific other language, but it does happen sometimes when searching for a certain, je ne sais pas.

akshitgaur2005 1/16/2026|||

Not really, most multilinguals switch between languages so seamlessly that you wouldn't even notice it! It even has given birth to new "languages", take for example Hinglish!!

echelon 1/16/2026|||

English has more users than all but a few products.

nmstoker 1/16/2026||

It's impressive but it's a shame that it's 2026 and despite remarkably lifelike speech, so many models fall on common issues like heteronyms ("the couple had a row because they couldn't agree where to row their boat"), realistic number handling and so on.

woadwarrior01 1/16/2026|

Yeah most models are quite bad at it. The industry term for it is: homograph disambiguation.

anon84873628 1/16/2026||

Let's undo the great vowel shift and modernize English spellings :-D

akx 1/16/2026||

It's pretty good. And for once, a software-engineering-ly high-quality codebase, too!

All too often, new models' codebases are just a dump of code that installs half the universe in dependencies for no reason, etc.

snvzz 1/15/2026||

Relative to AmigaOS translator.device + narrator.device, this sure seems bloated.

Paul_S 1/16/2026||

The speed of improvement of tts models reminds me of early days of Stable Diffusion. Can't wait until I can generate audiobooks without infinite pain. If I was an investor I'd short Audible.

asystole 1/16/2026||

An all-TTS audiobook offering is just about as appealing as an all-stable-diffusion picture gallery (that is, not at all).

echoangle 1/16/2026|||

Isn’t it more like an art gallery of prints of paintings? The primary art is the text of the book (like the painting in the gallery), TTS (and printing a copy) are just methods of making the art available.

306bobby 1/16/2026||

I think it can be argued that audiobook's add to the art by adding tone and inflection by the reader.

To me, what you're saying is the same as saying the art of a movie is in the script, the video is just the method of making it available. And I don't think that's a valid take

fluoridation 1/16/2026||

No, that's an incorrect analogy. The script of a movie is an intermediate step in the production process of a movie. It's generally not meant to be seen by any audiences. The script for example doesn't contain any cinematography or any soundtrack or any performances by actors. Meanwhile, a written work is a complete expressive work ready for consumption. It doesn't contain a voice, but that's because the intention is for the reader to interpret the voice into it. A voice actor can do that, but that's just an interpretation of the work. It's not one-to-one, but it's not unlike someone sitting next to you in the theater and telling you what they think a scene means.

So yes, I mostly agree with GP. An audiobook is a different rendering of the same subject. The content is in the text, regardless of whether it's delivered in written or oral form.

sysworld 1/16/2026|||

There already are audiobooks on audible that are 100% TTS, while it's playable, it's no substitute (yet) for a real human.

It's just too flat/dead compared to a human reader.

everyday7732 1/16/2026|||

It's not perfect, but I already have a setup for doing this on my phone. Add SherpaTTS and Librera Reader to your phone. (both available free on fdroid).

Set up SherpaTTS as the voice model for your phone (I like the en_GB-jenny_dioco-medium voice option, but there are several to choose from). Add a ebook to librera reader and open it. There's an icon with a little person wearing headphones, which lets you send the text continuously to your phone's tts, using just local processing on the phone. I don't have the latest phone but mine is able to process it faster than the audio is read, so the audio doesn't stop and start.

The voice isn't totally human sounding, but it's a lot better than the microsoft sam days, and once you get used to it the roboticness fades into the background and I can just listen to the story. You may get better results with kokoro (I couldn't get it running on my phone) or similar tts engines and a more powerful phone.

One thing I like about this setup is that if you want to swap back and forth between audio and text, you can. The reader scrolls automatically as it makes the audio, and you can pause it, read in silence for a while yourself and later set it going from a new point.

gempir 1/16/2026|||

I feel like TTS is one of the areas that as evolved the least. Small TTS models have been around for like 5+ years and they've only gotten incrementally better. Giants like ElevenLabs make good sounding TTS but it's not quite human yet and the improvements get less and less each iteration.

StevenNunez 1/16/2026|||

I've moved to https://github.com/readest/readest over audio books in most cases. I just need the dang thing in my ears and their TTS is good enough.

rowanG077 1/16/2026||

Wouldn't audible be perfectly positioned to take advantage of this. They have the perfect setup to integrate this into their offering.

Manfred 1/16/2026||

It seems more likely that people will buy a digital copy of the book for a few bucks and then run the TTS themselves on devices they already own.

howdareme9 1/16/2026|||

Not likely at all, people pay for convenience. They don't want to do that

johanyc 1/17/2026||

Yeah hackernews users kept thinking the average consumers like to tinker like we do lol

pantalaimon 1/16/2026|||

eBooks are much more expensive then an Audible subscription though.

potatoman22 1/16/2026||

I wouldn't say so. Audible gives you 1 book a month for $15. Most e-books I see are around $10.

donpdonp 1/16/2026||

it'd be nice to get some idea of what kind of hardware a laptop needs to be able to run this voice model.

donpdonp 1/17/2026|

for example, How much disk is needed? I started the uvx command and it started to download hundreds of megabytes. How much cpu ram is necessary and how much gpu ram is necessary? will an integrated intel gpu work? some ARM boards have a dedicated AI processor, are any of those supported?

d4rkp4ttern 1/16/2026||

Super nice and convenient to use as a CLI. I made it into a plugin for Claude Code to give a 1-sentence spoken status update whenever it stops:

claude plugin marketplace add pchalasani/claude-code-tools

claude plugin install voice@cctools-plugins

More here: https://github.com/pchalasani/claude-code-tools?tab=readme-o...

febin 1/17/2026||

I've vibecoded a Rust port of Pocket TTS using candle.

https://github.com/jamesfebin/pocket-tts-candle

The port supports:

- Native compilation with zero Python runtime dependency

- Streaming inference

- Metal acceleration for macOS

- Voice cloning (with the mimi feature)

Note: This was vibecoded (AI-assisted), but features were manually tested.

OfflineSergio 1/16/2026|

This is amazing. The audio feels very natural and it's fairly good at handling complext text to speech tasks. I've been working on WithAudio (https://with.audio). Currently it only uses Kokoros. I need to test this a bit more but I might actually add it to the app. It's too good to be ignored.

More comments...