Top
Best
New

Posted by pain_perdu 1 day ago

Pocket TTS: A high quality TTS that gives your CPU a voice(kyutai.org)
519 points | 121 commentspage 2
Imustaskforhelp 14 hours ago|
Perhaps I have been not talking to voice models that much or the chatgpt voice always felt weird and off because I was thinking it goes to a cloud server and everything but from Pocket TTS I discovered unmute.sh which is open source and I think is from the same company as Pocket TTS/can I think use Pocket TTS as well

I saw some agentic models at 4B or similar which can punch above its weights or even some basic models. I can definitely see them in the context of home lab without costing too much money.

I think atleast unmute.sh is similar/competed with chatgpt's voice model. It's crazy how good and (effective) open source models are from top to bottom. There's basically just about anything for almost everyone.

I feel like the only true moat might exist in coding models. Some are pretty good but its the only industry where people might pay 10x-20x more for the best (minimax/z.ai subscription fees vs claude code)

It will be interesting to see if we will see another deepseek moment in AI which might beat claude sonnet or similar. I think Deepseek has deepseek 4 so it will be interesting to see how/if it can beat sonnet

(Sorry for going offtopic)

smallerfish 3 hours ago||
Hopefully the browsers will improve their built in TTS soon. It's still pretty unusable unless you really need it.
dust42 16 hours ago||
Good quality but unfortunately it is single language English only.
jiehong 6 hours ago||
Agreed.

I think they should have added the fact that it's English only in the title at the very least.

dust42 6 hours ago||
Yes, apart from voice cloning nothing really new. Kokoro is out since a long time and it supports at least a few languages other than english. Also there is supertonic TTS and there is Soprano TTS. The latter is developed by a single guy while Kyutai is funded with 150M€.

  https://github.com/supertone-inc/supertonic 
  https://github.com/ekwek1/soprano
No affiliation with either.
phoronixrly 16 hours ago||
I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual and dynamically switch between languages pretty much per word.

Cool tech demo though!

bingaweek 13 hours ago|||
This is a great illustration that nothing you ever do will be good enough without people whining.
phoronixrly 4 hours ago||
Excuse me for pointing out that yet another LLM tech demo is presented to our attention.
kamranjon 16 hours ago||||
That's a pretty crazy requirement for something to be "useful" especially something that runs so efficiently on cpu. Many content creators from non-english speaking countries can benefit from this type of release by translating transcripts of their content to english and then running it through a model like this to dub their videos in a language that can reach many more people.
phoronixrly 15 hours ago|||
You mean youtubers? And have to (manually) synchronise the text to their video, and especially when youtube apparently offers voice-voice translation out of the box to my and many others' annoyance?
littlestymaar 8 hours ago||
YouTube's voice to voice is absolutely horrible though. Having the ability for the youtubers to clone their own voice would make it much, much more appealing.
ethin 14 hours ago|||
Uh, no? This is not at all an absurd requirement? Screen readers literally do this all the time, with voices that are the classic way of making a speech synthesizer, no AI required. ESpeak is an example, or MS OneCore. The NVDA screen reader has an option for automatic language switching as does pretty much every other modern screen reader in existence. And absolutely none of these use AI models to do that switching, either.
kube-system 11 hours ago||
They didn’t say it was a crazy requirement. They said it was crazy to consider it useless without meeting that requirement.
ethin 10 hours ago||
That doesn't really change what I said though. It isn't crazy to call it useless without some form of ALS either. Given that old school synthesis has been able to do it for like 20 years or so.
echoangle 6 hours ago||
How does state of the art matter when talking about usefulness? Is old school synthesis useless?
Levitz 16 hours ago||||
But it wouldn't be for those who "speak exclusively English", rather, for those who speak English. Not only that but it's also common to have system language set to English, even if one's language is different.

There's about 1.5B English speakers in the planet.

phoronixrly 15 hours ago||
Let's indeed limit the use case to the system language, let's say of a mobile phone.

You pull up a map and start navigation. All the street names are in the local language, and no, transliterating the local names to the English alphabet does not make them understandable when spoken by TTS. And not to mention localised foreign names which then are completely mangled by transliterating them to English.

You pull up a browser, open up an news article in your local language to read during your commute. You now have to reach for a translation model first before passing the data to the English-only TTS software.

You're driving, one of your friends Signals you. Your phone UI is in English, you get a notification (interrupting your Spotify) saying 'Signal message', followed by 5 minutes of gibberish.

But let's say you have a TTS model that supports your local language natively. Well due to the fact that '1.5B English speakers' apparently exist in the planet, many texts in other languages include English or Latin names and words. Now you have the opposite issue -- your TTS software needs to switch to English to pronounce these correctly...

And mind you, these are just very simple use cases for TTS. If you delve into use cases for people with limited sight that experience the entire Internet, and all mobile and desktop applications (often having poor localisation) via TTS you see how mono-lingual TTS is mostly useless and would be switched for a robotic old-school TTS in a flash...

> only that but it's also common to have system language set to English

Ask a German whether their system language is English. Ask a French person. I can go on.

VMG 20 minutes ago|||
> Ask a German whether their system language is English. Ask a French person. I can go on.

I'm German but my system language is English

Because translations often suck, are incomplete or inconsistent

numpad0 9 hours ago|||
If you don't speak the local language anyway, you can't decode pronounced spoken local language names anyway. Your speech sub-systems can't lock and sync to the audio track containing languages you don't speak. Let alone transliterate or pronounce.

Multilingual doesn't mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained. It's more like you can make;make install docker, after which you can attach/detach into/out of alternate environments while on terminal to do things or take in/out notes.

People sometimes picture multilingualism as owning a single joined-together super-language in the brain. That usually doesn't happen. Attempting this especially at young age could lead a person into a "semi-lingual" or "double-limited" state where they are not so fluent or intelligent in any particular languages.

And so, trying to make an omnilingual TTS for criticizing someone not devoting significant resources at it, don't make much sense.

phoronixrly 7 hours ago||
> If you don't speak the local language anyway, you can't decode pronounced spoken local language names anyway

This is plainly not true.

> Multilingual doesn't mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained

This and the analogy make no sense to me. Mind you I am trilingual.

I also did not imply that the model itself needs to be multilingual. I implied that the software that uses the model to generate speech must be multilingual and support language change detection and switching mid-sentence.

numpad0 10 hours ago||||
> it must be multilingual and dynamically switch between languages pretty much per word

Not abundantly obviously a satire and so interjecting: humans, including professional "simultaneous" interpreters, can't do this. This is not how languages work.

koakuma-chan 9 hours ago||
You can speak one language, switch to another language for one word, and continue speaking in the previous language.
numpad0 7 hours ago||
But that's my point. You'll stop, switch, speak, stop, switch, resume. You're not going to be "I was in 東京 yesterday" as a single continuous sentence. It'll have to be broken up to three separate sentences spoken back to back, even for humans.
jiehong 7 hours ago|||
>"I was in 東京 yesterday"

I think it's the wrong example, because this is actually very common if you're a Chinese speaker.

Actually, people tend to say the name of the cities in their own countries in their native language.

> I went to Nantes [0], to eat some kouign-amann [1].

As a French, both [0] and [1] will be spoken the French way on the fly in the sentence, while the other words are in English. Switching happens without any pause whatsoever (because there is really only one single way to pronounce those names in my mind, no thinking required).

Note that with Speech Recognition, it is fairly common to have models understanding language switches within a sentence like with Parakeet.

polshaw 6 hours ago||||
I think this is totally wrong. When you have both parties speaking multiple languages this happens all the time. You see this more with English being the loaner more often than it is the borrower, due to the reach that the language has. Listen to an Indian or Filipino speak for a while, it's interspersed with English words ALL the time. It happens less in English as there is not the universal knowledge base of one specific other language, but it does happen sometimes when searching for a certain, je ne sais pas.
akshitgaur2005 6 hours ago|||
Not really, most multilinguals switch between languages so seamlessly that you wouldn't even notice it! It even has given birth to new "languages", take for example Hinglish!!
knowitnone3 14 hours ago||||
I'm Martian so everything you create better support my language on day 1
echelon 15 hours ago|||
English has more users than all but a few products.
Paul_S 7 hours ago||
The speed of improvement of tts models reminds me of early days of Stable Diffusion. Can't wait until I can generate audiobooks without infinite pain. If I was an investor I'd short Audible.
asystole 6 hours ago||
An all-TTS audiobook offering is just about as appealing as an all-stable-diffusion picture gallery (that is, not at all).
echoangle 6 hours ago||
Isn’t it more like an art gallery of prints of paintings? The primary art is the text of the book (like the painting in the gallery), TTS (and printing a copy) are just methods of making the art available.
306bobby 4 hours ago||
I think it can be argued that audiobook's add to the art by adding tone and inflection by the reader.

To me, what you're saying is the same as saying the art of a movie is in the script, the video is just the method of making it available. And I don't think that's a valid take

fluoridation 2 hours ago||
No, that's an incorrect analogy. The script of a movie is an intermediate step in the production process of a movie. It's generally not meant to be seen by any audiences. The script for example doesn't contain any cinematography or any soundtrack or any performances by actors. Meanwhile, a written work is a complete expressive work ready for consumption. It doesn't contain a voice, but that's because the intention is for the reader to interpret the voice into it. A voice actor can do that, but that's just an interpretation of the work. It's not one-to-one, but it's not unlike someone sitting next to you in the theater and telling you what they think a scene means.

So yes, I mostly agree with GP. An audiobook is a different rendering of the same subject. The content is in the text, regardless of whether it's delivered in written or oral form.

everyday7732 5 hours ago|||
It's not perfect, but I already have a setup for doing this on my phone. Add SherpaTTS and Librera Reader to your phone. (both available free on fdroid).

Set up SherpaTTS as the voice model for your phone (I like the en_GB-jenny_dioco-medium voice option, but there are several to choose from). Add a ebook to librera reader and open it. There's an icon with a little person wearing headphones, which lets you send the text continuously to your phone's tts, using just local processing on the phone. I don't have the latest phone but mine is able to process it faster than the audio is read, so the audio doesn't stop and start.

The voice isn't totally human sounding, but it's a lot better than the microsoft sam days, and once you get used to it the roboticness fades into the background and I can just listen to the story. You may get better results with kokoro (I couldn't get it running on my phone) or similar tts engines and a more powerful phone.

One thing I like about this setup is that if you want to swap back and forth between audio and text, you can. The reader scrolls automatically as it makes the audio, and you can pause it, read in silence for a while yourself and later set it going from a new point.

gempir 6 hours ago|||
I feel like TTS is one of the areas that as evolved the least. Small TTS models have been around for like 5+ years and they've only gotten incrementally better. Giants like ElevenLabs make good sounding TTS but it's not quite human yet and the improvements get less and less each iteration.
rowanG077 7 hours ago||
Wouldn't audible be perfectly positioned to take advantage of this. They have the perfect setup to integrate this into their offering.
Manfred 6 hours ago||
It seems more likely that people will buy a digital copy of the book for a few bucks and then run the TTS themselves on devices they already own.
howdareme9 6 hours ago|||
Not likely at all, people pay for convenience. They don't want to do that
pantalaimon 4 hours ago|||
eBooks are much more expensive then an Audible subscription though.
potatoman22 3 hours ago||
I wouldn't say so. Audible gives you 1 book a month for $15. Most e-books I see are around $10.
exceptione 3 hours ago||
Question: does anyone recommend a TTS that automatically recognizes emotion from the text it self?
fluoridation 3 hours ago|
Chatterbox does something like that. For example, if the input is

"so and so," he <verb>

and the verb is not just "said", but "chuckled", or "whispered", or "said shakily", the output is modified accordingly, or if there's an indication that it's a woman speaking it may pitch up during the quotation. It also tries to guess emotive content from textual content, such if a passage reads angry it may try to make it sound angry. That's more hit-and-miss, but when it hits, it hits really well. A very common failure case is, imagine someone is trying to psych themselves up and they say internally "come on, Steve, stand up and keep going", it'll read it in a deeper voice like it was being spoken by a WW2 sergeant to a soldier.

aki237 5 hours ago||
This is impressive.

I just tried some sample verses, sounds natural.

But there seems to be a bug maybe? Just for fun, I had asked it to play the Real Slim Shady lyrics. It always seems to add 1 extra "please stand-up" in the chorus. Anyone see that?

gabrieldemarm 1 hour ago|
Hello Gabriel from Kyutai here, maybe it's related to the way we chunk the text? Can you post an issue on github with the extact text and voice? I'll take a look.
britannio 6 hours ago||
This is impressive but in a sample I tried, it switched language on the second paragraph. I'm on a M4 Pro Macbook.

https://gist.github.com/britannio/481aca8cb81a70e8fd5b7dfa2f...

anonymous344 3 hours ago||
doesn't seem to know thai language. anyobody can suggest thai tts?
agentifysh 9 hours ago||
Just added it to my codex plugin that reads summary of what it finishes after each turn and I am spooked! runs well on my macbook, much better than Samantha!

https://github.com/agentify-sh/speak/

gabrieldemarm 1 hour ago|
[dead]
donpdonp 9 hours ago|
it'd be nice to get some idea of what kind of hardware a laptop needs to be able to run this voice model.
More comments...