Posted by pain_perdu 1 day ago
I saw some agentic models at 4B or similar which can punch above its weights or even some basic models. I can definitely see them in the context of home lab without costing too much money.
I think atleast unmute.sh is similar/competed with chatgpt's voice model. It's crazy how good and (effective) open source models are from top to bottom. There's basically just about anything for almost everyone.
I feel like the only true moat might exist in coding models. Some are pretty good but its the only industry where people might pay 10x-20x more for the best (minimax/z.ai subscription fees vs claude code)
It will be interesting to see if we will see another deepseek moment in AI which might beat claude sonnet or similar. I think Deepseek has deepseek 4 so it will be interesting to see how/if it can beat sonnet
(Sorry for going offtopic)
I think they should have added the fact that it's English only in the title at the very least.
https://github.com/supertone-inc/supertonic
https://github.com/ekwek1/soprano
No affiliation with either.Cool tech demo though!
There's about 1.5B English speakers in the planet.
You pull up a map and start navigation. All the street names are in the local language, and no, transliterating the local names to the English alphabet does not make them understandable when spoken by TTS. And not to mention localised foreign names which then are completely mangled by transliterating them to English.
You pull up a browser, open up an news article in your local language to read during your commute. You now have to reach for a translation model first before passing the data to the English-only TTS software.
You're driving, one of your friends Signals you. Your phone UI is in English, you get a notification (interrupting your Spotify) saying 'Signal message', followed by 5 minutes of gibberish.
But let's say you have a TTS model that supports your local language natively. Well due to the fact that '1.5B English speakers' apparently exist in the planet, many texts in other languages include English or Latin names and words. Now you have the opposite issue -- your TTS software needs to switch to English to pronounce these correctly...
And mind you, these are just very simple use cases for TTS. If you delve into use cases for people with limited sight that experience the entire Internet, and all mobile and desktop applications (often having poor localisation) via TTS you see how mono-lingual TTS is mostly useless and would be switched for a robotic old-school TTS in a flash...
> only that but it's also common to have system language set to English
Ask a German whether their system language is English. Ask a French person. I can go on.
I'm German but my system language is English
Because translations often suck, are incomplete or inconsistent
Multilingual doesn't mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained. It's more like you can make;make install docker, after which you can attach/detach into/out of alternate environments while on terminal to do things or take in/out notes.
People sometimes picture multilingualism as owning a single joined-together super-language in the brain. That usually doesn't happen. Attempting this especially at young age could lead a person into a "semi-lingual" or "double-limited" state where they are not so fluent or intelligent in any particular languages.
And so, trying to make an omnilingual TTS for criticizing someone not devoting significant resources at it, don't make much sense.
This is plainly not true.
> Multilingual doesn't mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained
This and the analogy make no sense to me. Mind you I am trilingual.
I also did not imply that the model itself needs to be multilingual. I implied that the software that uses the model to generate speech must be multilingual and support language change detection and switching mid-sentence.
Not abundantly obviously a satire and so interjecting: humans, including professional "simultaneous" interpreters, can't do this. This is not how languages work.
I think it's the wrong example, because this is actually very common if you're a Chinese speaker.
Actually, people tend to say the name of the cities in their own countries in their native language.
> I went to Nantes [0], to eat some kouign-amann [1].
As a French, both [0] and [1] will be spoken the French way on the fly in the sentence, while the other words are in English. Switching happens without any pause whatsoever (because there is really only one single way to pronounce those names in my mind, no thinking required).
Note that with Speech Recognition, it is fairly common to have models understanding language switches within a sentence like with Parakeet.
To me, what you're saying is the same as saying the art of a movie is in the script, the video is just the method of making it available. And I don't think that's a valid take
So yes, I mostly agree with GP. An audiobook is a different rendering of the same subject. The content is in the text, regardless of whether it's delivered in written or oral form.
Set up SherpaTTS as the voice model for your phone (I like the en_GB-jenny_dioco-medium voice option, but there are several to choose from). Add a ebook to librera reader and open it. There's an icon with a little person wearing headphones, which lets you send the text continuously to your phone's tts, using just local processing on the phone. I don't have the latest phone but mine is able to process it faster than the audio is read, so the audio doesn't stop and start.
The voice isn't totally human sounding, but it's a lot better than the microsoft sam days, and once you get used to it the roboticness fades into the background and I can just listen to the story. You may get better results with kokoro (I couldn't get it running on my phone) or similar tts engines and a more powerful phone.
One thing I like about this setup is that if you want to swap back and forth between audio and text, you can. The reader scrolls automatically as it makes the audio, and you can pause it, read in silence for a while yourself and later set it going from a new point.
"so and so," he <verb>
and the verb is not just "said", but "chuckled", or "whispered", or "said shakily", the output is modified accordingly, or if there's an indication that it's a woman speaking it may pitch up during the quotation. It also tries to guess emotive content from textual content, such if a passage reads angry it may try to make it sound angry. That's more hit-and-miss, but when it hits, it hits really well. A very common failure case is, imagine someone is trying to psych themselves up and they say internally "come on, Steve, stand up and keep going", it'll read it in a deeper voice like it was being spoken by a WW2 sergeant to a soldier.
I just tried some sample verses, sounds natural.
But there seems to be a bug maybe? Just for fun, I had asked it to play the Real Slim Shady lyrics. It always seems to add 1 extra "please stand-up" in the chorus. Anyone see that?
https://gist.github.com/britannio/481aca8cb81a70e8fd5b7dfa2f...