Top
Best
New

Posted by rohan_joshi 3 hours ago

Show HN: Three new Kitten TTS models – smallest less than 25MB(github.com)
Kitten TTS (https://github.com/KittenML/KittenTTS) is an open-source series of tiny and expressive text-to-speech models for on-device applications. We had a thread last year here: https://news.ycombinator.com/item?id=44807868.

Today we're releasing three new models with 80M, 40M and 14M parameters.

The largest model (80M) has the highest quality. The 14M variant reaches new SOTA in expressivity among similar sized models, despite being <25MB in size. This release is a major upgrade from the previous one and supports English text-to-speech applications in eight voices: four male and four female.

Here's a short demo: https://www.youtube.com/watch?v=ge3u5qblqZA.

Most models are quantized to int8 + fp16, and they use ONNX for runtime. Our models are designed to run anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required! This release aims to bridge the gap between on-device and cloud models for tts applications. Multi-lingual model release is coming soon.

On-device AI is bottlenecked by one thing: a lack of tiny models that actually perform. Our goal is to open-source more models to run production-ready voice agents and apps entirely on-device.

We would love your feedback!

160 points | 56 comments
kevin42 2 hours ago|
What I love about OpenClaw is that I was able to send it a message on Discord with just this github URL and it started sending me voice messages using it within a few minutes. It also gave me a bunch of different benchmarks and sample audio.

I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.

rohan_joshi 2 hours ago|
yeah we'll add some more professional-sounding voices and also support for diy custom voices. we tried to add more anime/cartoon-ish voices to showcase the expressivity.

Regarding running on the 3080 gpu, can you share more details on github issues, discord or email? it should be blazing fast on that. i'll add an example to run the model on gpu too.

vezycash 39 minutes ago||
Would an Android app of this be able to replace the built in tts?
rohan_joshi 37 minutes ago|
yes, our mobile sdk is coming soon(eta 2 weeks) so we should be able to replace the built-in version of it. can you share what tts use-case you're thinking of?
satvikpendem 32 minutes ago||
I use an epub reader like Moon+ with the built in TTS to turn epubs into audiobooks, and I tried Kokoro TTS but the issue was too much lag between sentences plus it doesn't preprocess the next sentence while it reads out the current one.
gabrielcsapo 22 minutes ago|||
Working on a reader and server that use pockettts to turn epubs into audio books https://github.com/gabrielcsapo/compendus shows a virtual scroller for the text and audio
rohan_joshi 26 minutes ago|||
okay this seems pretty doable, i think i know someone who is working on an epub reader using kittentts. if they don't post about it, i'll do it once its done.
armcat 50 minutes ago||
This is awesome, well done. Been doing lot of work with voice assistants, if you can replicate voice cloning Qwen3-TTS into this small factor, you will be absolute legends!
rohan_joshi 40 minutes ago|
thanks a lot, our voice cloning model will be out by May. we're experimenting w some very cool ways of doing voice cloning at 15M but will have a range of models going upto 500M
pumanoir 52 minutes ago||
The example.py file says "it will run blazing fast on any GPU. But this example will run on CPU."

I couldn't locate how to run it on a GPU anywhere in the repo.

rohan_joshi 40 minutes ago|
thanks for the feedback. i'll add an example of running it on gpu.
ks2048 2 hours ago||
You should put examples comparing the 4 models you released - same text spoken by each.
rohan_joshi 2 hours ago|
great idea, let me add this. meanwhile, you can try the models on our huggingface spaces demo here: https://huggingface.co/spaces/KittenML/KittenTTS-Demo
magicalhippo 2 hours ago||
A lot of good small TTS models in recent times. Most seem to struggle hard on prosody though.

Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.

Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?

rohan_joshi 1 hour ago||
small models struggle with prosody due to limited capacity. this version does much better than the precious one and is the best among other <25MB models. Kokoro is a really good model for its size, its competitive on artificial analysis too. i think by the next release we should have something kokoro quality but a fifth of the size. Adding control for rhythm seems to be quite important too, and we should start looking at that for other languages.
soco 1 hour ago||
That, and also using English words in the middle of another language phrase confuses them a lot.
rohan_joshi 1 hour ago||
yes. the current release of our model is english-only. so other languages are not expected to perform well. we'll try to look out for this in our multilingual release.
gabrielcsapo 25 minutes ago||
are there plans to output text alignment?
rohan_joshi 24 minutes ago|
yes, we just started working on this yesterday haha, great that you mentioned it. once we have it working it'll be out soon.
gabrielcsapo 16 minutes ago||
that would be awesome, I was using pockettts then I had to run it through whisper to get the accurate alignment. Not super productive for realtime work.
whitepaper27 32 minutes ago||
This is great. Demo looks awesome.
rohan_joshi 25 minutes ago|
thanks, glad you liked it
schopra909 58 minutes ago||
Really cool to see innovation in terms of quality of tiny models. Great work!
rohan_joshi 38 minutes ago|
thanks a lot. small model quality is improving exponentially. This 15M is way better than the 80M model from our previous launch (V0.1).
janice1999 51 minutes ago|
What's the actual install size for a working example? Like similar "tiny" projects, do these models actually require installing 1GB+ of dependencies?
deathanatos 19 minutes ago||
Running the example is 3 MiB for the repo, +667 MiB of Python dependencies, +86 MiB of models that will get downloaded from HuggingFace. =756 MiB.

(That's using the example as-is. If you switch it to the smaller model, modify the above with +57 MiB of models from HuggingFace, or =727 MiB.)

wedowhatwedo 11 minutes ago||
My quick test showed 670m of python libraries required on top of the model.
More comments...