Top
Best
New

Posted by rohan_joshi 5 hours ago

Show HN: Three new Kitten TTS models – smallest less than 25MB(github.com)
Kitten TTS (https://github.com/KittenML/KittenTTS) is an open-source series of tiny and expressive text-to-speech models for on-device applications. We had a thread last year here: https://news.ycombinator.com/item?id=44807868.

Today we're releasing three new models with 80M, 40M and 14M parameters.

The largest model (80M) has the highest quality. The 14M variant reaches new SOTA in expressivity among similar sized models, despite being <25MB in size. This release is a major upgrade from the previous one and supports English text-to-speech applications in eight voices: four male and four female.

Here's a short demo: https://www.youtube.com/watch?v=ge3u5qblqZA.

Most models are quantized to int8 + fp16, and they use ONNX for runtime. Our models are designed to run anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required! This release aims to bridge the gap between on-device and cloud models for tts applications. Multi-lingual model release is coming soon.

On-device AI is bottlenecked by one thing: a lack of tiny models that actually perform. Our goal is to open-source more models to run production-ready voice agents and apps entirely on-device.

We would love your feedback!

160 points | 56 commentspage 2
Remi_Etien 3 hours ago|
25MB is impressive. What's the tradeoff vs the 80M model — is it mainly voice quality or does it also affect pronunciation accuracy on less common words?
rohan_joshi 2 hours ago|
80M model is the highest quality while also being quite efficient. it is superior in terms of pronunciation accuracy for less common words, and also is more stable in terms of speed. its my fav model. i think the 40M is quite similar to 80M for most usecases. 15M is for resource cpus, loading onto a browser etc.

The new 15M is way better than the previous 80M model(v0.1). So we're able to predictably improve the quality which is very encouraging.

altruios 4 hours ago||
One of the core features I look for is expressive control.

Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.

Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].

the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?

rohan_joshi 3 hours ago|
thank you so much. Right now, it cannot handle expressive tags. what kind of tags would be most helpful according to you?
daneel_w 42 minutes ago|||
Intonation (frequency rise/fall) would offer a lot of versatility.
altruios 3 hours ago|||
Emotion based tagging control would be the most helpful narrowing it down. Tags like [sarcastically] [happily] [joyfully] [fearfully]: so a subsection of adverbs.

A stretch goal is 'arbitrary tags' from [singing] [sung to the tune of {x}] [pausing for emphasis] [slowly decreasing speed for emphasis] [emphasizing the object of this sentence] [clapping] [car crash in the distance] [laser's pew pew].

But yeah: instruction/control via [tags] is the deciding feature for me, provided prompt adherence is strong enough.

Also: a thought...

Everyone is using [] for different kinds of tags in this space: which is very simple. Maybe it makes sense to differentiate kinds of tags? I.E. [tags for modifying how text is spoken] vs {tags for creating sounds not specifically speech: not modifying anything... but instead it's own 'sound/word'}

rohan_joshi 2 hours ago||
yeah i think to start with, narrowing it down to a few tags would be most helpful and we'll probably start w that first. Thanks a lot!
DavidTompkins 3 hours ago||
This would be great as a js package - 25mb is small enough that I think it'd be worth it (in-browser tts is still pretty bad and varies by browser)
rohan_joshi 2 hours ago|
great idea, we're on it. we're also working on a mobile sdk. a browser sdk would be really cool too.
sschueller 2 hours ago||
I'm still looking for the "perfect" setup in order to clone my voice and use it locally to send voice replies in telegram via openclaw. Does anyone have auch a setup?

I want to be my own personal assistant...

EDIT: I can provide it a RTX 3080ti.

nicpottier 1 hour ago||
Try training a model on piper, you will need to record a lot of utterances but the results are pretty great and the output is a fast TTS model.
ilaksh 2 hours ago|||
You need to provide info on your hardware. Pocket-TTS does cloning on CPU, but for me randomly outputs something pretty weird sounding mixed in with like 90% good outputs. So it hasn't been quite stable enough to run without checking output. But maybe it depends on your voice sample.

Qwen 3 TTS is good for voice cloning but requires GPU of some sort.

justanotherunit 2 hours ago||
Is it not just to train a model on your voice recordings and just use that to generate audio clips from text?
ks2048 4 hours ago||
There's a number of recent, good quality, small TTS models.

If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.

the_duke 4 hours ago||
Any recommendations?
Joel_Mckay 1 hour ago||
Depends how small or complex you want a TTS, as flite + flitevox voice packages worked on pi or zynq ARM cpu just fine. =3

Also:

https://github.com/sparkaudio/spark-tts

okokwhatever 4 hours ago||
[flagged]
devinprater 3 hours ago||
A lot of these models struggle with small text strings, like "next button" that screen readers are going to speak a lot.
soco 3 hours ago|
I think I tried on my Android everything I could try and 1. outside webpage reading, not many options; 2. as browser extensions, also not many (I don't like to copy URLs in your app) 3. they all insist reading every little shit, not only buttons but also "wave arrow pointing directly right" which some people use in their texts. So basically reading text aloud is a bunch of shitty options. Anyone jumping in this market opening?
rohan_joshi 2 hours ago||
we'd love to serve this use-case. i'll make a demo for this next week and comment here with it.
fwsgonzo 4 hours ago||
How much work would it be to use the C++ ONNX run-time with this instead of Python? Is it a Claudeable amount of work?

The iOS version is Swift-based.

rohan_joshi 4 hours ago|
shouldn't be hard. what backend/hardware are you interested in running this with? i'll add an example for using C++ onnx model. btw check out roadmap, our inference engine will be out 1-2 weeks and it is expected to be faster than onnx.
fwsgonzo 1 hour ago||
desktop CPUs running inference on a single background thread would be the ideal case for what I'm considering.
great_psy 4 hours ago||
Thanks for working on this!

Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.

rohan_joshi 4 hours ago|
yes, we're releasing an official mobile sdk and inference engine very soon. if you want to use something until then, some folks from the oss community have built ways to run kitten on ios. if you search kittentts ios on github you should find a few. if you cant find it, feel free to ping me and i can help you set it up. thanks a lot for your support and feedback!
ilaksh 4 hours ago||
Thanks for open sourcing this.

Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.

rohan_joshi 4 hours ago|
thanks a lot for the feedback. yes, we're working on a diy way to add custom voices and will also be releasing a model with more professional voices in the next 2-3 weeks. as of now, we're providing commercial support for custom voices, languages and deployment through the support form on our github. can you share more about your business use-case? if possible, i'd like to ensure the next release can serve that.
ilaksh 2 hours ago||
Right now it's outgoing calls for a small business client that checks information. Although if they call back they don't mind an automated system, on outgoing calls the person answering will often hang up if they detect AI right away, so we use a realistic custom voice with an accent.

This is a mind numbing task that requires workers to make hundreds of calls each day with only minor variations, sometimes navigating phone trees, half the time leaving almost the exact same message.

Anyway, I believe almost all such businesses will be automated within months. Human labour just cannot compete on cost.

Tacite 4 hours ago|
Is it English only?
rohan_joshi 4 hours ago|
as of now its english only. the training for multilingual model is underway and should be out in April! what languages are you most interested in? Right now, we are providing deployments for custom languages + voices through support form on the github.
ivm 1 hour ago|||
Spanish would be great, there's a serious lack of Spanish TTS on Android compared to iOS and the quality is not the best.
Zopieux 2 hours ago|||
French, Spanish, German would go a long way.
More comments...