Posted by rohan_joshi 6 hours ago
Today we're releasing three new models with 80M, 40M and 14M parameters.
The largest model (80M) has the highest quality. The 14M variant reaches new SOTA in expressivity among similar sized models, despite being <25MB in size. This release is a major upgrade from the previous one and supports English text-to-speech applications in eight voices: four male and four female.
Here's a short demo: https://www.youtube.com/watch?v=ge3u5qblqZA.
Most models are quantized to int8 + fp16, and they use ONNX for runtime. Our models are designed to run anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required! This release aims to bridge the gap between on-device and cloud models for tts applications. Multi-lingual model release is coming soon.
On-device AI is bottlenecked by one thing: a lack of tiny models that actually perform. Our goal is to open-source more models to run production-ready voice agents and apps entirely on-device.
We would love your feedback!
Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.
Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.
This is a mind numbing task that requires workers to make hundreds of calls each day with only minor variations, sometimes navigating phone trees, half the time leaving almost the exact same message.
Anyway, I believe almost all such businesses will be automated within months. Human labour just cannot compete on cost.
Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.