VibeVoice: A Frontier Open-Source Text-to-Speech Model

Posted by lastdong 9/3/2025

VibeVoice: A Frontier Open-Source Text-to-Speech Model(microsoft.github.io)

448 points | 170 commentspage 3

mpaepper 9/4/2025|

Unfortunate naming given I named my repo which does open source locally running speech to text vibevoice 7 months ago:

faxmeyourcode 9/3/2025||

I tried the colab notebook that they link to and couldn't replicate the quality for whatever reason. I just swapped out the text and let it run on the introduction paragraph of Metamorphosis by Franz Kafka and it seemingly could not handle the intricacies.

wewewedxfgdf 9/3/2025||

I'm really hoping one day there will be TTS does that does really nice British accents - I've surveyed them all deeply, none do.

Most that claim to do a British accent end up sounding like Kelsey Grammer - sort of an American accent pretending to be British.

specproc 9/3/2025||

I'd like one that really nails Brummie.

xp84 9/5/2025||

I’m just a yank, but a lot of the AI-voiced videos on YouTube that I’ve been listening to while I’m falling asleep lately have British voices that sound quite nice to me.

ndkap 9/4/2025||

Here is AI being as close as possible to the most animated person I know and here I am sounding robotic in every conversation I have, despite my best efforts to sound otherwise. Sometimes, I just wish I could have an AI speak for me

lyu07282 9/4/2025||

Did they delete the repo? It's 404 for me now: https://github.com/microsoft/VibeVoice

RealtyDAO 9/4/2025|

they must have removed it.. been down for hrs.

lyu07282 9/5/2025||

Repo is back but code is gone, with this statement:

> 2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have disabled the repo until we are confident that out-of-scope use is no longer possible.

What was that about?

bazlan 9/3/2025||

Sad to not see vui on the comparisons!

A 100M podcast model

https://huggingface.co/spaces/fluxions/vui-space

ementally 9/3/2025||

they vibecoded their demo website? the text is invisible on Firefox.

double_one 9/3/2025||

Same problem here. A quick refresh solved it for me — maybe try that?

recursive 9/3/2025||

Works for me

anarticle 9/3/2025||

The first example sounds like a cry for help.

Some of them have tone wobbles which iirc was more common in early TTS models. Looks like the huge context window is really helping out here.

baal80spam 9/3/2025||

Wow. I admit that I am not a native speaker, but this looks (or rather, sounds) VERY impressive and I could mistake it for hearing two people talking.

x187463 9/3/2025||

The giveaway is they will never talk over each other. Only one speaker at a time, consistently.

tracker1 9/3/2025|||

Fair enough... though it would be possible to generate that and edit to overlay the speech, introducing stuttering/pauses at the beginning and end of statements then edit the output to overlay the steps.

Would probably want to do similar to balance crossfade anyway... having each speaker's input offset from center instead of straight mono.

kaptainscarlet 9/3/2025||||

Also the lack of stutter and perfect flow of speech are a dead giveaway

kridsdale1 9/3/2025|||

And longer pause between turns than humans would do.

tracker1 9/3/2025||

Yeah, a lot of the TTS has gotten really impressive in general. Definitely a clear leap from the TTS stuff I worked with for training simulations a bit over a decade ago. Aside: Installing a sound card (unused) on a windows server just to be able to generate TTS was interesting. It was required by the platform, even if it wasn't used for it.

I generally don't like a lot of the AI generated slop that's starting to pop up on YouTube these days... I do enjoy some of the reddit story channels, but have completely stopped with it all now. With the AI stuff, it really becomes apparent with dates/ages and when numbers are spoken. Dates/ages/timelines are just off as far as story generation, and really should be human tweaked. As to the voice gen, saying a year or measurement is just not how English speakers (US or otherwise) speak.

qwertytyyuu 9/3/2025|

Woah they even immitate the western chinese accent well

More comments...