VibeVoice: A Frontier Open-Source Text-to-Speech Model

Posted by lastdong 9/3/2025

VibeVoice: A Frontier Open-Source Text-to-Speech Model(microsoft.github.io)

448 points | 170 commentspage 2

Meneth 9/3/2025|

Open-source, eh? Where's the training data, then?

Joel_Mckay 9/3/2025|

Most scraped data is often full of copyright, usage agreement, and privacy law violations.

Making it "open" would be unwise for a commercial entity. =3

zoobab 9/3/2025||

Open source is being abused to not provide the actual source. Stop this.

Joel_Mckay 9/3/2025||

A lot of code has multiple FOSS licenses that are not contaminating like GPL. GPL violations do occur on code, but have nothing to do with the training Data.

For example, many academic data sets are not public domain, and can't be used in a commercial context. A GPL claim on that data is often an argument of which thief showed up first.

Rule #24: A lawyers Strategic Truth is to never lie, but also avoid voluntarily disclosing information that may help opponents.

Thus, a business will never disclose they paid a fool to break laws for them... =3

nullc 9/3/2025|||

Perhaps, but it is not Open Source in the traditional sense if they do not provide the preferred form for modifications.

Joel_Mckay 9/3/2025||

There are also some weird OSS license rules that only trip the disclosure obligation when distributing the build to end users.

Indeed, these adversarial behaviors do not follow the spirit of FOSS community standards. If a project started as FOSS, than FOSS it should remain. =3

crvdgc 9/3/2025||

Very impressive that it can reproduce the Mandarin accent when speaking English and English accent when speaking Mandarin.

stuffoverflow 9/3/2025||

VibeVoice-Large is the first local TTS that can produce convincing Finnish speech with little to no accent. I tinkered with it yesterday and was pleasantly surprised at how good the voice cloning is and how it "clones" the emotion in the speech as well.

lxe 9/3/2025||

There are 2 "best" TTS models out right now: HiggsAudio and VibeVoice. I found that Higgs is both faster and much higher fidelity than Vibe. Can't speak to expressiveness, but don't sleep on it.

data-ottawa 9/4/2025||

Looks like the repo went private

https://github.com/microsoft/VibeVoice

I was trying to get this working on strix halo.

glenstein 9/3/2025||

Very good and I could see how I might believe they are real people if I let my guard down. The male voice sounded a little sedated though and there was a smoothness to it that could be samey over long stretches.

Still not at the astonishing level of Google Notebook text to speech which has been out for a while now. I still can't believe how good that one is.

regularfry 9/3/2025||

Ok, this is nit-picking, but it's very obvious that the sample voices these were trained with were captured in different audio environments. There's noticeable reverb on the male voice that's not there on the other.

So that's a useful next step: for multi-voice TTS models, make them sound like they're in the same room.

viggity 9/3/2025||

I feel like this is a step in the right direction, but a lot of emotive text-to-speech models are only changing the duration and loudness of each word, the timing/pauses are better too.

I would love to have a model that can make sense of things like stressing particular syllables or phonemes to make a point.

watsonmusic 9/3/2025|

this model is superb

cush 9/3/2025||

To me this is like early generative AI art, where the images came out very "smooth" and visually buttery, but instead there's no timbre to the voices. Intonation issues aside, these models could use a touch of vocal fry and some body to be more believable

bityard 9/3/2025|

I thought the name sounded familiar, I'm guessing its no relation to this project which has been around for 7 months? https://github.com/mpaepper/vibevoice

More comments...