Voxtral Transcribe 2 - Hacker News

Posted by meetpateltech 7 hours ago

606 points | 151 comments

simonw 6 hours ago|

This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.

I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:

> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?

tekacs 6 hours ago||

Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.

And open weight too! So grateful for this.

Oras 6 hours ago|||

Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.

I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.

druskacik 4 hours ago||

According to the announcement blog Le Chat is powered by the new model as well: https://chat.mistral.ai/chat

skykooler 3 hours ago|||

Doesn't seem to work for me - tried in both Firefox and Chromium and I can see the waveform when I talk but the transcription just shows "Awaiting audio input".

starkgoose 2 hours ago|||

Try disabling CSP for the page

codethief 3 hours ago|||

Same here. In Chromium I don't even see the waveform.

fragmede 2 hours ago||

I had to turn off ad-block to get it to work.

daemonologist 5 hours ago|||

404 on https://mistralai-voxtral-mini-realtime.hf.space/gradio_api/... for me (which shows up in the UI as a little red error in the top right).

jaggederest 5 hours ago|||

It can transcribe Eminem's Rap God fast sequence, really, really impressive.

rafram 5 hours ago|||

That's almost certainly in the training data, to be fair.

keeganpoppen 3 hours ago|||

what a great test hahah

carbocation 4 hours ago|||

This model was able to transcribe Bad Bunny lyrics over the sound of the background music, played casually from my speakers. Impressive, to me.

pyprism 5 hours ago|||

Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.

derefr 4 hours ago||

Well, on the linked page, it mentions "strong transcription performance in 13 languages, including [...] Hindi" but with no mention of Bengali. It probably doesn't know a lick of Bengali, and is just trying to snap your words into the closest language it does know.

keeganpoppen 3 hours ago||

it must have some exposure to bengali— just not enough for them to advertise it. otherwise it would have a damn hard time.

darkwater 1 hour ago|||

It's really nice although I've got a sentence in French when I was speaking Italian but I corrected myself in the middle of a word.

But I'm definitely going to keep an eye on this for local-only TTS for Home Assistant.

sheepscreek 4 hours ago|||

I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!

mentalgear 1 hour ago|||

Here European Multilingual-Intelligence truly shines!

Barbing 2 hours ago|||

Doesn’t seem to work in Safari on iOS 26.2, iPhone 17 Pro, just about anything extra disabled.

rafram 5 hours ago|||

Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.

timhh 39 minutes ago||

Yeah it messed up a bit for me too when I didn't enunciate well. If I speak clearly it seems to work very well even with background noise. Remember Dragon Naturally Speaking? Imagine having this back then!

colordrops 45 minutes ago|||

is this demo running fully in the browser?

simonw 28 minutes ago||

No, it's server-side.

Model is around 7.5 GB - once they get above 4 GB running them in a browser gets quite difficult I believe.

th0ma5 5 hours ago|||

[dead]

adarsh2321 5 hours ago|||

[flagged]

adarsh2321 5 hours ago||

[flagged]

iagooar 4 hours ago||

In English it is pretty good. But talk to it in Polish, and suddenly it thinks you speak Russian? Ukranian? Belarus? I would understand if an American company launched this, but for a company being so proud about their European roots, I think it should have better support for major European languages.

I tried English + Polish:

> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.

loire280 1 hour ago||

They don't claim to support Polish, but they do support Russian.

> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.

I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?

MarcelOlsz 1 hour ago||

Nobody ever supports Polish. It's the worst. They'll support like, Swahili, but not Polish.

londons_explore 51 minutes ago|||

200 million people speak Swahili.

39 million people speak Polish, and most of those also speak English or another more common language.

timhh 37 minutes ago||

You could say the same about Dutch to be fair. 90-95% speak English - I bet that's way higher than in Poland.

chickenimprint 49 minutes ago|||

Swahili is subcontinental lingua franca spoken by 200M people and growing quickly. Polish is spoken by a shrinking population in one country where English is understood anyways.

lm28469 3 hours ago|||

Try sticking to the supported languages

tdb7893 4 hours ago|||

Yeah, it's too bad. Apparently it only performs well in certain languages: "The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch"

ricardonunez 3 hours ago||

It did great English and Spanish, it didn't switch to Portuguese, french nor German, maybe struggle with my accent.

scotty79 2 hours ago||

Try to warn it you are going to switch language to Portugese. Worked for me.

yko 3 hours ago|||

That's a mix of Polish and Ukrainian in the transcript. Now, if I try speaking Ukrainian, I'm getting transcript in Russian every time. That's upsetting.

overfeed 2 hours ago|||

Oh no! The model won't translate to an unsupported language, and incorrectly reverts to one that it was explicitly trained on.

The base likely was pretrained on days that included Polish and Ukrainian. You shouldn't be surprised to learn it doesn't perform great on languages it wasn't trained on, or perhaps had the highest share of training data.

scotty79 2 hours ago|||

Tell it you are going to speak Polish now. It helps.

moffkalast 18 minutes ago|||

I'm not sure why but their multilingual performance in general has usually been below average. For a French company, their models are not even close to being best in French, even outdone by the likes of Qwen. I don't think they're focusing on anything but English, the rest is just marketing.

mystifyingpoi 4 hours ago||

TBH ChatGPT does the same, when I mix Polish and English. Generally getting some cyrillic characters and it gets super confused.

mnbbrown 43 minutes ago||

Incroyable! Competitive (if not better) than deepgram nova-3, and much better than assembly and elevenlabs in basically all cases on our internal streaming benchmarking.

The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.

Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?

dmix 6 hours ago||

> At approximately 4% word error rate on FLEURS and $0.003/min

Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/

mdrzn 6 hours ago|

Is it 0.003 per minute of audio uploaded, or "compute minute"?

For example fal.ai has a Whisper API endpoint priced at "$0.00125 per compute second" which (at 10-25x realtime) is EXTREMELY cheaper than all the competitors.

Oras 6 hours ago||

I think the point is having it for real-time; this is for conversations rather than transcribing audio files.

jamilton 4 hours ago||

That quote was for the non-realtime model.

sbinnee 7 minutes ago||

3 hours for a single request sounds nice to me. Although the graph suggests that it’s not going to perform as good as openai model I have been using, it is open source and surely I will give it a try.

pietz 6 hours ago||

Do we know if this is better than Nvidia Parakeet V3? That has been my go-to model locally and it's hard to imagine there's something even better.

m1el 4 hours ago||

I've been using nemotron ASR with my own ported inference, and happy about it:

https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...

https://github.com/m1el/nemotron-asr.cpp https://huggingface.co/m1el/nemotron-speech-streaming-0.6B-g...

Multicomp 3 hours ago||

I'm so amazed to find out just how close we are to the start trek voice computer.

I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.

And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.

But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.

And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.

Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.

[1] https://github.com/cjpais/Handy

m1el 1 hour ago||

you should check out

https://github.com/pipecat-ai/nemotron-january-2026/

discovered through this twitter post:

https://x.com/kwindla/status/2008601717987045382

kwindla 47 minutes ago||

Happy to answer questions about this (or work with people on further optimizing the open source inference code here). NVIDIA has more inference tooling coming, but it's also fun to hack on the PyTorch/etc stuff they've released so far.

moffkalast 13 minutes ago|||

Parakeet is really good imo too, and it's just 0.6B so it can actually run on edge devices. 4B is massive, I don't see Voxtral running realtime on an Orin or fitting on a Hailo. An Orin Nano probably can't even load it at BF16.

tylergetsay 5 hours ago|||

I've been using Parakeet V3 locally and totally ancedotaly this feels more accurate but slightly slower

czottmann 4 hours ago|||

I liked Parakeet v3 a lot until it started to drop whole sentences, willy-nilly.

cypherpunks01 18 minutes ago||

Yeah, I think the multilingual improvements in V3 caused some kind of regression for English - I've noticed large blocks occasionally dropped as well, so reverted to v2 for my usage. Specifically nvidia/parakeet-tdt-0.6b-v2 vs nvidia/parakeet-tdt-0.6b-v3

whinvik 4 hours ago||

Came here to ask the same question!

janalsncm 5 hours ago||

I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.

https://aclanthology.org/2025.findings-acl.87/

m463 1 hour ago||

I don't know. What about words inherited from other languages? I think a cross-language model could improve lots of things.

For example, "here it is, voila!" "turn left on el camino real"

decide1000 4 hours ago|||

I think this model proves it's very efficient and accurate.

ethmarks 1 hour ago||

But it could potentially be even more efficient if it was single-language.

depr 2 hours ago|||

STT services that have been around for longer, like Azure, Google and Amazon, generally require you to request a specific language, and their quality is a lot higher than models that advertise themselves as LLMs (even though I believe the clouds are also using the same types of models now).

popalchemist 4 hours ago|||

It doesn't make sense to have a language-restricted transcription model because of code switching. People aren't machines, we don't stick to our native languages without failure. Even monolingual people move in and out of their native language when using "borrowed" words/phrases. A single-language model will often fail to deal with that.

javier123454321 3 hours ago||

yeah, one example I run into is getting my perplexity phone assistant to play a song in spanish. I cannot for the life of me get a model to translate: "Play señorita a mi me gusta su style on spotify" correctly

idiotsecant 1 hour ago|||

The hilarious part of this comment is all the comments around it complaining about not supporting enough languages

keeganpoppen 3 hours ago|||

uhhh i cast doubt on multi-language support as affecting latency. model size, maybe, but what is the mechanism for making latency worse? i think of model latency as O(log(model size))… but i am open to being wrong / that being a not-good mental model / educated guess.

kergonath 2 hours ago|||

Even model size, it’s modest. There is a lot of machinery that is going to be common for all languages. You don’t multiply model size by 2 when you double the number of supported languages.

ethmarks 52 minutes ago||||

If encoding more learned languages and grammars and dictionaries makes the model size bigger, it will also increase latency. Try running a 1B model locally and then try to run a 500B model on the same hardware. You'll notice that latency has rather a lot to do with model size.

make3 3 hours ago|||

model size directly affects latency

raincole 2 hours ago||

Imagine if ChatGPT started like this and thought they should trim coding abilities from their language model because most people don't code.

ethmarks 55 minutes ago||

They've already done the inverse and trimmed non-coding abilities from their language model: https://openai.com/index/introducing-gpt-5-2-codex/. There's already precedent for creating domain-specific models.

I think it's nice to have specialized models for specific tasks that don't try to be generalists. Voxtral Transcript 2 is already extremely impressive, so imagine how much better it could be if it specialized in specific languages rather than cramming 14 languages into one model.

That said, generalist models definitely have their uses. I do want multilingual transcribing models to exist, I just also think that monolingual models could potentially achieve even better results for that specific language.

observationist 6 hours ago||

Native diarization, this looks exciting. edit: or not, no diarization in real-time.

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

~9GB model.

coder543 6 hours ago|

The diarization is on Voxtral Mini Transcribe V2, not Voxtral Mini 4B.

sbrother 5 hours ago|||

Do you have experience with that model for diarization? Does it feel accurate, and what's its realtime factor on a typical GPU? Diarization has been the biggest thorn in my side for a long time..

ashenke 3 hours ago|||

You can test it yourself for free on https://console.mistral.ai/build/audio/speech-to-text I tried it on an english-speaking podcast episode, and apart from identying one host as two different speakers (but only once for a few sentences at the start), the rest was flawless from what I could see

sbrother 31 minutes ago||

Amazing. Thank you.

coder543 5 hours ago|||

> Do you have experience with that model

No, I just heard about it this morning.

observationist 6 hours ago|||

Ahh, yeah, and it's explicitly not working for realtime streams. Good catch!

yko 3 hours ago||

Played with the demo a bit. It's really good at English, and detects language change on the fly. Impressive.

But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.

breisa 3 hours ago|

Thats just the result of the model only supporting russian (and 12 other languages) and not urkainian. It maps to the closest words from training data.

jiehong 4 hours ago|

It’s nice, but the previous version wasn’t actually that great compared to Parakeet for example.

We need better independent comparison to see how it performs against the latest Qwen3-ASR, and so on.

I can no longer take at face value the cherry picked comparisons of the companies showing off their new models.

For now, NVIDIA Parakeet v3 is the best for my use case, and runs very fast on my laptop or my phone.

nodja 3 hours ago||

There is https://huggingface.co/spaces/hf-audio/open_asr_leaderboard but it hasn't been updated for half a year.

archb 3 hours ago||

I like Parakeet as well and use it via Handy on Mac. What app are you using on your phone?

jiehong 3 hours ago||

Spokenly has it on Mac and iOS, in both cases for free when using parakeet

More comments...