Is it better? Worse? Why do they only compare to gpt4o mini transcribe?
The thing that makes it particularly misleading is that models that do transcription to lowercase and then use inverse text normalization to restore structure and grammar end up making a very different class of mistakes than Whisper, which goes directly to final form text including punctuation and quotes and tone.
But nonetheless, they're claiming such a lower error rate than Whisper that it's almost not in the same bucket.
There's a reason that quite a lot of good transcribers still use V2, not V3.
For Whisper API online (with v3 large) I've found "$0.00125 per compute second" which is the cheapest absolute I've ever found.
Why it should be Whisper v3? They even released an open model: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.
The hardest one I did was for a sports network where it was a motorcross motorbike event where most of what you could hear was the roar of the bikes. There were two commentators I had to transcribe over the top of that mess and they were using the slang insider nicknames for all the riders, not their published names, so I had to sit and Google forums to find the names of the riders while I was listening. I'm not even sure how these local models would even be able to handle that insanity at all because they almost certainly lack enough domain knowledge.
[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...
- familiarity with the accent and/or speaker;
- speed and style/cadence of the speech;
- any other audio that is happening that can muffle or distort the audio;
- etc.
It can also take multiple passes to get a decent transcription.
"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"
So, you don't mean 'try this out', you mean 'buy this product'.
Let's not act like it's a free sampler.
I can't comment on the model : i'm not giving them money.
> We've worked hand-in-hand with the vLLM team to have production-grade support for Voxtral Mini 4B Realtime 2602 with vLLM. Special thanks goes out to Joshua Deng, Yu Luo, Chen Zhang, Nick Hill, Nicolò Lucchesi, Roger Wang, and Cyrus Leung for the amazing work and help on building a production-ready audio streaming and realtime system in vLLM.
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
https://docs.vllm.ai/en/latest/serving/openai_compatible_ser...
On the information density of languages: it is true that some languages have a more information dense textual representation. But all spoken languages convey about the same information in the same time. Which is not all that surprising, it just means that human brains have an optimal range at which they process information.
Further reading: Coupé, Christophe, et al. "Different Languages, Similar Encoding Efficiency: Comparable Information Rates across the Human Communicative Niche." Science Advances. https://doi.org/10.1126/sciadv.aaw2594
Evidence?
It seems like the best tradeoff between information density and understandability actually comes from the deep latin roots of the language
I agree with your belief, other languages have either lower density (e.g. German) or lower understandability (e.g. English)
Italian has one official italian (two, if you count IT_ch, but difference is minor), doesn't pay much attention to stress and vowel length, and only has a few "confusable" sounds (gl/l, gn/n, double consonants, stuff you get wrong in primary school). Italian dialects would be a disaster tho :)
That's interesting. As a linguist, I have to say that Haskell is the most computationally advanced programming language, having the best balance of clear syntax and expressiveness. I am qualified to say this because I once used Haskell to make a web site, and I also tried C++ but I kept on getting errors.
/s obviously.
Tldr: computer scientists feel unjustifiably entitled to make scientific-sounding but meaningless pronouncements on topics outside their field of expertise.
I don't know how widely accepted that conclusion is, what exceptions there may be, etc.
You could use their api (they have this snippet):
```curl -X POST "https://api.mistral.ai/v1/audio/transcriptions" \ -H "Authorization: Bearer $MISTRAL_API_KEY" \ -F model="voxtral-mini-latest" \ -F file=@"your-file.m4a" \ -F diarize=true \ -F timestamp_granularities="segment"```
In the api it took 18s to do a 20m audio file I had lying around where someone is reviewing a product.
There will, I'm sure, be ways of running this locally up and available soon (if they aren't in huggingface right now) but the API is $0.003/min. If it's something like 120 meetings (10 years of monthly ones) then it's roughly $20 if the meetings are 1hr each. Depending on whether they're 1 or 10 hours (or if they're weekly or monthly but 10 parallel sessions or something) then this might be a price you're willing to pay if you get the results back in an afternoon.
edit - their realtime model can be run with vllm, the batch model is not open
- make sure you have a list of all these YouTube meeting URLs somewhere
- ask your preferred coding assistant to write you up a script that downloads the audio for these videos with yt-dlp & calls Mixtrals' API
- ????
- profit