Posted by petewarden 7 hours ago
[1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
I'm actually a little surprised they haven't added model size to that chart.
I can tell that this is now definitely going to be my go-to model and app on all my clients.
The one built in is much faster, and you only have to toggle it on.
Are these so much more accurate? I definitely have to correct stuff, but pretty good experience.
Also use speech to text on my iphone which seems to be the same accuracy.
edit: holy shit parakeet is good.... Moonshine impressive too and it is half the param, can it run on CPU like even Apple M1 ???? big advantage over parakeet
Now if only there was something just as quick as Parakeet v3 for TTS ! Then I can talk to codex all day long!!!
i was using assmeblyAI but this is fast and accurate and offline wtf!
For voice agents, the painful failure mode is partials getting rewritten every few hundred ms. If you can share it, metrics like median first-token latency, real-time factor, and "% partial tokens revised after 1s / 3s" on noisy far-field audio would make comparisons much more actionable.
If those numbers look good, this seems very promising for local assistant pipelines.
I'd love a faster and more accurate option than Whisper, but streamers need something off-the-shelf they can install in their pipeline, like an OBS plugin which can just grab the audio from their OBS audio sources.
I see a couple obvious problems: this doesn't seem to support translation which is unfortunate, that's pretty key for this usecase. Also it only supports one language at a time, which is problematic with how streamers will frequently code-switch while talking to their chat in different languages or on Discord with their gameplay partners. Maybe such a plugin would be able to detect which language is spoken and route to one or the other model as needed?
There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.
The minimum useful data for this stuff is a small table of language | WER for dataset
The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)