Posted by MattHart88 14 hours ago
My 2021 Google Pixel 6, when offline, can transcribe speech to text, and also corrects things contextually. it can make a mistake, and as I continue to speak, it will go back and correct something earlier in the sentence. What tech does Google have shoved in there that predates Whisper and Qwen by five years? And why do we now need a 1Gb of transformers to do it on a more powerful platform?
Google mostly funded the training of this model around 10 years ago, and it's quite good.
There are many websites that are simple frontends for this model which is built into Webkit and Blink based browsers. However to my knowledge the model is a blob packed into the apps which is not open source, hence the no Firefox support.
https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...
I was actually on the OneNote team when they were transitioning to an online only transcription model because there was no one left to maintain the on device legacy system.
It wasn't any sort of planned technical direction, just a lack of anyone wanting to maintain the old system.
You had to go through some training exercises to tune it to your voice, but then it worked fairly well for transcription or even interacting with applications.
I've switched away from Gboard to Futo on Android and exclusively use MacWhisper on MacOS instead of the default Apple transcription model.
In English and Hebrew it stops after half a dozen words, and those words must be spoken slowly and mechanically for it to work at all. Russian and Arabic are right out - I can't coax any coherent sentence out of it.
I've gone through all permutations of relevant settings, such as "Faster Voice Dictation" (translated from Hebrew,I don't know what the original English option is called). I think there used to be an option for Online or Offline transcription, but that option is gone now.
This is ridiculous - I tried to copy the version information and there is no way to copy it in-app. Let's try the S24 OCR feature...
17.0.10.880768217 release-arm64-v8a 175712590 ראשית (en_GB) 2025090100 = גרסה עדכני Primary on-device: No packs Fallback on-device: Packs: ru-RU: 200
I'll try to install the English, Hebrew, and Arabic packs, though I'm certain that I've installed them already.
As far as I understand Apple’s voice model runs locally for most languages.
Siri commands can be used for training, but is also executed locally and sent to Apple separately (and this can be disabled).
The latest open source local STT models people are running on devices are significantly more robust (e.g. whisper models, parakeet models, etc.). So background noise, mumbling, and/or just not having a perfect audio environment doesn't trip up the SoTA models as much (all of them still do get tripped up).
I work in voice AI and am using these models (both proprietary and local open source) every day. Night and day different for me.
It often gives the illusion of being very good but I could record a half hour of me speaking and discover some very random stuff in the middle that I did not say
https://opensource.builders/alternatives/superwhisper
Just added Ghost Pepper, and you can actually create a skill.md with the features you need to build your own
It has all the usual features, plus you can add project specific vocabulary in your repo. It detects the working folder based on the active window, reads a WORDBIRD.md file in that folder and corrects terms accordingly.
(My friend Till built it)
Wish they would do an ios version, but the creator already kind of dismissed it.
I didn't just dismiss for no reason, I am a human! I have needs and I can't just sleeplessly stay in front of the computer putting out code. If I had more time I would, but alas.
Someone could easily vibe code an iOS version in a few hours. I could do the same but I do not have time to support it.
1: livestream transcript directly into the cursor in real time (just like native macOS dictation)
2: show realtime transcript live in an overlay (still has to paste when done, unlike #1, but can still read live while dictating)
1- localvoxtral, 2- FluidVoice (bumping it to 7 features on your list)
Like telling it to edit the text or remove a word.
The cherry on top: it’s completely broken! Enable the Context Awareness filter, the list shrinks. Now enable the Auto-pasting filter, the list grows back.
>“Compare” - This is the most important part. Apps in the most saturated categories (whisper dictation, clipboard managers, wallpaper apps, etc.) must clearly explain their differentiation from existing solutions.
https://www.reddit.com/r/macapps/comments/1r6d06r/new_post_r...
That whole list of requirements there is actually a good thing that anyone who wants to make a new application should ask themselves.
It's remarkable how similar its performance is to Wispr Flow... and it runs locally...
I built one for cross platform — using parakeet mlx or faster whisper. :)
When I most recently abandoned it, the trigger word would fire one time in five.
windows (kotlin multi platform) => https://github.com/maceip/daydream
parakeet-tdt-0.6b-v2
hotword dict so no more "clawd" "dash" "dot com"
But I did it because I wanted it to work exactly the way I wanted it.
Also, for kicks, I (codex) ported it to Linux. But because my Linux laptop isn't as fast, I've had to use a few tricks to make it fast. https://github.com/obra/pepper-x
This is the unfortunate real face of open source. So many devs each making little sandcastles on their own when if efforts were combined, we could have had something truly solid and sustainable, instead of a litany of 90% there apps each missing something or the other, leaving people ending up using WisprFlow etc.
On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.
Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.
Incidentally, waiting for Apple to blow this all up with native STT any day now. :)
But in this case I built hyprwhspr for Linux (Arch at first).
The goal was (is) the absolute best performance, in both accuracy & speed.
Python, via CUDA, on a NVIDIA GPU, is where that exists.
For example:
The #1 model on the ASR (automatic speech recognition) hugging face board is Cohere Transcribe and it is not yet 2 weeks old.
The ecosystem choices allowed me to hook it up in a night.
Other hardware types also work great on Linux due to its adaptability.
In short, the local stt peak is Linux/Wayland.
If this needs nvidia CPU acceleration for good performance it is not useful to me, I have Intel graphics and handy works fine.
That said: If handy works, no need whatsoever to change.
Not sure how you're running it, via whichever "app thing", but...
On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.
This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.
Maybe you can try hackin' that up?
Have you ever considered using a foot-pedal for PTT?
Apple incidentally already has native STT, but for some reason they just don't use a decent model yet.
Apparently they do have a better model, they just haven't exposed it in their own OS yet!
https://developer.apple.com/documentation/speech/bringing-ad...
Wonder what's the hold up...
For footpedal:
Yes, conceptually it’s just another evdev-trigger source, assuming the pedal exposes usable key/button events.
Otherwise we’d bridge it into the existing external control interface. Either way, hooks are there. :)
Parakeet does both just fine.
Also, wish it was on nixpkgs, where at least it will be almost guaranteed to build forever =)
I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!
I regularly just sit down and often just describe whatever I'm trying to do in detail and I speak out loud my entire thought process and what kind of trade-offs I'm thinking, all the concerns and any other edge cases and patterns I have in my mind. I just prefer to speak out loud all of those. I regularly speak out loud for 5 to 10 minutes while sometimes taking some breaks in between as well to think through things.
I am not doing it just for vibe coding, I'm using it for everything. So obviously for driving coding agents, but also for in general, describing my thoughts for brainstorming or having some kind of like a critique session with LLMs for my ideas and thoughts. So for everything, I'm just using dictation.
One other benefit I think for me personally is that since I'm interacting with coding agents and in general LLMs a lot again and again every day, I end up giving much more context and details if I'm speaking out loud compared to typing. Sometimes I might feel a little bit lazy to type one or two extra sentences. But while speaking, I don't really have that kind of friction.
Other than that issue I like it.
1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.
3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?
I've been using parakeet v3 which is fantastic (and tiny). Confused why we're still seeing whisper out there, there's been a lot of development.
It's also in many flavours, from tiny to turbo, and so can fit many system profiles.
That's what makes it unique and hard to replace.
Also vibe coded a way to use parakeet from the same parakeet piper server on my grapheneos phone https://zach.codes/p/vibe-coding-a-wispr-clone-in-20-minutes
E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.
Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.
We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.
https://developers.openai.com/cookbook/examples/whisper_prom...
I'll give a shoutout as well to Glimpse: https://github.com/LegendarySpy/Glimpse
Extra bonus is that Handy lets add an automatic LLM post-processor. This is very handy for the Parakeet V3 model, which can sometimes have issues where it repeats words or makes recognition errors for example, duplicating the recognition of a single word a dozen dozen dozen dozen dozen dozen dozen dozen times.
Once in a while it will only output a literal space instead of the actual translation, but if I go into the 'history' page the translation is there for me to copy and paste manually. Maybe some pasting bug.
"You know what would be useful?" followed by asking your LLM of choice to implement it.
Then again for a lot of scenarios it's your slop or someone else's slop.
I think the only difference is that I keep my own slop tools private.