Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS

Posted by MattHart88 14 hours ago

Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS(github.com)

I built this because I wanted to see how far I could get with a voice-to-text app that used 100% local models so no data left my computer. I've been using a ton for coding and emails. Experimenting with using it as a voice interface for my other agents too. 100% open-source MIT license, would love feedback, PRs, and ideas on where to take it.

383 points | 174 comments

arkensaw 10 hours ago|

This is great, and I'm not knocking it, but every time I see these apps it reminds me of my phone.

My 2021 Google Pixel 6, when offline, can transcribe speech to text, and also corrects things contextually. it can make a mistake, and as I continue to speak, it will go back and correct something earlier in the sentence. What tech does Google have shoved in there that predates Whisper and Qwen by five years? And why do we now need a 1Gb of transformers to do it on a more powerful platform?

pushedx 2 hours ago||

It's the same model used for the WebSpeech API, which can operate entirely offline.

Google mostly funded the training of this model around 10 years ago, and it's quite good.

There are many websites that are simple frontends for this model which is built into Webkit and Blink based browsers. However to my knowledge the model is a blob packed into the apps which is not open source, hence the no Firefox support.

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

https://www.google.com/intl/en/chrome/demos/speech.html

com2kid 10 hours ago|||

Microsoft OneNote had this back in 2007 or so, granted the speech to text model wasn't nearly as advanced as they are now.

I was actually on the OneNote team when they were transitioning to an online only transcription model because there was no one left to maintain the on device legacy system.

It wasn't any sort of planned technical direction, just a lack of anyone wanting to maintain the old system.

rudhdb773b 4 hours ago|||

I remember trying out some voice-to-text around 2002 that I believe was included with Windows XP.. or maybe Office?

You had to go through some training exercises to tune it to your voice, but then it worked fairly well for transcription or even interacting with applications.

silon42 3 hours ago||

OS/2 had it built in in 1996.

adamsmark 10 hours ago|||

The accuracy is much lower though.

I've switched away from Gboard to Futo on Android and exclusively use MacWhisper on MacOS instead of the default Apple transcription model.

dotancohen 6 hours ago||

Any particular reason why you switched? I've been using Gboard for years, especially the text to speech in four languages. In the past few weeks, there was an update where the TTS feature is now in a separate "panel" of the keyboard, and it hardly works at all.

In English and Hebrew it stops after half a dozen words, and those words must be spoken slowly and mechanically for it to work at all. Russian and Arabic are right out - I can't coax any coherent sentence out of it.

I've gone through all permutations of relevant settings, such as "Faster Voice Dictation" (translated from Hebrew,I don't know what the original English option is called). I think there used to be an option for Online or Offline transcription, but that option is gone now.

This is ridiculous - I tried to copy the version information and there is no way to copy it in-app. Let's try the S24 OCR feature...

17.0.10.880768217 release-arm64-v8a 175712590 ראשית (en_GB) 2025090100 = גרסה עדכני Primary on-device: No packs Fallback on-device: Packs: ru-RU: 200

I'll try to install the English, Hebrew, and Arabic packs, though I'm certain that I've installed them already.

artdigital 3 hours ago|||

macOS and iOS can do that to with the baked in dictation. Globe key + D on Mac

dust42 2 hours ago|||

When you activate it you agree that your voice input is sent to Apple. As far as I understand this project runs fully locally. Up to you to decide for whatever suits your needs best.

stingraycharles 42 minutes ago||

Where did you get from that the voice input is sent to Apple / the cloud?

As far as I understand Apple’s voice model runs locally for most languages.

Siri commands can be used for training, but is also executed locally and sent to Apple separately (and this can be disabled).

dwayne_dibley 3 hours ago|||

yup, this is how I 'type'

cootsnuck 10 hours ago|||

Interesting. My Pixel 7 transcription is barely usable for me. Makes way too many mistakes and defeats the purpose of me not having to type, but maybe that's just my experience.

The latest open source local STT models people are running on devices are significantly more robust (e.g. whisper models, parakeet models, etc.). So background noise, mumbling, and/or just not having a perfect audio environment doesn't trip up the SoTA models as much (all of them still do get tripped up).

I work in voice AI and am using these models (both proprietary and local open source) every day. Night and day different for me.

taffydavid 2 hours ago||

I've built my own tts apps testing whisper and while it's good it does hallucinate quite a bit if there's noise, or just sometimes when the audio is perfectly clear.

It often gives the illusion of being very good but I could record a half hour of me speaking and discover some very random stuff in the middle that I did not say

vharish 3 hours ago||

IMO.. one of the best. It was surprisingly good. Yet they can't even replicate in on their own systems

atlgator 12 hours ago||

This thread is a support group for people who have each independently built the same macOS speech-to-text app.

theturtletalks 8 hours ago||

I'm tracking them all here:

https://opensource.builders/alternatives/superwhisper

Just added Ghost Pepper, and you can actually create a skill.md with the features you need to build your own

earthnail 2 hours ago|||

Please add wordbird as well: https://github.com/tillahoffmann/wordbird

It has all the usual features, plus you can add project specific vocabulary in your repo. It detects the working folder based on the active window, reads a WORDBIRD.md file in that folder and corrects terms accordingly.

(My friend Till built it)

theturtletalks 1 hour ago||

Added Wordbird and its features

bytesandbits 7 hours ago||||

Handy with parakeet is pretty awesome by the way!

perelin 5 hours ago|||

Agree. Slept on.

Wish they would do an ios version, but the creator already kind of dismissed it.

sipjca 3 hours ago||

I just don't have the bandwidth to run another project, maintaining Handy is hard enough on it's own, especially for free!

I didn't just dismiss for no reason, I am a human! I have needs and I can't just sleeplessly stay in front of the computer putting out code. If I had more time I would, but alas.

Someone could easily vibe code an iOS version in a few hours. I could do the same but I do not have time to support it.

xyos 2 hours ago||

Thank you for your work, I highly appreciate it!

sipjca 1 hour ago||

Thank you!!

MegagramEnjoyer 7 hours ago|||

i like handy a lot, so clean

dnlzro 7 hours ago||||

Another one to add (1.5k stars on GitHub): https://github.com/kitlangton/Hex

theturtletalks 1 hour ago||

Added Hex and its features

Barbing 3 hours ago||||

Very nice. Two great features I'd suggest highlighting in two apps, one app of which you have listed.

1: livestream transcript directly into the cursor in real time (just like native macOS dictation)

2: show realtime transcript live in an overlay (still has to paste when done, unlike #1, but can still read live while dictating)

1- localvoxtral, 2- FluidVoice (bumping it to 7 features on your list)

theturtletalks 1 hour ago||

Thank you, I have added localvoxtral[0] and fixed FluidVoice

0. https://github.com/T0mSIlver/localvoxtral

zgougou123 2 hours ago||||

You could add foxsay, a great one : https://github.com/skulkworks/foxsay

theturtletalks 1 hour ago||

Added Foxsay and its features

raybb 2 hours ago||||

Do any of the apps support taking actions as you talk without having to hit stop?

Like telling it to edit the text or remove a word.

v4nn4 3 hours ago||||

The filters selection seems to return a union not an intersection which is a bit confusing, at least to me.

theturtletalks 1 hour ago||

I’ve fixed this issue, please try it again when you get a chance

foltik 7 hours ago||||

So... a vibe slop index to keep track of all the vibe slop apps?

The cherry on top: it’s completely broken! Enable the Context Awareness filter, the list shrinks. Now enable the Auto-pasting filter, the list grows back.

mulquin 3 hours ago|||

I wouldn't call it completely broken; Pressing buttons still does something, it looks like an OR filter instead of an AND. It should be updated to be an AND filter as that's more intuitive.

thefourthchime 5 hours ago||||

hahah. It's slop all the way down.

dcreater 5 hours ago|||

Welcome to modern software

lizhang 6 hours ago|||

Can you add mine https://github.com/vorpus/D-scribe

karimf 10 hours ago|||

In the /r/macapps subreddit, they have huge influx of new apps posts, and the "whisper dictation" is one of the most saturated category. [0]

>“Compare” - This is the most important part. Apps in the most saturated categories (whisper dictation, clipboard managers, wallpaper apps, etc.) must clearly explain their differentiation from existing solutions.

https://www.reddit.com/r/macapps/comments/1r6d06r/new_post_r...

stingraycharles 40 minutes ago||

Seems like there’s also a huge influx of these apps as they’re relatively easy to make with LLMs.

That whole list of requirements there is actually a good thing that anyone who wants to make a new application should ask themselves.

tpowell 10 hours ago|||

I cobbled my own together one night before I came across the thoughtfully-built KeyVox and got to talking shop with its creator. Our cups runneth over. https://github.com/macmixing/keyvox/

aroman 7 hours ago|||

I did mine on nixOS with a nice little indicator built into Noctalia.

It's remarkable how similar its performance is to Wispr Flow... and it runs locally...

hbbio 6 hours ago|||

In the most possible Apple fashion, I am waiting for MacOS 27 or 28 to have this builtin.

perelin 4 hours ago|||

I recently attended a agentic SWE workshop and the starter project was this, whispr style, local voice dictation app. Took everybody around 30mins. tbh: i was kinda impressed.

lxe 9 hours ago|||

hahaha I’m glad I’m just a procedurally generated NPC

I built one for cross platform — using parakeet mlx or faster whisper. :)

colechristensen 8 hours ago|||

My name is Cole and I have a speech to text app.

When I most recently abandoned it, the trigger word would fire one time in five.

fragmede 9 hours ago|||

Yeah, but mine... Oh. Hello. sighs It's been three weeks since I tried to add feature to my version of the app. I don't miss it. I like this new life. Sober.

pmarreck 10 hours ago|||

Are there any better than Superwhisper? Because I haven't found any.

Tsarp 7 hours ago||

https://carelesswhisper.app

rmac 9 hours ago|||

checking in

windows (kotlin multi platform) => https://github.com/maceip/daydream

parakeet-tdt-0.6b-v2

rmac 6 hours ago||

now using Moonshine v2 Medium

hotword dict so no more "clawd" "dash" "dot com"

jannniii 5 hours ago|||

github.com/randomm/kuiskaus

brcmthrowaway 12 hours ago|||

Oh to be 20-something and do a bunch of free work for your portfolio again

obrajesse 11 hours ago|||

I'll have you know that I'm Matt's top contributor to Ghost Pepper and I'm nearly fifty

But I did it because I wanted it to work exactly the way I wanted it.

Also, for kicks, I (codex) ported it to Linux. But because my Linux laptop isn't as fast, I've had to use a few tricks to make it fast. https://github.com/obra/pepper-x

dotancohen 7 hours ago||

I'll look at this, thank you. I haven't yet gotten around to vibe coding my own itch yet so maybe your scratching will do.

dcreater 5 hours ago|||

Its gotten so bad that its a meme on the macapps subreddit.

This is the unfortunate real face of open source. So many devs each making little sandcastles on their own when if efforts were combined, we could have had something truly solid and sustainable, instead of a litany of 90% there apps each missing something or the other, leaving people ending up using WisprFlow etc.

seivan 9 hours ago||

[dead]

goodroot 14 hours ago||

Nice one! For Linux folks, I developed https://github.com/goodroot/hyprwhspr.

On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.

Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.

Incidentally, waiting for Apple to blow this all up with native STT any day now. :)

VorpalWay 11 hours ago||

How does it compare to the more well established https://github.com/cjpais/handy? Are there any stand out features (for either option)? What was the reason for writing your own rather than using or improving existing software?

goodroot 11 hours ago||

Not sure I know what you mean by IR...

But in this case I built hyprwhspr for Linux (Arch at first).

The goal was (is) the absolute best performance, in both accuracy & speed.

Python, via CUDA, on a NVIDIA GPU, is where that exists.

For example:

The #1 model on the ASR (automatic speech recognition) hugging face board is Cohere Transcribe and it is not yet 2 weeks old.

The ecosystem choices allowed me to hook it up in a night.

Other hardware types also work great on Linux due to its adaptability.

In short, the local stt peak is Linux/Wayland.

VorpalWay 11 hours ago||

IR was a typo, meant "it" (fixed it). I blame the phone keyboard plus insufficient proof reading on my part.

If this needs nvidia CPU acceleration for good performance it is not useful to me, I have Intel graphics and handy works fine.

goodroot 11 hours ago||

It works well with anything. :)

That said: If handy works, no need whatsoever to change.

LuxBennu 13 hours ago|||

I've been running whisper large-v3 on an m2 max through a self-hosted endpoint and honestly the accuracy is good enough that i stopped bothering with cleanup models. The bigger annoyance for me was latency on longer chunks, like anything over 30 seconds starts feeling sluggish even with metal acceleration. Haven't tried whisperkit specifically but curious how it handles longer audio compared to the full model.

goodroot 13 hours ago||

Ah yeah, longform is interesting.

Not sure how you're running it, via whichever "app thing", but...

On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.

This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.

Maybe you can try hackin' that up?

LuxBennu 12 hours ago||

Yeah that makes sense, chunking on silence would sidestep the latency issue pretty cleanly. I've been running it through a basic fastapi wrapper so it just takes whatever audio blob gets thrown at it, no chunking logic on the server side. Might be worth adding a vad pass before sending to whisper though, would cut down on processing dead air too.

ericd 5 hours ago|||

Nice, I've been using Hyprwhspr on Omarchy daily for a while now, it's been awesome, thanks very much.

hephaes7us 13 hours ago|||

Thanks for sharing! I was literally getting ready to build, essentially, this. Now it looks like I don't have to!

Have you ever considered using a foot-pedal for PTT?

Apple incidentally already has native STT, but for some reason they just don't use a decent model yet.

goodroot 13 hours ago|||

They do, and they even have that nice microphone F5 key for it, and an ideal OS level API making the input experience >perfect<.

Apparently they do have a better model, they just haven't exposed it in their own OS yet!

https://developer.apple.com/documentation/speech/bringing-ad...

Wonder what's the hold up...

For footpedal:

Yes, conceptually it’s just another evdev-trigger source, assuming the pedal exposes usable key/button events.

Otherwise we’d bridge it into the existing external control interface. Either way, hooks are there. :)

jiehong 12 hours ago||

The only issue with Apple models is that they do not detect languages automatically, nor switch if you do between sentences.

Parakeet does both just fine.

chrisweekly 12 hours ago|||

sorry, PTT?

serf 12 hours ago||

push-to-talk.

chrisweekly 7 hours ago||

pmarreck 10 hours ago||

looks like there's a nearly identically named one for Hyprland

Also, wish it was on nixpkgs, where at least it will be almost guaranteed to build forever =)

primaprashant 12 hours ago||

Speech-to-text has become integral part of my dev flow especially for dictating detailed prompts to LLMs and coding agents.

I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!

https://github.com/primaprashant/awesome-voice-typing

ArlenBales 4 hours ago||

Can you explain how exactly dictation is used for development? I type about 120 WPM so typing is always going to be way faster for me than talking. Aside for accessibility, is dictation development for slower typers or is it more so you can relax on a couch while vibe coding? If this comes off as condescension it's not intended, I am genuinely out of the loop here.

primaprashant 1 hour ago|||

For me personally, it's not really about typing speed. While I can type pretty fast and most likely I speak faster than typing, but typing and dictating are just different way of doing things for me. While the end result of both is same, but for me it's just like different way of doing things and it's not a competition between the two.

I regularly just sit down and often just describe whatever I'm trying to do in detail and I speak out loud my entire thought process and what kind of trade-offs I'm thinking, all the concerns and any other edge cases and patterns I have in my mind. I just prefer to speak out loud all of those. I regularly speak out loud for 5 to 10 minutes while sometimes taking some breaks in between as well to think through things.

I am not doing it just for vibe coding, I'm using it for everything. So obviously for driving coding agents, but also for in general, describing my thoughts for brainstorming or having some kind of like a critique session with LLMs for my ideas and thoughts. So for everything, I'm just using dictation.

One other benefit I think for me personally is that since I'm interacting with coding agents and in general LLMs a lot again and again every day, I end up giving much more context and details if I'm speaking out loud compared to typing. Sometimes I might feel a little bit lazy to type one or two extra sentences. But while speaking, I don't really have that kind of friction.

KerrickStaley 4 hours ago||||

I think most people can speak faster than 120 WPM. For example this site says I speak at 343 WPM https://www.typingmaster.com/speech-speed-test/, and I self-measure 222 WPM on dense technical text.

thakoppno 4 hours ago|||

Micro machines guy could be vibe coding at an absurd rate.

mememememememo 3 hours ago|||

My LLM types at 2k WPM. So I ise that to talk to my LLMs

Zizizizz 2 hours ago|||

Most English speakers speak faster than 120 wpm so that's probably why people, especially those who can't type at speeds like you can, prefer it.

RobertTheNerd 11 hours ago||

[dead]

marktolson 57 minutes ago||

I got it to transcribe this: "Create tests and ensure all tests pass" and instead of transcribing exactly what I said it outputs nonsense around "I am a large language model and I cannot create and execute tests".

Other than that issue I like it.

cupcake-unicorn 10 hours ago||

https://handy.computer/ already exists?

semiquaver 9 hours ago||

I have a few qualms with this app:

1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.

3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?

Graziano_M 8 hours ago|||

I got that reference!

MegagramEnjoyer 7 hours ago|||

why does it need to generate money?

morelikeborelax 5 hours ago||

This is the reply that was posted when Dropbox was first shown off on HN. It's a joke :)

itemize123 1 hour ago|||

yes speech to text exists

smcleod 9 hours ago|||

Yeah props to Handy, really nice tool.

forbiddenvoid 9 hours ago|||

More than one solution can exist for the same problem.

ktimespi 9 hours ago||

This is ideal for my use case, yeah. No need to fiddle around with another app's UI.

charlietran 14 hours ago||

Thank you for sharing, I appreciate the emphasis on local speed and privacy. As a current user of Hex (https://github.com/kitlangton/Hex), which has similar goals, what are your thoughts on how they compare?

parhamn 13 hours ago||

I see a lot of whisper stuff out there. Are these the same old OpenAI whispers or have they been updated heavily?

I've been using parakeet v3 which is fantastic (and tiny). Confused why we're still seeing whisper out there, there's been a lot of development.

71bw 3 hours ago||

I'm also wondering whether or not it would be beneficiary for my workload to switch over to Parakeet. Problem is, I'm using a lot of lingo - and in Polish, as well! - so it's not exactly the best case and whisper (v3), so far, works.

daemonologist 12 hours ago|||

Whisper is still old reliable - I find that it's less prone to hallucinations than newer models, easier to run (on AMD GPU, via whisper.cpp), and only ~2x slower than parakeet. I even bothered to "port" Parakeet to Nemo-less pytorch to run it on my GPU, and still went back to Whisper after a couple of days.

goodroot 12 hours ago|||

Whisper is very good in many languages.

It's also in many flavours, from tiny to turbo, and so can fit many system profiles.

That's what makes it unique and hard to replace.

zackify 13 hours ago||

same, even have kokoro for speech back to text for home assistant and parakeet on mac os through voice ink.

Also vibe coded a way to use parakeet from the same parakeet piper server on my grapheneos phone https://zach.codes/p/vibe-coding-a-wispr-clone-in-20-minutes

ericmcer 12 hours ago||

I see quite a few of these, the killer feature to me will be one that fine tunes the model based on your own voice.

E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.

Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.

We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.

sorenjan 12 hours ago||

Whisper supports a prompt, you can put your "Donold" there.

https://developers.openai.com/cookbook/examples/whisper_prom...

MattHart88 12 hours ago|||

I've found the "corrections" feature works well for most of the jargon and misspelling use cases. Can you give it a try and let me know edge cases?

bonkler59 11 hours ago||

My experience is that Aqua voice does a good job of this with custom dictionary and replacements.

konaraddi 13 hours ago|

That’s awesome! Do you know how it compares to Handy? Handy is open source and local only too. It’s been around a while and what I’ve been using.

https://github.com/cjpais/handy

JohnPDickerson 11 hours ago||

Handy is an awesome project, highly recommended - many of our engineers and PMs use it! CJ, Handy's creator, recently joined us as a Builder in Residence at Mozilla.ai. So for those interested in deploying a more raw/lightweight approach to local speech-to-text (or other multimodal) models, feel free to check out llamafile - which includes whisperfile, a single-file whisper.cpp + cosmopolitan framework-based executable. We're hoping to build some bridges between the two projects as well. https://github.com/mozilla-ai/llamafile

cootsnuck 10 hours ago|||

Yup, Handy is the one that made me stop looking for local open source alternatives to Wispr Flow.

I'll give a shoutout as well to Glimpse: https://github.com/LegendarySpy/Glimpse

vunderba 12 hours ago|||

I’d also be interested to know what the impetus was for developing ghost-pepper, which looks relatively recent, given that Handy exists and has been pretty well received.

Extra bonus is that Handy lets add an automatic LLM post-processor. This is very handy for the Parakeet V3 model, which can sometimes have issues where it repeats words or makes recognition errors for example, duplicating the recognition of a single word a dozen dozen dozen dozen dozen dozen dozen dozen times.

rob 12 hours ago|||

Yep. Using Handy with Parakeet v3 + a custom coding-tailored prompt to post-process on my 2019 Intel Mac and it's been working great.

Once in a while it will only output a literal space instead of the actual translation, but if I go into the 'history' page the translation is there for me to copy and paste manually. Maybe some pasting bug.

alasano 10 hours ago|||

I think it's the same reasoning for anything these days.

"You know what would be useful?" followed by asking your LLM of choice to implement it.

Then again for a lot of scenarios it's your slop or someone else's slop.

I think the only difference is that I keep my own slop tools private.

swaptr 13 hours ago|||

Handy is awesome! I used it for quite a while before Claude Code added voice support. Solid software, very good linux and mac integration. Shoutout to Parakeet models as well, extremely fast and solid models for their relatively modest memory requirements.

kwakubiney 10 hours ago|||

I love it. I use it all the time to communicate to my agents via opencode.

youniverse 13 hours ago|||

I love and have been using handy for a while too, what we need is this for mobile apps I don't think there's any free apps and native dictation is not always fully local and not as good.

olup 12 hours ago|||

I use handy all day long as a software engineer, and recommended it to all of my team members. I love it.

stavros 13 hours ago||

Handy is fantastic.

More comments...