Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS

Posted by MattHart88 19 hours ago

Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS(github.com)

I built this because I wanted to see how far I could get with a voice-to-text app that used 100% local models so no data left my computer. I've been using a ton for coding and emails. Experimenting with using it as a voice interface for my other agents too. 100% open-source MIT license, would love feedback, PRs, and ideas on where to take it.

415 points | 185 commentspage 4

janalsncm 16 hours ago|

I think the jab at the bottom of the readme is referring to whispr flow?

https://wisprflow.ai/new-funding

jannniii 10 hours ago||

Oh dear, why does it not use apfel for cleanup? No model download necessary…

tito 16 hours ago||

This is great. I'm typing this message now using Ghost Pepper. What benefits have you seen from the OCR screen sharing step?

maxmorrish 13 hours ago||

love seeing more local-first tools like this. feels like theres been a real shift since the codebeautify breach last year, people are actually thinking about where there data goes now. nice work on keeping it all on device

Supercompressor 17 hours ago||

I've been looking for the opposite - wanting to dump text and it be read to me, coherently. Anyone have good recommendations?

realityfactchex 17 hours ago|

Sure, Chatterbox TTS Server is rather high quality: https://github.com/devnen/Chatterbox-TTS-Server

You could hook it up to some workflow over the local API depending on how you want to dump the text, but the web UI is good too.

The Show HN by the author was at: https://news.ycombinator.com/item?id=44145564

Supercompressor 17 hours ago||

Appreciated - thank you.

guzik 18 hours ago||

Sadly the app doesn't work. There is no popup asking for microphone permission.

EDIT: I see there is an open issue for that on github

ttul 17 hours ago|

And many people are mailing in Codex and Claude Code generated PRs - myself included. Fingers crossed, I suppose.

MattHart88 17 hours ago||

Thanks to everyone who submitted PRs! The fix is merged, new version is up.

pmarreck 14 hours ago||

How does this compare with Superwhisper, which is otherwise excellent but not cheap?

imazio 12 hours ago||

is this the support group for people building speech-to-text apps?

I built https://yakki.ai

No regrets so far! XP

aristech 18 hours ago||

Great job. How about the supported languages? System languages gets recognised?

MattHart88 17 hours ago|

Thanks! We currently have 2 multi-lingual options available: - Whisper small (multilingual) (~466 MB, supports many languages) - Parakeet v3 (25 languages) (~1.4 GB, supports 25 languages via FluidAudio)

gegtik 17 hours ago|

how does this compare to macos built in siri TTS, in quality and in privacy?

realityfactchex 17 hours ago|

Exactly my question. I double-tap the control button and macOS does native, local TTS dictation pretty well. (Similar to Keyboard > Enable Dictation setting on iOS.)

The macOS built-in TTS (dictation) seems better than all the 3rd party, local apps I tried in the past that people raved about. I have tried several.

Is this better somehow?

If the 3rd party apps did streaming with typing in place and corrections within a reasonable window when they understand things better given more context, that would be cool. Theoretically, a custom model or UX could be "better" than what comes free built into macOS (more accurate or customizable).

But when I contacted the developer of my favorite one they said that would be pretty hard to implement due to having to go back and make corrections in the active field, etc.

I assume streaming STT in these utilities for Mac will get better at some point, but I haven't seen it yet (been waiting). It seems these tools generally are not streaming, e.g. they want you to finish speaking first before showing you anything. Which doesn't work for me when I'm dictating. I want to see what I've been saying lately, to jog my memory about what I've just said and help guide the next thing I'm about to say. I certainly don't want to split my attention by manually toggling the control (whether PTT or not) periodically to indicate "ok, you can render what I just said now".

I guess "hold-to-talk" tools are for delivering discrete, fully formed messages, not for longer, running dictation.

AFAICT, TFA is focused on hold-to-talk as the differentiator, over double-tap to begin speaking and double-tap to end speaking?

realityfactchex 11 hours ago||

s/TTS/STT/

More comments...