Top
Best
New

Posted by georgemandis 6/25/2025

OpenAI charges by the minute, so speed up your audio(george.mand.is)
740 points | 228 commentspage 2
conjecTech 6/26/2025|
If you are hosting whisper yourself, you can do something slightly more elegant, but with the same effect. You can downsample/pool the context 2:1 (or potentially more) a few layers into the encoder. That allows you to do the equivalent of speeding up audio without worry about potential spectral losses. For whisper large v3, that gets you nearly double throughput in exchange for a relative ~4% WER increase.
nomercy400 6/26/2025|
Do you have more details or examples on how to downsample the context in the encoder? I treat the encoder as an opaque block, so I have no idea where to start.
conjecTech 6/27/2025||
It's a very simple change in a vanilla python implementation. The encoder is a set of attention blocks, and the length of the attention can be changed without changing the calculation at all.

Here(https://github.com/openai/whisper/blob/main/whisper/model.py...) is the relevant code in the whisper repo. You'd just need to change the for loop to an enumerate and subsample the context along its length at the point you want. I believe it would be:

for i, block in enumerate(self.blocks): x = block(x) if i==4: x = x[,,::2]

mt_ 6/25/2025||
You can just dump the youtube link video in Google AI studio and ask it to transcribe the video with speaker labels and even ask it it to add useful visual clues, because the model is multimodal for video too.
MaxDPS 6/26/2025|
Can I ask what you mean by “useful visual clues”?
mt_ 6/26/2025||
What is the speaker showcasing in its slides, what is it's body language and so on.
brendanfinan 6/25/2025||
would this also work for my video consisting of 10,000 PDFs?

https://news.ycombinator.com/item?id=44125598

jasonjmcghee 6/25/2025||
I can't tell if this is a meme or not.

And if someone had this idea and pitched it to Claude (the model this project was vibe coded with) it would be like "what a great idea!"

raincole 6/26/2025||
Geez, that repo[0] has 8k stars on Github?

Are people just staring it for meme value or something? Is this a scam?

[0]: https://github.com/Olow304/memvid

stogot 6/25/2025||
Love this idea but the accuracy section is lacking. Couldnt you do a simple diff of the outputs and see how many differences there are? .5% or 5%?
georgemandis 6/25/2025|
Yeah, I'd like to do a more formal analysis of the outputs if I can carve out the time.

I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.

The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.

pbbakkum 6/25/2025||
This is great, thank you for sharing. I work on these APIs at OpenAI, it's a surprise to me that it still works reasonably well at 2/3x speed, but on the other hand for phone channels we get 8khz audio that is upsampled to 24khz for the model and it still works well. Note there's probably a measurable decrease in transcription accuracy that worsens as you deviate from 1x speed. Also we really need to support bigger/longer file uploads :)
georgemandis 6/26/2025||
I kind of want to take a more proper poke at this but focus more one summarization accuracy over word-for-word accuracy, though I see the value in both.

I'm actually curious, if I run transcriptions back-to-back-to-back on the exact same audio, how much variance should I expect?

Maybe I'll try three approaches:

- A straight diff comparison (I know a lot of people are calling for this, but I really think this is less useful than it sounds)

- A "variance within the modal" test running it multiple times against the same audio, tracking how much it varies between runs

- An LLM analysis assessing if the primary points from a talk were captured and summarized at 1x, 2x, 3x, 4x runs (I think this is far more useful and interesting)

nerder92 6/25/2025||
Quick Feedback: Would it be cool to research this internally and maybe find a sweet spot in speed multiplier where the loss is minimal. This pre-processing is quite cheap and could bring down the API price eventually.
pimlottc 6/25/2025||
Appreciated the concise summary + code snippet upfront, followed by more detail and background for those interested. More articles should be written this way!
meerab 6/26/2025||
Interesting approach to transcript generation!

I'm implementing a similar workflow for VideoToBe.com

My Current Pipeline:

Media Extraction - yt-dlp for reliable video/audio downloads Local Transcription - OpenAI Whisper running on my own hardware (no API costs) Storage & UI - Transcripts stored in S3 with a custom web interface for viewing

Y Combinator playlist https://videotobe.com/play/playlist/ycombinator

and Andrej's talk is https://videotobe.com/play/youtube/LCEmiRjPEtQ

After reading your blog post, I will be testing effect on speeding audio for locally-hosted Whisper models. Running Whisper locally eliminates the ongoing cost concerns since my infrastructure is already a sunk cost. Speeding audio could be an interesting performance enhancement to explore!

karpathy 6/25/2025||
Omg long post. TLDR from an LLM for anyone interested

Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.

;)

georgemandis 6/25/2025||
Hahaha. Okay, okay... I will watch it now ;)

(Thanks for your good sense of humor)

karpathy 6/25/2025||
I like that your post deliberately gets to the point first and then (optionally) expands later, I think it's a good and generally underutilized format. I often advise people to structure their emails in the same way, e.g. first just cutting to the chase with the specific ask, then giving more context optionally below.

It's not my intention to bloat information or delivery but I also don't super know how to follow this format especially in this kind of talk. Because it's not so much about relaying specific information (like your final script here), but more as a collection of prompts back to the audience as things to think about.

My companion tweet to this video on X had a brief TLDR/Summary included where I tried, but I didn't super think it was very reflective of the talk, it was more about topics covered.

Anyway, I am overall a big fan of doing more compute at the "creation time" to compress other people's time during "consumption time" and I think it's the respectful and kind thing to do.

georgemandis 6/25/2025|||
I watched your talk. There are so many more interesting ideas in there that resonated with me that the summary (unsurprisingly) skipped over. I'm glad I watched it!

LLMs as the operating system, the way you interface with vibe-coding (smaller chunks) and the idea that maybe we haven't found the "GUI for AI" yet are all things I've pondered and discussed with people. You articulated them well.

I think some formats, like a talk, don't lend themselves easily to meaningful summaries. It's about giving the audience things to think about, to your point. It's the sum of storytelling that's more than the whole and why we still do it.

My post is, at the end of the day, really more about a neat trick to optimize transcriptions. This particular video might be a great example of why you may not always want to do that :)

Anyway, thanks for the time and thanks for the talk!

mh- 6/25/2025|||
> I often advise people to structure their emails [..]

I frequently do the same, and eventually someone sent me this HBR article summarizing the concept nicely as "bottom line up front". It's a good primer for those interested.

https://hbr.org/2016/11/how-to-write-email-with-military-pre...

bravesoul2 6/25/2025|||
This is the sort of content I want to see in Tweets and LinkedIn posts.

I have been thinking for a while how do you make good use of the short space in those places.

LLM did well here.

lordspace 6/26/2025||
that's a really good summary :)
godot 6/26/2025||
If you're already doing local ffmpeg stuff (i.e. pretty involved with code and scripting already) you're only a couple of steps more away from just downloading the openai-whisper models (or even the faster-whisper models which runs about two times faster). Since this looks like personal usage and not building production quality code, you can use AI (e.g. Cursor) to write a script to run the whisper model inference in seconds.

Then there is no cost at all to run any length of audio. (since cost seems to be the primary factor of this article)

On my m1 mac laptop it takes me about 30 seconds to run it on a 3-minute audio file. I'm guessing for a 40 minute talk it takes about 5-10 minutes to run.

Tepix 6/28/2025|
Have you tried faster-whisper and whisper.cpp?
godot 6/29/2025||
Yeah, my mentioned times are with faster-whisper, but I have not tried whisper.cpp. I just use a python script to run the model.
55555 6/25/2025|
This seems like a good place for me to complain about the fact that the automatically generated subtitle files Youtube creates are horribly malformed. Every sentence is repeated twice. In many subtitle files, the subtitle timestamp ranges overlap one another while also repeating every sentence twice in two different ranges. It's absolutely bizarre and has been like this for years or possibly forever. Here's an example - I apologize that it's not in English. I don't know if this issue affects English. https://pastebin.com/raw/LTBps80F
xenator 6/26/2025|
Seems like Thai. Thai translation and recognition is like 10 years ago comparing to other languages I'm dealing with in my everyday life. Good news tho is the same level was for Russian years ago, and now it is near perfect.
55555 6/26/2025||
Well the weird thing is honestly their speech to text recognizes 97% of words correctly. The subtitle content is pretty perfect. It’s just the formatting that’s awful.
More comments...