OpenAI charges by the minute, so speed up your audio

Posted by georgemandis 5 days ago

OpenAI charges by the minute, so speed up your audio(george.mand.is)

735 points | 226 commentspage 2

mt_ 5 days ago|

You can just dump the youtube link video in Google AI studio and ask it to transcribe the video with speaker labels and even ask it it to add useful visual clues, because the model is multimodal for video too.

MaxDPS 5 days ago|

Can I ask what you mean by “useful visual clues”?

mt_ 5 days ago||

What is the speaker showcasing in its slides, what is it's body language and so on.

brendanfinan 5 days ago||

would this also work for my video consisting of 10,000 PDFs?

https://news.ycombinator.com/item?id=44125598

jasonjmcghee 5 days ago||

I can't tell if this is a meme or not.

And if someone had this idea and pitched it to Claude (the model this project was vibe coded with) it would be like "what a great idea!"

raincole 5 days ago||

Geez, that repo[0] has 8k stars on Github?

Are people just staring it for meme value or something? Is this a scam?

[0]: https://github.com/Olow304/memvid

stogot 5 days ago||

Love this idea but the accuracy section is lacking. Couldnt you do a simple diff of the outputs and see how many differences there are? .5% or 5%?

georgemandis 5 days ago|

Yeah, I'd like to do a more formal analysis of the outputs if I can carve out the time.

I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.

The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.

pbbakkum 5 days ago||

This is great, thank you for sharing. I work on these APIs at OpenAI, it's a surprise to me that it still works reasonably well at 2/3x speed, but on the other hand for phone channels we get 8khz audio that is upsampled to 24khz for the model and it still works well. Note there's probably a measurable decrease in transcription accuracy that worsens as you deviate from 1x speed. Also we really need to support bigger/longer file uploads :)

georgemandis 5 days ago||

I kind of want to take a more proper poke at this but focus more one summarization accuracy over word-for-word accuracy, though I see the value in both.

I'm actually curious, if I run transcriptions back-to-back-to-back on the exact same audio, how much variance should I expect?

Maybe I'll try three approaches:

- A straight diff comparison (I know a lot of people are calling for this, but I really think this is less useful than it sounds)

- A "variance within the modal" test running it multiple times against the same audio, tracking how much it varies between runs

- An LLM analysis assessing if the primary points from a talk were captured and summarized at 1x, 2x, 3x, 4x runs (I think this is far more useful and interesting)

nerder92 5 days ago||

Quick Feedback: Would it be cool to research this internally and maybe find a sweet spot in speed multiplier where the loss is minimal. This pre-processing is quite cheap and could bring down the API price eventually.

conjecTech 5 days ago||

If you are hosting whisper yourself, you can do something slightly more elegant, but with the same effect. You can downsample/pool the context 2:1 (or potentially more) a few layers into the encoder. That allows you to do the equivalent of speeding up audio without worry about potential spectral losses. For whisper large v3, that gets you nearly double throughput in exchange for a relative ~4% WER increase.

nomercy400 5 days ago|

Do you have more details or examples on how to downsample the context in the encoder? I treat the encoder as an opaque block, so I have no idea where to start.

conjecTech 4 days ago||

It's a very simple change in a vanilla python implementation. The encoder is a set of attention blocks, and the length of the attention can be changed without changing the calculation at all.

Here(https://github.com/openai/whisper/blob/main/whisper/model.py...) is the relevant code in the whisper repo. You'd just need to change the for loop to an enumerate and subsample the context along its length at the point you want. I believe it would be:

for i, block in enumerate(self.blocks): x = block(x) if i==4: x = x[,,::2]

addaidirectory 2 days ago||

That's a clever idea. Their are alternatives to OpenAI for audio transcription. Check them out https://www.addaidirectory.com/categories/audio or scroll the home page https://www.addaidirectory.com for updates

pimlottc 5 days ago||

Appreciated the concise summary + code snippet upfront, followed by more detail and background for those interested. More articles should be written this way!

meerab 5 days ago||

Interesting approach to transcript generation!

I'm implementing a similar workflow for VideoToBe.com

My Current Pipeline:

Media Extraction - yt-dlp for reliable video/audio downloads Local Transcription - OpenAI Whisper running on my own hardware (no API costs) Storage & UI - Transcripts stored in S3 with a custom web interface for viewing

Y Combinator playlist https://videotobe.com/play/playlist/ycombinator

and Andrej's talk is https://videotobe.com/play/youtube/LCEmiRjPEtQ

After reading your blog post, I will be testing effect on speeding audio for locally-hosted Whisper models. Running Whisper locally eliminates the ongoing cost concerns since my infrastructure is already a sunk cost. Speeding audio could be an interesting performance enhancement to explore!

godot 4 days ago||

If you're already doing local ffmpeg stuff (i.e. pretty involved with code and scripting already) you're only a couple of steps more away from just downloading the openai-whisper models (or even the faster-whisper models which runs about two times faster). Since this looks like personal usage and not building production quality code, you can use AI (e.g. Cursor) to write a script to run the whisper model inference in seconds.

Then there is no cost at all to run any length of audio. (since cost seems to be the primary factor of this article)

On my m1 mac laptop it takes me about 30 seconds to run it on a 3-minute audio file. I'm guessing for a 40 minute talk it takes about 5-10 minutes to run.

Tepix 3 days ago|

Have you tried faster-whisper and whisper.cpp?

godot 1 day ago||

Yeah, my mentioned times are with faster-whisper, but I have not tried whisper.cpp. I just use a python script to run the model.

7speter 4 days ago|

So wait… is whisper transcription really all that slow locally on a M3 Macbook? It’s been a while since I used whispercpp, but I seem to remember it taking maybe 20 minutes on a comparatively slowpoke (and powerhungry) i5 12600k for maybe 40 minutes of audio; it might take less time on a faster m chip (maybe I’m imagining mobile apple silicon to be more performant than even desktop intel cpus), even less if there support built in for the built in gpu cores and other ai optimized silicon?

Did I miss that the task was time sensitive?

More comments...