Top
Best
New

Posted by georgemandis 6/25/2025

OpenAI charges by the minute, so speed up your audio(george.mand.is)
740 points | 228 commentspage 3
dajonker 6/26/2025|
Gemini 2.5 pro is, in my usage, quite superior for high quality transcriptions of phone calls, in Dutch in my case. As long as you upload the audio to GCS there you can easily process conversations of over an hour. It correctly identified and labeled speakers.

The cheaper 2.5 flash made noticeably more mistakes, for example it didn't correctly output numbers while the Pro model did.

As for OpenAI, their gpt-4o-transcribe model did worse than 2.5 flash, completely messing up names of places and/or people. Plus it doesn't label the conversation in turns, it just outputs a single continuous piece of text.

7speter 6/26/2025||
So wait… is whisper transcription really all that slow locally on a M3 Macbook? It’s been a while since I used whispercpp, but I seem to remember it taking maybe 20 minutes on a comparatively slowpoke (and powerhungry) i5 12600k for maybe 40 minutes of audio; it might take less time on a faster m chip (maybe I’m imagining mobile apple silicon to be more performant than even desktop intel cpus), even less if there support built in for the built in gpu cores and other ai optimized silicon?

Did I miss that the task was time sensitive?

addaidirectory 6/28/2025||
That's a clever idea. Their are alternatives to OpenAI for audio transcription. Check them out https://www.addaidirectory.com/categories/audio or scroll the home page https://www.addaidirectory.com for updates
mushishi 6/26/2025||
Do the APIs support simultaneous voice transcription in a way that different voices are tagged? (either in text or as metadata)

If so: could you split the audiofile and process the latter half by pitch shifting, say an octave, and then merging them together to get shorter audiofile — then transcribe and join them back to a linear form, tagging removed. (You could insert some prerecorded voice to know at which point the second voice starts.). If pitch change is not enough, maybe manipulate it further by formants.

KTibow 6/25/2025||
This is really interesting, although the cheapest route is still to use an alternative audio-compatible LLM (Gemini 2.0 Flash Lite, Phi 4 Multimodal) or an alternative host for Whisper (Deepinfra, Fal).
fallinditch 6/25/2025||
When extracting transcripts from YouTube videos, can anyone give advice on the best (cost effective, quick, accurate) way to do this?

I'm confused because I read in various places that the YouTube API doesn't provide access to transcripts ... so how do all these YouTube transcript extractor services do it?

I want to build my own YouTube summarizer app. Any advice and info on this topic greatly appreciated!

rob 6/25/2025||
There's a tool that uses YouTube's unofficial APIs to get them if they're available:

https://github.com/jdepoix/youtube-transcript-api

For our internal tool that transcribes local city council meetings on YouTube (often 1-3 hours long), we found that these automatic ones were never available though.

(Our tool usually 'processes' the videos within ~5-30 mins of being uploaded, so that's also why none are probably available 'officially' yet.)

So we use yt-dlp to download the highest quality audio and then process them with whisper via Groq, which is way cheaper (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's API.) Sometimes groq errors out so there's built-in support for Replicate and Deepgram as well.

We run yt-dlp on our remote Linode server and I have a Python script I created that will automatically login to YouTube with a "clean" account and extract the proper cookies.txt file, and we also generate a 'po token' using another tool:

https://github.com/iv-org/youtube-trusted-session-generator

Both cookies.txt and the "po token" get passed to yt-dlp when running on the Linode server and I haven't had to re-generate anything in over a month. Runs smoothly every day.

(Note that I don't use cookies/po_token when running locally at home, it usually works fine there.)

fallinditch 6/25/2025||
Very useful, thanks. So does this mean that every month or so you have to create a new 'clean' YouTube account and use that to create new po_token/cookies?

It's frustrating to have to jump through all these hoops just to extract transcripts when the YouTube Data API already gives reasonable limits to free API calls ... would be nice if they allowed transcripts too.

Do you think the various YouTube transcript extractor services all follow a similar method as yours?

banana_giraffe 6/25/2025|||
You can use yt-dlp to get the transcripts. For instance, to grab just the transcript of a video:

    ./yt-dlp --skip-download --write-sub --write-auto-sub --sub-lang en --sub-format json3 <youtube video URL>
You can also feed the same command a playlist or channel URL and it'll run through and grab all the transcripts for each video in the playlist or channel.
fallinditch 6/25/2025||
That's cool, thanks for the info. But do you also have to use a rotating proxy to prevent YouTube from blocking your IP address?
banana_giraffe 6/25/2025||
Last time I ran this at scale was a couple of months ago, so my information is no doubt out of date, but in my experience, YouTube seems less concerned about this than they are when you're grabbing lots of videos.

But that was a few months ago, so for all I know they've tightened down more hatches since then.

vjerancrnjak 6/25/2025||
If YouTube placed autogenerated captions you can download them free of charge with yt-dlp.
isubkhankulov 6/25/2025||
Transcripts get much more valuable when one diarizes the audio beforehand to determine which speaker said what.

I use this free tool to extract those and dump the transcripts into a LLM with basic prompts: https://contentflow.megalabs.co

jasonjmcghee 6/25/2025||
Heads up, the token cost breakdown tables look white on white to me. I'm in dark mode on iOS using Brave.
georgemandis 6/25/2025|
Should be fixed now. Thank you!
BrunoJo 6/26/2025||
If you look for a cheaper transcription API you could als use https://Lemonfox.ai. We've optimized the API for long audio files and are much faster and cheaper than OpenAI.
ta8903 6/26/2025|
This "hack" also works in real life, youtubers low to talk slowly to increase the video runtime so I watch everything other than songs at 2x speed (and that's only because their player doesn't let you go faster).
More comments...