Posted by georgemandis 5 days ago
And if someone had this idea and pitched it to Claude (the model this project was vibe coded with) it would be like "what a great idea!"
Are people just staring it for meme value or something? Is this a scam?
I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.
The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.
I'm actually curious, if I run transcriptions back-to-back-to-back on the exact same audio, how much variance should I expect?
Maybe I'll try three approaches:
- A straight diff comparison (I know a lot of people are calling for this, but I really think this is less useful than it sounds)
- A "variance within the modal" test running it multiple times against the same audio, tracking how much it varies between runs
- An LLM analysis assessing if the primary points from a talk were captured and summarized at 1x, 2x, 3x, 4x runs (I think this is far more useful and interesting)
Here(https://github.com/openai/whisper/blob/main/whisper/model.py...) is the relevant code in the whisper repo. You'd just need to change the for loop to an enumerate and subsample the context along its length at the point you want. I believe it would be:
for i, block in enumerate(self.blocks): x = block(x) if i==4: x = x[,,::2]
I'm implementing a similar workflow for VideoToBe.com
My Current Pipeline:
Media Extraction - yt-dlp for reliable video/audio downloads Local Transcription - OpenAI Whisper running on my own hardware (no API costs) Storage & UI - Transcripts stored in S3 with a custom web interface for viewing
Y Combinator playlist https://videotobe.com/play/playlist/ycombinator
and Andrej's talk is https://videotobe.com/play/youtube/LCEmiRjPEtQ
After reading your blog post, I will be testing effect on speeding audio for locally-hosted Whisper models. Running Whisper locally eliminates the ongoing cost concerns since my infrastructure is already a sunk cost. Speeding audio could be an interesting performance enhancement to explore!
Then there is no cost at all to run any length of audio. (since cost seems to be the primary factor of this article)
On my m1 mac laptop it takes me about 30 seconds to run it on a 3-minute audio file. I'm guessing for a 40 minute talk it takes about 5-10 minutes to run.
Did I miss that the task was time sensitive?