Posted by georgemandis 5 days ago
In the idea of making more of an OpenAI minute, don't send it any silence.
E.g.
ffmpeg -i video-audio.m4a \
-af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
apad=pad_dur=0.02" \
-c:a aac -b:a 128k output_minpause.m4a -y
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.
People say thank you to AI because they are portrayed as human-like chat bots, but in reality it has almost no effect on their effectiveness to respond to our queries.
Saying thank you to ChatGPT is no less wasteful than saying thank you to Windows for opening the calculator.
I don't think anyone is trying to draw any parallels between that inefficiency and real humans saying thank you?
I think this is the reverse of confrontation with the LLM. Typically if you get a really dumb response, it is better to hang up the conversation and completely start over than it is to tell the LLM why it is wrong. Once you start arguing, they start getting stupider and respond with even faultier logic as they try to appease you.
I suppose it makes sense if the training involves alternate models of discourse resembling two educated people in a forum with shared intellectual curiosity and a common goal, or two people having a ridiculous internet argument.
Well, I don’t think silence is not the real problem with a 3 hour meeting!
There MUST be time to think
guys how hard is it to toss both versions into like diffchecker or something haha youre just comparing text
really it becomes a question of whether or not the friction of invoking the command or the cost of tokens is greater.
as I get older and more rsi'd the tokens seem cheaper.
Unfortunately a byproduct of listening to everything at 2x is I've had a number of folks say they have to watch my videos at 0.75x but even when I play back my own videos it feels painfully slow unless it's 2x.
For reference I've always found John Carmack's pacing perfect / natural and watchable at 2x too.
A recent video of mine is https://www.youtube.com/watch?v=pL-qft1ykek. It was posted on HN by someone else the other day so I'm not trying to do any self promotion here, it's just an example of a recent video I put up and am generally curious if anyone finds that too fast or it's normal. It's a regular unscripted video where I have a rough idea of what I want to cover and then turn on the mic, start recording and let it pan out organically. If I had to guess I'd say the last ~250-300 videos were recorded this way.
Funnily enough, if you actually have ADHD, then stimulants like adderall or even nicotine, will calm you down.
> Naturally people may choose to slow down tutorials, [...]
For me it also depends on what mood I'm in and whether I'm doing anything else at the same time. If I'm fully concentrating on a video, 2x is often fine. If I'm doing some physical task at the same time, I need it slower than that.
If I'm doing a mental task at the same, I can forget about getting anything out of the video. At least, if the mental task involves any words. So eg I could probably still follow along a technical discussion at roughly 1x speed while playing Tetris, but not while coding.
But it feels (very subjectively) faster to me than usual because you don't really seem to take any pauses. It's like the whole video is a single run-on sentence that I keep buffering, but I never get a chance to process it and flush the buffer.
Now I think speed adjustment come less from the natural speaking pace of the person than the subject matter.
I'm thinking of a channel like Accented Cinema (https://youtu.be/hfruMPONaYg), with a slowish talking pace, but as there's all the visual part going on at all times, it actually doesn't feel slow to my ear.
I felt the same for videos explaining concept I have no familiarity with, so I see as how fast the brain can process the info, less than the talking speed per se.
https://en.m.wikipedia.org/wiki/James_Goodnight
I have watched one or two videos of his, and he spoke slowly, compared to the average person. I liked that. It sounded good.
Watching your video at 1x still feels too slow, and it's just right for me at 2x speed (that's approximately how fast I normally talk if others don't tell me to slow down), although my usual YouTube watching speed is closer to 2.5-3x. That is to say, you're still faster than a lot of others.
I think it just takes practice --- I started at around 1.25x for videos, and slowly moved up from there. As you have noticed, once you've consumed enough sped-up content, your own speaking speed will also naturally increase.
We get used to higher speeds when we consume a lot of content that way. Have you heard the systems used by experienced blind people? I cannot even understand the words in them, but months of training would probably fix that.
I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).
Transcribe it locally using whisper and output tokens/sec?
Yeah, totally easier than `len(transcribe(a))/len(a)`
The tokens/second can be used as ground truth labels for a fft->small neural net model.
Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.
Apparently human language conveys information at around 39 bits/s. You could use a similar technique as that paper to determine the information rate of a speaker and then correct it to 39 bits/s by changing the speed of the video.
javascript:void%20function(){document.querySelector(%22video,audio%22).playbackRate=parseFloat(prompt(%22Set%20the%20playback rate%22))}();
I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?
You are basically training your brain to work faster, and I suspect that causes some changes in the structure of your memory; if someone speaks too slowly, I'll be more likely to forget what they said earlier, compared to if they quickly gave me the entire sentence.
Could use an “auctioneer” voice to playback text at 10x speed.
I understand 4-6x speakers fairly well but don't enjoy listening at that pace. If I lose focus for a couple of seconds I effectively miss a paragraph of context and my brain can't fill in the missing details.
Hilbert transform and FFT to get phoneme rate would work.
Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!
I now think this might be a good solution:
ffmpeg -i video-audio.m4a \
-af "silenceremove=start_periods=1:stop_periods=-1:stop_duration=0.15:stop_threshold=-40dB:detection=rms" \
-c:a aac -b:a 128k output.m4a -y
Good documentation should do this work for you. It should explain somewhat atomic concepts to you, that you can immediately adapt, and compose. Where it already works is for the "detection" and "window" parameters, which are straightforward. But the actions of trimming in the start/middle/end, and how to configure how long the silence lasts before trimming, whether to ignore short bursts of noise, whether to skip every nth silence period, these are all ideas and concepts that get mushed together in 10 parameters which are called start/stop-duration/threshold/silence/mode/periods.
If you want to apply this filter, it takes a long time to build mental models for these 10 parameters. You do have some example calls, which is great, but which doesn't help if you need to adjust any of these - then you probably need to understand them all.
Some stuff I stumbled over when reading it:
"To remove silence from the middle of a file, specify a stop_periods that is negative. This value is then treated as a positive value [...]" - what? Why is this parameter so heavily overloaded?
"start_duration: Specify the amount of time that non-silence must be detected before it stops trimming audio" - parameter is named start_something, but it's about stopping? Why?
"start_periods: [...] Normally, [...] start_periods will be 1 [...]. Default value is 0."
"start_mode: Specify mode of detection of silence end at start": start_mode end at start?
It's very clunky. Every parameter has multiple modes of operation. Why is it start and stop for beginning and end, and why is "do stuff in the middle" part of the end? Why is there no global mode?
You could nitpick this stuff to death. In the end, naming things is famously one of the two hard problems in computer science (the others being cache invalidation and off-by-one errors). And writing good documentation is also very, very hard work. Just exposing the internals of the algorithm is often not great UX, because then every user has to learn how the thing works internally before they can start using it (hey, looking at you, git).
So while it's easy to point out where these docs fail, it would be a lot of work to rewrite this documentation from the top down, explaining the concepts first. Or even rewriting the interface to make this more approachable, and the parameters less overloaded. But since it's hard work, and not sexy to programmers, it won't get done, and many people will come after, having to spend time on reading and re-reading this current mess.
In "start_mode", "start" means "initial", and "mode" means "method". But specifically, it's a method of figuring out where the silence ends.
> In the end, naming things is famously one of the two hard problems in computer science
It's also one of the hard problems in English.
Isn't ffmpeg made by a French person? As a francophone myself, I can tell you one of the biggest weakness of francophone programmers is naming things, even worse when it's in English. Maybe it's what's at play here.
https://claude.ai/public/artifacts/96ea8227-48c3-484d-b30b-6...
I had Claude rewrite the documentation for silenceremove based on your feedback.
https://www.theverge.com/news/603581/youtube-premium-experim...
I hope they make up their mind on it soon instead of this endless A/B testing.
I listen to a lot of videos on 3 or even 4x.
In either case, I bet OpenAI is doing the same optimization under the hood and keeping the savings for themselves.
Is it common for people to watch Youtube sped up?
I've heard of people doing this for podcasts and audiobooks and never understood it all that much there. Just feels like 'skimming' a real book instead of actually reading it.
Additionally, the brain tends to adjust to a faster talking speed very quickly. If I'm watching an average-paced person talk and speed them up by 2x, the first couple minutes of listening might be difficult and will require more intent-listening. However, the brain starts processing it as the new normal and it does not feel sped-up anymore. To the extent that if I go back to 1x, it feels like the speaker is way too slow.
Same with a video. A lot of people speak considerably slower than you could process the information they are conveying, so you speed it up. You still get the same content and are not skipping parts as you would when skimming a book.
That's the goal for me lately. I primarily use Youtube for technical assistance (where are the screws to adjust this carburetor?, how do I remove this brake hub?, etc). There used to be short 1 to 2m videos on this kind of stuff but nowadays I have to suffer through a 10-15 minute video with multiple ad breaks.
So now I always watch youtube at 2x speed while rapidly jumping the slider forward to find relevant portions.
I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.
Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.
This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.
Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.
Slower is usually better for thinking.
Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.
"This specific knowledge format doesnt work for me, so I'm asking OpenAI to convert this knowledge into a format that is easier for me to digest" is exactly what this is about.
I'm not quite sure what you're upset about? Unless you're referring to "one size fits all knowledge" as simplified topics, so you can tackle things at a surface level? I love having surface level knowledge about a LOT of things. I certainly don't have time to have go deep on every topic out there. But if this is a topic I find I am interested in, the full talk is still available.
Breadth and depth are both important, and well summarized talks are important for breadth, but not helpful at all for depth, and that's ok.
This all discounts how human variation and thinking is critical to the advancement and survival of the species being adaptable as possible to the climate and conditions of the given day. We didnt get to moon on the back of one person or race. The AI can only emulate what it sees, it cant have ideas of its own. The dawn of AI will never be seen again, all AI will suffer from the collective delusion to the point your freedom will be defined by not truth.
Audiobooks before speed tools were the worst (are they trying to speak extra slow?) But when I can speed things up comprehension is just fine.
On the gripping hand, there are probably already excellent 10/30/60 minute book summaries on YouTube or wherever which are not going to hallucinate plot points.
But now we get to browse the knowledge rather than having it thrown at us. That's more important than the quality or formatting of the content.
There is too much information. people are trying to optimize breadth over depth, but obviously there are costs to this.
Your doomerism and superiority doesn't follow from your initial "I like many hackers don't like one size fits all".
This is literally offering you MANY sizes and you have the freedom to choose. Somehow you're pretending pushed down uniformity.
Consume it however you want and come up with actual criticisms next time?
There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.
In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!
++ to "Slower is usually better for thinking"
Yeah, I see people talking about listening to podcasts or audiobooks on 2x or 3x.
Sometimes I set mine to 0.8x. I find you get time to absorb and think. Am I an outlier?
I'm trying to imagine listening to War and Peace faster. On the one hand, there are a lot of threads and people to keep track of (I had a notepad of who is who). On the other hand, having the stories compressed in time might help remember what was going on with a character when finally returning to them.
Listening to something like Dune quickly, someone might come out only thinking of the main political thrusts, and the action, without building that same world in their mind they would if read slower.
By understanding the outline and themes of a book (or lecture, I suppose), it makes it easier to piece together thoughts as you delve deeper into the full content.
Felt like a fun trick worth sharing. There’s a full script and cost breakdown.
Just wondering if I cam build a retirement out of APIs :)
> Just wondering if I cam build a retirement out of APIs :)
I think it's possible, but you need to find a way to add value beyond the commodity itself (e.g., audio classification and speaker diarization in my case).
> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.
This is a great bit of work, and the author accurately summarizes my discomfort
Newspaper is essentially just an inaccurate summary of what really happened. So I don't find this realization that uncomfortable.
This kind of transformation has always come with flaws, and I think that will continue to be expected implicitly. Far more worrying is the public's trust in _interpretations_ and claims of _fact_ produced by gen AI services, or at least the popular idea that "AI" is more trustworthy/unbiased than humans, journalists, experts, etc.
If you are the one feeding content to a model then you are that responsible entity.
The last thing in the world I want to do is listen or watch presidential social media posts, but, on the other hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.
[0] https://github.com/huggingface/transformers.js/tree/main/exa...
Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.
We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.
> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube
Doesn't YouTube do this for you automatically these days within a day or so?
Oh yeah, we do a check first and use youtube-transcript-api if there's an automatic one available:
https://github.com/jdepoix/youtube-transcript-api
The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.
Probably would be a good idea to add a delay to it and wait for the automatic ones though :)
At this point you'll need to at least check how much running ffmpeg costs. Probably less than $0.01 per hour of audio (approximate savings) but still.
Last time I checked, I think the Google auto-captions were noticeably worse quality than whisper, but maybe that has changed.
https://developers.cloudflare.com/workers-ai/models/whisper-...
With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.
Whisper works quite well on Apple Silicon with simple drag/drop install (i.e. no terminal commands). Program is free; you can get an M4 mini for ~$550; don't see how an online platform can even compete with this, except for one-off customers (i.e. not great repeat customers).
We used it to transcribe ddaayyss of audio microcassettes which my mother had made during her lifetime. Whisper.app even transcribed a few hours that are difficult to comprehend as a human listener. It is VERY fast.
I've used the text to search for timestamps worth listening to, skipping most dead-space (e.g. she made most while driving, in a stream of not-always-focused consciousness).
Is there a definition for this expression? I don't catch you.
> ... using corporate technology for the solved problem is a symptom of self-directed skepticism by the user against the corporate institutions ...
Eh?
>> ... using corporate technology for the solved problem is a symptom of self-directed skepticism by the user against the corporate institutions ...
> Eh?
I don't know who wrote that or why you pasted in response to me.
Love this! I wish more authors follow this approach. So many articles keep going all over the place before 'the point' appears.
If trying, perhaps some 50% of the authors may realize that they don't _have_ a point.
ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \ -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt yuv420p \ -shortest \ file_you_upload_to_youtube_for_free_transcripts.mp4
This works VERY well for my needs.