Posted by qbow883 12/27/2025
My biggest misconception, bar none, was around what a codec is exactly, and how well specified they are. I'd keep hearing downright mythical sounding claims, such as how different hardware and software encoders, and even decoders, produce different quality outputs.
This sounded absolutely mental to me. I thought that when someone said AVC / H.264, then there was some specification somewhere, that was then implemented, and that's it. I could not for the life of me even begin to fathom where differences in quality might seep in. Chief of this was when somebody claimed using single threaded encoding instead of multi threaded encoding was superior. I legitimately considered I was being messed with, or that the person I was talking to simply didn't know what they were talking about.
My initial thoughts on this were that okay, maybe there's a specification, and the various codec implementations just "creatively interpret" these. This made intuitive sense to me because "de jure" and "de facto" distinctions are immensely common in the real world, be it for laws, standards, what have you. So I'd start differentiating and going "okay so this is H.264 but <implementation name>". I was pretty happy with this, but eventually, something felt off enough to make me start digging again.
And then, not even a very long time ago, the mystery unraveled. What the various codec specifications actually describe, and what these codecs actually "are", is the on-disk bitstream format, and how to decode it. Just the decode. Never the encode. This applies to video, image, and sound formats; all lossy media formats. Except for telephony, all these codecs only ever specify the end result and how to decode that, but not the way to get there.
And so suddenly, the differences between implementations made sense. It isn't that they're flaunting the standard: for the encoding step, there simply isn't one. The various codec implementations are to compete on finding the "best" way to compress information to the same cross-compatibly decode-able bitstream. It is the individual encoders' responsibility to craft a so-called psychovisual or psychoacoustic model, and then build a compute-efficient encoder that can get you the most bang for the buck. This is how you get differences between different hardware and software encoders, and how you can get differences even between single and multi-threaded codepaths of the same encoder. Some of the approaches they chose might simply not work or work well with multi threading.
One question that escaped me then was how can e.g. "HEVC / H.265" be "more optimal" than "AVC / H.264" if all these standards define is the end result and how to decode that end result. The answer is actually kinda trivial: more features. Literally just more knobs to tweak. These of course introduce some overhead, so the question becomes, can you reliably beat this overhead to achieve parity, or gain efficiency. The OP claims this is not a foregone conclusion, but doesn't substantiate. In my anecdotal experience, it is: parity or even efficiency gain is pretty much guaranteed.
Finally, I mentioned differences between decoder output quality. That is a bit more boring. It is usually a matter of fault tolerance, and indeed, standards violations, such as supporting a 10 bit format in H.264 when the standard (supposedly, never checked) only specifies 8-bit. And of course, just basic incorrectness / bugs.
Regarding subbing then, unless you're burning in subs (called hard-subs), all this malarkey about encoding doesn't actually matter. The only thing you really need to know about is subtitle formats and media containers. OP's writing is not really for you.
As a specific example, the DVD software had a random feature that could be used. There was one brand of player that had a preset list of random numbers so that every time you played a disc that used random, the random would be the exact same every time. This made designing DVD-Video games "interesting" as not all players behaved the same.
This was when I first became aware that just because there's a spec doesn't mean you can count on the spec being followed in the same way everywhere. As you mentioned, video decoders also play fast and loose with specs. That's why some players cannot decode the 10-bit encodes as that's an "advanced" feature. Some players could not decode all of the profiles/levels a codec could use according to the spec. Apple's QTPlayer could not decode the more advanced profiles/levels just to show that it's not "small" devs making limited decoders.
Let's just say we were encoding a list of numbers. So we get a keyframe (an exact number) and then all frames after that until the next keyframe are just deltas. How much to add to that keyframe number
keyframe = 123
nextFrame += 2 // result = 125
nextFrame += 3 // result = 128
nextFrame -= 1 // result = 127
etc... A different encoder might have different deltas. When it comes to video, those difference are likely relatively subtle, tho some definitely look better than others.The "spec" or "codec" only defines that each frame is encoded as a delta. it doesn't say what those detlas are or how they are computed, only how they are applied.
This is also why most video encoding software has quality settings and those settings often includely the fact higher quality is slower. Some of those settings are about bitrate or bitdepth or other things but others are about how much time is spent looking for the perfect or better delta values to get closer to the original image as searching for bettter matches takes time. Especially because it's lossy, there is no "correct" answer. There is just opinion.
Soooo with everyone getting used to creative names instead of descriptive names over the past decade or two, I guess "codec" just became a blob and it just never crosses peoples' minds that this is right there in the name: COding/DECoding. No ENCoding.
So that's a swing and a miss I'm afraid. But I'm very interested to hear what do you think a "coder" library does in this context if not encode, and why is it juxtaposed with "decoder" if not for doing the exact opposite.
the compressor (encoder) decides exactly how to pack the data, it's not deterministic, you can do a better job at it or a worse one
which is why we have "better" zlib implementations which compress more tightly
Makes a lot of sense in retrospect, to the extent it bothers me I haven't figured it out myself earlier.
hardware encoders (like the ones in GPUs) typically work realtime-ish, so they do minimal exploration of encoding space
you also have the one-pass/two-pass thing which is key for unlocking high quality compression
[1] Technically the term codec refers to a specific program that can encode and decode a certain format.
With video there are 3 formats: the video stream itself, the audio stream itself, and the container (only the container is knowable from the extension). Formats could technically be combined in any combination.
The video stream especially is costly in CPU to encode, and can degrade quality significantly to transcode so it’s just a shame to re-encode if the original codec is usable.
Container format mkv is notorious for not being supported out of the box on lots of consumer devices, even if they might have codecs for the audio and video streams they typically contain. (It has cool features geeks like, though, but for some reason it gets less support.)
Also there's one user-level aspect of MKV that makes it not too surprising to me: It can contain any number of video/audio/subtitle streams and the interface needs some way of allowing the user to pick between them. Easier to just skip that complexity, I guess.
I can't say I've experienced either of the ones mentioned, but I have had trouble in the past with output resolution selection (ending up with a larger file than expected with the encoding resolution much larger than the intended display resolution). User error, of course, but that tab is a bit non-obvious so it might be fair to call it a footgun.
The author's POV is that the handbrake is a lossy conversion and often people use it in cases where they could have used a different tool that is lossless.
My uses of handbrake are that I always want a lossy conversion so no issue. A good example is anytime I make screen capture and want to post it on github. I want it to be under the 10meg limit (or whatever it is) so I want it to be re-encoded to be smaller. I don't mind the loss in quality.
I remember all the weird repackaged video codec installers that put mystery goo all over them machine.
The article bashes VLC but I tell you what… VLC plays just about everything you feed it without complaint. Even horribly corrupt files it will attempt to handle. It might not be perfect by any means but it does almost always work.
In most circumstances, a MPEG-TS file can be remuxed (without reencoding) to a more reasonable container format like MP4, and it'll play better that way. In some cases, it'll even be a smaller file.
(nb they did often use their own demuxers instead of libavformat)
No need to know anything about the video file anymore.
(Of course if you're hosting billions of videos on a website like YouTube it is a different story, but at that point you need to learn a _lot_ more e.g. about hardware accelerators, etc.)
If I want the best possible quality image at a precisely specified time, what would I do?
Can I increase quality if I have some leeway regarding the time (to use the closest keyframe)?
Is there a way to "undo" motion blur and get a sharp picture?
ffmpeg -ss 00:00:12.435 -i '/Users/weinzieri/videofile.mp4' -vframes 1 '/Users/weinzieri/image.png'
The means “go to 00:00:12.435 on the file /Users/weinzieri/videofile.mp4 and extract one frame to the file /Users/weinzieri/image.png”.Not really, no, any more than there is a way to unblur something that was shot out of focus.
You can play clever tricks with motion estimation and neural networks but really all you're getting is a prediction of what it might have been like if the data had really been present.
Once the information is gone, it's gone.
video has certain temporal statistics which can allow you to fit the missing information
only true blurred white noise is impossible to recover
but across many consecutive frames, the information is spread out temporaly and can be recovered (partially)
the same principle of how you can get a high resolution image from a short video, by extracting the same patch from multiple frames
It is predicting what the information might maybe have been like.
I get that what you're describing can statisically "unblur" stuff you've blurred with overly-simplistic algorithms.
I can provide you with real-world footage that has "natural" motion blur in it, if you can demonstrate this technique working? I'd really like to see how it's done.
This is actually possible:
https://en.wikipedia.org/wiki/Deconvolution
If you have a high-quality image (before any compression) with a consistent blur, you can actually remove blur surprisingly well. Not completely perfectly, but often to a surprising degree that defies intuition.
And it's not a prediction -- it's recovering the actual data. Just because it's blurred doesn't mean it's gone -- it's just smeared across pixels, but clever math can be use to recover it. It's used widely in certain types of scientific imaging.
For photographers, it's most useful in removing motion blur from accidentally moving the camera while snapping a photo.
In mpc-hc, you can framestep using CTRL+LeftArrow (steps a frame backward) or CTRL+RightArrow (steps a frame forward). This lets you select the frame you want to capture. You do not need to be on a keyframe. These keybinds are configurable and may be different on the latest version.
Then in the File menu, there's an export image option. It directly exports the frame you're currently on, to disk. Make sure to use a lossless format for comparisons (e.g. PNG).
I'm aware this can be done in other players - like mpv - as well, although there I believe no keybinds are set up for this by default, and the default export format is JPEG.