Posted by birdculture 4 days ago
The reason is that the opcode encoding is very dense, and has no redundancy against detecting bad encodings, and usually no relationship to neighboring words.
By that I mean that some four byte chunk (say) treated as an opcode word is treated that way regardless of what came before or what comes after. If it looks like an opcode with a four-byte immediate operand, then the disassembly will pull in that operand (which can be any bit combination) and skip another four bytes. Nothing in the operand will indicate "this is a bad instruction overall".
* only 4.4% of the random data disassembles.
* only 4.0% of the random data decodes as Static Huffman.
BUT:
* 1.2% of the data decompresses and disassembles.
Relative to the 4.0% decompression, 1.2% is 30%.
In other words, 30% of successfully decompressed material also disassembles.
That's something that could benefit from an explanation.
Why is that, evidently, the conditional probability of a good disassemble, given a successful Static Huffman expansion, much higher than the probability of a disassemble from random data?
With 40 million "success" and 570 "end of stream", I think that implies that out of a billion tests it read all 128 bytes less than a thousand times.
As a rough estimate off the static huffman tables, each symbol gives you about an 80% chance of outputting a byte, 18% chance of crashing, 1% chance of repeating some bytes, and 1% chance of ending decompression. As it gets longer the odds tilt a few percent more toward repeating instead of crashing. But on average it's going to use quite few of the 128 bytes of input, outputting them in a slightly shuffled way plus some repetitions.
Even that won't find the maximal amount of decoding that is possible; for that you have to slide through the input bit by bit and try decoding at every bit position.
However, it seems fair because you wouldn't disassemble that way. If you disassemble some bytes successfully, you skip past those and keep going.
For example, there’s a very high chance a single random instruction would page fault.
If you want to generate random instructions and have them execute, you have to write a tiny debugger, intercept the page faults, fix up the program’s virtual memory map, then re-run the instruction to make it work.
This means that even though high entropy data has a good chance of producing valid instructions, it doesn’t have a high chance of producing valid instruction sequences.
Code that actually does something will have much much lower entropy.
That is interesting…even though random data is syntactically valid as instructions, it’s almost certainly invalid semantically.
(By help I mean just help, not write an entire sloppy article.)
The main one is to set reader expectations that any errors are entirely my own, and that I spent time reviewing the details of the work. The disclosure seemed to me a concise way to do that -- my intention was not any form of anti-AI virtue signaling.
The other reason is that I may use AI for some of my future work, and as a reader, I would prefer a disclosure about that. So I figured if I'm going to disclose using it, I might as well disclose not using it.
I linked to other thoughts on AI just in case others are interested in what I have to say. I don't stand to gain anything from what I write, and I don't even have analytics to tell me more people are viewing it.
All in all, I was just trying to be transparent, and share my work.
:)
I like it!
But, here it does serve a purpose beyond hinting at the author's ideological stance.
Nowadays, a lot of readers will wonder how much of your work is AI assisted. Their eyes will be drawn to the AI Use Disclosure, which will answer their question.