Posted by vishnuharidas 20 hours ago
So, it won't fill up during our lifetime I guess.
Imagine the code points we'll need to represent an alien culture :).
If we ever needed that many characters, yes the most obvious solution would be a fifth byte. The standard would need to be explicitly extended though.
But that would probably require having encountered literate extraterrestrial species to collect enough new alphabets to fill up all the available code points first. So seems like it would be a pretty cool problem to have.
So what would need to happen first would be that unicode decides they are going to include larger codepoints. Then UTF-8 would need to be extended to handle encoding them. (But I don't think that will happen.)
It seems like Unicode codepoints are less than 30% allocated, roughly. So there's 70% free space..
---
Think of these three separate concepts to make it clear. We are effectively dealing with two translations - one from the abstract symbol to defined unicode code point. Then from that code point we use UTF-8 to encode it into bytes.
1. The glyph or symbol ("A")
2. The unicode code point for the symbol (U+0041 Latin Capital Letter A)
3. The utf-8 encoding of the code point, as bytes (0x41)
UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.
I realize that hindsight is 20/20, and time were different, but lets face it: "how to use an unused top bit to best encode larger number representing Unicode" is not that much of challenge, and the space of practical solutions isn't even all that large.
UTF-8 is the best kind of brilliant. After you've seen it, you (and I) think of it as obvious, and clearly the solution any reasonable engineer would come up with. Except that it took a long time for it to be created.
It's also the reason why Unicode has a limit of about 1.1 million code points: without UTF-16, we could have over 2 billion (which is the UTF-8 limit).
UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.
Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).
Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.
Most other standards just do the xkcd thing: "now there's 15 competing standards"