Posted by vishnuharidas 1 day ago
† With an occasional UNICODE flourish.
UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.
Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).
Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.
Because if so: I don't really like that. It would mean that "equal sequence of code points" does not imply "equal sequence of encoded bytes" (the converse continues to hold, of course), while offering no advantage that I can see.
I could ask Gemini but HN seems more knowledgeable.
Almost the entire world could have ignored it if not for Microsoft making the wrong choice with Windows NT and then stubbornly insisting that their wrong choice was indeed correct for a couple of decades.
There was a long phase where some parts of Windows understood (and maybe generated) UTF-16 and others only UCS-2.
Java's mistake seems to have been independent and it seems mainly to have been motivated by the mistaken idea that it was necessary to index directly into strings. That would have been deprecated fast if Windows had been UTF-8 friendly and very fast if it had been UTF-16 hostile.
We can always dream.
ISO 2022 allowed you to use control codes to switch between ISO 8859 character sets though, allowing for mixed script text streams.