Posted by vishnuharidas 9/12/2025
https://commandcenter.blogspot.com/2020/01/utf-8-turned-20-y...
Even for identifiers you probably want to do all kinds of normalization even beyond the level of UTF-8 so things like overlong sequences and other errors are really not an inherent security issue.
Unicode does have a completely defined way to interpret invalid UTF-8 byte sequences by replacing them with the U+FFFD ("replacement character"). You'll see it used (for example) in browsers all the time.
Mandating acceptance for every invalid input works well for HTML because it's meant to be consumed (primarily) by humans. It's not done for UTF-8 because in some situations it's much more useful to detect and report errors instead of making an automatic correction that can't be automatically detected after the fact.
This is not a wart. And how to interpret them is not undefined -- you're just not allowed to interpret them as _characters_.
There is right now a discussion[0] about adding a garbage-in/garbage-out mode to jq/jaq/etc that allows them to read and output JSON with invalid UTF-8 strings representing binary data in a way that round-trips. I'm not for making that the default for jq, and you have to be very careful about this to make sure that all the tools you use to handle such "JSON" round-trip the data. But the clever thing is that the proposed changes indeed do not interpret invalid byte sequences as character data, so they stay within the bounds of Unicode as long as your terminal (if these binary strings end up there) and other tools also do the same.
UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!
A couple of days later, I got an email from someone explaining that it was gibberish — apparently our content partner who claimed to be sending GB2312 simplified Chinese was in fact sending us Big5 traditional Chinese so while many of the byte values mapped to valid characters it was nonsensical.
UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.
I still use some tools that assume ASCII input. For many years now, Linux tools have been removing the ability to specify default ASCII, leaving UTF-8 as the only relevant choice. This has caused me extra work, because if the data processing chain goes through these tools, I have to manually inspect the data for non-ASCII noise that has been introduced. I mostly use those older tools on Windows now, because most Windows tools still allow you to set default ASCII.
In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.
https://github.com/ParkMyCar/compact_str
How cool is that
(Discussed here https://news.ycombinator.com/item?id=41339224)
> how can we store a 24 byte long string, inline? Don't we also need to store the length somewhere?
> To do this, we utilize the fact that the last byte of our string could only ever have a value in the range [0, 192). We know this because all strings in Rust are valid UTF-8, and the only valid byte pattern for the last byte of a UTF-8 character (and thus the possible last byte of a string) is 0b0XXXXXXX aka [0, 128) or 0b10XXXXXX aka [128, 192)
UTF-32 has an entire spare byte to put flags into. 24 or 21 bit encodings have spare bits that could act as flags. UTF-16 has plenty of invalid code units, or you could use a high surrogate in the last 2 bytes as your flag.
Edit: see https://raw.githubusercontent.com/tsutsui/emacs-18.59-netbsd...