Top
Best
New

Posted by vishnuharidas 9/12/2025

UTF-8 is a brilliant design(iamvishnu.com)
849 points | 348 commentspage 2
twbarr 9/12/2025|
It should be noted that the final design for UTF-8 was sketched out on a placemat by Rob Pike and Ken Thompson.
hu3 9/12/2025|
I wonder if that placemat still exists today. It would be such an important piece of computer history.
ot 9/12/2025||
> It was so easy once we saw it that there was no reason to keep the placemat for notes, and we left it behind. Or maybe we did bring it back to the lab; I'm not sure. But it's gone now.

https://commandcenter.blogspot.com/2020/01/utf-8-turned-20-y...

modeless 9/12/2025||
UTF-8 is great and I wish everything used it (looking at you JavaScript). But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined. I think a perfect design would define exactly how to interpret every possible byte sequence even if nominally "invalid". This is how the HTML5 spec works and it's been phenomenally successful.
ekidd 9/12/2025||
For security reasons, the correct answer on how process invalid UTF-8 is (and needs to be) "throw away the data like it's radioactive, and return an error." Otherwise you leave yourself wide open to validation bypass attacks at many layers of your stack.
account42 9/15/2025|||
This is rarely the correct thing to do. Users don't particularly like it if you refuse to process a document because it has an error somewhere in there.

Even for identifiers you probably want to do all kinds of normalization even beyond the level of UTF-8 so things like overlong sequences and other errors are really not an inherent security issue.

modeless 9/12/2025|||
This is only true because the interpretation is not defined, so different implementations do different things.
cryptonector 9/13/2025||
That's not true. You're just not allowed to interpret them as characters.
moefh 9/13/2025|||
> This is how the HTML5 spec works and it's been phenomenally successful.

Unicode does have a completely defined way to interpret invalid UTF-8 byte sequences by replacing them with the U+FFFD ("replacement character"). You'll see it used (for example) in browsers all the time.

Mandating acceptance for every invalid input works well for HTML because it's meant to be consumed (primarily) by humans. It's not done for UTF-8 because in some situations it's much more useful to detect and report errors instead of making an automatic correction that can't be automatically detected after the fact.

cryptonector 9/13/2025||
> But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined.

This is not a wart. And how to interpret them is not undefined -- you're just not allowed to interpret them as _characters_.

There is right now a discussion[0] about adding a garbage-in/garbage-out mode to jq/jaq/etc that allows them to read and output JSON with invalid UTF-8 strings representing binary data in a way that round-trips. I'm not for making that the default for jq, and you have to be very careful about this to make sure that all the tools you use to handle such "JSON" round-trip the data. But the clever thing is that the proposed changes indeed do not interpret invalid byte sequences as character data, so they stay within the bounds of Unicode as long as your terminal (if these binary strings end up there) and other tools also do the same.

[0] https://github.com/01mf02/jaq/issues/309

3pt14159 9/12/2025||
I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.
linguae 9/12/2025||
I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).

UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!

layer8 9/12/2025|||
On the other hand, you now have to deal with the issues of Han unification: https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...
pezezin 9/13/2025|||
I live in Japan and I still receive the random email or work document encoded in Shit-JIS. Mojibake is not as common as it once was, but still a problem.
rmunn 9/13/2025||
I'm assuming you misspelled Shift-JIS on purpose because you're sick and tired of dealing with it. If that was an accidental misspelling, it was inspired. :-)
acdha 9/13/2025|||
I worked on a site in the late 90s which had news in several Asian languages, including both simplified and traditional Chinese. We had a partner in Hong Kong sending articles and being a stereotypical monolingual American I took them at their word that they were sending us simplified Chinese and had it loaded into our PHP app which dutifully served it with that encoding. It was clearly Chinese so I figured we had that feed working.

A couple of days later, I got an email from someone explaining that it was gibberish — apparently our content partner who claimed to be sending GB2312 simplified Chinese was in fact sending us Big5 traditional Chinese so while many of the byte values mapped to valid characters it was nonsensical.

glxxyz 9/12/2025||
I worked on an email client. Many many character set headaches.
fleebee 9/12/2025||
If you want to delve deeper into this topic and like the Advent of Code format, you're in luck: i18n-puzzles[1] has a bunch of puzzles related to text encoding that drill how UTF-8 (and other variants such as UTF-16) work into your brain.

[1]: https://i18n-puzzles.com/

Dwedit 9/12/2025||
Meanwhile Shift-JIS has a bad design, since the second byte of a character can be any ASCII character 0x40-0x9E. This includes brackets, backslash, caret, backquote, curly braces, pipe, and tilde. This can cause a path separator or math operator to appear in text that is encoded as Shift-JIS but interpreted as plain ASCII.

UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.

dotslashmain 9/12/2025||
Rob Pike and Ken Thompson are brilliant computer scientists & engineers.
carlos256 9/12/2025|
[flagged]
wrp 9/13/2025||
I need to call out a myth about UTF-8. Tools built to assume UTF-8 are not backwards compatible with ASCII. An encoding INCLUDES but also EXCLUDES. When a tool is set to use UTF-8, it will process an ASCII stream, but it will not filter out non-ASCII.

I still use some tools that assume ASCII input. For many years now, Linux tools have been removing the ability to specify default ASCII, leaving UTF-8 as the only relevant choice. This has caused me extra work, because if the data processing chain goes through these tools, I have to manually inspect the data for non-ASCII noise that has been introduced. I mostly use those older tools on Windows now, because most Windows tools still allow you to set default ASCII.

int_19h 9/13/2025||
The usual statement isn't that UTF-8 is backwards compatible with ASCII (it's obvious that any 8-bit encoding wouldn't be; that's why we have UTF-7!). It's that UTF-8 is backwards compatible with tools that are 8-bit clean.
wrp 9/14/2025||
Yes, the myth I was pointing out is based on loose terminology. It needs to be made clear that "backwards compatible" means that UTF-8 based tools can receive but are not constrained to emit valid ASCII. I see a lot of comments implying that UTF-8 can interact with an ASCII ecosystem without causing problems. Even worse, it seems most Linux developers believe there is no longer a need to provide a default ASCII setting if they have UTF-8.
account42 9/15/2025|||
Do you have an actual example where this causes an issue? "ASCII" tools mostly just passed along non-ASCII bytes unchanged even before UTF-8.
kccqzy 9/13/2025||
That's not a myth about UTF-8. That's a decision by tools not to support pure ASCII.
bruce511 9/12/2025||
While the backward compatibility of utf-8 is nice, and makes adoption much easier, the backward compatibility does not come at any cost to the elegance of the encoding.

In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.

nextaccountic 9/12/2025|
UTF-8 also enables this mindblowing design for small string optimization - if the string has 24 bytes or less it is stored inline, otherwise it is stored on the heap (with a pointer, a length, and a capacity - also 24 bytes)

https://github.com/ParkMyCar/compact_str

How cool is that

(Discussed here https://news.ycombinator.com/item?id=41339224)

adgjlsfhk1 9/12/2025||
How is that UTF8 specific?
ubitaco 9/12/2025|||
It's slightly buried in the readme on Github:

> how can we store a 24 byte long string, inline? Don't we also need to store the length somewhere?

> To do this, we utilize the fact that the last byte of our string could only ever have a value in the range [0, 192). We know this because all strings in Rust are valid UTF-8, and the only valid byte pattern for the last byte of a UTF-8 character (and thus the possible last byte of a string) is 0b0XXXXXXX aka [0, 128) or 0b10XXXXXX aka [128, 192)

Dylan16807 9/14/2025||
Any Unicode encoding would allow that.

UTF-32 has an entire spare byte to put flags into. 24 or 21 bit encodings have spare bits that could act as flags. UTF-16 has plenty of invalid code units, or you could use a high surrogate in the last 2 bytes as your flag.

vismit2000 9/13/2025||
Karpathy's "Let's build the GPT Tokenizer" also contains a good introduction to Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32 in the first 20 minutes: https://www.youtube.com/watch?v=zduSFxRajkE
gnufx 9/13/2025|
It's worth noting that Stallman had earlier proposed a design for Emacs "to handle all the world's alphabets and word signs" with similar requirements to UTF-8. That was the etc/CHARACTERS file in Emacs 18.59 (1990). The eventual international support implemented in Emacs 20's MULE was based on ISO-2022, which was a reasonable choice at the time, based on earlier Japanese work. (There was actually enough space in the MULE encoding to add UTF-8, but the implementation was always going to be inefficient with the number of bytes at the top of the code space.)

Edit: see https://raw.githubusercontent.com/tsutsui/emacs-18.59-netbsd...

More comments...