UTF-8 is a brilliant design

Posted by vishnuharidas 16 hours ago

UTF-8 is a brilliant design(iamvishnu.com)

528 points | 219 commentspage 2

twbarr 15 hours ago|

It should be noted that the final design for UTF-8 was sketched out on a placemat by Rob Pike and Ken Thompson.

hu3 14 hours ago|

I wonder if that placemat still exists today. It would be such an important piece of computer history.

ot 13 hours ago||

> It was so easy once we saw it that there was no reason to keep the placemat for notes, and we left it behind. Or maybe we did bring it back to the lab; I'm not sure. But it's gone now.

https://commandcenter.blogspot.com/2020/01/utf-8-turned-20-y...

drpixie 3 hours ago||

UTF-8 is a neat way of encoding 1M+ code points in 8 bit bytes, and including 7 bit ASCII. If only unicode were as neat - sigh. I guess it's way too late to flip unicode versions and start again avoiding the weirdness.

vismit2000 3 hours ago||

Karpathy's "Let's build the GPT Tokenizer" also contains a good introduction to Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32 in the first 20 minutes: https://www.youtube.com/watch?v=zduSFxRajkE

betimsl 3 hours ago||

The story is that Ken and Rob were at a diner when Ken gave structure to it and wrote the initial encode/decode functions on napkins. UTF-8 is so simple yet it required a complex mind to do it.

3pt14159 15 hours ago||

I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.

linguae 15 hours ago||

I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).

UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!

layer8 13 hours ago|||

On the other hand, you now have to deal with the issues of Han unification: https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...

pezezin 3 hours ago||||

I live in Japan and I still receive the random email or work document encoded in Shit-JIS. Mojibake is not as common as it once was, but still a problem.

rmunn 2 hours ago||

I'm assuming you misspelled Shift-JIS on purpose because you're sick and tired of dealing with it. If that was an accidental misspelling, it was inspired. :-)

acdha 9 hours ago|||

I worked on a site in the late 90s which had news in several Asian languages, including both simplified and traditional Chinese. We had a partner in Hong Kong sending articles and being a stereotypical monolingual American I took them at their word that they were sending us simplified Chinese and had it loaded into our PHP app which dutifully served it with that encoding. It was clearly Chinese so I figured we had that feed working.

A couple of days later, I got an email from someone explaining that it was gibberish — apparently our content partner who claimed to be sending GB2312 simplified Chinese was in fact sending us Big5 traditional Chinese so while many of the byte values mapped to valid characters it was nonsensical.

glxxyz 15 hours ago||

I worked on an email client. Many many character set headaches.

Mikhail_Edoshin 4 hours ago||

One aspect of Unicode that is probably not obvious is that with Unicode it is possible to keep using old encodings just fine. You can always get their Unicode equivalents, this is what Unicode was about. Otherwise just keep the data as is, tagged with the encoding. This nicely extends to filesystem "encodings" too.

vismit2000 3 hours ago||

UTF-8 Everywhere Manifesto: https://utf8everywhere.org/

fleebee 12 hours ago||

If you want to delve deeper into this topic and like the Advent of Code format, you're in luck: i18n-puzzles[1] has a bunch of puzzles related to text encoding that drill how UTF-8 (and other variants such as UTF-16) work into your brain.

[1]: https://i18n-puzzles.com/

dotslashmain 14 hours ago||

Rob Pike and Ken Thompson are brilliant computer scientists & engineers.

carlos256 14 hours ago|

[flagged]

alberth 15 hours ago|

I’ve re-read so many times Joel’s article on Unicode. It’s also very helpful.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

mixmastamyk 9 hours ago|

Read that a few times back then as well, but that and other pieces of the day never told you how to actually write a program that supported Unicode. Just facts about it.

So I went around fixing UnicodeErrors in Python at random, for years, despite knowing all that stuff. It wasn't until I read Batchelder's piece on the "Unicode Sandwich," about a decade later that I finally learned how to write a program to support it properly, rather than playing whack-a-mole.

More comments...