UTF-8 is a brilliant design

Posted by vishnuharidas 18 hours ago

UTF-8 is a brilliant design(iamvishnu.com)

581 points | 242 commentspage 3

Mikhail_Edoshin 6 hours ago|

One aspect of Unicode that is probably not obvious is that with Unicode it is possible to keep using old encodings just fine. You can always get their Unicode equivalents, this is what Unicode was about. Otherwise just keep the data as is, tagged with the encoding. This nicely extends to filesystem "encodings" too.

drpixie 5 hours ago||

UTF-8 is a neat way of encoding 1M+ code points in 8 bit bytes, and including 7 bit ASCII. If only unicode were as neat - sigh. I guess it's way too late to flip unicode versions and start again avoiding the weirdness.

bruce511 17 hours ago||

While the backward compatibility of utf-8 is nice, and makes adoption much easier, the backward compatibility does not come at any cost to the elegance of the encoding.

In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.

nextaccountic 17 hours ago|

UTF-8 also enables this mindblowing design for small string optimization - if the string has 24 bytes or less it is stored inline, otherwise it is stored on the heap (with a pointer, a length, and a capacity - also 24 bytes)

https://github.com/ParkMyCar/compact_str

How cool is that

(Discussed here https://news.ycombinator.com/item?id=41339224)

adgjlsfhk1 16 hours ago||

How is that UTF8 specific?

ubitaco 14 hours ago|||

It's slightly buried in the readme on Github:

> how can we store a 24 byte long string, inline? Don't we also need to store the length somewhere?

> To do this, we utilize the fact that the last byte of our string could only ever have a value in the range [0, 192). We know this because all strings in Rust are valid UTF-8, and the only valid byte pattern for the last byte of a UTF-8 character (and thus the possible last byte of a string) is 0b0XXXXXXX aka [0, 128) or 0b10XXXXXX aka [128, 192)

betimsl 5 hours ago||

The story is that Ken and Rob were at a diner when Ken gave structure to it and wrote the initial encode/decode functions on napkins. UTF-8 is so simple yet it required a complex mind to do it.

dolmen 4 hours ago||

Anyone remembers what UTF7.5 or UTF7,5 was? I can't find references to its description(s)...

dolmen 4 hours ago|

Finally found a description here: http://www.czyborra.com/utf/

mikelabatt 14 hours ago||

Nice article, thank you. I love UTF-8, but I only advocate it when used with a BOM. Otherwise, an application may have no way of knowing that it is UTF-8, and that it needs to be saved as UTF-8.

Imagine selecting New/Text Document in an environment like File Explorer on Windows: if the initial (empty) file has a BOM, any app will know that it is supposed to be saved again as UTF-8 once you start working on it. But with no BOM, there is no such luck, and corruption may be just around the corner, even when the editor tries to auto-detect the encoding (auto-detection is never easy or 100% reliable, even for basic Latin text with "special" characters)

The same can happen to a plain ASCII file (without a BOM): once you edit it, and you add, say, some accented vowel, the chaos begins. You thought it was Italian, but your favorite text editor might conclude it's Vietnamese! I've even seen Notepad switch to a different default encoding after some Windows updates.

So, UTF-8 yes, but with a BOM. It should be the default in any app and operating system.

rmunn 4 hours ago||

The fact that you advocate using a BOM with UTF-8 tells me that you run Windows. Any long-term Unix user has probably seen this error message before (copy and pasted from an issue report I filed just 3 days ago):

    bash: line 1:  #!/bin/bash: No such file or directory

If you've got any experience with Linux, you probably suspect the problem already. If your only experience is with Windows, you might not realize the issue. There's an invisible U+FEFF lurking before the `#!`. So instead of that shell script starting with the `#!` character pair that tells the Linux kernel "The application after the `#!` is the application that should parse and run this file", it actually starts with `<FEFF>#!`, which has no meaning to the kernel. The way this script was invoked meant that Bash did end up running the script, with only one error message (because the line did not start with `#` and therefore it was not interpreted as a Bash comment) that didn't matter to the actual script logic.

This is one of the more common problems caused by putting a BOM in UTF-8 files, but there are others. The issue is that adding a BOM, as can be seen here, *breaks the promise of UTF-8*: that a UTF-8 file that contains only codepoints below U+007F can be processed as-is, and legacy logic that assumes ASCII will parse it correctly. The Linux kernel is perfectly aware of UTF-8, of course, as is Bash. But the kernel logic that looks for `#!`, and the Bash logic that look for a leading `#` as a comment indicator to ignore the line, do *not* assume a leading U+FEFF can be ignored, nor should they (for many reasons).

What should happen is that these days, every application should assume UTF-8 if it isn't informed of the format of the file, unless and until something happens to make it believe it's a different format (such as reading a UTF-16 BOM in the first two bytes of the file). If a file fails to parse as UTF-8 but there are clues that make another encoding sensible, reparsing it as something else (like Windows-1252) might be sensible.

But putting a BOM in UTF-8 causes more problems than it solves, because it *breaks* the fundamental promise of UTF-8: ASCII compatibility with Unicode-unaware logic.

3036e4 2 hours ago||

Also some XML parsers I used choked on UTF-8 BOMs. Not sure if valid XML is allowed to have anything other than clean ASCII in the first few characters before declaring what the encoding is?

rmunn 1 hour ago|||

My search also turned this up:

https://x.com/jbogard/status/1111328911609217025

If that link doesn't work, then try:

https://xcancel.com/jbogard/status/1111328911609217025

Source (which will explain the joke for anyone who didn't get it immediately):

https://www.jimmybogard.com/the-curious-case-of-the-json-bom...

rmunn 1 hour ago|||

Not ALL of the 20th-century Internet has bit-rotted and fallen apart yet. (Just most of it).

taffer 1 hour ago|||

I respectfully disagree. The BOM is a Windows-specific idiosyncrasy resulting from its early adoption of UTF-16. In the Unix world, a BOM is unexpected and causes problems with many programs, such as GCC, PHP and XML parsers. Don't use it!

The correct approach is to use and assume UTF-8 everywhere. 99% of websites use UTF-8. There is no reason to break software by adding a BOM.

Cloudef 11 hours ago||

BOM is awful as it breaks concatenation. In modern world everything should be just assumed to be UTF8 by default.

vismit2000 5 hours ago||

UTF-8 Everywhere Manifesto: https://utf8everywhere.org/

billforsternz 15 hours ago||

A little off topic but amidst a lot of discussion of UTF-8 and its ASCII compatibility property I'm going to mention my one gripe with ASCII, something I never see anyone talking about, something I've never talked about before: The damn 0x7f character. Such an annoying anomaly in every conceivable way. It would be much better if it was some other proper printable punctuation or punctuation adjacent character. A copyright character. Or a pi character or just about anything other than what it already is. I have been programming and studying packet dumps long enough that I can basically convert hex to ASCII and vice versa in my head but I still recoil at this anomalous character (DELETE? is that what I should call it?) every time.

kragen 15 hours ago||

Much better in every way except the one that mattered most: being able to correct punching errors in a paper tape without starting over.

I don't know if you have ever had to use White-Out to correct typing errors on a typewriter that lacked the ability natively, but before White-Out, the only option was to start typing the letter again, from the beginning.

0x7f was White-Out for punched paper tape: it allowed you to strike out an incorrectly punched character so that the message, when it was sent, would print correctly. ASCII inherited it from the Baudot–Murray code.

It's been obsolete since people started punching their tapes on computers instead of Teletypes and Flexowriters, so around 01975, and maybe before; I don't know if there was a paper-tape equivalent of a duplicating keypunch, but that would seem to eliminate the need for the delete character. Certainly TECO and cheap microcomputers did.

billforsternz 9 hours ago||

Nice, thanks.

Agraillo 3 hours ago||

Related: Why is there a “small house” in IBM's Code page 437? (glyphdrawing.club) [1]. There are other interesting articles mentioned in the discussion. m_walden probably would comment here himself

[1] https://news.ycombinator.com/item?id=43667010

blindriver 14 hours ago||

It took time for UTF-8 to make sense. Struggling with how large everything was was a real problem just after the turn of the century. Today it makes more sense because capacity and compute power is much greater but back then it was a huge pain in the ass.

Mikhail_Edoshin 9 hours ago|

I once saw a good byte encoding for Unicode: 7 bit for data, 1 for continuation/stop. This gives 21 bit for data, which is enough for the whole range. ASCII compatible, at most 3 bytes per character. Very simple: the description is sufficient to implement it.

rmunn 3 hours ago|

Probably a good idea, but when UTF-8 was designed the Unicode committee had not yet made the mistake of limiting the character range to 21 bits. (Going into why it's a mistake would make this comment longer than it's worth, so I'll only expound on it if anyone asks me to). And at this point it would be a bad idea to switch away from the format that is now, finally, used in over 99% of all documents online. The gain would be small (not zero, but small) and the cost would be immense.

More comments...