Top
Best
New

Posted by vishnuharidas 9/12/2025

UTF-8 is a brilliant design(iamvishnu.com)
849 points | 348 commentspage 3
quotemstr 9/12/2025|
Great example of a technology you get from a brilliant guy with a vision and that you'll never get out of a committee.
billforsternz 9/12/2025||
A little off topic but amidst a lot of discussion of UTF-8 and its ASCII compatibility property I'm going to mention my one gripe with ASCII, something I never see anyone talking about, something I've never talked about before: The damn 0x7f character. Such an annoying anomaly in every conceivable way. It would be much better if it was some other proper printable punctuation or punctuation adjacent character. A copyright character. Or a pi character or just about anything other than what it already is. I have been programming and studying packet dumps long enough that I can basically convert hex to ASCII and vice versa in my head but I still recoil at this anomalous character (DELETE? is that what I should call it?) every time.
kragen 9/12/2025||
Much better in every way except the one that mattered most: being able to correct punching errors in a paper tape without starting over.

I don't know if you have ever had to use White-Out to correct typing errors on a typewriter that lacked the ability natively, but before White-Out, the only option was to start typing the letter again, from the beginning.

0x7f was White-Out for punched paper tape: it allowed you to strike out an incorrectly punched character so that the message, when it was sent, would print correctly. ASCII inherited it from the Baudot–Murray code.

It's been obsolete since people started punching their tapes on computers instead of Teletypes and Flexowriters, so around 01975, and maybe before; I don't know if there was a paper-tape equivalent of a duplicating keypunch, but that would seem to eliminate the need for the delete character. Certainly TECO and cheap microcomputers did.

billforsternz 9/13/2025||
Nice, thanks.
Agraillo 9/13/2025||
Related: Why is there a “small house” in IBM's Code page 437? (glyphdrawing.club) [1]. There are other interesting articles mentioned in the discussion. m_walden probably would comment here himself

[1] https://news.ycombinator.com/item?id=43667010

billforsternz 9/13/2025||
Thanks, interesting.
Mikhail_Edoshin 9/13/2025||
I once saw a good byte encoding for Unicode: 7 bit for data, 1 for continuation/stop. This gives 21 bit for data, which is enough for the whole range. ASCII compatible, at most 3 bytes per character. Very simple: the description is sufficient to implement it.
rmunn 9/13/2025||
Probably a good idea, but when UTF-8 was designed the Unicode committee had not yet made the mistake of limiting the character range to 21 bits. (Going into why it's a mistake would make this comment longer than it's worth, so I'll only expound on it if anyone asks me to). And at this point it would be a bad idea to switch away from the format that is now, finally, used in over 99% of all documents online. The gain would be small (not zero, but small) and the cost would be immense.
int_19h 9/13/2025||
Didn't they limit the range to 21 bits because UTF-16 has that limitation?
rmunn 9/15/2025||
That is indeed why they limited it, but that was a mistake. I want to call UTF-16 a mistake all on its own, but since it predated UTF-8, I can't entirely do so. But limiting the Unicode range to only what's allowed in UTF-16 was shortsighted. They should, instead, have allowed UTF-8 to continue to address 31 bits, and if the standard grew past 21 bits, then UTF-16 would be deprecated. (Going into depth would take an essay, and at this point nobody cares about hearing it, so I'll refrain).
gpvos 9/15/2025||
I suppose it's still possible to extend to 31 bits in the future, once UTF-16 has become obsolete enough. How big is the need for it right now?
rmunn 9/16/2025|||
Interestingly, in theory UTF-8 could be extended to 36 bits: the FLAC format uses an encoding similar to UTF-8 but extended to allow up to 36 bits (which takes seven bytes) to encode frame numbers: https://www.ietf.org/rfc/rfc9639.html#section-9.1.5

This means that frame numbers in a FLAC file can go up to 2^36-1, so a FLAC file can have up to 68,719,476,735 frames. If it was recorded at a 48kHz sample rate, there will be 48,000 frames per second, meaning a FLAC file at 48kHz sample rate can (in theory) be 14.3 million seconds long, or 165.7 days long.

So if Unicode ever needs to encode 68.7 billion characters, well, extended seven-byte UTF-8 will be ready and waiting. :-D

gpvos 9/16/2025||
See my comment on how Perl stores up to 2^63-1 in a UTF-8-like format: https://news.ycombinator.com/item?id=45227396 .
account42 9/15/2025|||
The problem is that now there are a bunch of UTF-8 tools that won't handle code points beyond 21 bits.
gpvos 9/15/2025||
Fair enough, it will take some time to weed those out.
restalis 9/13/2025||
This fits your description: https://en.wikipedia.org/wiki/Variable-length_quantity
blindriver 9/12/2025||
It took time for UTF-8 to make sense. Struggling with how large everything was was a real problem just after the turn of the century. Today it makes more sense because capacity and compute power is much greater but back then it was a huge pain in the ass.
gpvos 9/15/2025|
It made much more sense than UTF-16 or any of the existing multi-byte character sets, and the need for more than 256 characters had been apparent for decades. Seeing its simplicity, it made perfect sense almost immediately.
blindriver 9/15/2025||
No, it didn't. Not at the time. Like I said processing and storage were a pain back around the 2000-ish time. Windows supported UCS-2 (predecessor to UTF-16) which was fixed width 16-bit and faster to decode and encode, and since most of the world was Windows at the time, it made more sense to use UCS-2. Also, the world was only beginning to be more connected so UTF-8 seemed overkill.

NOW in hindsight it makes more sense to use UTF-8 but it wasn't clear back 20 years ago it was worth it.

acdha 9/16/2025|||
The need was clear even 30 years ago when UTF-16 was standardized in 1996. UCS-2 was known at the time to be inadequate but there was a period from the mid-80s to early 90s where western developers tried to rpetend that they could only support a tiny fraction of Asian languages like Chinese (>50k characters, even if Han unification was uncontroversial), scholarly and technical usage, etc. The language used in 1988 was “Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988)” with the idea that other characters could be punted into a private registry.

Once enough people accepted that this approach was impractical, UCS-2 was replaced with UTF-16 and surrogate codes. At that point it was clear that UTF-8 was better in almost every scenario because neither had an advantage for random access and UTF-8 was usually substantially smaller.

1. https://unicode.org/history/unicode88.pdf

gpvos 9/15/2025|||
Maybe if you were entrenched in the Windows world.

Storage-wise, UTF-8 is usually better since so much data is ASCII with maybe the occasional accented character. The speed issue only really matters to Windows NT since that was UCS-2 inside, but it wasn't a problem for many.

zamalek 9/12/2025||
Even for varints (you could probably drop the intermediate prefixes for that). There are many examples of using SIMD to decode utf-8, where-as the more common protobuf scheme is known to be hostile to SIMD and the branch predictor.
camel-cdr 9/13/2025|
Yeah, protobuf's varint are quite hard to decode with current SIMD instructions, but it would be quite easy, if we get element wise pext/pdep instructions in the future. (SVE2 already has those, but who has SVE2?)
sawyna 9/12/2025||
I have always wondered - what if the utf-8 space is filled up? Does it automatically promote to having a 5th byte? Is that part of the spec? Or are we then talking about utf-16?
vishnuharidas 9/12/2025||
UTF-8 can represent up to 1,114,112 characters in Unicode. And in Unicode 15.1 (2023, https://www.unicode.org/versions/Unicode15.1.0/) a total of 149,813 characters are included, which covers most of the world's languages, scripts, and emojis. That leaves a 960K space for future expansion.

So, it won't fill up during our lifetime I guess.

jaza 9/12/2025|||
I wouldn't be too quick to jump to that conclusion, we could easily shove another 960k emojis into the spec!
BeFlatXIII 9/13/2025||
Black Santa with 1 freckle, Black Santa with 2 freckles…
unnouinceput 9/13/2025|||
Wait until we get to know another specie then we will not just fill that Unicode space, but we will ditch any utf-16 compatibility so fast that will make your head spin on a snivel.

Imagine the code points we'll need to represent an alien culture :).

crazygringo 9/13/2025|||
Nothing is automatic.

If we ever needed that many characters, yes the most obvious solution would be a fifth byte. The standard would need to be explicitly extended though.

But that would probably require having encountered literate extraterrestrial species to collect enough new alphabets to fill up all the available code points first. So seems like it would be a pretty cool problem to have.

kzrdude 9/12/2025|||
utf-8 is just an encoding of unicode. UTF-8 is specified in a way so that it can encode all unicode codepoints up to 0x10FFFF. It doesn't extend further. And UTF-16 also encodes unicode in a similar same way, it doesn't encode anything more.

So what would need to happen first would be that unicode decides they are going to include larger codepoints. Then UTF-8 would need to be extended to handle encoding them. (But I don't think that will happen.)

It seems like Unicode codepoints are less than 30% allocated, roughly. So there's 70% free space..

---

Think of these three separate concepts to make it clear. We are effectively dealing with two translations - one from the abstract symbol to defined unicode code point. Then from that code point we use UTF-8 to encode it into bytes.

1. The glyph or symbol ("A")

2. The unicode code point for the symbol (U+0041 Latin Capital Letter A)

3. The utf-8 encoding of the code point, as bytes (0x41)

duskwuff 9/12/2025||
As an aside: UTF-8, as originally specified in RFC 2279, could encode codepoints up to U+7FFFFFFF (using sequences of up to six bytes). It was later restricted to U+10FFFF to ensure compatibility with UTF-16.
KingLancelot 9/12/2025||
[dead]
akoboldfrying 9/13/2025||
I take it you could choose to encode a code point using a larger number of bytes than are actually needed? E.g., you could encode "A" using 1, 2, 3 or 4 bytes?

Because if so: I don't really like that. It would mean that "equal sequence of code points" does not imply "equal sequence of encoded bytes" (the converse continues to hold, of course), while offering no advantage that I can see.

fmajid 9/12/2025||
Well, yes, Ken Thompson, the father of Unix, is behind it.
dpc_01234 9/12/2025||
UTF-8 is a undeniably a good answer, but to a relatively simple bit twiddling / variable len integer encoding problem in a somewhat specific context.

I realize that hindsight is 20/20, and time were different, but lets face it: "how to use an unused top bit to best encode larger number representing Unicode" is not that much of challenge, and the space of practical solutions isn't even all that large.

Tuna-Fish 9/12/2025||
Except that there were many different solutions before UTF-8, all of which sucked really badly.

UTF-8 is the best kind of brilliant. After you've seen it, you (and I) think of it as obvious, and clearly the solution any reasonable engineer would come up with. Except that it took a long time for it to be created.

ivanjermakov 9/12/2025||
I just realised that all latin text is wasting 12% of storage/memory/bandwidth with MSB zero. At least is compresses well. Are there any technology that utilizes 8th bit for something useful, e.g. error checking?
tmiku 9/12/2025||
See mort96's comments about 7-bit ASCII and parity bits (https://news.ycombinator.com/item?id=45225911). Kind of archaic now, though - 8-bit bytes with the error checking living elsewhere in the stack seems to be preferred.
Mikhail_Edoshin 9/13/2025|
One aspect of Unicode that is probably not obvious is that with Unicode it is possible to keep using old encodings just fine. You can always get their Unicode equivalents, this is what Unicode was about. Otherwise just keep the data as is, tagged with the encoding. This nicely extends to filesystem "encodings" too.
Mikhail_Edoshin 9/14/2025|
For example, modern Python internally uses three forms (Latin-1, UTF-16 and 32) depending on the contents of the string. But this can be done for all encodings and also for things like file names that do not follow Unicode. The Unicode standard does not dictate everything must take the same form; it can be used to keep existing forms but make them compatible.
More comments...