UTF-8 is a brilliant design

Posted by vishnuharidas 20 hours ago

UTF-8 is a brilliant design(iamvishnu.com)

633 points | 250 commentspage 4

sawyna 17 hours ago|

I have always wondered - what if the utf-8 space is filled up? Does it automatically promote to having a 5th byte? Is that part of the spec? Or are we then talking about utf-16?

vishnuharidas 17 hours ago||

UTF-8 can represent up to 1,114,112 characters in Unicode. And in Unicode 15.1 (2023, https://www.unicode.org/versions/Unicode15.1.0/) a total of 149,813 characters are included, which covers most of the world's languages, scripts, and emojis. That leaves a 960K space for future expansion.

So, it won't fill up during our lifetime I guess.

jaza 15 hours ago|||

I wouldn't be too quick to jump to that conclusion, we could easily shove another 960k emojis into the spec!

BeFlatXIII 37 minutes ago||

Black Santa with 1 freckle, Black Santa with 2 freckles…

unnouinceput 7 hours ago|||

Wait until we get to know another specie then we will not just fill that Unicode space, but we will ditch any utf-16 compatibility so fast that will make your head spin on a snivel.

Imagine the code points we'll need to represent an alien culture :).

crazygringo 13 hours ago|||

Nothing is automatic.

If we ever needed that many characters, yes the most obvious solution would be a fifth byte. The standard would need to be explicitly extended though.

But that would probably require having encountered literate extraterrestrial species to collect enough new alphabets to fill up all the available code points first. So seems like it would be a pretty cool problem to have.

kzrdude 17 hours ago|||

utf-8 is just an encoding of unicode. UTF-8 is specified in a way so that it can encode all unicode codepoints up to 0x10FFFF. It doesn't extend further. And UTF-16 also encodes unicode in a similar same way, it doesn't encode anything more.

So what would need to happen first would be that unicode decides they are going to include larger codepoints. Then UTF-8 would need to be extended to handle encoding them. (But I don't think that will happen.)

It seems like Unicode codepoints are less than 30% allocated, roughly. So there's 70% free space..

---

Think of these three separate concepts to make it clear. We are effectively dealing with two translations - one from the abstract symbol to defined unicode code point. Then from that code point we use UTF-8 to encode it into bytes.

1. The glyph or symbol ("A")

2. The unicode code point for the symbol (U+0041 Latin Capital Letter A)

3. The utf-8 encoding of the code point, as bytes (0x41)

duskwuff 15 hours ago||

As an aside: UTF-8, as originally specified in RFC 2279, could encode codepoints up to U+7FFFFFFF (using sequences of up to six bytes). It was later restricted to U+10FFFF to ensure compatibility with UTF-16.

KingLancelot 17 hours ago||

[dead]

Dwedit 18 hours ago||

Meanwhile Shift-JIS has a bad design, since the second byte of a character can be any ASCII character 0x40-0x9E. This includes brackets, backslash, caret, backquote, curly braces, pipe, and tilde. This can cause a path separator or math operator to appear in text that is encoded as Shift-JIS but interpreted as plain ASCII.

UTF-8 basically learned from the mistakes of previous encodings which allowed that kind of thing.

dpc_01234 19 hours ago||

UTF-8 is a undeniably a good answer, but to a relatively simple bit twiddling / variable len integer encoding problem in a somewhat specific context.

I realize that hindsight is 20/20, and time were different, but lets face it: "how to use an unused top bit to best encode larger number representing Unicode" is not that much of challenge, and the space of practical solutions isn't even all that large.

Tuna-Fish 18 hours ago||

Except that there were many different solutions before UTF-8, all of which sucked really badly.

UTF-8 is the best kind of brilliant. After you've seen it, you (and I) think of it as obvious, and clearly the solution any reasonable engineer would come up with. Except that it took a long time for it to be created.

ivanjermakov 18 hours ago||

I just realised that all latin text is wasting 12% of storage/memory/bandwidth with MSB zero. At least is compresses well. Are there any technology that utilizes 8th bit for something useful, e.g. error checking?

tmiku 17 hours ago||

See mort96's comments about 7-bit ASCII and parity bits (https://news.ycombinator.com/item?id=45225911). Kind of archaic now, though - 8-bit bytes with the error checking living elsewhere in the stack seems to be preferred.

dmz73 11 hours ago||

UTF8 is a horrible design. The only reason it was widely adopted was backwards compatibility with ASCII. There are large number of invalid byte combinations that have to be discarded. Parsing forward is complex even before taking invalid byte combinations in account and parsing backwards is even worse. Compare that to UTF16 where parsing forward and backwards are simpler than UTF8 and if there is invalid surrogate combination, one can assume it is valid UCS2 char.

moefh 10 hours ago|

UTF-16 is an abomination. It's only easy to parse because it's artificially limited to 1 or 2 code units. It's an ugly hack that requires reserving 2048 code points ("surrogates") from the Unicode table just for the encoding itself.

It's also the reason why Unicode has a limit of about 1.1 million code points: without UTF-16, we could have over 2 billion (which is the UTF-8 limit).

digianarchist 12 hours ago||

I read online that codepoints are formatted with 4 hex chars for historical reasons. U+41 (Latin A) is formatted as U+0041.

sjapkee 5 hours ago||

Until you interact with it as a programmer

gritzko 8 hours ago||

I specialize in protocol design, unfortunately. A while ago I had to code some Unicode conversion routines from scratch and I must say I absolutely admire UTF-8. Unicode per se is a dumpster fire, likely because of objective reasons. Dealing with multiple Unicode encodings is a minefield. I even made an angry write-up back then https://web.archive.org/web/20231001011301/http://replicated...

UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.

sedatk 6 hours ago||

> For example, how do you handle UTF-8 encoded surrogate pairs?

Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).

gritzko 6 hours ago||

In theory, yes. In practice, there are throngs of parsers and converters who might handle such cases differently. https://seriot.ch/projects/parsing_json.html

cryptonector 57 minutes ago||

> Unicode per se is a dumpster fire

Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.

zamalek 19 hours ago||

Even for varints (you could probably drop the intermediate prefixes for that). There are many examples of using SIMD to decode utf-8, where-as the more common protobuf scheme is known to be hostile to SIMD and the branch predictor.

camel-cdr 4 hours ago|

Yeah, protobuf's varint are quite hard to decode with current SIMD instructions, but it would be quite easy, if we get element wise pext/pdep instructions in the future. (SVE2 already has those, but who has SVE2?)

anthonyiscoding 17 hours ago||

UTF-8 contributors are some of our modern day unsung heroes. The design is brilliant but the dedication to encode every single way humans communicate via text into a single standard, and succeed at it, is truly on another level.

Most other standards just do the xkcd thing: "now there's 15 competing standards"

smoyer 13 hours ago|

Uvarint also has the property of a file containing only ascii characters still being a valid ascii file.

More comments...