UTF-8 is a brilliant design

Posted by vishnuharidas 1 day ago

UTF-8 is a brilliant design(iamvishnu.com)

708 points | 283 commentspage 6

carlos256 23 hours ago|

No, it's not. It's just a form of Elias-Gamma coding.

carlos256 23 hours ago|

* unary encoding coding.

ofou 19 hours ago||

UTF-8 should be a universal tokenizer

transfire 9 hours ago||

So brilliant that we’re all still using ASCII!†

† With an occasional UNICODE flourish.

gritzko 12 hours ago||

I specialize in protocol design, unfortunately. A while ago I had to code some Unicode conversion routines from scratch and I must say I absolutely admire UTF-8. Unicode per se is a dumpster fire, likely because of objective reasons. Dealing with multiple Unicode encodings is a minefield. I even made an angry write-up back then https://web.archive.org/web/20231001011301/http://replicated...

UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.

sedatk 10 hours ago||

> For example, how do you handle UTF-8 encoded surrogate pairs?

Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).

gritzko 10 hours ago||

In theory, yes. In practice, there are throngs of parsers and converters who might handle such cases differently. https://seriot.ch/projects/parsing_json.html

cryptonector 5 hours ago||

> Unicode per se is a dumpster fire

Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.

akoboldfrying 12 hours ago||

I take it you could choose to encode a code point using a larger number of bytes than are actually needed? E.g., you could encode "A" using 1, 2, 3 or 4 bytes?

Because if so: I don't really like that. It would mean that "equal sequence of code points" does not imply "equal sequence of encoded bytes" (the converse continues to hold, of course), while offering no advantage that I can see.

burtekd 1 day ago||

I'm just gonna leave this here too: https://www.youtube.com/watch?v=MijmeoH9LT4

postalrat 23 hours ago||

Looks similar to midi

Andrex 14 hours ago||

What are the perceived benefits of UTF-16 and 32 and why did they come about?

I could ask Gemini but HN seems more knowledgeable.

peterfirefly 5 hours ago|

UTF-16 is a hack that was invented when it became clear that UCS-2 wasn't gonna work (65536 codepoints was not enough for everybody).

Almost the entire world could have ignored it if not for Microsoft making the wrong choice with Windows NT and then stubbornly insisting that their wrong choice was indeed correct for a couple of decades.

There was a long phase where some parts of Windows understood (and maybe generated) UTF-16 and others only UCS-2.

kccqzy 4 hours ago||

Besides Microsoft, plenty of others thought UTF-16 to be a good idea. The Haskell Text type used to be based on UTF-16; it only switched to UTF-8 a few years ago. Java still uses UTF-16, but with an ad hoc optimization called CompactStrings to use ISO-8859-1 where possible.

peterfirefly 3 hours ago||

A lot of them did it because they had to have a Windows version and had to interface with Windows APIs and Windows programs that only spoke UTF-16 (or UCS-2 or some unspecified hybrid).

Java's mistake seems to have been independent and it seems mainly to have been motivated by the mistaken idea that it was necessary to index directly into strings. That would have been deprecated fast if Windows had been UTF-8 friendly and very fast if it had been UTF-16 hostile.

We can always dream.

ummonk 23 hours ago||

> Another one is the ISO/IEC 8859 encodings are single-byte encodings that extend ASCII to include additional characters, but they are limited to 256 characters.

ISO 2022 allowed you to use control codes to switch between ISO 8859 character sets though, allowing for mixed script text streams.

ceh56 19 hours ago|

Another collaboration by Pike and Thompson can be seen here: https://go.dev/.

More comments...