Top
Best
New

Posted by vishnuharidas 9/12/2025

UTF-8 is a brilliant design(iamvishnu.com)
849 points | 348 commentspage 7
ummonk 9/12/2025|
> Another one is the ISO/IEC 8859 encodings are single-byte encodings that extend ASCII to include additional characters, but they are limited to 256 characters.

ISO 2022 allowed you to use control codes to switch between ISO 8859 character sets though, allowing for mixed script text streams.

tiahura 9/12/2025||
How many llm tokens are wasted everyday resolving utf issues?
gritzko 9/13/2025||
I specialize in protocol design, unfortunately. A while ago I had to code some Unicode conversion routines from scratch and I must say I absolutely admire UTF-8. Unicode per se is a dumpster fire, likely because of objective reasons. Dealing with multiple Unicode encodings is a minefield. I even made an angry write-up back then https://web.archive.org/web/20231001011301/http://replicated...

UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior.

sedatk 9/13/2025||
> For example, how do you handle UTF-8 encoded surrogate pairs?

Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc).

gritzko 9/13/2025||
In theory, yes. In practice, there are throngs of parsers and converters who might handle such cases differently. https://seriot.ch/projects/parsing_json.html
sedatk 9/14/2025||
I mean hopefully not, but the linked example is about JSON parsing, not UTF-8.
gritzko 9/15/2025||
A big chunk of bugs there are Unicode related, that is my point. When people parse JSON they don't think that they also parse Unicode.
cryptonector 9/13/2025||
> Unicode per se is a dumpster fire

Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts.

ceh56 9/13/2025||
Another collaboration by Pike and Thompson can be seen here: https://go.dev/.
Andrex 9/13/2025||
What are the perceived benefits of UTF-16 and 32 and why did they come about?

I could ask Gemini but HN seems more knowledgeable.

peterfirefly 9/13/2025|
UTF-16 is a hack that was invented when it became clear that UCS-2 wasn't gonna work (65536 codepoints was not enough for everybody).

Almost the entire world could have ignored it if not for Microsoft making the wrong choice with Windows NT and then stubbornly insisting that their wrong choice was indeed correct for a couple of decades.

There was a long phase where some parts of Windows understood (and maybe generated) UTF-16 and others only UCS-2.

kccqzy 9/13/2025|||
Besides Microsoft, plenty of others thought UTF-16 to be a good idea. The Haskell Text type used to be based on UTF-16; it only switched to UTF-8 a few years ago. Java still uses UTF-16, but with an ad hoc optimization called CompactStrings to use ISO-8859-1 where possible.
peterfirefly 9/13/2025||
A lot of them did it because they had to have a Windows version and had to interface with Windows APIs and Windows programs that only spoke UTF-16 (or UCS-2 or some unspecified hybrid).

Java's mistake seems to have been independent and it seems mainly to have been motivated by the mistaken idea that it was necessary to index directly into strings. That would have been deprecated fast if Windows had been UTF-8 friendly and very fast if it had been UTF-16 hostile.

We can always dream.

int_19h 9/13/2025||
There are many other examples, and while some of them are derived from the ones you give, others are independent. JavaScript is an obvious one, but there's also e.g. Qt and NSString in Objective-C, ICU etc.

There really was a time when UTF-16 (or rather UCS2) made sense.

Andrex 9/17/2025|||
Thank you! That's interesting.

What about UTF-7? That seemed like a bad idea even at the time.

curtisszmania 9/13/2025||
[dead]
wetpaws 9/12/2025||
[dead]
TacticalCoder 9/12/2025||
[dead]
saltserv 9/13/2025||
[dead]
Androth 9/13/2025|
meh. it's a brilliant design to put a bandage over a bad design. if a language can't fit into 255 glyphs, it should be reinvented.
rmunn 9/13/2025|
Sun Tzu would like a word or two with you.
More comments...