Top
Best
New

Posted by vishnuharidas 9/12/2025

UTF-8 is a brilliant design(iamvishnu.com)
849 points | 348 commentspage 6
sheerun 9/12/2025|
I'll mention IPv6 as bad design that could have been potentially UTF-8-like success story
tialaramex 9/12/2025|
No. UTF-8 is for encoding text, so we don't need to care about it being variable length because text was already variable length.

The network addresses aren't variable length, so if you decide "Oh IPv6 is variable length" then you're just making it worse with no meaningful benefit.

The IPv4 address is 32 bits, the IPv6 address is 128 bits. You could go 64 but it's much less clear how to efficiently partition this and not regret whatever choices you do make in the foreseeable future. The extra space meant IPv6 didn't ever have those regrets.

It suits a certain kind of person to always pay $10M to avoid the one-time $50M upgrade cost. They can do this over a dozen jobs in twenty years, spending $200M to avoid $50M cost and be proud of saving money.

sheerun 9/26/2025||
You reserve 32 bits of these 128 just like UTF-8 did for theirs for ASCII for backward-compatibility, and request backward compatible fall-back from user interfaces, I hope it clears it
max23_ 9/13/2025||
Good read, thank you!

> Show the character represented by the remaiing 7 bits on the screen.

I notice there is a typo.

vishnuharidas 9/13/2025|
Fixed that, thank you!
transfire 9/13/2025||
So brilliant that we’re all still using ASCII!†

† With an occasional UNICODE flourish.

librasteve 9/12/2025||
some insightful unicode regex examples...

https://dev.to/bbkr/utf-8-internal-design-5c8b

hamburglar 9/12/2025|
Regex? Did you link to the wrong page? I see no regexes on that page.
librasteve 9/14/2025||
well you have to click around a bit and be prepared to look at the other pages in Pabels series of posts … I linked to this one since I felt it chimes well with the OP
jrochkind1 9/13/2025||
It really is, in so many ways.

It is amazing how successful it's been.

ofou 9/13/2025||
UTF-8 should be a universal tokenizer
burtekd 9/12/2025||
I'm just gonna leave this here too: https://www.youtube.com/watch?v=MijmeoH9LT4
xkcd1963 9/13/2025||
What I find inconvenient about emoji characters is the differential length counting in programming languages
kccqzy 9/13/2025|
That's a problem with programming languages having inconsistent definitions of length. They could be like Swift where the programmer has control over what counts as length one. Or they could decide that the problem shouldn't be solved by the language but by libraries like ICU.
z_open 9/13/2025||
kill Unicode. Done with this after these 25 byte single characters.
postalrat 9/12/2025|
Looks similar to midi
More comments...