UTF-8 is a brilliant design

Posted by vishnuharidas 1 day ago

UTF-8 is a brilliant design(iamvishnu.com)

737 points | 291 commentspage 7

xkcd1963 12 hours ago|

What I find inconvenient about emoji characters is the differential length counting in programming languages

kccqzy 7 hours ago|

That's a problem with programming languages having inconsistent definitions of length. They could be like Swift where the programmer has control over what counts as length one. Or they could decide that the problem shouldn't be solved by the language but by libraries like ICU.

lyu07282 19 hours ago||

UTF-8 was a huge improvement for sure, but I was, 20-25 years ago, working with LATIN-1 (so 8 bit charcters) which was a struggle in the years it took for everything to switch to UTF-8, the compatibility with ASCII meant you only really notice something was wrong when the data had special characters not representable in ASCII but valid LATIN-1. So perhaps breaking backwards compatibility would've resulted in less data corruption overall.

tiahura 1 day ago||

How many llm tokens are wasted everyday resolving utf issues?

curtisszmania 5 hours ago||

[dead]

saltserv 10 hours ago||

[dead]

wetpaws 1 day ago||

[dead]

TacticalCoder 1 day ago||

[dead]

Androth 22 hours ago||

meh. it's a brilliant design to put a bandage over a bad design. if a language can't fit into 255 glyphs, it should be reinvented.

rmunn 13 hours ago|

Sun Tzu would like a word or two with you.

LorenPechtel 1 day ago||

Now fix fonts! It should be possible to render any valid string in a font.

dmz73 18 hours ago|

UTF8 is a horrible design. The only reason it was widely adopted was backwards compatibility with ASCII. There are large number of invalid byte combinations that have to be discarded. Parsing forward is complex even before taking invalid byte combinations in account and parsing backwards is even worse. Compare that to UTF16 where parsing forward and backwards are simpler than UTF8 and if there is invalid surrogate combination, one can assume it is valid UCS2 char.

moefh 17 hours ago|

UTF-16 is an abomination. It's only easy to parse because it's artificially limited to 1 or 2 code units. It's an ugly hack that requires reserving 2048 code points ("surrogates") from the Unicode table just for the encoding itself.

It's also the reason why Unicode has a limit of about 1.1 million code points: without UTF-16, we could have over 2 billion (which is the UTF-8 limit).