Top
Best
New

Posted by vishnuharidas 9/12/2025

UTF-8 is a brilliant design(iamvishnu.com)
849 points | 348 commentspage 5
carlos256 9/12/2025|
No, it's not. It's just a form of Elias-Gamma coding.
carlos256 9/12/2025|
* unary encoding coding.
vismit2000 9/13/2025||
UTF-8 Everywhere Manifesto: https://utf8everywhere.org/
digianarchist 9/13/2025||
I read online that codepoints are formatted with 4 hex chars for historical reasons. U+41 (Latin A) is formatted as U+0041.
lyu07282 9/13/2025||
UTF-8 was a huge improvement for sure, but I was, 20-25 years ago, working with LATIN-1 (so 8 bit charcters) which was a struggle in the years it took for everything to switch to UTF-8, the compatibility with ASCII meant you only really notice something was wrong when the data had special characters not representable in ASCII but valid LATIN-1. So perhaps breaking backwards compatibility would've resulted in less data corruption overall.
sjapkee 9/13/2025||
Until you interact with it as a programmer
nottorp 9/13/2025||
Hmm i count at most 21 bits. Just 2 billion code points.

Is that all Unicode can do? How are they going to fit all the emojis in?

danhau 9/13/2025|
The max code point in Unicode is 0x10FFFF. ceil(log2(0x10FFFF+1)) = 21. So yes, a Unicode codepoint requires only 21 bits.

297334 codepoints have been assigned so far, that‘s about 1/4 of the available range, if my napkin math is right. Plenty of room for more emoji.

frollogaston 9/13/2025||
Seems obvious, ASCII had an unused bit, so you use it. Why did they even bother with UTF-16 and -32 then?
int_19h 9/13/2025|
Because the original design assumed that 16 bits are enough to encode everything worth encoding, hence UCS2 (not UTF-16, yet) being the easiest and most straightforward way to represent things.
frollogaston 9/16/2025||
Ah ok. Well even then, you end up spending 16 bits for every ASCII character.
smoyer 9/13/2025||
Uvarint also has the property of a file containing only ascii characters still being a valid ascii file.
dolmen 9/13/2025|
Anyone remembers what UTF7.5 or UTF7,5 was? I can't find references to its description(s)...
dolmen 9/13/2025|
Finally found a description here: http://www.czyborra.com/utf/
More comments...