Unsigned sizes: A five year mistake

Posted by lerno 7 days ago

Unsigned sizes: A five year mistake(c3-lang.org)

119 points | 149 commentspage 2

Validark 7 days ago|

I am personally moving in the opposite direction. I haven't meaningfully used a signed integer in years, and I see signed integers as being for more niche use-cases. I mainly only use a signed types when I want to do a "signed shift right". If there was a >>> operator in Zig I wouldn't even think of signed integers.

Given your examples, I think you'd have fewer issues if you were working with unsigned integers exclusively. Although I'm curious about what other code you were referencing with this: "But seeing how each change both made the code easier to reason about and more correct, I couldn’t deny the evidence."

With regards to modulo, in Zig if you try to use it with a signed integer it will tell you to specify whether you want `@mod` or `@rem` semantics. In my case, I'd almost never write `x % 2`, I'd write `x & 1`. I do use unsigned division but I'd pretty much never write code that would emit the `div` instruction.

I'm not saying you're wrong though! Everyone has a different mind. If you attain higher correctness and understandability through using signed integers, that's great. I'm just saying I'm in the opposite camp.

bsder 7 days ago|

Zig also differentiates between the wrapping and non-wrapping operators. The for loop example would toss a runtime error when the index underflowed in most compiler modes.

The if statement won't work since Zig would force a cast.

The tricky wrap sucks unless you use a power of 2. Then the Zig type can match (u4, u5, u7, etc.) and you would use wrapping arithmetic operators. And on smaller CPUs you NEED to use a power of 2 because division and mod are expensive.

EdSchouten 7 days ago||

> If sizes are unsigned, like in C, C++, Rust and Zig – then it follows that anything involving indexing into data will need to either be all unsigned or require casts.

I don’t really get this claim. Indexing should just look up the element corresponding to the value provided. It’s easy to come up with semantics that are intuitive and sound, even if signed integers or ones smaller than size_t are used.

adrian_b 7 days ago|

Indexing does that, but the indices must vary in a certain range, whose limits are frequently determined by using something like "sizeof(array)/sizeof(element)" which is an unsigned number.

This is especially inconvenient in C, where there exist extremely dangerous legacy implicit casts between signed integers and unsigned integers, which have a great probability of generating incorrect values.

Because the index is typically a signed integer, comparing it with an unsigned limit without using explicit casts is likely to cause bugs. Using explicit casts of smaller unsigned integers towards bigger signed integers results in correct code, but it is cumbersome.

These problems are avoided as said in TFA, by making "sizeof" and the like to have 64-bit signed integer values, instead of unsigned values.

Well chosen implicit conversions are good for a programming language, by reducing unnecessary verbosity, but the implicit integer conversions of C are just wrong and they are by far the worst mistake of C much worse than any other C feature.

Other C features are criticized because they may be misused by inexperienced or careless programmers, but most of the implicit integer conversions are just incorrect. There is no way of using them correctly. Only the conversions from a smaller signed integer to a bigger signed integer are correct.

Mixed signedness conversions have always been wrong and the conversions between unsigned integers have been made wrong by the change in the C standard that has decided that the unsigned integers are integer residues modulo 2^N and they are not non-negative integers.

For modular integers, the only correct conversions are from bigger numbers to smaller numbers, i.e. the opposite of the implicit conversions of C. The implicit conversions of C unsigned numbers would have been correct for non-negative integers, but in the current C standard there are no such numbers.

The current C standard is inconsistent, because the meaning of sizeof is of a non-negative integer and this is also true for the conversions between unsigned numbers, but all the arithmetic operations with unsigned numbers are defined to be operations with integer residues, not operations with non-negative numbers.

The hardware of most processors implements at least 3 kinds of arithmetic operations: operations with signed integers, operations with non-negative integers and operations with integer residues.

Any decent programming language should define distinct types for these kinds of numbers, otherwise the only way to use completely the processor hardware is to use assembly language. Because C does not do this, you have to use at least inline assembly, if not separate assembly source files, for implementing operations with big numbers.

uecker 7 days ago||

Not sure what change in the C standard you mean. unsingned was always modulo. Otherwise, use -Wsign-conversion.

adrian_b 7 days ago||

Nope.

It was undefined what happens at unsigned overflows and underflows. Therefore a compiler could choose to implement "unsigned" as either non-negative numbers or as integer residues.

The fact that "sizeof" is unsigned and the implicit conversions between "unsigned" numbers are consistent only with non-negative numbers. Therefore the undefined behavior should have been defined correspondingly.

Instead of this, at some version of the standard, I am lazy to search it now, but it might have been C99, they have changed the behavior from undefined to defined as the behavior of integer residues.

I do not know the reason for this choice, it may have been just laziness, because it is easier to implement in compilers and it leads to maximum performance in the absence of bugs. In any case this decision has broken the standard, because the arithmetic operations have become incompatible with the implicit conversions between "unsigned" types and with the semantics of "sizeof", which must be non-negative.

For non-negative numbers, the correct conversions are from smaller sizes to bigger sizes, while for integer residues the correct conversions are only in the opposite direction, from bigger sizes to smaller sizes (e.g. a number that is 257 modulo 65536 is also 1 modulo 256, so truncating it yields a correct value, while a number that is 1 modulo 256 when modulo 65536 it could be 257, 511, 769 etc. so you cannot extend it without additional information).

Judging from the implicit conversions, it is clear that the intention of the designers of C during the seventies was that "unsigned" numbers must be non-negative integers and not integer residues. The modern C standard is guilty of the current inconsistencies that greatly increase the chances of bugs

uecker 7 days ago|||

My copy of K&R already has unsigned modulo arithmetic: "unsigned numbers are always positive or zero, and obey the laws of arithmetic modulo 2n, where n is the number of bits in the type." So if it changed it was before that, but don't think so.

I get your argument about the conversion order, but I do not buy it in terms of language design. You also do not want to go to a quotient ring implicitly, so I do not agree that this conversion direction would be more "correct" for implicit conversion either and from a practical point of view the C design is defensible.

I think the motivation originally was merely to expose the common capabilities of the hardware, nothing more. What we miss from this perspective are polynomials over F_2, but nobody pushed for this too hard so far.

fanf2 6 days ago|||

[dead]

nycticorax 5 days ago||

This is only somewhat related, but: Has there ever been a language that adopted IEEE-754-like semantics for integer types? (Yes, I know this would be slow without hardware support.) By this I mean adding valid values for (signed) ints to represent positive infinity, negative infinity, and not-a-number; then use these values as the results of overflow, underflow, and division by zero in the natural way. It just seems like if these sort of values are useful in floating-point arithmetic, they might well be useful for integer arithmetic as well, for many of the same reasons.

AlotOfReading 5 days ago|

If you rename +/- Inf, that's what saturating arithmetic is. Very likely your language and hardware already supports it.

NaN is almost always a mistake, and adding it breaks the law of identity. You don't want it.

nycticorax 4 days ago||

OK, I did not know about [saturation arithmetic](https://en.wikipedia.org/wiki/Saturation_arithmetic). Cool!

But I can't agree with the claim that "nan is almost always a mistake". Certainly if you're doing floating-point computation on large arrays, the last thing you want is e.g. for an error to be thrown in a elementwise division just because two corresponding elements both happen to be zero.

It's true that nan!=nan is one of the more 'controversial' parts of the standard, that possibly would have been decided the other way in a perfect world. But it was also a reasonable pragmatic decision at the time the standard was developed. See here: https://stackoverflow.com/a/1573715/1013442

shirro 6 days ago||

With all respect to Christoffer and Bjarne and many others who are much smarter and more experience than me who have said similar things I am far from convinced. Their languages are not memory safe and they either are not doing bounds checking or proving it unnecessary. If iteration is causing underflow or overflow then perhaps the problem isn't signed or unsigned indexes.

I don't recall similar arguments being made for Pascal or ADA.

Look around at the state of our C++ and C software and all the CVEs I think we probably shouldn't care about unsigned or signed loop indexes and move on before regulatory pressure forces us. Please language designers, give us some interesting alternatives to Rust.

IshKebab 7 days ago||

It seems like they've identified common bugs patterns in C that would have been ameliorated by using signed, but come to the wrong conclusion that signed is the correct answer rather than that C is poorly designed for making the broken code the easy option.

Fix the language. Don't hack around it by using the wrong type.

ozgrakkurt 7 days ago|

This is already fixed in c via bitint types and disabling implicit integer sign casting.

ximm 7 days ago||

Is the text on this page really #bbbdc3 on #ffffff? How is anyone supposed to be able to read that?

sureglymop 7 days ago||

Weirdly, you have to turn on javascript for the text color to change...

idbehold 7 days ago||

For me it's #353841 on #ffffff which meets WCAG AAA standards for accessible text.

Panzerschrek 6 days ago||

In my programming language I use unsigned sizes. Signed sizes have no sense -sizes can't be negative. They provide larger range and don't waste an extra bit. Range checking is simpler, it requires no comparison. Also some operations like division or modulo are faster for unsigned integers.

Using signed sizes adds a lot of footguns and performance degradations and in exchange gives only small code simplifications in rare and niche cases.

rurban 7 days ago||

So his compiler cannot detect the unsigned overflows and instead chooses to call it a user mistake!

Sizes and indices of course need to be unsigned, and any self respecting compiler should warn about dangerous usage.

larsnystrom 7 days ago||

I don’t understand how dealing with numbers correctly is not a solved problem in computer engineering by now.

akkartik 7 days ago|

Maybe it's telling you it's a hard problem?

larsnystrom 7 days ago||

My comment was a bit tongue in cheek. Obviously it is a hard problem. But in a profession where we work with machines that literally were made to crunch numbers, and where abstraction is something we deal with daily, why can’t we have a performant abstraction for doing arbitrary calculations? The answer is that to be performant it must be solved in hardware, which would cost more than the hardware we have.

So in fact it is not just telling me it’s a hard problem, it’s telling me that the cost-benefit is still not there. It’s like it’s just not a very important problem (in an economic sense). And that is what surprises me, given that computers were made to do arbitrary calculations.

akkartik 7 days ago||

This article has been deeply influential for me: http://johnsalvatier.org/blog/2017/reality-has-a-surprising-...

I used to imagine for someone in construction a wall must be some really simple thing. But it's only simple after millennia of building walls. So I now have lots of grace and patience for humanity to figure out numbers in computers, whether integers or reals.

Your explanation is possibly the same just in different words. It's a hard problem and probably needs a whole lifetime. But it's in no single person's economic interest to devote to it the time it needs (not to mention the diverse skills required; once one has a solution one has to pitch it to the world). And so it will happen over a hundred lifetimes.

cperciva 7 days ago|

I don't get it. Is this a parody of poor design decisions?

Sure, it's possible to write bugs in C. And if you really want to, you can disable the compiler warnings which flag tautologous comparisons and mixed-sign comparisons (a common reason for doing this is to avoid spurious warnings in generic-type code).

But, uhh, "people can deliberately write bugs" has got to be the weakest justification I've ever seen for changing a language feature -- especially one as fundamental as "sizes of objects can't be negative".

adrian_b 7 days ago|

The C language does not have any data type that has the property "can't be negative".

Signed integers can be negative. The so-called "unsigned" integers of C are integer residues modulo 2^N, which are neither positive nor negative, i.e. these concepts are not applicable to "unsigned" integers.

An alternative view is that any C "unsigned" is both positive and negative. For example the unsigned short "1" is the same number as "65537" and as "-65535".

So any sizeof value in C is negative (while also being positive).

In contradiction with what you say, the change described in TFA, by making sizes 64-bit signed integers, is the only method to guarantee that the sizes are non-negative in a language that does not have dedicated non-negative integers.

Other programming languages have non-negative integers, but C and C++ and many languages derived from them do not have such integers.

The arithmetic operations with non-negative integers differ from the arithmetic operations of C. On overflows and underflows, they either generate exceptions or have saturating behavior.

cperciva 7 days ago|||

Leaving aside the fact that, yes, unsigned integer types are definitely not negative -- my point wasn't about types at all. Objects cannot take up a negative number of bytes of memory!

alberto-m 7 days ago||||

> An alternative view is that any C "unsigned" is both positive and negative. For example the unsigned short "1" is the same number as "65537" and as "-65535".

This can be disproven by the fact that dividing by `unsigned e = 1U` is well defined and always yields the starting number. If the unsigned numbers were really modular numbers as you suggest, division could not be defined.

adrian_b 7 days ago||

This does not demonstrate anything. It is just additional evidence that the C standard contains contradictory rules about "unsigned" integers.

The oldest parts of the C language are all consistent with "unsigned" numbers being non-negative integers. The implicit conversions between different sizes of "unsigned", the sizeof operator, the relational operators and division are consistent with non-negative integers.

However the first C standard, instead of defining the correct behavior has left undefined many corner cases of the arithmetic operations, allowing the implementation of "unsigned" as either non-negative integers or integer residues.

Eventually, the undefined behaviors for addition, subtraction and multiplication have been defined to be those of integer residues, not those of non-negative integers.

These contradictory properties are the cause of many confusions and bugs.

In extensible languages, like C++, it is possible to define proper non-negative integers and integer residues and bit strings and to always use those types instead of the built-in "unsigned".

In C, it is better to always use signed numbers and avoid unsigned, by casting unsigned to bigger sizes of signed before using such a value.

marshray 7 days ago|||

Are you claiming that the following program could possibly print "-1" ?

    #include <stdio.h>
    int main() {
        unsigned short a = 1;
        long b = a;
        printf("%ld\n", b);
    }

If not, why?