Linux eliminates the strncpy API after six years of work, 360 patches

Posted by simonpure 4 days ago

Linux eliminates the strncpy API after six years of work, 360 patches(www.phoronix.com)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

298 points | 324 commentspage 2

senfiaj 4 days ago|

I wonder, why not use a string buffer paired with its length? For example, maybe use struct that has char pointer, and 2 ints (occupied length + total buffer length). Almost like c++'s std::string. This null terminator thing really sucks, it's potentially insecure and often unperformant.

WalterBright 4 days ago||

Wonder no longer!

https://dlang.org/spec/arrays.html#dynamic-arrays

and

https://dlang.org/spec/arrays.html#strings

and for C:

https://digitalmars.com/articles/C-biggest-mistake.html

maxlybbert 3 days ago|||

It's definitely possible. And common, at least in some projects. The only real drawback is that sloppiness will lead to multiple slightly different nonstandard string types in the same project.

GalaxyNova 4 days ago|||

Yes I have seen it happen a few times with `strlen` being called in a loop silently causing O(N) to turn to O(N^2)

jkrejcha 4 days ago|||

Reminds me of an article[1] that described how he cut GTA Online loading times by 70% because strlen was getting called for effectively every character in a string

[1]: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

sweetjuly 4 days ago||

I remember reading this blog post when it was first published, but the subsequent updates are better than I would've ever expected this to turn out. Worth checking it out again if you've seen it before :)

senfiaj 4 days ago||||

Exactly, you can't write clean concise code when working with c strings. Almost every c string manipulation requires cognitive load: "Is the buffer size enough (including null terminator), should I reallocate it?", "I need to have the offset from the last concat, to make next concats performant", "Umm, shold I put null terminator at i or i + 1?"... It really sucks, it's akin to death by thousands of cuts.

sgerenser 4 days ago|||

Joel Spolsky coined the term “Shlemiel the Painter’s Algorithm” for this type of thing back in 2001: https://www.joelonsoftware.com/2001/12/11/back-to-basics/

bnolsen 4 days ago|||

That's called a fat pointer. Null terminated c strings is the majority of memory errors out there.

none_to_remain 4 days ago|||

The size overhead of that is 2*sizeof(int) while the overhead of null termination is sizeof(char). If I remember the standard right, the former is worse by at least sizeof(char), and usually more in practice. This used to matter, sometimes still does.

kgeist 4 days ago|||

I would assume the difference is mostly negligible in practice due to the allocator rounding up the allocated memory size at least by the word size anyway (for alignment and simpler bookkeeping). You can also use variable-length encoding in the header to use 1 byte for most cases, similar to how UTF-8 does it: if the most significant bit is not set, we assume a 7-bit encoding, which can represent string lengths up to 127 using 1 byte, which is probably 99% of strings.

senfiaj 4 days ago|||

Well, not saying to always use it, but if the string size is big enough, the overhead of 2 ints becomes relatively vanishing. For generic dynamically sized strings it probably has more advantages than disadvantages. But in any case, sure, if every single byte matters or some structure requires specific memory layout, then fine. I just don't think these things are the majority of use cases. Keep in mind that the cached lengths can increase performance, since you don't have to recalculate string lengths.

lelanthran 3 days ago|||

> Well, not saying to always use it, but if the string size is big enough, the overhead of 2 ints becomes relatively vanishing.

In that case, the fix is not to change C strings (breaking a lot of existing code), but to introduce a stringbuilder type.

senfiaj 3 days ago||

You can still use null terminator for compatibility (std::string does use this), but just not rely on that in your own code.

ekaryotic 3 days ago|||

I am a terrible hobby c programmer that doesn't understand pointers but surely a symmetric approach doesn't have the overhead or the bug. that is to say that if the language was designed to work in single bit pairs of a string character in conjunction of a string length character assuming a fail safe design of one dummy string character then if a bug happens in the code then there's no overflow because the length can never be shorter than the character.

chiph 4 days ago|||

Pascal did/does this, but eventually someone wants a string longer than the size portion can handle. Or wants the number of characters not the number of bytes.

jerf 4 days ago|||

I wasn't a programmer in these days, so I don't know if there's some other major concern that would kill this, but I sometimes wonder about whether we could have / should have used variable-length integers. That is, something like, 0-127 byte strings get their length prefixed, 128 - 16383 get two bytes of prefix, and the probably-rare 16384 - 2097151 strings would end up with three, though proportionally by that point it's hardly anything. Or you could use the UTF-8 mechanism for packing the bytes, though that costs more and probably doesn't get anything we'd care about in the 1980s or 1990s.

It's a bit of extra code, yes. Not necessarily all that much, but some. On average it is only slightly more expensive than null termination, and considered as a proportion of the size of the strings themselves it's hardly anything. It's probably better than the strings getting hard-limited to 0-255, though, which was quite frequently a user-visible quirk.

Parodper 3 days ago|||

You could start the encoding with two bytes, so that if the most significant bit of the first byte is 0, the length is that byte plus another. That gives you 32KiB strings with just a byte more. Short strings might suffer, but I think the overhead is reasonable.

The next level (110x xxxx) would give you 8MiB strings, which are going to be fine for most things.

senfiaj 3 days ago|||

32-bit int isn't too much overhead. Just 3 additional bytes. I bet it's almost always better than c style strings. In the vast majority of situations the price isn't that bad, considering you make strings much more secure and potentially faster in string manipulations.

jerf 2 days ago||

32-bit is so little overhead that we don't blink at adding 64 to our strings nowadays, because of the benefits we get from alignment.

But remember the first Macintosh shipped with 128KB of RAM, 131,072 bytes. Three more bytes per string hurts a lot more there...

... although, that said, even in that era given the number of errors that null-terminated strings caused, even completely ignoring security, I do still wonder if at least defaulting to 2 bytes of length and doing something special for strings over 64K still wouldn't have been the right tradeoff, even in the case of short strings. Today we mostly focus on security, but null-terminated strings also caused a lot of just plain-old bugs. But so did 1-byte length strings... it's way too easy to run out of 256 characters even on those old systems.

Johanx64 4 days ago||||

Dude, every sane language out there does this. Just generally with 4byte prefix. Null-terminated stuff has always been backwards compat stuff.

Pascal strings - historically and why people even remember this being an issue - were up to 255 chars in size, if not you had to use different string type.

You might still want raw pointers for all sorts of low level stuff, but you almost never want to have null-terminated strings for anything but back-compat, one of the worst things ever, even on memory constrained systems.

pjmlp 3 days ago|||

And then anyone that isn't stuck in 1976 will use open arrays.

MBCook 4 days ago||

A lot of them are strings coming from or going to user space right? So wouldn’t you have to do constant conversions?

D-Coder 4 days ago||

Note that "360 Patches" is 360 uses of strncpy that have been removed, not necessarily bugs.

dpark 3 days ago|

I would imagine 360 patches removed way more than 360 uses of strncpy. But yeah, it’s not a given that each of these patches addressed a bug. (Also not a given that there were only 360 bugs fixed.)

rswail 3 days ago||

In all the comments in this thread it's interesting how people confuse:

* NUL: An ASCII non-printing character with the byte value of 0

* NULL: A pointer that does not point to usable memory with the value that compiles in C to be equal to ((void *) 0).

layer8 3 days ago|

NUL was always just an abbreviation for null: https://www.rfc-editor.org/rfc/rfc20.html#section-4

I don’t think anyone in this thread is confusing the null character with the null pointer.

rswail 3 days ago||

I've seen a lot of confusion, where people are talking about checking for a NULL at the end of a list of pointers which is very different to a NUL at the end of a string.

Yes it was an abbreviation in ASCII, as are all the non-printable first 32 codes.

stcg 3 days ago||

I wonder what is the difficulty in rewriting strncpy uses that makes it take six years? Was it widespread? Or was it more of a long going effort, where it was only changed if there were some changes in the same file? Or is there some other thing that makes it difficult?

kstenerud 3 days ago||

strncpy is 99.999% of the time NOT the correct function to call, so this is a huge win.

It's just a shame that such a confusing name was chosen for such a niche use case (fixed width records that require null padding).

DerSaidin 3 days ago||

strtomem_pad seems redundant with memcpy_and_pad, and also it requires the preprocessor: https://github.com/torvalds/linux/blob/1a3746ccbb0a97bed3c06...

I was curious: Why have it, instead of just using memcpy_and_pad?

AI's answer (paraphrased) was * Avoid possible bugs from manually write sizeof(dest) * Enforces the __nonstring Attribute * signals: "I am converting an actual C-string into a fixed-width legacy memory field." vs copy binary data & pad it.

Interesting to learn about the __nonstring attribute:

https://github.com/torvalds/linux/blob/1a3746ccbb0a97bed3c06... https://github.com/search?q=repo%3Atorvalds%2Flinux+__nonstr...

GTP 3 days ago||

I always thought that srncpy was the safe alternative to strcpy. Now that I think of it, I'm unsure if the NUL terminator is counted into strncpy's size or not, which would be a likely source of errors. But, could someone explain better what the problems were? And also, would have to pick the right function in the list of given alternatives much better?

GabrielTFS 3 days ago||

The issue with strncpy is that it doesn't actually necessarily terminate - in fact in any case where the source is larger than the destination it will just leave it unterminated (like, it will copy the last character it can from the source instead of terminating the destination string with a NUL)

rurban 3 days ago||

No, the safe alternatives end with _s. They do check matching buffer sizes, and enforce zero-termination. Unfortunately WG14 hates them also, because Microsoft. Microsoft did indeed break some of the, but you can use better alternatives, like my safeclib

GTP 3 days ago||

> No, the safe alternatives end with _s.

Could you please elaborate on this? Both `man strncpy_s` and `man strcpy_s` didn't return any manual page on my Linux system.

rurban 3 days ago||

Search more. It's in safeclib and in the C standard

devsda 4 days ago||

Did anybody else misunderstand the title as removing strncpy func for linux users ?

For a moment, I misunderstood it as (g)libc removing strncpy and was worried about the trouble its going to cause.

henrypoydar 3 days ago||

No code is faster than no code.

naturalmovement 4 days ago|

A reminder that we've had strlcpy[1] for ~ 30 years but it was never accepted into the Linux world because of typical petty open source bullshit. This is why we can't have nice things.

[1] https://man.openbsd.org/strlcpy

ericbarrett 4 days ago||

The Linux kernel had strlcpy over 20 years ago. It was removed in favor of strscpy because the latter was judged a better interface. Here's a 2022 article: https://lwn.net/Articles/905777/

avadodin 3 days ago||

Returning an error is better but you're using ssize_t which is a tradeoff.

The race conditions appear to be a result of the Linux kernel implementation but UNIX style syscalls introduce these races by default. It is not an inherent flaw of the API or even the implementation Linux was using.

The only useable C string API has always been memcpy anyways.

BoingBoomTschak 4 days ago||

Actually, glibc 2.38 has it.

naturalmovement 4 days ago||

Wow it only took them 26 years to import a 30 line C function, a third of which is comments?

I should have sent them a nice fruit basket to commemorate the occasion.

More comments...