Posted by HexDecOctBin 1 day ago
https://clang.llvm.org/docs/TypeSanitizer.html
https://www.phoronix.com/news/LLVM-Merge-TySan-Type-Sanitize...
> the functions `recip` and `recip⁺` and not equivalent
Several paragraphs after this got swallowed by the code block.
Edit: Oh, I didn't realize the article is by the author of the book, Modern C. I've seen it recommended in many places.
> The C23 edition of Modern C is now available for free download from https://hal.inria.fr/hal-02383654
This seems to have been fixed now.
To convert it to C syntax, it's a function with roughly this signature:
void* with_addr(void* ptr, uintptr_t addr)
Where the returned pointer has the address of `addr` and the provenance of `ptr`.https://github.com/protocolbuffers/protobuf/blob/ae0129fcd01...
From the PVI section onward it seems to recover, but if the author sees this please fix and re-convert your post.
[Edited, nope, there are more errors further in the text, this needed proper proofreading before it was posted, I can somewhat struggle through because I already know this topic but if this was intended to introduce newcomers it's probably very confusing]
https://godbolt.org/z/qKejzc1Kb
Whereas clang loudly complains,
An identifier is an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters, and Unicode characters specified using \u and \U escape notation(since C99), of class XID_Continue(since C23). A valid identifier must begin with a non-digit character (Latin letter, underscore, or Unicode non-digit character(since C99)(until C23), or Unicode character of class XID_Start)(since C23)). Identifiers are case-sensitive (lowercase and uppercase letters are distinct). Every identifier must conform to Normalization Form C.(since C23)
In practice depends on the compiler.
Definitely a questionable choice to throw off readers with unicode weirdness in the very first code example.
And yeah, the chinese tone in practice does not align with the idea of "down a little up a lot" either. It depends on context...
I still don't want anything as unpredictable as Unicode in my code. How many different encodings will display as the same variable name and how is the compiler supposed to decide?
If you're thinking of comments and user facing strings, the OP already excluded those.
The answer to your question seems to (still) be "no".
But not for char-typed accesses. And even for larger types, I think you would have to worry about the combo of first memcpying from pointer-typed memory to integer-typed memory, then loading the integer. If you eliminate dead integer loads, then you would have to not eliminate the memcpy.
Or from a pragmatic viewpoint, perhaps if the optimiser eliminates a dead load, then don't mark the pointer as exposed? After all, the whole point is to keep track of whether a synthesised pointer might potentially refer to the exposed pointer's storage. There's zero danger of that happening if the integer load never actually occurs.
> The functions recip and recip⁺ and not equivalent.
This is one of those examples of how optimizing code can improve legibility, robustness, or both.
The first implementation allows for side effects to change the outcome of the function. But the problem is that the code is not written expecting someone to modify the values in the middle of the loop. It's incorrect behavior, and you're paying a performance penalty for it to boot.
Functional Core code tends not to have this problem, in that we pass in a snapshot of data and it either gets an answer or an error.
I've seen too much code that checks 3 times if a user is either still logged in or has permission to do a task, and not one of them was set up to deal with one answer for the first call and a different one for any of the subsequent ones. They just go into undefined behavior.
Or better yet, don't let 'social pressure' influence your choice of programming language ;)
If your workplace has a clear rule to not use memory-unsafe languages for production code that's a different matter of course. But nothing can stop you from writing C code as a hobby - C99 and later is a very enjoyable and fun language.
It’s hard. Programming is a social discipline, and the more people who work in a language, the more love it gets.
Of course it's still really nice to just have C itself being updated into something that's nicer to work with and easier to write safely, but Zig seems to be a decent other option.
Does Zig document the precise mechanics of noalias? Does it provide a mechanism for controllably exposing or not exposing provenance of a pointer? Does it specify the provenance ABA problem in atomics on compare-exchange somehow or is that undefined? Are there any plans to make allocation optimizations sound? (This is still a problem even in Rust land; you can write a program that is guaranteed to exhibit OOM according to the language spec, but LLVM outputs code that doesn't OOM.) Does it at least have a sanitizer like Miri to make sure UB (e.g. data races, type confusion, or aliasing problems) is absent?
If the answer to most of the above is "Zig doesn't care", why do people even consider it better than C?
In a sibling comment, I mentioned a proof of concept I did that if I had the time to complete/do correctly, it should give you near-rust-level checking on memory safety, plus automatically flags sites where you need to inspect the code. At the point where you are using MIRI, you're already bringing extra stuff into rust, so in practice zig + zig-clr could be the equivalent of the result of "what if you moved borrow checking from rustc into miri"
[0] type erasure, or using "known dangerous types, like c pointers, or non-slice multipointers".
As for null pointer problems, while they may result in CVEs, they’re a pretty minor security concern since they generally only result in denial of service.
Edit 2: Here's some data: In an analysis by Google, the "most frequently exploited" vulnerability types for zero-day exploitation were use-after-free, command injection, and XSS [3]. Since command injection and XSS are not memory-unsafety vulnerabilities, that implies that use-after-frees are significantly more frequently exploited than other types of memory unsafety.
Edit: Zig previously had a GeneralPurposeAllocator that prevented use-after-frees of heap allocations by never reusing addresses. But apparently, four months ago [1], GeneralPurposeAllocator was renamed to DebugAllocator and a comment was added saying that the safety features "require the allocator to be quite slow and wasteful". No explicit reasoning was given for this change, but it seems to me like a concession that applications need high performance generally shouldn't be using this type of allocator. In addition, it appears that use-after-free is not caught for stack allocations [2], or allocations from some other types of allocators.
Note that almost the entire purpose of Rust's borrow checker is to prevent use-after-free. And the rest of its purpose is to prevent other issues that Zig also doesn't protect against: tagged-union type confusion and data races.
[1] https://github.com/ziglang/zig/commit/cd99ab32294a3c22f09615...
[2] https://github.com/ziglang/zig/issues/3180.
[3] https://cloud.google.com/blog/topics/threat-intelligence/202...
Anyways, I am optimistic that UAF can be prevented by static analysis:
https://www.youtube.com/watch?v=ZY_Z-aGbYm8
Note since this sort of technique interfaces with the compiler, unless the dependency is in a .so file, it will detect UAF in dependencies too, whether or not the dependency chooses to run the static analysis as part of their software quality control.
On one side are the many C++ “static analyzers” like Coverity or clang-analyzer, which work with unannotated C++ code. On the other side is the “Safe C++” proposal (safecpp.org), which is supposed to achieve full safety, but at the cost of basically transplanting Rust’s type system into C++, requiring all functions to have lifetime annotations and disallow mutable aliasing, and replacing the entire standard library with a new one that follows those rules. Between those two extremes there have been tools like the C++ Core Guidelines Checker and Clang’s lifetimebound attribute, which require some level of annotations, and in turn provide some level of checking.
So far, none of these have been particularly successful in preventing memory safety vulnerabilities. Static analyzers are widely used in industry but only find a fraction of bugs. Safe C++ will probably be too unpopular to make it into the spec. The intermediate solutions have some fundamental issues (see [1], though it’s written by the author of Safe C++ and may be biased), and in practice haven’t really taken off.
But I admit that only the “static analyzer” side of the solution space has been extensively explored. The other projects are just experiments whose lack of adoption may be due to inertia as much as inherent lack of merit.
And Zig may be different… I’m not a Zig programmer, but I have the impression that compared to C++ it encourages fewer allocations and smaller codebases, both of which may make lifetime analysis more tractable. It’s also a much younger language whose audience is necessarily much more open to change.
So we’ll see. Good luck - I’d sure like to see more low-level languages offering memory safety.
"I think this could be possible" isn't an enabling technology. If you write hard SF it's maybe useful to distinguish things which could happen from those which can't, but for practical purposes it only matters if you actually did it. Sean's proposed "Safe C++" did it, Zig, today, did not.
There are other obstacles - like adoption, as we saw for "Safe C++" - but they're predicated on having the technology at all, you cannot adopt technologies which don't exist, that's just make believe. Which I think is already the path WG21 has set out on.
Not just that, but the committee accepted a paper that basically says it's design is against C++'s design principles, so it's effectively dead forever.
Here's somebody who was in the room explaining how this was agreed as standing policy for the C++ programming language.
"It was literally the last paper. Seen at the last hour. Of a really long week. Most everyone was elsewhere in other working group meetings assuming no meaningful work was going to happen."
CaSe Sensitivity
Weird pointer syntax
Lack of a separate assignment token
Null terminated strings
Macros - the evil scourge of the universe
On the plus side, it's installed everywhere, and it's not indent sensitive> CaSe Sensitivity
Wait, what, you.. you want a case-insensitive language? Like SQL?
I love SQL, but please no more case-insensitive programming languages!
What does that mean?
True is true, and false is false, if you're wondering whether this Doodad is Wibbly, you should ask that question not rely on a convention that Wibbly Doodads are somehow "truthy" while the non-Wibbly ones are not.
I will be very surprised if there's widespread adoption of Fil-C for many new projects though.
That's half tongue in cheek. I am fluent in three languages, but I program "in English" and I greatly appreciate that my colleagues who are fluent in languages other than the ones I'm fluent in (except English) also do. Basically English is the world's lingua franca today. Nonetheless if a company in France wants to use French for their symbol names, or a company in Mexico wants to use Spanish for their symbol names, or a company in China wants to use Chinese for their symbol names, who am I to stop them?! Surely it's not my place.
However, everything else, from spreadsheet software to CAD tools to OS kernels to JavaScript frameworks is universal across cultures and languages. And for better or for worse (I'm not a native English speaker either), the world has gone with English for a lot of code commons.
And the thing with the examples in that post isn't about supporting language diversity, it's math symbols which are noone's native language. And you pretty much can't type them on any keyboard. Which really makes it a rather poor flex IMHO. Did the author reconfigure their keyboard layout for that specific math use case? It can't generically cover "all of math" either. Or did they copy&paste it around? That's just silly.
[…could some of the downvoters explain why they're downvoting?]
In heavily mathematical contexts, most of those assumptions get turned on their head. Anybody qualified to be modifying a model of electromagnetism is going to be intimately familiar with the language of the formulas: mu for permeability, epsilon for permittivity, etc. With that shared context,
1/(4*π*ε)*(q_electron * q_proton)/r^2 is going to be a lot easier to see, at a glance, as Coulombs law
compared to
1 / (4 * Math.Pi * permitivity_of_free_space)*(charge_electron * charge_proton)/distance_of_separation
Source code, like any other language built for humans, is meant to be read by humans. If those humans have a shared context, utilizing that shared context improves the quality and ease of that communication.
I guess maybe this is an argument for better UI/UX for symbolic input…
Because you made false assertions ("And you pretty much can't type them on any keyboard").
(Unless you're being pedantic because I wrote "keyboard" rather than "keyboard layout", or ignored the qualifying "pretty much". In either of those cases you're unwilling to communicate cooperatively and I can't help you.)
Literally no one, anywhere, wants to be forced to read source written in a language they can't read (or more specifically in this case: written in glyphs they can't even produce on their keyboard). That idea, for almost everyone, seems "horrific", yeah.
So a lingua franca is a firm requirement for modern software development outside of extremely specific environments (FSB malware authors probably don't care about anyone else reading their cyrillic variable names, etc...). Must it be ASCII-encoded English? No. But that's what the market has picked and most people seem happy enough with it.
This is blatantly false. I'd posit that a solid 90% of all source code written is done so by single, co-located teams (a substantial portion of which are teams of 1). That certainly fits the bill for most companies I've worked at.
Source code is for humans, and thus should be written in whatever way makes it easiest to read, write, and understand for humans. If your language doesn't map onto ASCII, then Unicode support improves that goal. If your code is meant to directly implement some physics formula, then using the appropriate unicode characters might make it easier to read (and thus spot transcription errors, something I find far too often in physics simulations).
I'm a practitioner of neither though, so I can't condemn the practice wholeheartedly as an outsider, but it does make me groan.
Consider a simple programming example, in C blocks are delimited by `{}`, why not use `block_begin` and `block_end`? Because it's noisy, and it doesn't take much to internalize the meaning of braces.
This can be especially difficult if the author is trying to map 1:1 to a complex algorithm in a white paper that uses domain-standard mathematical notation.
The alternative is to break the "full formula" into simpler expression chunks, but then naming those partial expression results descriptively can be even more challenging.
It's probably also a great way to introduce almost undetectable security vulnerabilities by using Unicode characters that look similar to each other but in fact are different.
Yes, that would probably be one way to do it.
> Which would violate the whole "Code is meant to be easily read by humans" thing.
I'd think someone who's deliberately and sneakily introducing a security vulnerability would want it to be undetectable, rather than easily readable.
Such a silly issue too, you'd think we'd have come up with some automated wrangling for this, so that those experienced with a codebase can switch over and see super short versions of identifiers, while people new to it all will see the long stuff.
Maybe for code that was written in the early 90's, but the only 'tradition' that has survived is calling the vanilla loop variable 'i'.
My first thought before I saw this was “I wonder is this going to be an article from people who build things or something from “academics” that don’t.”
At least it was answered quickly.