Posted by todsacerdoti 2 days ago
The mechanism itself seems reasonable, but I am surprised that something like this doesn't already exist in the kernel.
There were flags proposed that allowed the seal to be ignored.
>So you say "we can't munmap in this *one* place, but all others ignore the sealing".
Later was the spice.
>And dammit, once something is sealed, it is SEALED. None of this crazy "one place honors the sealing, random other places do not".
And later, even spicier, Linus says that seals cannot be ignored and that is non-negotiable. Any further suggestions to ignore a seal via a flag would result in the person being added to Linus' ignore list. (He, of course, said this with some profanities and capitals sprinkled in.)
> I don't think you understand the problem space well enough to come up with your own solution for it. I spent a year on this, and ship a complete system using it. You are asking such simplistic questions above it shocks me.
https://lwn.net/ml/linux-kernel/95482.1697587015@cvs.openbsd...
That was beautiful.
Demonstrated the difference in design/engineering philosophies from two of the greats.
It's particularly impressive how misguided the patch is given that they took inspiration from the OpenBSD API implementation, changed both API & implementation, & then argued with both Linus and Theo who started Linux & OpenBSD respectively and were trying to give direct feedback about how OpenBSD is different and why it took the approach it did.
Hopefully the author has taken the more forceful feedback as a learning opportunity about listening to feedback when the people giving it to you have a lot more experience. Or their team is coaching them about what went wrong now that this became so visible to learn what they got wrong.
From Matthew Wilcox who is another senior Linux maintainer:
> I concur with Theo & Linus. You don't know what you're doing. I think the underlying idea of mimmutable() is good, but how you've split it up and how you've implemented it is terrible.
It's delivered directly and bluntly but it's not mean or personal. The author proposed a bad patch & argued from a position of ignorance.
Isn't that the Linux kernel in a nutshell?
Of the top of my head I can name:
- The zram maintainer that's for a few years been blocking the patches that add the zpool api to zram. This would go a long way to unifying zram and zswap in the future.
- TuxOnIce being permablocked by some heels-in-sand maintainers until the kernel diverged to much and the patch writer gave up. This one would have fixed hibernation on Linux.. in other words, hibernation still sucks because of these maintainers.
- Of course the trench warfare all the C-proficient maintainers are waging to chase all the Rust-in-Linux people away, lest they have to learn Rust. ~80% of CVEs are memory-related, so you could say that in say.. 10 years time, ~80% of the CVEs happening in the Linux kernel are the legacy of Ted & co.
1. Wrong and ignorant
2. Wrong and knowledgeable
3. Right and ignorant
4. Right and knowledgeable
There's also when there's just a disagreement of opinion because you weight the tradeoffs differently, in which case there's less of a right or wrong.
The main difference is that in those cases the people involved were in camps 2-4 or simply weighted tradeoffs differently. In this case the author seemed much more clearly in camp 1.
Regarding the claim that "this one would have fixed hibernation on Linux" - maybe, but it's hard to evaluate a road not taken. It could have made other tradeoffs or caused other issues down the road that aren't visible to you right now because it didn't get mainlined or maybe hibernation would still have been broken. Or maybe it would have been hunky dorry. But it kind of doesn't matter because the author in this case was clearly wrong (i.e. the patch wouldn't have achieved the goal it set out to do and would have predictably caused vulnerabilities to become possible in the future as the kernel evolved).
As for Rust-in-Linux, it sounds like your not actually up to date [1]. I'll forgive Ted's emotional volatility at the infamous filesystem talk as he's been much more calm about it now:
> There is a need for documentation and tutorials on how to write filesystem code in idiomatic Rust. He said that he has a lot to learn; he is willing to do that, but needs help on what to learn.
I'd say "trench resistance" not warfare. The Rust movement is trying to shift the daily work of ~15k developers in a mostly bazaar development model. There's going to be resistance and push back and some of it will feel unfair to the Rust folks and some will feel unfair to the C folk and both will be right.
I was speaking in general terms, of how recognizable the "heels-in-the-sand" attitude is. In this particular case yes, it does sound right that the current "mseal" proposal / patches is not up to snuff.
Edit: also reading through it again, props to Theo's response. Stern but with a lot of edification.
> Regarding the claim that "this one would have fixed hibernation on Linux" - maybe, but it's hard to evaluate a road not taken. It could have made other tradeoffs or caused other issues down the road that aren't visible to you right now because it didn't get mainlined or maybe hibernation would still have been broken.
The nice thing about TuxOnIce is that it would have pushed large parts of hibernation into userspace, which would have made it much more versatile and much easier to iterate on.
> As for Rust-in-Linux, it sounds like your not actually up to date [1].
One of the main Rust spearhead developers felt the need to quit, and almost immediately thereafter Asahi Lina posted an article about how "a subset of C kernel developers just seem determined to make the lives of the Rust maintainers as difficult as possible." By my count that is 0-2. How is that "some of it will feel unfair to the Rust folks and some will feel unfair to the C folk"?
> I'd say "trench resistance" not warfare. The Rust movement is trying to shift the daily work of ~15k developers in a mostly bazaar development model.
They're not trying to shift the daily work of ~15k developers. They're asking for some shims, stubs, adaptations, and some general flexibility & willingness. They're not expecting everyone to write Rust within the next 18 months.
> One of the main Rust spearhead developers felt the need to quit, and almost immediately thereafter Asahi Lina posted an article about how "a subset of C kernel developers just seem determined to make the lives of the Rust maintainers as difficult as possible." By my count that is 0-2.
I again refer you to the article I linked. The drama has greatly subsided and it seems like people are talking through the differences. While there's some unfortunate intransigence (which may have since been clarified btw), there's also a lot of premature eagerness for pushing Rust into Linux. It's a shame for sure that Wedson burned out and left. If nothing else at least 6 months after that departure the conversation tonally is very very different.
> Linus Torvalds admonished the group that he did not want to talk about every subsystem supporting Rust at this time; getting support into some of them is sufficient for now. When Airlie asked what would happen when some subsystem blocks progress, Torvalds answered "that's my job"
> Torvalds pointed out that there are kernel features that are currently incompatible with Rust; that is impeding Rust support overall. He mentioned modversions in particular; that problem is being worked on. The list of blocking features is getting shorter, he said, but it still includes kernel features that people need.
> Brauner said that nobody has ever declared that the filesystem abstractions would not be merged; the discussion is all about the details of how that will happen.
Important to note that the filesystem abstraction conflict is what led to the resignation in the first place. The maintainers are taking a very different position, possibly because of non-public conversations that have happened in the intervening time.
> Gleixner said that, 20 years ago, there had been a lot of fear and concern surrounding the realtime kernel work; he is seeing the same thing now with regard to Rust. We cannot let that fear drive things, he said. Torvalds said that Rust has been partially integrated for two years. That is nothing, he said; the project to build the kernel with Clang took a decade, and that was the same old language.
> As the session closed, though, the primary outcome may well have been expressed by Torvalds, who suggested telling people that getting kernel Rust up to production levels will happen, but it will take years.
It can be frustrating to work on something when the time scale for any kind of observable success can be a decade or more, especially when you see the obstacles as other people on the project. I know I would not have the patience and I'm a huge fan of Rust. Keep in mind that Rust itself isn't actually able to support all the kernel use-cases. There's stuff like which platforms that Rust supports as "tier 1" because the kernel needs to support those platforms - the most likely success case will be having the Rust frontend be able to use GCC as the backend and that work is nowhere near (neither is getting tier-1 automation support for those platforms). Those themselves are years out and so it's understandable that maintainers are trying to figure out how the proposed abstractions will work and how roles and responsibilities will be managed.
> They're not trying to shift the daily work of ~15k developers. They're asking for some shims, stubs, adaptations, and some general flexibility & willingness. They're not expecting everyone to write Rust within the next 18 months.
I mean it's pretty clear that the goal of the "Rust in Linux" project is to stop having any new C code in the Linux kernel. No one is claiming in 18 months but that's clearly the direction and it shouldn't be surprising for there to be an emotional reaction to that, especially from the maintainers who might perceive that as a threat if they only know C, even if that seems illogical to you. And more importantly, maintainers need to do a lot of work and they hadn't signed up to maintain Rust abstractions so there's another emotional aspect of it where they feel someone else is imposing a new environment on them.
cmon, lets not lie.
You'd be surprised. My understanding from folks on Chrome OS is they've already shedded most, if not all, of the most experienced old hands. (n.b. Chrome OS was absorbed by Android and development is, by and large, ceased on it according to same sources directly, and indirectly via Blind.)
Mentorship from one of those people would be valuable, but arguing with them about the implementation of something you don't understand isn't how you get that mentorship.
Actually that's exactly how you can provoke them into explaining themselves. Too many experienced people sit in their ivory towers and basically appeal to their authority, "I wrote a kernel therefore I know everything, <insert profanity here>". There's a reason why Cunningham's Law exist, "The best way to get the right answer on the Internet is not to ask a question, it's to post the wrong answer."
From: Theo de Raadt <deraadt-AT-openbsd.org>
To: Jeff Xu <jeffxu-AT-google.com>
> On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Let's start with the purpose. The point of mimmutable/mseal/whatever is
> > to fix the mapping of an address range to its underlying object, be it
> > a particular file mapping or anonymous memory. After the call succeeds,
> > it must not be possible to make any address in that virtual range point
> > into any other object.
> >
> > The secondary purpose is to lock down permissions on that range.
> > Possibly to fix them where they are, possibly to allow RW->RO transitions.
> >
> > With those purposes in mind, you should be able to deduce for any syscall
> > or any madvise(), ... whether it should be allowed.
> >
> I got it.
>
> IMO: The approaches mimmutable() and mseal() took are different, but
> we all want to seal the memory from attackers and make the linux
> application safer.
I think you are building mseal for chrome, and chrome alone.
I do not think this will work out for the rest of the application space
because
1) it is too complicated
2) experience with mimmutable() says that applications don't do any of it
themselves, it is all in execve(), libc initialization, and ld.so.
You don't strike me as an execve, libc, or ld.so developer.
From: Matthew Wilcox <willy-AT-infradead.org>
To: Jeff Xu <jeffxu-AT-google.com>
...
Yes, thank you for demonstrating that you have no idea what you need to
block.
> It is practical to keep syscall extentable, when the business logic is the same.
I concur with Theo & Linus. You don't know what you're doing. I think
the underlying idea of mimmutable() is good, but how you've split it up
and how you've implemented it is terrible.
...
Edit: I always find it funny that these articles on the mailing list tend to read like a sports announcer describing a boxing match!
Chrome is the one who wants it. But you can’t unmap sealed pages because an attacker could then re-map them with different flags.
So that basically means this can never be used on pages allocated at runtime unless you intend to hold them for the entire process lifetime, right?
Doesn’t that mean it can’t be used for all the memory used by, say, the JS sandbox which would be a very very tempting target?
Or is the idea that you deal with this by always running that kind of stuff in a different process where you can seal the memory and then you can just kill the process when you’re done?
I’m not familiar with how Chrome manages memory/processes, so I’m not exactly sure why this wouldn’t be an issue.
Is this also the reason why the articles about this often mention it’s not useful to most programs (outside of how memory is set up at processes start up)?
Multiprocessing is an option here. I think chrome uses it extensively, so that might be the play here. You need separate processes for other stuff anyway, like isolation via namespaces.
> experience with mimmutable() says that applications don't do any of it themselves, it is all in execve(), libc initialization, and ld.so.
So this is almost never something a process does to itself, it is part of the sandboxing of child processes.
mseal() gets closer, January 19, 2024: https://lwn.net/Articles/958438/
Memory sealing for the GNU C Library, June 12, 2024: https://lwn.net/Articles/978010/
not to say these things arent steps in the right direction, but if you let go of current ideals on how operating systems work, and take into account current systems, knowledge about them, and knowledge about what people want from systems, you can envision systems free from the burden and risks put on developers and users today.
yes architecture bugs exist, but software hardly takes advantage of current features truly,so arguing about architectural bugs is a moot point. theres cheaper ways to compromise, and always will be if things are built on shaky foundations
That's what I think the parent post is saying. (I personally gently disagree)
If a remote attacker can change the local environment then they must have already broken into your system.
The point is that "The local environment" could refer to what's inside the sandbox. Your browser isn't going to treat my comment as x86 machine code and execute it, for example. Javascript is heavily sandboxed, and mseal() and friends are ways to add another layer of sandboxing.
Discussion about intercepting linux syscalls: https://stackoverflow.com/questions/69859/how-could-i-interc...
But building your own patched kernel that pretends that mseal works would be the simplest way to "disable" that feature. Programs that use mseal could still do sanity checks to see if mseal actually works or not. Then a compromised kernel would need secret ways to disable mseal after it has been applied, to stop the apps from checking for a non-functional mseal.
Protecting against outside attackers
Digital Rights Management
Faking "mseal" is something you might intentionally do if you are trying to break DRM, and something you would not want to do if you are trying to defend against outside attackers.
Some CPUs have special hardware means for protecting some memory region against the kernel ("secure enclaves", e.g. Intel SGX), and that is the feature that the supporters of DRM want.
"Mseal" is only against attackers who do not control the kernel.
This system call is meant for a different threat model than "attacker has early access to your process before it started initializing".
This is an interesting thought so I looked it up and this is how (all?) preload syscall overrides work. You override the wrapper but not the syscalls itself so if you’re doing direct syscalls I don’t think that can be overridden. Technically you could override the syscall function itself maybe?
But then you can just write assembly code to issue the system call.
unsigned start addr
But should probably be unsigned long start_addr
int mseal(unsigned long start, size_t len, unsigned long flags)
> How should CPython support the mseal() syscall?
OpenBSD introduced mimmutable in OpenBSD 7.3, which was released 10/4/2023 (for US people, it would be 4/10/2023), so it isn't "forever".
Meanwhile Linux and FreeBSD has "memfd_create" forever while OpenBSD doesn't have anonymous files and relies on "shm_open".
Correct but they did have a very similar syscall for a long time that they deprecated after the release of mimmutable iirc