A deep dive into Linux's new mseal syscall

Posted by todsacerdoti 10/25/2024

A deep dive into Linux's new mseal syscall(blog.trailofbits.com)

252 points | 54 comments

ykonstant 10/25/2024|

Interesting. The article mentions "spicy discussions" in the kernel mailing list. Is there any insider who can summarize objections and concerns? I tend to avoid reading the mailing list itself since it can get too spicy, and my headaches are already strong enough!

The mechanism itself seems reasonable, but I am surprised that something like this doesn't already exist in the kernel.

ziddoap 10/25/2024||

Not sure if there was much more to it than the thread linked to, but it was basically Linus being Linus. He said stuff that made sense in a pretty blunt fashion.

There were flags proposed that allowed the seal to be ignored.

>So you say "we can't munmap in this *one* place, but all others ignore the sealing".

Later was the spice.

>And dammit, once something is sealed, it is SEALED. None of this crazy "one place honors the sealing, random other places do not".

And later, even spicier, Linus says that seals cannot be ignored and that is non-negotiable. Any further suggestions to ignore a seal via a flag would result in the person being added to Linus' ignore list. (He, of course, said this with some profanities and capitals sprinkled in.)

js2 10/25/2024|||

Wasn't just Linus. Earlier, from Theo de Raadt:

> I don't think you understand the problem space well enough to come up with your own solution for it. I spent a year on this, and ship a complete system using it. You are asking such simplistic questions above it shocks me.

https://lwn.net/ml/linux-kernel/95482.1697587015@cvs.openbsd...

Via https://lwn.net/Articles/948129/

Affric 10/26/2024|||

Thank you.

That was beautiful.

Demonstrated the difference in design/engineering philosophies from two of the greats.

benreesman 10/26/2024||||

You’re asking me how a watch works, let’s just try to keep an eye on the time.

https://youtu.be/vkYqs9iuJqY?feature=shared&t=109

0xbadcafebee 10/25/2024|||

Not a great perspective... "It took me a year [or more] to understand this. The fact that you don't understand it shocks me." Dude, not everybody's as smart or experienced as you. Here's an opportunity to be a mentor.

vlovich123 10/25/2024|||

My reading of this is a lot more generous to the maintainers and a lot less sympathetic to the author than yours is. The maintainers highlighted the problems and the author came back basically with "I don't believe you so let's go with my approach to stay more general" - it's one thing to disagree, it's another to straight up not acknowledge the feedback. The author ate a lot of very senior people's time arguing instead of listening to them and learning from their experience and that was justifiably frustrating forcing much more direct feedback. The kind of mistake the author made - having to enforce at each individual syscall level instead of it being a protection on the memory itself enforced on all accesses - indicates a poor understanding of how to think about security and build security APIs which is a problem when you're proposing a security API.

It's particularly impressive how misguided the patch is given that they took inspiration from the OpenBSD API implementation, changed both API & implementation, & then argued with both Linus and Theo who started Linux & OpenBSD respectively and were trying to give direct feedback about how OpenBSD is different and why it took the approach it did.

Hopefully the author has taken the more forceful feedback as a learning opportunity about listening to feedback when the people giving it to you have a lot more experience. Or their team is coaching them about what went wrong now that this became so visible to learn what they got wrong.

From Matthew Wilcox who is another senior Linux maintainer:

> I concur with Theo & Linus. You don't know what you're doing. I think the underlying idea of mimmutable() is good, but how you've split it up and how you've implemented it is terrible.

It's delivered directly and bluntly but it's not mean or personal. The author proposed a bad patch & argued from a position of ignorance.

jorvi 10/25/2024||

> The maintainers highlighted the problems and the author came back basically with "I don't believe you so let's go with my approach to stay more general" - it's one thing to disagree, it's another to straight up not acknowledge the feedback.

Isn't that the Linux kernel in a nutshell?

Of the top of my head I can name:

- The zram maintainer that's for a few years been blocking the patches that add the zpool api to zram. This would go a long way to unifying zram and zswap in the future.

- TuxOnIce being permablocked by some heels-in-sand maintainers until the kernel diverged to much and the patch writer gave up. This one would have fixed hibernation on Linux.. in other words, hibernation still sucks because of these maintainers.

- Of course the trench warfare all the C-proficient maintainers are waging to chase all the Rust-in-Linux people away, lest they have to learn Rust. ~80% of CVEs are memory-related, so you could say that in say.. 10 years time, ~80% of the CVEs happening in the Linux kernel are the legacy of Ted & co.

vlovich123 10/26/2024|||

You can roughly be:

1. Wrong and ignorant

2. Wrong and knowledgeable

3. Right and ignorant

4. Right and knowledgeable

There's also when there's just a disagreement of opinion because you weight the tradeoffs differently, in which case there's less of a right or wrong.

The main difference is that in those cases the people involved were in camps 2-4 or simply weighted tradeoffs differently. In this case the author seemed much more clearly in camp 1.

Regarding the claim that "this one would have fixed hibernation on Linux" - maybe, but it's hard to evaluate a road not taken. It could have made other tradeoffs or caused other issues down the road that aren't visible to you right now because it didn't get mainlined or maybe hibernation would still have been broken. Or maybe it would have been hunky dorry. But it kind of doesn't matter because the author in this case was clearly wrong (i.e. the patch wouldn't have achieved the goal it set out to do and would have predictably caused vulnerabilities to become possible in the future as the kernel evolved).

As for Rust-in-Linux, it sounds like your not actually up to date [1]. I'll forgive Ted's emotional volatility at the infamous filesystem talk as he's been much more calm about it now:

> There is a need for documentation and tutorials on how to write filesystem code in idiomatic Rust. He said that he has a lot to learn; he is willing to do that, but needs help on what to learn.

I'd say "trench resistance" not warfare. The Rust movement is trying to shift the daily work of ~15k developers in a mostly bazaar development model. There's going to be resistance and push back and some of it will feel unfair to the Rust folks and some will feel unfair to the C folk and both will be right.

[1] https://lwn.net/Articles/991062/

jorvi 10/26/2024||

> But it kind of doesn't matter because the author in this case was clearly wrong (i.e. the patch wouldn't have achieved the goal it set out to do and would have predictably caused vulnerabilities to become possible in the future as the kernel evolved).

I was speaking in general terms, of how recognizable the "heels-in-the-sand" attitude is. In this particular case yes, it does sound right that the current "mseal" proposal / patches is not up to snuff.

Edit: also reading through it again, props to Theo's response. Stern but with a lot of edification.

> Regarding the claim that "this one would have fixed hibernation on Linux" - maybe, but it's hard to evaluate a road not taken. It could have made other tradeoffs or caused other issues down the road that aren't visible to you right now because it didn't get mainlined or maybe hibernation would still have been broken.

The nice thing about TuxOnIce is that it would have pushed large parts of hibernation into userspace, which would have made it much more versatile and much easier to iterate on.

> As for Rust-in-Linux, it sounds like your not actually up to date [1].

One of the main Rust spearhead developers felt the need to quit, and almost immediately thereafter Asahi Lina posted an article about how "a subset of C kernel developers just seem determined to make the lives of the Rust maintainers as difficult as possible." By my count that is 0-2. How is that "some of it will feel unfair to the Rust folks and some will feel unfair to the C folk"?

> I'd say "trench resistance" not warfare. The Rust movement is trying to shift the daily work of ~15k developers in a mostly bazaar development model.

They're not trying to shift the daily work of ~15k developers. They're asking for some shims, stubs, adaptations, and some general flexibility & willingness. They're not expecting everyone to write Rust within the next 18 months.

[0]https://vt.social/@lina/113045455229442533

vlovich123 10/26/2024||

I think you really missed what I was trying to say about the hibernate stuff. May be helpful to reread.

> One of the main Rust spearhead developers felt the need to quit, and almost immediately thereafter Asahi Lina posted an article about how "a subset of C kernel developers just seem determined to make the lives of the Rust maintainers as difficult as possible." By my count that is 0-2.

I again refer you to the article I linked. The drama has greatly subsided and it seems like people are talking through the differences. While there's some unfortunate intransigence (which may have since been clarified btw), there's also a lot of premature eagerness for pushing Rust into Linux. It's a shame for sure that Wedson burned out and left. If nothing else at least 6 months after that departure the conversation tonally is very very different.

> Linus Torvalds admonished the group that he did not want to talk about every subsystem supporting Rust at this time; getting support into some of them is sufficient for now. When Airlie asked what would happen when some subsystem blocks progress, Torvalds answered "that's my job"

> Torvalds pointed out that there are kernel features that are currently incompatible with Rust; that is impeding Rust support overall. He mentioned modversions in particular; that problem is being worked on. The list of blocking features is getting shorter, he said, but it still includes kernel features that people need.

> Brauner said that nobody has ever declared that the filesystem abstractions would not be merged; the discussion is all about the details of how that will happen.

Important to note that the filesystem abstraction conflict is what led to the resignation in the first place. The maintainers are taking a very different position, possibly because of non-public conversations that have happened in the intervening time.

> Gleixner said that, 20 years ago, there had been a lot of fear and concern surrounding the realtime kernel work; he is seeing the same thing now with regard to Rust. We cannot let that fear drive things, he said. Torvalds said that Rust has been partially integrated for two years. That is nothing, he said; the project to build the kernel with Clang took a decade, and that was the same old language.

> As the session closed, though, the primary outcome may well have been expressed by Torvalds, who suggested telling people that getting kernel Rust up to production levels will happen, but it will take years.

It can be frustrating to work on something when the time scale for any kind of observable success can be a decade or more, especially when you see the obstacles as other people on the project. I know I would not have the patience and I'm a huge fan of Rust. Keep in mind that Rust itself isn't actually able to support all the kernel use-cases. There's stuff like which platforms that Rust supports as "tier 1" because the kernel needs to support those platforms - the most likely success case will be having the Rust frontend be able to use GCC as the backend and that work is nowhere near (neither is getting tier-1 automation support for those platforms). Those themselves are years out and so it's understandable that maintainers are trying to figure out how the proposed abstractions will work and how roles and responsibilities will be managed.

> They're not trying to shift the daily work of ~15k developers. They're asking for some shims, stubs, adaptations, and some general flexibility & willingness. They're not expecting everyone to write Rust within the next 18 months.

I mean it's pretty clear that the goal of the "Rust in Linux" project is to stop having any new C code in the Linux kernel. No one is claiming in 18 months but that's clearly the direction and it shouldn't be surprising for there to be an emotional reaction to that, especially from the maintainers who might perceive that as a threat if they only know C, even if that seems illogical to you. And more importantly, maintainers need to do a lot of work and they hadn't signed up to maintain Rust abstractions so there's another emotional aspect of it where they feel someone else is imposing a new environment on them.

f1shy 10/26/2024|||

Every time I see somebody insinuate that all the CVEs would not be there if the kernel was written in Rust, I yawn... The idea that a programming language is a silver bullet is just stupid.

ArtixFox 10/26/2024||

who is saying all? pls show me the people who are bruh. its always 70% of the cve's are cuz of things that rust fixes blah blah

cmon, lets not lie.

f1shy 10/26/2024||

Would you believe me if I said 65%?

ArtixFox 10/27/2024||

how does that matter? calling 70% ALL is a very funny thing that i can only see on hn

nativeit 10/25/2024||||

> Google has no shortage of experienced developers who could have reviewed this submission before it was posted publicly, but that does not appear to have happened, with the result that a relatively inexperienced developer was put into a difficult position. Feedback on the proposal was resisted rather than listened to. The result was an interaction that pleased nobody.

refulgentis 10/25/2024||

> Google has no shortage of experienced developers who could have reviewed this submission before it was posted publicly,

You'd be surprised. My understanding from folks on Chrome OS is they've already shedded most, if not all, of the most experienced old hands. (n.b. Chrome OS was absorbed by Android and development is, by and large, ceased on it according to same sources directly, and indirectly via Blind.)

terribleperson 10/25/2024|||

The time of the people who maintain the free and open source software we rely on is not free. From the people I've talked to, maintainers of successful projects are overworked and underappreciated.

Mentorship from one of those people would be valuable, but arguing with them about the implementation of something you don't understand isn't how you get that mentorship.

Onavo 10/25/2024||

> Mentorship from one of those people would be valuable, but arguing with them about the implementation of something you don't understand isn't how you get that mentorship.

Actually that's exactly how you can provoke them into explaining themselves. Too many experienced people sit in their ivory towers and basically appeal to their authority, "I wrote a kernel therefore I know everything, <insert profanity here>". There's a reason why Cunningham's Law exist, "The best way to get the right answer on the Internet is not to ask a question, it's to post the wrong answer."

santiagobasulto 10/26/2024||||

Is that considered "spicy"? Is the sensitivity threshold maybe too low?

f1shy 10/26/2024||

Extremely too low. My personal opinion: people who cannot take that kind of criticism, has no place in such a project. Period.

f1shy 10/26/2024|||

And the reason why SW development sucks in enterprise, is the lack of people that can speak clearly like Linus.

greenavocado 10/25/2024|||

https://lwn.net/ml/linux-kernel/7071.1697661373@cvs.openbsd....

    From:   Theo de Raadt <deraadt-AT-openbsd.org>
    To:   Jeff Xu <jeffxu-AT-google.com>

    > On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
    > >
    > > Let's start with the purpose.  The point of mimmutable/mseal/whatever is
    > > to fix the mapping of an address range to its underlying object, be it
    > > a particular file mapping or anonymous memory.  After the call succeeds,
    > > it must not be possible to make any address in that virtual range point
    > > into any other object.
    > >
    > > The secondary purpose is to lock down permissions on that range.
    > > Possibly to fix them where they are, possibly to allow RW->RO transitions.
    > >
    > > With those purposes in mind, you should be able to deduce for any syscall
    > > or any madvise(), ... whether it should be allowed.
    > >
    > I got it.
    > 
    > IMO: The approaches mimmutable() and mseal() took are different, but
    > we all want to seal the memory from attackers and make the linux
    > application safer.

    I think you are building mseal for chrome, and chrome alone.

    I do not think this will work out for the rest of the application space
    because

    1) it is too complicated
    2) experience with mimmutable() says that applications don't do any of it
    themselves, it is all in execve(), libc initialization, and ld.so.
    You don't strike me as an execve, libc, or ld.so developer.

greenavocado 10/25/2024|||

    From:   Matthew Wilcox <willy-AT-infradead.org>
    To:   Jeff Xu <jeffxu-AT-google.com>

    ...

    Yes, thank you for demonstrating that you have no idea what you need to
    block.

    > It is practical to keep syscall extentable, when the business logic is the same.

    I concur with Theo & Linus.  You don't know what you're doing.  I think
    the underlying idea of mimmutable() is good, but how you've split it up
    and how you've implemented it is terrible.

    ...

lathiat 10/25/2024||

This may help a bit: https://lwn.net/Articles/948129/

ykonstant 10/25/2024||

Very nice, thanks!

Edit: I always find it funny that these articles on the mailing list tend to read like a sports announcer describing a boxing match!

MBCook 10/25/2024||

A question about using this call:

Chrome is the one who wants it. But you can’t unmap sealed pages because an attacker could then re-map them with different flags.

So that basically means this can never be used on pages allocated at runtime unless you intend to hold them for the entire process lifetime, right?

Doesn’t that mean it can’t be used for all the memory used by, say, the JS sandbox which would be a very very tempting target?

Or is the idea that you deal with this by always running that kind of stuff in a different process where you can seal the memory and then you can just kill the process when you’re done?

I’m not familiar with how Chrome manages memory/processes, so I’m not exactly sure why this wouldn’t be an issue.

Is this also the reason why the articles about this often mention it’s not useful to most programs (outside of how memory is set up at processes start up)?

PhilipRoman 10/25/2024|

>Doesn’t that mean it can’t be used for all the memory used by, say, the JS sandbox which would be a very very tempting target?

Multiprocessing is an option here. I think chrome uses it extensively, so that might be the play here. You need separate processes for other stuff anyway, like isolation via namespaces.

masklinn 10/26/2024||

Yes, in fact in his comments Theo de Raadt specifically says (amongst other things):

> experience with mimmutable() says that applications don't do any of it themselves, it is all in execve(), libc initialization, and ld.so.

So this is almost never something a process does to itself, it is part of the sandboxing of child processes.

throw0101a 10/25/2024||

mseal() and what comes after, October 20, 2023: https://lwn.net/Articles/948129/

mseal() gets closer, January 19, 2024: https://lwn.net/Articles/958438/

Memory sealing for the GNU C Library, June 12, 2024: https://lwn.net/Articles/978010/

sim7c00 10/25/2024||

i am sad operating systems need to have such calls implemented while most modern (x86_64) architectures have so many features to facilitate safe and sound programming and computing. legacy crap en mentality , and trying to patch old systems built on paradigms not matching the current world and knowledge rather than rebuilding really put a break on progress in computing, and put litterally billions at risk.

not to say these things arent steps in the right direction, but if you let go of current ideals on how operating systems work, and take into account current systems, knowledge about them, and knowledge about what people want from systems, you can envision systems free from the burden and risks put on developers and users today.

yes architecture bugs exist, but software hardly takes advantage of current features truly,so arguing about architectural bugs is a moot point. theres cheaper ways to compromise, and always will be if things are built on shaky foundations

GolDDranks 10/26/2024|

Elighten me: what unused/underused safety features x86_64 has that wouldn't require the OS to have some method of using or enabling them? Why do you think mseal isn't warranted and what would be better instead?

gcr 10/26/2024||

i read the above poster's critique about operating system and API design more generally. x86-64 can do wonderful things with memory access paradigms, why must we keep using Linux and its baked-in assumptions about how memory should work? Let's instead rewrite everything to be memory-safe, with safety enforced by everything we've learned in the last 60 years of architecture design.

That's what I think the parent post is saying. (I personally gently disagree)

pjc50 10/26/2024||

Such as what, though?

metadat 10/25/2024||

Will it be possible to override / disable the `mseal' syscall with the LD_PRELOAD trick?

eska 10/25/2024||

mseal digresses from prior memory protection schemes on Linux because it is a syscall tailored specifically for exploit mitigation against remote attackers seeking code execution rather than potentially local ones looking to exfiltrate sensitive secrets in-memory.

If a remote attacker can change the local environment then they must have already broken into your system.

gcr 10/26/2024|||

Not necessarily. By posting this comment, I have caused "THIS STRING IS HARMFUL" to enter your computer's memory! If you see my comment on your screen, it's too late -- as a remote attacker, I have already changed the local environment! I've even slightly changed the rendering of the webpage you're looking at! Muahahah!

The point is that "The local environment" could refer to what's inside the sandbox. Your browser isn't going to treat my comment as x86 machine code and execute it, for example. Javascript is heavily sandboxed, and mseal() and friends are ways to add another layer of sandboxing.

rowanG077 10/27/2024||

The poster obviously meant environment variables as in the LD_PRELOAD variable mentioned previously...

Dwedit 10/25/2024|||

Probably not LD_PRELOAD. It would need to be an imported function in order for LD_PRELOAD to have any effect. A raw syscall would not be interceptable that way.

Discussion about intercepting linux syscalls: https://stackoverflow.com/questions/69859/how-could-i-interc...

But building your own patched kernel that pretends that mseal works would be the simplest way to "disable" that feature. Programs that use mseal could still do sanity checks to see if mseal actually works or not. Then a compromised kernel would need secret ways to disable mseal after it has been applied, to stop the apps from checking for a non-functional mseal.

jandrese 10/25/2024||

I'm not sure what protection you could expect on any system where the kernel has been replaced by the attacker. Sure they can bypass mseal, but they are also bypassing all other security on the box.

Dwedit 10/25/2024||

Two different considerations for when you'd want to deny memory to other processes:

Protecting against outside attackers

Digital Rights Management

Faking "mseal" is something you might intentionally do if you are trying to break DRM, and something you would not want to do if you are trying to defend against outside attackers.

adrian_b 10/26/2024||

The kernel can bypass any memory protection, it does not need to fake mseal. Controlling the memory protection is one of the most important functions of any OS kernel and one of the few that could not be implemented in any other place.

Some CPUs have special hardware means for protecting some memory region against the kernel ("secure enclaves", e.g. Intel SGX), and that is the feature that the supporters of DRM want.

"Mseal" is only against attackers who do not control the kernel.

monocasa 10/25/2024|||

There's a bunch of ways to override it if you have early control over the process. Another example: ptrace the executable, watch the system calls, and skip over any mseal(2)s.

This system call is meant for a different threat model than "attacker has early access to your process before it started initializing".

chucky_z 10/25/2024|||

You can override the mseal call wrapper but not the syscall itself.

This is an interesting thought so I looked it up and this is how (all?) preload syscall overrides work. You override the wrapper but not the syscalls itself so if you’re doing direct syscalls I don’t think that can be overridden. Technically you could override the syscall function itself maybe?

jmmv 10/25/2024||

> Technically you could override the syscall function itself maybe?

But then you can just write assembly code to issue the system call.

the8472 10/25/2024|||

https://lwn.net/Articles/978010/ says there'll be a glibc tunable

cataphract 10/25/2024||

Depends whether the program calls into libc or inlines the syscalls, I imagine. Though you could use other mechanisms like secccomp.

unwind 10/25/2024||

Meta: the mseal() prototype in the article needs some editing, it is not syntacticallly correct as shown now. The first argument is shown as

    unsigned start addr

But should probably be

    unsigned long start_addr

hifromwork 10/25/2024|

Seems to be OK now:

    int mseal(unsigned long start, size_t len, unsigned long flags)

Iwan-Zotow 10/27/2024||

should be size_t

westurner 10/25/2024||

- "Memory Sealing "Mseal" System Call Merged for Linux 6.10" (2024) https://news.ycombinator.com/item?id=40474510#40474551 :

> How should CPython support the mseal() syscall?

xterminator 10/25/2024|

OpenBSD has had it since forever [1]. Why is such an obvious feature only reaching Linux now?

[1]https://man.openbsd.org/mimmutable.2

gilgamesh3 10/25/2024||

>OpenBSD has had it since forever.

OpenBSD introduced mimmutable in OpenBSD 7.3, which was released 10/4/2023 (for US people, it would be 4/10/2023), so it isn't "forever".

Meanwhile Linux and FreeBSD has "memfd_create" forever while OpenBSD doesn't have anonymous files and relies on "shm_open".

pushupentry1219 10/26/2024||

> OpenBSD introduced mimmutable in OpenBSD 7.3

Correct but they did have a very similar syscall for a long time that they deprecated after the release of mimmutable iirc