Message ID | 20220413134946.2732468-1-catalin.marinas@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | mm, arm64: In-kernel support for memory-deny-write-execute (MDWE) | expand |
On 13.4.2022 16.49, Catalin Marinas wrote: > Hi, > > The background to this is that systemd has a configuration option called > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim > is to prevent a user task from inadvertently creating an executable > mapping that is (or was) writeable. Since such BPF filter is stateless, > it cannot detect mappings that were previously writeable but > subsequently changed to read-only. Therefore the filter simply rejects > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI > support (Branch Target Identification), the dynamic loader cannot change > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). > For libraries, it can resort to unmapping and re-mapping but for the > main executable it does not have a file descriptor. The original bug > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround > for libraries - [3]. > > Add in-kernel support for such feature as a DENY_WRITE_EXEC personality > flag, inherited on fork() and execve(). The kernel tracks a previously > writeable mapping via a new VM_WAS_WRITE flag (64-bit only > architectures). I went for a personality flag by analogy with the > READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if > we don't want more personality flags. A minor downside with the > personality flag is that there is no way for the user to query which > flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise > this. With systemd there's a BPF construct to block personality changes (LockPersonality=yes) but I think prctl() would be easier to lock down irrevocably. Requiring or implying NoNewPrivileges could prevent nasty surprises from set-uid Python programs which happen to use FFI. > Posting this as an RFC to start a discussion and cc'ing some of the > systemd guys and those involved in the earlier thread around the glibc > workaround for dynamic libraries [4]. Before thinking of upstreaming > this we'd need the systemd folk to buy into replacing the MDWE SECCOMP > BPF filter with the in-kernel one. As the author of this feature in systemd (also similar feature in Firejail), I'd highly prefer in-kernel version to BPF protection. I'd definitely also want to use this in place of BPF on x86_64 and other arches too. In-kernel version would probably allow covering pretty easily this case (maybe it already does): fd = memfd_create(...); write(fd, malicious_code, sizeof(malicious_code)); mmap(..., PROT_EXEC, ..., fd); Other memory W^X implementations include S.A.R.A [1] and SELinux EXECMEM/EXECSTACK/EXECHEAP protections [2], [3]. SELinux checks IS_PRIVATE(file_inode(file)) and vma->anon_vma != NULL, which might be useful additions here too (or future extensions if you prefer). -Topi [1] https://smeso.it/sara/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3708 [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3787
Hi Topi, On Wed, Apr 13, 2022 at 09:39:37PM +0300, Topi Miettinen wrote: > On 13.4.2022 16.49, Catalin Marinas wrote: > > The background to this is that systemd has a configuration option called > > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim > > is to prevent a user task from inadvertently creating an executable > > mapping that is (or was) writeable. Since such BPF filter is stateless, > > it cannot detect mappings that were previously writeable but > > subsequently changed to read-only. Therefore the filter simply rejects > > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI > > support (Branch Target Identification), the dynamic loader cannot change > > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). > > For libraries, it can resort to unmapping and re-mapping but for the > > main executable it does not have a file descriptor. The original bug > > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround > > for libraries - [3]. > > > > Add in-kernel support for such feature as a DENY_WRITE_EXEC personality > > flag, inherited on fork() and execve(). The kernel tracks a previously > > writeable mapping via a new VM_WAS_WRITE flag (64-bit only > > architectures). BTW, we can avoid the VM_WAS_WRITE tracking if we limit the check to the current permissions. It would allow mprotect(PROT_EXEC) only if the mapping is already executable. The mmap(PROT_WRITE|PROT_EXEC) would be rejected, as expected. The difference from the current BPF filter is that mprotect(PROT_EXEC|PROT_BTI) is allowed if the mapping was previously mmap(PROT_READ|PROT_EXEC). I'm open to go in this direction if it fits the systemd requirements better. It's also less state to track in the kernel. > > I went for a personality flag by analogy with the > > READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if > > we don't want more personality flags. A minor downside with the > > personality flag is that there is no way for the user to query which > > flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise > > this. > > With systemd there's a BPF construct to block personality changes > (LockPersonality=yes) but I think prctl() would be easier to lock down > irrevocably. The personality flag is not sticky, so one could inadvertently clear it without LockPersonality=yes. We could add a mask of sticky bits to sys_personality() (only for new flags), though we might as well go with a prctl(), we have more flexibility and finer-grained control if we want to expand this to memfd files. Would there be any reason to disable such behaviour for an application, once enabled? I don't think it's currently possible with the BPF filter but we can add it to a prctl(). > > Posting this as an RFC to start a discussion and cc'ing some of the > > systemd guys and those involved in the earlier thread around the glibc > > workaround for dynamic libraries [4]. Before thinking of upstreaming > > this we'd need the systemd folk to buy into replacing the MDWE SECCOMP > > BPF filter with the in-kernel one. > > As the author of this feature in systemd (also similar feature in Firejail), > I'd highly prefer in-kernel version to BPF protection. I'd definitely also > want to use this in place of BPF on x86_64 and other arches too. It is generic, so yes, it can be enabled for other architectures. A minor issue with the VM_WAS_WRITE flag is that we ran out of 32-bit vma flags. It could be expanded but not sure how much you care about MDWE on 32-bit architectures. An alternative is to drop the VM_WAS_WRITE approach entirely, just use the current protection for the decision. > In-kernel version would probably allow covering pretty easily this case > (maybe it already does): > > fd = memfd_create(...); > write(fd, malicious_code, sizeof(malicious_code)); > mmap(..., PROT_EXEC, ..., fd); This series doesn't cover it. > Other memory W^X implementations include S.A.R.A [1] and SELinux > EXECMEM/EXECSTACK/EXECHEAP protections [2], [3]. SELinux checks > IS_PRIVATE(file_inode(file)) and vma->anon_vma != NULL, which might be > useful additions here too (or future extensions if you prefer). I had a quick look at SELinux and, IIUC, without the execmem permission it prevents any anonymous mmap(PROT_EXEC). We could probably do something similar here and check the file permission. I had an attempt but S_PRIVATE doesn't seem to be set on the memfd_create() allocated inode. I have to dig a bit more to see how to detect this. If we keep the check for all files, it won't be able to map any ELF text sections unless the binary is read-only. > [1] https://smeso.it/sara/ > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3708 > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3787
On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote: > The background to this is that systemd has a configuration option called > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim > is to prevent a user task from inadvertently creating an executable > mapping that is (or was) writeable. Since such BPF filter is stateless, > it cannot detect mappings that were previously writeable but > subsequently changed to read-only. Therefore the filter simply rejects > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI > support (Branch Target Identification), the dynamic loader cannot change > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). > For libraries, it can resort to unmapping and re-mapping but for the > main executable it does not have a file descriptor. The original bug > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround > for libraries - [3]. Right, so, the systemd filter is a big hammer solution for the kernel not having a very easy way to provide W^X mapping protections to userspace. There's stuff in SELinux, and there have been several attempts[1] at other LSMs to do it too, but nothing stuck. Given the filter, and the implementation of how to enable BTI, I see two solutions: - provide a way to do W^X so systemd can implement the feature differently - provide a way to turn on BTI separate from mprotect to bypass the filter I would agree, the latter seems like the greater hack, so I welcome this RFC, though I think it might need to explore a bit of the feature space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise it risks being too narrowly implemented. For example, playing well with JITs should be part of the design, and will likely need some kind of ELF flags and/or "sealing" mode, and to handle the vma alias case as Jann Horn pointed out[2]. > Add in-kernel support for such feature as a DENY_WRITE_EXEC personality > flag, inherited on fork() and execve(). The kernel tracks a previously > writeable mapping via a new VM_WAS_WRITE flag (64-bit only > architectures). I went for a personality flag by analogy with the > READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if > we don't want more personality flags. A minor downside with the > personality flag is that there is no way for the user to query which > flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise > this. My instinct here is to use a prctl(), which maps to other kinds of modern inherited state (like no_new_privs). > Posting this as an RFC to start a discussion and cc'ing some of the > systemd guys and those involved in the earlier thread around the glibc > workaround for dynamic libraries [4]. Before thinking of upstreaming > this we'd need the systemd folk to buy into replacing the MDWE SECCOMP > BPF filter with the in-kernel one. > > Thanks, > > Catalin > > [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute= > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842 > [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831 > [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com So, yes, let's do it. It's long long overdue in the kernel. :) -Kees [1] https://github.com/KSPP/linux/issues/32 [2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611 > > Catalin Marinas (4): > mm: Track previously writeable vma permission > mm, personality: Implement memory-deny-write-execute as a personality > flag > fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality > flag > arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC > > arch/arm64/Kconfig | 1 + > fs/binfmt_elf.c | 2 ++ > include/linux/mm.h | 6 ++++++ > include/linux/mman.h | 18 +++++++++++++++++- > include/uapi/linux/binfmts.h | 4 ++++ > include/uapi/linux/personality.h | 1 + > mm/Kconfig | 4 ++++ > mm/mmap.c | 3 +++ > mm/mprotect.c | 5 +++++ > 9 files changed, 43 insertions(+), 1 deletion(-) >
On 14.4.2022 21.52, Kees Cook wrote: > On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote: >> The background to this is that systemd has a configuration option called >> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim >> is to prevent a user task from inadvertently creating an executable >> mapping that is (or was) writeable. Since such BPF filter is stateless, >> it cannot detect mappings that were previously writeable but >> subsequently changed to read-only. Therefore the filter simply rejects >> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI >> support (Branch Target Identification), the dynamic loader cannot change >> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). >> For libraries, it can resort to unmapping and re-mapping but for the >> main executable it does not have a file descriptor. The original bug >> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround >> for libraries - [3]. > > Right, so, the systemd filter is a big hammer solution for the kernel > not having a very easy way to provide W^X mapping protections to > userspace. There's stuff in SELinux, and there have been several > attempts[1] at other LSMs to do it too, but nothing stuck. > > Given the filter, and the implementation of how to enable BTI, I see two > solutions: > > - provide a way to do W^X so systemd can implement the feature differently > - provide a way to turn on BTI separate from mprotect to bypass the filter > > I would agree, the latter seems like the greater hack, so I welcome > this RFC, though I think it might need to explore a bit of the feature > space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise > it risks being too narrowly implemented. For example, playing well with > JITs should be part of the design, and will likely need some kind of > ELF flags and/or "sealing" mode, and to handle the vma alias case as > Jann Horn pointed out[2]. Another interesting case from 2006 by Ulrich Drepper is to use a temporary file and map it twice, once with PROT_WRITE and once with PROT_EXEC [1]. This isn't possible if the mount flags of the file systems are also in line with W^X principle. System services (unlike user apps) typically don't use /tmp nor /dev/shm (mounted with "rw,exec"). With systemd a simple file system W^X policy can be implemented for a service for example with NoExecPaths=/ ExecPaths=/usr ReadOnlyPaths=/usr. In-kernel MDWE probably could look beyond file descriptors and check if the mount flags of the file system containing the file being mmap()ed agree with W^X. The use cases for system services and user apps may be different: system services are often compatible with maximum hardening, while user apps may need various compatibility solutions if they use JIT, trampolines or FFI and access to W+X file systems may be also needed. -Topi [1] https://akkadia.org/drepper/selinux-mem.html >> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality >> flag, inherited on fork() and execve(). The kernel tracks a previously >> writeable mapping via a new VM_WAS_WRITE flag (64-bit only >> architectures). I went for a personality flag by analogy with the >> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if >> we don't want more personality flags. A minor downside with the >> personality flag is that there is no way for the user to query which >> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise >> this. > > My instinct here is to use a prctl(), which maps to other kinds of modern > inherited state (like no_new_privs). > >> Posting this as an RFC to start a discussion and cc'ing some of the >> systemd guys and those involved in the earlier thread around the glibc >> workaround for dynamic libraries [4]. Before thinking of upstreaming >> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP >> BPF filter with the in-kernel one. >> >> Thanks, >> >> Catalin >> >> [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute= >> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842 >> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831 >> [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com > > So, yes, let's do it. It's long long overdue in the kernel. :) > > -Kees > > [1] https://github.com/KSPP/linux/issues/32 > [2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611 > >> >> Catalin Marinas (4): >> mm: Track previously writeable vma permission >> mm, personality: Implement memory-deny-write-execute as a personality >> flag >> fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality >> flag >> arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC >> >> arch/arm64/Kconfig | 1 + >> fs/binfmt_elf.c | 2 ++ >> include/linux/mm.h | 6 ++++++ >> include/linux/mman.h | 18 +++++++++++++++++- >> include/uapi/linux/binfmts.h | 4 ++++ >> include/uapi/linux/personality.h | 1 + >> mm/Kconfig | 4 ++++ >> mm/mmap.c | 3 +++ >> mm/mprotect.c | 5 +++++ >> 9 files changed, 43 insertions(+), 1 deletion(-) >> >
On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote: > On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote: > > The background to this is that systemd has a configuration option called > > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim > > is to prevent a user task from inadvertently creating an executable > > mapping that is (or was) writeable. Since such BPF filter is stateless, > > it cannot detect mappings that were previously writeable but > > subsequently changed to read-only. Therefore the filter simply rejects > > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI > > support (Branch Target Identification), the dynamic loader cannot change > > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). > > For libraries, it can resort to unmapping and re-mapping but for the > > main executable it does not have a file descriptor. The original bug > > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround > > for libraries - [3]. > > Right, so, the systemd filter is a big hammer solution for the kernel > not having a very easy way to provide W^X mapping protections to > userspace. There's stuff in SELinux, and there have been several > attempts[1] at other LSMs to do it too, but nothing stuck. > > Given the filter, and the implementation of how to enable BTI, I see two > solutions: > > - provide a way to do W^X so systemd can implement the feature differently > - provide a way to turn on BTI separate from mprotect to bypass the filter > > I would agree, the latter seems like the greater hack, We discussed such hacks in the past but they are just working around the fundamental issue - systemd wants W^X but with BPF it can only achieve it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping was already executable. If we find a better solution for W^X, we wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI). > so I welcome > this RFC, though I think it might need to explore a bit of the feature > space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise > it risks being too narrowly implemented. For example, playing well with > JITs should be part of the design, and will likely need some kind of > ELF flags and/or "sealing" mode, and to handle the vma alias case as > Jann Horn pointed out[2]. I agree we should look at what we want to cover, though trying to avoid re-inventing SELinux. With this patchset I went for the minimum that systemd MDWE does with BPF. I think JITs get around it using something like memfd with two separate mappings to the same page. We could try to prevent such aliases but allow it if an ELF note is detected (or get the JIT to issue a prctl()). Anyway, with a prctl() we can allow finer-grained control starting with anonymous and file mappings and later extending to vma aliases, writeable files etc. On top we can add a seal mask so that a process cannot disable a control was set. Something like (I'm not good at names): prctl(PR_MDWX_SET, flags, seal_mask); prctl(PR_MDWX_GET); with flags like: PR_MDWX_MMAP - basics, should cover mmap() and mprotect() PR_MDWX_ALIAS - vma aliases, allowed with an ELF note PR_MDWX_WRITEABLE_FILE (needs some more thinking)
On Wed, Apr 20, 2022 at 02:01:02PM +0100, Catalin Marinas wrote: > I agree we should look at what we want to cover, though trying to avoid > re-inventing SELinux. With this patchset I went for the minimum that > systemd MDWE does with BPF. Right -- and I don't think we're at any risk of slippery-sloping into a full MAC system. :) I'm fine with doing the implementation in stages, if we've attempted to design the steps (which I think you've got a good start on below). > I think JITs get around it using something like memfd with two separate > mappings to the same page. We could try to prevent such aliases but > allow it if an ELF note is detected (or get the JIT to issue a prctl()). Right -- I'd rather JITs carry some hard-coded property (i.e. ELF note) to indicate the fact that they're expecting to do these kinds of things rather than leaving it open for all processes. > Anyway, with a prctl() we can allow finer-grained control starting with > anonymous and file mappings and later extending to vma aliases, > writeable files etc. On top we can add a seal mask so that a process > cannot disable a control was set. Something like (I'm not good at > names): > > prctl(PR_MDWX_SET, flags, seal_mask); > prctl(PR_MDWX_GET); > > with flags like: > > PR_MDWX_MMAP - basics, should cover mmap() and mprotect() > PR_MDWX_ALIAS - vma aliases, allowed with an ELF note > PR_MDWX_WRITEABLE_FILE The SARA proposal lists a lot of behavioral details to consider. Quoting it[1] here: >> - W^X enforcement will cause problems to any programs that needs >> memory pages mapped both as writable and executable at the same time e.g. >> programs with executable stack markings in the PT_GNU_STACK segment. IMO, executable stack markings should be considered completely deprecated. In fact, we've been warning about it since 2020: 47a2ebb7f505 ("execve: warn if process starts with executable stack") So with execstack, under W^X, I think we should either: - refuse to exec the process (default) - disable W^X for the process (but not its children) >> - W!->X restriction will cause problems to any program that >> needs to generate executable code at run time or to modify executable >> pages e.g. programs with a JIT compiler built-in or linked against a >> non-PIC library. This seems solvable with an ELF flag. >> - Executable MMAP prevention can work only with programs that have at least >> partial RELRO support. It's disabled automatically for programs that >> lack this feature. It will cause problems to any program that uses dlopen >> or tries to do an executable mmap. Unfortunately this feature is the one >> that could create most problems and should be enabled only after careful >> evaluation. This seems like a variation on the execstack case, and we should be able to detect the state and choose a behavior based on system settings, and a smarter version (as SARA has) would track RELRO pages waiting for the loader to make them read-only. SARA was proposed with a set of feature flags[2]; quoting here: >> | W^X | 0x0008 | This is the basic property, refusing PROT_WRITE | PROT_EXEC. I note that SARA also rejects opening /proc/$pid/mem with FMODE_WRITE when this is enabled for a process. (It likely should extend to process_vm_write() too.) >> | W!->X Heap | 0x0001 | >> | W!->X Stack | 0x0002 | >> | W!->X Other memory | 0x0004 | This is for the vma history tracking, and I don't think we need to separate this by memory type? It's nice to have the granularity, but for a first-pass it seems like overkill? Maybe I'm missing some detail. >> | Don't enforce, just complain | 0x0010 | >> | Be Verbose | 0x0020 | Unclear if these would work well with a non-LSM approach. >> | Executable MMAP prevention | 0x0040 | This is the relro detection piece. >> | Trampoline emulation | 0x0100 | This is a more advanced case for emulating execstack, but if we can just ignore execstack entirely, this can go away? >> | Children will inherit flags | 0x0200 | Should a process have that control? >> | Force W^X on setprocattr | 0x0080 | This is a "seal" trigger, which could be done through prctl(). It looks like a bunch of the features are designed around having as much as possible enabled at exec time, and then tightening it further as various things are finished (e.g. execstack, relro, sealing, etc), which is, I think, what would still be needed for a process launcher to be able to enable this kind of protection. (i.e. hoping the process calls a prctl() to enable the protection isn't going to work well with systemd.) So, I *think* we could have a minimal form with these considerations: - execstack: declare it distinctly incompatible. - relro: I think this is solved with BIND_NOW. It's been a while since I looked deeply at this, but I think under BIND_NOW, the (executable) PLT doesn't ever need to be writable (since it points into the GOT), and the (initially writable) GOT is already never executable. This needs to be verified... - JITs can be allowed with a ELF flag and can choose to opt-in with a prctl(). -Kees [1] https://lore.kernel.org/lkml/1562410493-8661-1-git-send-email-s.mesoraca16@gmail.com/ [2] https://lore.kernel.org/lkml/1562410493-8661-2-git-send-email-s.mesoraca16@gmail.com/
On 20.4.2022 16.01, Catalin Marinas wrote: > On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote: >> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote: >>> The background to this is that systemd has a configuration option called >>> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim >>> is to prevent a user task from inadvertently creating an executable >>> mapping that is (or was) writeable. Since such BPF filter is stateless, >>> it cannot detect mappings that were previously writeable but >>> subsequently changed to read-only. Therefore the filter simply rejects >>> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI >>> support (Branch Target Identification), the dynamic loader cannot change >>> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). >>> For libraries, it can resort to unmapping and re-mapping but for the >>> main executable it does not have a file descriptor. The original bug >>> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround >>> for libraries - [3]. >> >> Right, so, the systemd filter is a big hammer solution for the kernel >> not having a very easy way to provide W^X mapping protections to >> userspace. There's stuff in SELinux, and there have been several >> attempts[1] at other LSMs to do it too, but nothing stuck. >> >> Given the filter, and the implementation of how to enable BTI, I see two >> solutions: >> >> - provide a way to do W^X so systemd can implement the feature differently >> - provide a way to turn on BTI separate from mprotect to bypass the filter >> >> I would agree, the latter seems like the greater hack, > > We discussed such hacks in the past but they are just working around the > fundamental issue - systemd wants W^X but with BPF it can only achieve > it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping > was already executable. If we find a better solution for W^X, we > wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI). > >> so I welcome >> this RFC, though I think it might need to explore a bit of the feature >> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise >> it risks being too narrowly implemented. For example, playing well with >> JITs should be part of the design, and will likely need some kind of >> ELF flags and/or "sealing" mode, and to handle the vma alias case as >> Jann Horn pointed out[2]. > > I agree we should look at what we want to cover, though trying to avoid > re-inventing SELinux. With this patchset I went for the minimum that > systemd MDWE does with BPF. > > I think JITs get around it using something like memfd with two separate > mappings to the same page. We could try to prevent such aliases but > allow it if an ELF note is detected (or get the JIT to issue a prctl()). > > Anyway, with a prctl() we can allow finer-grained control starting with > anonymous and file mappings and later extending to vma aliases, > writeable files etc. On top we can add a seal mask so that a process > cannot disable a control was set. Something like (I'm not good at > names): > > prctl(PR_MDWX_SET, flags, seal_mask); > prctl(PR_MDWX_GET); > > with flags like: > > PR_MDWX_MMAP - basics, should cover mmap() and mprotect() > PR_MDWX_ALIAS - vma aliases, allowed with an ELF note > PR_MDWX_WRITEABLE_FILE > > (needs some more thinking) > For systemd, feature compatibility with the BPF version is important so that we could automatically switch to the kernel version once available without regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once installed there should be no way to escape and ELF flags should be also ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the old flags had PROT_EXEC). Then we could have improved versions (other PR_MDWX_ prctls) with lots more checks. This could be enabled with MemoryDenyWriteExecute=strict or so. Perhaps also more relaxed versions (like SARA) could be interesting (system service running Python with FFI, or perhaps JVM etc), enabled with for example MemoryDenyWriteExecute=trampolines. That way even those programs would get some protection (though there would be a gap in the defences). -Topi
On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote: > On 20.4.2022 16.01, Catalin Marinas wrote: > > On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote: > > > On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote: > > > > The background to this is that systemd has a configuration option called > > > > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim > > > > is to prevent a user task from inadvertently creating an executable > > > > mapping that is (or was) writeable. Since such BPF filter is stateless, > > > > it cannot detect mappings that were previously writeable but > > > > subsequently changed to read-only. Therefore the filter simply rejects > > > > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI > > > > support (Branch Target Identification), the dynamic loader cannot change > > > > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). > > > > For libraries, it can resort to unmapping and re-mapping but for the > > > > main executable it does not have a file descriptor. The original bug > > > > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround > > > > for libraries - [3]. > > > > > > Right, so, the systemd filter is a big hammer solution for the kernel > > > not having a very easy way to provide W^X mapping protections to > > > userspace. There's stuff in SELinux, and there have been several > > > attempts[1] at other LSMs to do it too, but nothing stuck. > > > > > > Given the filter, and the implementation of how to enable BTI, I see two > > > solutions: > > > > > > - provide a way to do W^X so systemd can implement the feature differently > > > - provide a way to turn on BTI separate from mprotect to bypass the filter > > > > > > I would agree, the latter seems like the greater hack, > > > > We discussed such hacks in the past but they are just working around the > > fundamental issue - systemd wants W^X but with BPF it can only achieve > > it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping > > was already executable. If we find a better solution for W^X, we > > wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI). > > > > > so I welcome > > > this RFC, though I think it might need to explore a bit of the feature > > > space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise > > > it risks being too narrowly implemented. For example, playing well with > > > JITs should be part of the design, and will likely need some kind of > > > ELF flags and/or "sealing" mode, and to handle the vma alias case as > > > Jann Horn pointed out[2]. > > > > I agree we should look at what we want to cover, though trying to avoid > > re-inventing SELinux. With this patchset I went for the minimum that > > systemd MDWE does with BPF. > > > > I think JITs get around it using something like memfd with two separate > > mappings to the same page. We could try to prevent such aliases but > > allow it if an ELF note is detected (or get the JIT to issue a prctl()). > > > > Anyway, with a prctl() we can allow finer-grained control starting with > > anonymous and file mappings and later extending to vma aliases, > > writeable files etc. On top we can add a seal mask so that a process > > cannot disable a control was set. Something like (I'm not good at > > names): > > > > prctl(PR_MDWX_SET, flags, seal_mask); > > prctl(PR_MDWX_GET); > > > > with flags like: > > > > PR_MDWX_MMAP - basics, should cover mmap() and mprotect() > > PR_MDWX_ALIAS - vma aliases, allowed with an ELF note > > PR_MDWX_WRITEABLE_FILE > > > > (needs some more thinking) > > > > For systemd, feature compatibility with the BPF version is important so that > we could automatically switch to the kernel version once available without > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once > installed there should be no way to escape and ELF flags should be also > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the > old flags had PROT_EXEC). > > Then we could have improved versions (other PR_MDWX_ prctls) with lots more > checks. This could be enabled with MemoryDenyWriteExecute=strict or so. > > Perhaps also more relaxed versions (like SARA) could be interesting (system > service running Python with FFI, or perhaps JVM etc), enabled with for > example MemoryDenyWriteExecute=trampolines. That way even those programs > would get some protection (though there would be a gap in the defences). Yup, I think we're all on the same page. Catalin, can you respin with a prctl for enabling MDWE? I propose just: prctl(PR_MDWX_SET, flags); prctl(PR_MDWX_GET); PR_MDWX_FLAG_MMAP disallows PROT_EXEC on any VMA that is or was PROT_WRITE, covering at least: mmap, mprotect, pkey_mprotect, and shmat. I don't think anything should be allowed to be disabled once set.
On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote: > On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote: > > For systemd, feature compatibility with the BPF version is important so that > > we could automatically switch to the kernel version once available without > > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match > > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only > > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once > > installed there should be no way to escape and ELF flags should be also > > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the > > old flags had PROT_EXEC). I agree. > > Then we could have improved versions (other PR_MDWX_ prctls) with lots more > > checks. This could be enabled with MemoryDenyWriteExecute=strict or so. > > > > Perhaps also more relaxed versions (like SARA) could be interesting (system > > service running Python with FFI, or perhaps JVM etc), enabled with for > > example MemoryDenyWriteExecute=trampolines. That way even those programs > > would get some protection (though there would be a gap in the defences). > > Yup, I think we're all on the same page. Catalin, can you respin with a > prctl for enabling MDWE? I propose just: > > prctl(PR_MDWX_SET, flags); > prctl(PR_MDWX_GET); > > PR_MDWX_FLAG_MMAP > disallows PROT_EXEC on any VMA that is or was PROT_WRITE, > covering at least: mmap, mprotect, pkey_mprotect, and shmat. Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if the vma is not already PROT_EXEC? The latter is closer to the current systemd approach. The former allows an mprotect(PROT_EXEC) if the mapping was PROT_READ only for example. I'd drop the "was PROT_WRITE" for now if the aim is a drop-in replacement for BPF MDWE.
On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote: > On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote: > > On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote: > > > For systemd, feature compatibility with the BPF version is important so that > > > we could automatically switch to the kernel version once available without > > > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match > > > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only > > > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once > > > installed there should be no way to escape and ELF flags should be also > > > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the > > > old flags had PROT_EXEC). > > I agree. > > > > Then we could have improved versions (other PR_MDWX_ prctls) with lots more > > > checks. This could be enabled with MemoryDenyWriteExecute=strict or so. > > > > > > Perhaps also more relaxed versions (like SARA) could be interesting (system > > > service running Python with FFI, or perhaps JVM etc), enabled with for > > > example MemoryDenyWriteExecute=trampolines. That way even those programs > > > would get some protection (though there would be a gap in the defences). > > > > Yup, I think we're all on the same page. Catalin, can you respin with a > > prctl for enabling MDWE? I propose just: > > > > prctl(PR_MDWX_SET, flags); > > prctl(PR_MDWX_GET); > > > > PR_MDWX_FLAG_MMAP > > disallows PROT_EXEC on any VMA that is or was PROT_WRITE, > > covering at least: mmap, mprotect, pkey_mprotect, and shmat. > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if > the vma is not already PROT_EXEC? The latter is closer to the current > systemd approach. The former allows an mprotect(PROT_EXEC) if the > mapping was PROT_READ only for example. > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in > replacement for BPF MDWE. I think "was PROT_WRITE" is an important part of the defense that couldn't be done with a simple seccomp filter (which is why the filter ended up being a problem in the first place).
On 21.4.2022 18.35, Catalin Marinas wrote: > On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote: >> On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote: >>> For systemd, feature compatibility with the BPF version is important so that >>> we could automatically switch to the kernel version once available without >>> regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match >>> exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only >>> forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once >>> installed there should be no way to escape and ELF flags should be also >>> ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the >>> old flags had PROT_EXEC). > > I agree. > >>> Then we could have improved versions (other PR_MDWX_ prctls) with lots more >>> checks. This could be enabled with MemoryDenyWriteExecute=strict or so. >>> >>> Perhaps also more relaxed versions (like SARA) could be interesting (system >>> service running Python with FFI, or perhaps JVM etc), enabled with for >>> example MemoryDenyWriteExecute=trampolines. That way even those programs >>> would get some protection (though there would be a gap in the defences). >> >> Yup, I think we're all on the same page. Catalin, can you respin with a >> prctl for enabling MDWE? I propose just: >> >> prctl(PR_MDWX_SET, flags); >> prctl(PR_MDWX_GET); >> >> PR_MDWX_FLAG_MMAP >> disallows PROT_EXEC on any VMA that is or was PROT_WRITE, >> covering at least: mmap, mprotect, pkey_mprotect, and shmat. > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if > the vma is not already PROT_EXEC? The latter is closer to the current > systemd approach. The former allows an mprotect(PROT_EXEC) if the > mapping was PROT_READ only for example. > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in > replacement for BPF MDWE. > I think we'd want existing installations with MemoryDenyWriteExecute=yes not start failing when the implementation is changed to in-kernel version. The implementation could be very simple and not even check existing PROT_ flags (except for BTI case) to be maximally compatible to BPF version. So I'd leave "was PROT_WRITE" and other checks to more advanced versions, enabled with a different PR_MDWX_FLAG_. -Topi
On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote: > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote: > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if > > the vma is not already PROT_EXEC? The latter is closer to the current > > systemd approach. The former allows an mprotect(PROT_EXEC) if the > > mapping was PROT_READ only for example. > > > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in > > replacement for BPF MDWE. > > I think "was PROT_WRITE" is an important part of the defense that > couldn't be done with a simple seccomp filter (which is why the filter > ended up being a problem in the first place). I would say "was PROT_WRITE" is slightly more relaxed than "is not already PROT_EXEC". The seccomp filter can't do "is not already PROT_EXEC" either since it only checks the mprotect() arguments, not the current vma flags. So we have (with sub-cases): 1. Current BPF filter: a) mmap(PROT_READ|PROT_WRITE|PROT_EXEC); // fails b) mmap(PROT_READ|PROT_EXEC); mprotect(PROT_READ|PROT_EXEC|PROT_BTI); // fails 2. "is not already PROT_EXEC": a) mmap(PROT_READ|PROT_WRITE|PROT_EXEC); // fails b) mmap(PROT_READ|PROT_EXEC); mmap(PROT_READ|PROT_EXEC|PROT_BTI); // passes c) mmap(PROT_READ); mprotect(PROT_READ|PROT_EXEC); // fails 3. "is or was not PROT_WRITE": a) mmap(PROT_READ|PROT_WRITE|PROT_EXEC); // fails b) mmap(PROT_READ|PROT_EXEC); mmap(PROT_READ|PROT_EXEC|PROT_BTI); // passes c) mmap(PROT_READ); mprotect(PROT_READ|PROT_EXEC); // passes d) mmap(PROT_READ|PROT_WRITE); mprotect(PROT_READ); mprotect(PROT_READ|PROT_EXEC); // fails (was write) The current seccomp filter is the strictest. If we go for (2), it allows PROT_BTI as in 2.b but prevents 2.c (as would the current seccomp filter). (3) relaxes 2.c as in 3.c while still preventing 3.d. Both (1) and (2) prevent 3.d by simply rejecting mprotect(PROT_EXEC). If we don't care about 3.c, we might as well go for (2). I don't mind, already went for (3) in this series. I think either of them would not be a regression on MDWE, unless there is some test that attempts 3.c and expects it to fail.
On Thu, Apr 21, 2022 at 07:48:27PM +0300, Topi Miettinen wrote: > On 21.4.2022 18.35, Catalin Marinas wrote: > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if > > the vma is not already PROT_EXEC? The latter is closer to the current > > systemd approach. The former allows an mprotect(PROT_EXEC) if the > > mapping was PROT_READ only for example. > > > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in > > replacement for BPF MDWE. > > I think we'd want existing installations with MemoryDenyWriteExecute=yes not > start failing when the implementation is changed to in-kernel version. The > implementation could be very simple and not even check existing PROT_ flags > (except for BTI case) to be maximally compatible to BPF version. It would have to check the existing flags otherwise this could have been implemented in the BPF filter. The dynamic loader (or kernel loader) first mmap(PROT_READ|PROT_EXEC) and, if the BTI note is found, it switches it to mprotect(PROT_READ|PROT_EXEC|PROT_BTI). If we allowed this to pass simply because of PROT_BTI, one could create such executable mapping even if it is (or was) writeable. So we can either allow mprotect(PROT_EXEC) if the mapping was never writeable or we allow mprotect(PROT_EXEC) if the mapping is already PROT_EXEC (and the check for W^X was previously done by mmap()). > So I'd leave "was PROT_WRITE" and other checks to more advanced > versions, enabled with a different PR_MDWX_FLAG_. This works for me as well. See my reply to Kees for the use-cases. Thanks.
On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote: > On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote: > > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote: > > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if > > > the vma is not already PROT_EXEC? The latter is closer to the current > > > systemd approach. The former allows an mprotect(PROT_EXEC) if the > > > mapping was PROT_READ only for example. > > > > > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in > > > replacement for BPF MDWE. > > > > I think "was PROT_WRITE" is an important part of the defense that > > couldn't be done with a simple seccomp filter (which is why the filter > > ended up being a problem in the first place). > > I would say "was PROT_WRITE" is slightly more relaxed than "is not > already PROT_EXEC". The seccomp filter can't do "is not already > PROT_EXEC" either since it only checks the mprotect() arguments, not the > current vma flags. > > So we have (with sub-cases): > > 1. Current BPF filter: > > a) mmap(PROT_READ|PROT_WRITE|PROT_EXEC); // fails > > b) mmap(PROT_READ|PROT_EXEC); > mprotect(PROT_READ|PROT_EXEC|PROT_BTI); // fails > > c) mmap(PROT_READ); > mprotect(PROT_READ|PROT_EXEC); // fails > > d) mmap(PROT_READ|PROT_WRITE); > mprotect(PROT_READ); > mprotect(PROT_READ|PROT_EXEC); // fails > > 2. "is not already PROT_EXEC": > > a) mmap(PROT_READ|PROT_WRITE|PROT_EXEC); // fails > > b) mmap(PROT_READ|PROT_EXEC); > mprotect(PROT_READ|PROT_EXEC|PROT_BTI); // passes > > c) mmap(PROT_READ); > mprotect(PROT_READ|PROT_EXEC); // fails > > d) mmap(PROT_READ|PROT_WRITE); > mprotect(PROT_READ); > mprotect(PROT_READ|PROT_EXEC); // fails > > 3. "is or was not PROT_WRITE": > > a) mmap(PROT_READ|PROT_WRITE|PROT_EXEC); // fails > > b) mmap(PROT_READ|PROT_EXEC); > mprotect(PROT_READ|PROT_EXEC|PROT_BTI); // passes > > c) mmap(PROT_READ); > mprotect(PROT_READ|PROT_EXEC); // passes > > d) mmap(PROT_READ|PROT_WRITE); > mprotect(PROT_READ); > mprotect(PROT_READ|PROT_EXEC); // fails [edited above to show each case] restated what was already summarized: Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes). > If we don't care about 3.c, we might as well go for (2). I don't mind, > already went for (3) in this series. I think either of them would not be > a regression on MDWE, unless there is some test that attempts 3.c and > expects it to fail. I should stop arguing for a less restrictive mode. ;) It just feels weird that the combinations are API-mediated, rather than logically defined: I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As opposed to saying "the vma cannot be executable if it is or ever was writable". I find the latter much easier to reason about as far as the expectations of system state. So, I'd still prefer 3, as that was the _goal_ of the systemd MDWE seccomp filter, but yes, 2 does provide the same protection while allowing BTI.
On Thu, Apr 21, 2022 at 10:41:43AM -0700, Kees Cook wrote: > On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote: > > On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote: > > > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote: > > > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if > > > > the vma is not already PROT_EXEC? The latter is closer to the current > > > > systemd approach. The former allows an mprotect(PROT_EXEC) if the > > > > mapping was PROT_READ only for example. > > > > > > > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in > > > > replacement for BPF MDWE. > > > > > > I think "was PROT_WRITE" is an important part of the defense that > > > couldn't be done with a simple seccomp filter (which is why the filter > > > ended up being a problem in the first place). > > > > I would say "was PROT_WRITE" is slightly more relaxed than "is not > > already PROT_EXEC". The seccomp filter can't do "is not already > > PROT_EXEC" either since it only checks the mprotect() arguments, not the > > current vma flags. > > > > So we have (with sub-cases): > > > > 1. Current BPF filter: > > > > a) mmap(PROT_READ|PROT_WRITE|PROT_EXEC); // fails > > > > b) mmap(PROT_READ|PROT_EXEC); > > mprotect(PROT_READ|PROT_EXEC|PROT_BTI); // fails > > > > c) mmap(PROT_READ); > > mprotect(PROT_READ|PROT_EXEC); // fails > > > > d) mmap(PROT_READ|PROT_WRITE); > > mprotect(PROT_READ); > > mprotect(PROT_READ|PROT_EXEC); // fails > > > > 2. "is not already PROT_EXEC": > > > > a) mmap(PROT_READ|PROT_WRITE|PROT_EXEC); // fails > > > > b) mmap(PROT_READ|PROT_EXEC); > > mprotect(PROT_READ|PROT_EXEC|PROT_BTI); // passes > > > > c) mmap(PROT_READ); > > mprotect(PROT_READ|PROT_EXEC); // fails > > > > d) mmap(PROT_READ|PROT_WRITE); > > mprotect(PROT_READ); > > mprotect(PROT_READ|PROT_EXEC); // fails > > > > 3. "is or was not PROT_WRITE": > > > > a) mmap(PROT_READ|PROT_WRITE|PROT_EXEC); // fails > > > > b) mmap(PROT_READ|PROT_EXEC); > > mprotect(PROT_READ|PROT_EXEC|PROT_BTI); // passes > > > > c) mmap(PROT_READ); > > mprotect(PROT_READ|PROT_EXEC); // passes > > > > d) mmap(PROT_READ|PROT_WRITE); > > mprotect(PROT_READ); > > mprotect(PROT_READ|PROT_EXEC); // fails > > [edited above to show each case] Thanks, I was in a rush to get home ;). > restated what was already summarized: > Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes). > > > If we don't care about 3.c, we might as well go for (2). I don't mind, > > already went for (3) in this series. I think either of them would not be > > a regression on MDWE, unless there is some test that attempts 3.c and > > expects it to fail. > > I should stop arguing for a less restrictive mode. ;) It just feels weird > that the combinations are API-mediated, rather than logically defined: > I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As > opposed to saying "the vma cannot be executable if it is or ever was > writable". I find the latter much easier to reason about as far as the > expectations of system state. I had the same reasoning, hence option 3 in this series. I prefer to treat mmap(PROT_READ|PROT_EXEC) and mprotect(PROT_READ|PROT_EXEC) in a similar way.