mbox series

[RFC,0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)

Message ID 20220413134946.2732468-1-catalin.marinas@arm.com (mailing list archive)
Headers show
Series mm, arm64: In-kernel support for memory-deny-write-execute (MDWE) | expand

Message

Catalin Marinas April 13, 2022, 1:49 p.m. UTC
Hi,

The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but
subsequently changed to read-only. Therefore the filter simply rejects
any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
support (Branch Target Identification), the dynamic loader cannot change
an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
For libraries, it can resort to unmapping and re-mapping but for the
main executable it does not have a file descriptor. The original bug
report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
for libraries - [3].

Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
flag, inherited on fork() and execve(). The kernel tracks a previously
writeable mapping via a new VM_WAS_WRITE flag (64-bit only
architectures). I went for a personality flag by analogy with the
READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
we don't want more personality flags. A minor downside with the
personality flag is that there is no way for the user to query which
flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
this.

Posting this as an RFC to start a discussion and cc'ing some of the
systemd guys and those involved in the earlier thread around the glibc
workaround for dynamic libraries [4]. Before thinking of upstreaming
this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
BPF filter with the in-kernel one.

Thanks,

Catalin

[1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
[3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com

Catalin Marinas (4):
  mm: Track previously writeable vma permission
  mm, personality: Implement memory-deny-write-execute as a personality
    flag
  fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
    flag
  arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC

 arch/arm64/Kconfig               |  1 +
 fs/binfmt_elf.c                  |  2 ++
 include/linux/mm.h               |  6 ++++++
 include/linux/mman.h             | 18 +++++++++++++++++-
 include/uapi/linux/binfmts.h     |  4 ++++
 include/uapi/linux/personality.h |  1 +
 mm/Kconfig                       |  4 ++++
 mm/mmap.c                        |  3 +++
 mm/mprotect.c                    |  5 +++++
 9 files changed, 43 insertions(+), 1 deletion(-)

Comments

Topi Miettinen April 13, 2022, 6:39 p.m. UTC | #1
On 13.4.2022 16.49, Catalin Marinas wrote:
> Hi,
> 
> The background to this is that systemd has a configuration option called
> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> is to prevent a user task from inadvertently creating an executable
> mapping that is (or was) writeable. Since such BPF filter is stateless,
> it cannot detect mappings that were previously writeable but
> subsequently changed to read-only. Therefore the filter simply rejects
> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> support (Branch Target Identification), the dynamic loader cannot change
> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> For libraries, it can resort to unmapping and re-mapping but for the
> main executable it does not have a file descriptor. The original bug
> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> for libraries - [3].
> 
> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> flag, inherited on fork() and execve(). The kernel tracks a previously
> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> architectures). I went for a personality flag by analogy with the
> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> we don't want more personality flags. A minor downside with the
> personality flag is that there is no way for the user to query which
> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> this.

With systemd there's a BPF construct to block personality changes 
(LockPersonality=yes) but I think prctl() would be easier to lock down 
irrevocably.

Requiring or implying NoNewPrivileges could prevent nasty surprises from 
set-uid Python programs which happen to use FFI.

> Posting this as an RFC to start a discussion and cc'ing some of the
> systemd guys and those involved in the earlier thread around the glibc
> workaround for dynamic libraries [4]. Before thinking of upstreaming
> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> BPF filter with the in-kernel one.

As the author of this feature in systemd (also similar feature in 
Firejail), I'd highly prefer in-kernel version to BPF protection. I'd 
definitely also want to use this in place of BPF on x86_64 and other 
arches too.

In-kernel version would probably allow covering pretty easily this case 
(maybe it already does):

	fd = memfd_create(...);
	write(fd, malicious_code, sizeof(malicious_code));
	mmap(..., PROT_EXEC, ..., fd);

Other memory W^X implementations include S.A.R.A [1] and SELinux 
EXECMEM/EXECSTACK/EXECHEAP protections [2], [3]. SELinux checks 
IS_PRIVATE(file_inode(file)) and vma->anon_vma != NULL, which might be 
useful additions here too (or future extensions if you prefer).

-Topi

[1] https://smeso.it/sara/
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3708
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3787
Catalin Marinas April 14, 2022, 1:49 p.m. UTC | #2
Hi Topi,

On Wed, Apr 13, 2022 at 09:39:37PM +0300, Topi Miettinen wrote:
> On 13.4.2022 16.49, Catalin Marinas wrote:
> > The background to this is that systemd has a configuration option called
> > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > is to prevent a user task from inadvertently creating an executable
> > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > it cannot detect mappings that were previously writeable but
> > subsequently changed to read-only. Therefore the filter simply rejects
> > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > support (Branch Target Identification), the dynamic loader cannot change
> > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > For libraries, it can resort to unmapping and re-mapping but for the
> > main executable it does not have a file descriptor. The original bug
> > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > for libraries - [3].
> > 
> > Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> > flag, inherited on fork() and execve(). The kernel tracks a previously
> > writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> > architectures).

BTW, we can avoid the VM_WAS_WRITE tracking if we limit the check to the
current permissions. It would allow mprotect(PROT_EXEC) only if the
mapping is already executable. The mmap(PROT_WRITE|PROT_EXEC) would be
rejected, as expected. The difference from the current BPF filter is
that mprotect(PROT_EXEC|PROT_BTI) is allowed if the mapping was
previously mmap(PROT_READ|PROT_EXEC).

I'm open to go in this direction if it fits the systemd requirements
better. It's also less state to track in the kernel.

> > I went for a personality flag by analogy with the
> > READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> > we don't want more personality flags. A minor downside with the
> > personality flag is that there is no way for the user to query which
> > flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> > this.
> 
> With systemd there's a BPF construct to block personality changes
> (LockPersonality=yes) but I think prctl() would be easier to lock down
> irrevocably.

The personality flag is not sticky, so one could inadvertently clear it
without LockPersonality=yes. We could add a mask of sticky bits to
sys_personality() (only for new flags), though we might as well go with
a prctl(), we have more flexibility and finer-grained control if we want
to expand this to memfd files.

Would there be any reason to disable such behaviour for an application,
once enabled? I don't think it's currently possible with the BPF filter
but we can add it to a prctl().

> > Posting this as an RFC to start a discussion and cc'ing some of the
> > systemd guys and those involved in the earlier thread around the glibc
> > workaround for dynamic libraries [4]. Before thinking of upstreaming
> > this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> > BPF filter with the in-kernel one.
> 
> As the author of this feature in systemd (also similar feature in Firejail),
> I'd highly prefer in-kernel version to BPF protection. I'd definitely also
> want to use this in place of BPF on x86_64 and other arches too.

It is generic, so yes, it can be enabled for other architectures. A
minor issue with the VM_WAS_WRITE flag is that we ran out of 32-bit vma
flags. It could be expanded but not sure how much you care about MDWE on
32-bit architectures. An alternative is to drop the VM_WAS_WRITE
approach entirely, just use the current protection for the decision.

> In-kernel version would probably allow covering pretty easily this case
> (maybe it already does):
> 
> 	fd = memfd_create(...);
> 	write(fd, malicious_code, sizeof(malicious_code));
> 	mmap(..., PROT_EXEC, ..., fd);

This series doesn't cover it.

> Other memory W^X implementations include S.A.R.A [1] and SELinux
> EXECMEM/EXECSTACK/EXECHEAP protections [2], [3]. SELinux checks
> IS_PRIVATE(file_inode(file)) and vma->anon_vma != NULL, which might be
> useful additions here too (or future extensions if you prefer).

I had a quick look at SELinux and, IIUC, without the execmem permission
it prevents any anonymous mmap(PROT_EXEC). We could probably do
something similar here and check the file permission. I had an attempt
but S_PRIVATE doesn't seem to be set on the memfd_create() allocated
inode. I have to dig a bit more to see how to detect this. If we keep
the check for all files, it won't be able to map any ELF text sections
unless the binary is read-only.

> [1] https://smeso.it/sara/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3708
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3787
Kees Cook April 14, 2022, 6:52 p.m. UTC | #3
On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> The background to this is that systemd has a configuration option called
> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> is to prevent a user task from inadvertently creating an executable
> mapping that is (or was) writeable. Since such BPF filter is stateless,
> it cannot detect mappings that were previously writeable but
> subsequently changed to read-only. Therefore the filter simply rejects
> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> support (Branch Target Identification), the dynamic loader cannot change
> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> For libraries, it can resort to unmapping and re-mapping but for the
> main executable it does not have a file descriptor. The original bug
> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> for libraries - [3].

Right, so, the systemd filter is a big hammer solution for the kernel
not having a very easy way to provide W^X mapping protections to
userspace. There's stuff in SELinux, and there have been several
attempts[1] at other LSMs to do it too, but nothing stuck.

Given the filter, and the implementation of how to enable BTI, I see two
solutions:

- provide a way to do W^X so systemd can implement the feature differently
- provide a way to turn on BTI separate from mprotect to bypass the filter

I would agree, the latter seems like the greater hack, so I welcome
this RFC, though I think it might need to explore a bit of the feature
space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
it risks being too narrowly implemented. For example, playing well with
JITs should be part of the design, and will likely need some kind of
ELF flags and/or "sealing" mode, and to handle the vma alias case as
Jann Horn pointed out[2].

> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> flag, inherited on fork() and execve(). The kernel tracks a previously
> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> architectures). I went for a personality flag by analogy with the
> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> we don't want more personality flags. A minor downside with the
> personality flag is that there is no way for the user to query which
> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> this.

My instinct here is to use a prctl(), which maps to other kinds of modern
inherited state (like no_new_privs).

> Posting this as an RFC to start a discussion and cc'ing some of the
> systemd guys and those involved in the earlier thread around the glibc
> workaround for dynamic libraries [4]. Before thinking of upstreaming
> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> BPF filter with the in-kernel one.
> 
> Thanks,
> 
> Catalin
> 
> [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
> [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com

So, yes, let's do it. It's long long overdue in the kernel. :)

-Kees

[1] https://github.com/KSPP/linux/issues/32
[2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611

> 
> Catalin Marinas (4):
>   mm: Track previously writeable vma permission
>   mm, personality: Implement memory-deny-write-execute as a personality
>     flag
>   fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
>     flag
>   arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
> 
>  arch/arm64/Kconfig               |  1 +
>  fs/binfmt_elf.c                  |  2 ++
>  include/linux/mm.h               |  6 ++++++
>  include/linux/mman.h             | 18 +++++++++++++++++-
>  include/uapi/linux/binfmts.h     |  4 ++++
>  include/uapi/linux/personality.h |  1 +
>  mm/Kconfig                       |  4 ++++
>  mm/mmap.c                        |  3 +++
>  mm/mprotect.c                    |  5 +++++
>  9 files changed, 43 insertions(+), 1 deletion(-)
>
Topi Miettinen April 15, 2022, 8:01 p.m. UTC | #4
On 14.4.2022 21.52, Kees Cook wrote:
> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
>> The background to this is that systemd has a configuration option called
>> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
>> is to prevent a user task from inadvertently creating an executable
>> mapping that is (or was) writeable. Since such BPF filter is stateless,
>> it cannot detect mappings that were previously writeable but
>> subsequently changed to read-only. Therefore the filter simply rejects
>> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
>> support (Branch Target Identification), the dynamic loader cannot change
>> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
>> For libraries, it can resort to unmapping and re-mapping but for the
>> main executable it does not have a file descriptor. The original bug
>> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
>> for libraries - [3].
> 
> Right, so, the systemd filter is a big hammer solution for the kernel
> not having a very easy way to provide W^X mapping protections to
> userspace. There's stuff in SELinux, and there have been several
> attempts[1] at other LSMs to do it too, but nothing stuck.
> 
> Given the filter, and the implementation of how to enable BTI, I see two
> solutions:
> 
> - provide a way to do W^X so systemd can implement the feature differently
> - provide a way to turn on BTI separate from mprotect to bypass the filter
> 
> I would agree, the latter seems like the greater hack, so I welcome
> this RFC, though I think it might need to explore a bit of the feature
> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> it risks being too narrowly implemented. For example, playing well with
> JITs should be part of the design, and will likely need some kind of
> ELF flags and/or "sealing" mode, and to handle the vma alias case as
> Jann Horn pointed out[2].

Another interesting case from 2006 by Ulrich Drepper is to use a 
temporary file and map it twice, once with PROT_WRITE and once with 
PROT_EXEC [1]. This isn't possible if the mount flags of the file 
systems are also in line with W^X principle. System services (unlike 
user apps) typically don't use /tmp nor /dev/shm (mounted with 
"rw,exec"). With systemd a simple file system W^X policy can be 
implemented for a service for example with NoExecPaths=/ ExecPaths=/usr 
ReadOnlyPaths=/usr. In-kernel MDWE probably could look beyond file 
descriptors and check if the mount flags of the file system containing 
the file being mmap()ed agree with W^X. The use cases for system 
services and user apps may be different: system services are often 
compatible with maximum hardening, while user apps may need various 
compatibility solutions if they use JIT, trampolines or FFI and access 
to W+X file systems may be also needed.

-Topi

[1] https://akkadia.org/drepper/selinux-mem.html

>> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
>> flag, inherited on fork() and execve(). The kernel tracks a previously
>> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
>> architectures). I went for a personality flag by analogy with the
>> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
>> we don't want more personality flags. A minor downside with the
>> personality flag is that there is no way for the user to query which
>> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
>> this.
> 
> My instinct here is to use a prctl(), which maps to other kinds of modern
> inherited state (like no_new_privs).
> 
>> Posting this as an RFC to start a discussion and cc'ing some of the
>> systemd guys and those involved in the earlier thread around the glibc
>> workaround for dynamic libraries [4]. Before thinking of upstreaming
>> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
>> BPF filter with the in-kernel one.
>>
>> Thanks,
>>
>> Catalin
>>
>> [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
>> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
>> [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com
> 
> So, yes, let's do it. It's long long overdue in the kernel. :)
> 
> -Kees
> 
> [1] https://github.com/KSPP/linux/issues/32
> [2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611
> 
>>
>> Catalin Marinas (4):
>>    mm: Track previously writeable vma permission
>>    mm, personality: Implement memory-deny-write-execute as a personality
>>      flag
>>    fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
>>      flag
>>    arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
>>
>>   arch/arm64/Kconfig               |  1 +
>>   fs/binfmt_elf.c                  |  2 ++
>>   include/linux/mm.h               |  6 ++++++
>>   include/linux/mman.h             | 18 +++++++++++++++++-
>>   include/uapi/linux/binfmts.h     |  4 ++++
>>   include/uapi/linux/personality.h |  1 +
>>   mm/Kconfig                       |  4 ++++
>>   mm/mmap.c                        |  3 +++
>>   mm/mprotect.c                    |  5 +++++
>>   9 files changed, 43 insertions(+), 1 deletion(-)
>>
>
Catalin Marinas April 20, 2022, 1:01 p.m. UTC | #5
On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> > The background to this is that systemd has a configuration option called
> > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > is to prevent a user task from inadvertently creating an executable
> > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > it cannot detect mappings that were previously writeable but
> > subsequently changed to read-only. Therefore the filter simply rejects
> > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > support (Branch Target Identification), the dynamic loader cannot change
> > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > For libraries, it can resort to unmapping and re-mapping but for the
> > main executable it does not have a file descriptor. The original bug
> > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > for libraries - [3].
> 
> Right, so, the systemd filter is a big hammer solution for the kernel
> not having a very easy way to provide W^X mapping protections to
> userspace. There's stuff in SELinux, and there have been several
> attempts[1] at other LSMs to do it too, but nothing stuck.
> 
> Given the filter, and the implementation of how to enable BTI, I see two
> solutions:
> 
> - provide a way to do W^X so systemd can implement the feature differently
> - provide a way to turn on BTI separate from mprotect to bypass the filter
> 
> I would agree, the latter seems like the greater hack,

We discussed such hacks in the past but they are just working around the
fundamental issue - systemd wants W^X but with BPF it can only achieve
it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
was already executable. If we find a better solution for W^X, we
wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).

> so I welcome
> this RFC, though I think it might need to explore a bit of the feature
> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> it risks being too narrowly implemented. For example, playing well with
> JITs should be part of the design, and will likely need some kind of
> ELF flags and/or "sealing" mode, and to handle the vma alias case as
> Jann Horn pointed out[2].

I agree we should look at what we want to cover, though trying to avoid
re-inventing SELinux. With this patchset I went for the minimum that
systemd MDWE does with BPF.

I think JITs get around it using something like memfd with two separate
mappings to the same page. We could try to prevent such aliases but
allow it if an ELF note is detected (or get the JIT to issue a prctl()).

Anyway, with a prctl() we can allow finer-grained control starting with
anonymous and file mappings and later extending to vma aliases,
writeable files etc. On top we can add a seal mask so that a process
cannot disable a control was set. Something like (I'm not good at
names):

	prctl(PR_MDWX_SET, flags, seal_mask);
	prctl(PR_MDWX_GET);

with flags like:

	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
	PR_MDWX_WRITEABLE_FILE

(needs some more thinking)
Kees Cook April 20, 2022, 5:44 p.m. UTC | #6
On Wed, Apr 20, 2022 at 02:01:02PM +0100, Catalin Marinas wrote:
> I agree we should look at what we want to cover, though trying to avoid
> re-inventing SELinux. With this patchset I went for the minimum that
> systemd MDWE does with BPF.

Right -- and I don't think we're at any risk of slippery-sloping into a
full MAC system. :)

I'm fine with doing the implementation in stages, if we've attempted to
design the steps (which I think you've got a good start on below).

> I think JITs get around it using something like memfd with two separate
> mappings to the same page. We could try to prevent such aliases but
> allow it if an ELF note is detected (or get the JIT to issue a prctl()).

Right -- I'd rather JITs carry some hard-coded property (i.e. ELF note)
to indicate the fact that they're expecting to do these kinds of things
rather than leaving it open for all processes.

> Anyway, with a prctl() we can allow finer-grained control starting with
> anonymous and file mappings and later extending to vma aliases,
> writeable files etc. On top we can add a seal mask so that a process
> cannot disable a control was set. Something like (I'm not good at
> names):
> 
> 	prctl(PR_MDWX_SET, flags, seal_mask);
> 	prctl(PR_MDWX_GET);
> 
> with flags like:
> 
> 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> 	PR_MDWX_WRITEABLE_FILE

The SARA proposal lists a lot of behavioral details to consider.
Quoting it[1] here:
>> - W^X enforcement will cause problems to any programs that needs
>>   memory pages mapped both as writable and executable at the same time e.g.
>>   programs with executable stack markings in the PT_GNU_STACK segment.

IMO, executable stack markings should be considered completely
deprecated. In fact, we've been warning about it since 2020:
47a2ebb7f505 ("execve: warn if process starts with executable stack")

So with execstack, under W^X, I think we should either:
- refuse to exec the process (default)
- disable W^X for the process (but not its children)

>> - W!->X restriction will cause problems to any program that
>>   needs to generate executable code at run time or to modify executable
>>   pages e.g. programs with a JIT compiler built-in or linked against a
>>   non-PIC library.

This seems solvable with an ELF flag.

>> - Executable MMAP prevention can work only with programs that have at least
>>   partial RELRO support. It's disabled automatically for programs that
>>   lack this feature. It will cause problems to any program that uses dlopen
>>   or tries to do an executable mmap. Unfortunately this feature is the one
>>   that could create most problems and should be enabled only after careful
>>   evaluation.

This seems like a variation on the execstack case, and we should be
able to detect the state and choose a behavior based on system settings,
and a smarter version (as SARA has) would track RELRO pages waiting for
the loader to make them read-only.

SARA was proposed with a set of feature flags[2]; quoting here:
>> | W^X                          |  0x0008  |

This is the basic property, refusing PROT_WRITE | PROT_EXEC. I note that
SARA also rejects opening /proc/$pid/mem with FMODE_WRITE when this is
enabled for a process. (It likely should extend to process_vm_write()
too.)

>> | W!->X Heap                   |  0x0001  |
>> | W!->X Stack                  |  0x0002  |
>> | W!->X Other memory           |  0x0004  |

This is for the vma history tracking, and I don't think we need to
separate this by memory type? It's nice to have the granularity, but for
a first-pass it seems like overkill? Maybe I'm missing some detail.

>> | Don't enforce, just complain |  0x0010  |
>> | Be Verbose                   |  0x0020  |

Unclear if these would work well with a non-LSM approach.

>> | Executable MMAP prevention   |  0x0040  |

This is the relro detection piece.

>> | Trampoline emulation         |  0x0100  |

This is a more advanced case for emulating execstack, but if we can just
ignore execstack entirely, this can go away?

>> | Children will inherit flags  |  0x0200  |

Should a process have that control?

>> | Force W^X on setprocattr     |  0x0080  |

This is a "seal" trigger, which could be done through prctl().


It looks like a bunch of the features are designed around having as much
as possible enabled at exec time, and then tightening it further as
various things are finished (e.g. execstack, relro, sealing, etc), which
is, I think, what would still be needed for a process launcher to be
able to enable this kind of protection. (i.e. hoping the process calls a
prctl() to enable the protection isn't going to work well with systemd.)

So, I *think* we could have a minimal form with these considerations:
 - execstack: declare it distinctly incompatible.
 - relro: I think this is solved with BIND_NOW. It's been a while since
   I looked deeply at this, but I think under BIND_NOW, the (executable)
   PLT doesn't ever need to be writable (since it points into the GOT),
   and the (initially writable) GOT is already never executable. This
   needs to be verified...
 - JITs can be allowed with a ELF flag and can choose to opt-in with
   a prctl().

-Kees

[1] https://lore.kernel.org/lkml/1562410493-8661-1-git-send-email-s.mesoraca16@gmail.com/
[2] https://lore.kernel.org/lkml/1562410493-8661-2-git-send-email-s.mesoraca16@gmail.com/
Topi Miettinen April 20, 2022, 7:34 p.m. UTC | #7
On 20.4.2022 16.01, Catalin Marinas wrote:
> On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
>> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
>>> The background to this is that systemd has a configuration option called
>>> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
>>> is to prevent a user task from inadvertently creating an executable
>>> mapping that is (or was) writeable. Since such BPF filter is stateless,
>>> it cannot detect mappings that were previously writeable but
>>> subsequently changed to read-only. Therefore the filter simply rejects
>>> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
>>> support (Branch Target Identification), the dynamic loader cannot change
>>> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
>>> For libraries, it can resort to unmapping and re-mapping but for the
>>> main executable it does not have a file descriptor. The original bug
>>> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
>>> for libraries - [3].
>>
>> Right, so, the systemd filter is a big hammer solution for the kernel
>> not having a very easy way to provide W^X mapping protections to
>> userspace. There's stuff in SELinux, and there have been several
>> attempts[1] at other LSMs to do it too, but nothing stuck.
>>
>> Given the filter, and the implementation of how to enable BTI, I see two
>> solutions:
>>
>> - provide a way to do W^X so systemd can implement the feature differently
>> - provide a way to turn on BTI separate from mprotect to bypass the filter
>>
>> I would agree, the latter seems like the greater hack,
> 
> We discussed such hacks in the past but they are just working around the
> fundamental issue - systemd wants W^X but with BPF it can only achieve
> it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
> was already executable. If we find a better solution for W^X, we
> wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).
> 
>> so I welcome
>> this RFC, though I think it might need to explore a bit of the feature
>> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
>> it risks being too narrowly implemented. For example, playing well with
>> JITs should be part of the design, and will likely need some kind of
>> ELF flags and/or "sealing" mode, and to handle the vma alias case as
>> Jann Horn pointed out[2].
> 
> I agree we should look at what we want to cover, though trying to avoid
> re-inventing SELinux. With this patchset I went for the minimum that
> systemd MDWE does with BPF.
> 
> I think JITs get around it using something like memfd with two separate
> mappings to the same page. We could try to prevent such aliases but
> allow it if an ELF note is detected (or get the JIT to issue a prctl()).
> 
> Anyway, with a prctl() we can allow finer-grained control starting with
> anonymous and file mappings and later extending to vma aliases,
> writeable files etc. On top we can add a seal mask so that a process
> cannot disable a control was set. Something like (I'm not good at
> names):
> 
> 	prctl(PR_MDWX_SET, flags, seal_mask);
> 	prctl(PR_MDWX_GET);
> 
> with flags like:
> 
> 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> 	PR_MDWX_WRITEABLE_FILE
> 
> (needs some more thinking)
> 

For systemd, feature compatibility with the BPF version is important so 
that we could automatically switch to the kernel version once available 
without regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) 
should match exactly what MemoryDenyWriteExecute=yes as implemented with 
BPF has: only forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). 
Like BPF, once installed there should be no way to escape and ELF flags 
should be also ignored. ARM BTI should be allowed though (allow 
PROT_EXEC|PROT_BTI if the old flags had PROT_EXEC).

Then we could have improved versions (other PR_MDWX_ prctls) with lots 
more checks. This could be enabled with MemoryDenyWriteExecute=strict or so.

Perhaps also more relaxed versions (like SARA) could be interesting 
(system service running Python with FFI, or perhaps JVM etc), enabled 
with for example MemoryDenyWriteExecute=trampolines. That way even those 
programs would get some protection (though there would be a gap in the 
defences).

-Topi
Kees Cook April 20, 2022, 11:21 p.m. UTC | #8
On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> On 20.4.2022 16.01, Catalin Marinas wrote:
> > On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
> > > On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> > > > The background to this is that systemd has a configuration option called
> > > > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > > > is to prevent a user task from inadvertently creating an executable
> > > > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > > > it cannot detect mappings that were previously writeable but
> > > > subsequently changed to read-only. Therefore the filter simply rejects
> > > > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > > > support (Branch Target Identification), the dynamic loader cannot change
> > > > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > > > For libraries, it can resort to unmapping and re-mapping but for the
> > > > main executable it does not have a file descriptor. The original bug
> > > > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > > > for libraries - [3].
> > > 
> > > Right, so, the systemd filter is a big hammer solution for the kernel
> > > not having a very easy way to provide W^X mapping protections to
> > > userspace. There's stuff in SELinux, and there have been several
> > > attempts[1] at other LSMs to do it too, but nothing stuck.
> > > 
> > > Given the filter, and the implementation of how to enable BTI, I see two
> > > solutions:
> > > 
> > > - provide a way to do W^X so systemd can implement the feature differently
> > > - provide a way to turn on BTI separate from mprotect to bypass the filter
> > > 
> > > I would agree, the latter seems like the greater hack,
> > 
> > We discussed such hacks in the past but they are just working around the
> > fundamental issue - systemd wants W^X but with BPF it can only achieve
> > it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
> > was already executable. If we find a better solution for W^X, we
> > wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).
> > 
> > > so I welcome
> > > this RFC, though I think it might need to explore a bit of the feature
> > > space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> > > it risks being too narrowly implemented. For example, playing well with
> > > JITs should be part of the design, and will likely need some kind of
> > > ELF flags and/or "sealing" mode, and to handle the vma alias case as
> > > Jann Horn pointed out[2].
> > 
> > I agree we should look at what we want to cover, though trying to avoid
> > re-inventing SELinux. With this patchset I went for the minimum that
> > systemd MDWE does with BPF.
> > 
> > I think JITs get around it using something like memfd with two separate
> > mappings to the same page. We could try to prevent such aliases but
> > allow it if an ELF note is detected (or get the JIT to issue a prctl()).
> > 
> > Anyway, with a prctl() we can allow finer-grained control starting with
> > anonymous and file mappings and later extending to vma aliases,
> > writeable files etc. On top we can add a seal mask so that a process
> > cannot disable a control was set. Something like (I'm not good at
> > names):
> > 
> > 	prctl(PR_MDWX_SET, flags, seal_mask);
> > 	prctl(PR_MDWX_GET);
> > 
> > with flags like:
> > 
> > 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> > 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> > 	PR_MDWX_WRITEABLE_FILE
> > 
> > (needs some more thinking)
> > 
> 
> For systemd, feature compatibility with the BPF version is important so that
> we could automatically switch to the kernel version once available without
> regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> installed there should be no way to escape and ELF flags should be also
> ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> old flags had PROT_EXEC).
> 
> Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> 
> Perhaps also more relaxed versions (like SARA) could be interesting (system
> service running Python with FFI, or perhaps JVM etc), enabled with for
> example MemoryDenyWriteExecute=trampolines. That way even those programs
> would get some protection (though there would be a gap in the defences).

Yup, I think we're all on the same page. Catalin, can you respin with a
prctl for enabling MDWE? I propose just:

	prctl(PR_MDWX_SET, flags);
	prctl(PR_MDWX_GET);

	PR_MDWX_FLAG_MMAP
		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
		covering at least: mmap, mprotect, pkey_mprotect, and shmat.

I don't think anything should be allowed to be disabled once set.
Catalin Marinas April 21, 2022, 3:35 p.m. UTC | #9
On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
> On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> > For systemd, feature compatibility with the BPF version is important so that
> > we could automatically switch to the kernel version once available without
> > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> > installed there should be no way to escape and ELF flags should be also
> > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> > old flags had PROT_EXEC).

I agree.

> > Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> > checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> > 
> > Perhaps also more relaxed versions (like SARA) could be interesting (system
> > service running Python with FFI, or perhaps JVM etc), enabled with for
> > example MemoryDenyWriteExecute=trampolines. That way even those programs
> > would get some protection (though there would be a gap in the defences).
> 
> Yup, I think we're all on the same page. Catalin, can you respin with a
> prctl for enabling MDWE? I propose just:
> 
> 	prctl(PR_MDWX_SET, flags);
> 	prctl(PR_MDWX_GET);
> 
> 	PR_MDWX_FLAG_MMAP
> 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
> 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.

Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
the vma is not already PROT_EXEC? The latter is closer to the current
systemd approach. The former allows an mprotect(PROT_EXEC) if the
mapping was PROT_READ only for example.

I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
replacement for BPF MDWE.
Kees Cook April 21, 2022, 4:42 p.m. UTC | #10
On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
> > On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> > > For systemd, feature compatibility with the BPF version is important so that
> > > we could automatically switch to the kernel version once available without
> > > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> > > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> > > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> > > installed there should be no way to escape and ELF flags should be also
> > > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> > > old flags had PROT_EXEC).
> 
> I agree.
> 
> > > Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> > > checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> > > 
> > > Perhaps also more relaxed versions (like SARA) could be interesting (system
> > > service running Python with FFI, or perhaps JVM etc), enabled with for
> > > example MemoryDenyWriteExecute=trampolines. That way even those programs
> > > would get some protection (though there would be a gap in the defences).
> > 
> > Yup, I think we're all on the same page. Catalin, can you respin with a
> > prctl for enabling MDWE? I propose just:
> > 
> > 	prctl(PR_MDWX_SET, flags);
> > 	prctl(PR_MDWX_GET);
> > 
> > 	PR_MDWX_FLAG_MMAP
> > 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
> > 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.
> 
> Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> the vma is not already PROT_EXEC? The latter is closer to the current
> systemd approach. The former allows an mprotect(PROT_EXEC) if the
> mapping was PROT_READ only for example.
> 
> I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> replacement for BPF MDWE.

I think "was PROT_WRITE" is an important part of the defense that
couldn't be done with a simple seccomp filter (which is why the filter
ended up being a problem in the first place).
Topi Miettinen April 21, 2022, 4:48 p.m. UTC | #11
On 21.4.2022 18.35, Catalin Marinas wrote:
> On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
>> On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
>>> For systemd, feature compatibility with the BPF version is important so that
>>> we could automatically switch to the kernel version once available without
>>> regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
>>> exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
>>> forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
>>> installed there should be no way to escape and ELF flags should be also
>>> ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
>>> old flags had PROT_EXEC).
> 
> I agree.
> 
>>> Then we could have improved versions (other PR_MDWX_ prctls) with lots more
>>> checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
>>>
>>> Perhaps also more relaxed versions (like SARA) could be interesting (system
>>> service running Python with FFI, or perhaps JVM etc), enabled with for
>>> example MemoryDenyWriteExecute=trampolines. That way even those programs
>>> would get some protection (though there would be a gap in the defences).
>>
>> Yup, I think we're all on the same page. Catalin, can you respin with a
>> prctl for enabling MDWE? I propose just:
>>
>> 	prctl(PR_MDWX_SET, flags);
>> 	prctl(PR_MDWX_GET);
>>
>> 	PR_MDWX_FLAG_MMAP
>> 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
>> 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.
> 
> Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> the vma is not already PROT_EXEC? The latter is closer to the current
> systemd approach. The former allows an mprotect(PROT_EXEC) if the
> mapping was PROT_READ only for example.
> 
> I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> replacement for BPF MDWE.
> 

I think we'd want existing installations with MemoryDenyWriteExecute=yes 
not start failing when the implementation is changed to in-kernel 
version. The implementation could be very simple and not even check 
existing PROT_ flags (except for BTI case) to be maximally compatible to 
BPF version. So I'd leave "was PROT_WRITE" and other checks to more 
advanced versions, enabled with a different PR_MDWX_FLAG_.

-Topi
Catalin Marinas April 21, 2022, 5:24 p.m. UTC | #12
On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > the vma is not already PROT_EXEC? The latter is closer to the current
> > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > mapping was PROT_READ only for example.
> > 
> > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > replacement for BPF MDWE.
> 
> I think "was PROT_WRITE" is an important part of the defense that
> couldn't be done with a simple seccomp filter (which is why the filter
> ended up being a problem in the first place).

I would say "was PROT_WRITE" is slightly more relaxed than "is not
already PROT_EXEC". The seccomp filter can't do "is not already
PROT_EXEC" either since it only checks the mprotect() arguments, not the
current vma flags.

So we have (with sub-cases):

1. Current BPF filter:

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails

2. "is not already PROT_EXEC":

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mmap(PROT_READ|PROT_EXEC|PROT_BTI);	// passes

   c)	mmap(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// fails

3. "is or was not PROT_WRITE":

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mmap(PROT_READ|PROT_EXEC|PROT_BTI);	// passes

   c)	mmap(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// passes

   d)	mmap(PROT_READ|PROT_WRITE);
	mprotect(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// fails (was write)

The current seccomp filter is the strictest. If we go for (2), it allows
PROT_BTI as in 2.b but prevents 2.c (as would the current seccomp
filter). (3) relaxes 2.c as in 3.c while still preventing 3.d. Both (1)
and (2) prevent 3.d by simply rejecting mprotect(PROT_EXEC).

If we don't care about 3.c, we might as well go for (2). I don't mind,
already went for (3) in this series. I think either of them would not be
a regression on MDWE, unless there is some test that attempts 3.c and
expects it to fail.
Catalin Marinas April 21, 2022, 5:28 p.m. UTC | #13
On Thu, Apr 21, 2022 at 07:48:27PM +0300, Topi Miettinen wrote:
> On 21.4.2022 18.35, Catalin Marinas wrote:
> > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > the vma is not already PROT_EXEC? The latter is closer to the current
> > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > mapping was PROT_READ only for example.
> > 
> > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > replacement for BPF MDWE.
> 
> I think we'd want existing installations with MemoryDenyWriteExecute=yes not
> start failing when the implementation is changed to in-kernel version. The
> implementation could be very simple and not even check existing PROT_ flags
> (except for BTI case) to be maximally compatible to BPF version.

It would have to check the existing flags otherwise this could have been
implemented in the BPF filter. The dynamic loader (or kernel loader)
first mmap(PROT_READ|PROT_EXEC) and, if the BTI note is found, it
switches it to mprotect(PROT_READ|PROT_EXEC|PROT_BTI). If we allowed
this to pass simply because of PROT_BTI, one could create such
executable mapping even if it is (or was) writeable.

So we can either allow mprotect(PROT_EXEC) if the mapping was never
writeable or we allow mprotect(PROT_EXEC) if the mapping is already
PROT_EXEC (and the check for W^X was previously done by mmap()).

> So I'd leave "was PROT_WRITE" and other checks to more advanced
> versions, enabled with a different PR_MDWX_FLAG_.

This works for me as well. See my reply to Kees for the use-cases.

Thanks.
Kees Cook April 21, 2022, 5:41 p.m. UTC | #14
On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote:
> On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > > the vma is not already PROT_EXEC? The latter is closer to the current
> > > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > > mapping was PROT_READ only for example.
> > > 
> > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > > replacement for BPF MDWE.
> > 
> > I think "was PROT_WRITE" is an important part of the defense that
> > couldn't be done with a simple seccomp filter (which is why the filter
> > ended up being a problem in the first place).
> 
> I would say "was PROT_WRITE" is slightly more relaxed than "is not
> already PROT_EXEC". The seccomp filter can't do "is not already
> PROT_EXEC" either since it only checks the mprotect() arguments, not the
> current vma flags.
> 
> So we have (with sub-cases):
> 
> 1. Current BPF filter:
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails
>
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
>
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> 2. "is not already PROT_EXEC":
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> 
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
>
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> 3. "is or was not PROT_WRITE":
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> 
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// passes
> 
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails

[edited above to show each case]

restated what was already summarized:
Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes).

> If we don't care about 3.c, we might as well go for (2). I don't mind,
> already went for (3) in this series. I think either of them would not be
> a regression on MDWE, unless there is some test that attempts 3.c and
> expects it to fail.

I should stop arguing for a less restrictive mode. ;) It just feels weird
that the combinations are API-mediated, rather than logically defined:
I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As
opposed to saying "the vma cannot be executable if it is or ever was
writable". I find the latter much easier to reason about as far as the
expectations of system state.

So, I'd still prefer 3, as that was the _goal_ of the systemd MDWE
seccomp filter, but yes, 2 does provide the same protection while
allowing BTI.
Catalin Marinas April 21, 2022, 6:33 p.m. UTC | #15
On Thu, Apr 21, 2022 at 10:41:43AM -0700, Kees Cook wrote:
> On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote:
> > On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> > > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > > > the vma is not already PROT_EXEC? The latter is closer to the current
> > > > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > > > mapping was PROT_READ only for example.
> > > > 
> > > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > > > replacement for BPF MDWE.
> > > 
> > > I think "was PROT_WRITE" is an important part of the defense that
> > > couldn't be done with a simple seccomp filter (which is why the filter
> > > ended up being a problem in the first place).
> > 
> > I would say "was PROT_WRITE" is slightly more relaxed than "is not
> > already PROT_EXEC". The seccomp filter can't do "is not already
> > PROT_EXEC" either since it only checks the mprotect() arguments, not the
> > current vma flags.
> > 
> > So we have (with sub-cases):
> > 
> > 1. Current BPF filter:
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails
> >
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> >
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> > 
> > 2. "is not already PROT_EXEC":
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> > 
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> >
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> > 
> > 3. "is or was not PROT_WRITE":
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> > 
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// passes
> > 
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >	 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> [edited above to show each case]

Thanks, I was in a rush to get home ;).

> restated what was already summarized:
> Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes).
> 
> > If we don't care about 3.c, we might as well go for (2). I don't mind,
> > already went for (3) in this series. I think either of them would not be
> > a regression on MDWE, unless there is some test that attempts 3.c and
> > expects it to fail.
> 
> I should stop arguing for a less restrictive mode. ;) It just feels weird
> that the combinations are API-mediated, rather than logically defined:
> I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As
> opposed to saying "the vma cannot be executable if it is or ever was
> writable". I find the latter much easier to reason about as far as the
> expectations of system state.

I had the same reasoning, hence option 3 in this series. I prefer to
treat mmap(PROT_READ|PROT_EXEC) and mprotect(PROT_READ|PROT_EXEC) in a
similar way.