Message ID | 20220413134946.2732468-3-catalin.marinas@arm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, arm64: In-kernel support for memory-deny-write-execute (MDWE) | expand |
On 13.04.22 15:49, Catalin Marinas wrote: > The aim of such policy is to prevent a user task from inadvertently > creating an executable mapping that is or was writeable (and > subsequently made read-only). > > An example of mmap() returning -EACCESS if the policy is enabled: > > mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0); > > Similarly, mprotect() would return -EACCESS below: > > addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); > mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC); > > With the past vma writeable permission tracking, mprotect() below would > also fail with -EACCESS: > > addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0); > mprotect(addr, size, PROT_READ | PROT_EXEC); > > While the above could be achieved by checking PROT_WRITE & PROT_EXEC on > mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current > systemd MDWE approach via SECCOMP BPF filters), we want the following > scenario to succeed: > > addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); > mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI); > > where PROT_BTI enables branch tracking identification on arm64. > > The choice for a DENY_WRITE_EXEC personality flag, inherited on fork() > and execve(), was made by analogy to READ_IMPLIES_EXEC. > > Note that it is sufficient to check for VM_WAS_WRITE in > map_deny_write_exec() as this flag is always set on VM_WRITE mappings. > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > Cc: Christoph Hellwig <hch@infradead.org> > Cc: Andrew Morton <akpm@linux-foundation.org> How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly? Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't actually map the PTE writable, but set it dirty. GUP will retry, find a R/O pte that is dirty and where it knows that it broke COW and will allow the read access, although the PTE is R/O. That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE kernel sections, but it's used elsewhere for page pinning as well. My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right now to bypass that mechanism, I might be wrong.
On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote: > On 13.04.22 15:49, Catalin Marinas wrote: > > The aim of such policy is to prevent a user task from inadvertently > > creating an executable mapping that is or was writeable (and > > subsequently made read-only). > > > > An example of mmap() returning -EACCESS if the policy is enabled: > > > > mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0); > > > > Similarly, mprotect() would return -EACCESS below: > > > > addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); > > mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC); > > > > With the past vma writeable permission tracking, mprotect() below would > > also fail with -EACCESS: > > > > addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0); > > mprotect(addr, size, PROT_READ | PROT_EXEC); > > > > While the above could be achieved by checking PROT_WRITE & PROT_EXEC on > > mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current > > systemd MDWE approach via SECCOMP BPF filters), we want the following > > scenario to succeed: > > > > addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); > > mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI); > > > > where PROT_BTI enables branch tracking identification on arm64. > > > > The choice for a DENY_WRITE_EXEC personality flag, inherited on fork() > > and execve(), was made by analogy to READ_IMPLIES_EXEC. > > > > Note that it is sufficient to check for VM_WAS_WRITE in > > map_deny_write_exec() as this flag is always set on VM_WRITE mappings. > > > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Christoph Hellwig <hch@infradead.org> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a > VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly? For now, that's just about VM_WRITE. Most vmas are VM_MAYWRITE, so we can't really have MAYWRITE^EXEC. The basic feature aims to avoid user vulnerabilities where a buffer is mapped both writeable and executable. Of course, it can be expanded with additional prctl() flags to cover other cases. > Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on > the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't > actually map the PTE writable, but set it dirty. GUP will retry, find a > R/O pte that is dirty and where it knows that it broke COW and will > allow the read access, although the PTE is R/O. > > That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE > kernel sections, but it's used elsewhere for page pinning as well. > > My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right > now to bypass that mechanism, I might be wrong. GUP can be used to bypass this. But if an attacker can trigger such GUP paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the checks on those paths (and reject the syscall) rather than on mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE. Not sure what would break if we prevent GUP(FOLL_WRITE|FOLL_FORCE) when the vma is !VM_WRITE, basically removing FOLL_FORCE. I guess for ptrace() and uprobes that's fine. We could also make this only about VM_EXEC rather than VM_WRITE, though we'd probably need to set VM_WAS_WRITE if we ever had a GUP(FOLL_WRITE|FOLL_FORCE) in order to prevent a subsequent mprotect(PROT_EXEC). Anyway, this can be a new flag. My first aim is to get the basics working similarly to systemd's MDWE.
On 22.04.22 12:28, Catalin Marinas wrote: > On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote: >> On 13.04.22 15:49, Catalin Marinas wrote: >>> The aim of such policy is to prevent a user task from inadvertently >>> creating an executable mapping that is or was writeable (and >>> subsequently made read-only). >>> >>> An example of mmap() returning -EACCESS if the policy is enabled: >>> >>> mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0); >>> >>> Similarly, mprotect() would return -EACCESS below: >>> >>> addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); >>> mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC); >>> >>> With the past vma writeable permission tracking, mprotect() below would >>> also fail with -EACCESS: >>> >>> addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0); >>> mprotect(addr, size, PROT_READ | PROT_EXEC); >>> >>> While the above could be achieved by checking PROT_WRITE & PROT_EXEC on >>> mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current >>> systemd MDWE approach via SECCOMP BPF filters), we want the following >>> scenario to succeed: >>> >>> addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); >>> mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI); >>> >>> where PROT_BTI enables branch tracking identification on arm64. >>> >>> The choice for a DENY_WRITE_EXEC personality flag, inherited on fork() >>> and execve(), was made by analogy to READ_IMPLIES_EXEC. >>> >>> Note that it is sufficient to check for VM_WAS_WRITE in >>> map_deny_write_exec() as this flag is always set on VM_WRITE mappings. >>> >>> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> >>> Cc: Christoph Hellwig <hch@infradead.org> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >> >> How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a >> VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly? > > For now, that's just about VM_WRITE. Most vmas are VM_MAYWRITE, so we > can't really have MAYWRITE^EXEC. The basic feature aims to avoid user > vulnerabilities where a buffer is mapped both writeable and executable. > Of course, it can be expanded with additional prctl() flags to cover > other cases. > >> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on >> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't >> actually map the PTE writable, but set it dirty. GUP will retry, find a >> R/O pte that is dirty and where it knows that it broke COW and will >> allow the read access, although the PTE is R/O. >> >> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE >> kernel sections, but it's used elsewhere for page pinning as well. >> >> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right >> now to bypass that mechanism, I might be wrong. > > GUP can be used to bypass this. But if an attacker can trigger such GUP > paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the > checks on those paths (and reject the syscall) rather than on > mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE. > > I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to unprivileged users.
On Fri, Apr 22, 2022 at 01:04:31PM +0200, David Hildenbrand wrote: > On 22.04.22 12:28, Catalin Marinas wrote: > > On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote: > >> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on > >> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't > >> actually map the PTE writable, but set it dirty. GUP will retry, find a > >> R/O pte that is dirty and where it knows that it broke COW and will > >> allow the read access, although the PTE is R/O. > >> > >> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE > >> kernel sections, but it's used elsewhere for page pinning as well. > >> > >> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right > >> now to bypass that mechanism, I might be wrong. > > > > GUP can be used to bypass this. But if an attacker can trigger such GUP > > paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the > > checks on those paths (and reject the syscall) rather than on > > mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE. > > I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to > unprivileged users. Ah, do they really need this? At a quick search, ib_umem_get() for example: unsigned int gup_flags = FOLL_WRITE; ... if (!umem->writable) gup_flags |= FOLL_FORCE; I guess with a new MDWE flag we can make the GUP code ignore FOLL_FORCE if VM_EXEC.
On 22.04.22 15:12, Catalin Marinas wrote: > On Fri, Apr 22, 2022 at 01:04:31PM +0200, David Hildenbrand wrote: >> On 22.04.22 12:28, Catalin Marinas wrote: >>> On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote: >>>> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on >>>> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't >>>> actually map the PTE writable, but set it dirty. GUP will retry, find a >>>> R/O pte that is dirty and where it knows that it broke COW and will >>>> allow the read access, although the PTE is R/O. >>>> >>>> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE >>>> kernel sections, but it's used elsewhere for page pinning as well. >>>> >>>> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right >>>> now to bypass that mechanism, I might be wrong. >>> >>> GUP can be used to bypass this. But if an attacker can trigger such GUP >>> paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the >>> checks on those paths (and reject the syscall) rather than on >>> mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE. >> >> I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to >> unprivileged users. > > Ah, do they really need this? At a quick search, ib_umem_get() for > example: > > unsigned int gup_flags = FOLL_WRITE; > ... > if (!umem->writable) > gup_flags |= FOLL_FORCE; > > I guess with a new MDWE flag we can make the GUP code ignore FOLL_FORCE > if VM_EXEC. > It's, for example, required when you have a MAP_PRIVATE but PROT_READ mapping and want to take a reliable R/O (!) pin. Without FOLL_FORCE|FOLL_WRITE you'd be pinning a (shared zeropage, pagecache) page that will get replaced by an anonymous page in the COW handler, after mprotect(PROT_READ|PROT_WRITE) followed by a write access. That was an issue for RDMA in the past, that's why we have that handling in place IIRC. Yes, it's ugly.
diff --git a/include/linux/mman.h b/include/linux/mman.h index 2d841ddae2aa..17e91a1bdfb3 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -166,4 +166,14 @@ calc_vm_flag_bits(unsigned long flags) } unsigned long vm_commit_limit(void); + +static inline bool map_deny_write_exec(unsigned long vm_flags) +{ + if (IS_ENABLED(CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC) && + (current->personality & DENY_WRITE_EXEC) && + (vm_flags & VM_EXEC) && (vm_flags & VM_WAS_WRITE)) + return true; + return false; +} + #endif /* _LINUX_MMAN_H */ diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..c8d924be3dcd 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000, + DENY_WRITE_EXEC = 0x10000000, }; /* diff --git a/mm/mmap.c b/mm/mmap.c index 3aa839f81e63..8e894270a80e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1579,6 +1579,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; } + if (map_deny_write_exec(vm_flags)) + return -EACCES; + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || diff --git a/mm/mprotect.c b/mm/mprotect.c index b69ce7a7b2b7..ff0d13a4c1ed 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -627,6 +627,11 @@ static int do_mprotect_pkey(unsigned long start, size_t len, goto out; } + if (map_deny_write_exec(newflags)) { + error = -EACCES; + goto out; + } + /* Allow architectures to sanity-check the new flags */ if (!arch_validate_flags(newflags)) { error = -EINVAL;
The aim of such policy is to prevent a user task from inadvertently creating an executable mapping that is or was writeable (and subsequently made read-only). An example of mmap() returning -EACCESS if the policy is enabled: mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0); Similarly, mprotect() would return -EACCESS below: addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC); With the past vma writeable permission tracking, mprotect() below would also fail with -EACCESS: addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_EXEC); While the above could be achieved by checking PROT_WRITE & PROT_EXEC on mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current systemd MDWE approach via SECCOMP BPF filters), we want the following scenario to succeed: addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI); where PROT_BTI enables branch tracking identification on arm64. The choice for a DENY_WRITE_EXEC personality flag, inherited on fork() and execve(), was made by analogy to READ_IMPLIES_EXEC. Note that it is sufficient to check for VM_WAS_WRITE in map_deny_write_exec() as this flag is always set on VM_WRITE mappings. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> --- include/linux/mman.h | 10 ++++++++++ include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 +++ mm/mprotect.c | 5 +++++ 4 files changed, 19 insertions(+)