diff mbox series

[RFC,2/4] mm, personality: Implement memory-deny-write-execute as a personality flag

Message ID 20220413134946.2732468-3-catalin.marinas@arm.com (mailing list archive)
State New, archived
Headers show
Series mm, arm64: In-kernel support for memory-deny-write-execute (MDWE) | expand

Commit Message

Catalin Marinas April 13, 2022, 1:49 p.m. UTC
The aim of such policy is to prevent a user task from inadvertently
creating an executable mapping that is or was writeable (and
subsequently made read-only).

An example of mmap() returning -EACCESS if the policy is enabled:

	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);

Similarly, mprotect() would return -EACCESS below:

	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);

With the past vma writeable permission tracking, mprotect() below would
also fail with -EACCESS:

	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_EXEC);

While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
systemd MDWE approach via SECCOMP BPF filters), we want the following
scenario to succeed:

	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);

where PROT_BTI enables branch tracking identification on arm64.

The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
and execve(), was made by analogy to READ_IMPLIES_EXEC.

Note that it is sufficient to check for VM_WAS_WRITE in
map_deny_write_exec() as this flag is always set on VM_WRITE mappings.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mman.h             | 10 ++++++++++
 include/uapi/linux/personality.h |  1 +
 mm/mmap.c                        |  3 +++
 mm/mprotect.c                    |  5 +++++
 4 files changed, 19 insertions(+)

Comments

David Hildenbrand April 21, 2022, 5:37 p.m. UTC | #1
On 13.04.22 15:49, Catalin Marinas wrote:
> The aim of such policy is to prevent a user task from inadvertently
> creating an executable mapping that is or was writeable (and
> subsequently made read-only).
> 
> An example of mmap() returning -EACCESS if the policy is enabled:
> 
> 	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
> 
> Similarly, mprotect() would return -EACCESS below:
> 
> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> 	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
> 
> With the past vma writeable permission tracking, mprotect() below would
> also fail with -EACCESS:
> 
> 	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
> 	mprotect(addr, size, PROT_READ | PROT_EXEC);
> 
> While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
> mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
> systemd MDWE approach via SECCOMP BPF filters), we want the following
> scenario to succeed:
> 
> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
> 
> where PROT_BTI enables branch tracking identification on arm64.
> 
> The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
> and execve(), was made by analogy to READ_IMPLIES_EXEC.
> 
> Note that it is sufficient to check for VM_WAS_WRITE in
> map_deny_write_exec() as this flag is always set on VM_WRITE mappings.
> 
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>

How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a
VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly?

Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
actually map the PTE writable, but set it dirty. GUP will retry, find a
R/O pte that is dirty and where it knows that it broke COW and will
allow the read access, although the PTE is R/O.

That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
kernel sections, but it's used elsewhere for page pinning as well.

My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
now to bypass that mechanism, I might be wrong.
Catalin Marinas April 22, 2022, 10:28 a.m. UTC | #2
On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
> On 13.04.22 15:49, Catalin Marinas wrote:
> > The aim of such policy is to prevent a user task from inadvertently
> > creating an executable mapping that is or was writeable (and
> > subsequently made read-only).
> > 
> > An example of mmap() returning -EACCESS if the policy is enabled:
> > 
> > 	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
> > 
> > Similarly, mprotect() would return -EACCESS below:
> > 
> > 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> > 	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
> > 
> > With the past vma writeable permission tracking, mprotect() below would
> > also fail with -EACCESS:
> > 
> > 	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
> > 	mprotect(addr, size, PROT_READ | PROT_EXEC);
> > 
> > While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
> > mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
> > systemd MDWE approach via SECCOMP BPF filters), we want the following
> > scenario to succeed:
> > 
> > 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> > 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
> > 
> > where PROT_BTI enables branch tracking identification on arm64.
> > 
> > The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
> > and execve(), was made by analogy to READ_IMPLIES_EXEC.
> > 
> > Note that it is sufficient to check for VM_WAS_WRITE in
> > map_deny_write_exec() as this flag is always set on VM_WRITE mappings.
> > 
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> 
> How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a
> VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly?

For now, that's just about VM_WRITE. Most vmas are VM_MAYWRITE, so we
can't really have MAYWRITE^EXEC. The basic feature aims to avoid user
vulnerabilities where a buffer is mapped both writeable and executable.
Of course, it can be expanded with additional prctl() flags to cover
other cases.

> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
> actually map the PTE writable, but set it dirty. GUP will retry, find a
> R/O pte that is dirty and where it knows that it broke COW and will
> allow the read access, although the PTE is R/O.
> 
> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
> kernel sections, but it's used elsewhere for page pinning as well.
> 
> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
> now to bypass that mechanism, I might be wrong.

GUP can be used to bypass this. But if an attacker can trigger such GUP
paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
checks on those paths (and reject the syscall) rather than on
mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.

Not sure what would break if we prevent GUP(FOLL_WRITE|FOLL_FORCE) when
the vma is !VM_WRITE, basically removing FOLL_FORCE. I guess for
ptrace() and uprobes that's fine. We could also make this only about
VM_EXEC rather than VM_WRITE, though  we'd probably need to set
VM_WAS_WRITE if we ever had a GUP(FOLL_WRITE|FOLL_FORCE) in order to
prevent a subsequent mprotect(PROT_EXEC).

Anyway, this can be a new flag. My first aim is to get the basics
working similarly to systemd's MDWE.
David Hildenbrand April 22, 2022, 11:04 a.m. UTC | #3
On 22.04.22 12:28, Catalin Marinas wrote:
> On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
>> On 13.04.22 15:49, Catalin Marinas wrote:
>>> The aim of such policy is to prevent a user task from inadvertently
>>> creating an executable mapping that is or was writeable (and
>>> subsequently made read-only).
>>>
>>> An example of mmap() returning -EACCESS if the policy is enabled:
>>>
>>> 	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
>>>
>>> Similarly, mprotect() would return -EACCESS below:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
>>>
>>> With the past vma writeable permission tracking, mprotect() below would
>>> also fail with -EACCESS:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_EXEC);
>>>
>>> While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
>>> mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
>>> systemd MDWE approach via SECCOMP BPF filters), we want the following
>>> scenario to succeed:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
>>>
>>> where PROT_BTI enables branch tracking identification on arm64.
>>>
>>> The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
>>> and execve(), was made by analogy to READ_IMPLIES_EXEC.
>>>
>>> Note that it is sufficient to check for VM_WAS_WRITE in
>>> map_deny_write_exec() as this flag is always set on VM_WRITE mappings.
>>>
>>> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Christoph Hellwig <hch@infradead.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>
>> How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a
>> VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly?
> 
> For now, that's just about VM_WRITE. Most vmas are VM_MAYWRITE, so we
> can't really have MAYWRITE^EXEC. The basic feature aims to avoid user
> vulnerabilities where a buffer is mapped both writeable and executable.
> Of course, it can be expanded with additional prctl() flags to cover
> other cases.
> 
>> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
>> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
>> actually map the PTE writable, but set it dirty. GUP will retry, find a
>> R/O pte that is dirty and where it knows that it broke COW and will
>> allow the read access, although the PTE is R/O.
>>
>> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
>> kernel sections, but it's used elsewhere for page pinning as well.
>>
>> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
>> now to bypass that mechanism, I might be wrong.
> 
> GUP can be used to bypass this. But if an attacker can trigger such GUP
> paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
> checks on those paths (and reject the syscall) rather than on
> mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.
> 
> 

I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to
unprivileged users.
Catalin Marinas April 22, 2022, 1:12 p.m. UTC | #4
On Fri, Apr 22, 2022 at 01:04:31PM +0200, David Hildenbrand wrote:
> On 22.04.22 12:28, Catalin Marinas wrote:
> > On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
> >> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
> >> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
> >> actually map the PTE writable, but set it dirty. GUP will retry, find a
> >> R/O pte that is dirty and where it knows that it broke COW and will
> >> allow the read access, although the PTE is R/O.
> >>
> >> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
> >> kernel sections, but it's used elsewhere for page pinning as well.
> >>
> >> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
> >> now to bypass that mechanism, I might be wrong.
> > 
> > GUP can be used to bypass this. But if an attacker can trigger such GUP
> > paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
> > checks on those paths (and reject the syscall) rather than on
> > mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.
> 
> I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to
> unprivileged users.

Ah, do they really need this? At a quick search, ib_umem_get() for
example:

	unsigned int gup_flags = FOLL_WRITE;
	...
	if (!umem->writable)
		gup_flags |= FOLL_FORCE;

I guess with a new MDWE flag we can make the GUP code ignore FOLL_FORCE
if VM_EXEC.
David Hildenbrand April 22, 2022, 5:41 p.m. UTC | #5
On 22.04.22 15:12, Catalin Marinas wrote:
> On Fri, Apr 22, 2022 at 01:04:31PM +0200, David Hildenbrand wrote:
>> On 22.04.22 12:28, Catalin Marinas wrote:
>>> On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
>>>> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
>>>> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
>>>> actually map the PTE writable, but set it dirty. GUP will retry, find a
>>>> R/O pte that is dirty and where it knows that it broke COW and will
>>>> allow the read access, although the PTE is R/O.
>>>>
>>>> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
>>>> kernel sections, but it's used elsewhere for page pinning as well.
>>>>
>>>> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
>>>> now to bypass that mechanism, I might be wrong.
>>>
>>> GUP can be used to bypass this. But if an attacker can trigger such GUP
>>> paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
>>> checks on those paths (and reject the syscall) rather than on
>>> mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.
>>
>> I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to
>> unprivileged users.
> 
> Ah, do they really need this? At a quick search, ib_umem_get() for
> example:
> 
> 	unsigned int gup_flags = FOLL_WRITE;
> 	...
> 	if (!umem->writable)
> 		gup_flags |= FOLL_FORCE;
> 
> I guess with a new MDWE flag we can make the GUP code ignore FOLL_FORCE
> if VM_EXEC.
> 

It's, for example, required when you have a MAP_PRIVATE but PROT_READ
mapping and want to take a reliable R/O (!) pin. Without
FOLL_FORCE|FOLL_WRITE you'd be pinning a (shared zeropage, pagecache)
page that will get replaced by an anonymous page in the COW handler,
after mprotect(PROT_READ|PROT_WRITE) followed by a write access. That
was an issue for RDMA in the past, that's why we have that handling in
place IIRC.

Yes, it's ugly.
diff mbox series

Patch

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 2d841ddae2aa..17e91a1bdfb3 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -166,4 +166,14 @@  calc_vm_flag_bits(unsigned long flags)
 }
 
 unsigned long vm_commit_limit(void);
+
+static inline bool map_deny_write_exec(unsigned long vm_flags)
+{
+	if (IS_ENABLED(CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC) &&
+	    (current->personality & DENY_WRITE_EXEC) &&
+	    (vm_flags & VM_EXEC) && (vm_flags & VM_WAS_WRITE))
+		return true;
+	return false;
+}
+
 #endif /* _LINUX_MMAN_H */
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index 49796b7756af..c8d924be3dcd 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -22,6 +22,7 @@  enum {
 	WHOLE_SECONDS =		0x2000000,
 	STICKY_TIMEOUTS	=	0x4000000,
 	ADDR_LIMIT_3GB = 	0x8000000,
+	DENY_WRITE_EXEC =	0x10000000,
 };
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
index 3aa839f81e63..8e894270a80e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1579,6 +1579,9 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
+	if (map_deny_write_exec(vm_flags))
+		return -EACCES;
+
 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b69ce7a7b2b7..ff0d13a4c1ed 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -627,6 +627,11 @@  static int do_mprotect_pkey(unsigned long start, size_t len,
 			goto out;
 		}
 
+		if (map_deny_write_exec(newflags)) {
+			error = -EACCES;
+			goto out;
+		}
+
 		/* Allow architectures to sanity-check the new flags */
 		if (!arch_validate_flags(newflags)) {
 			error = -EINVAL;