diff mbox series

[RFC,v2,5/8] mseal: Check seal flag for munmap(2)

Message ID 20231017090815.1067790-6-jeffxu@chromium.org (mailing list archive)
State New
Headers show
Series Introduce mseal() syscall | expand

Commit Message

Jeff Xu Oct. 17, 2023, 9:08 a.m. UTC
From: Jeff Xu <jeffxu@google.com>

munmap(2) unmap VMAs in the given address range.
Sealing will prevent unintended munmap(2) call.

What this patch does:
When a munmap(2) is invoked, if one of its VMAs has MM_SEAL_MUNMAP
set from previous mseal(2) call, this munmap(2) will fail,
without any VMA modified.

This patch is based on following:
1. At syscall entry point: SYSCALL_DEFINE2(munmap, ...)
Pass checkSeals = MM_SEAL_MUNMAP into __vm_munmap(),
in turn, to do_vmi_munmap().

Of all the call paths that call into do_vmi_munmap(),
this is the only place where checkSeals = MM_SEAL_MUNMAP.
The rest has checkSeals = 0.

2. In do_vmi_munmap(), calls can_modify_mm() before any
update is made to VMAs.

Signed-off-by: Jeff Xu <jeffxu@google.com>
---
 include/linux/mm.h |  2 +-
 mm/mmap.c          | 21 +++++++++++++--------
 mm/mremap.c        |  5 +++--
 3 files changed, 17 insertions(+), 11 deletions(-)

Comments

Linus Torvalds Oct. 17, 2023, 4:54 p.m. UTC | #1
On Tue, 17 Oct 2023 at 02:08, <jeffxu@chromium.org> wrote:
>
> Of all the call paths that call into do_vmi_munmap(),
> this is the only place where checkSeals = MM_SEAL_MUNMAP.
> The rest has checkSeals = 0.

Why?

None of this makes sense.

So you say "we can't munmap in this *one* place, but all others ignore
the sealing".

Crazy.

             Linus
Jeff Xu Oct. 18, 2023, 3:08 p.m. UTC | #2
On Tue, Oct 17, 2023 at 9:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 17 Oct 2023 at 02:08, <jeffxu@chromium.org> wrote:
> >
> > Of all the call paths that call into do_vmi_munmap(),
> > this is the only place where checkSeals = MM_SEAL_MUNMAP.
> > The rest has checkSeals = 0.
>
> Why?
>
> None of this makes sense.
>
> So you say "we can't munmap in this *one* place, but all others ignore
> the sealing".
>
I apologize that previously, I described what this code does, and not reasoning.

In our threat model, as Stephen Röttger point out in [1], and I quote:

V8 exploits typically follow a similar pattern: an initial bug leads
to memory corruption but often the initial corruption is limited and
the attacker has to find a way to arbitrarily read/write in the whole
address space.

The memory correction is in the user space process, e.g. Chrome.
Attackers will try to modify permission of the memory, by calling
mprotect,  or munmap then mmap to the same address but with different
permission, etc.

Sealing blocks mprotect/munmap/mremap/mmap call from the user space
process, e.g. Chrome.

At time of handling those 4 syscalls, we need to check the seal (
can_modify_mm), this requires locking the VMA (
mmap_write_lock_killable), and ideally, after validating the syscall
input. The reasonable place for can_modify_mm() is from utility
functions, such as do_mmap(), do_vmi_munmap(), etc.

However, there is no guarantee that do_mmap() and do_vmi_munmap() are
only reachable from mprotect/munmap/mremap/mmap syscall entry point
(SYSCALL_DEFINE_XX). In theory,  the kernel can call those in other
scenarios, and some of them can be perfectly legit. Those other
scenarios are not covered by our threat model at this time. Therefore,
we need a flag, passed from the SYSCALL_DEFINE_XX entry , down to
can_modify_mm(), to differentiate those other scenarios.

Now, back to code, it did some optimization, i.e. doesn't pass the
flag from SYSCALL_DEFINE_XX  in all cases. If SYSCALL_DEFINE_XX calls
do_a, and do_a has only one caller, I will set the flag in do_a,
instead of SYSCALL_DEFINE_XX. Doing this reduces the size of the
patchset, but it also makes the code less readable indeed. I could
remove this optimization in V3. I welcome suggestions to improve
readability on this.

When handing the mmap/munmap/mremap/mmap, once the code passed
can_modify_mm(), it means the memory area is not sealed, if the code
continues to call the other utility functions, we don't need to check
the seal again. This is the case for mremap(), the seal of src address
and dest address (when applicable) are checked first, later when the
code calls  do_vmi_munmap(), it no longer needs to check the seal
again.

[1] https://v8.dev/blog/control-flow-integrity

-Jeff
Jeff Xu Oct. 18, 2023, 5:14 p.m. UTC | #3
On Wed, Oct 18, 2023 at 8:08 AM Jeff Xu <jeffxu@google.com> wrote:
>
> On Tue, Oct 17, 2023 at 9:54 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Tue, 17 Oct 2023 at 02:08, <jeffxu@chromium.org> wrote:
> > >
> > > Of all the call paths that call into do_vmi_munmap(),
> > > this is the only place where checkSeals = MM_SEAL_MUNMAP.
> > > The rest has checkSeals = 0.
> >
> > Why?
> >
> > None of this makes sense.
> >
> > So you say "we can't munmap in this *one* place, but all others ignore
> > the sealing".
> >
> I apologize that previously, I described what this code does, and not reasoning.
>
> In our threat model, as Stephen Röttger point out in [1], and I quote:
>
> V8 exploits typically follow a similar pattern: an initial bug leads
> to memory corruption but often the initial corruption is limited and
> the attacker has to find a way to arbitrarily read/write in the whole
> address space.
>
> The memory correction is in the user space process, e.g. Chrome.
> Attackers will try to modify permission of the memory, by calling
> mprotect,  or munmap then mmap to the same address but with different
> permission, etc.
>
> Sealing blocks mprotect/munmap/mremap/mmap call from the user space
> process, e.g. Chrome.
>
> At time of handling those 4 syscalls, we need to check the seal (
> can_modify_mm), this requires locking the VMA (
> mmap_write_lock_killable), and ideally, after validating the syscall
> input. The reasonable place for can_modify_mm() is from utility
> functions, such as do_mmap(), do_vmi_munmap(), etc.
>
> However, there is no guarantee that do_mmap() and do_vmi_munmap() are
> only reachable from mprotect/munmap/mremap/mmap syscall entry point
> (SYSCALL_DEFINE_XX). In theory,  the kernel can call those in other
> scenarios, and some of them can be perfectly legit. Those other
> scenarios are not covered by our threat model at this time. Therefore,
> we need a flag, passed from the SYSCALL_DEFINE_XX entry , down to
> can_modify_mm(), to differentiate those other scenarios.
>
> Now, back to code, it did some optimization, i.e. doesn't pass the
> flag from SYSCALL_DEFINE_XX  in all cases. If SYSCALL_DEFINE_XX calls
> do_a, and do_a has only one caller, I will set the flag in do_a,
> instead of SYSCALL_DEFINE_XX. Doing this reduces the size of the
> patchset, but it also makes the code less readable indeed. I could
> remove this optimization in V3. I welcome suggestions to improve
> readability on this.
>
> When handing the mmap/munmap/mremap/mmap, once the code passed
> can_modify_mm(), it means the memory area is not sealed, if the code
> continues to call the other utility functions, we don't need to check
> the seal again. This is the case for mremap(), the seal of src address
> and dest address (when applicable) are checked first, later when the
> code calls  do_vmi_munmap(), it no longer needs to check the seal
> again.
>
> [1] https://v8.dev/blog/control-flow-integrity
>
> -Jeff

There is also alternative approach:

For all the places that call do_vmi_munmap(), find out which
case should ignore the sealing flag legitimately, set an ignore_seal
flag and pass it down into do_vmi_munmap(). For the rest case,
use default behavior.

All future API will automatically be covered for sealing, by using default.

The risky side, if I missed a case that requires setting ignore_seal,
there will be a bug.

Also if a driver calls the utility functions to unmap a memory, the
seal will be checked as well. (Driver is not in our threat model,
but Chrome probably doesn't mind it.)

Which of those two approaches are better ? I appreciate the direction on this.

Thanks!
-Jeff


-Jeff
Linus Torvalds Oct. 18, 2023, 6:27 p.m. UTC | #4
On Wed, 18 Oct 2023 at 10:14, Jeff Xu <jeffxu@google.com> wrote:
>
> There is also alternative approach:
>
> For all the places that call do_vmi_munmap(), find out which
> case should ignore the sealing flag legitimately,

NO.

Christ.

THERE ARE NO LEGITIMATE CASES OF IGNORING SEALING FLAGS.

If you ignore a sealing flag, it's not a sealing flag. It's random
crap, and claiming that it has *anything* to do with security is just
a cruel joke.

Really.

Stop this. I do not want to hear your excuses for garbage any more.
We're done. If I hear any more arguments for this sh*t, I will
literally put you in my ignore file, and will auto-NAK any future
patches.

This is simply not up for discussion. Any flag for "ignore sealing" is wrong.

We do have one special "unmap" case, namely "unmap_vmas()' called at
last mmput() -> __mmput() -> exit_mmap().

And yes, that is called at munmap() time too, but that's after the
point of no return after we've already removed the vma's from the VM
lists. So it's long after any error cases have been checked.

             Linus
Jeff Xu Oct. 18, 2023, 7:07 p.m. UTC | #5
On Wed, Oct 18, 2023 at 11:27 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 18 Oct 2023 at 10:14, Jeff Xu <jeffxu@google.com> wrote:
> This is simply not up for discussion. Any flag for "ignore sealing" is wrong.
>
> We do have one special "unmap" case, namely "unmap_vmas()' called at
> last mmput() -> __mmput() -> exit_mmap().
>
> And yes, that is called at munmap() time too, but that's after the
> point of no return after we've already removed the vma's from the VM
> lists. So it's long after any error cases have been checked.
>
Ah. I see.
I didn't know there was no legit case, which is what I worried about before.
this flag can be removed.
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b09df8501987..f2f316522f2a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3279,7 +3279,7 @@  extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
-			 bool unlock);
+			 bool unlock, unsigned long checkSeals);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
diff --git a/mm/mmap.c b/mm/mmap.c
index 414ac31aa9fa..62d592f16f45 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2601,6 +2601,7 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
  * @len: The length of the range to munmap
  * @uf: The userfaultfd list_head
  * @unlock: set to true if the user wants to drop the mmap_lock on success
+ * @checkSeals: seal type to check.
  *
  * This function takes a @mas that is either pointing to the previous VMA or set
  * to MA_START and sets it up to remove the mapping(s).  The @len will be
@@ -2611,7 +2612,7 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
  */
 int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
-		  bool unlock)
+		  bool unlock, unsigned long checkSeals)
 {
 	unsigned long end;
 	struct vm_area_struct *vma;
@@ -2623,6 +2624,9 @@  int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 	if (end == start)
 		return -EINVAL;
 
+	if (!can_modify_mm(mm, start, end, checkSeals))
+		return -EACCES;
+
 	 /* arch_unmap() might do unmaps itself.  */
 	arch_unmap(mm, start, end);
 
@@ -2650,7 +2654,7 @@  int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 {
 	VMA_ITERATOR(vmi, mm, start);
 
-	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
+	return do_vmi_munmap(&vmi, mm, start, len, uf, false, 0);
 }
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
@@ -2684,7 +2688,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 	}
 
 	/* Unmap any existing mapping in the area */
-	if (do_vmi_munmap(&vmi, mm, addr, len, uf, false))
+	if (do_vmi_munmap(&vmi, mm, addr, len, uf, false, 0))
 		return -ENOMEM;
 
 	/*
@@ -2909,7 +2913,8 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 	return error;
 }
 
-static int __vm_munmap(unsigned long start, size_t len, bool unlock)
+static int __vm_munmap(unsigned long start, size_t len, bool unlock,
+			unsigned long checkSeals)
 {
 	int ret;
 	struct mm_struct *mm = current->mm;
@@ -2919,7 +2924,7 @@  static int __vm_munmap(unsigned long start, size_t len, bool unlock)
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 
-	ret = do_vmi_munmap(&vmi, mm, start, len, &uf, unlock);
+	ret = do_vmi_munmap(&vmi, mm, start, len, &uf, unlock, checkSeals);
 	if (ret || !unlock)
 		mmap_write_unlock(mm);
 
@@ -2929,14 +2934,14 @@  static int __vm_munmap(unsigned long start, size_t len, bool unlock)
 
 int vm_munmap(unsigned long start, size_t len)
 {
-	return __vm_munmap(start, len, false);
+	return __vm_munmap(start, len, false, 0);
 }
 EXPORT_SYMBOL(vm_munmap);
 
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
 	addr = untagged_addr(addr);
-	return __vm_munmap(addr, len, true);
+	return __vm_munmap(addr, len, true, MM_SEAL_MUNMAP);
 }
 
 
@@ -3168,7 +3173,7 @@  int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
 	if (ret)
 		goto limits_failed;
 
-	ret = do_vmi_munmap(&vmi, mm, addr, len, &uf, 0);
+	ret = do_vmi_munmap(&vmi, mm, addr, len, &uf, 0, 0);
 	if (ret)
 		goto munmap_failed;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 056478c106ee..ac363937f8c4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -715,7 +715,8 @@  static unsigned long move_vma(struct vm_area_struct *vma,
 	}
 
 	vma_iter_init(&vmi, mm, old_addr);
-	if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false)) {
+	if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false,
+				0)) {
 		/* OOM: unable to split vma, just get accounts right */
 		if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
 			vm_acct_memory(old_len >> PAGE_SHIFT);
@@ -1009,7 +1010,7 @@  SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 		}
 
 		ret = do_vmi_munmap(&vmi, mm, addr + new_len, old_len - new_len,
-				    &uf_unmap, true);
+				    &uf_unmap, true, 0);
 		if (ret)
 			goto out;