diff mbox series

mm: improve mprotect(R|W) efficiency on pages referenced once

Message ID 20201212053152.3783250-1-pcc@google.com (mailing list archive)
State New, archived
Headers show
Series mm: improve mprotect(R|W) efficiency on pages referenced once | expand

Commit Message

Peter Collingbourne Dec. 12, 2020, 5:31 a.m. UTC
In the Scudo memory allocator [1] we would like to be able to
detect use-after-free vulnerabilities involving large allocations
by issuing mprotect(PROT_NONE) on the memory region used for the
allocation when it is deallocated. Later on, after the memory
region has been "quarantined" for a sufficient period of time we
would like to be able to use it for another allocation by issuing
mprotect(PROT_READ|PROT_WRITE).

Before this patch, after removing the write protection, any writes
to the memory region would result in page faults and entering
the copy-on-write code path, even in the usual case where the
pages are only referenced by a single PTE, harming performance
unnecessarily. Make it so that any pages in anonymous mappings that
are only referenced by a single PTE are immediately made writable
during the mprotect so that we can avoid the page faults.

This program shows the critical syscall sequence that we intend to
use in the allocator:

  #include <string.h>
  #include <sys/mman.h>

  enum { kSize = 131072 };

  int main(int argc, char **argv) {
    char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE,
                              MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    for (int i = 0; i != 100000; ++i) {
      memset(addr, i, kSize);
      mprotect((void *)addr, kSize, PROT_NONE);
      mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE);
    }
  }

The effect of this patch on the above program was measured on a
DragonBoard 845c by taking the median real time execution time of
10 runs.

Before: 3.19s
After:  0.79s

The effect was also measured using one of the microbenchmarks that
we normally use to benchmark the allocator [2], after modifying it
to make the appropriate mprotect calls [3]. With an allocation size
of 131072 bytes to trigger the allocator's "large allocation" code
path the per-iteration time was measured as follows:

Before: 33364ns
After:   6886ns

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I98d75ef90e20330c578871c87494d64b1df3f1b8
Link: [1] https://source.android.com/devices/tech/debug/scudo
Link: [2] https://cs.android.com/android/platform/superproject/+/master:bionic/benchmarks/stdlib_benchmark.cpp;l=53;drc=e8693e78711e8f45ccd2b610e4dbe0b94d551cc9
Link: [3] https://github.com/pcc/llvm-project/commit/scudo-mprotect-secondary
---
 mm/mprotect.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Andrew Morton Dec. 23, 2020, 10:34 p.m. UTC | #1
On Fri, 11 Dec 2020 21:31:52 -0800 Peter Collingbourne <pcc@google.com> wrote:

> In the Scudo memory allocator [1] we would like to be able to
> detect use-after-free vulnerabilities involving large allocations
> by issuing mprotect(PROT_NONE) on the memory region used for the
> allocation when it is deallocated. Later on, after the memory
> region has been "quarantined" for a sufficient period of time we
> would like to be able to use it for another allocation by issuing
> mprotect(PROT_READ|PROT_WRITE).
> 
> Before this patch, after removing the write protection, any writes
> to the memory region would result in page faults and entering
> the copy-on-write code path, even in the usual case where the
> pages are only referenced by a single PTE, harming performance
> unnecessarily. Make it so that any pages in anonymous mappings that
> are only referenced by a single PTE are immediately made writable
> during the mprotect so that we can avoid the page faults.
> 

I worry that some other application out there does a similar thing, but
only expects to very sparsely write to the area.  It will see a big increase
in mprotect() latency.

Would it be better to implement this as a separate operation, rather
than unconditionally tying it into mprotect()?  Say, a new madvise()
operation?
Peter Collingbourne Dec. 29, 2020, 2:09 a.m. UTC | #2
On Wed, Dec 23, 2020 at 2:34 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 11 Dec 2020 21:31:52 -0800 Peter Collingbourne <pcc@google.com> wrote:
>
> > In the Scudo memory allocator [1] we would like to be able to
> > detect use-after-free vulnerabilities involving large allocations
> > by issuing mprotect(PROT_NONE) on the memory region used for the
> > allocation when it is deallocated. Later on, after the memory
> > region has been "quarantined" for a sufficient period of time we
> > would like to be able to use it for another allocation by issuing
> > mprotect(PROT_READ|PROT_WRITE).
> >
> > Before this patch, after removing the write protection, any writes
> > to the memory region would result in page faults and entering
> > the copy-on-write code path, even in the usual case where the
> > pages are only referenced by a single PTE, harming performance
> > unnecessarily. Make it so that any pages in anonymous mappings that
> > are only referenced by a single PTE are immediately made writable
> > during the mprotect so that we can avoid the page faults.
> >
>
> I worry that some other application out there does a similar thing, but
> only expects to very sparsely write to the area.  It will see a big increase
> in mprotect() latency.
>
> Would it be better to implement this as a separate operation, rather
> than unconditionally tying it into mprotect()?  Say, a new madvise()
> operation?

So the case that you're concerned about would be highlighted by this
program, correct?

#include <string.h>
#include <sys/mman.h>

enum { kSize = 131072 };

int main(int argc, char **argv) {
  char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE,
                            MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
  memset(addr, 1, kSize);
  for (int i = 0; i != 100000; ++i) {
    mprotect((void *)addr, kSize, PROT_NONE);
    mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE);
  }
}

I measured the before/after real time execution time of this program
(median of 100 runs) on the DragonBoard and the results were:

Before: 0.325928s
After: 0.365493s

So there's an increase of around 12%. It doesn't seem very large to me
when compared to the 400-500% improvement that we get in the case
where every page is touched.

I also tried this program to measure the impact in the case where a
single page is touched after the mprotect:

#include <string.h>
#include <sys/mman.h>

enum { kSize = 131072 };

int main(int argc, char **argv) {
  char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE,
                            MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
  memset(addr, 1, kSize);
  for (int i = 0; i != 100000; ++i) {
    memset(addr + (i * 4096) % kSize, i, 4096);
    mprotect((void *)addr, kSize, PROT_NONE);
    mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE);
  }
}

With this program the numbers were:

Before: 0.441516s
After: 0.380251s

So it seems that even with a single page fault the new approach is faster.

I saw similar results if I adjusted the programs to use a larger
mapping size. With kSize = 1048576 I get these numbers for the first
program:

Before: 1.563078s
After: 1.607476s

i.e. around 3%.

And these for the second:

Before: 1.684663s
After: 1.683272s

i.e. about the same.

What I think we may conclude from these results is that for smaller
mappings the advantage of the previous approach, although measurable,
is wiped out by a single page fault. I think we may expect that there
should be at least one access resulting in a page fault (under the
previous approach) after making the pages writable, since the program
presumably made the pages writable for a reason.

For larger mappings we may guesstimate that the new approach wins if
the density of future page faults is > 0.4%. But for the mappings that
are large enough for density to matter (not just the absolute number
of page faults) it doesn't seem like the increase in mprotect latency
would be very large relative to the total mprotect execution time.

So my inclination is that I don't really see a strong need for it to
be toggleable, but if we do, the behavior implemented by this patch
should be the default. Perhaps something like madvise(MADV_COLD) could
be used to get the previous behavior but I wouldn't implement that
straight away.

Peter
Andrew Morton Dec. 29, 2020, 7:25 p.m. UTC | #3
On Mon, 28 Dec 2020 18:09:28 -0800 Peter Collingbourne <pcc@google.com> wrote:

> On Wed, Dec 23, 2020 at 2:34 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri, 11 Dec 2020 21:31:52 -0800 Peter Collingbourne <pcc@google.com> wrote:
> >
> > > In the Scudo memory allocator [1] we would like to be able to
> > > detect use-after-free vulnerabilities involving large allocations
> > > by issuing mprotect(PROT_NONE) on the memory region used for the
> > > allocation when it is deallocated. Later on, after the memory
> > > region has been "quarantined" for a sufficient period of time we
> > > would like to be able to use it for another allocation by issuing
> > > mprotect(PROT_READ|PROT_WRITE).
> > >
> > > Before this patch, after removing the write protection, any writes
> > > to the memory region would result in page faults and entering
> > > the copy-on-write code path, even in the usual case where the
> > > pages are only referenced by a single PTE, harming performance
> > > unnecessarily. Make it so that any pages in anonymous mappings that
> > > are only referenced by a single PTE are immediately made writable
> > > during the mprotect so that we can avoid the page faults.
> > >
> >
> > I worry that some other application out there does a similar thing, but
> > only expects to very sparsely write to the area.  It will see a big increase
> > in mprotect() latency.
> >
> > Would it be better to implement this as a separate operation, rather
> > than unconditionally tying it into mprotect()?  Say, a new madvise()
> > operation?
> 
> So the case that you're concerned about would be highlighted by this
> program, correct?
> 
> ...
>
> So it seems that even with a single page fault the new approach is faster.
> 

Cool, thanks.  Can you please roll this new info into the changelog and
resend?
Peter Collingbourne Dec. 30, 2020, 12:43 a.m. UTC | #4
On Tue, Dec 29, 2020 at 11:25 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Mon, 28 Dec 2020 18:09:28 -0800 Peter Collingbourne <pcc@google.com> wrote:
>
> > On Wed, Dec 23, 2020 at 2:34 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Fri, 11 Dec 2020 21:31:52 -0800 Peter Collingbourne <pcc@google.com> wrote:
> > >
> > > > In the Scudo memory allocator [1] we would like to be able to
> > > > detect use-after-free vulnerabilities involving large allocations
> > > > by issuing mprotect(PROT_NONE) on the memory region used for the
> > > > allocation when it is deallocated. Later on, after the memory
> > > > region has been "quarantined" for a sufficient period of time we
> > > > would like to be able to use it for another allocation by issuing
> > > > mprotect(PROT_READ|PROT_WRITE).
> > > >
> > > > Before this patch, after removing the write protection, any writes
> > > > to the memory region would result in page faults and entering
> > > > the copy-on-write code path, even in the usual case where the
> > > > pages are only referenced by a single PTE, harming performance
> > > > unnecessarily. Make it so that any pages in anonymous mappings that
> > > > are only referenced by a single PTE are immediately made writable
> > > > during the mprotect so that we can avoid the page faults.
> > > >
> > >
> > > I worry that some other application out there does a similar thing, but
> > > only expects to very sparsely write to the area.  It will see a big increase
> > > in mprotect() latency.
> > >
> > > Would it be better to implement this as a separate operation, rather
> > > than unconditionally tying it into mprotect()?  Say, a new madvise()
> > > operation?
> >
> > So the case that you're concerned about would be highlighted by this
> > program, correct?
> >
> > ...
> >
> > So it seems that even with a single page fault the new approach is faster.
> >
>
> Cool, thanks.  Can you please roll this new info into the changelog and
> resend?

Sent a v2 with the new info added to the commit message.

Peter
diff mbox series

Patch

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56c02beb6041..6f5313d66d00 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -47,6 +47,8 @@  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	bool anon_writable =
+		vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE);
 
 	/*
 	 * Can be called with only the mmap_lock for reading by
@@ -136,7 +138,11 @@  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					(pte_soft_dirty(ptent) ||
 					 !(vma->vm_flags & VM_SOFTDIRTY))) {
 				ptent = pte_mkwrite(ptent);
+			} else if (anon_writable &&
+				   page_mapcount(pte_page(ptent)) == 1) {
+				ptent = pte_mkwrite(ptent);
 			}
+
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {