diff mbox series

[128/192] mm: improve mprotect(R|W) efficiency on pages referenced once

Message ID 20210629023959.4ZAFiI8oZ%akpm@linux-foundation.org (mailing list archive)
State New
Headers show
Series [001/192] mm/gup: fix try_grab_compound_head() race with split_huge_page() | expand

Commit Message

Andrew Morton June 29, 2021, 2:39 a.m. UTC
From: Peter Collingbourne <pcc@google.com>
Subject: mm: improve mprotect(R|W) efficiency on pages referenced once

In the Scudo memory allocator [1] we would like to be able to detect
use-after-free vulnerabilities involving large allocations by issuing
mprotect(PROT_NONE) on the memory region used for the allocation when it
is deallocated.  Later on, after the memory region has been "quarantined"
for a sufficient period of time we would like to be able to use it for
another allocation by issuing mprotect(PROT_READ|PROT_WRITE).

Before this patch, after removing the write protection, any writes to the
memory region would result in page faults and entering the copy-on-write
code path, even in the usual case where the pages are only referenced by a
single PTE, harming performance unnecessarily.  Make it so that any pages
in anonymous mappings that are only referenced by a single PTE are
immediately made writable during the mprotect so that we can avoid the
page faults.

This program shows the critical syscall sequence that we intend to use in
the allocator:

  #include <string.h>
  #include <sys/mman.h>

  enum { kSize = 131072 };

  int main(int argc, char **argv) {
    char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE,
                              MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    for (int i = 0; i != 100000; ++i) {
      memset(addr, i, kSize);
      mprotect((void *)addr, kSize, PROT_NONE);
      mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE);
    }
  }

The effect of this patch on the above program was measured on a
DragonBoard 845c by taking the median real time execution time of 10 runs.

Before: 2.94s
After:  0.66s

The effect was also measured using one of the microbenchmarks that we
normally use to benchmark the allocator [2], after modifying it to make
the appropriate mprotect calls [3].  With an allocation size of 131072
bytes to trigger the allocator's "large allocation" code path the
per-iteration time was measured as follows:

Before: 27450ns
After:   6010ns

This patch means that we do more work during the mprotect call itself in
exchange for less work when the pages are accessed.  In the worst case,
the pages are not accessed at all.  The effect of this patch in such cases
was measured using the following program:

  #include <string.h>
  #include <sys/mman.h>

  enum { kSize = 131072 };

  int main(int argc, char **argv) {
    char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE,
                              MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    memset(addr, 1, kSize);
    for (int i = 0; i != 100000; ++i) {
  #ifdef PAGE_FAULT
      memset(addr + (i * 4096) % kSize, i, 4096);
  #endif
      mprotect((void *)addr, kSize, PROT_NONE);
      mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE);
    }
  }

With PAGE_FAULT undefined (0 pages touched after removing write
protection) the median real time execution time of 100 runs was measured
as follows:

Before: 0.330260s
After:  0.338836s

With PAGE_FAULT defined (1 page touched) the measurements were
as follows:

Before: 0.438048s
After:  0.355661s

So it seems that even with a single page fault the new approach is faster.

I saw similar results if I adjusted the programs to use a larger mapping
size.  With kSize = 1048576 I get these numbers with PAGE_FAULT undefined:

Before: 1.428988s
After:  1.512016s

i.e. around 5.5%.

And these with PAGE_FAULT defined:

Before: 1.518559s
After:  1.524417s

i.e. about the same.

What I think we may conclude from these results is that for smaller
mappings the advantage of the previous approach, although measurable, is
wiped out by a single page fault.  I think we may expect that there should
be at least one access resulting in a page fault (under the previous
approach) after making the pages writable, since the program presumably
made the pages writable for a reason.

For larger mappings we may guesstimate that the new approach wins if the
density of future page faults is > 0.4%.  But for the mappings that are
large enough for density to matter (not just the absolute number of page
faults) it doesn't seem like the increase in mprotect latency would be
very large relative to the total mprotect execution time.

[pcc@google.com: add comments, prohibit optimization for NUMA pages]
  Link: https://lkml.kernel.org/r/20210601185926.2623183-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210527190453.1259020-1-pcc@google.com
Link: https://linux-review.googlesource.com/id/I98d75ef90e20330c578871c87494d64b1df3f1b8
Link: [1] https://source.android.com/devices/tech/debug/scudo
Link: [2] https://cs.android.com/android/platform/superproject/+/master:bionic/benchmarks/stdlib_benchmark.cpp;l=53;drc=e8693e78711e8f45ccd2b610e4dbe0b94d551cc9
Link: [3] https://github.com/pcc/llvm-project/commit/scudo-mprotect-secondary2
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Kostya Kortchinsky <kostyak@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mprotect.c |   52 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 46 insertions(+), 6 deletions(-)

Comments

Linus Torvalds June 29, 2021, 5:50 p.m. UTC | #1
On Mon, Jun 28, 2021 at 7:40 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
>
> -                       /* Avoid taking write faults for known dirty pages */
> -                       if (dirty_accountable && pte_dirty(ptent) &&
> -                                       (pte_soft_dirty(ptent) ||
> -                                        !(vma->vm_flags & VM_SOFTDIRTY))) {
> +                       if (may_avoid_write_fault(ptent, vma, cp_flags))
>                                 ptent = pte_mkwrite(ptent);
> -                       }

Hmm. I don't think this is correct.

As fat as I can tell, may_avoid_write_fault() doesn't even check if
the vma is writable!

Am I misreading it? Because I think you just made even a shared mmap
with "mprotect(PROT_READ)" turn the pte's writable.

Which is a "slight" security issue.

Maybe the new code is fine, and I'm missing something. The old code
looks strange too, which makes me think that the MM_CP_DIRTY_ACCT test
ends up saving us and depend on VM_WRITE. But it's very much not
obvious.

And even if I _am_ missing something, I really would like a very
obvious and direct test for "this vma is writable", ie maybe a

        if (!(vma->vm_flags & VM_WRITE))
                return false;

at the very top of the function.

And no, "pte_dirty()" is not a reason to make something writable, it
might have started out as a writable mapping, and we dirtied the page,
and we made it read-only. The page stays dirty, but it shouldn't
become writable just because of that.

So please make me get the warm and fuzzies about this code. Because
as-is, it just looks scary.

              Linus
Peter Xu June 30, 2021, 12:12 a.m. UTC | #2
On Tue, Jun 29, 2021 at 10:50:12AM -0700, Linus Torvalds wrote:
> On Mon, Jun 28, 2021 at 7:40 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> >
> > -                       /* Avoid taking write faults for known dirty pages */
> > -                       if (dirty_accountable && pte_dirty(ptent) &&
> > -                                       (pte_soft_dirty(ptent) ||
> > -                                        !(vma->vm_flags & VM_SOFTDIRTY))) {
> > +                       if (may_avoid_write_fault(ptent, vma, cp_flags))
> >                                 ptent = pte_mkwrite(ptent);
> > -                       }
> 
> Hmm. I don't think this is correct.
> 
> As fat as I can tell, may_avoid_write_fault() doesn't even check if
> the vma is writable!
> 
> Am I misreading it? Because I think you just made even a shared mmap
> with "mprotect(PROT_READ)" turn the pte's writable.
> 
> Which is a "slight" security issue.
> 
> Maybe the new code is fine, and I'm missing something. The old code
> looks strange too, which makes me think that the MM_CP_DIRTY_ACCT test
> ends up saving us and depend on VM_WRITE. But it's very much not
> obvious.

vma_wants_writenotify() checks first VM_WRITE|VM_SHARED, otherwise
MM_CP_DIRTY_ACCT will not be set.  While for anonymous vmas the newly
introduced may_avoid_write_fault() checks VM_WRITE explicitly.

Agreed even if it's checked it's not straightforward.  Maybe it'll be a bonus
to have a comment above may_avoid_write_fault() about it in a follow up.

> 
> And even if I _am_ missing something, I really would like a very
> obvious and direct test for "this vma is writable", ie maybe a
> 
>         if (!(vma->vm_flags & VM_WRITE))
>                 return false;
> 
> at the very top of the function.

Yes looks okay too; I think using MM_CP_DIRTY_ACCT flag has a slight advantage
in that it checks VM_WRITE only once before calling change_protection(), rather
than doing the check for every pte even if we know it'll have the same result.
However it indeed hides the facts deeper..

> 
> And no, "pte_dirty()" is not a reason to make something writable, it
> might have started out as a writable mapping, and we dirtied the page,
> and we made it read-only. The page stays dirty, but it shouldn't
> become writable just because of that.

I think the dirty bit checks are only to make sure we don't need those extra
write faults.  It should definitely be based on the fact that VM_WRITE being
set already.

Thanks,
Peter Xu June 30, 2021, 1:39 a.m. UTC | #3
On Tue, Jun 29, 2021 at 08:12:12PM -0400, Peter Xu wrote:
> On Tue, Jun 29, 2021 at 10:50:12AM -0700, Linus Torvalds wrote:
> > On Mon, Jun 28, 2021 at 7:40 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > >
> > > -                       /* Avoid taking write faults for known dirty pages */
> > > -                       if (dirty_accountable && pte_dirty(ptent) &&
> > > -                                       (pte_soft_dirty(ptent) ||
> > > -                                        !(vma->vm_flags & VM_SOFTDIRTY))) {
> > > +                       if (may_avoid_write_fault(ptent, vma, cp_flags))
> > >                                 ptent = pte_mkwrite(ptent);
> > > -                       }
> > 
> > Hmm. I don't think this is correct.
> > 
> > As fat as I can tell, may_avoid_write_fault() doesn't even check if
> > the vma is writable!
> > 
> > Am I misreading it? Because I think you just made even a shared mmap
> > with "mprotect(PROT_READ)" turn the pte's writable.
> > 
> > Which is a "slight" security issue.
> > 
> > Maybe the new code is fine, and I'm missing something. The old code
> > looks strange too, which makes me think that the MM_CP_DIRTY_ACCT test
> > ends up saving us and depend on VM_WRITE. But it's very much not
> > obvious.
> 
> vma_wants_writenotify() checks first VM_WRITE|VM_SHARED, otherwise
> MM_CP_DIRTY_ACCT will not be set.  While for anonymous vmas the newly
> introduced may_avoid_write_fault() checks VM_WRITE explicitly.

Sorry, this statement is unclear.  It's not about anonymous or not, it's just
that a hidden check against VM_WRITE is there already..

Say, below chunk of the patch:

 	if (!(cp_flags & MM_CP_DIRTY_ACCT)) {
 		/* Otherwise, we must have exclusive access to the page. */
 		if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE)))
 			return false;
 
 		if (page_count(pte_page(pte)) != 1)
 			return false;
 	}

Should be the same as:

 	if (!(cp_flags & MM_CP_DIRTY_ACCT)) {
 		if (!vma_is_anonymous(vma))
 			return false;

                if (!(vma->vm_flags & VM_WRITE))
                        return false;
 
 		if (page_count(pte_page(pte)) != 1)
 			return false;
 	}

And since MM_CP_DIRTY_ACCT implies "VM_WRITE|VM_SHARED" all set, above should
be a slightly faster version of below:

        /* This just never trigger if MM_CP_DIRTY_ACCT set */
        if (!(vma->vm_flags & VM_WRITE))
                return false;
 
 	if (!(cp_flags & MM_CP_DIRTY_ACCT)) {
 		if (!vma_is_anonymous(vma))
 			return false;

 		if (page_count(pte_page(pte)) != 1)
 			return false;
 	}

It's just that we avoid checking "vma->vm_flags & VM_WRITE" when
MM_CP_DIRTY_ACCT set.

Again, I think in all cases some more comment should be good indeed..

> 
> Agreed even if it's checked it's not straightforward.  Maybe it'll be a bonus
> to have a comment above may_avoid_write_fault() about it in a follow up.
> 
> > 
> > And even if I _am_ missing something, I really would like a very
> > obvious and direct test for "this vma is writable", ie maybe a
> > 
> >         if (!(vma->vm_flags & VM_WRITE))
> >                 return false;
> > 
> > at the very top of the function.
> 
> Yes looks okay too; I think using MM_CP_DIRTY_ACCT flag has a slight advantage
> in that it checks VM_WRITE only once before calling change_protection(), rather
> than doing the check for every pte even if we know it'll have the same result.
> However it indeed hides the facts deeper..
> 
> > 
> > And no, "pte_dirty()" is not a reason to make something writable, it
> > might have started out as a writable mapping, and we dirtied the page,
> > and we made it read-only. The page stays dirty, but it shouldn't
> > become writable just because of that.
> 
> I think the dirty bit checks are only to make sure we don't need those extra
> write faults.  It should definitely be based on the fact that VM_WRITE being
> set already.
> 
> Thanks,
> 
> -- 
> Peter Xu
Linus Torvalds June 30, 2021, 2:25 a.m. UTC | #4
On Tue, Jun 29, 2021 at 6:39 PM Peter Xu <peterx@redhat.com> wrote:
>
> And since MM_CP_DIRTY_ACCT implies "VM_WRITE|VM_SHARED" all set, above should
> be a slightly faster version of below:

That's way too subtle, particularly since the MM_CP_DIRTY_ACCT logic
comes from another file entirely.

I don't think it's even faster, considering that presumably the
anonymous mapping case is the common one, and that's the one that
needs all the extra tests, it's likely better to _not_ test that very
subtle flag at all, and just doing the straightforward and obvious
tests that are understandable _locally_.

So I claim that it's

 (a) not an optimization at all

 (b) completely locally unintuitive and unreadable

> Again, I think in all cases some more comment should be good indeed..

I really want more than a comment. I want that MM_CP_DIRTY_ACCT bit
testing gone.

The only point where it makes sense to check MM_CP_DIRTY_ACCT is
within the context of "is the page already dirty".

So I think the logic should be something along the lines of

 - first:

         if (!(vma->vm_flags & VM_WRITE))
                return false;

   because that logic is set in stone, and true regardless of anything
else. If the vma isn't writable, we're not going to set the write bit.
End of story.

 - then, check the vma_is_anonumous() case:

        if (vma_is_anonymous(vma))
                return page_count(pte_page(pte)) == 1;

     because if it's a writable mapping, and anonymous, then we can
mark it writable if we're the exclusive owners of that page.

 - and THEN we can handle the "ok, shared mapping, now let's start
thinking about dirty accounting" cases.

Make it obvious and correct. This is not a sequence where you should
try to (incorrectly) optimize away individual instructions.

               Linus
Peter Xu June 30, 2021, 4:42 p.m. UTC | #5
On Tue, Jun 29, 2021 at 07:25:42PM -0700, Linus Torvalds wrote:
> On Tue, Jun 29, 2021 at 6:39 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > And since MM_CP_DIRTY_ACCT implies "VM_WRITE|VM_SHARED" all set, above should
> > be a slightly faster version of below:
> 
> That's way too subtle, particularly since the MM_CP_DIRTY_ACCT logic
> comes from another file entirely.
> 
> I don't think it's even faster, considering that presumably the
> anonymous mapping case is the common one, and that's the one that
> needs all the extra tests, it's likely better to _not_ test that very
> subtle flag at all, and just doing the straightforward and obvious
> tests that are understandable _locally_.
> 
> So I claim that it's
> 
>  (a) not an optimization at all
> 
>  (b) completely locally unintuitive and unreadable
> 
> > Again, I think in all cases some more comment should be good indeed..
> 
> I really want more than a comment. I want that MM_CP_DIRTY_ACCT bit
> testing gone.

My understanding is that MM_CP_DIRTY_ACCT contains all check results from
vma_wants_writenotify(), so if we drop it we'd need to have something like that
to be checked within change_pte_range(), which is again slower (I have totally
no idea how slow to check vma->vm_flags & VM_WRITE, but moving the whole
vma_wants_writenotify here is definitely even slower).

> 
> The only point where it makes sense to check MM_CP_DIRTY_ACCT is
> within the context of "is the page already dirty".
> 
> So I think the logic should be something along the lines of
> 
>  - first:
> 
>          if (!(vma->vm_flags & VM_WRITE))
>                 return false;
> 
>    because that logic is set in stone, and true regardless of anything
> else. If the vma isn't writable, we're not going to set the write bit.
> End of story.
> 
>  - then, check the vma_is_anonumous() case:
> 
>         if (vma_is_anonymous(vma))
>                 return page_count(pte_page(pte)) == 1;
> 
>      because if it's a writable mapping, and anonymous, then we can
> mark it writable if we're the exclusive owners of that page.

Shouldn't we still at least checks [soft-]dirty bits and uffd-wp bits to make
sure it's either not dirty tracked or uffd wr-protected?  Say, IMHO it's
possible that soft-dirty tracking enabled on this anonymous vma range, then we
still depend on the write bit removed to set the soft-dirty later in the fault
handler.

> 
>  - and THEN we can handle the "ok, shared mapping, now let's start
> thinking about dirty accounting" cases.
> 
> Make it obvious and correct. This is not a sequence where you should
> try to (incorrectly) optimize away individual instructions.

Yes I still fully agree it's very un-obvious.  So far the best thing I can come
up with is something like below (patch attached too but not yet tested). I
moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the
checks so the final outcome looks like below:

static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
				  unsigned long cp_flags)
{
	/*
	 * It is unclear whether this optimization can be done safely for NUMA
	 * pages.
	 */
	if (cp_flags & MM_CP_PROT_NUMA)
		return false;

	/*
	 * Never apply write bit if VM_WRITE not set.  Note that this is
	 * actually checked for VM_SHARED when MM_CP_DIRTY_ACCT is set, so
	 * logically we only need to check it for !MM_CP_DIRTY_ACCT, but just
	 * make it even more obvious.
	 */
	if (!(vma->vm_flags & VM_WRITE))
		return false;

	/*
	 * Don't do this optimization for clean pages as we need to be notified
	 * of the transition from clean to dirty.
	 */
	if (!pte_dirty(pte))
		return false;

	/* Same for softdirty. */
	if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
		return false;

	/*
	 * For userfaultfd the user program needs to monitor write faults so we
	 * can't do this optimization.
	 */
	if (pte_uffd_wp(pte))
		return false;

	/*
	 * MM_CP_DIRTY_ACCT indicates that we can always make the page writable
	 * regardless of the number of references.  Time to set the write bit.
	 */
	if (cp_flags & MM_CP_DIRTY_ACCT)
		return true;

	/*
	 * Othewise it means !MM_CP_DIRTY_ACCT.  We can only apply write bit
	 * early if it's anonymous page and we exclusively own it.
	 */
	if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1))
		return true;

	/* Don't play any trick */
	return false;
}

The logic should be the same as before, it's just that we'll do an extra check
on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok.

Another side note is that I still think the VM_SOFTDIRTY check is wrong in
may_avoid_write_fault() and even in the old code (I mentioned it previously
when reviewing the patch), as !VM_SOFTDIRTY should mean soft dirty tracking
enabled while VM_SOFTDIRTY means disabled.  So I wonder whether it should be:

-       if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
+       if (!pte_soft_dirty(pte) && !(vma->vm_flags & VM_SOFTDIRTY))

However I didn't touch it up there as it may need more justifications (I feel
it's okay in the old code, as vma_wants_writenotify actually checks it too and
in the right way; however after the anonymous fast path it seems to prone to
error if it's anonymous; I'll check later).

Thanks,
Linus Torvalds June 30, 2021, 6:03 p.m. UTC | #6
On Wed, Jun 30, 2021 at 9:42 AM Peter Xu <peterx@redhat.com> wrote:
>
> Yes I still fully agree it's very un-obvious.  So far the best thing I can come
> up with is something like below (patch attached too but not yet tested). I
> moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the
> checks so the final outcome looks like below:
>
> static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
>                                   unsigned long cp_flags)
> {
>         /*
>          * It is unclear whether this optimization can be done safely for NUMA
>          * pages.
>          */
>         if (cp_flags & MM_CP_PROT_NUMA)
>                 return false;

Please just put that VM_WRITE test first. It's the one that really
*matters*. There's no "it's unclear if" about that part. Just handle
the obvious and important check first.

Yeah, yeah, they both return false, so order doesn't matter from a
semantic standpoint, but from a clarity standpoint just do the clear
and unambiguous and security-relevant test first.

The rest of the tests are implementation details, the VM_WRITE test is
fundamental behavior. It's the one that made me worry about this patch
in the first place.

>         /*
>          * Don't do this optimization for clean pages as we need to be notified
>          * of the transition from clean to dirty.
>          */
>         if (!pte_dirty(pte))
>                 return false;
>
>         /* Same for softdirty. */
>         if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
>                 return false;
>
>         /*
>          * For userfaultfd the user program needs to monitor write faults so we
>          * can't do this optimization.
>          */
>         if (pte_uffd_wp(pte))
>                 return false;

So all of these are a bit special.

Why? Because if I look at the actual page fault path, these are not
the tests there.

I'd really like to have some obvious situation where we keep this
"make it writable" in sync with what would actually happen on a write
fault when it's not writable.

And it's not at all obvious to me for these cases.

The do_wp_page() code doesn't even use pte_uffd_wp(). It uses
userfaultfd_pte_wp(vma, pte), and I don't even know why. Yes, I can
see the code (it additionally tests the VM_UFFD_WP flag in the vma),
but a number of other paths then only do that pte_uffd_wp() test.

I get the feeling that we really should try to match what the
do_wp_page() path does, though.

Which brings up another issue: the do_wp_page() path treats PageKsm()
pages differently. And it locks the page before double-checking the
page count.

Why does mprotect() not need to do the same thing? I think this has
come up before, and "change_protection()" can get called with the
mmap_sem held just for reading - see userfaultfd - so it has all the
same issues as a page fault does, afaik.

>         /*
>          * MM_CP_DIRTY_ACCT indicates that we can always make the page writable
>          * regardless of the number of references.  Time to set the write bit.
>          */
>         if (cp_flags & MM_CP_DIRTY_ACCT)
>                 return true;
>
>         /*
>          * Othewise it means !MM_CP_DIRTY_ACCT.  We can only apply write bit
>          * early if it's anonymous page and we exclusively own it.
>          */
>         if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1))
>                 return true;
>
>         /* Don't play any trick */
>         return false;
> }
>
> The logic should be the same as before, it's just that we'll do an extra check
> on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok.

See above. I don't think the logic before was all that clear either.

The one case that is clear is that if it's a shared mapping, and
MM_CP_DIRTY_ACCT is set, and it was already dirty (and softdirty),
then it's ok.,

That's the old code.  I don't like how the old code was written
(because I think that MM_CP_DIRTY_ACCT bit wasx too subtle), but I
think the old code was at least correct.

The new code, it just worries me. It adds all those new cases for when
we can make the page writable early - that's the whole point of the
patch, after all - but my point here is that it's not at all obvious
that those new cases are actually correct.

MAYBE it's all correct. I'm not saying it's wrong. I'm just saying
it's not _obvious_ that it's correct.

What about that page_count() test, for example: it has a comment, it
looks obvious, but it's very different from what do_wp_page() does. So
what happens if we have a page-out at the same time that turns that
page into a swap cache page, and increments the page count? What about
that race? Do we end up with a writable page that is shared with a
swap cache entry? Is that ok? Why isn't it ok in the page fault case?

See why this patch worries me so much?

                       Linus
Peter Xu July 1, 2021, 1:27 a.m. UTC | #7
On Wed, Jun 30, 2021 at 11:03:25AM -0700, Linus Torvalds wrote:
> On Wed, Jun 30, 2021 at 9:42 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Yes I still fully agree it's very un-obvious.  So far the best thing I can come
> > up with is something like below (patch attached too but not yet tested). I
> > moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the
> > checks so the final outcome looks like below:
> >
> > static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
> >                                   unsigned long cp_flags)
> > {
> >         /*
> >          * It is unclear whether this optimization can be done safely for NUMA
> >          * pages.
> >          */
> >         if (cp_flags & MM_CP_PROT_NUMA)
> >                 return false;
> 
> Please just put that VM_WRITE test first. It's the one that really
> *matters*. There's no "it's unclear if" about that part. Just handle
> the obvious and important check first.
> 
> Yeah, yeah, they both return false, so order doesn't matter from a
> semantic standpoint, but from a clarity standpoint just do the clear
> and unambiguous and security-relevant test first.
> 
> The rest of the tests are implementation details, the VM_WRITE test is
> fundamental behavior. It's the one that made me worry about this patch
> in the first place.

Sure.

> 
> >         /*
> >          * Don't do this optimization for clean pages as we need to be notified
> >          * of the transition from clean to dirty.
> >          */
> >         if (!pte_dirty(pte))
> >                 return false;
> >
> >         /* Same for softdirty. */
> >         if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
> >                 return false;
> >
> >         /*
> >          * For userfaultfd the user program needs to monitor write faults so we
> >          * can't do this optimization.
> >          */
> >         if (pte_uffd_wp(pte))
> >                 return false;
> 
> So all of these are a bit special.
> 
> Why? Because if I look at the actual page fault path, these are not
> the tests there.
> 
> I'd really like to have some obvious situation where we keep this
> "make it writable" in sync with what would actually happen on a write
> fault when it's not writable.
> 
> And it's not at all obvious to me for these cases.
> 
> The do_wp_page() code doesn't even use pte_uffd_wp(). It uses
> userfaultfd_pte_wp(vma, pte), and I don't even know why. Yes, I can
> see the code (it additionally tests the VM_UFFD_WP flag in the vma),
> but a number of other paths then only do that pte_uffd_wp() test.

The vma check is a safety net to make sure it's not the case e.g. when the vma
has already unregistered from uffd-wp while there's uffd-wp bit left over.
E.g., currently UFFDIO_UNREGISTER is lazy on removing uffd-wp bits that applied
to ptes, so even vma is unregistered there could still have pte_uffd_wp() being
true for some ptes.  That vma check makes sure when it happens the uffd-wp bit
will be auto-removed.

> 
> I get the feeling that we really should try to match what the
> do_wp_page() path does, though.

Makes sense.

> 
> Which brings up another issue: the do_wp_page() path treats PageKsm()
> pages differently. And it locks the page before double-checking the
> page count.
> 
> Why does mprotect() not need to do the same thing? I think this has
> come up before, and "change_protection()" can get called with the
> mmap_sem held just for reading - see userfaultfd - so it has all the
> same issues as a page fault does, afaik.

Good point..  I overlooked ksm when reviewing the patch, while I should really
have looked at do_wp_page() as you suggested (hmm.. the truth is I wasn't even
aware of this patch and never planned to try to review it, until it breaks the
uffd-wp anonymous tests in its initial versions when I was testing mmots...).

Maybe something like this (to be squashed into the previously attached patch):

---8<---
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3977bfd55f62..7aab30ac9c9f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -39,12 +39,8 @@
 static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
                                  unsigned long cp_flags)
 {
-       /*
-        * It is unclear whether this optimization can be done safely for NUMA
-        * pages.
-        */
-       if (cp_flags & MM_CP_PROT_NUMA)
-               return false;
+       struct page *page;
+       bool ret = false;

        /*
         * Never apply write bit if VM_WRITE not set.  Note that this is
@@ -55,6 +51,13 @@ static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
        if (!(vma->vm_flags & VM_WRITE))
                return false;

+       /*
+        * It is unclear whether this optimization can be done safely for NUMA
+        * pages.
+        */
+       if (cp_flags & MM_CP_PROT_NUMA)
+               return false;
+
        /*
         * Don't do this optimization for clean pages as we need to be notified
         * of the transition from clean to dirty.
@@ -80,15 +83,22 @@ static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
        if (cp_flags & MM_CP_DIRTY_ACCT)
                return true;

+       page = pte_page(pte);
+       /* Best effort to take page lock, don't play trick if failed */
+       if (!trylock_page(page))
+               return false;
+       /* KSM pages needs COW; leave them be */
+       if (PageKsm(page))
+               goto unlock_fail;
        /*
-        * Othewise it means !MM_CP_DIRTY_ACCT.  We can only apply write bit
-        * early if it's anonymous page and we exclusively own it.
+        * Othewise it means !MM_CP_DIRTY_ACCT and !KSM.  We can only apply
+        * write bit early if it's anonymous page and we exclusively own it.
         */
-       if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1))
-               return true;
-
-       /* Don't play any trick */
-       return false;
+       if (vma_is_anonymous(vma) && (page_count(page) == 1))
+               ret = true;
+unlock_fail:
+       unlock_page(page);
+       return ret;
 }

 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
---8<---

I hope I didn't overlook something else..

Today when I was looking at ksm code, I found that I got lost on why we say
"PageKsm() doesn't necessarily raise the page refcount", as in do_wp_page().  I
was looking at replace_page() where, afaict, we still do proper refcounting for
the stable nodes with "get_page(kpage)".

I know I must missed something but I can't quickly tell.  In all cases with
above PageKsm check I think it'll be safe, and it's definitely clear that page
lock will stablize PageKsm()'s return value, like do_wp_page().

> 
> >         /*
> >          * MM_CP_DIRTY_ACCT indicates that we can always make the page writable
> >          * regardless of the number of references.  Time to set the write bit.
> >          */
> >         if (cp_flags & MM_CP_DIRTY_ACCT)
> >                 return true;
> >
> >         /*
> >          * Othewise it means !MM_CP_DIRTY_ACCT.  We can only apply write bit
> >          * early if it's anonymous page and we exclusively own it.
> >          */
> >         if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1))
> >                 return true;
> >
> >         /* Don't play any trick */
> >         return false;
> > }
> >
> > The logic should be the same as before, it's just that we'll do an extra check
> > on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok.
> 
> See above. I don't think the logic before was all that clear either.
> 
> The one case that is clear is that if it's a shared mapping, and
> MM_CP_DIRTY_ACCT is set, and it was already dirty (and softdirty),
> then it's ok.,
> 
> That's the old code.  I don't like how the old code was written
> (because I think that MM_CP_DIRTY_ACCT bit wasx too subtle), but I
> think the old code was at least correct.
> 
> The new code, it just worries me. It adds all those new cases for when
> we can make the page writable early - that's the whole point of the
> patch, after all - but my point here is that it's not at all obvious
> that those new cases are actually correct.

Yes agreed the MM_CP_DIRTY_ACCT bit is very subtle and not easy to get.  It's
just that I don't have a good idea to make it better, yet..

> 
> MAYBE it's all correct. I'm not saying it's wrong. I'm just saying
> it's not _obvious_ that it's correct.
> 
> What about that page_count() test, for example: it has a comment, it
> looks obvious, but it's very different from what do_wp_page() does. So
> what happens if we have a page-out at the same time that turns that
> page into a swap cache page, and increments the page count? What about
> that race? Do we end up with a writable page that is shared with a
> swap cache entry? Is that ok? Why isn't it ok in the page fault case?

That looks fine to me: when the race happens we must have checked page_count==1
first and granted the write bit, then add_to_swap_cache() happens after the
page_count==1 check (as it takes another refcount, so >2 otherwise).  Then it
also means unmap mappings should happen even after that point.  If my above
understanding is correct, our newly installed pte will be zapped safely soon,
but of course after we release the pgtable lock in change_pte_range().

Thanks,
Linus Torvalds July 1, 2021, 6:29 p.m. UTC | #8
On Wed, Jun 30, 2021 at 6:27 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jun 30, 2021 at 11:03:25AM -0700, Linus Torvalds wrote:
> >
> > What about that page_count() test, for example: it has a comment, it
> > looks obvious, but it's very different from what do_wp_page() does. So
> > what happens if we have a page-out at the same time that turns that
> > page into a swap cache page, and increments the page count? What about
> > that race? Do we end up with a writable page that is shared with a
> > swap cache entry? Is that ok? Why isn't it ok in the page fault case?
>
> That looks fine to me: when the race happens we must have checked page_count==1
> first and granted the write bit, then add_to_swap_cache() happens after the
> page_count==1 check (as it takes another refcount, so >2 otherwise).  Then it
> also means unmap mappings should happen even after that point.  If my above
> understanding is correct, our newly installed pte will be zapped safely soon,
> but of course after we release the pgtable lock in change_pte_range().

So if this is fine, then maybe we should just remove the page lock in
the do_wp_page() path (and remove the PageKSM check while at it)?

If it's not required by mprotect() to say "I can make the page
writable directly", then it really shouldn't be required by the page
fault path either.

Which I'd love to do, and was really itching to do (it's a nasty
lock), but I worried about it..

I'd hate to have mprotect do one thing, and page faulting do another
thing, and not have some logic to why they have to be different.

                  Linus
Peter Xu July 6, 2021, 1:24 a.m. UTC | #9
(sorry for a very late reply)

On Thu, Jul 01, 2021 at 11:29:50AM -0700, Linus Torvalds wrote:
> On Wed, Jun 30, 2021 at 6:27 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Jun 30, 2021 at 11:03:25AM -0700, Linus Torvalds wrote:
> > >
> > > What about that page_count() test, for example: it has a comment, it
> > > looks obvious, but it's very different from what do_wp_page() does. So
> > > what happens if we have a page-out at the same time that turns that
> > > page into a swap cache page, and increments the page count? What about
> > > that race? Do we end up with a writable page that is shared with a
> > > swap cache entry? Is that ok? Why isn't it ok in the page fault case?
> >
> > That looks fine to me: when the race happens we must have checked page_count==1
> > first and granted the write bit, then add_to_swap_cache() happens after the
> > page_count==1 check (as it takes another refcount, so >2 otherwise).  Then it
> > also means unmap mappings should happen even after that point.  If my above
> > understanding is correct, our newly installed pte will be zapped safely soon,
> > but of course after we release the pgtable lock in change_pte_range().
> 
> So if this is fine, then maybe we should just remove the page lock in
> the do_wp_page() path (and remove the PageKSM check while at it)?

I could be wrong, but I thought the page lock in do_wp_page() is more for the
PageKsm() race (e.g., to make sure we don't grant write to a page that is
becoming a ksm page in parallel).

> 
> If it's not required by mprotect() to say "I can make the page
> writable directly", then it really shouldn't be required by the page
> fault path either.
> 
> Which I'd love to do, and was really itching to do (it's a nasty
> lock), but I worried about it..
> 
> I'd hate to have mprotect do one thing, and page faulting do another
> thing, and not have some logic to why they have to be different.

Agreed; perhaps no need to be identical - I think the mprotect path can be even
stricter than the fault page, as it's a fast-path only. It should never apply
the write bit when the page fault path won't.  So I think the original patch
does need a justification on why it didn't handle ksm page while do_wp_page
handled it.

Thanks,
diff mbox series

Patch

--- a/mm/mprotect.c~mm-improve-mprotectrw-efficiency-on-pages-referenced-once
+++ a/mm/mprotect.c
@@ -35,6 +35,51 @@ 
 
 #include "internal.h"
 
+/* Determine whether we can avoid taking write faults for known dirty pages. */
+static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
+				  unsigned long cp_flags)
+{
+	/*
+	 * The dirty accountable bit indicates that we can always make the page
+	 * writable regardless of the number of references.
+	 */
+	if (!(cp_flags & MM_CP_DIRTY_ACCT)) {
+		/* Otherwise, we must have exclusive access to the page. */
+		if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE)))
+			return false;
+
+		if (page_count(pte_page(pte)) != 1)
+			return false;
+	}
+
+	/*
+	 * Don't do this optimization for clean pages as we need to be notified
+	 * of the transition from clean to dirty.
+	 */
+	if (!pte_dirty(pte))
+		return false;
+
+	/* Same for softdirty. */
+	if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
+		return false;
+
+	/*
+	 * For userfaultfd the user program needs to monitor write faults so we
+	 * can't do this optimization.
+	 */
+	if (pte_uffd_wp(pte))
+		return false;
+
+	/*
+	 * It is unclear whether this optimization can be done safely for NUMA
+	 * pages.
+	 */
+	if (cp_flags & MM_CP_PROT_NUMA)
+		return false;
+
+	return true;
+}
+
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		unsigned long cp_flags)
@@ -43,7 +88,6 @@  static unsigned long change_pte_range(st
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 	int target_node = NUMA_NO_NODE;
-	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
@@ -131,12 +175,8 @@  static unsigned long change_pte_range(st
 				ptent = pte_clear_uffd_wp(ptent);
 			}
 
-			/* Avoid taking write faults for known dirty pages */
-			if (dirty_accountable && pte_dirty(ptent) &&
-					(pte_soft_dirty(ptent) ||
-					 !(vma->vm_flags & VM_SOFTDIRTY))) {
+			if (may_avoid_write_fault(ptent, vma, cp_flags))
 				ptent = pte_mkwrite(ptent);
-			}
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			pages++;
 		} else if (is_swap_pte(oldpte)) {