diff mbox series

[v2] mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection

Message ID 20220610181436.84713-1-david@redhat.com (mailing list archive)
State New
Headers show
Series [v2] mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection | expand

Commit Message

David Hildenbrand June 10, 2022, 6:14 p.m. UTC
Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
can try mapping anonymous pages writable if they are exclusive,
the PTE is already dirty, and no special handling applies. Mapping the
PTE writable is essentially the same thing the write fault handler would do
in this case.

Special handling is required for uffd-wp and softdirty tracking, so take
care of that properly. Also, leave PROT_NONE handling alone for now;
in the future, we could similarly extend the logic in do_numa_page() or
use pte_mk_savedwrite() here. Note that we'll now also check for uffd-wp in
case of VM_SHARED -- which is harmless and prepares for uffd-wp support for
shmem.

While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
performance, it should also be a valuable optimization for uffd-wp, when
un-protecting.

Applying the same logic to PMDs (anonymous THP, anonymous hugetlb) is
probably not worth the trouble, but could similarly be added if there is
demand.

Results of a simple microbenchmark on my Ryzen 9 3900X, comparing the new
optimization (avoiding write faults) during mprotect() with softdirty
tracking, where we require a write fault.

  Running 1000 iterations each

  ==========================================================
  Measuring memset() of 4096 bytes
   First write access:
    Min: 169 ns, Max: 8997 ns, Avg: 830 ns
   Second write access:
    Min: 80 ns, Max: 251 ns, Avg: 168 ns
   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
    Min: 180 ns, Max: 290 ns, Avg: 190 ns
   Write access after clearing softdirty:
    Min: 451 ns, Max: 1774 ns, Avg: 470 ns
  -> mprotect = 1.131 * second [avg]
  -> mprotect = 0.404 * softdirty [avg]
  ----------------------------------------------------------
  Measuring single byte access per page of 4096 bytes
   First write access:
    Min: 761 ns, Max: 1152 ns, Avg: 784 ns
   Second write access:
    Min: 130 ns, Max: 181 ns, Avg: 137 ns
   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
    Min: 150 ns, Max: 1553 ns, Avg: 155 ns
   Write access after clearing softdirty:
    Min: 169 ns, Max: 1783 ns, Avg: 432 ns
  -> mprotect = 1.131 * second [avg]
  -> mprotect = 0.359 * softdirty [avg]
  ==========================================================
  Measuring memset() of 16384 bytes
   First write access:
    Min: 1594 ns, Max: 3497 ns, Avg: 2143 ns
   Second write access:
    Min: 250 ns, Max: 381 ns, Avg: 260 ns
   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
    Min: 290 ns, Max: 1643 ns, Avg: 300 ns
   Write access after clearing softdirty:
    Min: 1242 ns, Max: 8987 ns, Avg: 1297 ns
  -> mprotect = 1.154 * second [avg]
  -> mprotect = 0.231 * softdirty [avg]
  ----------------------------------------------------------
  Measuring single byte access per page of 16384 bytes
   First write access:
    Min: 1953 ns, Max: 2945 ns, Avg: 2008 ns
   Second write access:
    Min: 130 ns, Max: 912 ns, Avg: 142 ns
   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
    Min: 160 ns, Max: 240 ns, Avg: 166 ns
   Write access after clearing softdirty:
    Min: 1112 ns, Max: 1513 ns, Avg: 1126 ns
  -> mprotect = 1.169 * second [avg]
  -> mprotect = 0.147 * softdirty [avg]
  ==========================================================
  Measuring memset() of 65536 bytes
   First write access:
    Min: 7524 ns, Max: 15650 ns, Avg: 7680 ns
   Second write access:
    Min: 251 ns, Max: 1323 ns, Avg: 648 ns
   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
    Min: 270 ns, Max: 1282 ns, Avg: 736 ns
   Write access after clearing softdirty:
    Min: 4558 ns, Max: 12524 ns, Avg: 4623 ns
  -> mprotect = 1.136 * second [avg]
  -> mprotect = 0.159 * softdirty [avg]
  ----------------------------------------------------------
  Measuring single byte access per page of 65536 bytes
   First write access:
    Min: 7083 ns, Max: 9027 ns, Avg: 7241 ns
   Second write access:
    Min: 140 ns, Max: 201 ns, Avg: 156 ns
   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
    Min: 190 ns, Max: 451 ns, Avg: 197 ns
   Write access after clearing softdirty:
    Min: 3707 ns, Max: 5119 ns, Avg: 3958 ns
  -> mprotect = 1.263 * second [avg]
  -> mprotect = 0.050 * softdirty [avg]
  ==========================================================
  Measuring memset() of 524288 bytes
   First write access:
    Min: 58470 ns, Max: 87754 ns, Avg: 59353 ns
   Second write access:
    Min: 5180 ns, Max: 6863 ns, Avg: 5318 ns
   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
    Min: 5871 ns, Max: 9358 ns, Avg: 6028 ns
   Write access after clearing softdirty:
    Min: 35797 ns, Max: 41338 ns, Avg: 36710 ns
  -> mprotect = 1.134 * second [avg]
  -> mprotect = 0.164 * softdirty [avg]
  ----------------------------------------------------------
  Measuring single byte access per page of 524288 bytes
   First write access:
    Min: 53751 ns, Max: 59431 ns, Avg: 54506 ns
   Second write access:
    Min: 781 ns, Max: 2194 ns, Avg: 1123 ns
   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
    Min: 161 ns, Max: 1282 ns, Avg: 622 ns
   Write access after clearing softdirty:
    Min: 30888 ns, Max: 34565 ns, Avg: 31229 ns
  -> mprotect = 0.554 * second [avg]
  -> mprotect = 0.020 * softdirty [avg]

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

v1 -> v2:
* Rebased on v5.19-rc1
* Rerun benchmark
* Fix minor spelling issues in subject+description
* Drop IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) check
* Move pte_write() check into caller

---
 mm/mprotect.c | 67 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 55 insertions(+), 12 deletions(-)


base-commit: f2906aa863381afb0015a9eb7fefad885d4e5a56

Comments

Nadav Amit June 10, 2022, 6:42 p.m. UTC | #1
On Jun 10, 2022, at 11:14 AM, David Hildenbrand <david@redhat.com> wrote:

> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
> can try mapping anonymous pages writable if they are exclusive,
> the PTE is already dirty, and no special handling applies. Mapping the
> PTE writable is essentially the same thing the write fault handler would do
> in this case.
> 
> Special handling is required for uffd-wp and softdirty tracking, so take
> care of that properly. Also, leave PROT_NONE handling alone for now;
> in the future, we could similarly extend the logic in do_numa_page() or
> use pte_mk_savedwrite() here. Note that we'll now also check for uffd-wp in
> case of VM_SHARED -- which is harmless and prepares for uffd-wp support for
> shmem.
> 
> While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
> performance, it should also be a valuable optimization for uffd-wp, when
> un-protecting.
> 
> Applying the same logic to PMDs (anonymous THP, anonymous hugetlb) is
> probably not worth the trouble, but could similarly be added if there is
> demand.
> 
> Results of a simple microbenchmark on my Ryzen 9 3900X, comparing the new
> optimization (avoiding write faults) during mprotect() with softdirty
> tracking, where we require a write fault.
> 
>  Running 1000 iterations each
> 
>  ==========================================================
>  Measuring memset() of 4096 bytes
>   First write access:
>    Min: 169 ns, Max: 8997 ns, Avg: 830 ns
>   Second write access:
>    Min: 80 ns, Max: 251 ns, Avg: 168 ns
>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>    Min: 180 ns, Max: 290 ns, Avg: 190 ns
>   Write access after clearing softdirty:
>    Min: 451 ns, Max: 1774 ns, Avg: 470 ns
>  -> mprotect = 1.131 * second [avg]
>  -> mprotect = 0.404 * softdirty [avg]
>  ----------------------------------------------------------
>  Measuring single byte access per page of 4096 bytes
>   First write access:
>    Min: 761 ns, Max: 1152 ns, Avg: 784 ns
>   Second write access:
>    Min: 130 ns, Max: 181 ns, Avg: 137 ns
>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>    Min: 150 ns, Max: 1553 ns, Avg: 155 ns
>   Write access after clearing softdirty:
>    Min: 169 ns, Max: 1783 ns, Avg: 432 ns
>  -> mprotect = 1.131 * second [avg]
>  -> mprotect = 0.359 * softdirty [avg]
>  ==========================================================
>  Measuring memset() of 16384 bytes
>   First write access:
>    Min: 1594 ns, Max: 3497 ns, Avg: 2143 ns
>   Second write access:
>    Min: 250 ns, Max: 381 ns, Avg: 260 ns
>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>    Min: 290 ns, Max: 1643 ns, Avg: 300 ns
>   Write access after clearing softdirty:
>    Min: 1242 ns, Max: 8987 ns, Avg: 1297 ns
>  -> mprotect = 1.154 * second [avg]
>  -> mprotect = 0.231 * softdirty [avg]
>  ----------------------------------------------------------
>  Measuring single byte access per page of 16384 bytes
>   First write access:
>    Min: 1953 ns, Max: 2945 ns, Avg: 2008 ns
>   Second write access:
>    Min: 130 ns, Max: 912 ns, Avg: 142 ns
>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>    Min: 160 ns, Max: 240 ns, Avg: 166 ns
>   Write access after clearing softdirty:
>    Min: 1112 ns, Max: 1513 ns, Avg: 1126 ns
>  -> mprotect = 1.169 * second [avg]
>  -> mprotect = 0.147 * softdirty [avg]
>  ==========================================================
>  Measuring memset() of 65536 bytes
>   First write access:
>    Min: 7524 ns, Max: 15650 ns, Avg: 7680 ns
>   Second write access:
>    Min: 251 ns, Max: 1323 ns, Avg: 648 ns
>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>    Min: 270 ns, Max: 1282 ns, Avg: 736 ns
>   Write access after clearing softdirty:
>    Min: 4558 ns, Max: 12524 ns, Avg: 4623 ns
>  -> mprotect = 1.136 * second [avg]
>  -> mprotect = 0.159 * softdirty [avg]
>  ----------------------------------------------------------
>  Measuring single byte access per page of 65536 bytes
>   First write access:
>    Min: 7083 ns, Max: 9027 ns, Avg: 7241 ns
>   Second write access:
>    Min: 140 ns, Max: 201 ns, Avg: 156 ns
>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>    Min: 190 ns, Max: 451 ns, Avg: 197 ns
>   Write access after clearing softdirty:
>    Min: 3707 ns, Max: 5119 ns, Avg: 3958 ns
>  -> mprotect = 1.263 * second [avg]
>  -> mprotect = 0.050 * softdirty [avg]
>  ==========================================================
>  Measuring memset() of 524288 bytes
>   First write access:
>    Min: 58470 ns, Max: 87754 ns, Avg: 59353 ns
>   Second write access:
>    Min: 5180 ns, Max: 6863 ns, Avg: 5318 ns
>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>    Min: 5871 ns, Max: 9358 ns, Avg: 6028 ns
>   Write access after clearing softdirty:
>    Min: 35797 ns, Max: 41338 ns, Avg: 36710 ns
>  -> mprotect = 1.134 * second [avg]
>  -> mprotect = 0.164 * softdirty [avg]
>  ----------------------------------------------------------
>  Measuring single byte access per page of 524288 bytes
>   First write access:
>    Min: 53751 ns, Max: 59431 ns, Avg: 54506 ns
>   Second write access:
>    Min: 781 ns, Max: 2194 ns, Avg: 1123 ns
>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>    Min: 161 ns, Max: 1282 ns, Avg: 622 ns
>   Write access after clearing softdirty:
>    Min: 30888 ns, Max: 34565 ns, Avg: 31229 ns
>  -> mprotect = 0.554 * second [avg]
>  -> mprotect = 0.020 * softdirty [avg]
> 
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> 
> v1 -> v2:
> * Rebased on v5.19-rc1
> * Rerun benchmark
> * Fix minor spelling issues in subject+description
> * Drop IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) check
> * Move pte_write() check into caller
> 
> ---
> mm/mprotect.c | 67 ++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 55 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ba5592655ee3..728772bf41c7 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -38,6 +38,45 @@
> 
> #include "internal.h"
> 
> +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
> +					   unsigned long addr, pte_t pte,
> +					   unsigned long cp_flags)
> +{
> +	struct page *page;
> +
> +	if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
> +		/*
> +		 * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
> +		 * without MM_CP_DIRTY_ACCT, there is nothing to do.
> +		 */
> +		return false;
> +
> +	if (pte_protnone(pte) || !pte_dirty(pte))
> +		return false;
> +
> +	/* Do we need write faults for softdirty tracking? */
> +	if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))
> +		return false;
> +
> +	/* Do we need write faults for uffd-wp tracking? */
> +	if (userfaultfd_pte_wp(vma, pte))
> +		return false;
> +
> +	if (!(vma->vm_flags & VM_SHARED)) {
> +		/*
> +		 * We can only special-case on exclusive anonymous pages,
> +		 * because we know that our write-fault handler similarly would
> +		 * map them writable without any additional checks while holding
> +		 * the PT lock.
> +		 */
> +		page = vm_normal_page(vma, addr, pte);
> +		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +

Looks good in general. Just wondering (out loud) whether it makes more sense
to do all the vm_flags and cp_flags related checks in one of the callers
(mprotect_fixup()?) and propagate whether to try to write-unprotect in
cp_flags (e.g., by introducing new MM_CP_TRY_WRITE_UNPROTECT).
Peter Xu June 13, 2022, 7:43 p.m. UTC | #2
Hi, David,

On Fri, Jun 10, 2022 at 11:42:06AM -0700, Nadav Amit wrote:
> On Jun 10, 2022, at 11:14 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> > Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
> > can try mapping anonymous pages writable if they are exclusive,
> > the PTE is already dirty, and no special handling applies. Mapping the
> > PTE writable is essentially the same thing the write fault handler would do
> > in this case.
> > 
> > Special handling is required for uffd-wp and softdirty tracking, so take
> > care of that properly. Also, leave PROT_NONE handling alone for now;
> > in the future, we could similarly extend the logic in do_numa_page() or
> > use pte_mk_savedwrite() here. Note that we'll now also check for uffd-wp in
> > case of VM_SHARED -- which is harmless and prepares for uffd-wp support for
> > shmem.
> > 
> > While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
> > performance, it should also be a valuable optimization for uffd-wp, when
> > un-protecting.
> > 
> > Applying the same logic to PMDs (anonymous THP, anonymous hugetlb) is
> > probably not worth the trouble, but could similarly be added if there is
> > demand.

My memory was that Andrea had a version that used to have thp optimized
too.  It'll be a slight pity to lose it if still possible, then we treat
thp and small pages the same logic and be all fair.  Or is there any other
challenge that we're facing?

> > 
> > Results of a simple microbenchmark on my Ryzen 9 3900X, comparing the new
> > optimization (avoiding write faults) during mprotect() with softdirty
> > tracking, where we require a write fault.

Are we comparing the mprotect() sequence operations against softdirty
clearing operation?  Would it make more sense if we compare the same
mprotect() sequence to kernels that are before/after this patch applied?

> > 
> >  Running 1000 iterations each
> > 
> >  ==========================================================
> >  Measuring memset() of 4096 bytes
> >   First write access:
> >    Min: 169 ns, Max: 8997 ns, Avg: 830 ns
> >   Second write access:
> >    Min: 80 ns, Max: 251 ns, Avg: 168 ns
> >   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >    Min: 180 ns, Max: 290 ns, Avg: 190 ns
> >   Write access after clearing softdirty:
> >    Min: 451 ns, Max: 1774 ns, Avg: 470 ns
> >  -> mprotect = 1.131 * second [avg]
> >  -> mprotect = 0.404 * softdirty [avg]

(I don't understand these two lines.. but maybe I'm the only one?)

> >  ----------------------------------------------------------
> >  Measuring single byte access per page of 4096 bytes
> >   First write access:
> >    Min: 761 ns, Max: 1152 ns, Avg: 784 ns
> >   Second write access:
> >    Min: 130 ns, Max: 181 ns, Avg: 137 ns
> >   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >    Min: 150 ns, Max: 1553 ns, Avg: 155 ns
> >   Write access after clearing softdirty:
> >    Min: 169 ns, Max: 1783 ns, Avg: 432 ns
> >  -> mprotect = 1.131 * second [avg]
> >  -> mprotect = 0.359 * softdirty [avg]
> >  ==========================================================
> >  Measuring memset() of 16384 bytes
> >   First write access:
> >    Min: 1594 ns, Max: 3497 ns, Avg: 2143 ns
> >   Second write access:
> >    Min: 250 ns, Max: 381 ns, Avg: 260 ns
> >   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >    Min: 290 ns, Max: 1643 ns, Avg: 300 ns
> >   Write access after clearing softdirty:
> >    Min: 1242 ns, Max: 8987 ns, Avg: 1297 ns
> >  -> mprotect = 1.154 * second [avg]
> >  -> mprotect = 0.231 * softdirty [avg]
> >  ----------------------------------------------------------
> >  Measuring single byte access per page of 16384 bytes
> >   First write access:
> >    Min: 1953 ns, Max: 2945 ns, Avg: 2008 ns
> >   Second write access:
> >    Min: 130 ns, Max: 912 ns, Avg: 142 ns
> >   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >    Min: 160 ns, Max: 240 ns, Avg: 166 ns
> >   Write access after clearing softdirty:
> >    Min: 1112 ns, Max: 1513 ns, Avg: 1126 ns
> >  -> mprotect = 1.169 * second [avg]
> >  -> mprotect = 0.147 * softdirty [avg]
> >  ==========================================================
> >  Measuring memset() of 65536 bytes
> >   First write access:
> >    Min: 7524 ns, Max: 15650 ns, Avg: 7680 ns
> >   Second write access:
> >    Min: 251 ns, Max: 1323 ns, Avg: 648 ns
> >   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >    Min: 270 ns, Max: 1282 ns, Avg: 736 ns
> >   Write access after clearing softdirty:
> >    Min: 4558 ns, Max: 12524 ns, Avg: 4623 ns
> >  -> mprotect = 1.136 * second [avg]
> >  -> mprotect = 0.159 * softdirty [avg]
> >  ----------------------------------------------------------
> >  Measuring single byte access per page of 65536 bytes
> >   First write access:
> >    Min: 7083 ns, Max: 9027 ns, Avg: 7241 ns
> >   Second write access:
> >    Min: 140 ns, Max: 201 ns, Avg: 156 ns
> >   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >    Min: 190 ns, Max: 451 ns, Avg: 197 ns
> >   Write access after clearing softdirty:
> >    Min: 3707 ns, Max: 5119 ns, Avg: 3958 ns
> >  -> mprotect = 1.263 * second [avg]
> >  -> mprotect = 0.050 * softdirty [avg]
> >  ==========================================================
> >  Measuring memset() of 524288 bytes
> >   First write access:
> >    Min: 58470 ns, Max: 87754 ns, Avg: 59353 ns
> >   Second write access:
> >    Min: 5180 ns, Max: 6863 ns, Avg: 5318 ns
> >   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >    Min: 5871 ns, Max: 9358 ns, Avg: 6028 ns
> >   Write access after clearing softdirty:
> >    Min: 35797 ns, Max: 41338 ns, Avg: 36710 ns
> >  -> mprotect = 1.134 * second [avg]
> >  -> mprotect = 0.164 * softdirty [avg]
> >  ----------------------------------------------------------
> >  Measuring single byte access per page of 524288 bytes
> >   First write access:
> >    Min: 53751 ns, Max: 59431 ns, Avg: 54506 ns
> >   Second write access:
> >    Min: 781 ns, Max: 2194 ns, Avg: 1123 ns
> >   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >    Min: 161 ns, Max: 1282 ns, Avg: 622 ns
> >   Write access after clearing softdirty:
> >    Min: 30888 ns, Max: 34565 ns, Avg: 31229 ns
> >  -> mprotect = 0.554 * second [avg]
> >  -> mprotect = 0.020 * softdirty [avg]
> > 
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Nadav Amit <nadav.amit@gmail.com>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yang Shi <shy828301@gmail.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> > 
> > v1 -> v2:
> > * Rebased on v5.19-rc1
> > * Rerun benchmark
> > * Fix minor spelling issues in subject+description
> > * Drop IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) check
> > * Move pte_write() check into caller
> > 
> > ---
> > mm/mprotect.c | 67 ++++++++++++++++++++++++++++++++++++++++++---------
> > 1 file changed, 55 insertions(+), 12 deletions(-)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index ba5592655ee3..728772bf41c7 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -38,6 +38,45 @@
> > 
> > #include "internal.h"
> > 
> > +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
> > +					   unsigned long addr, pte_t pte,
> > +					   unsigned long cp_flags)
> > +{
> > +	struct page *page;
> > +
> > +	if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
> > +		/*
> > +		 * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
> > +		 * without MM_CP_DIRTY_ACCT, there is nothing to do.
> > +		 */
> > +		return false;
> > +
> > +	if (pte_protnone(pte) || !pte_dirty(pte))
> > +		return false;
> > +
> > +	/* Do we need write faults for softdirty tracking? */
> > +	if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))
> > +		return false;
> > +
> > +	/* Do we need write faults for uffd-wp tracking? */
> > +	if (userfaultfd_pte_wp(vma, pte))
> > +		return false;
> > +
> > +	if (!(vma->vm_flags & VM_SHARED)) {
> > +		/*
> > +		 * We can only special-case on exclusive anonymous pages,
> > +		 * because we know that our write-fault handler similarly would
> > +		 * map them writable without any additional checks while holding
> > +		 * the PT lock.
> > +		 */
> > +		page = vm_normal_page(vma, addr, pte);
> > +		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
> > +			return false;
> > +	}
> > +
> > +	return true;
> > +}
> > +
> 
> Looks good in general. Just wondering (out loud) whether it makes more sense
> to do all the vm_flags and cp_flags related checks in one of the callers
> (mprotect_fixup()?) and propagate whether to try to write-unprotect in
> cp_flags (e.g., by introducing new MM_CP_TRY_WRITE_UNPROTECT).

I can see why David put it like that, because most of the checks are on
ptes not vm_flags.

But I also agree on this point, especially if to put it in another way:
IMHO it'll be confusing if we keey MM_CP_DIRTY_ACCT==false for all private
pages even if we're going to take them into account and do smart unprotect
operations.

So I'm wondering whether we could still at least move vm_flags check into
the mprotect_fixup() as suggested by Nadav, perhaps something like:

diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..aefd5fe982af 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -583,7 +583,11 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
         * held in write mode.
         */
        vma->vm_flags = newflags;
-       dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
+       if (vma->vm_flags & VM_SHARED)
+               dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
+       else
+               /* For private mappings, only if it's writable */
+               dirty_accountable = vma->vm_flags & VM_WRITE;
        vma_set_page_prot(vma);
 
        change_protection(tlb, vma, start, end, vma->vm_page_prot,

Then IIUC we could drop both the VM_WRITE check in change_pte_range(), and
also the VM_SHARED check above in can_change_pte_writable().  Not sure
whether that'll look slightly cleaner.

I'm also copying Peter Collingbourne <pcc@google.com> because afaict he
proposed the initial idea (maybe worth some credit in the commit message?),
and IIRC he has a good use case of it too.

Thanks,
David Hildenbrand June 13, 2022, 8:10 p.m. UTC | #3
On 13.06.22 21:43, Peter Xu wrote:
> Hi, David,
> 
> On Fri, Jun 10, 2022 at 11:42:06AM -0700, Nadav Amit wrote:
>> On Jun 10, 2022, at 11:14 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>>> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
>>> can try mapping anonymous pages writable if they are exclusive,
>>> the PTE is already dirty, and no special handling applies. Mapping the
>>> PTE writable is essentially the same thing the write fault handler would do
>>> in this case.
>>>
>>> Special handling is required for uffd-wp and softdirty tracking, so take
>>> care of that properly. Also, leave PROT_NONE handling alone for now;
>>> in the future, we could similarly extend the logic in do_numa_page() or
>>> use pte_mk_savedwrite() here. Note that we'll now also check for uffd-wp in
>>> case of VM_SHARED -- which is harmless and prepares for uffd-wp support for
>>> shmem.
>>>
>>> While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
>>> performance, it should also be a valuable optimization for uffd-wp, when
>>> un-protecting.
>>>
>>> Applying the same logic to PMDs (anonymous THP, anonymous hugetlb) is
>>> probably not worth the trouble, but could similarly be added if there is
>>> demand.
> 
> My memory was that Andrea had a version that used to have thp optimized
> too.  It'll be a slight pity to lose it if still possible, then we treat
> thp and small pages the same logic and be all fair.  Or is there any other
> challenge that we're facing?

Not really, but I assume performance gain will be minimal and might not
be worth the trouble.

I'm fairly busy (and not aware of Andreas version), so I can look at
this, but it will be part of a separate patch because it will go on my
TODO list. Not mad if someone beats me to it ;)

> 
>>>
>>> Results of a simple microbenchmark on my Ryzen 9 3900X, comparing the new
>>> optimization (avoiding write faults) during mprotect() with softdirty
>>> tracking, where we require a write fault.
> 
> Are we comparing the mprotect() sequence operations against softdirty
> clearing operation?  Would it make more sense if we compare the same
> mprotect() sequence to kernels that are before/after this patch applied?

For simplicity I compared on the same kernel, one time exploting the
optimization and one time disabling the optimization via softdirty.

I can also simply measure without+with. Extra work for me to combine
outputs :P

> 
>>>
>>>  Running 1000 iterations each
>>>
>>>  ==========================================================
>>>  Measuring memset() of 4096 bytes
>>>   First write access:
>>>    Min: 169 ns, Max: 8997 ns, Avg: 830 ns
>>>   Second write access:
>>>    Min: 80 ns, Max: 251 ns, Avg: 168 ns
>>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>>>    Min: 180 ns, Max: 290 ns, Avg: 190 ns
>>>   Write access after clearing softdirty:
>>>    Min: 451 ns, Max: 1774 ns, Avg: 470 ns
>>>  -> mprotect = 1.131 * second [avg]
>>>  -> mprotect = 0.404 * softdirty [avg]
> 
> (I don't understand these two lines.. but maybe I'm the only one?)

Most probably not.

"mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)" needs 113,1% the
runtime compared with the "second write" access.

"mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)" needs 40% of the
runtime compared with disabling the optimization via softdirty tracking.

I may find time to clean that up a bit more to make it easier to consume
for humans.

> 
>>>  ----------------------------------------------------------
>>>  Measuring single byte access per page of 4096 bytes
>>>   First write access:
>>>    Min: 761 ns, Max: 1152 ns, Avg: 784 ns
>>>   Second write access:
>>>    Min: 130 ns, Max: 181 ns, Avg: 137 ns
>>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>>>    Min: 150 ns, Max: 1553 ns, Avg: 155 ns
>>>   Write access after clearing softdirty:
>>>    Min: 169 ns, Max: 1783 ns, Avg: 432 ns
>>>  -> mprotect = 1.131 * second [avg]
>>>  -> mprotect = 0.359 * softdirty [avg]
>>>  ==========================================================
>>>  Measuring memset() of 16384 bytes
>>>   First write access:
>>>    Min: 1594 ns, Max: 3497 ns, Avg: 2143 ns
>>>   Second write access:
>>>    Min: 250 ns, Max: 381 ns, Avg: 260 ns
>>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>>>    Min: 290 ns, Max: 1643 ns, Avg: 300 ns
>>>   Write access after clearing softdirty:
>>>    Min: 1242 ns, Max: 8987 ns, Avg: 1297 ns
>>>  -> mprotect = 1.154 * second [avg]
>>>  -> mprotect = 0.231 * softdirty [avg]
>>>  ----------------------------------------------------------
>>>  Measuring single byte access per page of 16384 bytes
>>>   First write access:
>>>    Min: 1953 ns, Max: 2945 ns, Avg: 2008 ns
>>>   Second write access:
>>>    Min: 130 ns, Max: 912 ns, Avg: 142 ns
>>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>>>    Min: 160 ns, Max: 240 ns, Avg: 166 ns
>>>   Write access after clearing softdirty:
>>>    Min: 1112 ns, Max: 1513 ns, Avg: 1126 ns
>>>  -> mprotect = 1.169 * second [avg]
>>>  -> mprotect = 0.147 * softdirty [avg]
>>>  ==========================================================
>>>  Measuring memset() of 65536 bytes
>>>   First write access:
>>>    Min: 7524 ns, Max: 15650 ns, Avg: 7680 ns
>>>   Second write access:
>>>    Min: 251 ns, Max: 1323 ns, Avg: 648 ns
>>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>>>    Min: 270 ns, Max: 1282 ns, Avg: 736 ns
>>>   Write access after clearing softdirty:
>>>    Min: 4558 ns, Max: 12524 ns, Avg: 4623 ns
>>>  -> mprotect = 1.136 * second [avg]
>>>  -> mprotect = 0.159 * softdirty [avg]
>>>  ----------------------------------------------------------
>>>  Measuring single byte access per page of 65536 bytes
>>>   First write access:
>>>    Min: 7083 ns, Max: 9027 ns, Avg: 7241 ns
>>>   Second write access:
>>>    Min: 140 ns, Max: 201 ns, Avg: 156 ns
>>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>>>    Min: 190 ns, Max: 451 ns, Avg: 197 ns
>>>   Write access after clearing softdirty:
>>>    Min: 3707 ns, Max: 5119 ns, Avg: 3958 ns
>>>  -> mprotect = 1.263 * second [avg]
>>>  -> mprotect = 0.050 * softdirty [avg]
>>>  ==========================================================
>>>  Measuring memset() of 524288 bytes
>>>   First write access:
>>>    Min: 58470 ns, Max: 87754 ns, Avg: 59353 ns
>>>   Second write access:
>>>    Min: 5180 ns, Max: 6863 ns, Avg: 5318 ns
>>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>>>    Min: 5871 ns, Max: 9358 ns, Avg: 6028 ns
>>>   Write access after clearing softdirty:
>>>    Min: 35797 ns, Max: 41338 ns, Avg: 36710 ns
>>>  -> mprotect = 1.134 * second [avg]
>>>  -> mprotect = 0.164 * softdirty [avg]
>>>  ----------------------------------------------------------
>>>  Measuring single byte access per page of 524288 bytes
>>>   First write access:
>>>    Min: 53751 ns, Max: 59431 ns, Avg: 54506 ns
>>>   Second write access:
>>>    Min: 781 ns, Max: 2194 ns, Avg: 1123 ns
>>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>>>    Min: 161 ns, Max: 1282 ns, Avg: 622 ns
>>>   Write access after clearing softdirty:
>>>    Min: 30888 ns, Max: 34565 ns, Avg: 31229 ns
>>>  -> mprotect = 0.554 * second [avg]
>>>  -> mprotect = 0.020 * softdirty [avg]
>>>
>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Nadav Amit <nadav.amit@gmail.com>
>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Yang Shi <shy828301@gmail.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Mel Gorman <mgorman@techsingularity.net>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>
>>> v1 -> v2:
>>> * Rebased on v5.19-rc1
>>> * Rerun benchmark
>>> * Fix minor spelling issues in subject+description
>>> * Drop IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) check
>>> * Move pte_write() check into caller
>>>
>>> ---
>>> mm/mprotect.c | 67 ++++++++++++++++++++++++++++++++++++++++++---------
>>> 1 file changed, 55 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index ba5592655ee3..728772bf41c7 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -38,6 +38,45 @@
>>>
>>> #include "internal.h"
>>>
>>> +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>> +					   unsigned long addr, pte_t pte,
>>> +					   unsigned long cp_flags)
>>> +{
>>> +	struct page *page;
>>> +
>>> +	if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
>>> +		/*
>>> +		 * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
>>> +		 * without MM_CP_DIRTY_ACCT, there is nothing to do.
>>> +		 */
>>> +		return false;
>>> +
>>> +	if (pte_protnone(pte) || !pte_dirty(pte))
>>> +		return false;
>>> +
>>> +	/* Do we need write faults for softdirty tracking? */
>>> +	if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))
>>> +		return false;
>>> +
>>> +	/* Do we need write faults for uffd-wp tracking? */
>>> +	if (userfaultfd_pte_wp(vma, pte))
>>> +		return false;
>>> +
>>> +	if (!(vma->vm_flags & VM_SHARED)) {
>>> +		/*
>>> +		 * We can only special-case on exclusive anonymous pages,
>>> +		 * because we know that our write-fault handler similarly would
>>> +		 * map them writable without any additional checks while holding
>>> +		 * the PT lock.
>>> +		 */
>>> +		page = vm_normal_page(vma, addr, pte);
>>> +		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>>> +			return false;
>>> +	}
>>> +
>>> +	return true;
>>> +}
>>> +
>>
>> Looks good in general. Just wondering (out loud) whether it makes more sense
>> to do all the vm_flags and cp_flags related checks in one of the callers
>> (mprotect_fixup()?) and propagate whether to try to write-unprotect in
>> cp_flags (e.g., by introducing new MM_CP_TRY_WRITE_UNPROTECT).
> 
> I can see why David put it like that, because most of the checks are on
> ptes not vm_flags.
> 
> But I also agree on this point, especially if to put it in another way:
> IMHO it'll be confusing if we keey MM_CP_DIRTY_ACCT==false for all private
> pages even if we're going to take them into account and do smart unprotect
> operations.
> 
> So I'm wondering whether we could still at least move vm_flags check into
> the mprotect_fixup() as suggested by Nadav, perhaps something like:
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ba5592655ee3..aefd5fe982af 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -583,7 +583,11 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
>          * held in write mode.
>          */
>         vma->vm_flags = newflags;
> -       dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
> +       if (vma->vm_flags & VM_SHARED)
> +               dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
> +       else
> +               /* For private mappings, only if it's writable */
> +               dirty_accountable = vma->vm_flags & VM_WRITE;
>         vma_set_page_prot(vma);
>  
>         change_protection(tlb, vma, start, end, vma->vm_page_prot,
> 
> Then IIUC we could drop both the VM_WRITE check in change_pte_range(), and
> also the VM_SHARED check above in can_change_pte_writable().  Not sure
> whether that'll look slightly cleaner.

I'll give it a shot and most probably rename dirty_accountable to
something more expressive -- like Nadav proposed, for example.

> 
> I'm also copying Peter Collingbourne <pcc@google.com> because afaict he
> proposed the initial idea (maybe worth some credit in the commit message?),

Do you have a link to that conversation? Either my memory is messing
with me or I did this without reading that mail (which I think, because
it simply made sense with PageAnonExclusive at hand). Still, I can add a
reference to that mail and mention that this was suggested earlier by
Peter C..


Thanks!
Peter Xu June 13, 2022, 9:22 p.m. UTC | #4
On Mon, Jun 13, 2022 at 10:10:16PM +0200, David Hildenbrand wrote:
> On 13.06.22 21:43, Peter Xu wrote:
> > Hi, David,
> > 
> > On Fri, Jun 10, 2022 at 11:42:06AM -0700, Nadav Amit wrote:
> >> On Jun 10, 2022, at 11:14 AM, David Hildenbrand <david@redhat.com> wrote:
> >>
> >>> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
> >>> can try mapping anonymous pages writable if they are exclusive,
> >>> the PTE is already dirty, and no special handling applies. Mapping the
> >>> PTE writable is essentially the same thing the write fault handler would do
> >>> in this case.
> >>>
> >>> Special handling is required for uffd-wp and softdirty tracking, so take
> >>> care of that properly. Also, leave PROT_NONE handling alone for now;
> >>> in the future, we could similarly extend the logic in do_numa_page() or
> >>> use pte_mk_savedwrite() here. Note that we'll now also check for uffd-wp in
> >>> case of VM_SHARED -- which is harmless and prepares for uffd-wp support for
> >>> shmem.
> >>>
> >>> While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
> >>> performance, it should also be a valuable optimization for uffd-wp, when
> >>> un-protecting.
> >>>
> >>> Applying the same logic to PMDs (anonymous THP, anonymous hugetlb) is
> >>> probably not worth the trouble, but could similarly be added if there is
> >>> demand.
> > 
> > My memory was that Andrea had a version that used to have thp optimized
> > too.  It'll be a slight pity to lose it if still possible, then we treat
> > thp and small pages the same logic and be all fair.  Or is there any other
> > challenge that we're facing?
> 
> Not really, but I assume performance gain will be minimal and might not
> be worth the trouble.
> 
> I'm fairly busy (and not aware of Andreas version), so I can look at
> this, but it will be part of a separate patch because it will go on my
> TODO list. Not mad if someone beats me to it ;)

Just for the reference:

https://github.com/aagit/aa/commit/34cd0d78db407af06d35a06b24be8e92593964be

> 
> > 
> >>>
> >>> Results of a simple microbenchmark on my Ryzen 9 3900X, comparing the new
> >>> optimization (avoiding write faults) during mprotect() with softdirty
> >>> tracking, where we require a write fault.
> > 
> > Are we comparing the mprotect() sequence operations against softdirty
> > clearing operation?  Would it make more sense if we compare the same
> > mprotect() sequence to kernels that are before/after this patch applied?
> 
> For simplicity I compared on the same kernel, one time exploting the
> optimization and one time disabling the optimization via softdirty.
> 
> I can also simply measure without+with. Extra work for me to combine
> outputs :P

Well, still that's normally how we work on these, don't we? :)

Still note that the SOFTDIRTY check (I think) was still reverted..  I meant
I kept thinking below check "vma->vm_flags & VM_SOFTDIRTY" should be
"!(vma->vm_flags & VM_SOFTDIRTY)", but again that's separate change so feel
free to ignore as we've discussed, but please make sure even if you want to
compare with softdirty that's taking into account.

> 
> > 
> >>>
> >>>  Running 1000 iterations each
> >>>
> >>>  ==========================================================
> >>>  Measuring memset() of 4096 bytes
> >>>   First write access:
> >>>    Min: 169 ns, Max: 8997 ns, Avg: 830 ns
> >>>   Second write access:
> >>>    Min: 80 ns, Max: 251 ns, Avg: 168 ns
> >>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >>>    Min: 180 ns, Max: 290 ns, Avg: 190 ns
> >>>   Write access after clearing softdirty:
> >>>    Min: 451 ns, Max: 1774 ns, Avg: 470 ns
> >>>  -> mprotect = 1.131 * second [avg]
> >>>  -> mprotect = 0.404 * softdirty [avg]
> > 
> > (I don't understand these two lines.. but maybe I'm the only one?)
> 
> Most probably not.
> 
> "mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)" needs 113,1% the
> runtime compared with the "second write" access.
> 
> "mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)" needs 40% of the
> runtime compared with disabling the optimization via softdirty tracking.
> 
> I may find time to clean that up a bit more to make it easier to consume
> for humans.

I see, thanks.  Appending the explanation after the test result will also
work for me.

I'm curious is that 113.1% came from tlb miss?  If that's the case, I'd
suggest drop those comparisons if there's a new version, since they're
probably not helping to explain what this patch is changing (avoid page
faluts), and IMHO it can slightly confuse reviewers, if you agree.

> 
> > 
> >>>  ----------------------------------------------------------
> >>>  Measuring single byte access per page of 4096 bytes
> >>>   First write access:
> >>>    Min: 761 ns, Max: 1152 ns, Avg: 784 ns
> >>>   Second write access:
> >>>    Min: 130 ns, Max: 181 ns, Avg: 137 ns
> >>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >>>    Min: 150 ns, Max: 1553 ns, Avg: 155 ns
> >>>   Write access after clearing softdirty:
> >>>    Min: 169 ns, Max: 1783 ns, Avg: 432 ns
> >>>  -> mprotect = 1.131 * second [avg]
> >>>  -> mprotect = 0.359 * softdirty [avg]
> >>>  ==========================================================
> >>>  Measuring memset() of 16384 bytes
> >>>   First write access:
> >>>    Min: 1594 ns, Max: 3497 ns, Avg: 2143 ns
> >>>   Second write access:
> >>>    Min: 250 ns, Max: 381 ns, Avg: 260 ns
> >>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >>>    Min: 290 ns, Max: 1643 ns, Avg: 300 ns
> >>>   Write access after clearing softdirty:
> >>>    Min: 1242 ns, Max: 8987 ns, Avg: 1297 ns
> >>>  -> mprotect = 1.154 * second [avg]
> >>>  -> mprotect = 0.231 * softdirty [avg]
> >>>  ----------------------------------------------------------
> >>>  Measuring single byte access per page of 16384 bytes
> >>>   First write access:
> >>>    Min: 1953 ns, Max: 2945 ns, Avg: 2008 ns
> >>>   Second write access:
> >>>    Min: 130 ns, Max: 912 ns, Avg: 142 ns
> >>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >>>    Min: 160 ns, Max: 240 ns, Avg: 166 ns
> >>>   Write access after clearing softdirty:
> >>>    Min: 1112 ns, Max: 1513 ns, Avg: 1126 ns
> >>>  -> mprotect = 1.169 * second [avg]
> >>>  -> mprotect = 0.147 * softdirty [avg]
> >>>  ==========================================================
> >>>  Measuring memset() of 65536 bytes
> >>>   First write access:
> >>>    Min: 7524 ns, Max: 15650 ns, Avg: 7680 ns
> >>>   Second write access:
> >>>    Min: 251 ns, Max: 1323 ns, Avg: 648 ns
> >>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >>>    Min: 270 ns, Max: 1282 ns, Avg: 736 ns
> >>>   Write access after clearing softdirty:
> >>>    Min: 4558 ns, Max: 12524 ns, Avg: 4623 ns
> >>>  -> mprotect = 1.136 * second [avg]
> >>>  -> mprotect = 0.159 * softdirty [avg]
> >>>  ----------------------------------------------------------
> >>>  Measuring single byte access per page of 65536 bytes
> >>>   First write access:
> >>>    Min: 7083 ns, Max: 9027 ns, Avg: 7241 ns
> >>>   Second write access:
> >>>    Min: 140 ns, Max: 201 ns, Avg: 156 ns
> >>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >>>    Min: 190 ns, Max: 451 ns, Avg: 197 ns
> >>>   Write access after clearing softdirty:
> >>>    Min: 3707 ns, Max: 5119 ns, Avg: 3958 ns
> >>>  -> mprotect = 1.263 * second [avg]
> >>>  -> mprotect = 0.050 * softdirty [avg]
> >>>  ==========================================================
> >>>  Measuring memset() of 524288 bytes
> >>>   First write access:
> >>>    Min: 58470 ns, Max: 87754 ns, Avg: 59353 ns
> >>>   Second write access:
> >>>    Min: 5180 ns, Max: 6863 ns, Avg: 5318 ns
> >>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >>>    Min: 5871 ns, Max: 9358 ns, Avg: 6028 ns
> >>>   Write access after clearing softdirty:
> >>>    Min: 35797 ns, Max: 41338 ns, Avg: 36710 ns
> >>>  -> mprotect = 1.134 * second [avg]
> >>>  -> mprotect = 0.164 * softdirty [avg]
> >>>  ----------------------------------------------------------
> >>>  Measuring single byte access per page of 524288 bytes
> >>>   First write access:
> >>>    Min: 53751 ns, Max: 59431 ns, Avg: 54506 ns
> >>>   Second write access:
> >>>    Min: 781 ns, Max: 2194 ns, Avg: 1123 ns
> >>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
> >>>    Min: 161 ns, Max: 1282 ns, Avg: 622 ns
> >>>   Write access after clearing softdirty:
> >>>    Min: 30888 ns, Max: 34565 ns, Avg: 31229 ns
> >>>  -> mprotect = 0.554 * second [avg]
> >>>  -> mprotect = 0.020 * softdirty [avg]
> >>>
> >>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: Nadav Amit <nadav.amit@gmail.com>
> >>> Cc: Dave Hansen <dave.hansen@intel.com>
> >>> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >>> Cc: Peter Xu <peterx@redhat.com>
> >>> Cc: Yang Shi <shy828301@gmail.com>
> >>> Cc: Hugh Dickins <hughd@google.com>
> >>> Cc: Mel Gorman <mgorman@techsingularity.net>
> >>> Signed-off-by: David Hildenbrand <david@redhat.com>
> >>> ---
> >>>
> >>> v1 -> v2:
> >>> * Rebased on v5.19-rc1
> >>> * Rerun benchmark
> >>> * Fix minor spelling issues in subject+description
> >>> * Drop IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) check
> >>> * Move pte_write() check into caller
> >>>
> >>> ---
> >>> mm/mprotect.c | 67 ++++++++++++++++++++++++++++++++++++++++++---------
> >>> 1 file changed, 55 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index ba5592655ee3..728772bf41c7 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -38,6 +38,45 @@
> >>>
> >>> #include "internal.h"
> >>>
> >>> +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
> >>> +					   unsigned long addr, pte_t pte,
> >>> +					   unsigned long cp_flags)
> >>> +{
> >>> +	struct page *page;
> >>> +
> >>> +	if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
> >>> +		/*
> >>> +		 * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
> >>> +		 * without MM_CP_DIRTY_ACCT, there is nothing to do.
> >>> +		 */
> >>> +		return false;
> >>> +
> >>> +	if (pte_protnone(pte) || !pte_dirty(pte))
> >>> +		return false;
> >>> +
> >>> +	/* Do we need write faults for softdirty tracking? */
> >>> +	if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))
> >>> +		return false;
> >>> +
> >>> +	/* Do we need write faults for uffd-wp tracking? */
> >>> +	if (userfaultfd_pte_wp(vma, pte))
> >>> +		return false;
> >>> +
> >>> +	if (!(vma->vm_flags & VM_SHARED)) {
> >>> +		/*
> >>> +		 * We can only special-case on exclusive anonymous pages,
> >>> +		 * because we know that our write-fault handler similarly would
> >>> +		 * map them writable without any additional checks while holding
> >>> +		 * the PT lock.
> >>> +		 */
> >>> +		page = vm_normal_page(vma, addr, pte);
> >>> +		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
> >>> +			return false;
> >>> +	}
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>
> >> Looks good in general. Just wondering (out loud) whether it makes more sense
> >> to do all the vm_flags and cp_flags related checks in one of the callers
> >> (mprotect_fixup()?) and propagate whether to try to write-unprotect in
> >> cp_flags (e.g., by introducing new MM_CP_TRY_WRITE_UNPROTECT).
> > 
> > I can see why David put it like that, because most of the checks are on
> > ptes not vm_flags.
> > 
> > But I also agree on this point, especially if to put it in another way:
> > IMHO it'll be confusing if we keey MM_CP_DIRTY_ACCT==false for all private
> > pages even if we're going to take them into account and do smart unprotect
> > operations.
> > 
> > So I'm wondering whether we could still at least move vm_flags check into
> > the mprotect_fixup() as suggested by Nadav, perhaps something like:
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index ba5592655ee3..aefd5fe982af 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -583,7 +583,11 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >          * held in write mode.
> >          */
> >         vma->vm_flags = newflags;
> > -       dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
> > +       if (vma->vm_flags & VM_SHARED)
> > +               dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
> > +       else
> > +               /* For private mappings, only if it's writable */
> > +               dirty_accountable = vma->vm_flags & VM_WRITE;
> >         vma_set_page_prot(vma);
> >  
> >         change_protection(tlb, vma, start, end, vma->vm_page_prot,
> > 
> > Then IIUC we could drop both the VM_WRITE check in change_pte_range(), and
> > also the VM_SHARED check above in can_change_pte_writable().  Not sure
> > whether that'll look slightly cleaner.
> 
> I'll give it a shot and most probably rename dirty_accountable to
> something more expressive -- like Nadav proposed, for example.

Sure.

> 
> > 
> > I'm also copying Peter Collingbourne <pcc@google.com> because afaict he
> > proposed the initial idea (maybe worth some credit in the commit message?),
> 
> Do you have a link to that conversation? Either my memory is messing
> with me or I did this without reading that mail (which I think, because
> it simply made sense with PageAnonExclusive at hand). Still, I can add a
> reference to that mail and mention that this was suggested earlier by
> Peter C..

I see, no worry then I thought it was coming from that.  In this case I'm
not sure whether it's still needed.

PeterC's v1 was here:

https://lore.kernel.org/linux-mm/20201212053152.3783250-1-pcc@google.com/

But there're a bunch of versions:

https://lore.kernel.org/linux-mm/?q=mm%3A+improve+mprotect%28R%7CW%29+efficiency+on+pages+referenced+once

Thanks,
David Hildenbrand June 14, 2022, 7:50 a.m. UTC | #5
>> Not really, but I assume performance gain will be minimal and might not
>> be worth the trouble.
>>
>> I'm fairly busy (and not aware of Andreas version), so I can look at
>> this, but it will be part of a separate patch because it will go on my
>> TODO list. Not mad if someone beats me to it ;)
> 
> Just for the reference:
> 
> https://github.com/aagit/aa/commit/34cd0d78db407af06d35a06b24be8e92593964be

Thanks for that reference!

> 
>>
>>>
>>>>>
>>>>> Results of a simple microbenchmark on my Ryzen 9 3900X, comparing the new
>>>>> optimization (avoiding write faults) during mprotect() with softdirty
>>>>> tracking, where we require a write fault.
>>>
>>> Are we comparing the mprotect() sequence operations against softdirty
>>> clearing operation?  Would it make more sense if we compare the same
>>> mprotect() sequence to kernels that are before/after this patch applied?
>>
>> For simplicity I compared on the same kernel, one time exploting the
>> optimization and one time disabling the optimization via softdirty.
>>
>> I can also simply measure without+with. Extra work for me to combine
>> outputs :P
> 
> Well, still that's normally how we work on these, don't we? :)
> 
> Still note that the SOFTDIRTY check (I think) was still reverted..  I meant
> I kept thinking below check "vma->vm_flags & VM_SOFTDIRTY" should be
> "!(vma->vm_flags & VM_SOFTDIRTY)", but again that's separate change so feel
> free to ignore as we've discussed, but please make sure even if you want to
> compare with softdirty that's taking into account.

I wrapped my head around that softdirty check *a lot* and ended up
figuring out that it unintuitively is correct. But maybe I ended up
confusing myself. Anyhow, this patch merely moves that check.

> 
>>
>>>
>>>>>
>>>>>  Running 1000 iterations each
>>>>>
>>>>>  ==========================================================
>>>>>  Measuring memset() of 4096 bytes
>>>>>   First write access:
>>>>>    Min: 169 ns, Max: 8997 ns, Avg: 830 ns
>>>>>   Second write access:
>>>>>    Min: 80 ns, Max: 251 ns, Avg: 168 ns
>>>>>   Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
>>>>>    Min: 180 ns, Max: 290 ns, Avg: 190 ns
>>>>>   Write access after clearing softdirty:
>>>>>    Min: 451 ns, Max: 1774 ns, Avg: 470 ns
>>>>>  -> mprotect = 1.131 * second [avg]
>>>>>  -> mprotect = 0.404 * softdirty [avg]
>>>
>>> (I don't understand these two lines.. but maybe I'm the only one?)
>>
>> Most probably not.
>>
>> "mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)" needs 113,1% the
>> runtime compared with the "second write" access.
>>
>> "mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)" needs 40% of the
>> runtime compared with disabling the optimization via softdirty tracking.
>>
>> I may find time to clean that up a bit more to make it easier to consume
>> for humans.
> 
> I see, thanks.  Appending the explanation after the test result will also
> work for me.
> 
> I'm curious is that 113.1% came from tlb miss?  If that's the case, I'd
> suggest drop those comparisons if there's a new version, since they're
> probably not helping to explain what this patch is changing (avoid page
> faluts), and IMHO it can slightly confuse reviewers, if you agree.

As we seem to have easier benchmarks from Andrea and Peter, I can just
reuse these and make my life easier :)

[...]

>>>> Looks good in general. Just wondering (out loud) whether it makes more sense
>>>> to do all the vm_flags and cp_flags related checks in one of the callers
>>>> (mprotect_fixup()?) and propagate whether to try to write-unprotect in
>>>> cp_flags (e.g., by introducing new MM_CP_TRY_WRITE_UNPROTECT).
>>>
>>> I can see why David put it like that, because most of the checks are on
>>> ptes not vm_flags.
>>>
>>> But I also agree on this point, especially if to put it in another way:
>>> IMHO it'll be confusing if we keey MM_CP_DIRTY_ACCT==false for all private
>>> pages even if we're going to take them into account and do smart unprotect
>>> operations.
>>>
>>> So I'm wondering whether we could still at least move vm_flags check into
>>> the mprotect_fixup() as suggested by Nadav, perhaps something like:
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index ba5592655ee3..aefd5fe982af 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -583,7 +583,11 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>          * held in write mode.
>>>          */
>>>         vma->vm_flags = newflags;
>>> -       dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>> +       if (vma->vm_flags & VM_SHARED)
>>> +               dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>> +       else
>>> +               /* For private mappings, only if it's writable */
>>> +               dirty_accountable = vma->vm_flags & VM_WRITE;
>>>         vma_set_page_prot(vma);
>>>  
>>>         change_protection(tlb, vma, start, end, vma->vm_page_prot,
>>>
>>> Then IIUC we could drop both the VM_WRITE check in change_pte_range(), and
>>> also the VM_SHARED check above in can_change_pte_writable().  Not sure
>>> whether that'll look slightly cleaner.
>>
>> I'll give it a shot and most probably rename dirty_accountable to
>> something more expressive -- like Nadav proposed, for example.
> 
> Sure.
> 
>>
>>>
>>> I'm also copying Peter Collingbourne <pcc@google.com> because afaict he
>>> proposed the initial idea (maybe worth some credit in the commit message?),
>>
>> Do you have a link to that conversation? Either my memory is messing
>> with me or I did this without reading that mail (which I think, because
>> it simply made sense with PageAnonExclusive at hand). Still, I can add a
>> reference to that mail and mention that this was suggested earlier by
>> Peter C..
> 
> I see, no worry then I thought it was coming from that.  In this case I'm
> not sure whether it's still needed.
> 
> PeterC's v1 was here:
> 
> https://lore.kernel.org/linux-mm/20201212053152.3783250-1-pcc@google.com/
> 
> But there're a bunch of versions:
> 
> https://lore.kernel.org/linux-mm/?q=mm%3A+improve+mprotect%28R%7CW%29+efficiency+on+pages+referenced+once

No idea why I missed that completely as I tend to read a lot linux-mm.
Thanks for the pointers!
diff mbox series

Patch

diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..728772bf41c7 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -38,6 +38,45 @@ 
 
 #include "internal.h"
 
+static inline bool can_change_pte_writable(struct vm_area_struct *vma,
+					   unsigned long addr, pte_t pte,
+					   unsigned long cp_flags)
+{
+	struct page *page;
+
+	if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
+		/*
+		 * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
+		 * without MM_CP_DIRTY_ACCT, there is nothing to do.
+		 */
+		return false;
+
+	if (pte_protnone(pte) || !pte_dirty(pte))
+		return false;
+
+	/* Do we need write faults for softdirty tracking? */
+	if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))
+		return false;
+
+	/* Do we need write faults for uffd-wp tracking? */
+	if (userfaultfd_pte_wp(vma, pte))
+		return false;
+
+	if (!(vma->vm_flags & VM_SHARED)) {
+		/*
+		 * We can only special-case on exclusive anonymous pages,
+		 * because we know that our write-fault handler similarly would
+		 * map them writable without any additional checks while holding
+		 * the PT lock.
+		 */
+		page = vm_normal_page(vma, addr, pte);
+		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+			return false;
+	}
+
+	return true;
+}
+
 static unsigned long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -46,7 +85,6 @@  static unsigned long change_pte_range(struct mmu_gather *tlb,
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 	int target_node = NUMA_NO_NODE;
-	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
@@ -137,21 +175,26 @@  static unsigned long change_pte_range(struct mmu_gather *tlb,
 				ptent = pte_wrprotect(ptent);
 				ptent = pte_mkuffd_wp(ptent);
 			} else if (uffd_wp_resolve) {
-				/*
-				 * Leave the write bit to be handled
-				 * by PF interrupt handler, then
-				 * things like COW could be properly
-				 * handled.
-				 */
 				ptent = pte_clear_uffd_wp(ptent);
 			}
 
-			/* Avoid taking write faults for known dirty pages */
-			if (dirty_accountable && pte_dirty(ptent) &&
-					(pte_soft_dirty(ptent) ||
-					 !(vma->vm_flags & VM_SOFTDIRTY))) {
+			/*
+			 * In some writable, shared mappings, we might want
+			 * to catch actual write access -- see
+			 * vma_wants_writenotify().
+			 *
+			 * In all writable, private mappings, we have to
+			 * properly handle COW.
+			 *
+			 * In both cases, we can sometimes still map PTEs
+			 * writable and avoid the write-fault handler, for
+			 * example, if the PTE is already dirty and no other
+			 * COW or special handling is required.
+			 */
+			if ((vma->vm_flags & VM_WRITE) && !pte_write(ptent) &&
+			    can_change_pte_writable(vma, addr, ptent, cp_flags))
 				ptent = pte_mkwrite(ptent);
-			}
+
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			if (pte_needs_flush(oldpte, ptent))
 				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);