diff mbox series

[v2,2/5] mm: avoid unnecessary flush on change_huge_pmd()

Message ID 20211021122112.592634-3-namit@vmware.com (mailing list archive)
State New
Headers show
Series mm/mprotect: avoid unnecessary TLB flushes | expand

Commit Message

Nadav Amit Oct. 21, 2021, 12:21 p.m. UTC
From: Nadav Amit <namit@vmware.com>

Calls to change_protection_range() on THP can trigger, at least on x86,
two TLB flushes for one page: one immediately, when pmdp_invalidate() is
called by change_huge_pmd(), and then another one later (that can be
batched) when change_protection_range() finishes.

The first TLB flush is only necessary to prevent the dirty bit (and with
a lesser importance the access bit) from changing while the PTE is
modified. However, this is not necessary as the x86 CPUs set the
dirty-bit atomically with an additional check that the PTE is (still)
present. One caveat is Intel's Knights Landing that has a bug and does
not do so.

Leverage this behavior to eliminate the unnecessary TLB flush in
change_huge_pmd(). Introduce a new arch specific pmdp_invalidate_ad()
that only invalidates the access and dirty bit from further changes.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/pgtable.h | 8 ++++++++
 include/linux/pgtable.h        | 5 +++++
 mm/huge_memory.c               | 7 ++++---
 mm/pgtable-generic.c           | 8 ++++++++
 4 files changed, 25 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra Oct. 25, 2021, 10:52 a.m. UTC | #1
On Thu, Oct 21, 2021 at 05:21:09AM -0700, Nadav Amit wrote:
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 448cd01eb3ec..18c3366f8f4d 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1146,6 +1146,14 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  	}
>  }
>  #endif
> +
> +#define __HAVE_ARCH_PMDP_INVALIDATE_AD
> +static inline pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
> +				       unsigned long address, pmd_t *pmdp)
> +{
> +	return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));

Did this want to be something like:

	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
	return old;

instead?

> +}
> +
>  /*
>   * Page table pages are page-aligned.  The lower half of the top
>   * level is used for userspace and the top half for the kernel.

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e5ea5f775d5c..435da011b1a2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1795,10 +1795,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
>  	 * which may break userspace.
>  	 *
> -	 * pmdp_invalidate() is required to make sure we don't miss
> -	 * dirty/young flags set by hardware.
> +	 * pmdp_invalidate_ad() is required to make sure we don't miss
> +	 * dirty/young flags (which are also known as access/dirty) cannot be
> +	 * further modifeid by the hardware.

"modified", I think is the more common spelling.

>  	 */
> -	entry = pmdp_invalidate(vma, addr, pmd);
> +	entry = pmdp_invalidate_ad(vma, addr, pmd);
>  
>  	entry = pmd_modify(entry, newprot);
>  	if (preserve_write)
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4e640baf9794..b0ce6c7391bf 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -200,6 +200,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  }
>  #endif
>  
> +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD

/*
 * Does this deserve a comment to explain the intended difference vs
 * pmdp_invalidate() ?
 */

> +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
> +			 pmd_t *pmdp)
> +{
> +	return pmdp_invalidate(vma, address, pmdp);
> +}
> +#endif
> +
>  #ifndef pmdp_collapse_flush
>  pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
>  			  pmd_t *pmdp)
> -- 
> 2.25.1
>
Nadav Amit Oct. 25, 2021, 4:29 p.m. UTC | #2
> On Oct 25, 2021, at 3:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Oct 21, 2021 at 05:21:09AM -0700, Nadav Amit wrote:
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index 448cd01eb3ec..18c3366f8f4d 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1146,6 +1146,14 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>> 	}
>> }
>> #endif
>> +
>> +#define __HAVE_ARCH_PMDP_INVALIDATE_AD
>> +static inline pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
>> +				       unsigned long address, pmd_t *pmdp)
>> +{
>> +	return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
> 
> Did this want to be something like:
> 
> 	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
> 	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
> 		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> 	return old;
> 
> instead?

Yes. Of course. Where did my code go to? :(

> 
>> +}
>> +
>> /*
>>  * Page table pages are page-aligned.  The lower half of the top
>>  * level is used for userspace and the top half for the kernel.
> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e5ea5f775d5c..435da011b1a2 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1795,10 +1795,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> 	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
>> 	 * which may break userspace.
>> 	 *
>> -	 * pmdp_invalidate() is required to make sure we don't miss
>> -	 * dirty/young flags set by hardware.
>> +	 * pmdp_invalidate_ad() is required to make sure we don't miss
>> +	 * dirty/young flags (which are also known as access/dirty) cannot be
>> +	 * further modifeid by the hardware.
> 
> "modified", I think is the more common spelling.

I tried to start a new trend. I will fix it.

> 
>> 	 */
>> -	entry = pmdp_invalidate(vma, addr, pmd);
>> +	entry = pmdp_invalidate_ad(vma, addr, pmd);
>> 
>> 	entry = pmd_modify(entry, newprot);
>> 	if (preserve_write)
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 4e640baf9794..b0ce6c7391bf 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -200,6 +200,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>> }
>> #endif
>> 
>> +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
> 
> /*
> * Does this deserve a comment to explain the intended difference vs
> * pmdp_invalidate() ?
> */

I will add a comment.
Dave Hansen Oct. 26, 2021, 4:06 p.m. UTC | #3
On 10/21/21 5:21 AM, Nadav Amit wrote:
> The first TLB flush is only necessary to prevent the dirty bit (and with
> a lesser importance the access bit) from changing while the PTE is
> modified. However, this is not necessary as the x86 CPUs set the
> dirty-bit atomically with an additional check that the PTE is (still)
> present. One caveat is Intel's Knights Landing that has a bug and does
> not do so.

First, did I miss the check in this patch for X86_BUG_PTE_LEAK?  I don't
see it anywhere.

> -	 * pmdp_invalidate() is required to make sure we don't miss
> -	 * dirty/young flags set by hardware.

This got me thinking...  In here:

> https://lore.kernel.org/lkml/20160708001909.FB2443E2@viggo.jf.intel.com/

I wrote:

> These bits are truly "stray".  In the case of the Dirty bit, the
> thread associated with the stray set was *not* allowed to write to
> the page.  This means that we do not have to launder the bit(s); we
> can simply ignore them.

Is the goal of your proposed patch here to ensure that the dirty bit is
not set at *all*?  Or, is it to ensure that a dirty bit which we need to
*launder* is never set?
Nadav Amit Oct. 26, 2021, 4:47 p.m. UTC | #4
> On Oct 26, 2021, at 9:06 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 10/21/21 5:21 AM, Nadav Amit wrote:
>> The first TLB flush is only necessary to prevent the dirty bit (and with
>> a lesser importance the access bit) from changing while the PTE is
>> modified. However, this is not necessary as the x86 CPUs set the
>> dirty-bit atomically with an additional check that the PTE is (still)
>> present. One caveat is Intel's Knights Landing that has a bug and does
>> not do so.
> 
> First, did I miss the check in this patch for X86_BUG_PTE_LEAK?  I don't
> see it anywhere.

No, it is me who missed it. It should have been in pmdp_invalidate_ad():

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..f14f64cc17b5 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -780,6 +780,30 @@ int pmd_clear_huge(pmd_t *pmd)
 	return 0;
 }
 
+/*
+ * pmdp_invalidate_ad() - prevents the access and dirty bits from being further
+ * updated by the CPU.
+ *
+ * Returns the original PTE.
+ *
+ * During an access to a page, x86 CPUs set the dirty and access bit atomically
+ * with an additional check of the present-bit. Therefore, it is possible to
+ * avoid the TLB flush if we change the PTE atomically, as pmdp_establish does.
+ *
+ * We do not make this optimization on certain CPUs that has a bug that violates
+ * this behavior (specifically Knights Landing).
+ */
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+			 pmd_t *pmdp)
+{
+	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
+
+	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
+		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return old;
+}

> 
>> -	 * pmdp_invalidate() is required to make sure we don't miss
>> -	 * dirty/young flags set by hardware.
> 
> This got me thinking...  In here:
> 
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20160708001909.FB2443E2%40viggo.jf.intel.com%2F&amp;data=04%7C01%7Cnamit%40vmware.com%7Cf6a2a69eec094b12638108d9989afb60%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637708613735772213%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=o8fYbm8BKHvWxYC9aO5e3MFLkaOnQxvDMy%2BEnYxz56I%3D&amp;reserved=0
> 
> I wrote:
> 
>> These bits are truly "stray".  In the case of the Dirty bit, the
>> thread associated with the stray set was *not* allowed to write to
>> the page.  This means that we do not have to launder the bit(s); we
>> can simply ignore them.
> 
> Is the goal of your proposed patch here to ensure that the dirty bit is
> not set at *all*?  Or, is it to ensure that a dirty bit which we need to
> *launder* is never set?

At *all*.

Err… I remembered from our previous discussions that the dirty bit cannot
be set once the R/W bit is cleared atomically. But going back to the SDM,
I see the (relatively new?) note:

"If software on one logical processor writes to a page while software on
 another logical processor concurrently clears the R/W flag in the
 paging-structure entry that maps the page, execution on some processors may
 result in the entry’s dirty flag being set (due to the write on the first
 logical processor) and the entry’s R/W flag being clear (due to the update
 to the entry on the second logical processor). This will never occur on a
 processor that supports control-flow enforcement technology (CET)”

So I guess that this optimization can only be enabled when CET is enabled.

:(
Nadav Amit Oct. 26, 2021, 5:44 p.m. UTC | #5
> On Oct 26, 2021, at 9:47 AM, Nadav Amit <namit@vmware.com> wrote:
> 
> 
> 
>> On Oct 26, 2021, at 9:06 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> 
>> On 10/21/21 5:21 AM, Nadav Amit wrote:
>>> The first TLB flush is only necessary to prevent the dirty bit (and with
>>> a lesser importance the access bit) from changing while the PTE is
>>> modified. However, this is not necessary as the x86 CPUs set the
>>> dirty-bit atomically with an additional check that the PTE is (still)
>>> present. One caveat is Intel's Knights Landing that has a bug and does
>>> not do so.
>> 
>> First, did I miss the check in this patch for X86_BUG_PTE_LEAK?  I don't
>> see it anywhere.
> 
> No, it is me who missed it. It should have been in pmdp_invalidate_ad():
> 
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 3481b35cb4ec..f14f64cc17b5 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -780,6 +780,30 @@ int pmd_clear_huge(pmd_t *pmd)
> 	return 0;
> }
> 
> +/*
> + * pmdp_invalidate_ad() - prevents the access and dirty bits from being further
> + * updated by the CPU.
> + *
> + * Returns the original PTE.
> + *
> + * During an access to a page, x86 CPUs set the dirty and access bit atomically
> + * with an additional check of the present-bit. Therefore, it is possible to
> + * avoid the TLB flush if we change the PTE atomically, as pmdp_establish does.
> + *
> + * We do not make this optimization on certain CPUs that has a bug that violates
> + * this behavior (specifically Knights Landing).
> + */
> +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
> +			 pmd_t *pmdp)
> +{
> +	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
> +
> +	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
> +		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	return old;
> +}
> 
>> 
>>> -	 * pmdp_invalidate() is required to make sure we don't miss
>>> -	 * dirty/young flags set by hardware.
>> 
>> This got me thinking...  In here:
>> 
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20160708001909.FB2443E2%40viggo.jf.intel.com%2F&amp;data=04%7C01%7Cnamit%40vmware.com%7Cf6a2a69eec094b12638108d9989afb60%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637708613735772213%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=o8fYbm8BKHvWxYC9aO5e3MFLkaOnQxvDMy%2BEnYxz56I%3D&amp;reserved=0
>> 
>> I wrote:
>> 
>>> These bits are truly "stray".  In the case of the Dirty bit, the
>>> thread associated with the stray set was *not* allowed to write to
>>> the page.  This means that we do not have to launder the bit(s); we
>>> can simply ignore them.
>> 
>> Is the goal of your proposed patch here to ensure that the dirty bit is
>> not set at *all*?  Or, is it to ensure that a dirty bit which we need to
>> *launder* is never set?
> 
> At *all*.
> 
> Err… I remembered from our previous discussions that the dirty bit cannot
> be set once the R/W bit is cleared atomically. But going back to the SDM,
> I see the (relatively new?) note:
> 
> "If software on one logical processor writes to a page while software on
> another logical processor concurrently clears the R/W flag in the
> paging-structure entry that maps the page, execution on some processors may
> result in the entry’s dirty flag being set (due to the write on the first
> logical processor) and the entry’s R/W flag being clear (due to the update
> to the entry on the second logical processor). This will never occur on a
> processor that supports control-flow enforcement technology (CET)”
> 
> So I guess that this optimization can only be enabled when CET is enabled.
> 
> :(

I still wonder whether the SDM comment applies to present bit vs dirty
bit atomicity as well.

On AMD’s APM I find:

"The processor never sets the Accessed bit or the Dirty bit for a not
present page (P = 0). The ordering of Accessed and Dirty bit updates
with respect to surrounding loads and stores is discussed below.”

( The later comment regards ordering to WC memory ).

I don’t know if I read it too creatively...
Dave Hansen Oct. 26, 2021, 6:44 p.m. UTC | #6
On 10/26/21 10:44 AM, Nadav Amit wrote:
>> "If software on one logical processor writes to a page while software on
>> another logical processor concurrently clears the R/W flag in the
>> paging-structure entry that maps the page, execution on some processors may
>> result in the entry’s dirty flag being set (due to the write on the first
>> logical processor) and the entry’s R/W flag being clear (due to the update
>> to the entry on the second logical processor). This will never occur on a
>> processor that supports control-flow enforcement technology (CET)”
>>
>> So I guess that this optimization can only be enabled when CET is enabled.
>>
>> :(
> I still wonder whether the SDM comment applies to present bit vs dirty
> bit atomicity as well.

I think it's implicit.  From "4.8 ACCESSED AND DIRTY FLAGS":

	"Whenever there is a write to a linear address, the processor
	 sets the dirty flag (if it is not already set) in the paging-
	 structure entry"

There can't be a "write to a linear address" without a Present=1 PTE.
If it were a Dirty=1,Present=1 PTE, there's no race because there might
not be a write to the PTE at all.

There's also this from the "4.10.4.3 Optional Invalidation" section:

	"no TLB entry or paging-structure cache entry is created with
	 information from a paging-structure entry in which the P flag
 	 is 0."

That means that we don't have to worry about the TLB doing something
bonkers like caching a Dirty=1 bit from a Present=0 PTE.

Is that what you were worried about?
Nadav Amit Oct. 26, 2021, 7:06 p.m. UTC | #7
> On Oct 26, 2021, at 11:44 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 10/26/21 10:44 AM, Nadav Amit wrote:
>>> "If software on one logical processor writes to a page while software on
>>> another logical processor concurrently clears the R/W flag in the
>>> paging-structure entry that maps the page, execution on some processors may
>>> result in the entry’s dirty flag being set (due to the write on the first
>>> logical processor) and the entry’s R/W flag being clear (due to the update
>>> to the entry on the second logical processor). This will never occur on a
>>> processor that supports control-flow enforcement technology (CET)”
>>> 
>>> So I guess that this optimization can only be enabled when CET is enabled.
>>> 
>>> :(
>> I still wonder whether the SDM comment applies to present bit vs dirty
>> bit atomicity as well.
> 
> I think it's implicit.  From "4.8 ACCESSED AND DIRTY FLAGS":
> 
> 	"Whenever there is a write to a linear address, the processor
> 	 sets the dirty flag (if it is not already set) in the paging-
> 	 structure entry"
> 
> There can't be a "write to a linear address" without a Present=1 PTE.
> If it were a Dirty=1,Present=1 PTE, there's no race because there might
> not be a write to the PTE at all.
> 
> There's also this from the "4.10.4.3 Optional Invalidation" section:
> 
> 	"no TLB entry or paging-structure cache entry is created with
> 	 information from a paging-structure entry in which the P flag
> 	 is 0."
> 
> That means that we don't have to worry about the TLB doing something
> bonkers like caching a Dirty=1 bit from a Present=0 PTE.
> 
> Is that what you were worried about?

Thanks Dave, but no - that is not my concern.

To make it very clear - consider the following scenario, in which
a volatile pointer p is mapped using a certain PTE, which is RW
(i.e., *p is writable):

  CPU0				CPU1
  ----				----
  x = *p
  [ PTE cached in TLB; 
    PTE is not dirty ]
				clear_pte(PTE)
  *p = x
  [ needs to set dirty ]

Note that there is no TLB flush in this scenario. The question
is whether the write access to *p would succeed, setting the
dirty bit on the clear, non-present entry.

I was under the impression that the hardware AD-assist would
recheck the PTE atomically as it sets the dirty bit. But, as I
said, I am not sure anymore whether this is defined architecturally
(or at least would work in practice on all CPUs modulo the 
Knights Landing thingy).
Dave Hansen Oct. 26, 2021, 7:40 p.m. UTC | #8
On 10/26/21 12:06 PM, Nadav Amit wrote:
> 
> To make it very clear - consider the following scenario, in which
> a volatile pointer p is mapped using a certain PTE, which is RW
> (i.e., *p is writable):
> 
>   CPU0				CPU1
>   ----				----
>   x = *p
>   [ PTE cached in TLB; 
>     PTE is not dirty ]
> 				clear_pte(PTE)
>   *p = x
>   [ needs to set dirty ]
> 
> Note that there is no TLB flush in this scenario. The question
> is whether the write access to *p would succeed, setting the
> dirty bit on the clear, non-present entry.
> 
> I was under the impression that the hardware AD-assist would
> recheck the PTE atomically as it sets the dirty bit. But, as I
> said, I am not sure anymore whether this is defined architecturally
> (or at least would work in practice on all CPUs modulo the 
> Knights Landing thingy).

Practically, at "x=*p", he thing that gets cached in the TLB will
Dirty=0.  At the "*p=x", the CPU will decide it needs to do a write,
find the Dirty=0 entry and will entirely discard it.  In other words, it
*acts* roughly like this:

	x = *p				
	INVLPG(p)
	*p = x;

Where the INVLPG() and the "*p=x" are atomic.  So, there's no
_practical_ problem with your scenario.  This specific behavior isn't
architectural as far as I know, though.

Although it's pretty much just academic, as for the architecture, are
you getting hung up on the difference between the description of "Accessed":

	Whenever the processor uses a paging-structure entry as part of
	linear-address translation, it sets the accessed flag in that
	entry

and "Dirty:"

	Whenever there is a write to a linear address, the processor
	sets the dirty flag (if it is not already set) in the paging-
	structure entry...

Accessed says "as part of linear-address translation", which means that
the address must have a translation.  But, the "Dirty" section doesn't
say that.  It talks about "a write to a linear address" but not whether
there is a linear address *translation* involved.

If that's it, we could probably add a bit like:

	In addition to setting the accessed flag, whenever there is a
	write...

before the dirty rules in the SDM.

Or am I being dense and continuing to miss your point? :)
Nadav Amit Oct. 26, 2021, 8:07 p.m. UTC | #9
> On Oct 26, 2021, at 12:40 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 10/26/21 12:06 PM, Nadav Amit wrote:
>> 
>> To make it very clear - consider the following scenario, in which
>> a volatile pointer p is mapped using a certain PTE, which is RW
>> (i.e., *p is writable):
>> 
>>  CPU0				CPU1
>>  ----				----
>>  x = *p
>>  [ PTE cached in TLB; 
>>    PTE is not dirty ]
>> 				clear_pte(PTE)
>>  *p = x
>>  [ needs to set dirty ]
>> 
>> Note that there is no TLB flush in this scenario. The question
>> is whether the write access to *p would succeed, setting the
>> dirty bit on the clear, non-present entry.
>> 
>> I was under the impression that the hardware AD-assist would
>> recheck the PTE atomically as it sets the dirty bit. But, as I
>> said, I am not sure anymore whether this is defined architecturally
>> (or at least would work in practice on all CPUs modulo the 
>> Knights Landing thingy).
> 
> Practically, at "x=*p", he thing that gets cached in the TLB will
> Dirty=0.  At the "*p=x", the CPU will decide it needs to do a write,
> find the Dirty=0 entry and will entirely discard it.  In other words, it
> *acts* roughly like this:
> 
> 	x = *p				
> 	INVLPG(p)
> 	*p = x;
> 
> Where the INVLPG() and the "*p=x" are atomic.  So, there's no
> _practical_ problem with your scenario.  This specific behavior isn't
> architectural as far as I know, though.
> 
> Although it's pretty much just academic, as for the architecture, are
> you getting hung up on the difference between the description of "Accessed":
> 
> 	Whenever the processor uses a paging-structure entry as part of
> 	linear-address translation, it sets the accessed flag in that
> 	entry
> 
> and "Dirty:"
> 
> 	Whenever there is a write to a linear address, the processor
> 	sets the dirty flag (if it is not already set) in the paging-
> 	structure entry...
> 
> Accessed says "as part of linear-address translation", which means that
> the address must have a translation.  But, the "Dirty" section doesn't
> say that.  It talks about "a write to a linear address" but not whether
> there is a linear address *translation* involved.
> 
> If that's it, we could probably add a bit like:
> 
> 	In addition to setting the accessed flag, whenever there is a
> 	write...
> 
> before the dirty rules in the SDM.
> 
> Or am I being dense and continuing to miss your point? :)

I think this time you got my question right.

I was thrown off by the SDM comment on RW permissions vs dirty that I
mentioned before:

"If software on one logical processor writes to a page while software on
 another logical processor concurrently clears the R/W flag in the
 paging-structure entry that maps the page, execution on some processors may
 result in the entry’s dirty flag being set (due to the write on the first
 logical processor) and the entry’s R/W flag being clear (due to the update
 to the entry on the second logical processor).”

I did not pay enough attention to these small differences that you mentioned
between access and dirty this time (although I did notice them before).

I do not think that the change that you offered to the SDM really clarifies
the situation. Setting the access flag is done as part of caching the PTE in
the TLB. The SDM change you propose does not clarify the atomicity of the
permission/PTE-validity check and dirty-bit setting or the fact the PTE is
invalidated if the dirty-bit needs to be set and is cached as clear [I do not
presume you would want the latter in the SDM, since it is an implementation
detail.]

I just wonder how come the R/W-clearing and the P-clearing cause concurrent
dirty bit setting to behave differently. I am not a hardware guy, but I would
imagine they would be the same...
Dave Hansen Oct. 26, 2021, 8:47 p.m. UTC | #10
On 10/26/21 1:07 PM, Nadav Amit wrote:
> I just wonder how come the R/W-clearing and the P-clearing cause concurrent
> dirty bit setting to behave differently. I am not a hardware guy, but I would
> imagine they would be the same...

First of all, I think the non-atomic properties where a PTE can go:

	W=1,D=0 // original
	W=0,D=0 // software clears W
	W=0,D=1 // hardware sets D

were a total implementation accident.  It wasn't someone being clever
and since the behavior was architecturally allowed and well-tolerated by
software it was around for a while.  I think I was the one that asked
that it get fixed for shadow stacks, and nobody pushed back on it too
hard as far as I remember.  I don't think it was super hard to fix.

Why do the Present/Accessed and Write/Dirty pairs act differently?  I
think it's a total implementation accident and wasn't by design.

The KNL erratum was an erratum and wasn't codified in the architecture
because it actually broke things.  The pre-CET Write/Dirty behavior
didn't break software  to a level it was considered an erratum.  It gets
to live on as allowed in the architecture.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..18c3366f8f4d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1146,6 +1146,14 @@  static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 	}
 }
 #endif
+
+#define __HAVE_ARCH_PMDP_INVALIDATE_AD
+static inline pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+				       unsigned long address, pmd_t *pmdp)
+{
+	return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
+}
+
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..622efe0a9ef0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -561,6 +561,11 @@  extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+				unsigned long address, pmd_t *pmdp);
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5ea5f775d5c..435da011b1a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1795,10 +1795,11 @@  int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
 	 * which may break userspace.
 	 *
-	 * pmdp_invalidate() is required to make sure we don't miss
-	 * dirty/young flags set by hardware.
+	 * pmdp_invalidate_ad() is required to make sure we don't miss
+	 * dirty/young flags (which are also known as access/dirty) cannot be
+	 * further modifeid by the hardware.
 	 */
-	entry = pmdp_invalidate(vma, addr, pmd);
+	entry = pmdp_invalidate_ad(vma, addr, pmd);
 
 	entry = pmd_modify(entry, newprot);
 	if (preserve_write)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4e640baf9794..b0ce6c7391bf 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -200,6 +200,14 @@  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+			 pmd_t *pmdp)
+{
+	return pmdp_invalidate(vma, address, pmdp);
+}
+#endif
+
 #ifndef pmdp_collapse_flush
 pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 			  pmd_t *pmdp)