diff mbox series

[RESEND,v3,5/5] mm: avoid unnecessary flush on change_huge_pmd()

Message ID 20220311190749.338281-6-namit@vmware.com (mailing list archive)
State New
Headers show
Series mm/mprotect: avoid unnecessary TLB flushes | expand

Commit Message

Nadav Amit March 11, 2022, 7:07 p.m. UTC
From: Nadav Amit <namit@vmware.com>

Calls to change_protection_range() on THP can trigger, at least on x86,
two TLB flushes for one page: one immediately, when pmdp_invalidate() is
called by change_huge_pmd(), and then another one later (that can be
batched) when change_protection_range() finishes.

The first TLB flush is only necessary to prevent the dirty bit (and with
a lesser importance the access bit) from changing while the PTE is
modified. However, this is not necessary as the x86 CPUs set the
dirty-bit atomically with an additional check that the PTE is (still)
present. One caveat is Intel's Knights Landing that has a bug and does
not do so.

Leverage this behavior to eliminate the unnecessary TLB flush in
change_huge_pmd(). Introduce a new arch specific pmdp_invalidate_ad()
that only invalidates the access and dirty bit from further changes.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/pgtable.h |  5 +++++
 arch/x86/mm/pgtable.c          | 10 ++++++++++
 include/linux/pgtable.h        | 20 ++++++++++++++++++++
 mm/huge_memory.c               |  4 ++--
 mm/pgtable-generic.c           |  8 ++++++++
 5 files changed, 45 insertions(+), 2 deletions(-)

Comments

Dave Hansen March 11, 2022, 8:41 p.m. UTC | #1
On 3/11/22 11:07, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Calls to change_protection_range() on THP can trigger, at least on x86,
> two TLB flushes for one page: one immediately, when pmdp_invalidate() is
> called by change_huge_pmd(), and then another one later (that can be
> batched) when change_protection_range() finishes.
> 
> The first TLB flush is only necessary to prevent the dirty bit (and with
> a lesser importance the access bit) from changing while the PTE is
> modified. However, this is not necessary as the x86 CPUs set the
> dirty-bit atomically with an additional check that the PTE is (still)
> present. One caveat is Intel's Knights Landing that has a bug and does
> not do so.

First of all, thank you for your diligence here.  This is a super
obscure issue.  I think I put handling for it in the kernel and I'm not
sure I would have even thought about this angle.

That said, I'm not sure this is all necessary.

Yes, the Dirty bit can get set unexpectedly in some PTEs.  But, the
question is whether it is *VALUABLE* and needs to be preserved.  The
current kernel code pretty much just lets the hardware set the Dirty bit
and then ignores it.  If it were valuable, ignoring it would have been a
bad thing.  We'd be losing data on today's kernels because the hardware
told us about a write that happened but that the kernel ignored.

My mental model of what the microcode responsible for the erratum does
is something along these lines:

	if (write)
		pte |= _PAGE_DIRTY;
	if (!pte_present(pte))
		#PF

The PTE is marked dirty, but the write never actually executes.  The
thread that triggered the A/D setting *also* gets a fault.

I'll double-check with some Intel folks to make sure I'm not missing
something.  But, either way, I don't think we should be going to this
much trouble for the good ol' Xeon Phi.  I doubt there are many still
around and I *REALLY* doubt they're running new kernels.

*If* we need this (and I'm not convinced we do), my first instinct would
be to just do this instead:

	clear_cpu_cap(c, X86_FEATURE_PSE);

on KNL systems.  If anyone cares, they know where to find us.
Nadav Amit March 11, 2022, 8:53 p.m. UTC | #2
> On Mar 11, 2022, at 12:41 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/11/22 11:07, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Calls to change_protection_range() on THP can trigger, at least on x86,
>> two TLB flushes for one page: one immediately, when pmdp_invalidate() is
>> called by change_huge_pmd(), and then another one later (that can be
>> batched) when change_protection_range() finishes.
>> 
>> The first TLB flush is only necessary to prevent the dirty bit (and with
>> a lesser importance the access bit) from changing while the PTE is
>> modified. However, this is not necessary as the x86 CPUs set the
>> dirty-bit atomically with an additional check that the PTE is (still)
>> present. One caveat is Intel's Knights Landing that has a bug and does
>> not do so.
> 
> First of all, thank you for your diligence here.  This is a super
> obscure issue.  I think I put handling for it in the kernel and I'm not
> sure I would have even thought about this angle.
> 
> That said, I'm not sure this is all necessary.
> 
> Yes, the Dirty bit can get set unexpectedly in some PTEs.  But, the
> question is whether it is *VALUABLE* and needs to be preserved.  The
> current kernel code pretty much just lets the hardware set the Dirty bit
> and then ignores it.  If it were valuable, ignoring it would have been a
> bad thing.  We'd be losing data on today's kernels because the hardware
> told us about a write that happened but that the kernel ignored.
> 
> My mental model of what the microcode responsible for the erratum does
> is something along these lines:
> 
> 	if (write)
> 		pte |= _PAGE_DIRTY;
> 	if (!pte_present(pte))
> 		#PF
> 
> The PTE is marked dirty, but the write never actually executes.  The
> thread that triggered the A/D setting *also* gets a fault.
> 

This makes perfect sense. I guess I misunderstood or forgot the erratum.
But feel free to recheck. It would allow to remove the KNL check, and
probably the first patch in this series. But I don’t think it would
allow to get rid of pmdp_invalidate_ad() since I do not fell comfortable
just to use pmdp_establish() directly: I do not know about other
architectures well enough to say that they have the same atomicity
guarantees when it comes to A/D bits.

> I'll double-check with some Intel folks to make sure I'm not missing
> something.  But, either way, I don't think we should be going to this
> much trouble for the good ol' Xeon Phi.  I doubt there are many still
> around and I *REALLY* doubt they're running new kernels.
> 
> *If* we need this (and I'm not convinced we do), my first instinct would
> be to just do this instead:
> 
> 	clear_cpu_cap(c, X86_FEATURE_PSE);
> 
> on KNL systems.  If anyone cares, they know where to find us.

I think that it is not necessary and your understanding of the erratum
is the right one. Let me know if you find it is not the case.

BTW: Thanks for the quick response, and sorry for the time it took me
to send v3.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..23ad34edcc4b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1173,6 +1173,11 @@  static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 	}
 }
 #endif
+
+#define __HAVE_ARCH_PMDP_INVALIDATE_AD
+extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+				unsigned long address, pmd_t *pmdp);
+
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..b2fcb2c749ce 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -608,6 +608,16 @@  int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
+
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+			 pmd_t *pmdp)
+{
+	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
+
+	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
+		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return old;
+}
 #endif
 
 /**
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..5826e8e52619 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -570,6 +570,26 @@  extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+
+/*
+ * pmdp_invalidate_ad() invalidates the PMD while changing a transparent
+ * hugepage mapping in the page tables. This function is similar to
+ * pmdp_invalidate(), but should only be used if the access and dirty bits would
+ * not be cleared by the software in the new PMD value. The function ensures
+ * that hardware changes of the access and dirty bits updates would not be lost.
+ *
+ * Doing so can allow in certain architectures to avoid a TLB flush in most
+ * cases. Yet, another TLB flush might be necessary later if the PMD update
+ * itself requires such flush (e.g., if protection was set to be stricter). Yet,
+ * even when a TLB flush is needed because of the update, the caller may be able
+ * to batch these TLB flushing operations, so fewer TLB flush operations are
+ * needed.
+ */
+extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+				unsigned long address, pmd_t *pmdp);
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 51b0f3cb1ba0..691d80edcfd7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1781,10 +1781,10 @@  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
 	 * which may break userspace.
 	 *
-	 * pmdp_invalidate() is required to make sure we don't miss
+	 * pmdp_invalidate_ad() is required to make sure we don't miss
 	 * dirty/young flags set by hardware.
 	 */
-	oldpmd = pmdp_invalidate(vma, addr, pmd);
+	oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
 
 	entry = pmd_modify(oldpmd, newprot);
 	if (preserve_write)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 6523fda274e5..90ab721a12a8 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -201,6 +201,14 @@  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+			 pmd_t *pmdp)
+{
+	return pmdp_invalidate(vma, address, pmdp);
+}
+#endif
+
 #ifndef pmdp_collapse_flush
 pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 			  pmd_t *pmdp)