[v5,02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
diff mbox series

Message ID 20191003013325.2614-3-leonardo@linux.ibm.com
State New
Headers show
Series
  • Introduces new count-based method for tracking lockless pagetable walks
Related show

Commit Message

Leonardo Bras Oct. 3, 2019, 1:33 a.m. UTC
It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

On powerpc, we need to do some lockless pagetable walks from functions
that already have disabled interrupts, specially from real mode with
MSR[EE=0].

In these contexts, disabling/enabling interrupts can be very troubling.

So, this arch-specific implementation features functions with an extra
argument that allows interrupt enable/disable to be skipped:
__begin_lockless_pgtbl_walk() and __end_lockless_pgtbl_walk().

Functions similar to the generic ones are also exported, by calling
the above functions with parameter *able_irq = false.

While there is no config option, the method is disabled and these functions
are only doing what was already needed to lockless pagetable walks
(disabling interrupt). A memory barrier was also added just to make sure
there is no speculative read outside the interrupt disabled area.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   9 ++
 arch/powerpc/mm/book3s64/pgtable.c           | 117 +++++++++++++++++++
 2 files changed, 126 insertions(+)

Comments

Christopher Lameter Oct. 8, 2019, 3:11 p.m. UTC | #1
On Wed, 2 Oct 2019, Leonardo Bras wrote:

> +
> +inline unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
> +						 bool disable_irq)
> +{
> +	unsigned long irq_mask = 0;
> +
> +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> +		atomic_inc(&mm->lockless_pgtbl_walkers);
> +

You are creating contention on a single exclusive cacheline. Doesnt this
defeat the whole purpose of the lockless page table walk? Use mmap_sem or
so should cause the same performance regression?
Leonardo Bras Oct. 8, 2019, 5:13 p.m. UTC | #2
On Tue, 2019-10-08 at 15:11 +0000, Christopher Lameter wrote:
> 
> On Wed, 2 Oct 2019, Leonardo Bras wrote:
> 
> > +
> > +inline unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
> > +						 bool disable_irq)
> > +{
> > +	unsigned long irq_mask = 0;
> > +
> > +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > +		atomic_inc(&mm->lockless_pgtbl_walkers);
> > +
> 
> You are creating contention on a single exclusive cacheline. Doesnt this
> defeat the whole purpose of the lockless page table walk? Use mmap_sem or
> so should cause the same performance regression?

Sorry, I did not understand that question.
I mean, this is just a refcount and never causes a lock.  


FYI: This function was updated as following, and will be in v6:

#ifdef CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING
	atomic_inc(&mm->lockless_pgtbl_walkers);
#endif
	smp_mb();

IS_ENABLED doesnt work fine if CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING
is not defined, causing an error: the mm member lockless_pgtbl_walkers
doesn't exist.
Christopher Lameter Oct. 8, 2019, 5:43 p.m. UTC | #3
On Tue, 8 Oct 2019, Leonardo Bras wrote:

> > You are creating contention on a single exclusive cacheline. Doesnt this
> > defeat the whole purpose of the lockless page table walk? Use mmap_sem or
> > so should cause the same performance regression?
>
> Sorry, I did not understand that question.
> I mean, this is just a refcount and never causes a lock.

Locks also use atomic operations like a refcount increment. Both require
the cacheline to be in exclusive state. So the impact is very similar.
Leonardo Bras Oct. 8, 2019, 6:02 p.m. UTC | #4
On Tue, 2019-10-08 at 17:43 +0000, Christopher Lameter wrote:
> On Tue, 8 Oct 2019, Leonardo Bras wrote:
> 
> > > You are creating contention on a single exclusive cacheline. Doesnt this
> > > defeat the whole purpose of the lockless page table walk? Use mmap_sem or
> > > so should cause the same performance regression?
> > 
> > Sorry, I did not understand that question.
> > I mean, this is just a refcount and never causes a lock.
> 
> Locks also use atomic operations like a refcount increment. Both require
> the cacheline to be in exclusive state. So the impact is very similar.

Thanks for explaining. :)

So you say that the performance impact of using my approach is the same
as using locks? (supposing that lock never waits)

So, there are 'lockless pagetable walks' only for the sake of better
performance? 

I thought they existed to enable doing pagetable walks in states where
locking was not safe.

Is that right?

Thanks!
Leonardo BrĂ¡s,
Christopher Lameter Oct. 8, 2019, 6:27 p.m. UTC | #5
On Tue, 8 Oct 2019, Leonardo Bras wrote:

> So you say that the performance impact of using my approach is the same
> as using locks? (supposing that lock never waits)
>
> So, there are 'lockless pagetable walks' only for the sake of better
> performance?

I thought that was the major motivation here.

> I thought they existed to enable doing pagetable walks in states where
> locking was not safe.

That sounds profoundly dangerous. Usually locking makes things safe to
access. Accesses without locking require a lot of care.

Patch
diff mbox series

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index b01624e5c467..8330b35cd28d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1372,5 +1372,14 @@  static inline bool pgd_is_leaf(pgd_t pgd)
 	return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
 }
 
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm);
+unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+					  bool disable_irq);
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask);
+void __end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask,
+			       bool enable_irq);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 75483b40fcb1..ae557fdce9a3 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,123 @@  void serialize_against_pte_lookup(struct mm_struct *mm)
 	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses begin_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version allows setting disable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+
+inline unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+						 bool disable_irq)
+{
+	unsigned long irq_mask = 0;
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_inc(&mm->lockless_pgtbl_walkers);
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (disable_irq)
+		local_irq_save(irq_mask);
+
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	return irq_mask;
+}
+EXPORT_SYMBOL(__begin_lockless_pgtbl_walk);
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version is used by generic code, and always assume irqs being disabled
+ */
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	return __begin_lockless_pgtbl_walk(mm, true);
+}
+EXPORT_SYMBOL(begin_lockless_pgtbl_walk);
+
+/*
+ * __end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version allows setting enable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+inline void __end_lockless_pgtbl_walk(struct mm_struct *mm,
+				      unsigned long irq_mask, bool enable_irq)
+{
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (enable_irq)
+		local_irq_restore(irq_mask);
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_dec(&mm->lockless_pgtbl_walkers);
+}
+EXPORT_SYMBOL(__end_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version is used by generic code, and always assume irqs being enabled
+ */
+
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask)
+{
+	__end_lockless_pgtbl_walk(mm, irq_mask, true);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		return atomic_read(&mm->lockless_pgtbl_walkers);
+
+	/* If disabled, must return > 0, so it fallback to sync method
+	 * (serialize_against_pte_lookup)
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(running_lockless_pgtbl_walk);
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.