[v5,01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
diff mbox series

Message ID 20191003013325.2614-2-leonardo@linux.ibm.com
State New
Headers show
Series
  • Introduces new count-based method for tracking lockless pagetable walks
Related show

Commit Message

Leonardo Bras Oct. 3, 2019, 1:33 a.m. UTC
It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

Some methods rely on irq enable/disable, but that can be slow on
cases with a lot of cpus are used for the process, given all these cpus
have to run a IPI.

In order to speedup some cases, I propose a refcount-based approach,
that counts the number of lockless pagetable walks happening on the
process. If this count is zero, it skips the irq-oriented method.

Given that there are lockless pagetable walks on generic code, it's
necessary to create documented generic functions that may be enough for
most archs and but let open to arch-specific implemenations.

This method does not exclude the current irq-oriented method. It works as a
complement to skip unnecessary waiting.

begin_lockless_pgtbl_walk(mm)
        Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
        Insert after the end of any lockless pgtable walk
        (Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
        Returns the number of lockless pgtable walks running

While there is no config option, the method is disabled and these functions
are only doing what was already needed to lockless pagetable walks
(disabling interrupt). A memory barrier was also added just to make sure
there is no speculative read outside the interrupt disabled area.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 include/asm-generic/pgtable.h | 58 +++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h      | 11 +++++++
 kernel/fork.c                 |  3 ++
 3 files changed, 72 insertions(+)

Comments

Peter Zijlstra Oct. 3, 2019, 7:11 a.m. UTC | #1
On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 818691846c90..3043ea9812d5 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
>  #endif
>  #endif
>  
> +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> +{
> +	unsigned long irq_mask;
> +
> +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> +		atomic_inc(&mm->lockless_pgtbl_walkers);

This will not work for file backed THP. Also, this is a fairly serious
contention point all on its own.

> +	/*
> +	 * Interrupts must be disabled during the lockless page table walk.
> +	 * That's because the deleting or splitting involves flushing TLBs,
> +	 * which in turn issues interrupts, that will block when disabled.
> +	 */
> +	local_irq_save(irq_mask);
> +
> +	/*
> +	 * This memory barrier pairs with any code that is either trying to
> +	 * delete page tables, or split huge pages. Without this barrier,
> +	 * the page tables could be read speculatively outside of interrupt
> +	 * disabling.
> +	 */
> +	smp_mb();

I don't think this is something smp_mb() can guarantee. smp_mb() is
defined to order memory accesses, in this case the store of the old
flags vs whatever comes after this.

It cannot (in generic) order against completion of prior instructions,
like clearing the interrupt enabled flags.

Possibly you want barrier_nospec().

> +
> +	return irq_mask;
> +}
Peter Zijlstra Oct. 3, 2019, 11:51 a.m. UTC | #2
On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > index 818691846c90..3043ea9812d5 100644
> > --- a/include/asm-generic/pgtable.h
> > +++ b/include/asm-generic/pgtable.h
> > @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
> >  #endif
> >  #endif
> >  
> > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> > +{
> > +	unsigned long irq_mask;
> > +
> > +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > +		atomic_inc(&mm->lockless_pgtbl_walkers);
> 
> This will not work for file backed THP. Also, this is a fairly serious
> contention point all on its own.

Kiryl says we have tmpfs-thp, this would be broken vs that, as would
your (PowerPC) use of mm_cpumask() for that IPI.

> > +	/*
> > +	 * Interrupts must be disabled during the lockless page table walk.
> > +	 * That's because the deleting or splitting involves flushing TLBs,
> > +	 * which in turn issues interrupts, that will block when disabled.
> > +	 */
> > +	local_irq_save(irq_mask);
> > +
> > +	/*
> > +	 * This memory barrier pairs with any code that is either trying to
> > +	 * delete page tables, or split huge pages. Without this barrier,
> > +	 * the page tables could be read speculatively outside of interrupt
> > +	 * disabling.
> > +	 */
> > +	smp_mb();
> 
> I don't think this is something smp_mb() can guarantee. smp_mb() is
> defined to order memory accesses, in this case the store of the old
> flags vs whatever comes after this.
> 
> It cannot (in generic) order against completion of prior instructions,
> like clearing the interrupt enabled flags.
> 
> Possibly you want barrier_nospec().

I'm still really confused about this barrier. It just doesn't make
sense.

If an interrupt happens before the local_irq_disable()/save(), then it
will discard any and all speculation that would be in progress to handle
the exception.

If there isn't an interrupt (or it happens after disable) it is
irrelevant.

Specifically, that serialize-IPI thing wants to ensure in-progress
lookups are complete, and I can't find a scenario where
local_irq_disable/enable() needs additional help vs IPIs. The moment an
interrupt lands it kills speculation and forces things into
program-order.

Did you perhaps want something like:

	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
		atomic_inc(&foo);
		smp_mb__after_atomic();
	}

	...

	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
		smp_mb__before_atomic();
		atomic_dec(&foo);
	}

To ensure everything happens inside of the increment?

And I still think all that wrong, you really shouldn't need to wait on
munmap().
John Hubbard Oct. 3, 2019, 8:40 p.m. UTC | #3
On 10/3/19 4:51 AM, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
>> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
...
> 
> I'm still really confused about this barrier. It just doesn't make
> sense.
> 
> If an interrupt happens before the local_irq_disable()/save(), then it
> will discard any and all speculation that would be in progress to handle
> the exception.
> 

Hi Peter,

So, would that imply that it's correct to apply approximately the following
patch:

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1adbb8a371c7..cf41eff37e24 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2099,9 +2099,9 @@ INTERRUPT DISABLING FUNCTIONS
 -----------------------------
 
 Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
-(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
-barriers are required in such a situation, they must be provided from some
-other means.
+(RELEASE equivalent) will act as full memory barriers. This is because, for
+all supported CPU architectures, interrupt arrival causes all speculative
+memory accesses to be discarded.
 
?

We're also discussing this over in [1] ("mm: don't expose non-hugetlb page to
fast gup prematurely"), so I'm adding Paul to this thread here as well.

[1] https://lore.kernel.org/r/20191002092447.GC9320@quack2.suse.cz

thanks,
Leonardo Bras Oct. 3, 2019, 9:24 p.m. UTC | #4
Hello Peter, thanks for the feedback!

On Thu, 2019-10-03 at 13:51 +0200, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> > > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > > index 818691846c90..3043ea9812d5 100644
> > > --- a/include/asm-generic/pgtable.h
> > > +++ b/include/asm-generic/pgtable.h
> > > @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
> > >  #endif
> > >  #endif
> > >  
> > > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > > +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> > > +{
> > > +	unsigned long irq_mask;
> > > +
> > > +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > > +		atomic_inc(&mm->lockless_pgtbl_walkers);
> > 
> > This will not work for file backed THP. Also, this is a fairly serious
> > contention point all on its own.
> 
> Kiryl says we have tmpfs-thp, this would be broken vs that, as would
> your (PowerPC) use of mm_cpumask() for that IPI.

Could you please explain it?
I mean, why this breaks tmpfs-thp?

Also, why mm_cpumask() is also broken?

> 
> > > +	/*
> > > +	 * Interrupts must be disabled during the lockless page table walk.
> > > +	 * That's because the deleting or splitting involves flushing TLBs,
> > > +	 * which in turn issues interrupts, that will block when disabled.
> > > +	 */
> > > +	local_irq_save(irq_mask);
> > > +
> > > +	/*
> > > +	 * This memory barrier pairs with any code that is either trying to
> > > +	 * delete page tables, or split huge pages. Without this barrier,
> > > +	 * the page tables could be read speculatively outside of interrupt
> > > +	 * disabling.
> > > +	 */
> > > +	smp_mb();
> > 
> > I don't think this is something smp_mb() can guarantee. smp_mb() is
> > defined to order memory accesses, in this case the store of the old
> > flags vs whatever comes after this.
> > 
> > It cannot (in generic) order against completion of prior instructions,
> > like clearing the interrupt enabled flags.
> > 
> > Possibly you want barrier_nospec().
> 
> I'm still really confused about this barrier. It just doesn't make
> sense.
> 
> If an interrupt happens before the local_irq_disable()/save(), then it
> will discard any and all speculation that would be in progress to handle
> the exception.
> 
> If there isn't an interrupt (or it happens after disable) it is
> irrelevant.
> 
> Specifically, that serialize-IPI thing wants to ensure in-progress
> lookups are complete, and I can't find a scenario where
> local_irq_disable/enable() needs additional help vs IPIs. The moment an
> interrupt lands it kills speculation and forces things into
> program-order.
> 
> Did you perhaps want something like:
> 
> 	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
> 		atomic_inc(&foo);
> 		smp_mb__after_atomic();
> 	}
> 
> 	...
> 
> 	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
> 		smp_mb__before_atomic();
> 		atomic_dec(&foo);
> 	}
> 
> To ensure everything happens inside of the increment?
> 

I need to rethink this barrier, but yes. I think that's it. 
It's how it was on v4.

I have changed it because I thought it would be better this way. Well,
it was probably a mistake of my part.

> And I still think all that wrong, you really shouldn't need to wait on
> munmap().

That is something I need to better understand. I mean, before coming
with this patch, I thought exactly this: not serialize when on munmap. 

But on the way I was convinced it would not work on munmap. I need to
recall why, and if it was false to assume this, re-think the whole
solution.

Best regards,

Leonardo BrĂ¡s
Peter Zijlstra Oct. 4, 2019, 11:24 a.m. UTC | #5
On Thu, Oct 03, 2019 at 01:40:38PM -0700, John Hubbard wrote:
> On 10/3/19 4:51 AM, Peter Zijlstra wrote:
> > On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> >> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> ...
> > 
> > I'm still really confused about this barrier. It just doesn't make
> > sense.
> > 
> > If an interrupt happens before the local_irq_disable()/save(), then it
> > will discard any and all speculation that would be in progress to handle
> > the exception.
> > 
> 
> Hi Peter,
> 
> So, would that imply that it's correct to apply approximately the following
> patch:
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 1adbb8a371c7..cf41eff37e24 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -2099,9 +2099,9 @@ INTERRUPT DISABLING FUNCTIONS
>  -----------------------------
>  
>  Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
> -(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
> -barriers are required in such a situation, they must be provided from some
> -other means.
> +(RELEASE equivalent) will act as full memory barriers. This is because, for
> +all supported CPU architectures, interrupt arrival causes all speculative
> +memory accesses to be discarded.
>  
> ?

No, you're misunderstanding. They imply nothing of the sort.
Peter Zijlstra Oct. 4, 2019, 11:28 a.m. UTC | #6
On Thu, Oct 03, 2019 at 06:24:07PM -0300, Leonardo Bras wrote:
> Hello Peter, thanks for the feedback!
> 
> On Thu, 2019-10-03 at 13:51 +0200, Peter Zijlstra wrote:
> > On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> > > > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > > > index 818691846c90..3043ea9812d5 100644
> > > > --- a/include/asm-generic/pgtable.h
> > > > +++ b/include/asm-generic/pgtable.h
> > > > @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
> > > >  #endif
> > > >  #endif
> > > >  
> > > > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > > > +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> > > > +{
> > > > +	unsigned long irq_mask;
> > > > +
> > > > +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > > > +		atomic_inc(&mm->lockless_pgtbl_walkers);
> > > 
> > > This will not work for file backed THP. Also, this is a fairly serious
> > > contention point all on its own.
> > 
> > Kiryl says we have tmpfs-thp, this would be broken vs that, as would
> > your (PowerPC) use of mm_cpumask() for that IPI.
> 
> Could you please explain it?
> I mean, why this breaks tmpfs-thp?
> Also, why mm_cpumask() is also broken?

Because shared pages are not bound by a mm; or does it not share the thp
state between mappings?

> > And I still think all that wrong, you really shouldn't need to wait on
> > munmap().
> 
> That is something I need to better understand. I mean, before coming
> with this patch, I thought exactly this: not serialize when on munmap. 
> 
> But on the way I was convinced it would not work on munmap. I need to
> recall why, and if it was false to assume this, re-think the whole
> solution.

And once you (re)figure it out, please write it down. It is a crucial
bit of the puzzle and needs to be part of the Changelogs.
Aneesh Kumar K.V Oct. 5, 2019, 8:35 a.m. UTC | #7
On 10/3/19 5:21 PM, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
>> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:


....


> 
> And I still think all that wrong, you really shouldn't need to wait on
> munmap().
> 

I do have a patch that does something like that.


+#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR_FULL
+static inline pmd_t pmdp_huge_get_and_clear_full(struct mm_struct *mm,
+						 unsigned long address, pmd_t *pmdp,
+						 int full)
+{
+	bool serialize = true;
+	/*
+	 * We don't need to serialze against a lockless page table walk if
+	 * we are clearing the pmd due to task exit. For regular mnumap, we
+	 * still need to serialize due the possibility of MADV_DONTNEED running
+	 * parallel to a page fault which can convert a THP pte entry to a
+	 * pointer to level 4 table.
+	 * Here MADV_DONTNEED is removing the THP entry and the fault is filling
+	 * a level 4 pte.
+	 */
+	if (full == 1)
+		serialize = false;
+	return __pmdp_huge_get_and_clear(mm, address, pmdp, serialize);
  }


if it is a fullmm flush we can skip that serialize, But for everything 
else we need to serialize. MADV_DONTNEED is another case. I haven't sent 
this yet, because I was trying to look at what it takes to switch that 
MADV variant to take mmap_sem in write mode.

MADV_DONTNEED has caused us multiple issues due to the fact that it can 
run in parallel to page fault. I am not sure whether we have a 
known/noticeable performance gain in allowing that with mmap_sem held in 
read mode.




-aneesh
Kirill A. Shutemov Oct. 8, 2019, 2:47 p.m. UTC | #8
On Sat, Oct 05, 2019 at 02:05:29PM +0530, Aneesh Kumar K.V wrote:
> MADV_DONTNEED has caused us multiple issues due to the fact that it can run
> in parallel to page fault. I am not sure whether we have a known/noticeable
> performance gain in allowing that with mmap_sem held in read mode.

Yes we do. MADV_DONTNEED used a lot by userspace memory allocators and it
will be very noticible performance regression if we would switch it to
down_write(mmap_sem).
Leonardo Bras Oct. 9, 2019, 6:09 p.m. UTC | #9
On Fri, 2019-10-04 at 13:28 +0200, Peter Zijlstra wrote:
> > Could you please explain it?
> > I mean, why this breaks tmpfs-thp?
> > Also, why mm_cpumask() is also broken?
> 
> Because shared pages are not bound by a mm; or does it not share the thp
> state between mappings?

By what I could understand, even though the memory is shared, the
mapping may differ for different processes (i.e. the same physical
memory that is mapped as a hugepage in process A can be mapped as a lot
of smallpages in process B).

Did I miss something here?  

> And once you (re)figure it out, please write it down. It is a crucial
> bit of the puzzle and needs to be part of the Changelogs.

I am still investing time studying this. More on this later :)

Thanks!

Patch
diff mbox series

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 818691846c90..3043ea9812d5 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1171,6 +1171,64 @@  static inline bool arch_has_pfn_modify_check(void)
 #endif
 #endif
 
+#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
+static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	unsigned long irq_mask;
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_inc(&mm->lockless_pgtbl_walkers);
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 */
+	local_irq_save(irq_mask);
+
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling.
+	 */
+	smp_mb();
+
+	return irq_mask;
+}
+
+static inline void end_lockless_pgtbl_walk(struct mm_struct *mm,
+					   unsigned long irq_mask)
+{
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling.
+	 */
+	smp_mb();
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 */
+	local_irq_restore(irq_mask);
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_dec(&mm->lockless_pgtbl_walkers);
+}
+
+static inline int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		return atomic_read(&mm->lockless_pgtbl_walkers);
+
+	/* If disabled, must return > 0, so it falls back to sync method */
+	return 1;
+}
+#endif
+
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2222fa795284..277462f0b4fd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -521,6 +521,17 @@  struct mm_struct {
 		struct work_struct async_put_work;
 	} __randomize_layout;
 
+#ifdef CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	/*
+	 * Number of callers who are doing a lockless walk of the
+	 * page tables. Typically arches might enable this in order to
+	 * help optimize performance, by possibly avoiding expensive
+	 * IPIs at the wrong times.
+	 */
+	atomic_t lockless_pgtbl_walkers;
+
+#endif
+
 	/*
 	 * The mm_cpumask needs to be at the end of mm_struct, because it
 	 * is dynamically sized based on nr_cpu_ids.
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f416126..2cbca867f5a5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1029,6 +1029,9 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 #endif
 	mm_init_uprobes_state(mm);
 
+#ifdef CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	atomic_set(&mm->lockless_pgtbl_walkers, 0);
+#endif
 	if (current->mm) {
 		mm->flags = current->mm->flags & MMF_INIT_MASK;
 		mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;