diff mbox series

[v4,01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks

Message ID 20190927234008.11513-2-leonardo@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series Introduces new count-based method for monitoring lockless pagetable walks | expand

Commit Message

Leonardo Bras Sept. 27, 2019, 11:39 p.m. UTC
It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.

In order to speedup some cases, I propose a refcount-based approach, that
counts the number of lockless pagetable	walks happening on the process.

This method does not exclude the current irq-oriented method. It works as a
complement to skip unnecessary waiting.

start_lockless_pgtbl_walk(mm)
	Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
	Insert after the end of any lockless pgtable walk
	(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
	Returns the number of lockless pgtable walks running

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
 arch/powerpc/mm/book3s64/pgtable.c       | 45 ++++++++++++++++++++++++
 3 files changed, 49 insertions(+)

Comments

John Hubbard Sept. 29, 2019, 10:40 p.m. UTC | #1
On 9/27/19 4:39 PM, Leonardo Bras wrote:
> It's necessary to monitor lockless pagetable walks, in order to avoid doing
> THP splitting/collapsing during them.
> 
> Some methods rely on local_irq_{save,restore}, but that can be slow on
> cases with a lot of cpus are used for the process.
> 
> In order to speedup some cases, I propose a refcount-based approach, that
> counts the number of lockless pagetable	walks happening on the process.
> 
> This method does not exclude the current irq-oriented method. It works as a
> complement to skip unnecessary waiting.
> 
> start_lockless_pgtbl_walk(mm)
> 	Insert before starting any lockless pgtable walk
> end_lockless_pgtbl_walk(mm)
> 	Insert after the end of any lockless pgtable walk
> 	(Mostly after the ptep is last used)
> running_lockless_pgtbl_walk(mm)
> 	Returns the number of lockless pgtable walks running
> 
> Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> ---
>   arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>   arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
>   arch/powerpc/mm/book3s64/pgtable.c       | 45 ++++++++++++++++++++++++
>   3 files changed, 49 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> index 23b83d3593e2..13b006e7dde4 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> @@ -116,6 +116,9 @@ typedef struct {
>   	/* Number of users of the external (Nest) MMU */
>   	atomic_t copros;
>   
> +	/* Number of running instances of lockless pagetable walk*/
> +	atomic_t lockless_pgtbl_walk_count;
> +
>   	struct hash_mm_context *hash_context;
>   
>   	unsigned long vdso_base;
> diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> index 2d0cb5ba9a47..3dd01c0ca5be 100644
> --- a/arch/powerpc/mm/book3s64/mmu_context.c
> +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> @@ -200,6 +200,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>   #endif
>   	atomic_set(&mm->context.active_cpus, 0);
>   	atomic_set(&mm->context.copros, 0);
> +	atomic_set(&mm->context.lockless_pgtbl_walk_count, 0);
>   
>   	return 0;
>   }
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 7d0e0d0d22c4..6ba6195bff1b 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -98,6 +98,51 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
>   	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
>   }
>   
> +/*
> + * Counting method to monitor lockless pagetable walks:
> + * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
> + * number of lockless pgtable walks happening, and
> + * running_lockless_pgtbl_walk to return this value.
> + */
> +
> +/* start_lockless_pgtbl_walk: Must be inserted before a function call that does
> + *   lockless pagetable walks, such as __find_linux_pte()
> + */
> +void start_lockless_pgtbl_walk(struct mm_struct *mm)
> +{
> +	atomic_inc(&mm->context.lockless_pgtbl_walk_count);
> +	/* Avoid reorder to garantee that the increment will happen before any
> +	 * part of the walkless pagetable walk after it.
> +	 */
> +	smp_mb();
> +}
> +EXPORT_SYMBOL(start_lockless_pgtbl_walk);
> +
> +/*
> + * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
> + *   returned by a lockless pagetable walk, such as __find_linux_pte()
> +*/
> +void end_lockless_pgtbl_walk(struct mm_struct *mm)
> +{
> +	/* Avoid reorder to garantee that it will only decrement after the last
> +	 * use of the returned ptep from the lockless pagetable walk.
> +	 */
> +	smp_mb();
> +	atomic_dec(&mm->context.lockless_pgtbl_walk_count);
> +}
> +EXPORT_SYMBOL(end_lockless_pgtbl_walk);
> +
> +/*
> + * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
> + *   currently running. If it returns 0, there is no running pagetable walk, and
> + *   THP split/collapse can be safely done. This can be used to avoid more
> + *   expensive approaches like serialize_against_pte_lookup()
> + */
> +int running_lockless_pgtbl_walk(struct mm_struct *mm)
> +{
> +	return atomic_read(&mm->context.lockless_pgtbl_walk_count);
> +}
> +
>   /*
>    * We use this to invalidate a pmdp entry before switching from a
>    * hugepte to regular pmd entry.
> 

Hi, Leonardo,

Can we please do it as shown below, instead (compile-tested only)?

This addresses all of the comments that I was going to make about structure
of this patch, which are:

* The lockless synch is tricky, so it should be encapsulated in function
   calls if possible.

* This is really a core mm function, so don't hide it away in arch layers.
   (If you're changing mm/ files, that's a big hint.)

* Other things need parts of this: gup.c needs the memory barriers; IMHO you'll
   be fixing a pre-existing, theoretical (we've never seen bug reports) problem.

* The documentation needs to accurately explain what's going on here.

(Not shown: one or more of the PPC Kconfig files should select
LOCKLESS_PAGE_TABLE_WALK_TRACKING.)

So:


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 294a67b94147..c9e5defb4d7e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1541,6 +1541,9 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
  			struct task_struct *task, bool bypass_rlim);
  
+void register_lockless_page_table_walker(unsigned long *flags);
+void deregister_lockless_page_table_walker(unsigned long *flags);
+
  /* Container for pinned pfns / pages */
  struct frame_vector {
  	unsigned int nr_allocated;	/* Number of frames we have space for */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5183e0d77dfa..83b7930a995f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -403,6 +403,16 @@ struct mm_struct {
  		 */
  		atomic_t mm_count;
  
+#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
+		/*
+		 * Number of callers who are doing a lockless walk of the
+		 * page tables. Typically arches might enable this in order to
+		 * help optimize performance, by possibly avoiding expensive
+		 * IPIs at the wrong times.
+		 */
+		atomic_t lockless_pgtbl_nr_walkers;
+#endif
+
  #ifdef CONFIG_MMU
  		atomic_long_t pgtables_bytes;	/* PTE page table pages */
  #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb51..1cf58f668fe1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,15 @@ config ARCH_HAS_PTE_SPECIAL
  config ARCH_HAS_HUGEPD
  	bool
  
+config LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	bool "Tracking (and optimization) of lockless page table walkers"
+	default n
+
+	help
+	  Maintain a reference count of active lockless page table
+	  walkers. This adds 4 bytes to struct mm size, and two atomic
+	  operations to calls such as get_user_pages_fast(). Some
+	  architectures can optimize page table operations if this
+	  is enabled.
+
  endmenu
diff --git a/mm/gup.c b/mm/gup.c
index 60c3915c8ee6..7b1be8ed1e8f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2302,6 +2302,62 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end)
  }
  #endif
  
+/*
+ * register_lockless_page_table_walker() - start a lockless page table walk
+ *
+ * @flags: for saving and restoring irq state
+ *
+ * Lockless page table walking still requires synchronization against freeing
+ * of the page tables, and against splitting of huge pages. This is done by
+ * interacting with interrupts, as first described in the struct mmu_table_batch
+ * comments in include/asm-generic/tlb.h.
+ *
+ * In order to do the right thing, code that walks page tables in the style of
+ * get_user_pages_fast() should call register_lockless_page_table_walker()
+ * before starting the walk, and deregister_lockless_page_table_walker() upon
+ * finishing.
+ */
+void register_lockless_page_table_walker(unsigned long *flags)
+{
+#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	atomic_inc(&current->mm->lockless_pgtbl_nr_walkers);
+#endif
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. In order for that to work,
+	 * interrupts must also be disabled during the lockless page table
+	 * walk. That's because the deleting or splitting involves flushing
+	 * TLBs, which in turn issues interrupts, which will block here.
+	 * However, without memory barriers, the page tables could be
+	 * read speculatively outside of interrupt disabling.
+	 */
+	smp_mb();
+	local_irq_save(*flags);
+}
+EXPORT_SYMBOL_GPL(gup_fast_lock_acquire);
+
+/*
+ * register_lockless_page_table_walker() - finish a lockless page table walk
+ *
+ * This is the complement to register_lockless_page_table_walker().
+ *
+ * @flags: for saving and restoring irq state
+ */
+void deregister_lockless_page_table_walker(unsigned long *flags)
+{
+	local_irq_restore(flags);
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. See the comments in
+	 * gup_fast_lock_acquire() for details.
+	 */
+	smp_mb();
+#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	atomic_dec(&current->mm->lockless_pgtbl_nr_walkers);
+#endif
+}
+EXPORT_SYMBOL_GPL(gup_fast_lock_release);
+
  /*
   * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
   * the regular GUP.
@@ -2341,9 +2397,9 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  
  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
  	    gup_fast_permitted(start, end)) {
-		local_irq_save(flags);
+		register_lockless_page_table_walker(&flags);
  		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
-		local_irq_restore(flags);
+		deregister_lockless_page_table_walker(&flags);
  	}
  
  	return nr;
@@ -2392,7 +2448,7 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
  int get_user_pages_fast(unsigned long start, int nr_pages,
  			unsigned int gup_flags, struct page **pages)
  {
-	unsigned long addr, len, end;
+	unsigned long addr, len, end, flags;
  	int nr = 0, ret = 0;
  
  	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
@@ -2410,9 +2466,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
  
  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
  	    gup_fast_permitted(start, end)) {
-		local_irq_disable();
+		register_lockless_page_table_walker(&flags);
  		gup_pgd_range(addr, end, gup_flags, pages, &nr);
-		local_irq_enable();
+		deregister_lockless_page_table_walker(&flags);
  		ret = nr;
  	}
  


thanks,
John Hubbard Sept. 29, 2019, 11:17 p.m. UTC | #2
On 9/29/19 3:40 PM, John Hubbard wrote:
> On 9/27/19 4:39 PM, Leonardo Bras wrote:
...
> +config LOCKLESS_PAGE_TABLE_WALK_TRACKING
> +    bool "Tracking (and optimization) of lockless page table walkers"
> +    default n
> +
> +    help
> +      Maintain a reference count of active lockless page table
> +      walkers. This adds 4 bytes to struct mm size, and two atomic
> +      operations to calls such as get_user_pages_fast(). Some
> +      architectures can optimize page table operations if this
> +      is enabled.
> +
>   endmenu

Actually, the above should be an internal-only config option (PPC arch can
auto-select it), so just:

+config LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	bool

...because it's entirely up to other code (as opposed to other people)
as to whether this should be selected.

I got carried away. :)

thanks,
Leonardo Bras Sept. 30, 2019, 3:14 p.m. UTC | #3
On Sun, 2019-09-29 at 15:40 -0700, John Hubbard wrote:
> Hi, Leonardo,

Hello John, thanks for the feedback.

> Can we please do it as shown below, instead (compile-tested only)?
> 
> This addresses all of the comments that I was going to make about structure
> of this patch, which are:
> 
> * The lockless synch is tricky, so it should be encapsulated in function
>    calls if possible.

As I told before, there are cases where this function is called from
'real mode' in powerpc, which doesn't disable irqs and may have a
tricky behavior if we do. So, encapsulate the irq disable in this
function can be a bad choice.

Of course, if we really need that, we can add a bool parameter to the
function to choose about disabling/enabling irqs.
> 
> * This is really a core mm function, so don't hide it away in arch layers.
>    (If you're changing mm/ files, that's a big hint.)

My idea here is to let the arch decide on how this 'register' is going
to work, as archs may have different needs (in powerpc for example, we
can't always disable irqs, since we may be in realmode).

Maybe we can create a generic function instead of a dummy, and let it
be replaced in case the arch needs to do so.

> * Other things need parts of this: gup.c needs the memory barriers; IMHO you'll
>    be fixing a pre-existing, theoretical (we've never seen bug reports) problem.

Humm, you are right. Here I would suggest adding the barrier to the
generic function.

> * The documentation needs to accurately explain what's going on here.

Yes, my documentation was probably not good enough due to my lack of
experience with memory barriers (I learnt about using them last week,
and tried to come with the best solution.)

> (Not shown: one or more of the PPC Kconfig files should select
> LOCKLESS_PAGE_TABLE_WALK_TRACKING.)

The way it works today is defining it on platform pgtable.h. I agree
that using Kconfig may be a better solution that can make this config
more visible to disable/enable. 

Thanks for the feedback,

Leonardo Bras
John Hubbard Sept. 30, 2019, 5:57 p.m. UTC | #4
On 9/30/19 8:14 AM, Leonardo Bras wrote:
> On Sun, 2019-09-29 at 15:40 -0700, John Hubbard wrote:
>> Hi, Leonardo,
> 
> Hello John, thanks for the feedback.
> 
>> Can we please do it as shown below, instead (compile-tested only)?
>>
>> This addresses all of the comments that I was going to make about structure
>> of this patch, which are:
>>
>> * The lockless synch is tricky, so it should be encapsulated in function
>>     calls if possible.
> 
> As I told before, there are cases where this function is called from
> 'real mode' in powerpc, which doesn't disable irqs and may have a
> tricky behavior if we do. So, encapsulate the irq disable in this
> function can be a bad choice.

You still haven't explained how this works in that case. So far, the
synchronization we've discussed has depended upon interrupt disabling
as part of the solution, in order to hold off page splitting and page
table freeing.

Simply skipping that means that an additional mechanism is required...which
btw might involve a new, ppc-specific routine, so maybe this is going to end
up pretty close to what I pasted in after all...

> 
> Of course, if we really need that, we can add a bool parameter to the
> function to choose about disabling/enabling irqs.
>>
>> * This is really a core mm function, so don't hide it away in arch layers.
>>     (If you're changing mm/ files, that's a big hint.)
> 
> My idea here is to let the arch decide on how this 'register' is going
> to work, as archs may have different needs (in powerpc for example, we
> can't always disable irqs, since we may be in realmode).
> 
> Maybe we can create a generic function instead of a dummy, and let it
> be replaced in case the arch needs to do so.

Yes, that might be what we need, if it turns out that ppc can't use this
approach (although let's see about that).


thanks,
Leonardo Bras Sept. 30, 2019, 6:42 p.m. UTC | #5
On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
> > As I told before, there are cases where this function is called from
> > 'real mode' in powerpc, which doesn't disable irqs and may have a
> > tricky behavior if we do. So, encapsulate the irq disable in this
> > function can be a bad choice.
> 
> You still haven't explained how this works in that case. So far, the
> synchronization we've discussed has depended upon interrupt disabling
> as part of the solution, in order to hold off page splitting and page
> table freeing.

The irqs are already disabled by another mechanism (hw): MSR_EE=0.
So, serialize will work as expected.

> Simply skipping that means that an additional mechanism is required...which
> btw might involve a new, ppc-specific routine, so maybe this is going to end
> up pretty close to what I pasted in after all...
> > Of course, if we really need that, we can add a bool parameter to the
> > function to choose about disabling/enabling irqs.
> > > * This is really a core mm function, so don't hide it away in arch layers.
> > >     (If you're changing mm/ files, that's a big hint.)
> > 
> > My idea here is to let the arch decide on how this 'register' is going
> > to work, as archs may have different needs (in powerpc for example, we
> > can't always disable irqs, since we may be in realmode).
> > 
> > Maybe we can create a generic function instead of a dummy, and let it
> > be replaced in case the arch needs to do so.
> 
> Yes, that might be what we need, if it turns out that ppc can't use this
> approach (although let's see about that).
> 

I initially used the dummy approach because I did not see anything like
serialize in other archs. 

I mean, even if I put some generic function here, if there is no
function to use the 'lockless_pgtbl_walk_count', it becomes only a
overhead.

> 
> thanks,

Thank you!
John Hubbard Sept. 30, 2019, 9:47 p.m. UTC | #6
On 9/30/19 11:42 AM, Leonardo Bras wrote:
> On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
>>> As I told before, there are cases where this function is called from
>>> 'real mode' in powerpc, which doesn't disable irqs and may have a
>>> tricky behavior if we do. So, encapsulate the irq disable in this
>>> function can be a bad choice.
>>
>> You still haven't explained how this works in that case. So far, the
>> synchronization we've discussed has depended upon interrupt disabling
>> as part of the solution, in order to hold off page splitting and page
>> table freeing.
> 
> The irqs are already disabled by another mechanism (hw): MSR_EE=0.
> So, serialize will work as expected.

I get that they're disabled. But will this interlock with the code that
issues IPIs?? Because it's not just disabling interrupts that matters, but
rather, synchronizing with the code (TLB flushing) that *happens* to 
require issuing IPIs, which in turn interact with disabling interrupts.

So I'm still not seeing how that could work here, unless there is something
interesting about the smp_call_function_many() on ppc with MSR_EE=0 mode...?

> 
>> Simply skipping that means that an additional mechanism is required...which
>> btw might involve a new, ppc-specific routine, so maybe this is going to end
>> up pretty close to what I pasted in after all...
>>> Of course, if we really need that, we can add a bool parameter to the
>>> function to choose about disabling/enabling irqs.
>>>> * This is really a core mm function, so don't hide it away in arch layers.
>>>>     (If you're changing mm/ files, that's a big hint.)
>>>
>>> My idea here is to let the arch decide on how this 'register' is going
>>> to work, as archs may have different needs (in powerpc for example, we
>>> can't always disable irqs, since we may be in realmode).

Yes, the tension there is that a) some things are per-arch, and b) it's easy 
to get it wrong. The commit below (d9101bfa6adc) is IMHO a perfect example of
that.

So, I would like core mm/ functions that guide the way, but the interrupt
behavior complicates it. I think your original passing of just struct_mm
is probably the right balance, assuming that I'm wrong about interrupts.


>>>
>>> Maybe we can create a generic function instead of a dummy, and let it
>>> be replaced in case the arch needs to do so.
>>
>> Yes, that might be what we need, if it turns out that ppc can't use this
>> approach (although let's see about that).
>>
> 
> I initially used the dummy approach because I did not see anything like
> serialize in other archs. 
> 
> I mean, even if I put some generic function here, if there is no
> function to use the 'lockless_pgtbl_walk_count', it becomes only a
> overhead.
> 

Not really: the memory barrier is required in all cases, and this code
would be good I think:

+void register_lockless_pgtable_walker(struct mm_struct *mm)
+{
+#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
+       atomic_inc(&mm->lockless_pgtbl_nr_walkers);
+#endif
+       /*
+        * This memory barrier pairs with any code that is either trying to
+        * delete page tables, or split huge pages.
+        */
+       smp_mb();
+}
+EXPORT_SYMBOL_GPL(gup_fast_lock_acquire);

And this is the same as your original patch, with just a minor name change:

@@ -2341,9 +2395,11 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
        if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
            gup_fast_permitted(start, end)) {
+               register_lockless_pgtable_walker(current->mm);
                local_irq_save(flags);
                gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
                local_irq_restore(flags);
+               deregister_lockless_pgtable_walker(current->mm);


Btw, hopefully minor note: it also looks like there's a number of changes in the same 
area that conflict, for example:

    commit d9101bfa6adc ("powerpc/mm/mce: Keep irqs disabled during lockless 
         page table walk") <Aneesh Kumar K.V> (Thu, 19 Sep 2019)

...so it would be good to rebase this onto 5.4-rc1, now that that's here.


thanks,
Leonardo Bras Oct. 1, 2019, 6:39 p.m. UTC | #7
On Mon, 2019-09-30 at 14:47 -0700, John Hubbard wrote:
> On 9/30/19 11:42 AM, Leonardo Bras wrote:
> > On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
> > > > As I told before, there are cases where this function is called from
> > > > 'real mode' in powerpc, which doesn't disable irqs and may have a
> > > > tricky behavior if we do. So, encapsulate the irq disable in this
> > > > function can be a bad choice.
> > > 
> > > You still haven't explained how this works in that case. So far, the
> > > synchronization we've discussed has depended upon interrupt disabling
> > > as part of the solution, in order to hold off page splitting and page
> > > table freeing.
> > 
> > The irqs are already disabled by another mechanism (hw): MSR_EE=0.
> > So, serialize will work as expected.
> 
> I get that they're disabled. But will this interlock with the code that
> issues IPIs?? Because it's not just disabling interrupts that matters, but
> rather, synchronizing with the code (TLB flushing) that *happens* to 
> require issuing IPIs, which in turn interact with disabling interrupts.
> 
> So I'm still not seeing how that could work here, unless there is something
> interesting about the smp_call_function_many() on ppc with MSR_EE=0 mode...?
> 

I am failing to understand the issue.
I mean, smp_call_function_many() will issue a IPI to each CPU in
CPUmask and wait it to run before returning. 
If interrupts are disabled (either by MSR_EE=0 or local_irq_disable),
the IPI will not run on that CPU, and the wait part will make sure to
lock the thread until the interrupts are enabled again. 

Could you please point the issue there?

> > > Simply skipping that means that an additional mechanism is required...which
> > > btw might involve a new, ppc-specific routine, so maybe this is going to end
> > > up pretty close to what I pasted in after all...
> > > > Of course, if we really need that, we can add a bool parameter to the
> > > > function to choose about disabling/enabling irqs.
> > > > > * This is really a core mm function, so don't hide it away in arch layers.
> > > > >     (If you're changing mm/ files, that's a big hint.)
> > > > 
> > > > My idea here is to let the arch decide on how this 'register' is going
> > > > to work, as archs may have different needs (in powerpc for example, we
> > > > can't always disable irqs, since we may be in realmode).
> 
> Yes, the tension there is that a) some things are per-arch, and b) it's easy 
> to get it wrong. The commit below (d9101bfa6adc) is IMHO a perfect example of
> that.
> 
> So, I would like core mm/ functions that guide the way, but the interrupt
> behavior complicates it. I think your original passing of just struct_mm
> is probably the right balance, assuming that I'm wrong about interrupts.
> 

I think, for the generic function, that including {en,dis}abling the
interrupt is fine. I mean, if disabling the interrupt is the generic
behavior, it's ok. 
I will just make sure to explain that the interrupt {en,dis}abling is
part of the sync process. If an arch don't like it, it can write a
specific function that does the sync in a better way. (and defining
__HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER to ignore the generic function)

In this case, the generic function would also include the ifdef'ed
atomic inc and the memory barrier. 

> 
> > > > Maybe we can create a generic function instead of a dummy, and let it
> > > > be replaced in case the arch needs to do so.
> > > 
> > > Yes, that might be what we need, if it turns out that ppc can't use this
> > > approach (although let's see about that).
> > > 
> > 
> > I initially used the dummy approach because I did not see anything like
> > serialize in other archs. 
> > 
> > I mean, even if I put some generic function here, if there is no
> > function to use the 'lockless_pgtbl_walk_count', it becomes only a
> > overhead.
> > 
> 
> Not really: the memory barrier is required in all cases, and this code
> would be good I think:
> 
> +void register_lockless_pgtable_walker(struct mm_struct *mm)
> +{
> +#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
> +       atomic_inc(&mm->lockless_pgtbl_nr_walkers);
> +#endif
> +       /*
> +        * This memory barrier pairs with any code that is either trying to
> +        * delete page tables, or split huge pages.
> +        */
> +       smp_mb();
> +}
> +EXPORT_SYMBOL_GPL(gup_fast_lock_acquire);
> 
> And this is the same as your original patch, with just a minor name change:
> 
> @@ -2341,9 +2395,11 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  
>         if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
>             gup_fast_permitted(start, end)) {
> +               register_lockless_pgtable_walker(current->mm);
>                 local_irq_save(flags);
>                 gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
>                 local_irq_restore(flags);
> +               deregister_lockless_pgtable_walker(current->mm);
> 
> 
> Btw, hopefully minor note: it also looks like there's a number of changes in the same 
> area that conflict, for example:
> 
>     commit d9101bfa6adc ("powerpc/mm/mce: Keep irqs disabled during lockless 
>          page table walk") <Aneesh Kumar K.V> (Thu, 19 Sep 2019)
> 
> ...so it would be good to rebase this onto 5.4-rc1, now that that's here.
> 

Yeap, agree. Already rebased on top of v5.4-rc1.

> 
> thanks,

Thank you!
John Hubbard Oct. 1, 2019, 6:52 p.m. UTC | #8
On 10/1/19 11:39 AM, Leonardo Bras wrote:
> On Mon, 2019-09-30 at 14:47 -0700, John Hubbard wrote:
>> On 9/30/19 11:42 AM, Leonardo Bras wrote:
>>> On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
...
> 
> I am failing to understand the issue.
> I mean, smp_call_function_many() will issue a IPI to each CPU in
> CPUmask and wait it to run before returning. 
> If interrupts are disabled (either by MSR_EE=0 or local_irq_disable),
> the IPI will not run on that CPU, and the wait part will make sure to
> lock the thread until the interrupts are enabled again. 
> 
> Could you please point the issue there?

The biggest problem here is evidently my not knowing much about ppc. :) 
So if that's how it behaves, then all is well, sorry it took me a while
to understand the MSR_EE=0 behavior.

> 
>>>> Simply skipping that means that an additional mechanism is required...which
>>>> btw might involve a new, ppc-specific routine, so maybe this is going to end
>>>> up pretty close to what I pasted in after all...
>>>>> Of course, if we really need that, we can add a bool parameter to the
>>>>> function to choose about disabling/enabling irqs.
>>>>>> * This is really a core mm function, so don't hide it away in arch layers.
>>>>>>     (If you're changing mm/ files, that's a big hint.)
>>>>>
>>>>> My idea here is to let the arch decide on how this 'register' is going
>>>>> to work, as archs may have different needs (in powerpc for example, we
>>>>> can't always disable irqs, since we may be in realmode).
>>
>> Yes, the tension there is that a) some things are per-arch, and b) it's easy 
>> to get it wrong. The commit below (d9101bfa6adc) is IMHO a perfect example of
>> that.
>>
>> So, I would like core mm/ functions that guide the way, but the interrupt
>> behavior complicates it. I think your original passing of just struct_mm
>> is probably the right balance, assuming that I'm wrong about interrupts.
>>
> 
> I think, for the generic function, that including {en,dis}abling the
> interrupt is fine. I mean, if disabling the interrupt is the generic
> behavior, it's ok. 
> I will just make sure to explain that the interrupt {en,dis}abling is
> part of the sync process. If an arch don't like it, it can write a
> specific function that does the sync in a better way. (and defining
> __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER to ignore the generic function)
> 

Tentatively, that sounds good. We still end up with the counter variable
directly in struct mm_struct, and the generic function in mm/gup.c 
(or mm/somewhere), where it's easy to find and see what's going on. sure.

thanks,
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 23b83d3593e2..13b006e7dde4 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -116,6 +116,9 @@  typedef struct {
 	/* Number of users of the external (Nest) MMU */
 	atomic_t copros;
 
+	/* Number of running instances of lockless pagetable walk*/
+	atomic_t lockless_pgtbl_walk_count;
+
 	struct hash_mm_context *hash_context;
 
 	unsigned long vdso_base;
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
index 2d0cb5ba9a47..3dd01c0ca5be 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -200,6 +200,7 @@  int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 #endif
 	atomic_set(&mm->context.active_cpus, 0);
 	atomic_set(&mm->context.copros, 0);
+	atomic_set(&mm->context.lockless_pgtbl_walk_count, 0);
 
 	return 0;
 }
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..6ba6195bff1b 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,51 @@  void serialize_against_pte_lookup(struct mm_struct *mm)
 	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* start_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte()
+ */
+void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	atomic_inc(&mm->context.lockless_pgtbl_walk_count);
+	/* Avoid reorder to garantee that the increment will happen before any
+	 * part of the walkless pagetable walk after it.
+	 */
+	smp_mb();
+}
+EXPORT_SYMBOL(start_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+*/
+void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	/* Avoid reorder to garantee that it will only decrement after the last
+	 * use of the returned ptep from the lockless pagetable walk.
+	 */
+	smp_mb();
+	atomic_dec(&mm->context.lockless_pgtbl_walk_count);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	return atomic_read(&mm->context.lockless_pgtbl_walk_count);
+}
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.