diff mbox series

[4/8] membarrier: Make the post-switch-mm barrier explicit

Message ID f184d013a255a523116b692db4996c5db2569e86.1623813516.git.luto@kernel.org (mailing list archive)
State New
Headers show
Series membarrier cleanups | expand

Commit Message

Andy Lutomirski June 16, 2021, 3:21 a.m. UTC
membarrier() needs a barrier after any CPU changes mm.  There is currently
a comment explaining why this barrier probably exists in all cases.  This
is very fragile -- any change to the relevant parts of the scheduler
might get rid of these barriers, and it's not really clear to me that
the barrier actually exists in all necessary cases.

Simplify the logic by adding an explicit barrier, and allow architectures
to override it as an optimization if they want to.

One of the deleted comments in this patch said "It is therefore
possible to schedule between user->kernel->user threads without
passing through switch_mm()".  It is possible to do this without, say,
writing to CR3 on x86, but the core scheduler indeed calls
switch_mm_irqs_off() to tell the arch code to go back from lazy mode
to no-lazy mode.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched/mm.h | 21 +++++++++++++++++++++
 kernel/kthread.c         | 12 +-----------
 kernel/sched/core.c      | 35 +++++++++--------------------------
 3 files changed, 31 insertions(+), 37 deletions(-)

Comments

Nicholas Piggin June 16, 2021, 4:19 a.m. UTC | #1
Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> membarrier() needs a barrier after any CPU changes mm.  There is currently
> a comment explaining why this barrier probably exists in all cases.  This
> is very fragile -- any change to the relevant parts of the scheduler
> might get rid of these barriers, and it's not really clear to me that
> the barrier actually exists in all necessary cases.

The comments and barriers in the mmdrop() hunks? I don't see what is 
fragile or maybe-buggy about this. The barrier definitely exists.

And any change can change anything, that doesn't make it fragile. My
lazy tlb refcounting change avoids the mmdrop in some cases, but it
replaces it with smp_mb for example.

If you have some later changes that require this, can you post them
or move this patch to them?

> 
> Simplify the logic by adding an explicit barrier, and allow architectures
> to override it as an optimization if they want to.
> 
> One of the deleted comments in this patch said "It is therefore
> possible to schedule between user->kernel->user threads without
> passing through switch_mm()".  It is possible to do this without, say,
> writing to CR3 on x86, but the core scheduler indeed calls
> switch_mm_irqs_off() to tell the arch code to go back from lazy mode
> to no-lazy mode.

Context switching threads provides a barrier as well, so that comment at 
least probably stands to be improved.

Thanks,
Nick

> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  include/linux/sched/mm.h | 21 +++++++++++++++++++++
>  kernel/kthread.c         | 12 +-----------
>  kernel/sched/core.c      | 35 +++++++++--------------------------
>  3 files changed, 31 insertions(+), 37 deletions(-)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 10aace21d25e..c6eebbafadb0 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -341,6 +341,27 @@ enum {
>  	MEMBARRIER_FLAG_RSEQ		= (1U << 1),
>  };
>  
> +#ifdef CONFIG_MEMBARRIER
> +
> +/*
> + * Called by the core scheduler after calling switch_mm_irqs_off().
> + * Architectures that have implicit barriers when switching mms can
> + * override this as an optimization.
> + */
> +#ifndef membarrier_finish_switch_mm
> +static inline void membarrier_finish_switch_mm(int membarrier_state)
> +{
> +	if (membarrier_state & (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED))
> +		smp_mb();
> +}
> +#endif
> +
> +#else
> +
> +static inline void membarrier_finish_switch_mm(int membarrier_state) {}
> +
> +#endif
> +
>  #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
>  #include <asm/membarrier.h>
>  #endif
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index fe3f2a40d61e..8275b415acec 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1325,25 +1325,15 @@ void kthread_use_mm(struct mm_struct *mm)
>  	tsk->mm = mm;
>  	membarrier_update_current_mm(mm);
>  	switch_mm_irqs_off(active_mm, mm, tsk);
> +	membarrier_finish_switch_mm(atomic_read(&mm->membarrier_state));
>  	local_irq_enable();
>  	task_unlock(tsk);
>  #ifdef finish_arch_post_lock_switch
>  	finish_arch_post_lock_switch();
>  #endif
>  
> -	/*
> -	 * When a kthread starts operating on an address space, the loop
> -	 * in membarrier_{private,global}_expedited() may not observe
> -	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> -	 * memory barrier after storing to tsk->mm, before accessing
> -	 * user-space memory. A full memory barrier for membarrier
> -	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> -	 * mmdrop(), or explicitly with smp_mb().
> -	 */
>  	if (active_mm != mm)
>  		mmdrop(active_mm);
> -	else
> -		smp_mb();
>  
>  	to_kthread(tsk)->oldfs = force_uaccess_begin();
>  }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e4c122f8bf21..329a6d2a4e13 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4221,15 +4221,6 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  
>  	fire_sched_in_preempt_notifiers(current);
>  
> -	/*
> -	 * When switching through a kernel thread, the loop in
> -	 * membarrier_{private,global}_expedited() may have observed that
> -	 * kernel thread and not issued an IPI. It is therefore possible to
> -	 * schedule between user->kernel->user threads without passing though
> -	 * switch_mm(). Membarrier requires a barrier after storing to
> -	 * rq->curr, before returning to userspace, and mmdrop() provides
> -	 * this barrier.
> -	 */
>  	if (mm)
>  		mmdrop(mm);
>  
> @@ -4311,15 +4302,14 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  			prev->active_mm = NULL;
>  	} else {                                        // to user
>  		membarrier_switch_mm(rq, prev->active_mm, next->mm);
> +		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +
>  		/*
>  		 * sys_membarrier() requires an smp_mb() between setting
> -		 * rq->curr / membarrier_switch_mm() and returning to userspace.
> -		 *
> -		 * The below provides this either through switch_mm(), or in
> -		 * case 'prev->active_mm == next->mm' through
> -		 * finish_task_switch()'s mmdrop().
> +		 * rq->curr->mm to a membarrier-enabled mm and returning
> +		 * to userspace.
>  		 */
> -		switch_mm_irqs_off(prev->active_mm, next->mm, next);
> +		membarrier_finish_switch_mm(rq->membarrier_state);
>  
>  		if (!prev->mm) {                        // from kernel
>  			/* will mmdrop() in finish_task_switch(). */
> @@ -5121,17 +5111,10 @@ static void __sched notrace __schedule(bool preempt)
>  		RCU_INIT_POINTER(rq->curr, next);
>  		/*
>  		 * The membarrier system call requires each architecture
> -		 * to have a full memory barrier after updating
> -		 * rq->curr, before returning to user-space.
> -		 *
> -		 * Here are the schemes providing that barrier on the
> -		 * various architectures:
> -		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
> -		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
> -		 * - finish_lock_switch() for weakly-ordered
> -		 *   architectures where spin_unlock is a full barrier,
> -		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
> -		 *   is a RELEASE barrier),
> +		 * to have a full memory barrier before and after updating
> +		 * rq->curr->mm, before returning to userspace.  This
> +		 * is provided by membarrier_finish_switch_mm().  Architectures
> +		 * that want to optimize this can override that function.
>  		 */
>  		++*switch_count;
>  
> -- 
> 2.31.1
> 
>
Peter Zijlstra June 16, 2021, 7:35 a.m. UTC | #2
On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> > membarrier() needs a barrier after any CPU changes mm.  There is currently
> > a comment explaining why this barrier probably exists in all cases.  This
> > is very fragile -- any change to the relevant parts of the scheduler
> > might get rid of these barriers, and it's not really clear to me that
> > the barrier actually exists in all necessary cases.
> 
> The comments and barriers in the mmdrop() hunks? I don't see what is 
> fragile or maybe-buggy about this. The barrier definitely exists.
> 
> And any change can change anything, that doesn't make it fragile. My
> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> replaces it with smp_mb for example.

I'm with Nick again, on this. You're adding extra barriers for no
discernible reason, that's not generally encouraged, seeing how extra
barriers is extra slow.

Both mmdrop() itself, as well as the callsite have comments saying how
membarrier relies on the implied barrier, what's fragile about that?
Andy Lutomirski June 16, 2021, 6:41 p.m. UTC | #3
On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>> a comment explaining why this barrier probably exists in all cases.  This
>>> is very fragile -- any change to the relevant parts of the scheduler
>>> might get rid of these barriers, and it's not really clear to me that
>>> the barrier actually exists in all necessary cases.
>>
>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>> fragile or maybe-buggy about this. The barrier definitely exists.
>>
>> And any change can change anything, that doesn't make it fragile. My
>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>> replaces it with smp_mb for example.
> 
> I'm with Nick again, on this. You're adding extra barriers for no
> discernible reason, that's not generally encouraged, seeing how extra
> barriers is extra slow.
> 
> Both mmdrop() itself, as well as the callsite have comments saying how
> membarrier relies on the implied barrier, what's fragile about that?
> 

My real motivation is that mmgrab() and mmdrop() don't actually need to
be full barriers.  The current implementation has them being full
barriers, and the current implementation is quite slow.  So let's try
that commit message again:

membarrier() needs a barrier after any CPU changes mm.  There is currently
a comment explaining why this barrier probably exists in all cases. The
logic is based on ensuring that the barrier exists on every control flow
path through the scheduler.  It also relies on mmgrab() and mmdrop() being
full barriers.

mmgrab() and mmdrop() would be better if they were not full barriers.  As a
trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
could use a release on architectures that have these operations.  Larger
optimizations are also in the works.  Doing any of these optimizations
while preserving an unnecessary barrier will complicate the code and
penalize non-membarrier-using tasks.

Simplify the logic by adding an explicit barrier, and allow architectures
to override it as an optimization if they want to.

One of the deleted comments in this patch said "It is therefore
possible to schedule between user->kernel->user threads without
passing through switch_mm()".  It is possible to do this without, say,
writing to CR3 on x86, but the core scheduler indeed calls
switch_mm_irqs_off() to tell the arch code to go back from lazy mode
to no-lazy mode.
Nicholas Piggin June 17, 2021, 1:37 a.m. UTC | #4
Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>> a comment explaining why this barrier probably exists in all cases.  This
>>>> is very fragile -- any change to the relevant parts of the scheduler
>>>> might get rid of these barriers, and it's not really clear to me that
>>>> the barrier actually exists in all necessary cases.
>>>
>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>>> fragile or maybe-buggy about this. The barrier definitely exists.
>>>
>>> And any change can change anything, that doesn't make it fragile. My
>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>>> replaces it with smp_mb for example.
>> 
>> I'm with Nick again, on this. You're adding extra barriers for no
>> discernible reason, that's not generally encouraged, seeing how extra
>> barriers is extra slow.
>> 
>> Both mmdrop() itself, as well as the callsite have comments saying how
>> membarrier relies on the implied barrier, what's fragile about that?
>> 
> 
> My real motivation is that mmgrab() and mmdrop() don't actually need to
> be full barriers.  The current implementation has them being full
> barriers, and the current implementation is quite slow.  So let's try
> that commit message again:
> 
> membarrier() needs a barrier after any CPU changes mm.  There is currently
> a comment explaining why this barrier probably exists in all cases. The
> logic is based on ensuring that the barrier exists on every control flow
> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> full barriers.
> 
> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> could use a release on architectures that have these operations.

I'm not against the idea, I've looked at something similar before (not
for mmdrop but a different primitive). Also my lazy tlb shootdown series 
could possibly take advantage of this, I might cherry pick it and test 
performance :)

I don't think it belongs in this series though. Should go together with
something that takes advantage of it.

Thanks,
Nick
Andy Lutomirski June 17, 2021, 2:57 a.m. UTC | #5
On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> > On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> >> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> >>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> >>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> >>>> a comment explaining why this barrier probably exists in all cases.  This
> >>>> is very fragile -- any change to the relevant parts of the scheduler
> >>>> might get rid of these barriers, and it's not really clear to me that
> >>>> the barrier actually exists in all necessary cases.
> >>>
> >>> The comments and barriers in the mmdrop() hunks? I don't see what is 
> >>> fragile or maybe-buggy about this. The barrier definitely exists.
> >>>
> >>> And any change can change anything, that doesn't make it fragile. My
> >>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> >>> replaces it with smp_mb for example.
> >> 
> >> I'm with Nick again, on this. You're adding extra barriers for no
> >> discernible reason, that's not generally encouraged, seeing how extra
> >> barriers is extra slow.
> >> 
> >> Both mmdrop() itself, as well as the callsite have comments saying how
> >> membarrier relies on the implied barrier, what's fragile about that?
> >> 
> > 
> > My real motivation is that mmgrab() and mmdrop() don't actually need to
> > be full barriers.  The current implementation has them being full
> > barriers, and the current implementation is quite slow.  So let's try
> > that commit message again:
> > 
> > membarrier() needs a barrier after any CPU changes mm.  There is currently
> > a comment explaining why this barrier probably exists in all cases. The
> > logic is based on ensuring that the barrier exists on every control flow
> > path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> > full barriers.
> > 
> > mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> > could use a release on architectures that have these operations.
> 
> I'm not against the idea, I've looked at something similar before (not
> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
> could possibly take advantage of this, I might cherry pick it and test 
> performance :)
> 
> I don't think it belongs in this series though. Should go together with
> something that takes advantage of it.

I’m going to see if I can get hazard pointers into shape quickly.

> 
> Thanks,
> Nick
>
Andy Lutomirski June 17, 2021, 5:32 a.m. UTC | #6
On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
> 
> 
> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
> > Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> > > On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> > >> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> > >>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> > >>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> > >>>> a comment explaining why this barrier probably exists in all cases.  This
> > >>>> is very fragile -- any change to the relevant parts of the scheduler
> > >>>> might get rid of these barriers, and it's not really clear to me that
> > >>>> the barrier actually exists in all necessary cases.
> > >>>
> > >>> The comments and barriers in the mmdrop() hunks? I don't see what is 
> > >>> fragile or maybe-buggy about this. The barrier definitely exists.
> > >>>
> > >>> And any change can change anything, that doesn't make it fragile. My
> > >>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> > >>> replaces it with smp_mb for example.
> > >> 
> > >> I'm with Nick again, on this. You're adding extra barriers for no
> > >> discernible reason, that's not generally encouraged, seeing how extra
> > >> barriers is extra slow.
> > >> 
> > >> Both mmdrop() itself, as well as the callsite have comments saying how
> > >> membarrier relies on the implied barrier, what's fragile about that?
> > >> 
> > > 
> > > My real motivation is that mmgrab() and mmdrop() don't actually need to
> > > be full barriers.  The current implementation has them being full
> > > barriers, and the current implementation is quite slow.  So let's try
> > > that commit message again:
> > > 
> > > membarrier() needs a barrier after any CPU changes mm.  There is currently
> > > a comment explaining why this barrier probably exists in all cases. The
> > > logic is based on ensuring that the barrier exists on every control flow
> > > path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> > > full barriers.
> > > 
> > > mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> > > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> > > could use a release on architectures that have these operations.
> > 
> > I'm not against the idea, I've looked at something similar before (not
> > for mmdrop but a different primitive). Also my lazy tlb shootdown series 
> > could possibly take advantage of this, I might cherry pick it and test 
> > performance :)
> > 
> > I don't think it belongs in this series though. Should go together with
> > something that takes advantage of it.
> 
> I’m going to see if I can get hazard pointers into shape quickly.

Here it is.  Not even boot tested!

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31

Nick, I think you can accomplish much the same thing as your patch by:

#define for_each_possible_lazymm_cpu while (false)

although a more clever definition might be even more performant.

I would appreciate everyone's thoughts as to whether this scheme is sane.

Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.

--Andy
Nicholas Piggin June 17, 2021, 6:51 a.m. UTC | #7
Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>> 
>> 
>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>> > Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>> > > On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>> > >> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>> > >>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>> > >>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>> > >>>> a comment explaining why this barrier probably exists in all cases.  This
>> > >>>> is very fragile -- any change to the relevant parts of the scheduler
>> > >>>> might get rid of these barriers, and it's not really clear to me that
>> > >>>> the barrier actually exists in all necessary cases.
>> > >>>
>> > >>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>> > >>> fragile or maybe-buggy about this. The barrier definitely exists.
>> > >>>
>> > >>> And any change can change anything, that doesn't make it fragile. My
>> > >>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>> > >>> replaces it with smp_mb for example.
>> > >> 
>> > >> I'm with Nick again, on this. You're adding extra barriers for no
>> > >> discernible reason, that's not generally encouraged, seeing how extra
>> > >> barriers is extra slow.
>> > >> 
>> > >> Both mmdrop() itself, as well as the callsite have comments saying how
>> > >> membarrier relies on the implied barrier, what's fragile about that?
>> > >> 
>> > > 
>> > > My real motivation is that mmgrab() and mmdrop() don't actually need to
>> > > be full barriers.  The current implementation has them being full
>> > > barriers, and the current implementation is quite slow.  So let's try
>> > > that commit message again:
>> > > 
>> > > membarrier() needs a barrier after any CPU changes mm.  There is currently
>> > > a comment explaining why this barrier probably exists in all cases. The
>> > > logic is based on ensuring that the barrier exists on every control flow
>> > > path through the scheduler.  It also relies on mmgrab() and mmdrop() being
>> > > full barriers.
>> > > 
>> > > mmgrab() and mmdrop() would be better if they were not full barriers.  As a
>> > > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
>> > > could use a release on architectures that have these operations.
>> > 
>> > I'm not against the idea, I've looked at something similar before (not
>> > for mmdrop but a different primitive). Also my lazy tlb shootdown series 
>> > could possibly take advantage of this, I might cherry pick it and test 
>> > performance :)
>> > 
>> > I don't think it belongs in this series though. Should go together with
>> > something that takes advantage of it.
>> 
>> I’m going to see if I can get hazard pointers into shape quickly.
> 
> Here it is.  Not even boot tested!
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
> 
> Nick, I think you can accomplish much the same thing as your patch by:
> 
> #define for_each_possible_lazymm_cpu while (false)

I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
as lazy at this point. I must be missing something.

> 
> although a more clever definition might be even more performant.
> 
> I would appreciate everyone's thoughts as to whether this scheme is sane.

powerpc has no use for it, after the series in akpm's tree there's just
a small change required for radix TLB flushing to make the final flush 
IPI also purge lazies, and then the shootdown scheme runs with zero
additional IPIs so essentially no benefit to the hazard pointer games.
I have found the additional IPIs aren't bad anyway, so not something 
we'd bother trying to optmise away on hash, which is slowly being
de-prioritized.

I must say, I still see active_mm featuring prominently in our patch
which comes as a surprise. I would have thought some preparation and 
cleanup work first to fix the x86 deficienies you were talking about 
should go in first, I'm eager to see those. But either way I don't see
a fundamental reason this couldn't be done to support archs for which 
the standard or shootdown refcounting options aren't sufficient.

IIRC I didn't see a fundamental hole in it last time you posted the
idea but I admittedly didn't go through it super carefully.

Thanks,
Nick

> 
> Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.
> 
> --Andy
>
Peter Zijlstra June 17, 2021, 8:45 a.m. UTC | #8
On Wed, Jun 16, 2021 at 11:41:19AM -0700, Andy Lutomirski wrote:
> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> trivial optimization,

> mmgrab() could use a relaxed atomic and mmdrop()
> could use a release on architectures that have these operations.

mmgrab() *is* relaxed, mmdrop() is a full barrier but could trivially be
made weaker once membarrier stops caring about it.

static inline void mmdrop(struct mm_struct *mm)
{
	unsigned int val = atomic_dec_return_release(&mm->mm_count);
	if (unlikely(!val)) {
		/* Provide REL+ACQ ordering for free() */
		smp_acquire__after_ctrl_dep();
		__mmdrop(mm);
	}
}

It's slightly less optimal for not being able to use the flags from the
decrement. Or convert the whole thing to refcount_t (if appropriate)
which already does something like the above.
Paul E. McKenney June 17, 2021, 3:02 p.m. UTC | #9
On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
> > On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
> > > Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> > > > On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> > > >> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> > > >>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> > > >>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> > > >>>> a comment explaining why this barrier probably exists in all cases.  This
> > > >>>> is very fragile -- any change to the relevant parts of the scheduler
> > > >>>> might get rid of these barriers, and it's not really clear to me that
> > > >>>> the barrier actually exists in all necessary cases.
> > > >>>
> > > >>> The comments and barriers in the mmdrop() hunks? I don't see what is 
> > > >>> fragile or maybe-buggy about this. The barrier definitely exists.
> > > >>>
> > > >>> And any change can change anything, that doesn't make it fragile. My
> > > >>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> > > >>> replaces it with smp_mb for example.
> > > >> 
> > > >> I'm with Nick again, on this. You're adding extra barriers for no
> > > >> discernible reason, that's not generally encouraged, seeing how extra
> > > >> barriers is extra slow.
> > > >> 
> > > >> Both mmdrop() itself, as well as the callsite have comments saying how
> > > >> membarrier relies on the implied barrier, what's fragile about that?
> > > >> 
> > > > 
> > > > My real motivation is that mmgrab() and mmdrop() don't actually need to
> > > > be full barriers.  The current implementation has them being full
> > > > barriers, and the current implementation is quite slow.  So let's try
> > > > that commit message again:
> > > > 
> > > > membarrier() needs a barrier after any CPU changes mm.  There is currently
> > > > a comment explaining why this barrier probably exists in all cases. The
> > > > logic is based on ensuring that the barrier exists on every control flow
> > > > path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> > > > full barriers.
> > > > 
> > > > mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> > > > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> > > > could use a release on architectures that have these operations.
> > > 
> > > I'm not against the idea, I've looked at something similar before (not
> > > for mmdrop but a different primitive). Also my lazy tlb shootdown series 
> > > could possibly take advantage of this, I might cherry pick it and test 
> > > performance :)
> > > 
> > > I don't think it belongs in this series though. Should go together with
> > > something that takes advantage of it.
> > 
> > I’m going to see if I can get hazard pointers into shape quickly.

One textbook C implementation is in perfbook CodeSamples/defer/hazptr.[hc]
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

A production-tested C++ implementation is in the folly library:

https://github.com/facebook/folly/blob/master/folly/synchronization/Hazptr.h

However, the hazard-pointers get-a-reference operation requires a full
barrier.  There are ways to optimize this away in some special cases,
one of which is used in the folly-library hash-map code.

> Here it is.  Not even boot tested!
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
> 
> Nick, I think you can accomplish much the same thing as your patch by:
> 
> #define for_each_possible_lazymm_cpu while (false)
> 
> although a more clever definition might be even more performant.
> 
> I would appreciate everyone's thoughts as to whether this scheme is sane.
> 
> Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.

OK, I should look at your patch, but that will be after morning meetings.

On RCU and idle, much of the idle code now allows rcu_read_lock() to be
directly, thanks to Peter's recent work.  Any sort of interrupt or NMI
from idle can also use rcu_read_lock(), including the IPIs that are now
done directly from idle.  RCU_NONIDLE() makes RCU pay attention to the
code supplied as its sole argument.

Or is your patch really having the CPU expect a mm_struct to stick around
across the full idle sojourn, and without the assistance of mmgrab()
and mmdrop()?

Anyway, off to meetings...  Hope this helps in the meantime.

							Thanx, Paul
Andy Lutomirski June 17, 2021, 11:49 p.m. UTC | #10
On 6/16/21 11:51 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
>> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>>>
>>>
>>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>>>> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>>>>> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>>>>>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>>>>>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>>>>>> a comment explaining why this barrier probably exists in all cases.  This
>>>>>>>> is very fragile -- any change to the relevant parts of the scheduler
>>>>>>>> might get rid of these barriers, and it's not really clear to me that
>>>>>>>> the barrier actually exists in all necessary cases.
>>>>>>>
>>>>>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>>>>>>> fragile or maybe-buggy about this. The barrier definitely exists.
>>>>>>>
>>>>>>> And any change can change anything, that doesn't make it fragile. My
>>>>>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>>>>>>> replaces it with smp_mb for example.
>>>>>>
>>>>>> I'm with Nick again, on this. You're adding extra barriers for no
>>>>>> discernible reason, that's not generally encouraged, seeing how extra
>>>>>> barriers is extra slow.
>>>>>>
>>>>>> Both mmdrop() itself, as well as the callsite have comments saying how
>>>>>> membarrier relies on the implied barrier, what's fragile about that?
>>>>>>
>>>>>
>>>>> My real motivation is that mmgrab() and mmdrop() don't actually need to
>>>>> be full barriers.  The current implementation has them being full
>>>>> barriers, and the current implementation is quite slow.  So let's try
>>>>> that commit message again:
>>>>>
>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>>> a comment explaining why this barrier probably exists in all cases. The
>>>>> logic is based on ensuring that the barrier exists on every control flow
>>>>> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
>>>>> full barriers.
>>>>>
>>>>> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
>>>>> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
>>>>> could use a release on architectures that have these operations.
>>>>
>>>> I'm not against the idea, I've looked at something similar before (not
>>>> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
>>>> could possibly take advantage of this, I might cherry pick it and test 
>>>> performance :)
>>>>
>>>> I don't think it belongs in this series though. Should go together with
>>>> something that takes advantage of it.
>>>
>>> I’m going to see if I can get hazard pointers into shape quickly.
>>
>> Here it is.  Not even boot tested!
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
>>
>> Nick, I think you can accomplish much the same thing as your patch by:
>>
>> #define for_each_possible_lazymm_cpu while (false)
> 
> I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
> as lazy at this point. I must be missing something.

What I mean is: if you want to shoot down lazies instead of doing the
hazard pointer trick to track them, you could do:

#define for_each_possible_lazymm_cpu while (false)

which would promise to the core code that you don't have any lazies left
by the time exit_mmap() is done.  You might need a new hook in
exit_mmap() depending on exactly how you implement the lazy shootdown.
Andy Lutomirski June 18, 2021, 12:06 a.m. UTC | #11
On 6/17/21 8:02 AM, Paul E. McKenney wrote:
> On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
>> I would appreciate everyone's thoughts as to whether this scheme is sane.
>>
>> Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.
> 
> OK, I should look at your patch, but that will be after morning meetings.
> 
> On RCU and idle, much of the idle code now allows rcu_read_lock() to be
> directly, thanks to Peter's recent work.  Any sort of interrupt or NMI
> from idle can also use rcu_read_lock(), including the IPIs that are now
> done directly from idle.  RCU_NONIDLE() makes RCU pay attention to the
> code supplied as its sole argument.
> 
> Or is your patch really having the CPU expect a mm_struct to stick around
> across the full idle sojourn, and without the assistance of mmgrab()
> and mmdrop()?

I really do expect it to stick around across the full idle sojourn.
Unless RCU is more magical than I think it is, this means I can't use RCU.

--Andy
Paul E. McKenney June 18, 2021, 3:35 a.m. UTC | #12
On Thu, Jun 17, 2021 at 05:06:02PM -0700, Andy Lutomirski wrote:
> On 6/17/21 8:02 AM, Paul E. McKenney wrote:
> > On Wed, Jun 16, 2021 at 10:32:15PM -0700, Andy Lutomirski wrote:
> >> I would appreciate everyone's thoughts as to whether this scheme is sane.
> >>
> >> Paul, I'm adding you for two reasons.  First, you seem to enjoy bizarre locking schemes.  Secondly, because maybe RCU could actually work here.  The basic idea is that we want to keep an mm_struct from being freed at an inopportune time.  The problem with naively using RCU is that each CPU can use one single mm_struct while in an idle extended quiescent state (but not a user extended quiescent state).  So rcu_read_lock() is right out.  If RCU could understand this concept, then maybe it could help us, but this seems a bit out of scope for RCU.
> > 
> > OK, I should look at your patch, but that will be after morning meetings.
> > 
> > On RCU and idle, much of the idle code now allows rcu_read_lock() to be
> > directly, thanks to Peter's recent work.  Any sort of interrupt or NMI
> > from idle can also use rcu_read_lock(), including the IPIs that are now
> > done directly from idle.  RCU_NONIDLE() makes RCU pay attention to the
> > code supplied as its sole argument.
> > 
> > Or is your patch really having the CPU expect a mm_struct to stick around
> > across the full idle sojourn, and without the assistance of mmgrab()
> > and mmdrop()?
> 
> I really do expect it to stick around across the full idle sojourn.
> Unless RCU is more magical than I think it is, this means I can't use RCU.

You are quite correct.  And unfortunately, making RCU pay attention
across the full idle sojourn would make the battery-powered embedded
guys quite annoyed.  And would result in OOM.  You could use something
like percpu_ref, but at a large memory expense.  You could use something
like SRCU or Tasks Trace RCU, but this would increase the overhead of
freeing mm_struct structures.

Your use of per-CPU pointers seems sound in principle, but I am uncertain
of some of the corner cases.  And either current mainline gained an
mmdrop-balance bug or rcutorture is also uncertain of those corner cases.
But again, the overall concept looks quite good.  Just some bugs to
be found and fixed, whether in this patch or in current mainline.
As always...  ;-)

						Thanx, Paul
Nicholas Piggin June 19, 2021, 2:53 a.m. UTC | #13
Excerpts from Andy Lutomirski's message of June 18, 2021 9:49 am:
> On 6/16/21 11:51 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
>>> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>>>>
>>>>
>>>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>>>>> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>>>>>> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>>>>>>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>>>>>>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>>>>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>>>>>>> a comment explaining why this barrier probably exists in all cases.  This
>>>>>>>>> is very fragile -- any change to the relevant parts of the scheduler
>>>>>>>>> might get rid of these barriers, and it's not really clear to me that
>>>>>>>>> the barrier actually exists in all necessary cases.
>>>>>>>>
>>>>>>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>>>>>>>> fragile or maybe-buggy about this. The barrier definitely exists.
>>>>>>>>
>>>>>>>> And any change can change anything, that doesn't make it fragile. My
>>>>>>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>>>>>>>> replaces it with smp_mb for example.
>>>>>>>
>>>>>>> I'm with Nick again, on this. You're adding extra barriers for no
>>>>>>> discernible reason, that's not generally encouraged, seeing how extra
>>>>>>> barriers is extra slow.
>>>>>>>
>>>>>>> Both mmdrop() itself, as well as the callsite have comments saying how
>>>>>>> membarrier relies on the implied barrier, what's fragile about that?
>>>>>>>
>>>>>>
>>>>>> My real motivation is that mmgrab() and mmdrop() don't actually need to
>>>>>> be full barriers.  The current implementation has them being full
>>>>>> barriers, and the current implementation is quite slow.  So let's try
>>>>>> that commit message again:
>>>>>>
>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>>>>>> a comment explaining why this barrier probably exists in all cases. The
>>>>>> logic is based on ensuring that the barrier exists on every control flow
>>>>>> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
>>>>>> full barriers.
>>>>>>
>>>>>> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
>>>>>> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
>>>>>> could use a release on architectures that have these operations.
>>>>>
>>>>> I'm not against the idea, I've looked at something similar before (not
>>>>> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
>>>>> could possibly take advantage of this, I might cherry pick it and test 
>>>>> performance :)
>>>>>
>>>>> I don't think it belongs in this series though. Should go together with
>>>>> something that takes advantage of it.
>>>>
>>>> I’m going to see if I can get hazard pointers into shape quickly.
>>>
>>> Here it is.  Not even boot tested!
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
>>>
>>> Nick, I think you can accomplish much the same thing as your patch by:
>>>
>>> #define for_each_possible_lazymm_cpu while (false)
>> 
>> I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
>> as lazy at this point. I must be missing something.
> 
> What I mean is: if you want to shoot down lazies instead of doing the
> hazard pointer trick to track them, you could do:
> 
> #define for_each_possible_lazymm_cpu while (false)
> 
> which would promise to the core code that you don't have any lazies left
> by the time exit_mmap() is done.  You might need a new hook in
> exit_mmap() depending on exactly how you implement the lazy shootdown.

Oh for configuring it away entirely. I'll have to see how it falls out, 
I suspect we'd want to just no-op that entire function and avoid the 2 
atomics if we are taking care of our lazy mms with shootdowns.

The more important thing would be the context switch fast path, but even 
there, there's really no reason why the two approaches couldn't be made 
to both work with some careful helper functions or structuring of the 
code.

Thanks,
Nick
Andy Lutomirski June 19, 2021, 3:20 a.m. UTC | #14
On Fri, Jun 18, 2021, at 7:53 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 18, 2021 9:49 am:
> > On 6/16/21 11:51 PM, Nicholas Piggin wrote:
> >> Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
> >>> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
> >>>>
> >>>>
> >>>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
> >>>>> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
> >>>>>> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
> >>>>>>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
> >>>>>>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> >>>>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> >>>>>>>>> a comment explaining why this barrier probably exists in all cases.  This
> >>>>>>>>> is very fragile -- any change to the relevant parts of the scheduler
> >>>>>>>>> might get rid of these barriers, and it's not really clear to me that
> >>>>>>>>> the barrier actually exists in all necessary cases.
> >>>>>>>>
> >>>>>>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
> >>>>>>>> fragile or maybe-buggy about this. The barrier definitely exists.
> >>>>>>>>
> >>>>>>>> And any change can change anything, that doesn't make it fragile. My
> >>>>>>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
> >>>>>>>> replaces it with smp_mb for example.
> >>>>>>>
> >>>>>>> I'm with Nick again, on this. You're adding extra barriers for no
> >>>>>>> discernible reason, that's not generally encouraged, seeing how extra
> >>>>>>> barriers is extra slow.
> >>>>>>>
> >>>>>>> Both mmdrop() itself, as well as the callsite have comments saying how
> >>>>>>> membarrier relies on the implied barrier, what's fragile about that?
> >>>>>>>
> >>>>>>
> >>>>>> My real motivation is that mmgrab() and mmdrop() don't actually need to
> >>>>>> be full barriers.  The current implementation has them being full
> >>>>>> barriers, and the current implementation is quite slow.  So let's try
> >>>>>> that commit message again:
> >>>>>>
> >>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
> >>>>>> a comment explaining why this barrier probably exists in all cases. The
> >>>>>> logic is based on ensuring that the barrier exists on every control flow
> >>>>>> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
> >>>>>> full barriers.
> >>>>>>
> >>>>>> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
> >>>>>> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
> >>>>>> could use a release on architectures that have these operations.
> >>>>>
> >>>>> I'm not against the idea, I've looked at something similar before (not
> >>>>> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
> >>>>> could possibly take advantage of this, I might cherry pick it and test 
> >>>>> performance :)
> >>>>>
> >>>>> I don't think it belongs in this series though. Should go together with
> >>>>> something that takes advantage of it.
> >>>>
> >>>> I’m going to see if I can get hazard pointers into shape quickly.
> >>>
> >>> Here it is.  Not even boot tested!
> >>>
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
> >>>
> >>> Nick, I think you can accomplish much the same thing as your patch by:
> >>>
> >>> #define for_each_possible_lazymm_cpu while (false)
> >> 
> >> I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
> >> as lazy at this point. I must be missing something.
> > 
> > What I mean is: if you want to shoot down lazies instead of doing the
> > hazard pointer trick to track them, you could do:
> > 
> > #define for_each_possible_lazymm_cpu while (false)
> > 
> > which would promise to the core code that you don't have any lazies left
> > by the time exit_mmap() is done.  You might need a new hook in
> > exit_mmap() depending on exactly how you implement the lazy shootdown.
> 
> Oh for configuring it away entirely. I'll have to see how it falls out, 
> I suspect we'd want to just no-op that entire function and avoid the 2 
> atomics if we are taking care of our lazy mms with shootdowns.

Do you mean the smp_store_release()?  On x86 and similar architectures, that’s almost free.  I’m also not convinced it needs to be a real release.

> 
> The more important thing would be the context switch fast path, but even 
> there, there's really no reason why the two approaches couldn't be made 
> to both work with some careful helper functions or structuring of the 
> code.
> 
> Thanks,
> Nick
>
Nicholas Piggin June 19, 2021, 4:27 a.m. UTC | #15
Excerpts from Andy Lutomirski's message of June 19, 2021 1:20 pm:
> 
> 
> On Fri, Jun 18, 2021, at 7:53 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 18, 2021 9:49 am:
>> > On 6/16/21 11:51 PM, Nicholas Piggin wrote:
>> >> Excerpts from Andy Lutomirski's message of June 17, 2021 3:32 pm:
>> >>> On Wed, Jun 16, 2021, at 7:57 PM, Andy Lutomirski wrote:
>> >>>>
>> >>>>
>> >>>> On Wed, Jun 16, 2021, at 6:37 PM, Nicholas Piggin wrote:
>> >>>>> Excerpts from Andy Lutomirski's message of June 17, 2021 4:41 am:
>> >>>>>> On 6/16/21 12:35 AM, Peter Zijlstra wrote:
>> >>>>>>> On Wed, Jun 16, 2021 at 02:19:49PM +1000, Nicholas Piggin wrote:
>> >>>>>>>> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>> >>>>>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>> >>>>>>>>> a comment explaining why this barrier probably exists in all cases.  This
>> >>>>>>>>> is very fragile -- any change to the relevant parts of the scheduler
>> >>>>>>>>> might get rid of these barriers, and it's not really clear to me that
>> >>>>>>>>> the barrier actually exists in all necessary cases.
>> >>>>>>>>
>> >>>>>>>> The comments and barriers in the mmdrop() hunks? I don't see what is 
>> >>>>>>>> fragile or maybe-buggy about this. The barrier definitely exists.
>> >>>>>>>>
>> >>>>>>>> And any change can change anything, that doesn't make it fragile. My
>> >>>>>>>> lazy tlb refcounting change avoids the mmdrop in some cases, but it
>> >>>>>>>> replaces it with smp_mb for example.
>> >>>>>>>
>> >>>>>>> I'm with Nick again, on this. You're adding extra barriers for no
>> >>>>>>> discernible reason, that's not generally encouraged, seeing how extra
>> >>>>>>> barriers is extra slow.
>> >>>>>>>
>> >>>>>>> Both mmdrop() itself, as well as the callsite have comments saying how
>> >>>>>>> membarrier relies on the implied barrier, what's fragile about that?
>> >>>>>>>
>> >>>>>>
>> >>>>>> My real motivation is that mmgrab() and mmdrop() don't actually need to
>> >>>>>> be full barriers.  The current implementation has them being full
>> >>>>>> barriers, and the current implementation is quite slow.  So let's try
>> >>>>>> that commit message again:
>> >>>>>>
>> >>>>>> membarrier() needs a barrier after any CPU changes mm.  There is currently
>> >>>>>> a comment explaining why this barrier probably exists in all cases. The
>> >>>>>> logic is based on ensuring that the barrier exists on every control flow
>> >>>>>> path through the scheduler.  It also relies on mmgrab() and mmdrop() being
>> >>>>>> full barriers.
>> >>>>>>
>> >>>>>> mmgrab() and mmdrop() would be better if they were not full barriers.  As a
>> >>>>>> trivial optimization, mmgrab() could use a relaxed atomic and mmdrop()
>> >>>>>> could use a release on architectures that have these operations.
>> >>>>>
>> >>>>> I'm not against the idea, I've looked at something similar before (not
>> >>>>> for mmdrop but a different primitive). Also my lazy tlb shootdown series 
>> >>>>> could possibly take advantage of this, I might cherry pick it and test 
>> >>>>> performance :)
>> >>>>>
>> >>>>> I don't think it belongs in this series though. Should go together with
>> >>>>> something that takes advantage of it.
>> >>>>
>> >>>> I’m going to see if I can get hazard pointers into shape quickly.
>> >>>
>> >>> Here it is.  Not even boot tested!
>> >>>
>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=sched/lazymm&id=ecc3992c36cb88087df9c537e2326efb51c95e31
>> >>>
>> >>> Nick, I think you can accomplish much the same thing as your patch by:
>> >>>
>> >>> #define for_each_possible_lazymm_cpu while (false)
>> >> 
>> >> I'm not sure what you mean? For powerpc, other CPUs can be using the mm 
>> >> as lazy at this point. I must be missing something.
>> > 
>> > What I mean is: if you want to shoot down lazies instead of doing the
>> > hazard pointer trick to track them, you could do:
>> > 
>> > #define for_each_possible_lazymm_cpu while (false)
>> > 
>> > which would promise to the core code that you don't have any lazies left
>> > by the time exit_mmap() is done.  You might need a new hook in
>> > exit_mmap() depending on exactly how you implement the lazy shootdown.
>> 
>> Oh for configuring it away entirely. I'll have to see how it falls out, 
>> I suspect we'd want to just no-op that entire function and avoid the 2 
>> atomics if we are taking care of our lazy mms with shootdowns.
> 
> Do you mean the smp_store_release()?  On x86 and similar architectures, that’s almost free.  I’m also not convinced it needs to be a real release.

Probably the shoot lazies code would complile that stuff out entirely so
not that as such, but the entire thing including the change to the 
membarrier barrier (which as I said, shoot lazies could possibly take 
advantage of anyway).

My point is I haven't seen how everything goes together or looked at 
generated code so I can't exactly say yes to your question, but that
there's no reason it couldn't be made to nicely fold away based on
config option so I'm not too concerned about that issue.

Thanks,
Nick
diff mbox series

Patch

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 10aace21d25e..c6eebbafadb0 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -341,6 +341,27 @@  enum {
 	MEMBARRIER_FLAG_RSEQ		= (1U << 1),
 };
 
+#ifdef CONFIG_MEMBARRIER
+
+/*
+ * Called by the core scheduler after calling switch_mm_irqs_off().
+ * Architectures that have implicit barriers when switching mms can
+ * override this as an optimization.
+ */
+#ifndef membarrier_finish_switch_mm
+static inline void membarrier_finish_switch_mm(int membarrier_state)
+{
+	if (membarrier_state & (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED))
+		smp_mb();
+}
+#endif
+
+#else
+
+static inline void membarrier_finish_switch_mm(int membarrier_state) {}
+
+#endif
+
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
 #include <asm/membarrier.h>
 #endif
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fe3f2a40d61e..8275b415acec 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1325,25 +1325,15 @@  void kthread_use_mm(struct mm_struct *mm)
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
+	membarrier_finish_switch_mm(atomic_read(&mm->membarrier_state));
 	local_irq_enable();
 	task_unlock(tsk);
 #ifdef finish_arch_post_lock_switch
 	finish_arch_post_lock_switch();
 #endif
 
-	/*
-	 * When a kthread starts operating on an address space, the loop
-	 * in membarrier_{private,global}_expedited() may not observe
-	 * that tsk->mm, and not issue an IPI. Membarrier requires a
-	 * memory barrier after storing to tsk->mm, before accessing
-	 * user-space memory. A full memory barrier for membarrier
-	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
-	 * mmdrop(), or explicitly with smp_mb().
-	 */
 	if (active_mm != mm)
 		mmdrop(active_mm);
-	else
-		smp_mb();
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e4c122f8bf21..329a6d2a4e13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4221,15 +4221,6 @@  static struct rq *finish_task_switch(struct task_struct *prev)
 
 	fire_sched_in_preempt_notifiers(current);
 
-	/*
-	 * When switching through a kernel thread, the loop in
-	 * membarrier_{private,global}_expedited() may have observed that
-	 * kernel thread and not issued an IPI. It is therefore possible to
-	 * schedule between user->kernel->user threads without passing though
-	 * switch_mm(). Membarrier requires a barrier after storing to
-	 * rq->curr, before returning to userspace, and mmdrop() provides
-	 * this barrier.
-	 */
 	if (mm)
 		mmdrop(mm);
 
@@ -4311,15 +4302,14 @@  context_switch(struct rq *rq, struct task_struct *prev,
 			prev->active_mm = NULL;
 	} else {                                        // to user
 		membarrier_switch_mm(rq, prev->active_mm, next->mm);
+		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+
 		/*
 		 * sys_membarrier() requires an smp_mb() between setting
-		 * rq->curr / membarrier_switch_mm() and returning to userspace.
-		 *
-		 * The below provides this either through switch_mm(), or in
-		 * case 'prev->active_mm == next->mm' through
-		 * finish_task_switch()'s mmdrop().
+		 * rq->curr->mm to a membarrier-enabled mm and returning
+		 * to userspace.
 		 */
-		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		membarrier_finish_switch_mm(rq->membarrier_state);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
@@ -5121,17 +5111,10 @@  static void __sched notrace __schedule(bool preempt)
 		RCU_INIT_POINTER(rq->curr, next);
 		/*
 		 * The membarrier system call requires each architecture
-		 * to have a full memory barrier after updating
-		 * rq->curr, before returning to user-space.
-		 *
-		 * Here are the schemes providing that barrier on the
-		 * various architectures:
-		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
-		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
-		 * - finish_lock_switch() for weakly-ordered
-		 *   architectures where spin_unlock is a full barrier,
-		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
-		 *   is a RELEASE barrier),
+		 * to have a full memory barrier before and after updating
+		 * rq->curr->mm, before returning to userspace.  This
+		 * is provided by membarrier_finish_switch_mm().  Architectures
+		 * that want to optimize this can override that function.
 		 */
 		++*switch_count;