diff mbox series

arm64/mm: Add memory barrier for mm_cid

Message ID 20240305145335.2696125-1-yeoreum.yun@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64/mm: Add memory barrier for mm_cid | expand

Commit Message

levi.yun March 5, 2024, 2:53 p.m. UTC
Currently arm64's switch_mm() doesn't always have an smp_mb()
which the core scheduler code has depended upon since commit:

    commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")

If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear()
can unset the activly used cid when it fails to observe active task after it
sets lazy_put.

By adding an smp_mb() in arm64's check_and_switch_context(),
Guarantee to observe active task after sched_mm_cid_remote_clear()
success to set lazy_put.

Signed-off-by: levi.yun <yeoreum.yun@arm.com>
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
Cc: <stable@vger.kernel.org> # 6.4.x
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Aaron Lu <aaron.lu@intel.com>
---
 I'm really sorry if you got this multiple times.
 I had some problems with the SMTP server...

 arch/arm64/mm/context.c | 5 +++++
 1 file changed, 5 insertions(+)

--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}

Comments

Will Deacon March 5, 2024, 5:13 p.m. UTC | #1
On Tue, Mar 05, 2024 at 02:53:35PM +0000, levi.yun wrote:
> Currently arm64's switch_mm() doesn't always have an smp_mb()
> which the core scheduler code has depended upon since commit:
> 
>     commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> 
> If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear()
> can unset the activly used cid when it fails to observe active task after it
> sets lazy_put.
> 
> By adding an smp_mb() in arm64's check_and_switch_context(),
> Guarantee to observe active task after sched_mm_cid_remote_clear()
> success to set lazy_put.
> 
> Signed-off-by: levi.yun <yeoreum.yun@arm.com>
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Cc: <stable@vger.kernel.org> # 6.4.x
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Aaron Lu <aaron.lu@intel.com>
> ---
>  I'm really sorry if you got this multiple times.
>  I had some problems with the SMTP server...
> 
>  arch/arm64/mm/context.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 188197590fc9..7a9e8e6647a0 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -268,6 +268,11 @@ void check_and_switch_context(struct mm_struct *mm)
>  	 */
>  	if (!system_uses_ttbr0_pan())
>  		cpu_switch_mm(mm->pgd, mm);
> +
> +	/*
> +	 * See the comments on switch_mm_cid describing user -> user transition.
> +	 */
> +	smp_mb();
>  }

We already have a stronger barrier than smp_mb() (dsb ish) in __switch_to().
Is that not sufficient?

Will
levi.yun March 5, 2024, 6:52 p.m. UTC | #2
Hi will.

> We already have a stronger barrier than smp_mb() (dsb ish) in __switch_to().
> Is that not sufficient?

IIUC, It's not sufficient with smp_mb() in __switch_to().

Because, it can be broken in sched_mm_cid_remote_clear()

CPU0  in __schedule()                       CPU1 in 
sched_mm_cid_remote_clear()
rq->curr = new_task;
<no barrier>
mm_get_cid remote_clear
    - check valid cid and use it.            Invalidate CID.
<barrier>
rq->curr (not observed).
                                                           unset the cid 
(<<BUG).


If change of rq->curr couldn't be observed in sched_mm_cid_remote_clear(),
It could unset actively used cid.
Note that __switch_to()'s smp_mb() is called AFTER switch_mm_cid().
That means before __switch_to(), there's possibility that
sched_mm_cid_remote_clear() couldn't observe new active task,
after it sets lazy_put on active cid used by new active task.
Mathieu Desnoyers March 5, 2024, 8:01 p.m. UTC | #3
On 2024-03-05 09:53, levi.yun wrote:
> Currently arm64's switch_mm() doesn't always have an smp_mb()
> which the core scheduler code has depended upon since commit:
> 
>      commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> 
> If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear()
> can unset the activly used cid when it fails to observe active task after it
> sets lazy_put.
> 
> By adding an smp_mb() in arm64's check_and_switch_context(),
> Guarantee to observe active task after sched_mm_cid_remote_clear()
> success to set lazy_put.

This comment from the original implementation of membarrier
MEMBARRIER_CMD_PRIVATE_EXPEDITED states that the original need from
membarrier was to have a full barrier between storing to rq->curr and
return to userspace:

commit 22e4ebb9758 ("membarrier: Provide expedited private command")

commit message:

     * Our TSO archs can do RELEASE without being a full barrier. Look at
       x86 spin_unlock() being a regular STORE for example.  But for those
       archs, all atomics imply smp_mb and all of them have atomic ops in
       switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
       barrier.
     
     * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
       the rest does indeed do smp_mb(), so there the spin_unlock() is a full
       barrier and we're good.
     
     * ARM64 has a very heavy barrier in switch_to(), which suffices.
     
     * PPC just removed its barrier from switch_to(), but appears to be
       talking about adding something to switch_mm(). So add a
       smp_mb__after_unlock_lock() for now, until this is settled on the PPC
       side.

associated code:

+               /*
+                * The membarrier system call requires each architecture
+                * to have a full memory barrier after updating
+                * rq->curr, before returning to user-space. For TSO
+                * (e.g. x86), the architecture must provide its own
+                * barrier in switch_mm(). For weakly ordered machines
+                * for which spin_unlock() acts as a full memory
+                * barrier, finish_lock_switch() in common code takes
+                * care of this barrier. For weakly ordered machines for
+                * which spin_unlock() acts as a RELEASE barrier (only
+                * arm64 and PowerPC), arm64 has a full barrier in
+                * switch_to(), and PowerPC has
+                * smp_mb__after_unlock_lock() before
+                * finish_lock_switch().
+                */

Which got updated to this by

commit 306e060435d ("membarrier: Document scheduler barrier requirements")

                 /*
                  * The membarrier system call requires each architecture
                  * to have a full memory barrier after updating
+                * rq->curr, before returning to user-space.
+                *
+                * Here are the schemes providing that barrier on the
+                * various architectures:
+                * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
+                *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
+                * - finish_lock_switch() for weakly-ordered
+                *   architectures where spin_unlock is a full barrier,
+                * - switch_to() for arm64 (weakly-ordered, spin_unlock
+                *   is a RELEASE barrier),
                  */

However, rseq mm_cid has stricter requirements: the barrier needs to be
issued between store to rq->curr and switch_mm_cid(), which happens
earlier than:

- spin_unlock(),
- switch_to().

So it's fine when the architecture switch_mm happens to have that barrier
already, but less so when the architecture only provides the full barrier
in switch_to() or spin_unlock().

The issue is therefore not specific to arm64, it's actually a bug in the
rseq switch_mm_cid() implementation. All architectures that don't have
memory barriers in switch_mm(), but rather have the full barrier either in
finish_lock_switch() or switch_to() have them too late for the needs of
switch_mm_cid().

I would recommend one of three approaches here:

A) Add smp_mb() in switch_mm_cid() for all architectures that lack that
    barrier in switch_mm().

B) Figure out if we can move switch_mm_cid() further down in the scheduler
    without breaking anything (within switch_to(), at the very end of
    finish_lock_switch() for instance). I'm not sure we can do that though
    because switch_mm_cid() touches the "prev" which is tricky after
    switch_to().

C) Add barriers in switch_mm() within all architectures that are missing it.

Thoughts ?

Thanks,

Mathieu


> 
> Signed-off-by: levi.yun <yeoreum.yun@arm.com>
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Cc: <stable@vger.kernel.org> # 6.4.x
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Aaron Lu <aaron.lu@intel.com>
> ---
>   I'm really sorry if you got this multiple times.
>   I had some problems with the SMTP server...
> 
>   arch/arm64/mm/context.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 188197590fc9..7a9e8e6647a0 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -268,6 +268,11 @@ void check_and_switch_context(struct mm_struct *mm)
>   	 */
>   	if (!system_uses_ttbr0_pan())
>   		cpu_switch_mm(mm->pgd, mm);
> +
> +	/*
> +	 * See the comments on switch_mm_cid describing user -> user transition.
> +	 */
> +	smp_mb();
>   }
> 
>   unsigned long arm64_mm_context_get(struct mm_struct *mm)
> --
> LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
>
levi.yun March 5, 2024, 9:07 p.m. UTC | #4
Hi Mathieu!

On 05/03/2024 20:01, Mathieu Desnoyers wrote:
> On 2024-03-05 09:53, levi.yun wrote:
>> Currently arm64's switch_mm() doesn't always have an smp_mb()
>> which the core scheduler code has depended upon since commit:
>>
>>      commit 223baf9d17f25 ("sched: Fix performance regression 
>> introduced by mm_cid")
>>
>> If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear()
>> can unset the activly used cid when it fails to observe active task 
>> after it
>> sets lazy_put.
>>
>> By adding an smp_mb() in arm64's check_and_switch_context(),
>> Guarantee to observe active task after sched_mm_cid_remote_clear()
>> success to set lazy_put.
>
> This comment from the original implementation of membarrier
> MEMBARRIER_CMD_PRIVATE_EXPEDITED states that the original need from
> membarrier was to have a full barrier between storing to rq->curr and
> return to userspace:
>
> commit 22e4ebb9758 ("membarrier: Provide expedited private command")
>
> commit message:
>
>     * Our TSO archs can do RELEASE without being a full barrier. Look at
>       x86 spin_unlock() being a regular STORE for example.  But for those
>       archs, all atomics imply smp_mb and all of them have atomic ops in
>       switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a 
> full
>       barrier.
>         * From all weakly ordered machines, only ARM64 and PPC can do 
> RELEASE,
>       the rest does indeed do smp_mb(), so there the spin_unlock() is 
> a full
>       barrier and we're good.
>         * ARM64 has a very heavy barrier in switch_to(), which suffices.
>         * PPC just removed its barrier from switch_to(), but appears 
> to be
>       talking about adding something to switch_mm(). So add a
>       smp_mb__after_unlock_lock() for now, until this is settled on 
> the PPC
>       side.
>
> associated code:
>
> +               /*
> +                * The membarrier system call requires each architecture
> +                * to have a full memory barrier after updating
> +                * rq->curr, before returning to user-space. For TSO
> +                * (e.g. x86), the architecture must provide its own
> +                * barrier in switch_mm(). For weakly ordered machines
> +                * for which spin_unlock() acts as a full memory
> +                * barrier, finish_lock_switch() in common code takes
> +                * care of this barrier. For weakly ordered machines for
> +                * which spin_unlock() acts as a RELEASE barrier (only
> +                * arm64 and PowerPC), arm64 has a full barrier in
> +                * switch_to(), and PowerPC has
> +                * smp_mb__after_unlock_lock() before
> +                * finish_lock_switch().
> +                */
>
> Which got updated to this by
>
> commit 306e060435d ("membarrier: Document scheduler barrier 
> requirements")
>
>                 /*
>                  * The membarrier system call requires each architecture
>                  * to have a full memory barrier after updating
> +                * rq->curr, before returning to user-space.
> +                *
> +                * Here are the schemes providing that barrier on the
> +                * various architectures:
> +                * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, 
> PowerPC.
> +                *   switch_mm() rely on membarrier_arch_switch_mm() 
> on PowerPC.
> +                * - finish_lock_switch() for weakly-ordered
> +                *   architectures where spin_unlock is a full barrier,
> +                * - switch_to() for arm64 (weakly-ordered, spin_unlock
> +                *   is a RELEASE barrier),
>                  */
>
> However, rseq mm_cid has stricter requirements: the barrier needs to be
> issued between store to rq->curr and switch_mm_cid(), which happens
> earlier than:
>
> - spin_unlock(),
> - switch_to().
>
> So it's fine when the architecture switch_mm happens to have that barrier
> already, but less so when the architecture only provides the full barrier
> in switch_to() or spin_unlock().
>
> The issue is therefore not specific to arm64, it's actually a bug in the
> rseq switch_mm_cid() implementation. All architectures that don't have
> memory barriers in switch_mm(), but rather have the full barrier 
> either in
> finish_lock_switch() or switch_to() have them too late for the needs of
> switch_mm_cid().

Thanks for the great detail explain!


>
> I would recommend one of three approaches here:
>
> A) Add smp_mb() in switch_mm_cid() for all architectures that lack that
>    barrier in switch_mm().
>
> B) Figure out if we can move switch_mm_cid() further down in the 
> scheduler
>    without breaking anything (within switch_to(), at the very end of
>    finish_lock_switch() for instance). I'm not sure we can do that though
>    because switch_mm_cid() touches the "prev" which is tricky after
>    switch_to().
>
> C) Add barriers in switch_mm() within all architectures that are 
> missing it.
>
> Thoughts ?
IMHO, A) is look good to me.

Because, In case of B), If you assume spin_unlock() for rq->lock has 
full memory barrier,
I'm not sure about the architecture which using queued_spin_unlock().

When I see the queued_spin_unlock()'s implementation, It implements 
using smp_store_relasse().
But, when we see the memory_barrier.txt describing MULTICOPY ATOMICITY,
If smp_mb__after_atomic() is implemented with smp_mb(), There might fail 
to observe.

Am I wrong?

Many thanks!
diff mbox series

Patch

diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 188197590fc9..7a9e8e6647a0 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -268,6 +268,11 @@  void check_and_switch_context(struct mm_struct *mm)
 	 */
 	if (!system_uses_ttbr0_pan())
 		cpu_switch_mm(mm->pgd, mm);
+
+	/*
+	 * See the comments on switch_mm_cid describing user -> user transition.
+	 */
+	smp_mb();
 }

 unsigned long arm64_mm_context_get(struct mm_struct *mm)