[01/23] membarrier: Document why membarrier() works

Message ID	d64b6651fe8799481c6204e43b17f81010018345.1641659630.git.luto@kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Andy Lutomirski <luto@kernel.org> To: Andrew Morton <akpm@linux-foundation.org>, Linux-MM <linux-mm@kvack.org> Cc: Nicholas Piggin <npiggin@gmail.com>, Anton Blanchard <anton@ozlabs.org>, Benjamin Herrenschmidt <benh@kernel.crashing.org>, Paul Mackerras <paulus@ozlabs.org>, Randy Dunlap <rdunlap@infradead.org>, linux-arch <linux-arch@vger.kernel.org>, x86@kernel.org, Rik van Riel <riel@surriel.com>, Dave Hansen <dave.hansen@intel.com>, Peter Zijlstra <peterz@infradead.org>, Nadav Amit <nadav.amit@gmail.com>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Andy Lutomirski <luto@kernel.org> Subject: [PATCH 01/23] membarrier: Document why membarrier() works Date: Sat, 8 Jan 2022 08:43:46 -0800 Message-Id: <d64b6651fe8799481c6204e43b17f81010018345.1641659630.git.luto@kernel.org> In-Reply-To: <cover.1641659630.git.luto@kernel.org> References: <cover.1641659630.git.luto@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm, sched: Rework lazy mm handling \| expand [00/23] mm, sched: Rework lazy mm handling [01/23] membarrier: Document why membarrier() works [02/23] x86/mm: Handle unlazying membarrier core sync in the arch code [03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code [04/23] membarrier: Make the post-switch-mm barrier explicit [06/23] powerpc/membarrier: Remove special barrier on mm switch [07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation [08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() [09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() [10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references [11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown [12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" [13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() [14/23] sched, exec: Factor current mm changes out from exec [15/23] kthread: Switch to __change_current_mm() [16/23] sched: Use lightweight hazard pointers to grab lazy mms [17/23] x86/mm: Make use/unuse_temporary_mm() non-static [18/23] x86/mm: Allow temporary mms when IRQs are on [19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery [20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() [21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs [22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() [23/23] x86/mm: Opt in to IRQs-off activate_mm()

Message ID

d64b6651fe8799481c6204e43b17f81010018345.1641659630.git.luto@kernel.org (mailing list archive)

State

New

Headers

From: Andy Lutomirski <luto@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>
Cc: Nicholas Piggin <npiggin@gmail.com>,
	Anton Blanchard <anton@ozlabs.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@ozlabs.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	linux-arch <linux-arch@vger.kernel.org>,
	x86@kernel.org,
	Rik van Riel <riel@surriel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Nadav Amit <nadav.amit@gmail.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Andy Lutomirski <luto@kernel.org>
Subject: [PATCH 01/23] membarrier: Document why membarrier() works
Date: Sat,  8 Jan 2022 08:43:46 -0800
Message-Id: 
 <d64b6651fe8799481c6204e43b17f81010018345.1641659630.git.luto@kernel.org>
In-Reply-To: <cover.1641659630.git.luto@kernel.org>
References: <cover.1641659630.git.luto@kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm, sched: Rework lazy mm handling | expand

Commit Message

Andy Lutomirski Jan. 8, 2022, 4:43 p.m. UTC

We had a nice comment at the top of membarrier.c explaining why membarrier
worked in a handful of scenarios, but that consisted more of a list of
things not to forget than an actual description of the algorithm and why it
should be expected to work.

Add a comment explaining my understanding of the algorithm.  This exposes a
couple of implementation issues that I will hopefully fix up in subsequent
patches.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/sched/membarrier.c | 60 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 58 insertions(+), 2 deletions(-)

Comments

Mathieu Desnoyers Jan. 12, 2022, 3:30 p.m. UTC | #1

----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote:

> We had a nice comment at the top of membarrier.c explaining why membarrier
> worked in a handful of scenarios, but that consisted more of a list of
> things not to forget than an actual description of the algorithm and why it
> should be expected to work.
> 
> Add a comment explaining my understanding of the algorithm.  This exposes a
> couple of implementation issues that I will hopefully fix up in subsequent
> patches.

Given that no explanation about the specific implementation issues is provided
here, I would be tempted to remove the last sentence above, and keep that for
the commit messages of the subsequent patches.

The explanation you add here is clear and very much fits my mental model, thanks!

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> kernel/sched/membarrier.c | 60 +++++++++++++++++++++++++++++++++++++--
> 1 file changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index b5add64d9698..30e964b9689d 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -7,8 +7,64 @@
> #include "sched.h"
> 
> /*
> - * For documentation purposes, here are some membarrier ordering
> - * scenarios to keep in mind:
> + * The basic principle behind the regular memory barrier mode of
> + * membarrier() is as follows.  membarrier() is called in one thread.  Tt
> + * iterates over all CPUs, and, for each CPU, it either sends an IPI to
> + * that CPU or it does not. If it sends an IPI, then we have the
> + * following sequence of events:
> + *
> + * 1. membarrier() does smp_mb().
> + * 2. membarrier() does a store (the IPI request payload) that is observed by
> + *    the target CPU.
> + * 3. The target CPU does smp_mb().
> + * 4. The target CPU does a store (the completion indication) that is observed
> + *    by membarrier()'s wait-for-IPIs-to-finish request.
> + * 5. membarrier() does smp_mb().
> + *
> + * So all pre-membarrier() local accesses are visible after the IPI on the
> + * target CPU and all pre-IPI remote accesses are visible after
> + * membarrier(). IOW membarrier() has synchronized both ways with the target
> + * CPU.
> + *
> + * (This has the caveat that membarrier() does not interrupt the CPU that it's
> + * running on at the time it sends the IPIs. However, if that is the CPU on
> + * which membarrier() starts and/or finishes, membarrier() does smp_mb() and,
> + * if not, then the scheduler's migration of membarrier() is a full barrier.)
> + *
> + * membarrier() skips sending an IPI only if membarrier() sees
> + * cpu_rq(cpu)->curr->mm != target mm.  The sequence of events is:
> + *
> + *           membarrier()            |          target CPU
> + * ---------------------------------------------------------------------
> + *                                   | 1. smp_mb()
> + *                                   | 2. set rq->curr->mm = other_mm
> + *                                   |    (by writing to ->curr or to ->mm)
> + * 3. smp_mb()                       |
> + * 4. read rq->curr->mm == other_mm  |
> + * 5. smp_mb()                       |
> + *                                   | 6. rq->curr->mm = target_mm
> + *                                   |    (by writing to ->curr or to ->mm)
> + *                                   | 7. smp_mb()
> + *                                   |
> + *
> + * All memory accesses on the target CPU prior to scheduling are visible
> + * to membarrier()'s caller after membarrier() returns due to steps 1, 2, 4
> + * and 5.
> + *
> + * All memory accesses by membarrier()'s caller prior to membarrier() are
> + * visible to the target CPU after scheduling due to steps 3, 4, 6, and 7.
> + *
> + * Note that, tasks can change their ->mm, e.g. via kthread_use_mm().  So
> + * tasks that switch their ->mm must follow the same rules as the scheduler
> + * changing rq->curr, and the membarrier() code needs to do both dereferences
> + * carefully.
> + *
> + * GLOBAL_EXPEDITED support works the same way except that all references
> + * to rq->curr->mm are replaced with references to rq->membarrier_state.
> + *
> + *
> + * Specific examples of how this produces the documented properties of
> + * membarrier():
>  *
>  * A) Userspace thread execution after IPI vs membarrier's memory
>  *    barrier before sending the IPI
> --
> 2.33.1

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b5add64d9698..30e964b9689d 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -7,8 +7,64 @@ 
 #include "sched.h"
 
 /*
- * For documentation purposes, here are some membarrier ordering
- * scenarios to keep in mind:
+ * The basic principle behind the regular memory barrier mode of
+ * membarrier() is as follows.  membarrier() is called in one thread.  Tt
+ * iterates over all CPUs, and, for each CPU, it either sends an IPI to
+ * that CPU or it does not. If it sends an IPI, then we have the
+ * following sequence of events:
+ *
+ * 1. membarrier() does smp_mb().
+ * 2. membarrier() does a store (the IPI request payload) that is observed by
+ *    the target CPU.
+ * 3. The target CPU does smp_mb().
+ * 4. The target CPU does a store (the completion indication) that is observed
+ *    by membarrier()'s wait-for-IPIs-to-finish request.
+ * 5. membarrier() does smp_mb().
+ *
+ * So all pre-membarrier() local accesses are visible after the IPI on the
+ * target CPU and all pre-IPI remote accesses are visible after
+ * membarrier(). IOW membarrier() has synchronized both ways with the target
+ * CPU.
+ *
+ * (This has the caveat that membarrier() does not interrupt the CPU that it's
+ * running on at the time it sends the IPIs. However, if that is the CPU on
+ * which membarrier() starts and/or finishes, membarrier() does smp_mb() and,
+ * if not, then the scheduler's migration of membarrier() is a full barrier.)
+ *
+ * membarrier() skips sending an IPI only if membarrier() sees
+ * cpu_rq(cpu)->curr->mm != target mm.  The sequence of events is:
+ *
+ *           membarrier()            |          target CPU
+ * ---------------------------------------------------------------------
+ *                                   | 1. smp_mb()
+ *                                   | 2. set rq->curr->mm = other_mm
+ *                                   |    (by writing to ->curr or to ->mm)
+ * 3. smp_mb()                       |
+ * 4. read rq->curr->mm == other_mm  |
+ * 5. smp_mb()                       |
+ *                                   | 6. rq->curr->mm = target_mm
+ *                                   |    (by writing to ->curr or to ->mm)
+ *                                   | 7. smp_mb()
+ *                                   |
+ *
+ * All memory accesses on the target CPU prior to scheduling are visible
+ * to membarrier()'s caller after membarrier() returns due to steps 1, 2, 4
+ * and 5.
+ *
+ * All memory accesses by membarrier()'s caller prior to membarrier() are
+ * visible to the target CPU after scheduling due to steps 3, 4, 6, and 7.
+ *
+ * Note that, tasks can change their ->mm, e.g. via kthread_use_mm().  So
+ * tasks that switch their ->mm must follow the same rules as the scheduler
+ * changing rq->curr, and the membarrier() code needs to do both dereferences
+ * carefully.
+ *
+ * GLOBAL_EXPEDITED support works the same way except that all references
+ * to rq->curr->mm are replaced with references to rq->membarrier_state.
+ *
+ *
+ * Specific examples of how this produces the documented properties of
+ * membarrier():
  *
  * A) Userspace thread execution after IPI vs membarrier's memory
  *    barrier before sending the IPI

[01/23] membarrier: Document why membarrier() works

Commit Message

Comments

Patch