Message ID | 20200710015646.2020871-5-npiggin@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mmu context cleanup, lazy tlb cleanup, | expand |
On Fri, Jul 10, 2020 at 11:56:43AM +1000, Nicholas Piggin wrote: > And get rid of the generic sync_core_before_usermode facility. > > This helper is the wrong way around I think. The idea that membarrier > state requires a core sync before returning to user is the easy one > that does not need hiding behind membarrier calls. The gap in core > synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > tricky detail that is better put in x86 lazy tlb code. > > Consider if an arch did not synchronize core in switch_mm either, then > membarrier_mm_sync_core_before_usermode would be in the wrong place > but arch specific mmu context functions would still be the right place. > There is also a exit_lazy_tlb case that is not covered by this call, which > could be a bugs (kthread use mm the membarrier process's mm then context > switch back to the process without switching mm or lazy mm switch). > > This makes lazy tlb code a bit more modular. Hurmph, I know I've been staring at this at some point. I think I meant to have a TIF to force the IRET path in the case of MEMBAR_SYNC_CORE. But I was discouraged by amluto.
----- On Jul 9, 2020, at 9:56 PM, Nicholas Piggin npiggin@gmail.com wrote: > And get rid of the generic sync_core_before_usermode facility. > > This helper is the wrong way around I think. The idea that membarrier > state requires a core sync before returning to user is the easy one > that does not need hiding behind membarrier calls. The gap in core > synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > tricky detail that is better put in x86 lazy tlb code. > > Consider if an arch did not synchronize core in switch_mm either, then > membarrier_mm_sync_core_before_usermode would be in the wrong place > but arch specific mmu context functions would still be the right place. > There is also a exit_lazy_tlb case that is not covered by this call, which > could be a bugs (kthread use mm the membarrier process's mm then context > switch back to the process without switching mm or lazy mm switch). > > This makes lazy tlb code a bit more modular. I agree that moving this logic to exit_lazy_tlb is much more modular and cleaner. Thanks, Mathieu > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> > --- > .../membarrier-sync-core/arch-support.txt | 6 +++- > arch/x86/include/asm/mmu_context.h | 35 +++++++++++++++++++ > arch/x86/include/asm/sync_core.h | 28 --------------- > include/linux/sched/mm.h | 14 -------- > include/linux/sync_core.h | 21 ----------- > kernel/cpu.c | 4 ++- > kernel/kthread.c | 2 +- > kernel/sched/core.c | 16 ++++----- > 8 files changed, 51 insertions(+), 75 deletions(-) > delete mode 100644 arch/x86/include/asm/sync_core.h > delete mode 100644 include/linux/sync_core.h > > diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt > b/Documentation/features/sched/membarrier-sync-core/arch-support.txt > index 52ad74a25f54..bd43fb1f5986 100644 > --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt > +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt > @@ -5,6 +5,10 @@ > # > # Architecture requirements > # > +# If your architecture returns to user-space through non-core-serializing > +# instructions, you need to ensure these are done in switch_mm and > exit_lazy_tlb > +# (if lazy tlb switching is implemented). > +# > # * arm/arm64/powerpc > # > # Rely on implicit context synchronization as a result of exception return > @@ -24,7 +28,7 @@ > # instead on write_cr3() performed by switch_mm() to provide core serialization > # after changing the current mm, and deal with the special case of kthread -> > # uthread (temporarily keeping current mm into active_mm) by issuing a > -# sync_core_before_usermode() in that specific case. > +# serializing instruction in exit_lazy_mm() in that specific case. > # > ----------------------- > | arch |status| > diff --git a/arch/x86/include/asm/mmu_context.h > b/arch/x86/include/asm/mmu_context.h > index 255750548433..5263863a9be8 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -6,6 +6,7 @@ > #include <linux/atomic.h> > #include <linux/mm_types.h> > #include <linux/pkeys.h> > +#include <linux/sched/mm.h> > > #include <trace/events/tlb.h> > > @@ -95,6 +96,40 @@ static inline void switch_ldt(struct mm_struct *prev, struct > mm_struct *next) > #define enter_lazy_tlb enter_lazy_tlb > extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); > > +#ifdef CONFIG_MEMBARRIER > +/* > + * Ensure that a core serializing instruction is issued before returning > + * to user-mode, if a SYNC_CORE was requested. x86 implements return to > + * user-space through sysexit, sysrel, and sysretq, which are not core > + * serializing. > + * > + * See the membarrier comment in finish_task_switch as to why this is done > + * in exit_lazy_tlb. > + */ > +#define exit_lazy_tlb exit_lazy_tlb > +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) > +{ > + /* Switching mm is serializing with write_cr3 */ > + if (tsk->mm != mm) > + return; > + > + if (likely(!(atomic_read(&mm->membarrier_state) & > + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > + return; > + > + /* With PTI, we unconditionally serialize before running user code. */ > + if (static_cpu_has(X86_FEATURE_PTI)) > + return; > + /* > + * Return from interrupt and NMI is done through iret, which is core > + * serializing. > + */ > + if (in_irq() || in_nmi()) > + return; > + sync_core(); > +} > +#endif > + > /* > * Init a new mm. Used on mm copies, like at fork() > * and on mm's that are brand-new, like at execve(). > diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h > deleted file mode 100644 > index c67caafd3381..000000000000 > --- a/arch/x86/include/asm/sync_core.h > +++ /dev/null > @@ -1,28 +0,0 @@ > -/* SPDX-License-Identifier: GPL-2.0 */ > -#ifndef _ASM_X86_SYNC_CORE_H > -#define _ASM_X86_SYNC_CORE_H > - > -#include <linux/preempt.h> > -#include <asm/processor.h> > -#include <asm/cpufeature.h> > - > -/* > - * Ensure that a core serializing instruction is issued before returning > - * to user-mode. x86 implements return to user-space through sysexit, > - * sysrel, and sysretq, which are not core serializing. > - */ > -static inline void sync_core_before_usermode(void) > -{ > - /* With PTI, we unconditionally serialize before running user code. */ > - if (static_cpu_has(X86_FEATURE_PTI)) > - return; > - /* > - * Return from interrupt and NMI is done through iret, which is core > - * serializing. > - */ > - if (in_irq() || in_nmi()) > - return; > - sync_core(); > -} > - > -#endif /* _ASM_X86_SYNC_CORE_H */ > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 480a4d1b7dd8..9b026264b445 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -7,7 +7,6 @@ > #include <linux/sched.h> > #include <linux/mm_types.h> > #include <linux/gfp.h> > -#include <linux/sync_core.h> > > /* > * Routines for handling mm_structs > @@ -364,16 +363,6 @@ enum { > #include <asm/membarrier.h> > #endif > > -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct > *mm) > -{ > - if (current->mm != mm) > - return; > - if (likely(!(atomic_read(&mm->membarrier_state) & > - MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > - return; > - sync_core_before_usermode(); > -} > - > extern void membarrier_exec_mmap(struct mm_struct *mm); > > #else > @@ -387,9 +376,6 @@ static inline void membarrier_arch_switch_mm(struct > mm_struct *prev, > static inline void membarrier_exec_mmap(struct mm_struct *mm) > { > } > -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct > *mm) > -{ > -} > #endif > > #endif /* _LINUX_SCHED_MM_H */ > diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h > deleted file mode 100644 > index 013da4b8b327..000000000000 > --- a/include/linux/sync_core.h > +++ /dev/null > @@ -1,21 +0,0 @@ > -/* SPDX-License-Identifier: GPL-2.0 */ > -#ifndef _LINUX_SYNC_CORE_H > -#define _LINUX_SYNC_CORE_H > - > -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > -#include <asm/sync_core.h> > -#else > -/* > - * This is a dummy sync_core_before_usermode() implementation that can be used > - * on all architectures which return to user-space through core serializing > - * instructions. > - * If your architecture returns to user-space through non-core-serializing > - * instructions, you need to write your own functions. > - */ > -static inline void sync_core_before_usermode(void) > -{ > -} > -#endif > - > -#endif /* _LINUX_SYNC_CORE_H */ > - > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 6ff2578ecf17..134688d79589 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -572,7 +572,9 @@ static int finish_cpu(unsigned int cpu) > > /* > * idle_task_exit() will have switched to &init_mm, now > - * clean up any remaining active_mm state. > + * clean up any remaining active_mm state. exit_lazy_tlb > + * is not done, if an arch did any accounting in these > + * functions it would have to be added. > */ > if (mm != &init_mm) > idle->active_mm = &init_mm; > diff --git a/kernel/kthread.c b/kernel/kthread.c > index e813d92f2eab..6f93c649aa97 100644 > --- a/kernel/kthread.c > +++ b/kernel/kthread.c > @@ -1251,9 +1251,9 @@ void kthread_use_mm(struct mm_struct *mm) > finish_arch_post_lock_switch(); > #endif > > + exit_lazy_tlb(active_mm, tsk); > if (active_mm != mm) > mmdrop(active_mm); > - exit_lazy_tlb(active_mm, tsk); > > to_kthread(tsk)->oldfs = get_fs(); > set_fs(USER_DS); > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index debc917bc69b..31e22c79826c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3294,22 +3294,19 @@ static struct rq *finish_task_switch(struct task_struct > *prev) > kcov_finish_switch(current); > > fire_sched_in_preempt_notifiers(current); > + > /* > * When switching through a kernel thread, the loop in > * membarrier_{private,global}_expedited() may have observed that > * kernel thread and not issued an IPI. It is therefore possible to > * schedule between user->kernel->user threads without passing though > - * switch_mm(). Membarrier requires a barrier after storing to > - * rq->curr, before returning to userspace, so provide them here: > - * > - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly > - * provided by mmdrop(), > - * - a sync_core for SYNC_CORE. > + * switch_mm(). Membarrier requires a full barrier after storing to > + * rq->curr, before returning to userspace, for > + * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop(). > */ > - if (mm) { > - membarrier_mm_sync_core_before_usermode(mm); > + if (mm) > mmdrop(mm); > - } > + > if (unlikely(prev_state == TASK_DEAD)) { > if (prev->sched_class->task_dead) > prev->sched_class->task_dead(prev); > @@ -6292,6 +6289,7 @@ void idle_task_exit(void) > BUG_ON(current != this_rq()->idle); > > if (mm != &init_mm) { > + /* enter_lazy_tlb is not done because we're about to go down */ > switch_mm(mm, &init_mm, current); > finish_arch_post_lock_switch(); > } > -- > 2.23.0
On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > And get rid of the generic sync_core_before_usermode facility. > > This helper is the wrong way around I think. The idea that membarrier > state requires a core sync before returning to user is the easy one > that does not need hiding behind membarrier calls. The gap in core > synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > tricky detail that is better put in x86 lazy tlb code. > > Consider if an arch did not synchronize core in switch_mm either, then > membarrier_mm_sync_core_before_usermode would be in the wrong place > but arch specific mmu context functions would still be the right place. > There is also a exit_lazy_tlb case that is not covered by this call, which > could be a bugs (kthread use mm the membarrier process's mm then context > switch back to the process without switching mm or lazy mm switch). > > This makes lazy tlb code a bit more modular. The mm-switching and TLB-management has often had the regrettable property that it gets wired up in a way that seems to work at the time but doesn't have clear semantics, and I'm a bit concerned that this patch is in that category. If I'm understanding right, you're trying to enforce the property that exiting lazy TLB mode will promise to sync the core eventually. But this has all kinds of odd properties: - Why is exit_lazy_tlb() getting called at all in the relevant cases? When is it permissible to call it? I look at your new code and see: > +/* > + * Ensure that a core serializing instruction is issued before returning > + * to user-mode, if a SYNC_CORE was requested. x86 implements return to > + * user-space through sysexit, sysrel, and sysretq, which are not core > + * serializing. > + * > + * See the membarrier comment in finish_task_switch as to why this is done > + * in exit_lazy_tlb. > + */ > +#define exit_lazy_tlb exit_lazy_tlb > +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) > +{ > + /* Switching mm is serializing with write_cr3 */ > + if (tsk->mm != mm) > + return; And my brain says WTF? Surely you meant something like if (WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened? how do we try to recover well enough to get a crashed logged at least? */ } So this needs actual documentation, preferably in comments near the function, of what the preconditions are and what this mm parameter is. Once that's done, then we could consider whether it's appropriate to have this function promise to sync the core under some conditions. - This whole structure seems to rely on the idea that switching mm syncs something. I periodically ask chip vendor for non-serializing mm switches. Specifically, in my dream world, we have totally separate user and kernel page tables. Changing out the user tables doesn't serialize or even create a fence. Instead it creates the minimum required pipeline hazard such that user memory access and switches to usermode will make sure they access memory through the correct page tables. I haven't convinced a chip vendor yet, but there are quite a few hundreds of cycles to be saved here. With this in mind, I see the fencing aspects of the TLB handling code as somewhat of an accident. I'm fine with documenting them and using them to optimize other paths, but I think it should be explicit. For example: /* Also does a full barrier? (Or a sync_core()-style barrier.) However, if you rely on this, you must document it in a comment where you call this function. *? void switch_mm_irqs_off() { } This is kind of like how we strongly encourage anyone using smp_?mb() to document what they are fencing against. Also, as it stands, I can easily see in_irq() ceasing to promise to serialize. There are older kernels for which it does not promise to serialize. And I have plans to make it stop serializing in the nearish future. --Andy
Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: > On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> And get rid of the generic sync_core_before_usermode facility. >> >> This helper is the wrong way around I think. The idea that membarrier >> state requires a core sync before returning to user is the easy one >> that does not need hiding behind membarrier calls. The gap in core >> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the >> tricky detail that is better put in x86 lazy tlb code. >> >> Consider if an arch did not synchronize core in switch_mm either, then >> membarrier_mm_sync_core_before_usermode would be in the wrong place >> but arch specific mmu context functions would still be the right place. >> There is also a exit_lazy_tlb case that is not covered by this call, which >> could be a bugs (kthread use mm the membarrier process's mm then context >> switch back to the process without switching mm or lazy mm switch). >> >> This makes lazy tlb code a bit more modular. > > The mm-switching and TLB-management has often had the regrettable > property that it gets wired up in a way that seems to work at the time > but doesn't have clear semantics, and I'm a bit concerned that this > patch is in that category. It's much more explicit in the core code about where hooks are called after this patch. And then the x86 membarrier implementation details are contained to the x86 code where they belong, and we don't have the previous hook with unclear semantics missing from core code. > If I'm understanding right, you're trying > to enforce the property that exiting lazy TLB mode will promise to > sync the core eventually. But this has all kinds of odd properties: > > - Why is exit_lazy_tlb() getting called at all in the relevant cases? It's a property of how MEMBARRIER_SYNC_CORE is implemented by arch/x86, see the membarrier comment in finish_task_switch (for analogous reason). > When is it permissible to call it? Comment for the asm-generic code says it's to be called when the lazy active mm becomes non-lazy. > I look at your new code and see: > >> +/* >> + * Ensure that a core serializing instruction is issued before returning >> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to >> + * user-space through sysexit, sysrel, and sysretq, which are not core >> + * serializing. >> + * >> + * See the membarrier comment in finish_task_switch as to why this is done >> + * in exit_lazy_tlb. >> + */ >> +#define exit_lazy_tlb exit_lazy_tlb >> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) >> +{ >> + /* Switching mm is serializing with write_cr3 */ >> + if (tsk->mm != mm) >> + return; > > And my brain says WTF? Surely you meant something like if > (WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened? how do > we try to recover well enough to get a crashed logged at least? */ } No, the active mm can be unlazied by switching to a different mm. > So this needs actual documentation, preferably in comments near the > function, of what the preconditions are and what this mm parameter is. > Once that's done, then we could consider whether it's appropriate to > have this function promise to sync the core under some conditions. It's documented in generic code. I prefer not to duplicate comments too much but I can add a "refer to asm-generic version for usage" or something if you'd like? > - This whole structure seems to rely on the idea that switching mm > syncs something. Which whole structure? The x86 implementation of sync core explicitly does rely on that, yes. But I've pulled that out of core code with this patch. > I periodically ask chip vendor for non-serializing > mm switches. Specifically, in my dream world, we have totally > separate user and kernel page tables. Changing out the user tables > doesn't serialize or even create a fence. Instead it creates the > minimum required pipeline hazard such that user memory access and > switches to usermode will make sure they access memory through the > correct page tables. I haven't convinced a chip vendor yet, but there > are quite a few hundreds of cycles to be saved here. The fundmaental difficulty is that the kernel can still access user mappings any time after the switch. We can probably handwave ways around it by serializing lazily when encountering the next user access and hoping that most of your mm switches result in a kernel exit that serializes or some other unavoidable serialize so you can avoid the mm switch one. In practice it sounds like a lot of trouble. But anyway the sync core could presumably be adjusted or reasoned to still be correct, depending on how it works. > With this in > mind, I see the fencing aspects of the TLB handling code as somewhat > of an accident. I'm fine with documenting them and using them to > optimize other paths, but I think it should be explicit. For example: > > /* Also does a full barrier? (Or a sync_core()-style barrier.) > However, if you rely on this, you must document it in a comment where > you call this function. *? > void switch_mm_irqs_off() > { > } > > This is kind of like how we strongly encourage anyone using smp_?mb() > to document what they are fencing against. Hmm. I don't think anything outside core scheduler/arch code is allowed to assume that, because they don't really know if schedule() will cause a switch. Hopefully nobody does, I would agree it shouldn't be encouraged. It is pretty fundamental to how we do task CPU migration so I don't see it ever going away. A push model where the source CPU has to release tasks that it last ran before they can be run elsewhere is unworkable. (Or maybe it's not, but no getting around that would require careful audits of said low level code). > Also, as it stands, I can easily see in_irq() ceasing to promise to > serialize. There are older kernels for which it does not promise to > serialize. And I have plans to make it stop serializing in the > nearish future. You mean x86's return from interrupt? Sounds fun... you'll konw where to update the membarrier sync code, at least :) Thanks, Nick
Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: > Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >> Also, as it stands, I can easily see in_irq() ceasing to promise to >> serialize. There are older kernels for which it does not promise to >> serialize. And I have plans to make it stop serializing in the >> nearish future. > > You mean x86's return from interrupt? Sounds fun... you'll konw where to > update the membarrier sync code, at least :) Oh, I should actually say Mathieu recently clarified a return from interrupt doesn't fundamentally need to serialize in order to support membarrier sync core. https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-July/214171.html So you may not need to do anything more if you relaxed it. Thanks, Nick
----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>> serialize. There are older kernels for which it does not promise to >>> serialize. And I have plans to make it stop serializing in the >>> nearish future. >> >> You mean x86's return from interrupt? Sounds fun... you'll konw where to >> update the membarrier sync code, at least :) > > Oh, I should actually say Mathieu recently clarified a return from > interrupt doesn't fundamentally need to serialize in order to support > membarrier sync core. Clarification to your statement: Return from interrupt to kernel code does not need to be context serializing as long as kernel serializes before returning to user-space. However, return from interrupt to user-space needs to be context serializing. Thanks, Mathieu > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-July/214171.html > > So you may not need to do anything more if you relaxed it. > > Thanks, > Nick
On Mon, Jul 13, 2020 at 7:13 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: > > > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: > >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: > >>> Also, as it stands, I can easily see in_irq() ceasing to promise to > >>> serialize. There are older kernels for which it does not promise to > >>> serialize. And I have plans to make it stop serializing in the > >>> nearish future. > >> > >> You mean x86's return from interrupt? Sounds fun... you'll konw where to > >> update the membarrier sync code, at least :) > > > > Oh, I should actually say Mathieu recently clarified a return from > > interrupt doesn't fundamentally need to serialize in order to support > > membarrier sync core. > > Clarification to your statement: > > Return from interrupt to kernel code does not need to be context serializing > as long as kernel serializes before returning to user-space. > > However, return from interrupt to user-space needs to be context serializing. > Indeed, and I figured this out on the first read through because I'm quite familiar with the x86 entry code. But Nick somehow missed this, and Nick is the one who wrote the patch. Nick, I think this helps prove my point. The code you're submitting may well be correct, but it's unmaintainable. At the very least, this needs a comment explaining, from the perspective of x86, *exactly* what exit_lazy_tlb() is promising, why it's promising it, how it achieves that promise, and what code cares about it. Or we could do something with TIF flags and make this all less magical, although that will probably end up very slightly slower. --Andy
Excerpts from Andy Lutomirski's message of July 14, 2020 1:48 am: > On Mon, Jul 13, 2020 at 7:13 AM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >> > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >> >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >> >>> Also, as it stands, I can easily see in_irq() ceasing to promise to >> >>> serialize. There are older kernels for which it does not promise to >> >>> serialize. And I have plans to make it stop serializing in the >> >>> nearish future. >> >> >> >> You mean x86's return from interrupt? Sounds fun... you'll konw where to >> >> update the membarrier sync code, at least :) >> > >> > Oh, I should actually say Mathieu recently clarified a return from >> > interrupt doesn't fundamentally need to serialize in order to support >> > membarrier sync core. >> >> Clarification to your statement: >> >> Return from interrupt to kernel code does not need to be context serializing >> as long as kernel serializes before returning to user-space. >> >> However, return from interrupt to user-space needs to be context serializing. >> > > Indeed, and I figured this out on the first read through because I'm > quite familiar with the x86 entry code. But Nick somehow missed this, > and Nick is the one who wrote the patch. > > Nick, I think this helps prove my point. The code you're submitting > may well be correct, but it's unmaintainable. It's not. The patch I wrote for x86 is a no-op, it just moves existing x86 hook and code that's already there to a different name. Actually it's not quite a no-op, it't changes it to use hooks that are actually called in the right places. Because previously it was unmaintainable from point of view of generic mm -- it was not clear at all that the old one should have been called in other places where the mm goes non-lazy. Now with the exit_lazy_tlb hook, it can quite easily be spotted where it is missing. And x86 keeps their membarrier code in x86, and uses nice well defined lazy tlb mm hooks. > At the very least, this > needs a comment explaining, from the perspective of x86, *exactly* > what exit_lazy_tlb() is promising, why it's promising it, how it > achieves that promise, and what code cares about it. Or we could do > something with TIF flags and make this all less magical, although that > will probably end up very slightly slower. It's all documented there in existing comments plus the asm-generic exit_lazy_tlb specification added AFAIKS. Is the membarrier comment in finish_task_switch plus these ones not enough? Thanks, Nick
Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am: > ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: > >> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>> serialize. There are older kernels for which it does not promise to >>>> serialize. And I have plans to make it stop serializing in the >>>> nearish future. >>> >>> You mean x86's return from interrupt? Sounds fun... you'll konw where to >>> update the membarrier sync code, at least :) >> >> Oh, I should actually say Mathieu recently clarified a return from >> interrupt doesn't fundamentally need to serialize in order to support >> membarrier sync core. > > Clarification to your statement: > > Return from interrupt to kernel code does not need to be context serializing > as long as kernel serializes before returning to user-space. > > However, return from interrupt to user-space needs to be context serializing. Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb in the right places. A kernel thread does a use_mm, then it blocks and the user process with the same mm runs on that CPU, and then it calls into the kernel, blocks, the kernel thread runs again, another CPU issues a membarrier which does not IPI this one because it's running a kthread, and then the kthread switches back to the user process (still without having unused the mm), and then the user process returns from syscall without having done a core synchronising instruction. The cause of the problem is you want to avoid IPI'ing kthreads. Why? I'm guessing it really only matters as an optimisation in case of idle threads. Idle thread is easy (well, easier) because it won't use_mm, so you could check for rq->curr == rq->idle in your loop (in a suitable sched accessor function). But... I'm not really liking this subtlety in the scheduler for all this (the scheduler still needs the barriers when switching out of idle). Can it be improved somehow? Let me forget x86 core sync problem for now (that _may_ be a bit harder), and step back and look at what we're doing. The memory barrier case would actually suffer from the same problem as core sync, because in the same situation it has no implicit mmdrop in the scheduler switch code either. So what are we doing with membarrier? We want any activity caused by the set of CPUs/threads specified that can be observed by this thread before calling membarrier is appropriately fenced from activity that can be observed to happen after the call returns. CPU0 CPU1 1. user stuff a. membarrier() 2. enter kernel b. read rq->curr 3. rq->curr switched to kthread c. is kthread, skip IPI 4. switch_to kthread d. return to user 5. rq->curr switched to user thread 6. switch_to user thread 7. exit kernel 8. more user stuff As far as I can see, the problem is CPU1 might reorder step 5 and step 8, so you have mmdrop of lazy mm be a mb after step 6. But why? The membarrier call only cares that there is a full barrier between 1 and 8, right? Which it will get from the previous context switch to the kthread. I must say the memory barrier comments in membarrier could be improved a bit (unless I'm missing where the main comment is). It's fine to know what barriers pair with one another, but we need to know which exact memory accesses it is ordering /* * Matches memory barriers around rq->curr modification in * scheduler. */ Sure, but it doesn't say what else is being ordered. I think it's just the user memory accesses, but would be nice to make that a bit more explicit. If we had such comments then we might know this case is safe. I think the funny powerpc barrier is a similar case of this. If we ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that CPU has or will have issued a memory barrier between running user code. So AFAIKS all this membarrier stuff in kernel/sched/core.c could just go away. Except x86 because thread switch doesn't imply core sync, so CPU1 between 1 and 8 may never issue a core sync instruction the same way a context switch must be a full mb. Before getting to x86 -- Am I right, or way off track here? Thanks, Nick
Excerpts from Nicholas Piggin's message of July 16, 2020 2:15 pm: > Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am: >> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>>> serialize. There are older kernels for which it does not promise to >>>>> serialize. And I have plans to make it stop serializing in the >>>>> nearish future. >>>> >>>> You mean x86's return from interrupt? Sounds fun... you'll konw where to >>>> update the membarrier sync code, at least :) >>> >>> Oh, I should actually say Mathieu recently clarified a return from >>> interrupt doesn't fundamentally need to serialize in order to support >>> membarrier sync core. >> >> Clarification to your statement: >> >> Return from interrupt to kernel code does not need to be context serializing >> as long as kernel serializes before returning to user-space. >> >> However, return from interrupt to user-space needs to be context serializing. > > Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb > in the right places. > > A kernel thread does a use_mm, then it blocks and the user process with > the same mm runs on that CPU, and then it calls into the kernel, blocks, > the kernel thread runs again, another CPU issues a membarrier which does > not IPI this one because it's running a kthread, and then the kthread > switches back to the user process (still without having unused the mm), > and then the user process returns from syscall without having done a > core synchronising instruction. > > The cause of the problem is you want to avoid IPI'ing kthreads. Why? > I'm guessing it really only matters as an optimisation in case of idle > threads. Idle thread is easy (well, easier) because it won't use_mm, so > you could check for rq->curr == rq->idle in your loop (in a suitable > sched accessor function). > > But... I'm not really liking this subtlety in the scheduler for all this > (the scheduler still needs the barriers when switching out of idle). > > Can it be improved somehow? Let me forget x86 core sync problem for now > (that _may_ be a bit harder), and step back and look at what we're doing. > The memory barrier case would actually suffer from the same problem as > core sync, because in the same situation it has no implicit mmdrop in > the scheduler switch code either. > > So what are we doing with membarrier? We want any activity caused by the > set of CPUs/threads specified that can be observed by this thread before > calling membarrier is appropriately fenced from activity that can be > observed to happen after the call returns. > > CPU0 CPU1 > 1. user stuff > a. membarrier() 2. enter kernel > b. read rq->curr 3. rq->curr switched to kthread > c. is kthread, skip IPI 4. switch_to kthread > d. return to user 5. rq->curr switched to user thread > 6. switch_to user thread > 7. exit kernel > 8. more user stuff > > As far as I can see, the problem is CPU1 might reorder step 5 and step > 8, so you have mmdrop of lazy mm be a mb after step 6. > > But why? The membarrier call only cares that there is a full barrier > between 1 and 8, right? Which it will get from the previous context > switch to the kthread. I should be more complete here, especially since I was complaining about unclear barrier comment :) CPU0 CPU1 a. user stuff 1. user stuff b. membarrier() 2. enter kernel c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule d. read rq->curr 4. rq->curr switched to kthread e. is kthread, skip IPI 5. switch_to kthread f. return to user 6. rq->curr switched to user thread g. user stuff 7. switch_to user thread 8. exit kernel 9. more user stuff What you're really ordering is a, g vs 1, 9 right? In other words, 9 must see a if it sees g, g must see 1 if it saw 9, etc. Userspace does not care where the barriers are exactly or what kernel memory accesses might be being ordered by them, so long as there is a mb somewhere between a and g, and 1 and 9. Right?
> On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am: >> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>>> serialize. There are older kernels for which it does not promise to >>>>> serialize. And I have plans to make it stop serializing in the >>>>> nearish future. >>>> >>>> You mean x86's return from interrupt? Sounds fun... you'll konw where to >>>> update the membarrier sync code, at least :) >>> >>> Oh, I should actually say Mathieu recently clarified a return from >>> interrupt doesn't fundamentally need to serialize in order to support >>> membarrier sync core. >> >> Clarification to your statement: >> >> Return from interrupt to kernel code does not need to be context serializing >> as long as kernel serializes before returning to user-space. >> >> However, return from interrupt to user-space needs to be context serializing. > > Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb > in the right places. > > A kernel thread does a use_mm, then it blocks and the user process with > the same mm runs on that CPU, and then it calls into the kernel, blocks, > the kernel thread runs again, another CPU issues a membarrier which does > not IPI this one because it's running a kthread, and then the kthread > switches back to the user process (still without having unused the mm), > and then the user process returns from syscall without having done a > core synchronising instruction. > > The cause of the problem is you want to avoid IPI'ing kthreads. Why? > I'm guessing it really only matters as an optimisation in case of idle > threads. Idle thread is easy (well, easier) because it won't use_mm, so > you could check for rq->curr == rq->idle in your loop (in a suitable > sched accessor function). > > But... I'm not really liking this subtlety in the scheduler for all this > (the scheduler still needs the barriers when switching out of idle). > > Can it be improved somehow? Let me forget x86 core sync problem for now > (that _may_ be a bit harder), and step back and look at what we're doing. > The memory barrier case would actually suffer from the same problem as > core sync, because in the same situation it has no implicit mmdrop in > the scheduler switch code either. > > So what are we doing with membarrier? We want any activity caused by the > set of CPUs/threads specified that can be observed by this thread before > calling membarrier is appropriately fenced from activity that can be > observed to happen after the call returns. > > CPU0 CPU1 > 1. user stuff > a. membarrier() 2. enter kernel > b. read rq->curr 3. rq->curr switched to kthread > c. is kthread, skip IPI 4. switch_to kthread > d. return to user 5. rq->curr switched to user thread > 6. switch_to user thread > 7. exit kernel > 8. more user stuff > > As far as I can see, the problem is CPU1 might reorder step 5 and step > 8, so you have mmdrop of lazy mm be a mb after step 6. > > But why? The membarrier call only cares that there is a full barrier > between 1 and 8, right? Which it will get from the previous context > switch to the kthread. > > I must say the memory barrier comments in membarrier could be improved > a bit (unless I'm missing where the main comment is). It's fine to know > what barriers pair with one another, but we need to know which exact > memory accesses it is ordering > > /* > * Matches memory barriers around rq->curr modification in > * scheduler. > */ > > Sure, but it doesn't say what else is being ordered. I think it's just > the user memory accesses, but would be nice to make that a bit more > explicit. If we had such comments then we might know this case is safe. > > I think the funny powerpc barrier is a similar case of this. If we > ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that > CPU has or will have issued a memory barrier between running user > code. > > So AFAIKS all this membarrier stuff in kernel/sched/core.c could > just go away. Except x86 because thread switch doesn't imply core > sync, so CPU1 between 1 and 8 may never issue a core sync instruction > the same way a context switch must be a full mb. > > Before getting to x86 -- Am I right, or way off track here? I find it hard to believe that this is x86 only. Why would thread switch imply core sync on any architecture? Is x86 unique in having a stupid expensive core sync that is heavier than smp_mb()? But I’m wondering if all this deferred sync stuff is wrong. In the brave new world of io_uring and such, perhaps kernel access matter too. Heck, even: int a[2]; Thread A: a[0] = 1; a[1] = 2: Thread B: write(fd, a, sizeof(a)); Doesn’t do what thread A is expecting. Admittedly this particular example is nonsense, but maybe there are sensible cases that matter to someone. —Andy > > Thanks, > Nick
Excerpts from Andy Lutomirski's message of July 16, 2020 3:18 pm: > > >> On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am: >>> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: >>> >>>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>>>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>>>> serialize. There are older kernels for which it does not promise to >>>>>> serialize. And I have plans to make it stop serializing in the >>>>>> nearish future. >>>>> >>>>> You mean x86's return from interrupt? Sounds fun... you'll konw where to >>>>> update the membarrier sync code, at least :) >>>> >>>> Oh, I should actually say Mathieu recently clarified a return from >>>> interrupt doesn't fundamentally need to serialize in order to support >>>> membarrier sync core. >>> >>> Clarification to your statement: >>> >>> Return from interrupt to kernel code does not need to be context serializing >>> as long as kernel serializes before returning to user-space. >>> >>> However, return from interrupt to user-space needs to be context serializing. >> >> Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb >> in the right places. >> >> A kernel thread does a use_mm, then it blocks and the user process with >> the same mm runs on that CPU, and then it calls into the kernel, blocks, >> the kernel thread runs again, another CPU issues a membarrier which does >> not IPI this one because it's running a kthread, and then the kthread >> switches back to the user process (still without having unused the mm), >> and then the user process returns from syscall without having done a >> core synchronising instruction. >> >> The cause of the problem is you want to avoid IPI'ing kthreads. Why? >> I'm guessing it really only matters as an optimisation in case of idle >> threads. Idle thread is easy (well, easier) because it won't use_mm, so >> you could check for rq->curr == rq->idle in your loop (in a suitable >> sched accessor function). >> >> But... I'm not really liking this subtlety in the scheduler for all this >> (the scheduler still needs the barriers when switching out of idle). >> >> Can it be improved somehow? Let me forget x86 core sync problem for now >> (that _may_ be a bit harder), and step back and look at what we're doing. >> The memory barrier case would actually suffer from the same problem as >> core sync, because in the same situation it has no implicit mmdrop in >> the scheduler switch code either. >> >> So what are we doing with membarrier? We want any activity caused by the >> set of CPUs/threads specified that can be observed by this thread before >> calling membarrier is appropriately fenced from activity that can be >> observed to happen after the call returns. >> >> CPU0 CPU1 >> 1. user stuff >> a. membarrier() 2. enter kernel >> b. read rq->curr 3. rq->curr switched to kthread >> c. is kthread, skip IPI 4. switch_to kthread >> d. return to user 5. rq->curr switched to user thread >> 6. switch_to user thread >> 7. exit kernel >> 8. more user stuff >> >> As far as I can see, the problem is CPU1 might reorder step 5 and step >> 8, so you have mmdrop of lazy mm be a mb after step 6. >> >> But why? The membarrier call only cares that there is a full barrier >> between 1 and 8, right? Which it will get from the previous context >> switch to the kthread. >> >> I must say the memory barrier comments in membarrier could be improved >> a bit (unless I'm missing where the main comment is). It's fine to know >> what barriers pair with one another, but we need to know which exact >> memory accesses it is ordering >> >> /* >> * Matches memory barriers around rq->curr modification in >> * scheduler. >> */ >> >> Sure, but it doesn't say what else is being ordered. I think it's just >> the user memory accesses, but would be nice to make that a bit more >> explicit. If we had such comments then we might know this case is safe. >> >> I think the funny powerpc barrier is a similar case of this. If we >> ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that >> CPU has or will have issued a memory barrier between running user >> code. >> >> So AFAIKS all this membarrier stuff in kernel/sched/core.c could >> just go away. Except x86 because thread switch doesn't imply core >> sync, so CPU1 between 1 and 8 may never issue a core sync instruction >> the same way a context switch must be a full mb. >> >> Before getting to x86 -- Am I right, or way off track here? > > I find it hard to believe that this is x86 only. Why would thread switch imply core sync on any architecture? Is x86 unique in having a stupid expensive core sync that is heavier than smp_mb()? It's not the thread switch but the return from kernel to user -- at least of architectures that implement membarrier SYNC_CORE, x86 can do that without serializing. The thread switch is muddying the waters a bit, it's not the actual thread switch we care about, that just happens to be used as a point where we try to catch the membarrier IPIs that were skipped due to the PF_KTHREAD optimisation. I think that doing said check in the lazy tlb exit code is both unnecessary for the memory ordering and insufficient for pipeline serialization. > But I’m wondering if all this deferred sync stuff is wrong. In the brave new world of io_uring and such, perhaps kernel access matter too. Heck, even: > > int a[2]; > > Thread A: > a[0] = 1; > a[1] = 2: > > Thread B: > > write(fd, a, sizeof(a)); > > Doesn’t do what thread A is expecting. Admittedly this particular example is nonsense, but maybe there are sensible cases that matter to someone. I think kernel accesses probably do matter (or at least they should by principle of least surprise). And so I was doubly misleading by labeling it as "user stuff". I should have distinguished between previous user or kernel accesses, as opposed to the kernel accesses specifically for the implementation of the membarrier call. So I think the membarrier code gets *that* part right (modulo what we have seen already) if the kernel access is being done from process context. But yes if the access is coming from io_uring that has done kthread_use_mm or some other random code running in a kernel thread working on get_user_pages memory or any similar shared vm_insert_pfn memory, then it goes completely to hell. So good catch, PF_KTHREAD check is problematic there even if no actual users exist today. rq->curr == rq->idle test might be better, but can we have interrupts writing completions into user memory? For performance I would hope so, so that makes even that test problematic. Maybe membarrier should close that gap entirely, and work around performance issue by adding _USER_ONLY flags which explicitly only order user mode accesess vs other user accesses. Thanks, Nick
On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: > > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > CPU0 CPU1 > > 1. user stuff > > a. membarrier() 2. enter kernel > > b. read rq->curr 3. rq->curr switched to kthread > > c. is kthread, skip IPI 4. switch_to kthread > > d. return to user 5. rq->curr switched to user thread > > 6. switch_to user thread > > 7. exit kernel > > 8. more user stuff > I find it hard to believe that this is x86 only. Why would thread > switch imply core sync on any architecture? Is x86 unique in having a > stupid expensive core sync that is heavier than smp_mb()? smp_mb() is nowhere near the most expensive barrier we have in Linux, mb() might qualify, since that has some completion requirements since it needs to serialize against external actors. On x86_64 things are rather murky, we have: LOCK prefix -- which implies smp_mb() before and after RmW LFENCE -- which used to be rmb like, until Spectre, and now it is ISYNC like. Since ISYNC ensures an empty pipeline, it also implies all loads are retired (and therefore complete) it implies rmb. MFENCE -- which is a memop completion barrier like, it makes sure all previously issued memops are complete. if you read that carefully, you'll note you'll have to use LFENCE + MFENCE to order against non-memops instructions. But none of them imply dumping the instruction decoder caches, that only happens on core serializing instructions like CR3 writes, IRET, CPUID and a few others, I think we recently got a SERIALIZE instruction to add to this list. On ARM64 there's something a whole different set of barriers, and again smp_mb() isn't nowhere near the top of the list. They have roughly 3 classes: ISB -- instruction sync barrier DMB(x) -- memory ordering in domain x DSB(x) -- memory completion in domain x And they have at least 3 domains (IIRC), system, outer, inner. The ARM64 __switch_to() includes a dsb(sy), just like PowerPC used to have a SYNC, but since PowerPC is rare for only having one rediculously heavy serializing instruction, we got to re-use the smp_mb() early in __schedule() instead, but ARM64 can't do that. So rather than say that x86 is special here, I'd say that PowerPC is special here. > But I’m wondering if all this deferred sync stuff is wrong. In the > brave new world of io_uring and such, perhaps kernel access matter > too. Heck, even: IIRC the membarrier SYNC_CORE use-case is about user-space self-modifying code. Userspace re-uses a text address and needs to SYNC_CORE before it can be sure the old text is forgotten. Nothing the kernel does matters there. I suppose the manpage could be more clear there.
Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm: > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> > CPU0 CPU1 >> > 1. user stuff >> > a. membarrier() 2. enter kernel >> > b. read rq->curr 3. rq->curr switched to kthread >> > c. is kthread, skip IPI 4. switch_to kthread >> > d. return to user 5. rq->curr switched to user thread >> > 6. switch_to user thread >> > 7. exit kernel >> > 8. more user stuff > >> I find it hard to believe that this is x86 only. Why would thread >> switch imply core sync on any architecture? Is x86 unique in having a >> stupid expensive core sync that is heavier than smp_mb()? > > smp_mb() is nowhere near the most expensive barrier we have in Linux, > mb() might qualify, since that has some completion requirements since it > needs to serialize against external actors. > > On x86_64 things are rather murky, we have: > > LOCK prefix -- which implies smp_mb() before and after RmW > LFENCE -- which used to be rmb like, until Spectre, and now it > is ISYNC like. Since ISYNC ensures an empty pipeline, > it also implies all loads are retired (and therefore > complete) it implies rmb. > MFENCE -- which is a memop completion barrier like, it makes > sure all previously issued memops are complete. > > if you read that carefully, you'll note you'll have to use LFENCE + > MFENCE to order against non-memops instructions. > > But none of them imply dumping the instruction decoder caches, that only > happens on core serializing instructions like CR3 writes, IRET, CPUID > and a few others, I think we recently got a SERIALIZE instruction to add > to this list. > > > On ARM64 there's something a whole different set of barriers, and again > smp_mb() isn't nowhere near the top of the list. They have roughly 3 > classes: > > ISB -- instruction sync barrier > DMB(x) -- memory ordering in domain x > DSB(x) -- memory completion in domain x > > And they have at least 3 domains (IIRC), system, outer, inner. > > The ARM64 __switch_to() includes a dsb(sy), just like PowerPC used to > have a SYNC, but since PowerPC is rare for only having one rediculously > heavy serializing instruction, we got to re-use the smp_mb() early in > __schedule() instead, but ARM64 can't do that. > > > So rather than say that x86 is special here, I'd say that PowerPC is > special here. PowerPC is "special", I'll agree with you there :) It does have a SYNC (HWSYNC) instruction that is mb(). It does not serialize the core. ISYNC is a nop. ICBI ; ISYNC does serialize the core. Difference between them is probably much the same as difference between MFENCE and CPUID on x86 CPUs. Serializing the core is almost always pretty expensive. HWSYNC/MFENCE can be expensive if you have a lot of or difficult (not exclusive in cache) outstanding with critical reads after the barrier, but it can also be somewhat cheap if there are few writes, and executed past, it only needs to hold up subsequent reads. That said... implementation details. powerpc CPUs have traditionally had fairly costly HWSYNC. >> But I’m wondering if all this deferred sync stuff is wrong. In the >> brave new world of io_uring and such, perhaps kernel access matter >> too. Heck, even: > > IIRC the membarrier SYNC_CORE use-case is about user-space > self-modifying code. > > Userspace re-uses a text address and needs to SYNC_CORE before it can be > sure the old text is forgotten. Nothing the kernel does matters there. > > I suppose the manpage could be more clear there. True, but memory ordering of kernel stores from kernel threads for regular mem barrier is the concern here. Does io_uring update completion queue from kernel thread or interrupt, for example? If it does, then membarrier will not order such stores with user memory accesses. Thanks, Nick
On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote: > Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm: > > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: > >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> But I’m wondering if all this deferred sync stuff is wrong. In the > >> brave new world of io_uring and such, perhaps kernel access matter > >> too. Heck, even: > > > > IIRC the membarrier SYNC_CORE use-case is about user-space > > self-modifying code. > > > > Userspace re-uses a text address and needs to SYNC_CORE before it can be > > sure the old text is forgotten. Nothing the kernel does matters there. > > > > I suppose the manpage could be more clear there. > > True, but memory ordering of kernel stores from kernel threads for > regular mem barrier is the concern here. > > Does io_uring update completion queue from kernel thread or interrupt, > for example? If it does, then membarrier will not order such stores > with user memory accesses. So we're talking about regular membarrier() then? Not the SYNC_CORE variant per-se. Even there, I'll argue we don't care, but perhaps Mathieu has a different opinion. All we care about is that all other threads (or CPUs for GLOBAL) observe an smp_mb() before it returns. Any serialization against whatever those other threads/CPUs are running at the instant of the syscall is external to the syscall, we make no gauarantees about that. That is, we can fundamentally not say what another CPU is executing concurrently. Nor should we want to. So if you feel that your membarrier() ought to serialize against remote execution, you need to arrange a quiecent state on the remote side yourself. Now, normally membarrier() is used to implement userspace RCU like things, and there all that matters is that the remote CPUs observe the beginngin of the new grace-period, ie counter flip, and we observe their read-side critical sections, or smething like that, it's been a while since I looked at all that. It's always been the case that concurrent syscalls could change user memory, io_uring doesn't change that, it just makes it even less well defined when that would happen. If you want to serialize against that, you need to arrange that externally.
----- On Jul 16, 2020, at 7:00 AM, Peter Zijlstra peterz@infradead.org wrote: > On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote: >> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm: >> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: >> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> >> But I’m wondering if all this deferred sync stuff is wrong. In the >> >> brave new world of io_uring and such, perhaps kernel access matter >> >> too. Heck, even: >> > >> > IIRC the membarrier SYNC_CORE use-case is about user-space >> > self-modifying code. >> > >> > Userspace re-uses a text address and needs to SYNC_CORE before it can be >> > sure the old text is forgotten. Nothing the kernel does matters there. >> > >> > I suppose the manpage could be more clear there. >> >> True, but memory ordering of kernel stores from kernel threads for >> regular mem barrier is the concern here. >> >> Does io_uring update completion queue from kernel thread or interrupt, >> for example? If it does, then membarrier will not order such stores >> with user memory accesses. > > So we're talking about regular membarrier() then? Not the SYNC_CORE > variant per-se. > > Even there, I'll argue we don't care, but perhaps Mathieu has a > different opinion. I agree with Peter that we don't care about accesses to user-space memory performed concurrently with membarrier. What we'd care about in terms of accesses to user-space memory from the kernel is something that would be clearly ordered as happening before or after the membarrier call, for instance a read(2) followed by membarrier(2) after the read returns, or a read(2) issued after return from membarrier(2). The other scenario we'd care about is with the compiler barrier paired with membarrier: e.g. read(2) returns, compiler barrier, followed by a store. Or load, compiler barrier, followed by write(2). All those scenarios imply before/after ordering wrt either membarrier or the compiler barrier. I notice that io_uring has a "completion" queue. Let's try to come up with realistic usage scenarios. So the dependency chain would be provided by e.g.: * Infrequent read / Frequent write, communicating read completion through variable X wait for io_uring read request completion -> membarrier -> store X=1 with matching load from X (waiting for X==1) -> asm volatile (::: "memory") -> submit io_uring write request or this other scenario: * Frequent read / Infrequent write, communicating read completion through variable X load from X (waiting for X==1) -> membarrier -> submit io_uring write request with matching wait for io_uring read request completion -> asm volatile (::: "memory") -> store X=1 Thanks, Mathieu
----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: > I should be more complete here, especially since I was complaining > about unclear barrier comment :) > > > CPU0 CPU1 > a. user stuff 1. user stuff > b. membarrier() 2. enter kernel > c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule > d. read rq->curr 4. rq->curr switched to kthread > e. is kthread, skip IPI 5. switch_to kthread > f. return to user 6. rq->curr switched to user thread > g. user stuff 7. switch_to user thread > 8. exit kernel > 9. more user stuff > > What you're really ordering is a, g vs 1, 9 right? > > In other words, 9 must see a if it sees g, g must see 1 if it saw 9, > etc. > > Userspace does not care where the barriers are exactly or what kernel > memory accesses might be being ordered by them, so long as there is a > mb somewhere between a and g, and 1 and 9. Right? This is correct. Note that the accesses to user-space memory can be done either by user-space code or kernel code, it doesn't matter. However, in order to be considered as happening before/after either membarrier or the matching compiler barrier, kernel code needs to have causality relationship with user-space execution, e.g. user-space does a system call, or returns from a system call. In the case of io_uring, submitting a request or returning from waiting on request completion appear to provide this causality relationship. Thanks, Mathieu
----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >> I should be more complete here, especially since I was complaining >> about unclear barrier comment :) >> >> >> CPU0 CPU1 >> a. user stuff 1. user stuff >> b. membarrier() 2. enter kernel >> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >> d. read rq->curr 4. rq->curr switched to kthread >> e. is kthread, skip IPI 5. switch_to kthread >> f. return to user 6. rq->curr switched to user thread >> g. user stuff 7. switch_to user thread >> 8. exit kernel >> 9. more user stuff >> >> What you're really ordering is a, g vs 1, 9 right? >> >> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >> etc. >> >> Userspace does not care where the barriers are exactly or what kernel >> memory accesses might be being ordered by them, so long as there is a >> mb somewhere between a and g, and 1 and 9. Right? > > This is correct. Actually, sorry, the above is not quite right. It's been a while since I looked into the details of membarrier. The smp_mb() at the beginning of membarrier() needs to be paired with a smp_mb() _after_ rq->curr is switched back to the user thread, so the memory barrier is between store to rq->curr and following user-space accesses. The smp_mb() at the end of membarrier() needs to be paired with the smp_mb__after_spinlock() at the beginning of schedule, which is between accesses to userspace memory and switching rq->curr to kthread. As to *why* this ordering is needed, I'd have to dig through additional scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? Thanks, Mathieu > Note that the accesses to user-space memory can be > done either by user-space code or kernel code, it doesn't matter. > However, in order to be considered as happening before/after > either membarrier or the matching compiler barrier, kernel code > needs to have causality relationship with user-space execution, > e.g. user-space does a system call, or returns from a system call. > > In the case of io_uring, submitting a request or returning from waiting > on request completion appear to provide this causality relationship. > > Thanks, > > Mathieu > > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com
----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers > mathieu.desnoyers@efficios.com wrote: > >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >>> I should be more complete here, especially since I was complaining >>> about unclear barrier comment :) >>> >>> >>> CPU0 CPU1 >>> a. user stuff 1. user stuff >>> b. membarrier() 2. enter kernel >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >>> d. read rq->curr 4. rq->curr switched to kthread >>> e. is kthread, skip IPI 5. switch_to kthread >>> f. return to user 6. rq->curr switched to user thread >>> g. user stuff 7. switch_to user thread >>> 8. exit kernel >>> 9. more user stuff >>> >>> What you're really ordering is a, g vs 1, 9 right? >>> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >>> etc. >>> >>> Userspace does not care where the barriers are exactly or what kernel >>> memory accesses might be being ordered by them, so long as there is a >>> mb somewhere between a and g, and 1 and 9. Right? >> >> This is correct. > > Actually, sorry, the above is not quite right. It's been a while > since I looked into the details of membarrier. > > The smp_mb() at the beginning of membarrier() needs to be paired with a > smp_mb() _after_ rq->curr is switched back to the user thread, so the > memory barrier is between store to rq->curr and following user-space > accesses. > > The smp_mb() at the end of membarrier() needs to be paired with the > smp_mb__after_spinlock() at the beginning of schedule, which is > between accesses to userspace memory and switching rq->curr to kthread. > > As to *why* this ordering is needed, I'd have to dig through additional > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? Thinking further about this, I'm beginning to consider that maybe we have been overly cautious by requiring memory barriers before and after store to rq->curr. If CPU0 observes a CPU1's rq->curr->mm which differs from its own process (current) while running the membarrier system call, it necessarily means that CPU1 had to issue smp_mb__after_spinlock when entering the scheduler, between any user-space loads/stores and update of rq->curr. Requiring a memory barrier between update of rq->curr (back to current process's thread) and following user-space memory accesses does not seem to guarantee anything more than what the initial barrier at the beginning of __schedule already provides, because the guarantees are only about accesses to user-space memory. Therefore, with the memory barrier at the beginning of __schedule, just observing that CPU1's rq->curr differs from current should guarantee that a memory barrier was issued between any sequentially consistent instructions belonging to the current process on CPU1. Or am I missing/misremembering an important point here ? Thanks, Mathieu > > Thanks, > > Mathieu > > >> Note that the accesses to user-space memory can be >> done either by user-space code or kernel code, it doesn't matter. >> However, in order to be considered as happening before/after >> either membarrier or the matching compiler barrier, kernel code >> needs to have causality relationship with user-space execution, >> e.g. user-space does a system call, or returns from a system call. >> >> In the case of io_uring, submitting a request or returning from waiting >> on request completion appear to provide this causality relationship. >> >> Thanks, >> >> Mathieu >> >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> http://www.efficios.com > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com
On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: > ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > > > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers > > mathieu.desnoyers@efficios.com wrote: > > > >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: > >>> I should be more complete here, especially since I was complaining > >>> about unclear barrier comment :) > >>> > >>> > >>> CPU0 CPU1 > >>> a. user stuff 1. user stuff > >>> b. membarrier() 2. enter kernel > >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule > >>> d. read rq->curr 4. rq->curr switched to kthread > >>> e. is kthread, skip IPI 5. switch_to kthread > >>> f. return to user 6. rq->curr switched to user thread > >>> g. user stuff 7. switch_to user thread > >>> 8. exit kernel > >>> 9. more user stuff > >>> > >>> What you're really ordering is a, g vs 1, 9 right? > >>> > >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, > >>> etc. > >>> > >>> Userspace does not care where the barriers are exactly or what kernel > >>> memory accesses might be being ordered by them, so long as there is a > >>> mb somewhere between a and g, and 1 and 9. Right? > >> > >> This is correct. > > > > Actually, sorry, the above is not quite right. It's been a while > > since I looked into the details of membarrier. > > > > The smp_mb() at the beginning of membarrier() needs to be paired with a > > smp_mb() _after_ rq->curr is switched back to the user thread, so the > > memory barrier is between store to rq->curr and following user-space > > accesses. > > > > The smp_mb() at the end of membarrier() needs to be paired with the > > smp_mb__after_spinlock() at the beginning of schedule, which is > > between accesses to userspace memory and switching rq->curr to kthread. > > > > As to *why* this ordering is needed, I'd have to dig through additional > > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? > > Thinking further about this, I'm beginning to consider that maybe we have been > overly cautious by requiring memory barriers before and after store to rq->curr. > > If CPU0 observes a CPU1's rq->curr->mm which differs from its own process (current) > while running the membarrier system call, it necessarily means that CPU1 had > to issue smp_mb__after_spinlock when entering the scheduler, between any user-space > loads/stores and update of rq->curr. > > Requiring a memory barrier between update of rq->curr (back to current process's > thread) and following user-space memory accesses does not seem to guarantee > anything more than what the initial barrier at the beginning of __schedule already > provides, because the guarantees are only about accesses to user-space memory. > > Therefore, with the memory barrier at the beginning of __schedule, just observing that > CPU1's rq->curr differs from current should guarantee that a memory barrier was issued > between any sequentially consistent instructions belonging to the current process on > CPU1. > > Or am I missing/misremembering an important point here ? Is it correct to say that the switch_to operations in 5 and 7 include memory barriers? If they do, then skipping the IPI should be okay. The reason is as follows: The guarantee you need to enforce is that anything written by CPU0 before the membarrier() will be visible to CPU1 after it returns to user mode. Let's say that a writes to X and 9 reads from X. Then we have an instance of the Store Buffer pattern: CPU0 CPU1 a. Write X 6. Write rq->curr for user thread c. smp_mb() 7. switch_to memory barrier d. Read rq->curr 9. Read X In this pattern, the memory barriers make it impossible for both reads to miss their corresponding writes. Since d does fail to read 6 (it sees the earlier value stored by 4), 9 must read a. The other guarantee you need is that g on CPU0 will observe anything written by CPU1 in 1. This is easier to see, using the fact that 3 is a memory barrier and d reads from 4. Alan Stern
Excerpts from peterz@infradead.org's message of July 16, 2020 9:00 pm: > On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote: >> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm: >> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: >> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> >> But I’m wondering if all this deferred sync stuff is wrong. In the >> >> brave new world of io_uring and such, perhaps kernel access matter >> >> too. Heck, even: >> > >> > IIRC the membarrier SYNC_CORE use-case is about user-space >> > self-modifying code. >> > >> > Userspace re-uses a text address and needs to SYNC_CORE before it can be >> > sure the old text is forgotten. Nothing the kernel does matters there. >> > >> > I suppose the manpage could be more clear there. >> >> True, but memory ordering of kernel stores from kernel threads for >> regular mem barrier is the concern here. >> >> Does io_uring update completion queue from kernel thread or interrupt, >> for example? If it does, then membarrier will not order such stores >> with user memory accesses. > > So we're talking about regular membarrier() then? Not the SYNC_CORE > variant per-se. Well, both but Andy in this case was wondering about kernel writes vs user. > > Even there, I'll argue we don't care, but perhaps Mathieu has a > different opinion. All we care about is that all other threads (or CPUs > for GLOBAL) observe an smp_mb() before it returns. > > Any serialization against whatever those other threads/CPUs are running > at the instant of the syscall is external to the syscall, we make no > gauarantees about that. That is, we can fundamentally not say what > another CPU is executing concurrently. Nor should we want to. > > So if you feel that your membarrier() ought to serialize against remote > execution, you need to arrange a quiecent state on the remote side > yourself. > > Now, normally membarrier() is used to implement userspace RCU like > things, and there all that matters is that the remote CPUs observe the > beginngin of the new grace-period, ie counter flip, and we observe their > read-side critical sections, or smething like that, it's been a while > since I looked at all that. > > It's always been the case that concurrent syscalls could change user > memory, io_uring doesn't change that, it just makes it even less well > defined when that would happen. If you want to serialize against that, > you need to arrange that externally. membarrier does replace barrier instructions on remote CPUs, which do order accesses performed by the kernel on the user address space. So membarrier should too I guess. Normal process context accesses like read(2) will do so because they don't get filtered out from IPIs, but kernel threads using the mm may not. Thanks, Nick
Excerpts from Mathieu Desnoyers's message of July 17, 2020 4:58 am: > ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > >> ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers >> mathieu.desnoyers@efficios.com wrote: >> >>> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >>>> I should be more complete here, especially since I was complaining >>>> about unclear barrier comment :) >>>> >>>> >>>> CPU0 CPU1 >>>> a. user stuff 1. user stuff >>>> b. membarrier() 2. enter kernel >>>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >>>> d. read rq->curr 4. rq->curr switched to kthread >>>> e. is kthread, skip IPI 5. switch_to kthread >>>> f. return to user 6. rq->curr switched to user thread >>>> g. user stuff 7. switch_to user thread >>>> 8. exit kernel >>>> 9. more user stuff >>>> >>>> What you're really ordering is a, g vs 1, 9 right? >>>> >>>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >>>> etc. >>>> >>>> Userspace does not care where the barriers are exactly or what kernel >>>> memory accesses might be being ordered by them, so long as there is a >>>> mb somewhere between a and g, and 1 and 9. Right? >>> >>> This is correct. >> >> Actually, sorry, the above is not quite right. It's been a while >> since I looked into the details of membarrier. >> >> The smp_mb() at the beginning of membarrier() needs to be paired with a >> smp_mb() _after_ rq->curr is switched back to the user thread, so the >> memory barrier is between store to rq->curr and following user-space >> accesses. >> >> The smp_mb() at the end of membarrier() needs to be paired with the >> smp_mb__after_spinlock() at the beginning of schedule, which is >> between accesses to userspace memory and switching rq->curr to kthread. >> >> As to *why* this ordering is needed, I'd have to dig through additional >> scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? > > Thinking further about this, I'm beginning to consider that maybe we have been > overly cautious by requiring memory barriers before and after store to rq->curr. > > If CPU0 observes a CPU1's rq->curr->mm which differs from its own process (current) > while running the membarrier system call, it necessarily means that CPU1 had > to issue smp_mb__after_spinlock when entering the scheduler, between any user-space > loads/stores and update of rq->curr. > > Requiring a memory barrier between update of rq->curr (back to current process's > thread) and following user-space memory accesses does not seem to guarantee > anything more than what the initial barrier at the beginning of __schedule already > provides, because the guarantees are only about accesses to user-space memory. > > Therefore, with the memory barrier at the beginning of __schedule, just observing that > CPU1's rq->curr differs from current should guarantee that a memory barrier was issued > between any sequentially consistent instructions belonging to the current process on > CPU1. > > Or am I missing/misremembering an important point here ? I might have mislead you. CPU0 CPU1 r1=y x=1 membarrier() y=1 r2=x membarrier provides if r1==1 then r2==1 (right?) CPU0 r1=y membarrier() smp_mb(); t = cpu_rq(1)->curr; if (t->mm == mm) IPI(CPU1); smp_mb() r2=x vs CPU1 ... __schedule() smp_mb__after_spinlock() rq->curr = kthread ... __schedule() smp_mb__after_spinlock() rq->curr = user thread exit kernel x=1 y=1 Now these last 3 stores are not ordered, so CPU0 might see y==1 but rq->curr == kthread, right? Then it will skip the IPI and stores to x and y will not be ordered. So we do need a mb after rq->curr store when mm is switching. I believe for the global membarrier PF_KTHREAD optimisation, we also need a barrier when switching from a kernel thread to user, for the same reason. So I think I was wrong to say the barrier is not necessary. I haven't quite worked out why two mb()s are required in membarrier(), but at least that's less of a performance concern. Thanks, Nick
----- On Jul 16, 2020, at 5:24 PM, Alan Stern stern@rowland.harvard.edu wrote: > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: >> ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers >> mathieu.desnoyers@efficios.com wrote: >> >> > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers >> > mathieu.desnoyers@efficios.com wrote: >> > >> >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >>> I should be more complete here, especially since I was complaining >> >>> about unclear barrier comment :) >> >>> >> >>> >> >>> CPU0 CPU1 >> >>> a. user stuff 1. user stuff >> >>> b. membarrier() 2. enter kernel >> >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >> >>> d. read rq->curr 4. rq->curr switched to kthread >> >>> e. is kthread, skip IPI 5. switch_to kthread >> >>> f. return to user 6. rq->curr switched to user thread >> >>> g. user stuff 7. switch_to user thread >> >>> 8. exit kernel >> >>> 9. more user stuff >> >>> >> >>> What you're really ordering is a, g vs 1, 9 right? >> >>> >> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >> >>> etc. >> >>> >> >>> Userspace does not care where the barriers are exactly or what kernel >> >>> memory accesses might be being ordered by them, so long as there is a >> >>> mb somewhere between a and g, and 1 and 9. Right? >> >> >> >> This is correct. >> > >> > Actually, sorry, the above is not quite right. It's been a while >> > since I looked into the details of membarrier. >> > >> > The smp_mb() at the beginning of membarrier() needs to be paired with a >> > smp_mb() _after_ rq->curr is switched back to the user thread, so the >> > memory barrier is between store to rq->curr and following user-space >> > accesses. >> > >> > The smp_mb() at the end of membarrier() needs to be paired with the >> > smp_mb__after_spinlock() at the beginning of schedule, which is >> > between accesses to userspace memory and switching rq->curr to kthread. >> > >> > As to *why* this ordering is needed, I'd have to dig through additional >> > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? >> >> Thinking further about this, I'm beginning to consider that maybe we have been >> overly cautious by requiring memory barriers before and after store to rq->curr. >> >> If CPU0 observes a CPU1's rq->curr->mm which differs from its own process >> (current) >> while running the membarrier system call, it necessarily means that CPU1 had >> to issue smp_mb__after_spinlock when entering the scheduler, between any >> user-space >> loads/stores and update of rq->curr. >> >> Requiring a memory barrier between update of rq->curr (back to current process's >> thread) and following user-space memory accesses does not seem to guarantee >> anything more than what the initial barrier at the beginning of __schedule >> already >> provides, because the guarantees are only about accesses to user-space memory. >> >> Therefore, with the memory barrier at the beginning of __schedule, just >> observing that >> CPU1's rq->curr differs from current should guarantee that a memory barrier was >> issued >> between any sequentially consistent instructions belonging to the current >> process on >> CPU1. >> >> Or am I missing/misremembering an important point here ? > > Is it correct to say that the switch_to operations in 5 and 7 include > memory barriers? If they do, then skipping the IPI should be okay. > > The reason is as follows: The guarantee you need to enforce is that > anything written by CPU0 before the membarrier() will be visible to CPU1 > after it returns to user mode. Let's say that a writes to X and 9 > reads from X. > > Then we have an instance of the Store Buffer pattern: > > CPU0 CPU1 > a. Write X 6. Write rq->curr for user thread > c. smp_mb() 7. switch_to memory barrier > d. Read rq->curr 9. Read X > > In this pattern, the memory barriers make it impossible for both reads > to miss their corresponding writes. Since d does fail to read 6 (it > sees the earlier value stored by 4), 9 must read a. > > The other guarantee you need is that g on CPU0 will observe anything > written by CPU1 in 1. This is easier to see, using the fact that 3 is a > memory barrier and d reads from 4. Right, and Nick's reply involving pairs of loads/stores on each side clarifies the situation even further. Thanks, Mathieu
----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote: [...] > > membarrier does replace barrier instructions on remote CPUs, which do > order accesses performed by the kernel on the user address space. So > membarrier should too I guess. > > Normal process context accesses like read(2) will do so because they > don't get filtered out from IPIs, but kernel threads using the mm may > not. But it should not be an issue, because membarrier's ordering is only with respect to submit and completion of io_uring requests, which are performed through system calls from the context of user-space threads, which are called from the right mm. Thanks, Mathieu
On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote: > ----- On Jul 16, 2020, at 5:24 PM, Alan Stern stern@rowland.harvard.edu wrote: > > > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: > >> ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers > >> mathieu.desnoyers@efficios.com wrote: > >> > >> > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers > >> > mathieu.desnoyers@efficios.com wrote: > >> > > >> >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: > >> >>> I should be more complete here, especially since I was complaining > >> >>> about unclear barrier comment :) > >> >>> > >> >>> > >> >>> CPU0 CPU1 > >> >>> a. user stuff 1. user stuff > >> >>> b. membarrier() 2. enter kernel > >> >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule > >> >>> d. read rq->curr 4. rq->curr switched to kthread > >> >>> e. is kthread, skip IPI 5. switch_to kthread > >> >>> f. return to user 6. rq->curr switched to user thread > >> >>> g. user stuff 7. switch_to user thread > >> >>> 8. exit kernel > >> >>> 9. more user stuff ... > >> Requiring a memory barrier between update of rq->curr (back to current process's > >> thread) and following user-space memory accesses does not seem to guarantee > >> anything more than what the initial barrier at the beginning of __schedule > >> already > >> provides, because the guarantees are only about accesses to user-space memory. ... > > Is it correct to say that the switch_to operations in 5 and 7 include > > memory barriers? If they do, then skipping the IPI should be okay. > > > > The reason is as follows: The guarantee you need to enforce is that > > anything written by CPU0 before the membarrier() will be visible to CPU1 > > after it returns to user mode. Let's say that a writes to X and 9 > > reads from X. > > > > Then we have an instance of the Store Buffer pattern: > > > > CPU0 CPU1 > > a. Write X 6. Write rq->curr for user thread > > c. smp_mb() 7. switch_to memory barrier > > d. Read rq->curr 9. Read X > > > > In this pattern, the memory barriers make it impossible for both reads > > to miss their corresponding writes. Since d does fail to read 6 (it > > sees the earlier value stored by 4), 9 must read a. > > > > The other guarantee you need is that g on CPU0 will observe anything > > written by CPU1 in 1. This is easier to see, using the fact that 3 is a > > memory barrier and d reads from 4. > > Right, and Nick's reply involving pairs of loads/stores on each side > clarifies the situation even further. The key part of my reply was the question: "Is it correct to say that the switch_to operations in 5 and 7 include memory barriers?" From the text quoted above and from Nick's reply, it seems clear that they do not. I agree with Nick: A memory barrier is needed somewhere between the assignment at 6 and the return to user mode at 8. Otherwise you end up with the Store Buffer pattern having a memory barrier on only one side, and it is well known that this arrangement does not guarantee any ordering. One thing I don't understand about all this: Any context switch has to include a memory barrier somewhere, but both you and Nick seem to be saying that steps 6 and 7 don't include (or don't need) any memory barriers. What am I missing? Alan Stern
----- On Jul 17, 2020, at 10:51 AM, Alan Stern stern@rowland.harvard.edu wrote: > On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote: >> ----- On Jul 16, 2020, at 5:24 PM, Alan Stern stern@rowland.harvard.edu wrote: >> >> > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: >> >> ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers >> >> mathieu.desnoyers@efficios.com wrote: >> >> >> >> > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers >> >> > mathieu.desnoyers@efficios.com wrote: >> >> > >> >> >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >> >>> I should be more complete here, especially since I was complaining >> >> >>> about unclear barrier comment :) >> >> >>> >> >> >>> >> >> >>> CPU0 CPU1 >> >> >>> a. user stuff 1. user stuff >> >> >>> b. membarrier() 2. enter kernel >> >> >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >> >> >>> d. read rq->curr 4. rq->curr switched to kthread >> >> >>> e. is kthread, skip IPI 5. switch_to kthread >> >> >>> f. return to user 6. rq->curr switched to user thread >> >> >>> g. user stuff 7. switch_to user thread >> >> >>> 8. exit kernel >> >> >>> 9. more user stuff > > ... > >> >> Requiring a memory barrier between update of rq->curr (back to current process's >> >> thread) and following user-space memory accesses does not seem to guarantee >> >> anything more than what the initial barrier at the beginning of __schedule >> >> already >> >> provides, because the guarantees are only about accesses to user-space memory. > > ... > >> > Is it correct to say that the switch_to operations in 5 and 7 include >> > memory barriers? If they do, then skipping the IPI should be okay. >> > >> > The reason is as follows: The guarantee you need to enforce is that >> > anything written by CPU0 before the membarrier() will be visible to CPU1 >> > after it returns to user mode. Let's say that a writes to X and 9 >> > reads from X. >> > >> > Then we have an instance of the Store Buffer pattern: >> > >> > CPU0 CPU1 >> > a. Write X 6. Write rq->curr for user thread >> > c. smp_mb() 7. switch_to memory barrier >> > d. Read rq->curr 9. Read X >> > >> > In this pattern, the memory barriers make it impossible for both reads >> > to miss their corresponding writes. Since d does fail to read 6 (it >> > sees the earlier value stored by 4), 9 must read a. >> > >> > The other guarantee you need is that g on CPU0 will observe anything >> > written by CPU1 in 1. This is easier to see, using the fact that 3 is a >> > memory barrier and d reads from 4. >> >> Right, and Nick's reply involving pairs of loads/stores on each side >> clarifies the situation even further. > > The key part of my reply was the question: "Is it correct to say that > the switch_to operations in 5 and 7 include memory barriers?" From the > text quoted above and from Nick's reply, it seems clear that they do > not. I remember that switch_mm implies it, but not switch_to. The scenario that triggered this discussion is when the scheduler does a lazy tlb entry/exit, which is basically switch from a user task to a kernel thread without changing the mm, and usually switching back afterwards. This optimization means the rq->curr mm temporarily differs, which prevent IPIs from being sent by membarrier, but without involving a switch_mm. This requires explicit memory barriers either on entry/exit of lazy tlb mode, or explicit barriers in the scheduler for those special-cases. > I agree with Nick: A memory barrier is needed somewhere between the > assignment at 6 and the return to user mode at 8. Otherwise you end up > with the Store Buffer pattern having a memory barrier on only one side, > and it is well known that this arrangement does not guarantee any > ordering. Yes, I see this now. I'm still trying to wrap my head around why the memory barrier at the end of membarrier() needs to be paired with a scheduler barrier though. > One thing I don't understand about all this: Any context switch has to > include a memory barrier somewhere, but both you and Nick seem to be > saying that steps 6 and 7 don't include (or don't need) any memory > barriers. What am I missing? All context switch have the smp_mb__before_spinlock at the beginning of __schedule(), which I suspect is what you refer to. However this barrier is before the store to rq->curr, not after. Thanks, Mathieu
> > I agree with Nick: A memory barrier is needed somewhere between the > > assignment at 6 and the return to user mode at 8. Otherwise you end up > > with the Store Buffer pattern having a memory barrier on only one side, > > and it is well known that this arrangement does not guarantee any > > ordering. > > Yes, I see this now. I'm still trying to wrap my head around why the memory > barrier at the end of membarrier() needs to be paired with a scheduler > barrier though. The memory barrier at the end of membarrier() on CPU0 is necessary in order to enforce the guarantee that any writes occurring on CPU1 before the membarrier() is executed will be visible to any code executing on CPU0 after the membarrier(). Ignoring the kthread issue, we can have: CPU0 CPU1 x = 1 barrier() y = 1 r2 = y membarrier(): a: smp_mb() b: send IPI IPI-induced mb c: smp_mb() r1 = x The writes to x and y are unordered by the hardware, so it's possible to have r2 = 1 even though the write to x doesn't execute until b. If the memory barrier at c is omitted then "r1 = x" can be reordered before b (although not before a), so we get r1 = 0. This violates the guarantee that membarrier() is supposed to provide. The timing of the memory barrier at c has to ensure that it executes after the IPI-induced memory barrier on CPU1. If it happened before then we could still end up with r1 = 0. That's why the pairing matters. I hope this helps your head get properly wrapped. :-) Alan Stern
----- On Jul 17, 2020, at 12:11 PM, Alan Stern stern@rowland.harvard.edu wrote: >> > I agree with Nick: A memory barrier is needed somewhere between the >> > assignment at 6 and the return to user mode at 8. Otherwise you end up >> > with the Store Buffer pattern having a memory barrier on only one side, >> > and it is well known that this arrangement does not guarantee any >> > ordering. >> >> Yes, I see this now. I'm still trying to wrap my head around why the memory >> barrier at the end of membarrier() needs to be paired with a scheduler >> barrier though. > > The memory barrier at the end of membarrier() on CPU0 is necessary in > order to enforce the guarantee that any writes occurring on CPU1 before > the membarrier() is executed will be visible to any code executing on > CPU0 after the membarrier(). Ignoring the kthread issue, we can have: > > CPU0 CPU1 > x = 1 > barrier() > y = 1 > r2 = y > membarrier(): > a: smp_mb() > b: send IPI IPI-induced mb > c: smp_mb() > r1 = x > > The writes to x and y are unordered by the hardware, so it's possible to > have r2 = 1 even though the write to x doesn't execute until b. If the > memory barrier at c is omitted then "r1 = x" can be reordered before b > (although not before a), so we get r1 = 0. This violates the guarantee > that membarrier() is supposed to provide. > > The timing of the memory barrier at c has to ensure that it executes > after the IPI-induced memory barrier on CPU1. If it happened before > then we could still end up with r1 = 0. That's why the pairing matters. > > I hope this helps your head get properly wrapped. :-) It does help a bit! ;-) This explains this part of the comment near the smp_mb at the end of membarrier: * Memory barrier on the caller thread _after_ we finished * waiting for the last IPI. [...] However, it does not explain why it needs to be paired with a barrier in the scheduler, clearly for the case where the IPI is skipped. I wonder whether this part of the comment is factually correct: * [...] Matches memory barriers around rq->curr modification in scheduler. Thanks, Mathieu
On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote: > ----- On Jul 17, 2020, at 12:11 PM, Alan Stern stern@rowland.harvard.edu wrote: > > >> > I agree with Nick: A memory barrier is needed somewhere between the > >> > assignment at 6 and the return to user mode at 8. Otherwise you end up > >> > with the Store Buffer pattern having a memory barrier on only one side, > >> > and it is well known that this arrangement does not guarantee any > >> > ordering. > >> > >> Yes, I see this now. I'm still trying to wrap my head around why the memory > >> barrier at the end of membarrier() needs to be paired with a scheduler > >> barrier though. > > > > The memory barrier at the end of membarrier() on CPU0 is necessary in > > order to enforce the guarantee that any writes occurring on CPU1 before > > the membarrier() is executed will be visible to any code executing on > > CPU0 after the membarrier(). Ignoring the kthread issue, we can have: > > > > CPU0 CPU1 > > x = 1 > > barrier() > > y = 1 > > r2 = y > > membarrier(): > > a: smp_mb() > > b: send IPI IPI-induced mb > > c: smp_mb() > > r1 = x > > > > The writes to x and y are unordered by the hardware, so it's possible to > > have r2 = 1 even though the write to x doesn't execute until b. If the > > memory barrier at c is omitted then "r1 = x" can be reordered before b > > (although not before a), so we get r1 = 0. This violates the guarantee > > that membarrier() is supposed to provide. > > > > The timing of the memory barrier at c has to ensure that it executes > > after the IPI-induced memory barrier on CPU1. If it happened before > > then we could still end up with r1 = 0. That's why the pairing matters. > > > > I hope this helps your head get properly wrapped. :-) > > It does help a bit! ;-) > > This explains this part of the comment near the smp_mb at the end of membarrier: > > * Memory barrier on the caller thread _after_ we finished > * waiting for the last IPI. [...] > > However, it does not explain why it needs to be paired with a barrier in the > scheduler, clearly for the case where the IPI is skipped. I wonder whether this part > of the comment is factually correct: > > * [...] Matches memory barriers around rq->curr modification in scheduler. The reasoning is pretty much the same as above: CPU0 CPU1 x = 1 barrier() y = 1 r2 = y membarrier(): a: smp_mb() switch to kthread (includes mb) b: read rq->curr == kthread switch to user (includes mb) c: smp_mb() r1 = x Once again, it is possible that x = 1 doesn't become visible to CPU0 until shortly before b. But if c is omitted then "r1 = x" can be reordered before b (to any time after a), so we can have r1 = 0. Here the timing requirement is that c executes after the first memory barrier on CPU1 -- which is one of the ones around the rq->curr modification. (In fact, in this scenario CPU1's switch back to the user process is irrelevant.) Alan Stern
----- On Jul 17, 2020, at 1:44 PM, Alan Stern stern@rowland.harvard.edu wrote: > On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote: >> ----- On Jul 17, 2020, at 12:11 PM, Alan Stern stern@rowland.harvard.edu wrote: >> >> >> > I agree with Nick: A memory barrier is needed somewhere between the >> >> > assignment at 6 and the return to user mode at 8. Otherwise you end up >> >> > with the Store Buffer pattern having a memory barrier on only one side, >> >> > and it is well known that this arrangement does not guarantee any >> >> > ordering. >> >> >> >> Yes, I see this now. I'm still trying to wrap my head around why the memory >> >> barrier at the end of membarrier() needs to be paired with a scheduler >> >> barrier though. >> > >> > The memory barrier at the end of membarrier() on CPU0 is necessary in >> > order to enforce the guarantee that any writes occurring on CPU1 before >> > the membarrier() is executed will be visible to any code executing on >> > CPU0 after the membarrier(). Ignoring the kthread issue, we can have: >> > >> > CPU0 CPU1 >> > x = 1 >> > barrier() >> > y = 1 >> > r2 = y >> > membarrier(): >> > a: smp_mb() >> > b: send IPI IPI-induced mb >> > c: smp_mb() >> > r1 = x >> > >> > The writes to x and y are unordered by the hardware, so it's possible to >> > have r2 = 1 even though the write to x doesn't execute until b. If the >> > memory barrier at c is omitted then "r1 = x" can be reordered before b >> > (although not before a), so we get r1 = 0. This violates the guarantee >> > that membarrier() is supposed to provide. >> > >> > The timing of the memory barrier at c has to ensure that it executes >> > after the IPI-induced memory barrier on CPU1. If it happened before >> > then we could still end up with r1 = 0. That's why the pairing matters. >> > >> > I hope this helps your head get properly wrapped. :-) >> >> It does help a bit! ;-) >> >> This explains this part of the comment near the smp_mb at the end of membarrier: >> >> * Memory barrier on the caller thread _after_ we finished >> * waiting for the last IPI. [...] >> >> However, it does not explain why it needs to be paired with a barrier in the >> scheduler, clearly for the case where the IPI is skipped. I wonder whether this >> part >> of the comment is factually correct: >> >> * [...] Matches memory barriers around rq->curr modification in scheduler. > > The reasoning is pretty much the same as above: > > CPU0 CPU1 > x = 1 > barrier() > y = 1 > r2 = y > membarrier(): > a: smp_mb() > switch to kthread (includes mb) > b: read rq->curr == kthread > switch to user (includes mb) > c: smp_mb() > r1 = x > > Once again, it is possible that x = 1 doesn't become visible to CPU0 > until shortly before b. But if c is omitted then "r1 = x" can be > reordered before b (to any time after a), so we can have r1 = 0. > > Here the timing requirement is that c executes after the first memory > barrier on CPU1 -- which is one of the ones around the rq->curr > modification. (In fact, in this scenario CPU1's switch back to the user > process is irrelevant.) That indeed covers the last scenario I was wondering about. Thanks Alan! Mathieu > > Alan Stern
Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm: > ----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote: > [...] >> >> membarrier does replace barrier instructions on remote CPUs, which do >> order accesses performed by the kernel on the user address space. So >> membarrier should too I guess. >> >> Normal process context accesses like read(2) will do so because they >> don't get filtered out from IPIs, but kernel threads using the mm may >> not. > > But it should not be an issue, because membarrier's ordering is only with respect > to submit and completion of io_uring requests, which are performed through > system calls from the context of user-space threads, which are called from the > right mm. Is that true? Can io completions be written into an address space via a kernel thread? I don't know the io_uring code well but it looks like that's asynchonously using the user mm context. How about other memory accesses via kthread_use_mm? Presumably there is still ordering requirement there for membarrier, so I really think it's a fragile interface with no real way for the user to know how kernel threads may use its mm for any particular reason, so membarrier should synchronize all possible kernel users as well. Thanks, Nick
----- On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npiggin@gmail.com wrote: > Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm: >> ----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote: >> [...] >>> >>> membarrier does replace barrier instructions on remote CPUs, which do >>> order accesses performed by the kernel on the user address space. So >>> membarrier should too I guess. >>> >>> Normal process context accesses like read(2) will do so because they >>> don't get filtered out from IPIs, but kernel threads using the mm may >>> not. >> >> But it should not be an issue, because membarrier's ordering is only with >> respect >> to submit and completion of io_uring requests, which are performed through >> system calls from the context of user-space threads, which are called from the >> right mm. > > Is that true? Can io completions be written into an address space via a > kernel thread? I don't know the io_uring code well but it looks like > that's asynchonously using the user mm context. Indeed, the io completion appears to be signaled asynchronously between kernel and user-space. Therefore, both kernel and userspace code need to have proper memory barriers in place to signal completion, otherwise user-space could read garbage after it notices completion of a read. I did not review the entire io_uring implementation, but the publish side for completion appears to be: static void __io_commit_cqring(struct io_ring_ctx *ctx) { struct io_rings *rings = ctx->rings; /* order cqe stores with ring update */ smp_store_release(&rings->cq.tail, ctx->cached_cq_tail); if (wq_has_sleeper(&ctx->cq_wait)) { wake_up_interruptible(&ctx->cq_wait); kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); } } The store-release on tail should be paired with a load_acquire on the reader-side (it's called "read_barrier()" in the code): tools/io_uring/queue.c: static int __io_uring_get_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr, int wait) { struct io_uring_cq *cq = &ring->cq; const unsigned mask = *cq->kring_mask; unsigned head; int ret; *cqe_ptr = NULL; head = *cq->khead; do { /* * It's necessary to use a read_barrier() before reading * the CQ tail, since the kernel updates it locklessly. The * kernel has the matching store barrier for the update. The * kernel also ensures that previous stores to CQEs are ordered * with the tail update. */ read_barrier(); if (head != *cq->ktail) { *cqe_ptr = &cq->cqes[head & mask]; break; } if (!wait) break; ret = io_uring_enter(ring->ring_fd, 0, 1, IORING_ENTER_GETEVENTS, NULL); if (ret < 0) return -errno; } while (1); return 0; } So as far as membarrier memory ordering dependencies are concerned, it relies on the store-release/load-acquire dependency chain in the completion queue to order against anything that was done prior to the completed requests. What is in-flight while the requests are being serviced provides no memory ordering guarantee whatsoever. > How about other memory accesses via kthread_use_mm? Presumably there is > still ordering requirement there for membarrier, Please provide an example case with memory accesses via kthread_use_mm where ordering matters to support your concern. > so I really think > it's a fragile interface with no real way for the user to know how > kernel threads may use its mm for any particular reason, so membarrier > should synchronize all possible kernel users as well. I strongly doubt so, but perhaps something should be clarified in the documentation if you have that feeling. Thanks, Mathieu > > Thanks, > Nick
Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am: > ----- On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npiggin@gmail.com wrote: > >> Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm: >>> ----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote: >>> [...] >>>> >>>> membarrier does replace barrier instructions on remote CPUs, which do >>>> order accesses performed by the kernel on the user address space. So >>>> membarrier should too I guess. >>>> >>>> Normal process context accesses like read(2) will do so because they >>>> don't get filtered out from IPIs, but kernel threads using the mm may >>>> not. >>> >>> But it should not be an issue, because membarrier's ordering is only with >>> respect >>> to submit and completion of io_uring requests, which are performed through >>> system calls from the context of user-space threads, which are called from the >>> right mm. >> >> Is that true? Can io completions be written into an address space via a >> kernel thread? I don't know the io_uring code well but it looks like >> that's asynchonously using the user mm context. > > Indeed, the io completion appears to be signaled asynchronously between kernel > and user-space. Yep, many other places do similar with use_mm. [snip] > So as far as membarrier memory ordering dependencies are concerned, it relies > on the store-release/load-acquire dependency chain in the completion queue to > order against anything that was done prior to the completed requests. > > What is in-flight while the requests are being serviced provides no memory > ordering guarantee whatsoever. Yeah you're probably right in this case I think. Quite likely most kernel tasks that asynchronously write to user memory would at least have some kind of producer-consumer barriers. But is that restriction of all async modifications documented and enforced anywhere? >> How about other memory accesses via kthread_use_mm? Presumably there is >> still ordering requirement there for membarrier, > > Please provide an example case with memory accesses via kthread_use_mm where > ordering matters to support your concern. I think the concern Andy raised with io_uring was less a specific problem he saw and more a general concern that we have these memory accesses which are not synchronized with membarrier. >> so I really think >> it's a fragile interface with no real way for the user to know how >> kernel threads may use its mm for any particular reason, so membarrier >> should synchronize all possible kernel users as well. > > I strongly doubt so, but perhaps something should be clarified in the documentation > if you have that feeling. I'd rather go the other way and say if you have reasoning or numbers for why PF_KTHREAD is an important optimisation above rq->curr == rq->idle then we could think about keeping this subtlety with appropriate documentation added, otherwise we can just kill it and remove all doubt. That being said, the x86 sync core gap that I imagined could be fixed by changing to rq->curr == rq->idle test does not actually exist because the global membarrier does not have a sync core option. So fixing the exit_lazy_tlb points that this series does *should* fix that. So PF_KTHREAD may be less problematic than I thought from implementation point of view, only semantics. Thanks, Nick
----- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npiggin@gmail.com wrote: > Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am: [...] > > Yeah you're probably right in this case I think. Quite likely most kernel > tasks that asynchronously write to user memory would at least have some > kind of producer-consumer barriers. > > But is that restriction of all async modifications documented and enforced > anywhere? > >>> How about other memory accesses via kthread_use_mm? Presumably there is >>> still ordering requirement there for membarrier, >> >> Please provide an example case with memory accesses via kthread_use_mm where >> ordering matters to support your concern. > > I think the concern Andy raised with io_uring was less a specific > problem he saw and more a general concern that we have these memory > accesses which are not synchronized with membarrier. > >>> so I really think >>> it's a fragile interface with no real way for the user to know how >>> kernel threads may use its mm for any particular reason, so membarrier >>> should synchronize all possible kernel users as well. >> >> I strongly doubt so, but perhaps something should be clarified in the >> documentation >> if you have that feeling. > > I'd rather go the other way and say if you have reasoning or numbers for > why PF_KTHREAD is an important optimisation above rq->curr == rq->idle > then we could think about keeping this subtlety with appropriate > documentation added, otherwise we can just kill it and remove all doubt. > > That being said, the x86 sync core gap that I imagined could be fixed > by changing to rq->curr == rq->idle test does not actually exist because > the global membarrier does not have a sync core option. So fixing the > exit_lazy_tlb points that this series does *should* fix that. So > PF_KTHREAD may be less problematic than I thought from implementation > point of view, only semantics. Today, the membarrier global expedited command explicitly skips kernel threads, but it happens that membarrier private expedited considers those with the same mm as target for the IPI. So we already implement a semantic which differs between private and global expedited membarriers. This can be explained in part by the fact that kthread_use_mm was introduced after 4.16, where the most recent membarrier commands where introduced. It seems that the effect on membarrier was not considered when kthread_use_mm was introduced. Looking at membarrier(2) documentation, it states that IPIs are only sent to threads belonging to the same process as the calling thread. If my understanding of the notion of process is correct, this should rule out sending the IPI to kernel threads, given they are not "part" of the same process, only borrowing the mm. But I agree that the distinction is moot, and should be clarified. Without a clear use-case to justify adding a constraint on membarrier, I am tempted to simply clarify documentation of current membarrier commands, stating clearly that they are not guaranteed to affect kernel threads. Then, if we have a compelling use-case to implement a different behavior which covers kthreads, this could be added consistently across membarrier commands with a flag (or by adding new commands). Does this approach make sense ? Thanks, Mathieu
Excerpts from Mathieu Desnoyers's message of July 21, 2020 11:11 pm: > ----- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npiggin@gmail.com wrote: > >> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am: > [...] >> >> Yeah you're probably right in this case I think. Quite likely most kernel >> tasks that asynchronously write to user memory would at least have some >> kind of producer-consumer barriers. >> >> But is that restriction of all async modifications documented and enforced >> anywhere? >> >>>> How about other memory accesses via kthread_use_mm? Presumably there is >>>> still ordering requirement there for membarrier, >>> >>> Please provide an example case with memory accesses via kthread_use_mm where >>> ordering matters to support your concern. >> >> I think the concern Andy raised with io_uring was less a specific >> problem he saw and more a general concern that we have these memory >> accesses which are not synchronized with membarrier. >> >>>> so I really think >>>> it's a fragile interface with no real way for the user to know how >>>> kernel threads may use its mm for any particular reason, so membarrier >>>> should synchronize all possible kernel users as well. >>> >>> I strongly doubt so, but perhaps something should be clarified in the >>> documentation >>> if you have that feeling. >> >> I'd rather go the other way and say if you have reasoning or numbers for >> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle >> then we could think about keeping this subtlety with appropriate >> documentation added, otherwise we can just kill it and remove all doubt. >> >> That being said, the x86 sync core gap that I imagined could be fixed >> by changing to rq->curr == rq->idle test does not actually exist because >> the global membarrier does not have a sync core option. So fixing the >> exit_lazy_tlb points that this series does *should* fix that. So >> PF_KTHREAD may be less problematic than I thought from implementation >> point of view, only semantics. > > Today, the membarrier global expedited command explicitly skips kernel threads, > but it happens that membarrier private expedited considers those with the > same mm as target for the IPI. > > So we already implement a semantic which differs between private and global > expedited membarriers. Which is not a good thing. > This can be explained in part by the fact that > kthread_use_mm was introduced after 4.16, where the most recent membarrier > commands where introduced. It seems that the effect on membarrier was not > considered when kthread_use_mm was introduced. No it was just renamed, it used to be called use_mm and has been in the kernel for ~ever. That you hadn't considered this is actually weight for my point, which is that there's so much subtle behaviour that's easy to miss we're better off with simpler and fewer special cases until it's proven they're needed. Not the other way around. > > Looking at membarrier(2) documentation, it states that IPIs are only sent to > threads belonging to the same process as the calling thread. If my understanding > of the notion of process is correct, this should rule out sending the IPI to > kernel threads, given they are not "part" of the same process, only borrowing > the mm. But I agree that the distinction is moot, and should be clarified. It does if you read it in a user-hostile legalistic way. The reality is userspace shouldn't and can't know about how the kernel might implement functionality. > Without a clear use-case to justify adding a constraint on membarrier, I am > tempted to simply clarify documentation of current membarrier commands, > stating clearly that they are not guaranteed to affect kernel threads. Then, > if we have a compelling use-case to implement a different behavior which covers > kthreads, this could be added consistently across membarrier commands with a > flag (or by adding new commands). > > Does this approach make sense ? The other position is without a clear use case for PF_KTHREAD, seeing as async kernel accesses had not been considered before now, we limit the optimision to only skipping the idle thread. I think that makes more sense (unless you have a reason for PF_KTHREAD but it doesn't seem like there is much of one). Thanks, Nick
On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: > That being said, the x86 sync core gap that I imagined could be fixed > by changing to rq->curr == rq->idle test does not actually exist because > the global membarrier does not have a sync core option. So fixing the > exit_lazy_tlb points that this series does *should* fix that. So > PF_KTHREAD may be less problematic than I thought from implementation > point of view, only semantics. So I've been trying to figure out where that PF_KTHREAD comes from, commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. So the first version: https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com appears to unconditionally send the IPI and checks p->mm in the IPI context, but then v2: https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com has the current code. But I've been unable to find the reason the 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. The comment doesn't really help either; sure we have the whole lazy mm thing, but that's ->active_mm, not ->mm. Possibly it is because {,un}use_mm() do not have sufficient barriers to make the remote p->mm test work? Or were we over-eager with the !p->mm doesn't imply kthread 'cleanups' at the time? Also, I just realized, I still have a fix for use_mm() now kthread_use_mm() that seems to have been lost.
----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote: > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: > >> That being said, the x86 sync core gap that I imagined could be fixed >> by changing to rq->curr == rq->idle test does not actually exist because >> the global membarrier does not have a sync core option. So fixing the >> exit_lazy_tlb points that this series does *should* fix that. So >> PF_KTHREAD may be less problematic than I thought from implementation >> point of view, only semantics. > > So I've been trying to figure out where that PF_KTHREAD comes from, > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. > > So the first version: > > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com > > appears to unconditionally send the IPI and checks p->mm in the IPI > context, but then v2: > > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com > > has the current code. But I've been unable to find the reason the > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. Looking back at my inbox, it seems like you are the one who proposed to skip all kthreads: https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net > > The comment doesn't really help either; sure we have the whole lazy mm > thing, but that's ->active_mm, not ->mm. > > Possibly it is because {,un}use_mm() do not have sufficient barriers to > make the remote p->mm test work? Or were we over-eager with the !p->mm > doesn't imply kthread 'cleanups' at the time? The nice thing about adding back kthreads to the threads considered for membarrier IPI is that it has no observable effect on the user-space ABI. No pre-existing kthread rely on this, and we just provide an additional guarantee for future kthread implementations. > Also, I just realized, I still have a fix for use_mm() now > kthread_use_mm() that seems to have been lost. I suspect we need to at least document the memory barriers in kthread_use_mm and kthread_unuse_mm to state that they are required by membarrier if we want to ipi kthreads as well. Thanks, Mathieu
On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote: > ----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote: > > > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: > > > >> That being said, the x86 sync core gap that I imagined could be fixed > >> by changing to rq->curr == rq->idle test does not actually exist because > >> the global membarrier does not have a sync core option. So fixing the > >> exit_lazy_tlb points that this series does *should* fix that. So > >> PF_KTHREAD may be less problematic than I thought from implementation > >> point of view, only semantics. > > > > So I've been trying to figure out where that PF_KTHREAD comes from, > > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy > > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. > > > > So the first version: > > > > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com > > > > appears to unconditionally send the IPI and checks p->mm in the IPI > > context, but then v2: > > > > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com > > > > has the current code. But I've been unable to find the reason the > > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. > > Looking back at my inbox, it seems like you are the one who proposed to > skip all kthreads: > > https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net I had a feeling it might've been me ;-) I just couldn't find the email. > > The comment doesn't really help either; sure we have the whole lazy mm > > thing, but that's ->active_mm, not ->mm. > > > > Possibly it is because {,un}use_mm() do not have sufficient barriers to > > make the remote p->mm test work? Or were we over-eager with the !p->mm > > doesn't imply kthread 'cleanups' at the time? > > The nice thing about adding back kthreads to the threads considered for membarrier > IPI is that it has no observable effect on the user-space ABI. No pre-existing kthread > rely on this, and we just provide an additional guarantee for future kthread > implementations. > > > Also, I just realized, I still have a fix for use_mm() now > > kthread_use_mm() that seems to have been lost. > > I suspect we need to at least document the memory barriers in kthread_use_mm and > kthread_unuse_mm to state that they are required by membarrier if we want to > ipi kthreads as well. Right, so going by that email you found it was mostly a case of being lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add any other bits that might be needed, covering kthreads should be possible. No objections from me for making it so.
----- On Jul 21, 2020, at 11:19 AM, Peter Zijlstra peterz@infradead.org wrote: > On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote: >> ----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote: >> >> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: >> > >> >> That being said, the x86 sync core gap that I imagined could be fixed >> >> by changing to rq->curr == rq->idle test does not actually exist because >> >> the global membarrier does not have a sync core option. So fixing the >> >> exit_lazy_tlb points that this series does *should* fix that. So >> >> PF_KTHREAD may be less problematic than I thought from implementation >> >> point of view, only semantics. >> > >> > So I've been trying to figure out where that PF_KTHREAD comes from, >> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy >> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. >> > >> > So the first version: >> > >> > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com >> > >> > appears to unconditionally send the IPI and checks p->mm in the IPI >> > context, but then v2: >> > >> > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com >> > >> > has the current code. But I've been unable to find the reason the >> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. >> >> Looking back at my inbox, it seems like you are the one who proposed to >> skip all kthreads: >> >> https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net > > I had a feeling it might've been me ;-) I just couldn't find the email. > >> > The comment doesn't really help either; sure we have the whole lazy mm >> > thing, but that's ->active_mm, not ->mm. >> > >> > Possibly it is because {,un}use_mm() do not have sufficient barriers to >> > make the remote p->mm test work? Or were we over-eager with the !p->mm >> > doesn't imply kthread 'cleanups' at the time? >> >> The nice thing about adding back kthreads to the threads considered for >> membarrier >> IPI is that it has no observable effect on the user-space ABI. No pre-existing >> kthread >> rely on this, and we just provide an additional guarantee for future kthread >> implementations. >> >> > Also, I just realized, I still have a fix for use_mm() now >> > kthread_use_mm() that seems to have been lost. >> >> I suspect we need to at least document the memory barriers in kthread_use_mm and >> kthread_unuse_mm to state that they are required by membarrier if we want to >> ipi kthreads as well. > > Right, so going by that email you found it was mostly a case of being > lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add > any other bits that might be needed, covering kthreads should be > possible. > > No objections from me for making it so. I'm OK on making membarrier cover kthreads using mm as well, provided we audit kthread_{,un}use_mm() to make sure the proper barriers are in place after setting task->mm and before clearing it. Thanks, Mathieu
diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt index 52ad74a25f54..bd43fb1f5986 100644 --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt @@ -5,6 +5,10 @@ # # Architecture requirements # +# If your architecture returns to user-space through non-core-serializing +# instructions, you need to ensure these are done in switch_mm and exit_lazy_tlb +# (if lazy tlb switching is implemented). +# # * arm/arm64/powerpc # # Rely on implicit context synchronization as a result of exception return @@ -24,7 +28,7 @@ # instead on write_cr3() performed by switch_mm() to provide core serialization # after changing the current mm, and deal with the special case of kthread -> # uthread (temporarily keeping current mm into active_mm) by issuing a -# sync_core_before_usermode() in that specific case. +# serializing instruction in exit_lazy_mm() in that specific case. # ----------------------- | arch |status| diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 255750548433..5263863a9be8 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -6,6 +6,7 @@ #include <linux/atomic.h> #include <linux/mm_types.h> #include <linux/pkeys.h> +#include <linux/sched/mm.h> #include <trace/events/tlb.h> @@ -95,6 +96,40 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) #define enter_lazy_tlb enter_lazy_tlb extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); +#ifdef CONFIG_MEMBARRIER +/* + * Ensure that a core serializing instruction is issued before returning + * to user-mode, if a SYNC_CORE was requested. x86 implements return to + * user-space through sysexit, sysrel, and sysretq, which are not core + * serializing. + * + * See the membarrier comment in finish_task_switch as to why this is done + * in exit_lazy_tlb. + */ +#define exit_lazy_tlb exit_lazy_tlb +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) +{ + /* Switching mm is serializing with write_cr3 */ + if (tsk->mm != mm) + return; + + if (likely(!(atomic_read(&mm->membarrier_state) & + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) + return; + + /* With PTI, we unconditionally serialize before running user code. */ + if (static_cpu_has(X86_FEATURE_PTI)) + return; + /* + * Return from interrupt and NMI is done through iret, which is core + * serializing. + */ + if (in_irq() || in_nmi()) + return; + sync_core(); +} +#endif + /* * Init a new mm. Used on mm copies, like at fork() * and on mm's that are brand-new, like at execve(). diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h deleted file mode 100644 index c67caafd3381..000000000000 --- a/arch/x86/include/asm/sync_core.h +++ /dev/null @@ -1,28 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _ASM_X86_SYNC_CORE_H -#define _ASM_X86_SYNC_CORE_H - -#include <linux/preempt.h> -#include <asm/processor.h> -#include <asm/cpufeature.h> - -/* - * Ensure that a core serializing instruction is issued before returning - * to user-mode. x86 implements return to user-space through sysexit, - * sysrel, and sysretq, which are not core serializing. - */ -static inline void sync_core_before_usermode(void) -{ - /* With PTI, we unconditionally serialize before running user code. */ - if (static_cpu_has(X86_FEATURE_PTI)) - return; - /* - * Return from interrupt and NMI is done through iret, which is core - * serializing. - */ - if (in_irq() || in_nmi()) - return; - sync_core(); -} - -#endif /* _ASM_X86_SYNC_CORE_H */ diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 480a4d1b7dd8..9b026264b445 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -7,7 +7,6 @@ #include <linux/sched.h> #include <linux/mm_types.h> #include <linux/gfp.h> -#include <linux/sync_core.h> /* * Routines for handling mm_structs @@ -364,16 +363,6 @@ enum { #include <asm/membarrier.h> #endif -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) -{ - if (current->mm != mm) - return; - if (likely(!(atomic_read(&mm->membarrier_state) & - MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) - return; - sync_core_before_usermode(); -} - extern void membarrier_exec_mmap(struct mm_struct *mm); #else @@ -387,9 +376,6 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev, static inline void membarrier_exec_mmap(struct mm_struct *mm) { } -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) -{ -} #endif #endif /* _LINUX_SCHED_MM_H */ diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h deleted file mode 100644 index 013da4b8b327..000000000000 --- a/include/linux/sync_core.h +++ /dev/null @@ -1,21 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _LINUX_SYNC_CORE_H -#define _LINUX_SYNC_CORE_H - -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE -#include <asm/sync_core.h> -#else -/* - * This is a dummy sync_core_before_usermode() implementation that can be used - * on all architectures which return to user-space through core serializing - * instructions. - * If your architecture returns to user-space through non-core-serializing - * instructions, you need to write your own functions. - */ -static inline void sync_core_before_usermode(void) -{ -} -#endif - -#endif /* _LINUX_SYNC_CORE_H */ - diff --git a/kernel/cpu.c b/kernel/cpu.c index 6ff2578ecf17..134688d79589 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -572,7 +572,9 @@ static int finish_cpu(unsigned int cpu) /* * idle_task_exit() will have switched to &init_mm, now - * clean up any remaining active_mm state. + * clean up any remaining active_mm state. exit_lazy_tlb + * is not done, if an arch did any accounting in these + * functions it would have to be added. */ if (mm != &init_mm) idle->active_mm = &init_mm; diff --git a/kernel/kthread.c b/kernel/kthread.c index e813d92f2eab..6f93c649aa97 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1251,9 +1251,9 @@ void kthread_use_mm(struct mm_struct *mm) finish_arch_post_lock_switch(); #endif + exit_lazy_tlb(active_mm, tsk); if (active_mm != mm) mmdrop(active_mm); - exit_lazy_tlb(active_mm, tsk); to_kthread(tsk)->oldfs = get_fs(); set_fs(USER_DS); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index debc917bc69b..31e22c79826c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3294,22 +3294,19 @@ static struct rq *finish_task_switch(struct task_struct *prev) kcov_finish_switch(current); fire_sched_in_preempt_notifiers(current); + /* * When switching through a kernel thread, the loop in * membarrier_{private,global}_expedited() may have observed that * kernel thread and not issued an IPI. It is therefore possible to * schedule between user->kernel->user threads without passing though - * switch_mm(). Membarrier requires a barrier after storing to - * rq->curr, before returning to userspace, so provide them here: - * - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly - * provided by mmdrop(), - * - a sync_core for SYNC_CORE. + * switch_mm(). Membarrier requires a full barrier after storing to + * rq->curr, before returning to userspace, for + * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop(). */ - if (mm) { - membarrier_mm_sync_core_before_usermode(mm); + if (mm) mmdrop(mm); - } + if (unlikely(prev_state == TASK_DEAD)) { if (prev->sched_class->task_dead) prev->sched_class->task_dead(prev); @@ -6292,6 +6289,7 @@ void idle_task_exit(void) BUG_ON(current != this_rq()->idle); if (mm != &init_mm) { + /* enter_lazy_tlb is not done because we're about to go down */ switch_mm(mm, &init_mm, current); finish_arch_post_lock_switch(); }
And get rid of the generic sync_core_before_usermode facility. This helper is the wrong way around I think. The idea that membarrier state requires a core sync before returning to user is the easy one that does not need hiding behind membarrier calls. The gap in core synchronization due to x86's sysret/sysexit and lazy tlb mode, is the tricky detail that is better put in x86 lazy tlb code. Consider if an arch did not synchronize core in switch_mm either, then membarrier_mm_sync_core_before_usermode would be in the wrong place but arch specific mmu context functions would still be the right place. There is also a exit_lazy_tlb case that is not covered by this call, which could be a bugs (kthread use mm the membarrier process's mm then context switch back to the process without switching mm or lazy mm switch). This makes lazy tlb code a bit more modular. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- .../membarrier-sync-core/arch-support.txt | 6 +++- arch/x86/include/asm/mmu_context.h | 35 +++++++++++++++++++ arch/x86/include/asm/sync_core.h | 28 --------------- include/linux/sched/mm.h | 14 -------- include/linux/sync_core.h | 21 ----------- kernel/cpu.c | 4 ++- kernel/kthread.c | 2 +- kernel/sched/core.c | 16 ++++----- 8 files changed, 51 insertions(+), 75 deletions(-) delete mode 100644 arch/x86/include/asm/sync_core.h delete mode 100644 include/linux/sync_core.h