Message ID | 20241004182734.1761555-5-mathieu.desnoyers@efficios.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | sched+mm: Track lazy active mm existence with hazard pointers | expand |
Hello, kernel test robot noticed "BUG:using_smp_processor_id()in_preemptible" on: commit: efef4da3b19cadf4beb45079a05643a77821de79 ("[RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence") url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/compiler-h-Introduce-ptr_eq-to-preserve-address-dependency/20241005-023027 base: https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git next patch link: https://lore.kernel.org/all/20241004182734.1761555-5-mathieu.desnoyers@efficios.com/ patch subject: [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence in testcase: boot compiler: gcc-12 test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G (please refer to attached dmesg/kmsg for entire log/backtrace) +--------------------------------------------+------------+------------+ | | 75b478bf10 | efef4da3b1 | +--------------------------------------------+------------+------------+ | BUG:using_smp_processor_id()in_preemptible | 0 | 12 | +--------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202410072229.18756716-oliver.sang@intel.com [ 6.336856][ T48] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u10:1/48 [ 6.338102][ T48] caller is debug_smp_processor_id (lib/smp_processor_id.c:61) [ 6.338809][ T48] CPU: 0 UID: 0 PID: 48 Comm: kworker/u10:1 Not tainted 6.12.0-rc1-00004-gefef4da3b19c #5 [ 6.339929][ T48] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 6.341119][ T48] Call Trace: [ 6.341504][ T48] dump_stack_lvl (lib/dump_stack.c:123) [ 6.342057][ T48] dump_stack (lib/dump_stack.c:130) [ 6.342065][ T48] check_preemption_disabled (arch/x86/include/asm/preempt.h:84 lib/smp_processor_id.c:53) [ 6.342065][ T48] debug_smp_processor_id (lib/smp_processor_id.c:61) [ 6.342065][ T48] exec_mmap (include/linux/sched/mm.h:91 fs/exec.c:1017) [ 6.342065][ T48] ? would_dump (fs/exec.c:1409) [ 6.342065][ T48] begin_new_exec (fs/exec.c:1280) [ 6.342065][ T48] ? load_elf_phdrs (fs/binfmt_elf.c:534) [ 6.342065][ T48] load_elf_binary (fs/binfmt_elf.c:996) [ 6.342065][ T48] ? get_lock_stats (kernel/locking/lockdep.c:339) [ 6.342065][ T48] ? search_binary_handler (fs/exec.c:1752) [ 6.342065][ T48] search_binary_handler (fs/exec.c:1752) [ 6.342065][ T48] exec_binprm (fs/exec.c:1795) [ 6.342065][ T48] bprm_execve (fs/exec.c:1846 fs/exec.c:1821) [ 6.342065][ T48] kernel_execve (fs/exec.c:2012) [ 6.342065][ T48] call_usermodehelper_exec_async (kernel/umh.c:110) [ 6.342065][ T48] ? umh_complete (kernel/umh.c:65) [ 6.342065][ T48] ret_from_fork (arch/x86/kernel/process.c:153) [ 6.342065][ T48] ? umh_complete (kernel/umh.c:65) [ 6.342065][ T48] ret_from_fork_asm (arch/x86/entry/entry_32.S:737) [ 6.342065][ T48] entry_INT80_32 (arch/x86/entry/entry_32.S:944) [ 6.352140][ T1] ppdev: user-space parallel port driver [ 6.353841][ T1] HSI/SSI char device loaded [ 6.354238][ T1] e1000: Intel(R) PRO/1000 Network Driver [ 6.354673][ T1] e1000: Copyright (c) 1999-2006 Intel Corporation. [ 6.650009][ T1] ACPI: _SB_.LNKC: Enabled at IRQ 11 [ 6.968868][ T1] e1000 0000:00:03.0 eth0: (PCI:33MHz:32-bit) 52:54:00:12:34:56 [ 6.969500][ T1] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection [ 6.970506][ T49] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u10:1/49 [ 6.971191][ T49] caller is debug_smp_processor_id (lib/smp_processor_id.c:61) [ 6.971650][ T49] CPU: 0 UID: 0 PID: 49 Comm: kworker/u10:1 Not tainted 6.12.0-rc1-00004-gefef4da3b19c #5 [ 6.972365][ T49] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 6.973115][ T49] Call Trace: [ 6.973356][ T49] dump_stack_lvl (lib/dump_stack.c:123) [ 6.973686][ T49] dump_stack (lib/dump_stack.c:130) [ 6.973983][ T49] check_preemption_disabled (arch/x86/include/asm/preempt.h:84 lib/smp_processor_id.c:53) [ 6.974328][ T49] debug_smp_processor_id (lib/smp_processor_id.c:61) [ 6.974328][ T49] exec_mmap (include/linux/sched/mm.h:91 fs/exec.c:1017) [ 6.974328][ T49] ? would_dump (fs/exec.c:1409) [ 6.974328][ T49] begin_new_exec (fs/exec.c:1280) [ 6.974328][ T49] ? load_elf_phdrs (fs/binfmt_elf.c:534) [ 6.974328][ T49] load_elf_binary (fs/binfmt_elf.c:996) [ 6.974328][ T49] ? get_lock_stats (kernel/locking/lockdep.c:339) [ 6.974328][ T49] ? search_binary_handler (fs/exec.c:1752) [ 6.974328][ T49] search_binary_handler (fs/exec.c:1752) [ 6.974328][ T49] exec_binprm (fs/exec.c:1795) [ 6.974328][ T49] bprm_execve (fs/exec.c:1846 fs/exec.c:1821) [ 6.974328][ T49] kernel_execve (fs/exec.c:2012) [ 6.974328][ T49] call_usermodehelper_exec_async (kernel/umh.c:110) [ 6.974328][ T49] ? umh_complete (kernel/umh.c:65) [ 6.974328][ T49] ret_from_fork (arch/x86/kernel/process.c:153) [ 6.974328][ T49] ? umh_complete (kernel/umh.c:65) [ 6.974328][ T49] ret_from_fork_asm (arch/x86/entry/entry_32.S:737) [ 6.974328][ T49] entry_INT80_32 (arch/x86/entry/entry_32.S:944) The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20241007/202410072229.18756716-oliver.sang@intel.com
On 2024-10-07 16:50, kernel test robot wrote: > > > Hello, > > kernel test robot noticed "BUG:using_smp_processor_id()in_preemptible" on: > > commit: efef4da3b19cadf4beb45079a05643a77821de79 ("[RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence") > url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/compiler-h-Introduce-ptr_eq-to-preserve-address-dependency/20241005-023027 > base: https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git next > patch link: https://lore.kernel.org/all/20241004182734.1761555-5-mathieu.desnoyers@efficios.com/ > patch subject: [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence > This should do the trick: - WRITE_ONCE(this_cpu_ptr(hp_domain_sched_lazy_mm.percpu_slots)->addr, NULL); + this_cpu_write(hp_domain_sched_lazy_mm.percpu_slots->addr, NULL); I'll update the patch for the next round. Thanks, Mathieu > in testcase: boot > > compiler: gcc-12 > test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > +--------------------------------------------+------------+------------+ > | | 75b478bf10 | efef4da3b1 | > +--------------------------------------------+------------+------------+ > | BUG:using_smp_processor_id()in_preemptible | 0 | 12 | > +--------------------------------------------+------------+------------+ > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <oliver.sang@intel.com> > | Closes: https://lore.kernel.org/oe-lkp/202410072229.18756716-oliver.sang@intel.com > > > [ 6.336856][ T48] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u10:1/48 > [ 6.338102][ T48] caller is debug_smp_processor_id (lib/smp_processor_id.c:61) > [ 6.338809][ T48] CPU: 0 UID: 0 PID: 48 Comm: kworker/u10:1 Not tainted 6.12.0-rc1-00004-gefef4da3b19c #5 > [ 6.339929][ T48] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > [ 6.341119][ T48] Call Trace: > [ 6.341504][ T48] dump_stack_lvl (lib/dump_stack.c:123) > [ 6.342057][ T48] dump_stack (lib/dump_stack.c:130) > [ 6.342065][ T48] check_preemption_disabled (arch/x86/include/asm/preempt.h:84 lib/smp_processor_id.c:53) > [ 6.342065][ T48] debug_smp_processor_id (lib/smp_processor_id.c:61) > [ 6.342065][ T48] exec_mmap (include/linux/sched/mm.h:91 fs/exec.c:1017) > [ 6.342065][ T48] ? would_dump (fs/exec.c:1409) > [ 6.342065][ T48] begin_new_exec (fs/exec.c:1280) > [ 6.342065][ T48] ? load_elf_phdrs (fs/binfmt_elf.c:534) > [ 6.342065][ T48] load_elf_binary (fs/binfmt_elf.c:996) > [ 6.342065][ T48] ? get_lock_stats (kernel/locking/lockdep.c:339) > [ 6.342065][ T48] ? search_binary_handler (fs/exec.c:1752) > [ 6.342065][ T48] search_binary_handler (fs/exec.c:1752) > [ 6.342065][ T48] exec_binprm (fs/exec.c:1795) > [ 6.342065][ T48] bprm_execve (fs/exec.c:1846 fs/exec.c:1821) > [ 6.342065][ T48] kernel_execve (fs/exec.c:2012) > [ 6.342065][ T48] call_usermodehelper_exec_async (kernel/umh.c:110) > [ 6.342065][ T48] ? umh_complete (kernel/umh.c:65) > [ 6.342065][ T48] ret_from_fork (arch/x86/kernel/process.c:153) > [ 6.342065][ T48] ? umh_complete (kernel/umh.c:65) > [ 6.342065][ T48] ret_from_fork_asm (arch/x86/entry/entry_32.S:737) > [ 6.342065][ T48] entry_INT80_32 (arch/x86/entry/entry_32.S:944) > [ 6.352140][ T1] ppdev: user-space parallel port driver > [ 6.353841][ T1] HSI/SSI char device loaded > [ 6.354238][ T1] e1000: Intel(R) PRO/1000 Network Driver > [ 6.354673][ T1] e1000: Copyright (c) 1999-2006 Intel Corporation. > [ 6.650009][ T1] ACPI: _SB_.LNKC: Enabled at IRQ 11 > [ 6.968868][ T1] e1000 0000:00:03.0 eth0: (PCI:33MHz:32-bit) 52:54:00:12:34:56 > [ 6.969500][ T1] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection > [ 6.970506][ T49] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u10:1/49 > [ 6.971191][ T49] caller is debug_smp_processor_id (lib/smp_processor_id.c:61) > [ 6.971650][ T49] CPU: 0 UID: 0 PID: 49 Comm: kworker/u10:1 Not tainted 6.12.0-rc1-00004-gefef4da3b19c #5 > [ 6.972365][ T49] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > [ 6.973115][ T49] Call Trace: > [ 6.973356][ T49] dump_stack_lvl (lib/dump_stack.c:123) > [ 6.973686][ T49] dump_stack (lib/dump_stack.c:130) > [ 6.973983][ T49] check_preemption_disabled (arch/x86/include/asm/preempt.h:84 lib/smp_processor_id.c:53) > [ 6.974328][ T49] debug_smp_processor_id (lib/smp_processor_id.c:61) > [ 6.974328][ T49] exec_mmap (include/linux/sched/mm.h:91 fs/exec.c:1017) > [ 6.974328][ T49] ? would_dump (fs/exec.c:1409) > [ 6.974328][ T49] begin_new_exec (fs/exec.c:1280) > [ 6.974328][ T49] ? load_elf_phdrs (fs/binfmt_elf.c:534) > [ 6.974328][ T49] load_elf_binary (fs/binfmt_elf.c:996) > [ 6.974328][ T49] ? get_lock_stats (kernel/locking/lockdep.c:339) > [ 6.974328][ T49] ? search_binary_handler (fs/exec.c:1752) > [ 6.974328][ T49] search_binary_handler (fs/exec.c:1752) > [ 6.974328][ T49] exec_binprm (fs/exec.c:1795) > [ 6.974328][ T49] bprm_execve (fs/exec.c:1846 fs/exec.c:1821) > [ 6.974328][ T49] kernel_execve (fs/exec.c:2012) > [ 6.974328][ T49] call_usermodehelper_exec_async (kernel/umh.c:110) > [ 6.974328][ T49] ? umh_complete (kernel/umh.c:65) > [ 6.974328][ T49] ret_from_fork (arch/x86/kernel/process.c:153) > [ 6.974328][ T49] ? umh_complete (kernel/umh.c:65) > [ 6.974328][ T49] ret_from_fork_asm (arch/x86/entry/entry_32.S:737) > [ 6.974328][ T49] entry_INT80_32 (arch/x86/entry/entry_32.S:944) > > > > The kernel config and materials to reproduce are available at: > https://download.01.org/0day-ci/archive/20241007/202410072229.18756716-oliver.sang@intel.com > > >
diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst index d096fc091e23..c225cac49c30 100644 --- a/Documentation/mm/active_mm.rst +++ b/Documentation/mm/active_mm.rst @@ -2,11 +2,10 @@ Active MM ========= -Note, the mm_count refcount may no longer include the "lazy" users -(running tasks with ->active_mm == mm && ->mm == NULL) on kernels -with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy -references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb() -helpers, which abstract this config option. +Note, the mm_count refcount no longer include the "lazy" users (running +tasks with ->active_mm == mm && ->mm == NULL) Taking and releasing these +lazy references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb() +helpers, which are implemented with hazard pointers. :: diff --git a/arch/Kconfig b/arch/Kconfig index 975dd22a2dbd..d4261935f8dc 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -475,38 +475,6 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM irqs disabled over activate_mm. Architectures that do IPI based TLB shootdowns should enable this. -# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references. -# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching -# to/from kernel threads when the same mm is running on a lot of CPUs (a large -# multi-threaded application), by reducing contention on the mm refcount. -# -# This can be disabled if the architecture ensures no CPUs are using an mm as a -# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm -# or its kernel page tables). This could be arranged by arch_exit_mmap(), or -# final exit(2) TLB flush, for example. -# -# To implement this, an arch *must*: -# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when manipulating -# the lazy tlb reference of a kthread's ->active_mm (non-arch code has been -# converted already). -config MMU_LAZY_TLB_REFCOUNT - def_bool y - depends on !MMU_LAZY_TLB_SHOOTDOWN - -# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an -# mm as a lazy tlb beyond its last reference count, by shooting down these -# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may -# be using the mm as a lazy tlb, so that they may switch themselves to using -# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs -# may be using mm as a lazy tlb mm. -# -# To implement this, an arch *must*: -# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains -# at least all possible CPUs in which the mm is lazy. -# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above). -config MMU_LAZY_TLB_SHOOTDOWN - bool - config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d7b09b064a8a..b1e25e75baab 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -291,7 +291,6 @@ config PPC select MMU_GATHER_PAGE_SIZE select MMU_GATHER_RCU_TABLE_FREE select MMU_GATHER_MERGE_VMAS - select MMU_LAZY_TLB_SHOOTDOWN if PPC_BOOK3S_64 select MODULES_USE_ELF_RELA select NEED_DMA_MAP_STATE if PPC64 || NOT_COHERENT_CACHE select NEED_PER_CPU_EMBED_FIRST_CHUNK if PPC64 diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 9e1f6558d026..ff0d4f28cf52 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -1197,28 +1197,7 @@ void radix__tlb_flush(struct mmu_gather *tlb) * See the comment for radix in arch_exit_mmap(). */ if (tlb->fullmm) { - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { - /* - * Shootdown based lazy tlb mm refcounting means we - * have to IPI everyone in the mm_cpumask anyway soon - * when the mm goes away, so might as well do it as - * part of the final flush now. - * - * If lazy shootdown was improved to reduce IPIs (e.g., - * by batching), then it may end up being better to use - * tlbies here instead. - */ - preempt_disable(); - - smp_mb(); /* see radix__flush_tlb_mm */ - exit_flush_lazy_tlbs(mm); - __flush_all_mm(mm, true); - - preempt_enable(); - } else { - __flush_all_mm(mm, true); - } - + __flush_all_mm(mm, true); } else if ( (psize = radix_get_mmu_psize(page_size)) == -1) { if (!tlb->freed_tables) radix__flush_tlb_mm(mm); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 485424979254..db5f13554485 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -975,9 +975,6 @@ struct mm_struct { atomic_t tlb_flush_batched; #endif struct uprobes_state uprobes_state; -#ifdef CONFIG_PREEMPT_RT - struct rcu_head delayed_drop; -#endif #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; #endif diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 91546493c43d..0fecd1a3311d 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -9,6 +9,10 @@ #include <linux/gfp.h> #include <linux/sync_core.h> #include <linux/sched/coredump.h> +#include <linux/hp.h> + +/* Sched lazy mm hazard pointer domain. */ +DECLARE_PER_CPU(struct hp_slot, hp_domain_sched_lazy_mm); /* * Routines for handling mm_structs @@ -55,61 +59,42 @@ static inline void mmdrop(struct mm_struct *mm) __mmdrop(mm); } -#ifdef CONFIG_PREEMPT_RT -/* - * RCU callback for delayed mm drop. Not strictly RCU, but call_rcu() is - * by far the least expensive way to do that. - */ -static inline void __mmdrop_delayed(struct rcu_head *rhp) +/* Helpers for lazy TLB mm refcounting */ +static inline void mmgrab_lazy_tlb(struct mm_struct *mm) { - struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop); + struct hp_ctx ctx; - __mmdrop(mm); -} + /* + * mmgrab_lazy_tlb must provide a full memory barrier, see the + * membarrier comment finish_task_switch which relies on this. + */ + smp_mb(); -/* - * Invoked from finish_task_switch(). Delegates the heavy lifting on RT - * kernels via RCU. - */ -static inline void mmdrop_sched(struct mm_struct *mm) -{ - /* Provides a full memory barrier. See mmdrop() */ - if (atomic_dec_and_test(&mm->mm_count)) - call_rcu(&mm->delayed_drop, __mmdrop_delayed); -} -#else -static inline void mmdrop_sched(struct mm_struct *mm) -{ - mmdrop(mm); -} -#endif + /* + * The caller guarantees existence of mm. Allocate a hazard + * pointer to chain this existence guarantee to a hazard + * pointer. + */ + ctx = hp_allocate(&hp_domain_sched_lazy_mm, mm); -/* Helpers for lazy TLB mm refcounting */ -static inline void mmgrab_lazy_tlb(struct mm_struct *mm) -{ - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) - mmgrab(mm); + /* There is only a single lazy mm per CPU at any time. */ + WARN_ON_ONCE(!hp_ctx_addr(ctx)); } static inline void mmdrop_lazy_tlb(struct mm_struct *mm) { - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) { - mmdrop(mm); - } else { - /* - * mmdrop_lazy_tlb must provide a full memory barrier, see the - * membarrier comment finish_task_switch which relies on this. - */ - smp_mb(); - } + /* + * mmdrop_lazy_tlb must provide a full memory barrier, see the + * membarrier comment finish_task_switch which relies on this. + */ + smp_mb(); + WRITE_ONCE(this_cpu_ptr(&hp_domain_sched_lazy_mm)->addr, NULL); } static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm) { - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) - mmdrop_sched(mm); - else - smp_mb(); /* see mmdrop_lazy_tlb() above */ + smp_mb(); /* see mmdrop_lazy_tlb() above */ + WRITE_ONCE(this_cpu_ptr(&hp_domain_sched_lazy_mm)->addr, NULL); } /** diff --git a/kernel/exit.c b/kernel/exit.c index 7430852a8571..cb4ace06c0f0 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -545,8 +545,6 @@ static void exit_mm(void) if (!mm) return; mmap_read_lock(mm); - mmgrab_lazy_tlb(mm); - BUG_ON(mm != current->active_mm); /* more a memory barrier than a real lock */ task_lock(current); /* @@ -561,6 +559,8 @@ static void exit_mm(void) */ smp_mb__after_spinlock(); local_irq_disable(); + mmgrab_lazy_tlb(mm); + BUG_ON(mm != current->active_mm); current->mm = NULL; membarrier_update_current_mm(NULL); enter_lazy_tlb(mm, current); diff --git a/kernel/fork.c b/kernel/fork.c index cc760491f201..42c652ec39b5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -149,6 +149,9 @@ DEFINE_PER_CPU(unsigned long, process_counts) = 0; __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ +/* Sched lazy mm hazard pointer domain. */ +DEFINE_PER_CPU(struct hp_slot, hp_domain_sched_lazy_mm); + #ifdef CONFIG_PROVE_RCU int lockdep_tasklist_lock_is_held(void) { @@ -855,50 +858,24 @@ static void do_shoot_lazy_tlb(void *arg) WARN_ON_ONCE(current->mm); current->active_mm = &init_mm; switch_mm(mm, &init_mm, current); + WRITE_ONCE(this_cpu_ptr(&hp_domain_sched_lazy_mm)->addr, NULL); } } -static void cleanup_lazy_tlbs(struct mm_struct *mm) +static void retire_lazy_mm_hp(int cpu, struct hp_slot *slot, void *addr) { - if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { - /* - * In this case, lazy tlb mms are refounted and would not reach - * __mmdrop until all CPUs have switched away and mmdrop()ed. - */ - return; - } + smp_call_function_single(cpu, do_shoot_lazy_tlb, addr, 1); + smp_call_function_single(cpu, do_check_lazy_tlb, addr, 1); +} +static void cleanup_lazy_tlbs(struct mm_struct *mm) +{ /* - * Lazy mm shootdown does not refcount "lazy tlb mm" usage, rather it - * requires lazy mm users to switch to another mm when the refcount + * Require lazy mm users to switch to another mm when the refcount * drops to zero, before the mm is freed. This requires IPIs here to * switch kernel threads to init_mm. - * - * archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm - * switch with the final userspace teardown TLB flush which leaves the - * mm lazy on this CPU but no others, reducing the need for additional - * IPIs here. There are cases where a final IPI is still required here, - * such as the final mmdrop being performed on a different CPU than the - * one exiting, or kernel threads using the mm when userspace exits. - * - * IPI overheads have not found to be expensive, but they could be - * reduced in a number of possible ways, for example (roughly - * increasing order of complexity): - * - The last lazy reference created by exit_mm() could instead switch - * to init_mm, however it's probable this will run on the same CPU - * immediately afterwards, so this may not reduce IPIs much. - * - A batch of mms requiring IPIs could be gathered and freed at once. - * - CPUs store active_mm where it can be remotely checked without a - * lock, to filter out false-positives in the cpumask. - * - After mm_users or mm_count reaches zero, switching away from the - * mm could clear mm_cpumask to reduce some IPIs, perhaps together - * with some batching or delaying of the final IPIs. - * - A delayed freeing and RCU-like quiescing sequence based on mm - * switching to avoid IPIs completely. */ - on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); - if (IS_ENABLED(CONFIG_DEBUG_VM_SHOOT_LAZIES)) - on_each_cpu(do_check_lazy_tlb, (void *)mm, 1); + hp_scan(&hp_domain_sched_lazy_mm, mm, retire_lazy_mm_hp); } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4c36cc680361..d883c2aa3518 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3527,12 +3527,8 @@ static inline void switch_mm_cid(struct rq *rq, if (!next->mm) { // to kernel /* * user -> kernel transition does not guarantee a barrier, but - * we can use the fact that it performs an atomic operation in - * mmgrab(). - */ - if (prev->mm) // from user - smp_mb__after_mmgrab(); - /* + * we can use the fact that mmgrab() has a full barrier. + * * kernel -> kernel transition does not change rq->curr->mm * state. It stays NULL. */ diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a30c03a66172..1cb9dab361c9 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -803,16 +803,6 @@ config DEBUG_VM If unsure, say N. -config DEBUG_VM_SHOOT_LAZIES - bool "Debug MMU_LAZY_TLB_SHOOTDOWN implementation" - depends on DEBUG_VM - depends on MMU_LAZY_TLB_SHOOTDOWN - help - Enable additional IPIs that ensure lazy tlb mm references are removed - before the mm is freed. - - If unsure, say N. - config DEBUG_VM_MAPLE_TREE bool "Debug VM maple trees" depends on DEBUG_VM
Replace lazy active mm existence tracking with hazard pointers. This removes the following implementations and their associated config options: - MMU_LAZY_TLB_REFCOUNT - MMU_LAZY_TLB_SHOOTDOWN - This removes the call_rcu delayed mm drop for RT. It leverages the fact that each CPU only ever have at most one single lazy active mm. This makes it a very good fit for a hazard pointer domain implemented with one hazard pointer slot per CPU. * Benchmarks: will-it-scale context_switch1_threads nr threads (-t) speedup 1 -0.2% 2 +0.4% 3 +0.2% 6 +0.6% 12 +0.8% 24 +3% 48 +12% 96 +21% 192 +28% 384 +4% 768 -0.6% Methodology: Each test is the average of 20 iterations. Use median result of 3 test runs. Test hardware: CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 100% CPU max MHz: 3709.0000 CPU min MHz: 400.0000 BogoMIPS: 4799.75 Memory: 768 GB ram. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: John Stultz <jstultz@google.com> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: maged.michael@gmail.com Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> Cc: rcu@vger.kernel.org Cc: linux-mm@kvack.org Cc: lkmm@lists.linux.dev --- Documentation/mm/active_mm.rst | 9 ++-- arch/Kconfig | 32 ------------- arch/powerpc/Kconfig | 1 - arch/powerpc/mm/book3s64/radix_tlb.c | 23 +-------- include/linux/mm_types.h | 3 -- include/linux/sched/mm.h | 71 +++++++++++----------------- kernel/exit.c | 4 +- kernel/fork.c | 47 +++++------------- kernel/sched/sched.h | 8 +--- lib/Kconfig.debug | 10 ---- 10 files changed, 49 insertions(+), 159 deletions(-)