diff mbox series

[RFC,v3,4/4] sched+mm: Use hazard pointers to track lazy active mm existence

Message ID 20241008135034.1982519-5-mathieu.desnoyers@efficios.com (mailing list archive)
State New
Headers show
Series sched+mm: Track lazy active mm existence with hazard pointers | expand

Commit Message

Mathieu Desnoyers Oct. 8, 2024, 1:50 p.m. UTC
Replace lazy active mm existence tracking with hazard pointers. This
removes the following implementations and their associated config
options:

- MMU_LAZY_TLB_REFCOUNT
- MMU_LAZY_TLB_SHOOTDOWN
- This removes the call_rcu delayed mm drop for RT.

It leverages the fact that each CPU only ever have at most one single
lazy active mm. This makes it a very good fit for a hazard pointer
domain implemented with one hazard pointer slot per CPU.

* Benchmarks:

will-it-scale context_switch1_threads

nr threads (-t)     speedup
     1                -0.2%
     2                +0.4%
     3                +0.2%
     6                +0.6%
    12                +0.8%
    24                +3%
    48               +12%
    96               +21%
   192               +28%
   384                +4%
   768                -0.6%

Methodology: Each test is the average of 20 iterations. Use median
result of 3 test runs.

Test hardware:

CPU(s):                   384
  On-line CPU(s) list:    0-383
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 9654 96-Core Processor
    CPU family:           25
    Model:                17
    Thread(s) per core:   2
    Core(s) per socket:   96
    Socket(s):            2
    Stepping:             1
    Frequency boost:      enabled
    CPU(s) scaling MHz:   100%
    CPU max MHz:          3709.0000
    CPU min MHz:          400.0000
    BogoMIPS:             4799.75

Memory: 768 GB ram.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
---
 Documentation/mm/active_mm.rst       |  9 ++--
 arch/Kconfig                         | 32 -------------
 arch/powerpc/Kconfig                 |  1 -
 arch/powerpc/mm/book3s64/radix_tlb.c | 23 +---------
 include/linux/mm_types.h             |  3 --
 include/linux/sched/mm.h             | 68 ++++++++++------------------
 kernel/exit.c                        |  4 +-
 kernel/fork.c                        | 47 +++++--------------
 kernel/sched/sched.h                 |  8 +---
 lib/Kconfig.debug                    | 10 ----
 10 files changed, 45 insertions(+), 160 deletions(-)

Comments

Joel Fernandes Oct. 8, 2024, 7:51 p.m. UTC | #1
On Tue, Oct 8, 2024 at 9:52 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Replace lazy active mm existence tracking with hazard pointers. This
> removes the following implementations and their associated config
> options:
>
> - MMU_LAZY_TLB_REFCOUNT
> - MMU_LAZY_TLB_SHOOTDOWN
> - This removes the call_rcu delayed mm drop for RT.
>
> It leverages the fact that each CPU only ever have at most one single
> lazy active mm. This makes it a very good fit for a hazard pointer
> domain implemented with one hazard pointer slot per CPU.
>
> -static void cleanup_lazy_tlbs(struct mm_struct *mm)
> +static void remove_lazy_mm_hp(int cpu, struct hazptr_slot *slot, void *addr)
>  {
> -       if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
> -               /*
> -                * In this case, lazy tlb mms are refounted and would not reach
> -                * __mmdrop until all CPUs have switched away and mmdrop()ed.
> -                */
> -               return;
> -       }
> +       smp_call_function_single(cpu, do_shoot_lazy_tlb, addr, 1);
> +       smp_call_function_single(cpu, do_check_lazy_tlb, addr, 1);
> +}
>
> +static void cleanup_lazy_tlbs(struct mm_struct *mm)
> +{
[...]
> -       on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
> -       if (IS_ENABLED(CONFIG_DEBUG_VM_SHOOT_LAZIES))
> -               on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
> +       hazptr_scan(&hazptr_domain_sched_lazy_mm, mm, remove_lazy_mm_hp);

Hey Mathieu, Take comments with a grain of salt because I am digging
into active_mm after a while.
It seems to me IMO this seems a strange hazard pointer callback
usecase. Because "scan" here immediately retires even though the
reader has a "reference". Here it is more like, the callback is
forcing all other readers holding a "reference" to switch immediately
whether they like it or not and not wait until _they_ release the
reference. There is no such precedent in RCU for instance, where a
callback never runs before a reader even finishes.

That might work for active_mm, but it sounds like a fringe usecase to
me that it might probably be better to just force
CONFIG_MMU_LAZY_TLB_SHOOTDOWN=y for everyone and use on_each_cpu()
instead. That will give the same code simplification for this patch
without requiring hazard pointers AFAICS? Or maybe start with that,
and _then_ convert to HP if it makes sense to? These are just some
thoughts and I am Ok with all the reviewer's consensus.

And if I understand correctly, for this usecase - we are not even
grabbing a "true reference" to the mm_struct object because direct
access to mm_struct should require a proper mmgrab(), not a lazy_tlb
flavored one? -- correct me if I'm wrong though.

Also, isn't it that on x86, now with this patch there will be more
IPIs, whereas previously the global refcount was not requiring that as
the last kthread switching out would no longer access the old mm?
Might it be worth checking the performance of fork/exit and if that
scales?

thanks,

- Joel
kernel test robot Oct. 14, 2024, 8:27 a.m. UTC | #2
Hello,

kernel test robot noticed "WARNING:at_kernel/hazptr.c:#hazptr_scan" on:

commit: c1508707268498a6fd3ca5853ad65f9482c12374 ("[RFC PATCH v3 4/4] sched+mm: Use hazard pointers to track lazy active mm existence")
url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/compiler-h-Introduce-ptr_eq-to-preserve-address-dependency/20241008-215353
base: https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git next
patch link: https://lore.kernel.org/all/20241008135034.1982519-5-mathieu.desnoyers@efficios.com/
patch subject: [RFC PATCH v3 4/4] sched+mm: Use hazard pointers to track lazy active mm existence

in testcase: boot

config: i386-randconfig-013-20241011
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+-----------------------------------------+------------+------------+
|                                         | b62696cacd | c150870726 |
+-----------------------------------------+------------+------------+
| WARNING:at_kernel/hazptr.c:#hazptr_scan | 0          | 5          |
| EIP:hazptr_scan                         | 0          | 5          |
+-----------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202410141617.612a0f5b-lkp@intel.com


[    6.951355][   T22] ------------[ cut here ]------------
[ 6.951920][ T22] WARNING: CPU: 0 PID: 22 at kernel/hazptr.c:28 hazptr_scan (kernel/hazptr.c:28) 
[    6.952580][   T22] Modules linked in:
[    6.952880][   T22] CPU: 0 UID: 0 PID: 22 Comm: khugepaged Not tainted 6.12.0-rc1-00004-gc15087072684 #10
[ 6.953685][ T22] EIP: hazptr_scan (kernel/hazptr.c:28) 
[ 6.954087][ T22] Code: c0 74 0a 85 db 8b 0a 74 45 39 c8 74 21 5b 5e 5d 31 c0 31 d2 31 c9 c3 8d b4 26 00 00 00 00 f7 05 a4 18 34 c3 ff ff ff 7f 74 14 <0f> 0b eb d1 89 c1 31 c0 ff d3 5b 5e 5d 31 c0 31 d2 31 c9 c3 8b 0d
All code
========
   0:	c0 74 0a 85 db       	shlb   $0xdb,-0x7b(%rdx,%rcx,1)
   5:	8b 0a                	mov    (%rdx),%ecx
   7:	74 45                	je     0x4e
   9:	39 c8                	cmp    %ecx,%eax
   b:	74 21                	je     0x2e
   d:	5b                   	pop    %rbx
   e:	5e                   	pop    %rsi
   f:	5d                   	pop    %rbp
  10:	31 c0                	xor    %eax,%eax
  12:	31 d2                	xor    %edx,%edx
  14:	31 c9                	xor    %ecx,%ecx
  16:	c3                   	ret
  17:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
  1e:	f7 05 a4 18 34 c3 ff 	testl  $0x7fffffff,-0x3ccbe75c(%rip)        # 0xffffffffc33418cc
  25:	ff ff 7f 
  28:	74 14                	je     0x3e
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	eb d1                	jmp    0xffffffffffffffff
  2e:	89 c1                	mov    %eax,%ecx
  30:	31 c0                	xor    %eax,%eax
  32:	ff d3                	call   *%rbx
  34:	5b                   	pop    %rbx
  35:	5e                   	pop    %rsi
  36:	5d                   	pop    %rbp
  37:	31 c0                	xor    %eax,%eax
  39:	31 d2                	xor    %edx,%edx
  3b:	31 c9                	xor    %ecx,%ecx
  3d:	c3                   	ret
  3e:	8b                   	.byte 0x8b
  3f:	0d                   	.byte 0xd

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	eb d1                	jmp    0xffffffffffffffd5
   4:	89 c1                	mov    %eax,%ecx
   6:	31 c0                	xor    %eax,%eax
   8:	ff d3                	call   *%rbx
   a:	5b                   	pop    %rbx
   b:	5e                   	pop    %rsi
   c:	5d                   	pop    %rbp
   d:	31 c0                	xor    %eax,%eax
   f:	31 d2                	xor    %edx,%edx
  11:	31 c9                	xor    %ecx,%ecx
  13:	c3                   	ret
  14:	8b                   	.byte 0x8b
  15:	0d                   	.byte 0xd
[    6.955564][   T22] EAX: c6087680 EBX: c1061470 ECX: 00000000 EDX: c2e104e8
[    6.956135][   T22] ESI: c2e104e4 EDI: 00000001 EBP: c42ade88 ESP: c42ade80
[    6.956665][   T22] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010202
[    6.957266][   T22] CR0: 80050033 CR2: 0819cd10 CR3: 04033d80 CR4: 000406b0
[    6.957807][   T22] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[    6.958380][   T22] DR6: fffe0ff0 DR7: 00000400
[    6.958747][   T22] Call Trace:
[ 6.959005][ T22] ? show_regs (arch/x86/kernel/dumpstack.c:479) 
[ 6.959362][ T22] ? hazptr_scan (kernel/hazptr.c:28) 
[ 6.959694][ T22] ? __warn (kernel/panic.c:748) 
[ 6.959974][ T22] ? hazptr_scan (kernel/hazptr.c:28) 
[ 6.960361][ T22] ? hazptr_scan (kernel/hazptr.c:28) 
[ 6.960695][ T22] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 6.961083][ T22] ? hazptr_scan (kernel/hazptr.c:28) 
[ 6.961427][ T22] ? exc_overflow (arch/x86/kernel/traps.c:301) 
[ 6.961778][ T22] ? handle_bug (arch/x86/kernel/traps.c:260) 
[ 6.962157][ T22] ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1)) 
[ 6.962549][ T22] ? thread_stack_free_rcu (kernel/fork.c:867) 
[ 6.962955][ T22] ? handle_exception (arch/x86/entry/entry_32.S:1047) 
[ 6.963399][ T22] ? thread_stack_free_rcu (kernel/fork.c:867) 
[ 6.963801][ T22] ? exc_overflow (arch/x86/kernel/traps.c:301) 
[ 6.964203][ T22] ? hazptr_scan (kernel/hazptr.c:28) 
[ 6.964544][ T22] ? exc_overflow (arch/x86/kernel/traps.c:301) 
[ 6.964895][ T22] ? hazptr_scan (kernel/hazptr.c:28) 
[ 6.965279][ T22] __mmdrop (kernel/fork.c:895 (discriminator 3)) 
[ 6.965599][ T22] collect_mm_slot (mm/khugepaged.c:1455) 
[ 6.965952][ T22] khugepaged_scan_mm_slot+0x210/0x60c 
[ 6.966493][ T22] ? khugepaged (mm/khugepaged.c:2511 mm/khugepaged.c:2571) 
[ 6.966865][ T22] khugepaged (mm/khugepaged.c:2515 mm/khugepaged.c:2571) 
[ 6.967239][ T22] ? _raw_spin_unlock_irqrestore (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:97 arch/x86/include/asm/irqflags.h:155 include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) 
[ 6.967684][ T22] ? __kthread_parkme (arch/x86/include/asm/bitops.h:206 arch/x86/include/asm/bitops.h:238 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/kthread.c:280) 
[ 6.968102][ T22] kthread (kernel/kthread.c:389) 
[ 6.968400][ T22] ? khugepaged_scan_mm_slot+0x60c/0x60c 
[ 6.968896][ T22] ? kthread_park (kernel/kthread.c:342) 
[ 6.969286][ T22] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 6.969628][ T22] ? kthread_park (kernel/kthread.c:342) 
[ 6.969961][ T22] ret_from_fork_asm (arch/x86/entry/entry_32.S:737) 
[ 6.970383][ T22] entry_INT80_32 (arch/x86/entry/entry_32.S:944) 
[    6.970758][   T22] irq event stamp: 4719
[ 6.971117][ T22] hardirqs last enabled at (4729): __up_console_sem (arch/x86/include/asm/irqflags.h:42 (discriminator 1) arch/x86/include/asm/irqflags.h:97 (discriminator 1) arch/x86/include/asm/irqflags.h:155 (discriminator 1) kernel/printk/printk.c:344 (discriminator 1)) 
[ 6.971790][ T22] hardirqs last disabled at (4736): __up_console_sem (kernel/printk/printk.c:342 (discriminator 1)) 
[ 6.972475][ T22] softirqs last enabled at (4708): handle_softirqs (kernel/softirq.c:401 kernel/softirq.c:582) 
[ 6.973162][ T22] softirqs last disabled at (4695): __do_softirq (kernel/softirq.c:589) 
[    6.973771][   T22] ---[ end trace 0000000000000000 ]---



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20241014/202410141617.612a0f5b-lkp@intel.com
Mathieu Desnoyers Oct. 15, 2024, 1:33 p.m. UTC | #3
On 2024-10-14 04:27, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "WARNING:at_kernel/hazptr.c:#hazptr_scan" on:
> 
> commit: c1508707268498a6fd3ca5853ad65f9482c12374 ("[RFC PATCH v3 4/4] sched+mm: Use hazard pointers to track lazy active mm existence")
> url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/compiler-h-Introduce-ptr_eq-to-preserve-address-dependency/20241008-215353
> base: https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git next
> patch link: https://lore.kernel.org/all/20241008135034.1982519-5-mathieu.desnoyers@efficios.com/
> patch subject: [RFC PATCH v3 4/4] sched+mm: Use hazard pointers to track lazy active mm existence
> 

This appears to be called from:

static int khugepaged(void *none)
{
         struct khugepaged_mm_slot *mm_slot;

         set_freezable();
         set_user_nice(current, MAX_NICE);
                         
         while (!kthread_should_stop()) {
                 khugepaged_do_scan(&khugepaged_collapse_control);
                 khugepaged_wait_work();
         }
                 
         spin_lock(&khugepaged_mm_lock);
         mm_slot = khugepaged_scan.mm_slot;
         khugepaged_scan.mm_slot = NULL;
         if (mm_slot)
                 collect_mm_slot(mm_slot);   <-------- here
         spin_unlock(&khugepaged_mm_lock);
         return 0;
}

[...]

static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
{
         struct mm_slot *slot = &mm_slot->slot;
         struct mm_struct *mm = slot->mm;

         lockdep_assert_held(&khugepaged_mm_lock);

         if (hpage_collapse_test_exit(mm)) {
                 /* free mm_slot */
                 hash_del(&slot->hash);
                 list_del(&slot->mm_node);

                 /*
                  * Not strictly needed because the mm exited already.
                  *
                  * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
                  */

                 /* khugepaged_mm_lock actually not necessary for the below */
                 mm_slot_free(mm_slot_cache, mm_slot);
                 mmdrop(mm);             <---------- here
         }
}

So technically, before my change, there were two things that differed here:

1) mmdrop possibly did not decrement mm_count to 0 if mm_count was held as a
    lazy mm, which skipped __mmdrop entirely.

2) If it happened that mm_count did decrement to 0, __mmdrop would call
    cleanup_lazy_tlbs().

    With CONFIG_MMU_LAZY_TLB_SHOOTDOWN=n (which is the case here), cleanup_lazy_tlbs()
    returned immediately.

    With CONFIG_MMU_LAZY_TLB_SHOOTDOWN=y (only for powerpc AFAIU), it would do
    on_each_cpu_mask() to send IPIs to other CPUs which are in the mm_cpumask().
    It appears to be OK to call on_each_cpu_mask() from preempt-disabled context.

So changing all this to send IPIs for CPUs which have the mm in their hazard pointer
slot is technically not so different from what CONFIG_MMU_LAZY_TLB_SHOOTDOWN=y
does.

So AFAIU we have the choice between three ways forward here:

1) Modify hazptr_scan() so it only check for lockdep_assert_preemption_enabled()
    when the on_match_cb() is NULL (busy-wait behavior). This would allow being
    called from preempt-disabled context when a specific on-match callback is
    provided.

2) Modify mm/khugepaged.c:collect_mm_slot() so it releases the khugepaged_mm_lock
    spinlock around:

                 /* khugepaged_mm_lock actually not necessary for the below */
                 mm_slot_free(mm_slot_cache, mm_slot);
                 mmdrop(mm);

   So it does not call mmdrop() from preempt-off context.

3) Approaches (1)+(2).

Thoughts ?

Thanks,

Mathieu



> in testcase: boot
> 
> config: i386-randconfig-013-20241011
> compiler: gcc-12
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> +-----------------------------------------+------------+------------+
> |                                         | b62696cacd | c150870726 |
> +-----------------------------------------+------------+------------+
> | WARNING:at_kernel/hazptr.c:#hazptr_scan | 0          | 5          |
> | EIP:hazptr_scan                         | 0          | 5          |
> +-----------------------------------------+------------+------------+
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202410141617.612a0f5b-lkp@intel.com
> 
> 
> [    6.951355][   T22] ------------[ cut here ]------------
> [ 6.951920][ T22] WARNING: CPU: 0 PID: 22 at kernel/hazptr.c:28 hazptr_scan (kernel/hazptr.c:28)
> [    6.952580][   T22] Modules linked in:
> [    6.952880][   T22] CPU: 0 UID: 0 PID: 22 Comm: khugepaged Not tainted 6.12.0-rc1-00004-gc15087072684 #10
> [ 6.953685][ T22] EIP: hazptr_scan (kernel/hazptr.c:28)
> [ 6.954087][ T22] Code: c0 74 0a 85 db 8b 0a 74 45 39 c8 74 21 5b 5e 5d 31 c0 31 d2 31 c9 c3 8d b4 26 00 00 00 00 f7 05 a4 18 34 c3 ff ff ff 7f 74 14 <0f> 0b eb d1 89 c1 31 c0 ff d3 5b 5e 5d 31 c0 31 d2 31 c9 c3 8b 0d
> All code
> ========
>     0:	c0 74 0a 85 db       	shlb   $0xdb,-0x7b(%rdx,%rcx,1)
>     5:	8b 0a                	mov    (%rdx),%ecx
>     7:	74 45                	je     0x4e
>     9:	39 c8                	cmp    %ecx,%eax
>     b:	74 21                	je     0x2e
>     d:	5b                   	pop    %rbx
>     e:	5e                   	pop    %rsi
>     f:	5d                   	pop    %rbp
>    10:	31 c0                	xor    %eax,%eax
>    12:	31 d2                	xor    %edx,%edx
>    14:	31 c9                	xor    %ecx,%ecx
>    16:	c3                   	ret
>    17:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
>    1e:	f7 05 a4 18 34 c3 ff 	testl  $0x7fffffff,-0x3ccbe75c(%rip)        # 0xffffffffc33418cc
>    25:	ff ff 7f
>    28:	74 14                	je     0x3e
>    2a:*	0f 0b                	ud2		<-- trapping instruction
>    2c:	eb d1                	jmp    0xffffffffffffffff
>    2e:	89 c1                	mov    %eax,%ecx
>    30:	31 c0                	xor    %eax,%eax
>    32:	ff d3                	call   *%rbx
>    34:	5b                   	pop    %rbx
>    35:	5e                   	pop    %rsi
>    36:	5d                   	pop    %rbp
>    37:	31 c0                	xor    %eax,%eax
>    39:	31 d2                	xor    %edx,%edx
>    3b:	31 c9                	xor    %ecx,%ecx
>    3d:	c3                   	ret
>    3e:	8b                   	.byte 0x8b
>    3f:	0d                   	.byte 0xd
> 
> Code starting with the faulting instruction
> ===========================================
>     0:	0f 0b                	ud2
>     2:	eb d1                	jmp    0xffffffffffffffd5
>     4:	89 c1                	mov    %eax,%ecx
>     6:	31 c0                	xor    %eax,%eax
>     8:	ff d3                	call   *%rbx
>     a:	5b                   	pop    %rbx
>     b:	5e                   	pop    %rsi
>     c:	5d                   	pop    %rbp
>     d:	31 c0                	xor    %eax,%eax
>     f:	31 d2                	xor    %edx,%edx
>    11:	31 c9                	xor    %ecx,%ecx
>    13:	c3                   	ret
>    14:	8b                   	.byte 0x8b
>    15:	0d                   	.byte 0xd
> [    6.955564][   T22] EAX: c6087680 EBX: c1061470 ECX: 00000000 EDX: c2e104e8
> [    6.956135][   T22] ESI: c2e104e4 EDI: 00000001 EBP: c42ade88 ESP: c42ade80
> [    6.956665][   T22] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010202
> [    6.957266][   T22] CR0: 80050033 CR2: 0819cd10 CR3: 04033d80 CR4: 000406b0
> [    6.957807][   T22] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [    6.958380][   T22] DR6: fffe0ff0 DR7: 00000400
> [    6.958747][   T22] Call Trace:
> [ 6.959005][ T22] ? show_regs (arch/x86/kernel/dumpstack.c:479)
> [ 6.959362][ T22] ? hazptr_scan (kernel/hazptr.c:28)
> [ 6.959694][ T22] ? __warn (kernel/panic.c:748)
> [ 6.959974][ T22] ? hazptr_scan (kernel/hazptr.c:28)
> [ 6.960361][ T22] ? hazptr_scan (kernel/hazptr.c:28)
> [ 6.960695][ T22] ? report_bug (lib/bug.c:180 lib/bug.c:219)
> [ 6.961083][ T22] ? hazptr_scan (kernel/hazptr.c:28)
> [ 6.961427][ T22] ? exc_overflow (arch/x86/kernel/traps.c:301)
> [ 6.961778][ T22] ? handle_bug (arch/x86/kernel/traps.c:260)
> [ 6.962157][ T22] ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1))
> [ 6.962549][ T22] ? thread_stack_free_rcu (kernel/fork.c:867)
> [ 6.962955][ T22] ? handle_exception (arch/x86/entry/entry_32.S:1047)
> [ 6.963399][ T22] ? thread_stack_free_rcu (kernel/fork.c:867)
> [ 6.963801][ T22] ? exc_overflow (arch/x86/kernel/traps.c:301)
> [ 6.964203][ T22] ? hazptr_scan (kernel/hazptr.c:28)
> [ 6.964544][ T22] ? exc_overflow (arch/x86/kernel/traps.c:301)
> [ 6.964895][ T22] ? hazptr_scan (kernel/hazptr.c:28)
> [ 6.965279][ T22] __mmdrop (kernel/fork.c:895 (discriminator 3))
> [ 6.965599][ T22] collect_mm_slot (mm/khugepaged.c:1455)
> [ 6.965952][ T22] khugepaged_scan_mm_slot+0x210/0x60c
> [ 6.966493][ T22] ? khugepaged (mm/khugepaged.c:2511 mm/khugepaged.c:2571)
> [ 6.966865][ T22] khugepaged (mm/khugepaged.c:2515 mm/khugepaged.c:2571)
> [ 6.967239][ T22] ? _raw_spin_unlock_irqrestore (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:97 arch/x86/include/asm/irqflags.h:155 include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
> [ 6.967684][ T22] ? __kthread_parkme (arch/x86/include/asm/bitops.h:206 arch/x86/include/asm/bitops.h:238 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/kthread.c:280)
> [ 6.968102][ T22] kthread (kernel/kthread.c:389)
> [ 6.968400][ T22] ? khugepaged_scan_mm_slot+0x60c/0x60c
> [ 6.968896][ T22] ? kthread_park (kernel/kthread.c:342)
> [ 6.969286][ T22] ret_from_fork (arch/x86/kernel/process.c:153)
> [ 6.969628][ T22] ? kthread_park (kernel/kthread.c:342)
> [ 6.969961][ T22] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
> [ 6.970383][ T22] entry_INT80_32 (arch/x86/entry/entry_32.S:944)
> [    6.970758][   T22] irq event stamp: 4719
> [ 6.971117][ T22] hardirqs last enabled at (4729): __up_console_sem (arch/x86/include/asm/irqflags.h:42 (discriminator 1) arch/x86/include/asm/irqflags.h:97 (discriminator 1) arch/x86/include/asm/irqflags.h:155 (discriminator 1) kernel/printk/printk.c:344 (discriminator 1))
> [ 6.971790][ T22] hardirqs last disabled at (4736): __up_console_sem (kernel/printk/printk.c:342 (discriminator 1))
> [ 6.972475][ T22] softirqs last enabled at (4708): handle_softirqs (kernel/softirq.c:401 kernel/softirq.c:582)
> [ 6.973162][ T22] softirqs last disabled at (4695): __do_softirq (kernel/softirq.c:589)
> [    6.973771][   T22] ---[ end trace 0000000000000000 ]---
> 
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20241014/202410141617.612a0f5b-lkp@intel.com
> 
> 
>
diff mbox series

Patch

diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst
index d096fc091e23..c225cac49c30 100644
--- a/Documentation/mm/active_mm.rst
+++ b/Documentation/mm/active_mm.rst
@@ -2,11 +2,10 @@ 
 Active MM
 =========
 
-Note, the mm_count refcount may no longer include the "lazy" users
-(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
-with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
-references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
-helpers, which abstract this config option.
+Note, the mm_count refcount no longer include the "lazy" users (running
+tasks with ->active_mm == mm && ->mm == NULL) Taking and releasing these
+lazy references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
+helpers, which are implemented with hazard pointers.
 
 ::
 
diff --git a/arch/Kconfig b/arch/Kconfig
index 975dd22a2dbd..d4261935f8dc 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -475,38 +475,6 @@  config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
-# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
-# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
-# to/from kernel threads when the same mm is running on a lot of CPUs (a large
-# multi-threaded application), by reducing contention on the mm refcount.
-#
-# This can be disabled if the architecture ensures no CPUs are using an mm as a
-# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
-# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
-# final exit(2) TLB flush, for example.
-#
-# To implement this, an arch *must*:
-# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when manipulating
-# the lazy tlb reference of a kthread's ->active_mm (non-arch code has been
-# converted already).
-config MMU_LAZY_TLB_REFCOUNT
-	def_bool y
-	depends on !MMU_LAZY_TLB_SHOOTDOWN
-
-# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
-# mm as a lazy tlb beyond its last reference count, by shooting down these
-# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
-# be using the mm as a lazy tlb, so that they may switch themselves to using
-# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
-# may be using mm as a lazy tlb mm.
-#
-# To implement this, an arch *must*:
-# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
-#   at least all possible CPUs in which the mm is lazy.
-# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
-config MMU_LAZY_TLB_SHOOTDOWN
-	bool
-
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d7b09b064a8a..b1e25e75baab 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -291,7 +291,6 @@  config PPC
 	select MMU_GATHER_PAGE_SIZE
 	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
-	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
 	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 9e1f6558d026..ff0d4f28cf52 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1197,28 +1197,7 @@  void radix__tlb_flush(struct mmu_gather *tlb)
 	 * See the comment for radix in arch_exit_mmap().
 	 */
 	if (tlb->fullmm) {
-		if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
-			/*
-			 * Shootdown based lazy tlb mm refcounting means we
-			 * have to IPI everyone in the mm_cpumask anyway soon
-			 * when the mm goes away, so might as well do it as
-			 * part of the final flush now.
-			 *
-			 * If lazy shootdown was improved to reduce IPIs (e.g.,
-			 * by batching), then it may end up being better to use
-			 * tlbies here instead.
-			 */
-			preempt_disable();
-
-			smp_mb(); /* see radix__flush_tlb_mm */
-			exit_flush_lazy_tlbs(mm);
-			__flush_all_mm(mm, true);
-
-			preempt_enable();
-		} else {
-			__flush_all_mm(mm, true);
-		}
-
+		__flush_all_mm(mm, true);
 	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
 		if (!tlb->freed_tables)
 			radix__flush_tlb_mm(mm);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..db5f13554485 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -975,9 +975,6 @@  struct mm_struct {
 		atomic_t tlb_flush_batched;
 #endif
 		struct uprobes_state uprobes_state;
-#ifdef CONFIG_PREEMPT_RT
-		struct rcu_head delayed_drop;
-#endif
 #ifdef CONFIG_HUGETLB_PAGE
 		atomic_long_t hugetlb_usage;
 #endif
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 91546493c43d..7b2f0a432f6e 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -9,6 +9,10 @@ 
 #include <linux/gfp.h>
 #include <linux/sync_core.h>
 #include <linux/sched/coredump.h>
+#include <linux/hazptr.h>
+
+/* Sched lazy mm hazard pointer domain. */
+DECLARE_HAZPTR_DOMAIN(hazptr_domain_sched_lazy_mm);
 
 /*
  * Routines for handling mm_structs
@@ -55,61 +59,37 @@  static inline void mmdrop(struct mm_struct *mm)
 		__mmdrop(mm);
 }
 
-#ifdef CONFIG_PREEMPT_RT
-/*
- * RCU callback for delayed mm drop. Not strictly RCU, but call_rcu() is
- * by far the least expensive way to do that.
- */
-static inline void __mmdrop_delayed(struct rcu_head *rhp)
-{
-	struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop);
-
-	__mmdrop(mm);
-}
-
-/*
- * Invoked from finish_task_switch(). Delegates the heavy lifting on RT
- * kernels via RCU.
- */
-static inline void mmdrop_sched(struct mm_struct *mm)
-{
-	/* Provides a full memory barrier. See mmdrop() */
-	if (atomic_dec_and_test(&mm->mm_count))
-		call_rcu(&mm->delayed_drop, __mmdrop_delayed);
-}
-#else
-static inline void mmdrop_sched(struct mm_struct *mm)
-{
-	mmdrop(mm);
-}
-#endif
-
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
-		mmgrab(mm);
+	/*
+	 * mmgrab_lazy_tlb must provide a full memory barrier, see the
+	 * membarrier comment finish_task_switch which relies on this.
+	 */
+	smp_mb();
+
+	/*
+	 * The caller guarantees existence of mm. Post a hazard pointer
+	 * to chain this existence guarantee to a hazard pointer.
+	 * There is only a single lazy mm per CPU at any time.
+	 */
+	WARN_ON_ONCE(!hazptr_try_protect(&hazptr_domain_sched_lazy_mm, mm, NULL));
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
-		mmdrop(mm);
-	} else {
-		/*
-		 * mmdrop_lazy_tlb must provide a full memory barrier, see the
-		 * membarrier comment finish_task_switch which relies on this.
-		 */
-		smp_mb();
-	}
+	/*
+	 * mmdrop_lazy_tlb must provide a full memory barrier, see the
+	 * membarrier comment finish_task_switch which relies on this.
+	 */
+	smp_mb();
+	this_cpu_write(hazptr_domain_sched_lazy_mm.percpu_slots->addr, NULL);
 }
 
 static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
 {
-	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
-		mmdrop_sched(mm);
-	else
-		smp_mb(); /* see mmdrop_lazy_tlb() above */
+	smp_mb(); /* see mmdrop_lazy_tlb() above */
+	this_cpu_write(hazptr_domain_sched_lazy_mm.percpu_slots->addr, NULL);
 }
 
 /**
diff --git a/kernel/exit.c b/kernel/exit.c
index 7430852a8571..cb4ace06c0f0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -545,8 +545,6 @@  static void exit_mm(void)
 	if (!mm)
 		return;
 	mmap_read_lock(mm);
-	mmgrab_lazy_tlb(mm);
-	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
 	/*
@@ -561,6 +559,8 @@  static void exit_mm(void)
 	 */
 	smp_mb__after_spinlock();
 	local_irq_disable();
+	mmgrab_lazy_tlb(mm);
+	BUG_ON(mm != current->active_mm);
 	current->mm = NULL;
 	membarrier_update_current_mm(NULL);
 	enter_lazy_tlb(mm, current);
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f201..0a2e2ab1680a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -149,6 +149,9 @@  DEFINE_PER_CPU(unsigned long, process_counts) = 0;
 
 __cacheline_aligned DEFINE_RWLOCK(tasklist_lock);  /* outer */
 
+/* Sched lazy mm hazard pointer domain. */
+DEFINE_HAZPTR_DOMAIN(hazptr_domain_sched_lazy_mm);
+
 #ifdef CONFIG_PROVE_RCU
 int lockdep_tasklist_lock_is_held(void)
 {
@@ -855,50 +858,24 @@  static void do_shoot_lazy_tlb(void *arg)
 		WARN_ON_ONCE(current->mm);
 		current->active_mm = &init_mm;
 		switch_mm(mm, &init_mm, current);
+		this_cpu_write(hazptr_domain_sched_lazy_mm.percpu_slots->addr, NULL);
 	}
 }
 
-static void cleanup_lazy_tlbs(struct mm_struct *mm)
+static void remove_lazy_mm_hp(int cpu, struct hazptr_slot *slot, void *addr)
 {
-	if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
-		/*
-		 * In this case, lazy tlb mms are refounted and would not reach
-		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
-		 */
-		return;
-	}
+	smp_call_function_single(cpu, do_shoot_lazy_tlb, addr, 1);
+	smp_call_function_single(cpu, do_check_lazy_tlb, addr, 1);
+}
 
+static void cleanup_lazy_tlbs(struct mm_struct *mm)
+{
 	/*
-	 * Lazy mm shootdown does not refcount "lazy tlb mm" usage, rather it
-	 * requires lazy mm users to switch to another mm when the refcount
+	 * Require lazy mm users to switch to another mm when the refcount
 	 * drops to zero, before the mm is freed. This requires IPIs here to
 	 * switch kernel threads to init_mm.
-	 *
-	 * archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm
-	 * switch with the final userspace teardown TLB flush which leaves the
-	 * mm lazy on this CPU but no others, reducing the need for additional
-	 * IPIs here. There are cases where a final IPI is still required here,
-	 * such as the final mmdrop being performed on a different CPU than the
-	 * one exiting, or kernel threads using the mm when userspace exits.
-	 *
-	 * IPI overheads have not found to be expensive, but they could be
-	 * reduced in a number of possible ways, for example (roughly
-	 * increasing order of complexity):
-	 * - The last lazy reference created by exit_mm() could instead switch
-	 *   to init_mm, however it's probable this will run on the same CPU
-	 *   immediately afterwards, so this may not reduce IPIs much.
-	 * - A batch of mms requiring IPIs could be gathered and freed at once.
-	 * - CPUs store active_mm where it can be remotely checked without a
-	 *   lock, to filter out false-positives in the cpumask.
-	 * - After mm_users or mm_count reaches zero, switching away from the
-	 *   mm could clear mm_cpumask to reduce some IPIs, perhaps together
-	 *   with some batching or delaying of the final IPIs.
-	 * - A delayed freeing and RCU-like quiescing sequence based on mm
-	 *   switching to avoid IPIs completely.
 	 */
-	on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
-	if (IS_ENABLED(CONFIG_DEBUG_VM_SHOOT_LAZIES))
-		on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+	hazptr_scan(&hazptr_domain_sched_lazy_mm, mm, remove_lazy_mm_hp);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c36cc680361..d883c2aa3518 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3527,12 +3527,8 @@  static inline void switch_mm_cid(struct rq *rq,
 	if (!next->mm) {                                // to kernel
 		/*
 		 * user -> kernel transition does not guarantee a barrier, but
-		 * we can use the fact that it performs an atomic operation in
-		 * mmgrab().
-		 */
-		if (prev->mm)                           // from user
-			smp_mb__after_mmgrab();
-		/*
+		 * we can use the fact that mmgrab() has a full barrier.
+		 *
 		 * kernel -> kernel transition does not change rq->curr->mm
 		 * state. It stays NULL.
 		 */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a30c03a66172..1cb9dab361c9 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -803,16 +803,6 @@  config DEBUG_VM
 
 	  If unsure, say N.
 
-config DEBUG_VM_SHOOT_LAZIES
-	bool "Debug MMU_LAZY_TLB_SHOOTDOWN implementation"
-	depends on DEBUG_VM
-	depends on MMU_LAZY_TLB_SHOOTDOWN
-	help
-	  Enable additional IPIs that ensure lazy tlb mm references are removed
-	  before the mm is freed.
-
-	  If unsure, say N.
-
 config DEBUG_VM_MAPLE_TREE
 	bool "Debug VM maple trees"
 	depends on DEBUG_VM