mbox series

[RFC,0/4] sched+mm: Track lazy active mm existence with hazard pointers

Message ID 20241002010205.1341915-1-mathieu.desnoyers@efficios.com (mailing list archive)
Headers show
Series sched+mm: Track lazy active mm existence with hazard pointers | expand

Message

Mathieu Desnoyers Oct. 2, 2024, 1:02 a.m. UTC
Hazard pointers appear to be a good fit for replacing refcount based lazy
active mm tracking.

Highlight:

will-it-scale context_switch1_threads

nr threads (-t)     speedup
    24                +3%
    48               +12%
    96               +21%
   192               +28%

I'm curious to see what the build bots have to say about this.

This series applies on top of v6.11.1.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev

Mathieu Desnoyers (4):
  compiler.h: Introduce ptr_eq() to preserve address dependency
  Documentation: RCU: Refer to ptr_eq()
  hp: Implement Hazard Pointers
  sched+mm: Use hazard pointers to track lazy active mm existence

 Documentation/RCU/rcu_dereference.rst |  38 ++++++-
 Documentation/mm/active_mm.rst        |   9 +-
 arch/Kconfig                          |  32 ------
 arch/powerpc/Kconfig                  |   1 -
 arch/powerpc/mm/book3s64/radix_tlb.c  |  23 +---
 include/linux/compiler.h              |  63 +++++++++++
 include/linux/hp.h                    | 154 ++++++++++++++++++++++++++
 include/linux/mm_types.h              |   3 -
 include/linux/sched/mm.h              |  71 +++++-------
 kernel/Makefile                       |   2 +-
 kernel/exit.c                         |   4 +-
 kernel/fork.c                         |  47 ++------
 kernel/hp.c                           |  46 ++++++++
 kernel/sched/sched.h                  |   8 +-
 lib/Kconfig.debug                     |  10 --
 15 files changed, 346 insertions(+), 165 deletions(-)
 create mode 100644 include/linux/hp.h
 create mode 100644 kernel/hp.c

Comments

Paul E. McKenney Oct. 2, 2024, 2:09 p.m. UTC | #1
On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote:
> Hazard pointers appear to be a good fit for replacing refcount based lazy
> active mm tracking.
> 
> Highlight:
> 
> will-it-scale context_switch1_threads
> 
> nr threads (-t)     speedup
>     24                +3%
>     48               +12%
>     96               +21%
>    192               +28%

Impressive!!!

I have to ask...  Any data for smaller numbers of CPUs?

							Thanx, Paul

> I'm curious to see what the build bots have to say about this.
> 
> This series applies on top of v6.11.1.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: John Stultz <jstultz@google.com>
> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Zqiang <qiang.zhang1211@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: maged.michael@gmail.com
> Cc: Mateusz Guzik <mjguzik@gmail.com>
> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> Cc: rcu@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: lkmm@lists.linux.dev
> 
> Mathieu Desnoyers (4):
>   compiler.h: Introduce ptr_eq() to preserve address dependency
>   Documentation: RCU: Refer to ptr_eq()
>   hp: Implement Hazard Pointers
>   sched+mm: Use hazard pointers to track lazy active mm existence
> 
>  Documentation/RCU/rcu_dereference.rst |  38 ++++++-
>  Documentation/mm/active_mm.rst        |   9 +-
>  arch/Kconfig                          |  32 ------
>  arch/powerpc/Kconfig                  |   1 -
>  arch/powerpc/mm/book3s64/radix_tlb.c  |  23 +---
>  include/linux/compiler.h              |  63 +++++++++++
>  include/linux/hp.h                    | 154 ++++++++++++++++++++++++++
>  include/linux/mm_types.h              |   3 -
>  include/linux/sched/mm.h              |  71 +++++-------
>  kernel/Makefile                       |   2 +-
>  kernel/exit.c                         |   4 +-
>  kernel/fork.c                         |  47 ++------
>  kernel/hp.c                           |  46 ++++++++
>  kernel/sched/sched.h                  |   8 +-
>  lib/Kconfig.debug                     |  10 --
>  15 files changed, 346 insertions(+), 165 deletions(-)
>  create mode 100644 include/linux/hp.h
>  create mode 100644 kernel/hp.c
> 
> -- 
> 2.39.2
Mathieu Desnoyers Oct. 2, 2024, 3:26 p.m. UTC | #2
On 2024-10-02 16:09, Paul E. McKenney wrote:
> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote:
>> Hazard pointers appear to be a good fit for replacing refcount based lazy
>> active mm tracking.
>>
>> Highlight:
>>
>> will-it-scale context_switch1_threads
>>
>> nr threads (-t)     speedup
>>      24                +3%
>>      48               +12%
>>      96               +21%
>>     192               +28%
> 
> Impressive!!!
> 
> I have to ask...  Any data for smaller numbers of CPUs?

Sure, but they are far less exciting ;-)

nr threads (-t)     speedup
      1                -0.2%
      2                +0.4%
      3                +0.2%
      6                +0.6%
     12                +0.8%
     24                +3%
     48               +12%
     96               +21%
    192               +28%
    384                +4%
    768                -0.6%

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
>> I'm curious to see what the build bots have to say about this.
>>
>> This series applies on top of v6.11.1.
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Cc: "Paul E. McKenney" <paulmck@kernel.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Boqun Feng <boqun.feng@gmail.com>
>> Cc: Alan Stern <stern@rowland.harvard.edu>
>> Cc: John Stultz <jstultz@google.com>
>> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>> Cc: Boqun Feng <boqun.feng@gmail.com>
>> Cc: Frederic Weisbecker <frederic@kernel.org>
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> Cc: Josh Triplett <josh@joshtriplett.org>
>> Cc: Uladzislau Rezki <urezki@gmail.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
>> Cc: Zqiang <qiang.zhang1211@gmail.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Waiman Long <longman@redhat.com>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: maged.michael@gmail.com
>> Cc: Mateusz Guzik <mjguzik@gmail.com>
>> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>> Cc: rcu@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> Cc: lkmm@lists.linux.dev
>>
>> Mathieu Desnoyers (4):
>>    compiler.h: Introduce ptr_eq() to preserve address dependency
>>    Documentation: RCU: Refer to ptr_eq()
>>    hp: Implement Hazard Pointers
>>    sched+mm: Use hazard pointers to track lazy active mm existence
>>
>>   Documentation/RCU/rcu_dereference.rst |  38 ++++++-
>>   Documentation/mm/active_mm.rst        |   9 +-
>>   arch/Kconfig                          |  32 ------
>>   arch/powerpc/Kconfig                  |   1 -
>>   arch/powerpc/mm/book3s64/radix_tlb.c  |  23 +---
>>   include/linux/compiler.h              |  63 +++++++++++
>>   include/linux/hp.h                    | 154 ++++++++++++++++++++++++++
>>   include/linux/mm_types.h              |   3 -
>>   include/linux/sched/mm.h              |  71 +++++-------
>>   kernel/Makefile                       |   2 +-
>>   kernel/exit.c                         |   4 +-
>>   kernel/fork.c                         |  47 ++------
>>   kernel/hp.c                           |  46 ++++++++
>>   kernel/sched/sched.h                  |   8 +-
>>   lib/Kconfig.debug                     |  10 --
>>   15 files changed, 346 insertions(+), 165 deletions(-)
>>   create mode 100644 include/linux/hp.h
>>   create mode 100644 kernel/hp.c
>>
>> -- 
>> 2.39.2
Matthew Wilcox Oct. 2, 2024, 3:33 p.m. UTC | #3
On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote:
> On 2024-10-02 16:09, Paul E. McKenney wrote:
> > On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote:
> > > Hazard pointers appear to be a good fit for replacing refcount based lazy
> > > active mm tracking.
> > > 
> > > Highlight:
> > > 
> > > will-it-scale context_switch1_threads
> > > 
> > > nr threads (-t)     speedup
> > >      24                +3%
> > >      48               +12%
> > >      96               +21%
> > >     192               +28%
> > 
> > Impressive!!!
> > 
> > I have to ask...  Any data for smaller numbers of CPUs?
> 
> Sure, but they are far less exciting ;-)

How many CPUs in the system under test?

> nr threads (-t)     speedup
>      1                -0.2%
>      2                +0.4%
>      3                +0.2%
>      6                +0.6%
>     12                +0.8%
>     24                +3%
>     48               +12%
>     96               +21%
>    192               +28%
>    384                +4%
>    768                -0.6%
> 
> Thanks,
> 
> Mathieu
> 
> > 
> > 							Thanx, Paul
> > 
> > > I'm curious to see what the build bots have to say about this.
> > > 
> > > This series applies on top of v6.11.1.
> > > 
> > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > > Cc: Nicholas Piggin <npiggin@gmail.com>
> > > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: Boqun Feng <boqun.feng@gmail.com>
> > > Cc: Alan Stern <stern@rowland.harvard.edu>
> > > Cc: John Stultz <jstultz@google.com>
> > > Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > Cc: Boqun Feng <boqun.feng@gmail.com>
> > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > Cc: Joel Fernandes <joel@joelfernandes.org>
> > > Cc: Josh Triplett <josh@joshtriplett.org>
> > > Cc: Uladzislau Rezki <urezki@gmail.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> > > Cc: Zqiang <qiang.zhang1211@gmail.com>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Waiman Long <longman@redhat.com>
> > > Cc: Mark Rutland <mark.rutland@arm.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: maged.michael@gmail.com
> > > Cc: Mateusz Guzik <mjguzik@gmail.com>
> > > Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > > Cc: rcu@vger.kernel.org
> > > Cc: linux-mm@kvack.org
> > > Cc: lkmm@lists.linux.dev
> > > 
> > > Mathieu Desnoyers (4):
> > >    compiler.h: Introduce ptr_eq() to preserve address dependency
> > >    Documentation: RCU: Refer to ptr_eq()
> > >    hp: Implement Hazard Pointers
> > >    sched+mm: Use hazard pointers to track lazy active mm existence
> > > 
> > >   Documentation/RCU/rcu_dereference.rst |  38 ++++++-
> > >   Documentation/mm/active_mm.rst        |   9 +-
> > >   arch/Kconfig                          |  32 ------
> > >   arch/powerpc/Kconfig                  |   1 -
> > >   arch/powerpc/mm/book3s64/radix_tlb.c  |  23 +---
> > >   include/linux/compiler.h              |  63 +++++++++++
> > >   include/linux/hp.h                    | 154 ++++++++++++++++++++++++++
> > >   include/linux/mm_types.h              |   3 -
> > >   include/linux/sched/mm.h              |  71 +++++-------
> > >   kernel/Makefile                       |   2 +-
> > >   kernel/exit.c                         |   4 +-
> > >   kernel/fork.c                         |  47 ++------
> > >   kernel/hp.c                           |  46 ++++++++
> > >   kernel/sched/sched.h                  |   8 +-
> > >   lib/Kconfig.debug                     |  10 --
> > >   15 files changed, 346 insertions(+), 165 deletions(-)
> > >   create mode 100644 include/linux/hp.h
> > >   create mode 100644 kernel/hp.c
> > > 
> > > -- 
> > > 2.39.2
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 
>
Mathieu Desnoyers Oct. 2, 2024, 3:36 p.m. UTC | #4
On 2024-10-02 17:33, Matthew Wilcox wrote:
> On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote:
>> On 2024-10-02 16:09, Paul E. McKenney wrote:
>>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote:
>>>> Hazard pointers appear to be a good fit for replacing refcount based lazy
>>>> active mm tracking.
>>>>
>>>> Highlight:
>>>>
>>>> will-it-scale context_switch1_threads
>>>>
>>>> nr threads (-t)     speedup
>>>>       24                +3%
>>>>       48               +12%
>>>>       96               +21%
>>>>      192               +28%
>>>
>>> Impressive!!!
>>>
>>> I have to ask...  Any data for smaller numbers of CPUs?
>>
>> Sure, but they are far less exciting ;-)
> 
> How many CPUs in the system under test?

2 sockets, 96-core per socket:

CPU(s):                   384
   On-line CPU(s) list:    0-383
Vendor ID:                AuthenticAMD
   Model name:             AMD EPYC 9654 96-Core Processor
     CPU family:           25
     Model:                17
     Thread(s) per core:   2
     Core(s) per socket:   96
     Socket(s):            2
     Stepping:             1
     Frequency boost:      enabled
     CPU(s) scaling MHz:   68%
     CPU max MHz:          3709.0000
     CPU min MHz:          400.0000
     BogoMIPS:             4800.00

Note that Jens Axboe got even more impressive speedups testing this
on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've
noticed I had schedstats and sched debug enabled in my config, so I'll 
have to re-run my tests.

Thanks,

Mathieu

[1] https://discuss.systems/@axboe@fosstodon.org/113238297041686326

> 
>> nr threads (-t)     speedup
>>       1                -0.2%
>>       2                +0.4%
>>       3                +0.2%
>>       6                +0.6%
>>      12                +0.8%
>>      24                +3%
>>      48               +12%
>>      96               +21%
>>     192               +28%
>>     384                +4%
>>     768                -0.6%
>>
>> Thanks,
>>
>> Mathieu
>>
>>>
>>> 							Thanx, Paul
>>>
>>>> I'm curious to see what the build bots have to say about this.
>>>>
>>>> This series applies on top of v6.11.1.
>>>>
>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> Cc: Nicholas Piggin <npiggin@gmail.com>
>>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>>>> Cc: "Paul E. McKenney" <paulmck@kernel.org>
>>>> Cc: Will Deacon <will@kernel.org>
>>>> Cc: Boqun Feng <boqun.feng@gmail.com>
>>>> Cc: Alan Stern <stern@rowland.harvard.edu>
>>>> Cc: John Stultz <jstultz@google.com>
>>>> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>>>> Cc: Boqun Feng <boqun.feng@gmail.com>
>>>> Cc: Frederic Weisbecker <frederic@kernel.org>
>>>> Cc: Joel Fernandes <joel@joelfernandes.org>
>>>> Cc: Josh Triplett <josh@joshtriplett.org>
>>>> Cc: Uladzislau Rezki <urezki@gmail.com>
>>>> Cc: Steven Rostedt <rostedt@goodmis.org>
>>>> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
>>>> Cc: Zqiang <qiang.zhang1211@gmail.com>
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: Waiman Long <longman@redhat.com>
>>>> Cc: Mark Rutland <mark.rutland@arm.com>
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>>> Cc: maged.michael@gmail.com
>>>> Cc: Mateusz Guzik <mjguzik@gmail.com>
>>>> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>>>> Cc: rcu@vger.kernel.org
>>>> Cc: linux-mm@kvack.org
>>>> Cc: lkmm@lists.linux.dev
>>>>
>>>> Mathieu Desnoyers (4):
>>>>     compiler.h: Introduce ptr_eq() to preserve address dependency
>>>>     Documentation: RCU: Refer to ptr_eq()
>>>>     hp: Implement Hazard Pointers
>>>>     sched+mm: Use hazard pointers to track lazy active mm existence
>>>>
>>>>    Documentation/RCU/rcu_dereference.rst |  38 ++++++-
>>>>    Documentation/mm/active_mm.rst        |   9 +-
>>>>    arch/Kconfig                          |  32 ------
>>>>    arch/powerpc/Kconfig                  |   1 -
>>>>    arch/powerpc/mm/book3s64/radix_tlb.c  |  23 +---
>>>>    include/linux/compiler.h              |  63 +++++++++++
>>>>    include/linux/hp.h                    | 154 ++++++++++++++++++++++++++
>>>>    include/linux/mm_types.h              |   3 -
>>>>    include/linux/sched/mm.h              |  71 +++++-------
>>>>    kernel/Makefile                       |   2 +-
>>>>    kernel/exit.c                         |   4 +-
>>>>    kernel/fork.c                         |  47 ++------
>>>>    kernel/hp.c                           |  46 ++++++++
>>>>    kernel/sched/sched.h                  |   8 +-
>>>>    lib/Kconfig.debug                     |  10 --
>>>>    15 files changed, 346 insertions(+), 165 deletions(-)
>>>>    create mode 100644 include/linux/hp.h
>>>>    create mode 100644 kernel/hp.c
>>>>
>>>> -- 
>>>> 2.39.2
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
>>
Mathieu Desnoyers Oct. 2, 2024, 3:53 p.m. UTC | #5
On 2024-10-02 17:36, Mathieu Desnoyers wrote:
> On 2024-10-02 17:33, Matthew Wilcox wrote:
>> On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote:
>>> On 2024-10-02 16:09, Paul E. McKenney wrote:
>>>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote:
>>>>> Hazard pointers appear to be a good fit for replacing refcount 
>>>>> based lazy
>>>>> active mm tracking.
>>>>>
>>>>> Highlight:
>>>>>
>>>>> will-it-scale context_switch1_threads
>>>>>
>>>>> nr threads (-t)     speedup
>>>>>       24                +3%
>>>>>       48               +12%
>>>>>       96               +21%
>>>>>      192               +28%
>>>>
>>>> Impressive!!!
>>>>
>>>> I have to ask...  Any data for smaller numbers of CPUs?
>>>
>>> Sure, but they are far less exciting ;-)
>>
>> How many CPUs in the system under test?
> 
> 2 sockets, 96-core per socket:
> 
> CPU(s):                   384
>    On-line CPU(s) list:    0-383
> Vendor ID:                AuthenticAMD
>    Model name:             AMD EPYC 9654 96-Core Processor
>      CPU family:           25
>      Model:                17
>      Thread(s) per core:   2
>      Core(s) per socket:   96
>      Socket(s):            2
>      Stepping:             1
>      Frequency boost:      enabled
>      CPU(s) scaling MHz:   68%
>      CPU max MHz:          3709.0000
>      CPU min MHz:          400.0000
>      BogoMIPS:             4800.00
> 
> Note that Jens Axboe got even more impressive speedups testing this
> on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've
> noticed I had schedstats and sched debug enabled in my config, so I'll 
> have to re-run my tests.

A quick re-run of the 128-thread case with schedstats and sched debug
disabled still show around 26% speedup, similar to my prior numbers.

I'm not sure why Jens has much better speedups on a similar system.

I'm attaching my config in case someone spots anything obvious. Note
that my BIOS is configured to show 24 NUMA nodes to the kernel (one
NUMA node per core complex).

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> [1] https://discuss.systems/@axboe@fosstodon.org/113238297041686326
> 
>>
>>> nr threads (-t)     speedup
>>>       1                -0.2%
>>>       2                +0.4%
>>>       3                +0.2%
>>>       6                +0.6%
>>>      12                +0.8%
>>>      24                +3%
>>>      48               +12%
>>>      96               +21%
>>>     192               +28%
>>>     384                +4%
>>>     768                -0.6%
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>>>
>>>>                             Thanx, Paul
>>>>
>>>>> I'm curious to see what the build bots have to say about this.
>>>>>
>>>>> This series applies on top of v6.11.1.
>>>>>
>>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>>> Cc: Nicholas Piggin <npiggin@gmail.com>
>>>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>>>>> Cc: "Paul E. McKenney" <paulmck@kernel.org>
>>>>> Cc: Will Deacon <will@kernel.org>
>>>>> Cc: Boqun Feng <boqun.feng@gmail.com>
>>>>> Cc: Alan Stern <stern@rowland.harvard.edu>
>>>>> Cc: John Stultz <jstultz@google.com>
>>>>> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>>>>> Cc: Boqun Feng <boqun.feng@gmail.com>
>>>>> Cc: Frederic Weisbecker <frederic@kernel.org>
>>>>> Cc: Joel Fernandes <joel@joelfernandes.org>
>>>>> Cc: Josh Triplett <josh@joshtriplett.org>
>>>>> Cc: Uladzislau Rezki <urezki@gmail.com>
>>>>> Cc: Steven Rostedt <rostedt@goodmis.org>
>>>>> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
>>>>> Cc: Zqiang <qiang.zhang1211@gmail.com>
>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>> Cc: Waiman Long <longman@redhat.com>
>>>>> Cc: Mark Rutland <mark.rutland@arm.com>
>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>>>> Cc: maged.michael@gmail.com
>>>>> Cc: Mateusz Guzik <mjguzik@gmail.com>
>>>>> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>>>>> Cc: rcu@vger.kernel.org
>>>>> Cc: linux-mm@kvack.org
>>>>> Cc: lkmm@lists.linux.dev
>>>>>
>>>>> Mathieu Desnoyers (4):
>>>>>     compiler.h: Introduce ptr_eq() to preserve address dependency
>>>>>     Documentation: RCU: Refer to ptr_eq()
>>>>>     hp: Implement Hazard Pointers
>>>>>     sched+mm: Use hazard pointers to track lazy active mm existence
>>>>>
>>>>>    Documentation/RCU/rcu_dereference.rst |  38 ++++++-
>>>>>    Documentation/mm/active_mm.rst        |   9 +-
>>>>>    arch/Kconfig                          |  32 ------
>>>>>    arch/powerpc/Kconfig                  |   1 -
>>>>>    arch/powerpc/mm/book3s64/radix_tlb.c  |  23 +---
>>>>>    include/linux/compiler.h              |  63 +++++++++++
>>>>>    include/linux/hp.h                    | 154 
>>>>> ++++++++++++++++++++++++++
>>>>>    include/linux/mm_types.h              |   3 -
>>>>>    include/linux/sched/mm.h              |  71 +++++-------
>>>>>    kernel/Makefile                       |   2 +-
>>>>>    kernel/exit.c                         |   4 +-
>>>>>    kernel/fork.c                         |  47 ++------
>>>>>    kernel/hp.c                           |  46 ++++++++
>>>>>    kernel/sched/sched.h                  |   8 +-
>>>>>    lib/Kconfig.debug                     |  10 --
>>>>>    15 files changed, 346 insertions(+), 165 deletions(-)
>>>>>    create mode 100644 include/linux/hp.h
>>>>>    create mode 100644 kernel/hp.c
>>>>>
>>>>> -- 
>>>>> 2.39.2
>>>
>>> -- 
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>>
>>>
>
Jens Axboe Oct. 2, 2024, 3:58 p.m. UTC | #6
On 10/2/24 9:53 AM, Mathieu Desnoyers wrote:
> On 2024-10-02 17:36, Mathieu Desnoyers wrote:
>> On 2024-10-02 17:33, Matthew Wilcox wrote:
>>> On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote:
>>>> On 2024-10-02 16:09, Paul E. McKenney wrote:
>>>>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote:
>>>>>> Hazard pointers appear to be a good fit for replacing refcount based lazy
>>>>>> active mm tracking.
>>>>>>
>>>>>> Highlight:
>>>>>>
>>>>>> will-it-scale context_switch1_threads
>>>>>>
>>>>>> nr threads (-t)     speedup
>>>>>>       24                +3%
>>>>>>       48               +12%
>>>>>>       96               +21%
>>>>>>      192               +28%
>>>>>
>>>>> Impressive!!!
>>>>>
>>>>> I have to ask...  Any data for smaller numbers of CPUs?
>>>>
>>>> Sure, but they are far less exciting ;-)
>>>
>>> How many CPUs in the system under test?
>>
>> 2 sockets, 96-core per socket:
>>
>> CPU(s):                   384
>>    On-line CPU(s) list:    0-383
>> Vendor ID:                AuthenticAMD
>>    Model name:             AMD EPYC 9654 96-Core Processor
>>      CPU family:           25
>>      Model:                17
>>      Thread(s) per core:   2
>>      Core(s) per socket:   96
>>      Socket(s):            2
>>      Stepping:             1
>>      Frequency boost:      enabled
>>      CPU(s) scaling MHz:   68%
>>      CPU max MHz:          3709.0000
>>      CPU min MHz:          400.0000
>>      BogoMIPS:             4800.00
>>
>> Note that Jens Axboe got even more impressive speedups testing this
>> on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've
>> noticed I had schedstats and sched debug enabled in my config, so I'll have to re-run my tests.
> 
> A quick re-run of the 128-thread case with schedstats and sched debug
> disabled still show around 26% speedup, similar to my prior numbers.
> 
> I'm not sure why Jens has much better speedups on a similar system.
> 
> I'm attaching my config in case someone spots anything obvious. Note
> that my BIOS is configured to show 24 NUMA nodes to the kernel (one
> NUMA node per core complex).

Here's my .config - note it's from the stock kernel run, which is why it
still has:

CONFIG_MMU_LAZY_TLB_REFCOUNT=y

set. Have the same numa configuration as you, just end up with 32 nodes
on this box.
Mathieu Desnoyers Oct. 2, 2024, 4:02 p.m. UTC | #7
On 2024-10-02 17:58, Jens Axboe wrote:
> On 10/2/24 9:53 AM, Mathieu Desnoyers wrote:
>> On 2024-10-02 17:36, Mathieu Desnoyers wrote:
>>> On 2024-10-02 17:33, Matthew Wilcox wrote:
>>>> On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote:
>>>>> On 2024-10-02 16:09, Paul E. McKenney wrote:
>>>>>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote:
>>>>>>> Hazard pointers appear to be a good fit for replacing refcount based lazy
>>>>>>> active mm tracking.
>>>>>>>
>>>>>>> Highlight:
>>>>>>>
>>>>>>> will-it-scale context_switch1_threads
>>>>>>>
>>>>>>> nr threads (-t)     speedup
>>>>>>>        24                +3%
>>>>>>>        48               +12%
>>>>>>>        96               +21%
>>>>>>>       192               +28%
>>>>>>
>>>>>> Impressive!!!
>>>>>>
>>>>>> I have to ask...  Any data for smaller numbers of CPUs?
>>>>>
>>>>> Sure, but they are far less exciting ;-)
>>>>
>>>> How many CPUs in the system under test?
>>>
>>> 2 sockets, 96-core per socket:
>>>
>>> CPU(s):                   384
>>>     On-line CPU(s) list:    0-383
>>> Vendor ID:                AuthenticAMD
>>>     Model name:             AMD EPYC 9654 96-Core Processor
>>>       CPU family:           25
>>>       Model:                17
>>>       Thread(s) per core:   2
>>>       Core(s) per socket:   96
>>>       Socket(s):            2
>>>       Stepping:             1
>>>       Frequency boost:      enabled
>>>       CPU(s) scaling MHz:   68%
>>>       CPU max MHz:          3709.0000
>>>       CPU min MHz:          400.0000
>>>       BogoMIPS:             4800.00
>>>
>>> Note that Jens Axboe got even more impressive speedups testing this
>>> on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've
>>> noticed I had schedstats and sched debug enabled in my config, so I'll have to re-run my tests.
>>
>> A quick re-run of the 128-thread case with schedstats and sched debug
>> disabled still show around 26% speedup, similar to my prior numbers.
>>
>> I'm not sure why Jens has much better speedups on a similar system.
>>
>> I'm attaching my config in case someone spots anything obvious. Note
>> that my BIOS is configured to show 24 NUMA nodes to the kernel (one
>> NUMA node per core complex).
> 
> Here's my .config - note it's from the stock kernel run, which is why it
> still has:
> 
> CONFIG_MMU_LAZY_TLB_REFCOUNT=y
> 
> set. Have the same numa configuration as you, just end up with 32 nodes
> on this box.

Just to make sure: did you use other command line options when starting
the test program (other than -t N ?).

Thanks,

Mathieu
Jens Axboe Oct. 2, 2024, 4:14 p.m. UTC | #8
On 10/2/24 10:02 AM, Mathieu Desnoyers wrote:
> On 2024-10-02 17:58, Jens Axboe wrote:
>> On 10/2/24 9:53 AM, Mathieu Desnoyers wrote:
>>> On 2024-10-02 17:36, Mathieu Desnoyers wrote:
>>>> On 2024-10-02 17:33, Matthew Wilcox wrote:
>>>>> On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote:
>>>>>> On 2024-10-02 16:09, Paul E. McKenney wrote:
>>>>>>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote:
>>>>>>>> Hazard pointers appear to be a good fit for replacing refcount based lazy
>>>>>>>> active mm tracking.
>>>>>>>>
>>>>>>>> Highlight:
>>>>>>>>
>>>>>>>> will-it-scale context_switch1_threads
>>>>>>>>
>>>>>>>> nr threads (-t)     speedup
>>>>>>>>        24                +3%
>>>>>>>>        48               +12%
>>>>>>>>        96               +21%
>>>>>>>>       192               +28%
>>>>>>>
>>>>>>> Impressive!!!
>>>>>>>
>>>>>>> I have to ask...  Any data for smaller numbers of CPUs?
>>>>>>
>>>>>> Sure, but they are far less exciting ;-)
>>>>>
>>>>> How many CPUs in the system under test?
>>>>
>>>> 2 sockets, 96-core per socket:
>>>>
>>>> CPU(s):                   384
>>>>     On-line CPU(s) list:    0-383
>>>> Vendor ID:                AuthenticAMD
>>>>     Model name:             AMD EPYC 9654 96-Core Processor
>>>>       CPU family:           25
>>>>       Model:                17
>>>>       Thread(s) per core:   2
>>>>       Core(s) per socket:   96
>>>>       Socket(s):            2
>>>>       Stepping:             1
>>>>       Frequency boost:      enabled
>>>>       CPU(s) scaling MHz:   68%
>>>>       CPU max MHz:          3709.0000
>>>>       CPU min MHz:          400.0000
>>>>       BogoMIPS:             4800.00
>>>>
>>>> Note that Jens Axboe got even more impressive speedups testing this
>>>> on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've
>>>> noticed I had schedstats and sched debug enabled in my config, so I'll have to re-run my tests.
>>>
>>> A quick re-run of the 128-thread case with schedstats and sched debug
>>> disabled still show around 26% speedup, similar to my prior numbers.
>>>
>>> I'm not sure why Jens has much better speedups on a similar system.
>>>
>>> I'm attaching my config in case someone spots anything obvious. Note
>>> that my BIOS is configured to show 24 NUMA nodes to the kernel (one
>>> NUMA node per core complex).
>>
>> Here's my .config - note it's from the stock kernel run, which is why it
>> still has:
>>
>> CONFIG_MMU_LAZY_TLB_REFCOUNT=y
>>
>> set. Have the same numa configuration as you, just end up with 32 nodes
>> on this box.
> 
> Just to make sure: did you use other command line options when starting
> the test program (other than -t N ?).

I did not, this is literally what I ran:

for i in 24 48 96 192 256 512 1024 2048; do echo $i threads; timeout -s INT -k 30 30 ./context_switch1_threads -t $i; done

and the numbers I got were very stable between runs and reboots.
Linus Torvalds Oct. 2, 2024, 5:39 p.m. UTC | #9
On Tue, 1 Oct 2024 at 18:04, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Hazard pointers appear to be a good fit for replacing refcount based lazy
> active mm tracking.

If the mm refcount is this expensive, I suspect we really shouldn't
use it at all.

The thing is, we don't _need_ to use the mm refcount - the reason the
lazy-tlb handling uses it is because we already had that refcount and
it was easy to extend on existing logic, not because it's really
required any more.

The lazy-tlb activation is basically "I'm switching to a kernel
thread, so I'll re-use the TLB state of the previous thread".

(And yes, it also has a secondary case of "I'm exiting, so I will turn
the mm I already have into a lazy one").

But in the actual task switch case, the previous thread hasn't _lost_
that mm, so we don't actually need to take the refcount at all.

We really just need to make sure to invalidate it before it's torn
down, but we do that *anyway* as part of TLB flushing.

(The exit case is actually different: we are setting it up to be lost,
although delayed - and the lazy count is the delay).

The only thing the refcount means is that we don't actually have to be
as careful when we actually *really* get rid of the MM. We can be a
bit laissez-faire about things because even if we weren't to
invalidate the lazy mm, it does have its own refcount, so we don't
much care.

But in reality, we're actually very careful about the active_mm
_anyway_, because of a fairly fundamental issue: the TLB shootdown and
PCID handling that we need to do even when mm's aren't lazy.

So we actually keep track of things like "which CPU's have seen this
MM state" in all the TLB code.

And even the exit case doesn't actually need the special thing - it
*does* need the "this CPU is still using this MM", but we have that
too as part of the TLB code - entirely independently of 'active_mm'.

So in many ways, I'm pretty sure not just the refcount, but all of
'active_mm', is largely pointless to begin with.

And if the refcount really is this big of a deal:

> nr threads (-t)     speedup
>    192               +28%

then we should probably just strive to get rid of 'active_mm' altogether.

Look, at least on x86 we ALREADY has a better replacement: it's the
percpu 'cpu_tlbstate'.

It basically duplicates all we do with active_mm and the whole "keep
track of old mm state" (the 'loaded_mm' member is basically the true
'active' mm), except it has some additional fixes:

 - it has some extra housekeeping data that the architecture wants
(for PCID updates etc)

 - it's actually atomic wrt the low-level code in ways that
'current->active_mm' isn't

So I think the real issue is that "active_mm" is an old hack from a
bygone era when we didn't have the (much more involved) full TLB
tracking.

               Linus
Peter Zijlstra Oct. 5, 2024, 4:15 p.m. UTC | #10
On Wed, Oct 02, 2024 at 10:39:15AM -0700, Linus Torvalds wrote:
> So I think the real issue is that "active_mm" is an old hack from a
> bygone era when we didn't have the (much more involved) full TLB
> tracking.

I still seem to have these patches that neither Andy nor I ever managed
to find time to finish:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy
Linus Torvalds Oct. 5, 2024, 4:56 p.m. UTC | #11
On Sat, 5 Oct 2024 at 09:16, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 02, 2024 at 10:39:15AM -0700, Linus Torvalds wrote:
> > So I think the real issue is that "active_mm" is an old hack from a
> > bygone era when we didn't have the (much more involved) full TLB
> > tracking.
>
> I still seem to have these patches that neither Andy nor I ever managed
> to find time to finish:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy

Yes, that looks very much like what I had in mind.

In fact, it looks a lot smaller and simpler than what my mental model was.

I was thinking I'd do it by removing "active_mm" entirely from 'struct
task_struct', and turn it into a per-cpu variable instead, and then
try to massage that into some global new world order. That patch
series you point to seems to be much simpler and clearer.

Of course, you also say "never managed to finish", so presumably
there's something completely broken in that series, and it doesn't
actually work?

                   Linus
Peter Zijlstra Oct. 7, 2024, 7:06 a.m. UTC | #12
On Sat, Oct 05, 2024 at 09:56:24AM -0700, Linus Torvalds wrote:
> On Sat, 5 Oct 2024 at 09:16, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Oct 02, 2024 at 10:39:15AM -0700, Linus Torvalds wrote:
> > > So I think the real issue is that "active_mm" is an old hack from a
> > > bygone era when we didn't have the (much more involved) full TLB
> > > tracking.
> >
> > I still seem to have these patches that neither Andy nor I ever managed
> > to find time to finish:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy
> 
> Yes, that looks very much like what I had in mind.
> 
> In fact, it looks a lot smaller and simpler than what my mental model was.
> 
> I was thinking I'd do it by removing "active_mm" entirely from 'struct
> task_struct', and turn it into a per-cpu variable instead, and then
> try to massage that into some global new world order. That patch
> series you point to seems to be much simpler and clearer.
> 
> Of course, you also say "never managed to finish", so presumably
> there's something completely broken in that series, and it doesn't
> actually work?

Last time I tried it, it worked fine. I just didn't get around to
actually fully thinking it trough and making sure nothing subtle was
broken etc. Pesky details and such..