Message ID | 20241002010205.1341915-1-mathieu.desnoyers@efficios.com (mailing list archive) |
---|---|
Headers | show |
Series | sched+mm: Track lazy active mm existence with hazard pointers | expand |
On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote: > Hazard pointers appear to be a good fit for replacing refcount based lazy > active mm tracking. > > Highlight: > > will-it-scale context_switch1_threads > > nr threads (-t) speedup > 24 +3% > 48 +12% > 96 +21% > 192 +28% Impressive!!! I have to ask... Any data for smaller numbers of CPUs? Thanx, Paul > I'm curious to see what the build bots have to say about this. > > This series applies on top of v6.11.1. > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Nicholas Piggin <npiggin@gmail.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Will Deacon <will@kernel.org> > Cc: Boqun Feng <boqun.feng@gmail.com> > Cc: Alan Stern <stern@rowland.harvard.edu> > Cc: John Stultz <jstultz@google.com> > Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > Cc: Boqun Feng <boqun.feng@gmail.com> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Joel Fernandes <joel@joelfernandes.org> > Cc: Josh Triplett <josh@joshtriplett.org> > Cc: Uladzislau Rezki <urezki@gmail.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Lai Jiangshan <jiangshanlai@gmail.com> > Cc: Zqiang <qiang.zhang1211@gmail.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Waiman Long <longman@redhat.com> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: maged.michael@gmail.com > Cc: Mateusz Guzik <mjguzik@gmail.com> > Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > Cc: rcu@vger.kernel.org > Cc: linux-mm@kvack.org > Cc: lkmm@lists.linux.dev > > Mathieu Desnoyers (4): > compiler.h: Introduce ptr_eq() to preserve address dependency > Documentation: RCU: Refer to ptr_eq() > hp: Implement Hazard Pointers > sched+mm: Use hazard pointers to track lazy active mm existence > > Documentation/RCU/rcu_dereference.rst | 38 ++++++- > Documentation/mm/active_mm.rst | 9 +- > arch/Kconfig | 32 ------ > arch/powerpc/Kconfig | 1 - > arch/powerpc/mm/book3s64/radix_tlb.c | 23 +--- > include/linux/compiler.h | 63 +++++++++++ > include/linux/hp.h | 154 ++++++++++++++++++++++++++ > include/linux/mm_types.h | 3 - > include/linux/sched/mm.h | 71 +++++------- > kernel/Makefile | 2 +- > kernel/exit.c | 4 +- > kernel/fork.c | 47 ++------ > kernel/hp.c | 46 ++++++++ > kernel/sched/sched.h | 8 +- > lib/Kconfig.debug | 10 -- > 15 files changed, 346 insertions(+), 165 deletions(-) > create mode 100644 include/linux/hp.h > create mode 100644 kernel/hp.c > > -- > 2.39.2
On 2024-10-02 16:09, Paul E. McKenney wrote: > On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote: >> Hazard pointers appear to be a good fit for replacing refcount based lazy >> active mm tracking. >> >> Highlight: >> >> will-it-scale context_switch1_threads >> >> nr threads (-t) speedup >> 24 +3% >> 48 +12% >> 96 +21% >> 192 +28% > > Impressive!!! > > I have to ask... Any data for smaller numbers of CPUs? Sure, but they are far less exciting ;-) nr threads (-t) speedup 1 -0.2% 2 +0.4% 3 +0.2% 6 +0.6% 12 +0.8% 24 +3% 48 +12% 96 +21% 192 +28% 384 +4% 768 -0.6% Thanks, Mathieu > > Thanx, Paul > >> I'm curious to see what the build bots have to say about this. >> >> This series applies on top of v6.11.1. >> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> Cc: Nicholas Piggin <npiggin@gmail.com> >> Cc: Michael Ellerman <mpe@ellerman.id.au> >> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> >> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >> Cc: "Paul E. McKenney" <paulmck@kernel.org> >> Cc: Will Deacon <will@kernel.org> >> Cc: Boqun Feng <boqun.feng@gmail.com> >> Cc: Alan Stern <stern@rowland.harvard.edu> >> Cc: John Stultz <jstultz@google.com> >> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> >> Cc: Boqun Feng <boqun.feng@gmail.com> >> Cc: Frederic Weisbecker <frederic@kernel.org> >> Cc: Joel Fernandes <joel@joelfernandes.org> >> Cc: Josh Triplett <josh@joshtriplett.org> >> Cc: Uladzislau Rezki <urezki@gmail.com> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Cc: Lai Jiangshan <jiangshanlai@gmail.com> >> Cc: Zqiang <qiang.zhang1211@gmail.com> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Waiman Long <longman@redhat.com> >> Cc: Mark Rutland <mark.rutland@arm.com> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: maged.michael@gmail.com >> Cc: Mateusz Guzik <mjguzik@gmail.com> >> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> >> Cc: rcu@vger.kernel.org >> Cc: linux-mm@kvack.org >> Cc: lkmm@lists.linux.dev >> >> Mathieu Desnoyers (4): >> compiler.h: Introduce ptr_eq() to preserve address dependency >> Documentation: RCU: Refer to ptr_eq() >> hp: Implement Hazard Pointers >> sched+mm: Use hazard pointers to track lazy active mm existence >> >> Documentation/RCU/rcu_dereference.rst | 38 ++++++- >> Documentation/mm/active_mm.rst | 9 +- >> arch/Kconfig | 32 ------ >> arch/powerpc/Kconfig | 1 - >> arch/powerpc/mm/book3s64/radix_tlb.c | 23 +--- >> include/linux/compiler.h | 63 +++++++++++ >> include/linux/hp.h | 154 ++++++++++++++++++++++++++ >> include/linux/mm_types.h | 3 - >> include/linux/sched/mm.h | 71 +++++------- >> kernel/Makefile | 2 +- >> kernel/exit.c | 4 +- >> kernel/fork.c | 47 ++------ >> kernel/hp.c | 46 ++++++++ >> kernel/sched/sched.h | 8 +- >> lib/Kconfig.debug | 10 -- >> 15 files changed, 346 insertions(+), 165 deletions(-) >> create mode 100644 include/linux/hp.h >> create mode 100644 kernel/hp.c >> >> -- >> 2.39.2
On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote: > On 2024-10-02 16:09, Paul E. McKenney wrote: > > On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote: > > > Hazard pointers appear to be a good fit for replacing refcount based lazy > > > active mm tracking. > > > > > > Highlight: > > > > > > will-it-scale context_switch1_threads > > > > > > nr threads (-t) speedup > > > 24 +3% > > > 48 +12% > > > 96 +21% > > > 192 +28% > > > > Impressive!!! > > > > I have to ask... Any data for smaller numbers of CPUs? > > Sure, but they are far less exciting ;-) How many CPUs in the system under test? > nr threads (-t) speedup > 1 -0.2% > 2 +0.4% > 3 +0.2% > 6 +0.6% > 12 +0.8% > 24 +3% > 48 +12% > 96 +21% > 192 +28% > 384 +4% > 768 -0.6% > > Thanks, > > Mathieu > > > > > Thanx, Paul > > > > > I'm curious to see what the build bots have to say about this. > > > > > > This series applies on top of v6.11.1. > > > > > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > > Cc: Nicholas Piggin <npiggin@gmail.com> > > > Cc: Michael Ellerman <mpe@ellerman.id.au> > > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > > Cc: Will Deacon <will@kernel.org> > > > Cc: Boqun Feng <boqun.feng@gmail.com> > > > Cc: Alan Stern <stern@rowland.harvard.edu> > > > Cc: John Stultz <jstultz@google.com> > > > Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> > > > Cc: Boqun Feng <boqun.feng@gmail.com> > > > Cc: Frederic Weisbecker <frederic@kernel.org> > > > Cc: Joel Fernandes <joel@joelfernandes.org> > > > Cc: Josh Triplett <josh@joshtriplett.org> > > > Cc: Uladzislau Rezki <urezki@gmail.com> > > > Cc: Steven Rostedt <rostedt@goodmis.org> > > > Cc: Lai Jiangshan <jiangshanlai@gmail.com> > > > Cc: Zqiang <qiang.zhang1211@gmail.com> > > > Cc: Ingo Molnar <mingo@redhat.com> > > > Cc: Waiman Long <longman@redhat.com> > > > Cc: Mark Rutland <mark.rutland@arm.com> > > > Cc: Thomas Gleixner <tglx@linutronix.de> > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > Cc: maged.michael@gmail.com > > > Cc: Mateusz Guzik <mjguzik@gmail.com> > > > Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > > > Cc: rcu@vger.kernel.org > > > Cc: linux-mm@kvack.org > > > Cc: lkmm@lists.linux.dev > > > > > > Mathieu Desnoyers (4): > > > compiler.h: Introduce ptr_eq() to preserve address dependency > > > Documentation: RCU: Refer to ptr_eq() > > > hp: Implement Hazard Pointers > > > sched+mm: Use hazard pointers to track lazy active mm existence > > > > > > Documentation/RCU/rcu_dereference.rst | 38 ++++++- > > > Documentation/mm/active_mm.rst | 9 +- > > > arch/Kconfig | 32 ------ > > > arch/powerpc/Kconfig | 1 - > > > arch/powerpc/mm/book3s64/radix_tlb.c | 23 +--- > > > include/linux/compiler.h | 63 +++++++++++ > > > include/linux/hp.h | 154 ++++++++++++++++++++++++++ > > > include/linux/mm_types.h | 3 - > > > include/linux/sched/mm.h | 71 +++++------- > > > kernel/Makefile | 2 +- > > > kernel/exit.c | 4 +- > > > kernel/fork.c | 47 ++------ > > > kernel/hp.c | 46 ++++++++ > > > kernel/sched/sched.h | 8 +- > > > lib/Kconfig.debug | 10 -- > > > 15 files changed, 346 insertions(+), 165 deletions(-) > > > create mode 100644 include/linux/hp.h > > > create mode 100644 kernel/hp.c > > > > > > -- > > > 2.39.2 > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com > >
On 2024-10-02 17:33, Matthew Wilcox wrote: > On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote: >> On 2024-10-02 16:09, Paul E. McKenney wrote: >>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote: >>>> Hazard pointers appear to be a good fit for replacing refcount based lazy >>>> active mm tracking. >>>> >>>> Highlight: >>>> >>>> will-it-scale context_switch1_threads >>>> >>>> nr threads (-t) speedup >>>> 24 +3% >>>> 48 +12% >>>> 96 +21% >>>> 192 +28% >>> >>> Impressive!!! >>> >>> I have to ask... Any data for smaller numbers of CPUs? >> >> Sure, but they are far less exciting ;-) > > How many CPUs in the system under test? 2 sockets, 96-core per socket: CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 68% CPU max MHz: 3709.0000 CPU min MHz: 400.0000 BogoMIPS: 4800.00 Note that Jens Axboe got even more impressive speedups testing this on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've noticed I had schedstats and sched debug enabled in my config, so I'll have to re-run my tests. Thanks, Mathieu [1] https://discuss.systems/@axboe@fosstodon.org/113238297041686326 > >> nr threads (-t) speedup >> 1 -0.2% >> 2 +0.4% >> 3 +0.2% >> 6 +0.6% >> 12 +0.8% >> 24 +3% >> 48 +12% >> 96 +21% >> 192 +28% >> 384 +4% >> 768 -0.6% >> >> Thanks, >> >> Mathieu >> >>> >>> Thanx, Paul >>> >>>> I'm curious to see what the build bots have to say about this. >>>> >>>> This series applies on top of v6.11.1. >>>> >>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >>>> Cc: Nicholas Piggin <npiggin@gmail.com> >>>> Cc: Michael Ellerman <mpe@ellerman.id.au> >>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> >>>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >>>> Cc: "Paul E. McKenney" <paulmck@kernel.org> >>>> Cc: Will Deacon <will@kernel.org> >>>> Cc: Boqun Feng <boqun.feng@gmail.com> >>>> Cc: Alan Stern <stern@rowland.harvard.edu> >>>> Cc: John Stultz <jstultz@google.com> >>>> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> >>>> Cc: Boqun Feng <boqun.feng@gmail.com> >>>> Cc: Frederic Weisbecker <frederic@kernel.org> >>>> Cc: Joel Fernandes <joel@joelfernandes.org> >>>> Cc: Josh Triplett <josh@joshtriplett.org> >>>> Cc: Uladzislau Rezki <urezki@gmail.com> >>>> Cc: Steven Rostedt <rostedt@goodmis.org> >>>> Cc: Lai Jiangshan <jiangshanlai@gmail.com> >>>> Cc: Zqiang <qiang.zhang1211@gmail.com> >>>> Cc: Ingo Molnar <mingo@redhat.com> >>>> Cc: Waiman Long <longman@redhat.com> >>>> Cc: Mark Rutland <mark.rutland@arm.com> >>>> Cc: Thomas Gleixner <tglx@linutronix.de> >>>> Cc: Vlastimil Babka <vbabka@suse.cz> >>>> Cc: maged.michael@gmail.com >>>> Cc: Mateusz Guzik <mjguzik@gmail.com> >>>> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> >>>> Cc: rcu@vger.kernel.org >>>> Cc: linux-mm@kvack.org >>>> Cc: lkmm@lists.linux.dev >>>> >>>> Mathieu Desnoyers (4): >>>> compiler.h: Introduce ptr_eq() to preserve address dependency >>>> Documentation: RCU: Refer to ptr_eq() >>>> hp: Implement Hazard Pointers >>>> sched+mm: Use hazard pointers to track lazy active mm existence >>>> >>>> Documentation/RCU/rcu_dereference.rst | 38 ++++++- >>>> Documentation/mm/active_mm.rst | 9 +- >>>> arch/Kconfig | 32 ------ >>>> arch/powerpc/Kconfig | 1 - >>>> arch/powerpc/mm/book3s64/radix_tlb.c | 23 +--- >>>> include/linux/compiler.h | 63 +++++++++++ >>>> include/linux/hp.h | 154 ++++++++++++++++++++++++++ >>>> include/linux/mm_types.h | 3 - >>>> include/linux/sched/mm.h | 71 +++++------- >>>> kernel/Makefile | 2 +- >>>> kernel/exit.c | 4 +- >>>> kernel/fork.c | 47 ++------ >>>> kernel/hp.c | 46 ++++++++ >>>> kernel/sched/sched.h | 8 +- >>>> lib/Kconfig.debug | 10 -- >>>> 15 files changed, 346 insertions(+), 165 deletions(-) >>>> create mode 100644 include/linux/hp.h >>>> create mode 100644 kernel/hp.c >>>> >>>> -- >>>> 2.39.2 >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> >>
On 2024-10-02 17:36, Mathieu Desnoyers wrote: > On 2024-10-02 17:33, Matthew Wilcox wrote: >> On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote: >>> On 2024-10-02 16:09, Paul E. McKenney wrote: >>>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote: >>>>> Hazard pointers appear to be a good fit for replacing refcount >>>>> based lazy >>>>> active mm tracking. >>>>> >>>>> Highlight: >>>>> >>>>> will-it-scale context_switch1_threads >>>>> >>>>> nr threads (-t) speedup >>>>> 24 +3% >>>>> 48 +12% >>>>> 96 +21% >>>>> 192 +28% >>>> >>>> Impressive!!! >>>> >>>> I have to ask... Any data for smaller numbers of CPUs? >>> >>> Sure, but they are far less exciting ;-) >> >> How many CPUs in the system under test? > > 2 sockets, 96-core per socket: > > CPU(s): 384 > On-line CPU(s) list: 0-383 > Vendor ID: AuthenticAMD > Model name: AMD EPYC 9654 96-Core Processor > CPU family: 25 > Model: 17 > Thread(s) per core: 2 > Core(s) per socket: 96 > Socket(s): 2 > Stepping: 1 > Frequency boost: enabled > CPU(s) scaling MHz: 68% > CPU max MHz: 3709.0000 > CPU min MHz: 400.0000 > BogoMIPS: 4800.00 > > Note that Jens Axboe got even more impressive speedups testing this > on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've > noticed I had schedstats and sched debug enabled in my config, so I'll > have to re-run my tests. A quick re-run of the 128-thread case with schedstats and sched debug disabled still show around 26% speedup, similar to my prior numbers. I'm not sure why Jens has much better speedups on a similar system. I'm attaching my config in case someone spots anything obvious. Note that my BIOS is configured to show 24 NUMA nodes to the kernel (one NUMA node per core complex). Thanks, Mathieu > > Thanks, > > Mathieu > > [1] https://discuss.systems/@axboe@fosstodon.org/113238297041686326 > >> >>> nr threads (-t) speedup >>> 1 -0.2% >>> 2 +0.4% >>> 3 +0.2% >>> 6 +0.6% >>> 12 +0.8% >>> 24 +3% >>> 48 +12% >>> 96 +21% >>> 192 +28% >>> 384 +4% >>> 768 -0.6% >>> >>> Thanks, >>> >>> Mathieu >>> >>>> >>>> Thanx, Paul >>>> >>>>> I'm curious to see what the build bots have to say about this. >>>>> >>>>> This series applies on top of v6.11.1. >>>>> >>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >>>>> Cc: Nicholas Piggin <npiggin@gmail.com> >>>>> Cc: Michael Ellerman <mpe@ellerman.id.au> >>>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> >>>>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >>>>> Cc: "Paul E. McKenney" <paulmck@kernel.org> >>>>> Cc: Will Deacon <will@kernel.org> >>>>> Cc: Boqun Feng <boqun.feng@gmail.com> >>>>> Cc: Alan Stern <stern@rowland.harvard.edu> >>>>> Cc: John Stultz <jstultz@google.com> >>>>> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> >>>>> Cc: Boqun Feng <boqun.feng@gmail.com> >>>>> Cc: Frederic Weisbecker <frederic@kernel.org> >>>>> Cc: Joel Fernandes <joel@joelfernandes.org> >>>>> Cc: Josh Triplett <josh@joshtriplett.org> >>>>> Cc: Uladzislau Rezki <urezki@gmail.com> >>>>> Cc: Steven Rostedt <rostedt@goodmis.org> >>>>> Cc: Lai Jiangshan <jiangshanlai@gmail.com> >>>>> Cc: Zqiang <qiang.zhang1211@gmail.com> >>>>> Cc: Ingo Molnar <mingo@redhat.com> >>>>> Cc: Waiman Long <longman@redhat.com> >>>>> Cc: Mark Rutland <mark.rutland@arm.com> >>>>> Cc: Thomas Gleixner <tglx@linutronix.de> >>>>> Cc: Vlastimil Babka <vbabka@suse.cz> >>>>> Cc: maged.michael@gmail.com >>>>> Cc: Mateusz Guzik <mjguzik@gmail.com> >>>>> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> >>>>> Cc: rcu@vger.kernel.org >>>>> Cc: linux-mm@kvack.org >>>>> Cc: lkmm@lists.linux.dev >>>>> >>>>> Mathieu Desnoyers (4): >>>>> compiler.h: Introduce ptr_eq() to preserve address dependency >>>>> Documentation: RCU: Refer to ptr_eq() >>>>> hp: Implement Hazard Pointers >>>>> sched+mm: Use hazard pointers to track lazy active mm existence >>>>> >>>>> Documentation/RCU/rcu_dereference.rst | 38 ++++++- >>>>> Documentation/mm/active_mm.rst | 9 +- >>>>> arch/Kconfig | 32 ------ >>>>> arch/powerpc/Kconfig | 1 - >>>>> arch/powerpc/mm/book3s64/radix_tlb.c | 23 +--- >>>>> include/linux/compiler.h | 63 +++++++++++ >>>>> include/linux/hp.h | 154 >>>>> ++++++++++++++++++++++++++ >>>>> include/linux/mm_types.h | 3 - >>>>> include/linux/sched/mm.h | 71 +++++------- >>>>> kernel/Makefile | 2 +- >>>>> kernel/exit.c | 4 +- >>>>> kernel/fork.c | 47 ++------ >>>>> kernel/hp.c | 46 ++++++++ >>>>> kernel/sched/sched.h | 8 +- >>>>> lib/Kconfig.debug | 10 -- >>>>> 15 files changed, 346 insertions(+), 165 deletions(-) >>>>> create mode 100644 include/linux/hp.h >>>>> create mode 100644 kernel/hp.c >>>>> >>>>> -- >>>>> 2.39.2 >>> >>> -- >>> Mathieu Desnoyers >>> EfficiOS Inc. >>> https://www.efficios.com >>> >>> >
On 10/2/24 9:53 AM, Mathieu Desnoyers wrote: > On 2024-10-02 17:36, Mathieu Desnoyers wrote: >> On 2024-10-02 17:33, Matthew Wilcox wrote: >>> On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote: >>>> On 2024-10-02 16:09, Paul E. McKenney wrote: >>>>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote: >>>>>> Hazard pointers appear to be a good fit for replacing refcount based lazy >>>>>> active mm tracking. >>>>>> >>>>>> Highlight: >>>>>> >>>>>> will-it-scale context_switch1_threads >>>>>> >>>>>> nr threads (-t) speedup >>>>>> 24 +3% >>>>>> 48 +12% >>>>>> 96 +21% >>>>>> 192 +28% >>>>> >>>>> Impressive!!! >>>>> >>>>> I have to ask... Any data for smaller numbers of CPUs? >>>> >>>> Sure, but they are far less exciting ;-) >>> >>> How many CPUs in the system under test? >> >> 2 sockets, 96-core per socket: >> >> CPU(s): 384 >> On-line CPU(s) list: 0-383 >> Vendor ID: AuthenticAMD >> Model name: AMD EPYC 9654 96-Core Processor >> CPU family: 25 >> Model: 17 >> Thread(s) per core: 2 >> Core(s) per socket: 96 >> Socket(s): 2 >> Stepping: 1 >> Frequency boost: enabled >> CPU(s) scaling MHz: 68% >> CPU max MHz: 3709.0000 >> CPU min MHz: 400.0000 >> BogoMIPS: 4800.00 >> >> Note that Jens Axboe got even more impressive speedups testing this >> on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've >> noticed I had schedstats and sched debug enabled in my config, so I'll have to re-run my tests. > > A quick re-run of the 128-thread case with schedstats and sched debug > disabled still show around 26% speedup, similar to my prior numbers. > > I'm not sure why Jens has much better speedups on a similar system. > > I'm attaching my config in case someone spots anything obvious. Note > that my BIOS is configured to show 24 NUMA nodes to the kernel (one > NUMA node per core complex). Here's my .config - note it's from the stock kernel run, which is why it still has: CONFIG_MMU_LAZY_TLB_REFCOUNT=y set. Have the same numa configuration as you, just end up with 32 nodes on this box.
On 2024-10-02 17:58, Jens Axboe wrote: > On 10/2/24 9:53 AM, Mathieu Desnoyers wrote: >> On 2024-10-02 17:36, Mathieu Desnoyers wrote: >>> On 2024-10-02 17:33, Matthew Wilcox wrote: >>>> On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote: >>>>> On 2024-10-02 16:09, Paul E. McKenney wrote: >>>>>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote: >>>>>>> Hazard pointers appear to be a good fit for replacing refcount based lazy >>>>>>> active mm tracking. >>>>>>> >>>>>>> Highlight: >>>>>>> >>>>>>> will-it-scale context_switch1_threads >>>>>>> >>>>>>> nr threads (-t) speedup >>>>>>> 24 +3% >>>>>>> 48 +12% >>>>>>> 96 +21% >>>>>>> 192 +28% >>>>>> >>>>>> Impressive!!! >>>>>> >>>>>> I have to ask... Any data for smaller numbers of CPUs? >>>>> >>>>> Sure, but they are far less exciting ;-) >>>> >>>> How many CPUs in the system under test? >>> >>> 2 sockets, 96-core per socket: >>> >>> CPU(s): 384 >>> On-line CPU(s) list: 0-383 >>> Vendor ID: AuthenticAMD >>> Model name: AMD EPYC 9654 96-Core Processor >>> CPU family: 25 >>> Model: 17 >>> Thread(s) per core: 2 >>> Core(s) per socket: 96 >>> Socket(s): 2 >>> Stepping: 1 >>> Frequency boost: enabled >>> CPU(s) scaling MHz: 68% >>> CPU max MHz: 3709.0000 >>> CPU min MHz: 400.0000 >>> BogoMIPS: 4800.00 >>> >>> Note that Jens Axboe got even more impressive speedups testing this >>> on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've >>> noticed I had schedstats and sched debug enabled in my config, so I'll have to re-run my tests. >> >> A quick re-run of the 128-thread case with schedstats and sched debug >> disabled still show around 26% speedup, similar to my prior numbers. >> >> I'm not sure why Jens has much better speedups on a similar system. >> >> I'm attaching my config in case someone spots anything obvious. Note >> that my BIOS is configured to show 24 NUMA nodes to the kernel (one >> NUMA node per core complex). > > Here's my .config - note it's from the stock kernel run, which is why it > still has: > > CONFIG_MMU_LAZY_TLB_REFCOUNT=y > > set. Have the same numa configuration as you, just end up with 32 nodes > on this box. Just to make sure: did you use other command line options when starting the test program (other than -t N ?). Thanks, Mathieu
On 10/2/24 10:02 AM, Mathieu Desnoyers wrote: > On 2024-10-02 17:58, Jens Axboe wrote: >> On 10/2/24 9:53 AM, Mathieu Desnoyers wrote: >>> On 2024-10-02 17:36, Mathieu Desnoyers wrote: >>>> On 2024-10-02 17:33, Matthew Wilcox wrote: >>>>> On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote: >>>>>> On 2024-10-02 16:09, Paul E. McKenney wrote: >>>>>>> On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote: >>>>>>>> Hazard pointers appear to be a good fit for replacing refcount based lazy >>>>>>>> active mm tracking. >>>>>>>> >>>>>>>> Highlight: >>>>>>>> >>>>>>>> will-it-scale context_switch1_threads >>>>>>>> >>>>>>>> nr threads (-t) speedup >>>>>>>> 24 +3% >>>>>>>> 48 +12% >>>>>>>> 96 +21% >>>>>>>> 192 +28% >>>>>>> >>>>>>> Impressive!!! >>>>>>> >>>>>>> I have to ask... Any data for smaller numbers of CPUs? >>>>>> >>>>>> Sure, but they are far less exciting ;-) >>>>> >>>>> How many CPUs in the system under test? >>>> >>>> 2 sockets, 96-core per socket: >>>> >>>> CPU(s): 384 >>>> On-line CPU(s) list: 0-383 >>>> Vendor ID: AuthenticAMD >>>> Model name: AMD EPYC 9654 96-Core Processor >>>> CPU family: 25 >>>> Model: 17 >>>> Thread(s) per core: 2 >>>> Core(s) per socket: 96 >>>> Socket(s): 2 >>>> Stepping: 1 >>>> Frequency boost: enabled >>>> CPU(s) scaling MHz: 68% >>>> CPU max MHz: 3709.0000 >>>> CPU min MHz: 400.0000 >>>> BogoMIPS: 4800.00 >>>> >>>> Note that Jens Axboe got even more impressive speedups testing this >>>> on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've >>>> noticed I had schedstats and sched debug enabled in my config, so I'll have to re-run my tests. >>> >>> A quick re-run of the 128-thread case with schedstats and sched debug >>> disabled still show around 26% speedup, similar to my prior numbers. >>> >>> I'm not sure why Jens has much better speedups on a similar system. >>> >>> I'm attaching my config in case someone spots anything obvious. Note >>> that my BIOS is configured to show 24 NUMA nodes to the kernel (one >>> NUMA node per core complex). >> >> Here's my .config - note it's from the stock kernel run, which is why it >> still has: >> >> CONFIG_MMU_LAZY_TLB_REFCOUNT=y >> >> set. Have the same numa configuration as you, just end up with 32 nodes >> on this box. > > Just to make sure: did you use other command line options when starting > the test program (other than -t N ?). I did not, this is literally what I ran: for i in 24 48 96 192 256 512 1024 2048; do echo $i threads; timeout -s INT -k 30 30 ./context_switch1_threads -t $i; done and the numbers I got were very stable between runs and reboots.
On Tue, 1 Oct 2024 at 18:04, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > Hazard pointers appear to be a good fit for replacing refcount based lazy > active mm tracking. If the mm refcount is this expensive, I suspect we really shouldn't use it at all. The thing is, we don't _need_ to use the mm refcount - the reason the lazy-tlb handling uses it is because we already had that refcount and it was easy to extend on existing logic, not because it's really required any more. The lazy-tlb activation is basically "I'm switching to a kernel thread, so I'll re-use the TLB state of the previous thread". (And yes, it also has a secondary case of "I'm exiting, so I will turn the mm I already have into a lazy one"). But in the actual task switch case, the previous thread hasn't _lost_ that mm, so we don't actually need to take the refcount at all. We really just need to make sure to invalidate it before it's torn down, but we do that *anyway* as part of TLB flushing. (The exit case is actually different: we are setting it up to be lost, although delayed - and the lazy count is the delay). The only thing the refcount means is that we don't actually have to be as careful when we actually *really* get rid of the MM. We can be a bit laissez-faire about things because even if we weren't to invalidate the lazy mm, it does have its own refcount, so we don't much care. But in reality, we're actually very careful about the active_mm _anyway_, because of a fairly fundamental issue: the TLB shootdown and PCID handling that we need to do even when mm's aren't lazy. So we actually keep track of things like "which CPU's have seen this MM state" in all the TLB code. And even the exit case doesn't actually need the special thing - it *does* need the "this CPU is still using this MM", but we have that too as part of the TLB code - entirely independently of 'active_mm'. So in many ways, I'm pretty sure not just the refcount, but all of 'active_mm', is largely pointless to begin with. And if the refcount really is this big of a deal: > nr threads (-t) speedup > 192 +28% then we should probably just strive to get rid of 'active_mm' altogether. Look, at least on x86 we ALREADY has a better replacement: it's the percpu 'cpu_tlbstate'. It basically duplicates all we do with active_mm and the whole "keep track of old mm state" (the 'loaded_mm' member is basically the true 'active' mm), except it has some additional fixes: - it has some extra housekeeping data that the architecture wants (for PCID updates etc) - it's actually atomic wrt the low-level code in ways that 'current->active_mm' isn't So I think the real issue is that "active_mm" is an old hack from a bygone era when we didn't have the (much more involved) full TLB tracking. Linus
On Wed, Oct 02, 2024 at 10:39:15AM -0700, Linus Torvalds wrote: > So I think the real issue is that "active_mm" is an old hack from a > bygone era when we didn't have the (much more involved) full TLB > tracking. I still seem to have these patches that neither Andy nor I ever managed to find time to finish: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy
On Sat, 5 Oct 2024 at 09:16, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Oct 02, 2024 at 10:39:15AM -0700, Linus Torvalds wrote: > > So I think the real issue is that "active_mm" is an old hack from a > > bygone era when we didn't have the (much more involved) full TLB > > tracking. > > I still seem to have these patches that neither Andy nor I ever managed > to find time to finish: > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy Yes, that looks very much like what I had in mind. In fact, it looks a lot smaller and simpler than what my mental model was. I was thinking I'd do it by removing "active_mm" entirely from 'struct task_struct', and turn it into a per-cpu variable instead, and then try to massage that into some global new world order. That patch series you point to seems to be much simpler and clearer. Of course, you also say "never managed to finish", so presumably there's something completely broken in that series, and it doesn't actually work? Linus
On Sat, Oct 05, 2024 at 09:56:24AM -0700, Linus Torvalds wrote: > On Sat, 5 Oct 2024 at 09:16, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Wed, Oct 02, 2024 at 10:39:15AM -0700, Linus Torvalds wrote: > > > So I think the real issue is that "active_mm" is an old hack from a > > > bygone era when we didn't have the (much more involved) full TLB > > > tracking. > > > > I still seem to have these patches that neither Andy nor I ever managed > > to find time to finish: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy > > Yes, that looks very much like what I had in mind. > > In fact, it looks a lot smaller and simpler than what my mental model was. > > I was thinking I'd do it by removing "active_mm" entirely from 'struct > task_struct', and turn it into a per-cpu variable instead, and then > try to massage that into some global new world order. That patch > series you point to seems to be much simpler and clearer. > > Of course, you also say "never managed to finish", so presumably > there's something completely broken in that series, and it doesn't > actually work? Last time I tried it, it worked fine. I just didn't get around to actually fully thinking it trough and making sure nothing subtle was broken etc. Pesky details and such..
Hazard pointers appear to be a good fit for replacing refcount based lazy active mm tracking. Highlight: will-it-scale context_switch1_threads nr threads (-t) speedup 24 +3% 48 +12% 96 +21% 192 +28% I'm curious to see what the build bots have to say about this. This series applies on top of v6.11.1. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: John Stultz <jstultz@google.com> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: maged.michael@gmail.com Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> Cc: rcu@vger.kernel.org Cc: linux-mm@kvack.org Cc: lkmm@lists.linux.dev Mathieu Desnoyers (4): compiler.h: Introduce ptr_eq() to preserve address dependency Documentation: RCU: Refer to ptr_eq() hp: Implement Hazard Pointers sched+mm: Use hazard pointers to track lazy active mm existence Documentation/RCU/rcu_dereference.rst | 38 ++++++- Documentation/mm/active_mm.rst | 9 +- arch/Kconfig | 32 ------ arch/powerpc/Kconfig | 1 - arch/powerpc/mm/book3s64/radix_tlb.c | 23 +--- include/linux/compiler.h | 63 +++++++++++ include/linux/hp.h | 154 ++++++++++++++++++++++++++ include/linux/mm_types.h | 3 - include/linux/sched/mm.h | 71 +++++------- kernel/Makefile | 2 +- kernel/exit.c | 4 +- kernel/fork.c | 47 ++------ kernel/hp.c | 46 ++++++++ kernel/sched/sched.h | 8 +- lib/Kconfig.debug | 10 -- 15 files changed, 346 insertions(+), 165 deletions(-) create mode 100644 include/linux/hp.h create mode 100644 kernel/hp.c