mbox series

[0/5] srcu fixes

Message ID 20231003232903.7109-1-frederic@kernel.org (mailing list archive)
Headers show
Series srcu fixes | expand

Message

Frederic Weisbecker Oct. 3, 2023, 11:28 p.m. UTC
Hi,

This contains a fix for "SRCU: kworker hung in synchronize_srcu":

	http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com

And a few cleanups.

Passed 50 hours of SRCU-P and SRCU-N.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	srcu/fixes

HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0

Thanks,
	Frederic
---

Frederic Weisbecker (5):
      srcu: Fix callbacks acceleration mishandling
      srcu: Only accelerate on enqueue time
      srcu: Remove superfluous callbacks advancing from srcu_start_gp()
      srcu: No need to advance/accelerate if no callback enqueued
      srcu: Explain why callbacks invocations can't run concurrently


 kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 39 insertions(+), 16 deletions(-)

Comments

Paul E. McKenney Oct. 4, 2023, 12:35 a.m. UTC | #1
On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> Hi,
> 
> This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> 
> 	http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
> 
> And a few cleanups.
> 
> Passed 50 hours of SRCU-P and SRCU-N.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> 	srcu/fixes
> 
> HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> 
> Thanks,
> 	Frederic

Very good, and a big "Thank You!!!" to all of you!

I queued this series for testing purposes, and have started a bunch of
SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
SRCU-N on another system, but with both scenarios resized to 40 CPU each.

While that is in flight, a few questions:

o	Please check the Co-developed-by rules.  Last I knew, it was
	necessary to have a Signed-off-by after each Co-developed-by.

o	Is it possible to get a Tested-by from the original reporter?
	Or is this not reproducible?

o	Is it possible to convince rcutorture to find this sort of
	bug?  Seems like it should be, but easy to say...

o	Frederic, would you like to include this in your upcoming
	pull request?  Or does it need more time?

						Thanx, Paul

> ---
> 
> Frederic Weisbecker (5):
>       srcu: Fix callbacks acceleration mishandling
>       srcu: Only accelerate on enqueue time
>       srcu: Remove superfluous callbacks advancing from srcu_start_gp()
>       srcu: No need to advance/accelerate if no callback enqueued
>       srcu: Explain why callbacks invocations can't run concurrently
> 
> 
>  kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 39 insertions(+), 16 deletions(-)
Paul E. McKenney Oct. 4, 2023, 3:21 a.m. UTC | #2
On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> > Hi,
> > 
> > This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> > 
> > 	http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
> > 
> > And a few cleanups.
> > 
> > Passed 50 hours of SRCU-P and SRCU-N.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > 	srcu/fixes
> > 
> > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> > 
> > Thanks,
> > 	Frederic
> 
> Very good, and a big "Thank You!!!" to all of you!
> 
> I queued this series for testing purposes, and have started a bunch of
> SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
> SRCU-N on another system, but with both scenarios resized to 40 CPU each.
> 
> While that is in flight, a few questions:
> 
> o	Please check the Co-developed-by rules.  Last I knew, it was
> 	necessary to have a Signed-off-by after each Co-developed-by.
> 
> o	Is it possible to get a Tested-by from the original reporter?
> 	Or is this not reproducible?
> 
> o	Is it possible to convince rcutorture to find this sort of
> 	bug?  Seems like it should be, but easy to say...

And one other thing...

o	What other bugs like this one are hiding elsewhere
	in RCU?

> o	Frederic, would you like to include this in your upcoming
> 	pull request?  Or does it need more time?

						Thanx, Paul

> > ---
> > 
> > Frederic Weisbecker (5):
> >       srcu: Fix callbacks acceleration mishandling
> >       srcu: Only accelerate on enqueue time
> >       srcu: Remove superfluous callbacks advancing from srcu_start_gp()
> >       srcu: No need to advance/accelerate if no callback enqueued
> >       srcu: Explain why callbacks invocations can't run concurrently
> > 
> > 
> >  kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 39 insertions(+), 16 deletions(-)
Paul E. McKenney Oct. 4, 2023, 3:30 a.m. UTC | #3
On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> > > Hi,
> > > 
> > > This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> > > 
> > > 	http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
> > > 
> > > And a few cleanups.
> > > 
> > > Passed 50 hours of SRCU-P and SRCU-N.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > 	srcu/fixes
> > > 
> > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> > > 
> > > Thanks,
> > > 	Frederic
> > 
> > Very good, and a big "Thank You!!!" to all of you!
> > 
> > I queued this series for testing purposes, and have started a bunch of
> > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
> > SRCU-N on another system, but with both scenarios resized to 40 CPU each.

The 200*1h of SRCU-N and the 100*1h of SRCU-p passed other than the usual
tick-stop errors.  (Is there a patch for that one?)  The 40-CPU SRCU-N
run was fine, but the 40-CPU SRCU-P run failed due to the fanouts setting
a maximum of 16 CPUs.  So I started a 10-hour 40-CPU SRCU-P and a pair
of 10-hour 16-CPU SRCU-N runs on one system, and 200*10h of SRCU-N and
100*10h of SRCU-P.

I will let you know how it goes.

							Thanx, Paul

> > While that is in flight, a few questions:
> > 
> > o	Please check the Co-developed-by rules.  Last I knew, it was
> > 	necessary to have a Signed-off-by after each Co-developed-by.
> > 
> > o	Is it possible to get a Tested-by from the original reporter?
> > 	Or is this not reproducible?
> > 
> > o	Is it possible to convince rcutorture to find this sort of
> > 	bug?  Seems like it should be, but easy to say...
> 
> And one other thing...
> 
> o	What other bugs like this one are hiding elsewhere
> 	in RCU?
> 
> > o	Frederic, would you like to include this in your upcoming
> > 	pull request?  Or does it need more time?
> 
> 						Thanx, Paul
> 
> > > ---
> > > 
> > > Frederic Weisbecker (5):
> > >       srcu: Fix callbacks acceleration mishandling
> > >       srcu: Only accelerate on enqueue time
> > >       srcu: Remove superfluous callbacks advancing from srcu_start_gp()
> > >       srcu: No need to advance/accelerate if no callback enqueued
> > >       srcu: Explain why callbacks invocations can't run concurrently
> > > 
> > > 
> > >  kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
> > >  1 file changed, 39 insertions(+), 16 deletions(-)
Frederic Weisbecker Oct. 4, 2023, 9:25 a.m. UTC | #4
On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> > Hi,
> > 
> > This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> > 
> > 	http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
> > 
> > And a few cleanups.
> > 
> > Passed 50 hours of SRCU-P and SRCU-N.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > 	srcu/fixes
> > 
> > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> > 
> > Thanks,
> > 	Frederic
> 
> Very good, and a big "Thank You!!!" to all of you!
> 
> I queued this series for testing purposes, and have started a bunch of
> SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
> SRCU-N on another system, but with both scenarios resized to 40 CPU each.
> 
> While that is in flight, a few questions:
> 
> o	Please check the Co-developed-by rules.  Last I knew, it was
> 	necessary to have a Signed-off-by after each Co-developed-by.

Indeed! I'll try to collect the three of them within a few days. If some
are missing, I'll put a Reported-by instead.

> 
> o	Is it possible to get a Tested-by from the original reporter?
> 	Or is this not reproducible?

It seems that the issue would trigger rarely. But I hope we can get one.

> 
> o	Is it possible to convince rcutorture to find this sort of
> 	bug?  Seems like it should be, but easy to say...

So at least the part where advance/accelerate fail is observed from time
to time. But then we must meet two more rare events:

1) The CPU failing to ACC/ADV must also fail to start the grace period because
  another CPU was faster.

2) The callbacks invocation must not run until that grace period has ended (even
  though we had a previous one completed with callbacks ready).

  Or it can run after all but at least the acceleration part of it has to
  happen after the end of the new grace period.

Perhaps all these conditions can me met more often if we overcommit the number
of vCPU. For example run 10 SRCU-P instances within 3 real CPUs. This could
introduce random breaks within the torture writers...

Just an idea...

> 
> o	Frederic, would you like to include this in your upcoming
> 	pull request?  Or does it need more time?

At least the first patch yes. It should be easily backported and
it should be enough to solve the race. I'll just wait a bit to collect
more tags.

Thanks!

> 
> 						Thanx, Paul
> 
> > ---
> > 
> > Frederic Weisbecker (5):
> >       srcu: Fix callbacks acceleration mishandling
> >       srcu: Only accelerate on enqueue time
> >       srcu: Remove superfluous callbacks advancing from srcu_start_gp()
> >       srcu: No need to advance/accelerate if no callback enqueued
> >       srcu: Explain why callbacks invocations can't run concurrently
> > 
> > 
> >  kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 39 insertions(+), 16 deletions(-)
Frederic Weisbecker Oct. 4, 2023, 9:35 a.m. UTC | #5
On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> > > Hi,
> > > 
> > > This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> > > 
> > > 	http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
> > > 
> > > And a few cleanups.
> > > 
> > > Passed 50 hours of SRCU-P and SRCU-N.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > 	srcu/fixes
> > > 
> > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> > > 
> > > Thanks,
> > > 	Frederic
> > 
> > Very good, and a big "Thank You!!!" to all of you!
> > 
> > I queued this series for testing purposes, and have started a bunch of
> > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
> > SRCU-N on another system, but with both scenarios resized to 40 CPU each.
> > 
> > While that is in flight, a few questions:
> > 
> > o	Please check the Co-developed-by rules.  Last I knew, it was
> > 	necessary to have a Signed-off-by after each Co-developed-by.
> > 
> > o	Is it possible to get a Tested-by from the original reporter?
> > 	Or is this not reproducible?
> > 
> > o	Is it possible to convince rcutorture to find this sort of
> > 	bug?  Seems like it should be, but easy to say...
> 
> And one other thing...
> 
> o	What other bugs like this one are hiding elsewhere
> 	in RCU?

Hmm, yesterday I thought RCU would be fine because it has a tick polling
on callbacks anyway. But I'm not so sure, I'll check for real...

Thanks.

> 
> > o	Frederic, would you like to include this in your upcoming
> > 	pull request?  Or does it need more time?
> 
> 						Thanx, Paul
> 
> > > ---
> > > 
> > > Frederic Weisbecker (5):
> > >       srcu: Fix callbacks acceleration mishandling
> > >       srcu: Only accelerate on enqueue time
> > >       srcu: Remove superfluous callbacks advancing from srcu_start_gp()
> > >       srcu: No need to advance/accelerate if no callback enqueued
> > >       srcu: Explain why callbacks invocations can't run concurrently
> > > 
> > > 
> > >  kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
> > >  1 file changed, 39 insertions(+), 16 deletions(-)
Frederic Weisbecker Oct. 4, 2023, 9:36 a.m. UTC | #6
On Tue, Oct 03, 2023 at 08:30:45PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote:
> > > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> > > > Hi,
> > > > 
> > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> > > > 
> > > > 	http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
> > > > 
> > > > And a few cleanups.
> > > > 
> > > > Passed 50 hours of SRCU-P and SRCU-N.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > > 	srcu/fixes
> > > > 
> > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> > > > 
> > > > Thanks,
> > > > 	Frederic
> > > 
> > > Very good, and a big "Thank You!!!" to all of you!
> > > 
> > > I queued this series for testing purposes, and have started a bunch of
> > > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
> > > SRCU-N on another system, but with both scenarios resized to 40 CPU each.
> 
> The 200*1h of SRCU-N and the 100*1h of SRCU-p passed other than the usual
> tick-stop errors.  (Is there a patch for that one?)  The 40-CPU SRCU-N
> run was fine, but the 40-CPU SRCU-P run failed due to the fanouts setting
> a maximum of 16 CPUs.  So I started a 10-hour 40-CPU SRCU-P and a pair
> of 10-hour 16-CPU SRCU-N runs on one system, and 200*10h of SRCU-N and
> 100*10h of SRCU-P.
> 
> I will let you know how it goes.

Very nice! It might be worth testing the first patch alone as
well if we backport only this one.

Thanks!


> 							Thanx, Paul
> 
> > > While that is in flight, a few questions:
> > > 
> > > o	Please check the Co-developed-by rules.  Last I knew, it was
> > > 	necessary to have a Signed-off-by after each Co-developed-by.
> > > 
> > > o	Is it possible to get a Tested-by from the original reporter?
> > > 	Or is this not reproducible?
> > > 
> > > o	Is it possible to convince rcutorture to find this sort of
> > > 	bug?  Seems like it should be, but easy to say...
> > 
> > And one other thing...
> > 
> > o	What other bugs like this one are hiding elsewhere
> > 	in RCU?
> > 
> > > o	Frederic, would you like to include this in your upcoming
> > > 	pull request?  Or does it need more time?
> > 
> > 						Thanx, Paul
> > 
> > > > ---
> > > > 
> > > > Frederic Weisbecker (5):
> > > >       srcu: Fix callbacks acceleration mishandling
> > > >       srcu: Only accelerate on enqueue time
> > > >       srcu: Remove superfluous callbacks advancing from srcu_start_gp()
> > > >       srcu: No need to advance/accelerate if no callback enqueued
> > > >       srcu: Explain why callbacks invocations can't run concurrently
> > > > 
> > > > 
> > > >  kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
> > > >  1 file changed, 39 insertions(+), 16 deletions(-)
Paul E. McKenney Oct. 4, 2023, 2:06 p.m. UTC | #7
On Wed, Oct 04, 2023 at 11:36:49AM +0200, Frederic Weisbecker wrote:
> On Tue, Oct 03, 2023 at 08:30:45PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote:
> > > On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> > > > > Hi,
> > > > > 
> > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> > > > > 
> > > > > 	http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
> > > > > 
> > > > > And a few cleanups.
> > > > > 
> > > > > Passed 50 hours of SRCU-P and SRCU-N.
> > > > > 
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > > > 	srcu/fixes
> > > > > 
> > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> > > > > 
> > > > > Thanks,
> > > > > 	Frederic
> > > > 
> > > > Very good, and a big "Thank You!!!" to all of you!
> > > > 
> > > > I queued this series for testing purposes, and have started a bunch of
> > > > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
> > > > SRCU-N on another system, but with both scenarios resized to 40 CPU each.
> > 
> > The 200*1h of SRCU-N and the 100*1h of SRCU-p passed other than the usual
> > tick-stop errors.  (Is there a patch for that one?)  The 40-CPU SRCU-N
> > run was fine, but the 40-CPU SRCU-P run failed due to the fanouts setting
> > a maximum of 16 CPUs.  So I started a 10-hour 40-CPU SRCU-P and a pair
> > of 10-hour 16-CPU SRCU-N runs on one system, and 200*10h of SRCU-N and
> > 100*10h of SRCU-P.
> > 
> > I will let you know how it goes.
> 
> Very nice! It might be worth testing the first patch alone as
> well if we backport only this one.

The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs
completed without failure.  The others had some failures, but I need
to look and see if any were unexpected.  In the meantime, I started a
two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on
just that first commit.  Also servicing SIGSHOWER and SIGFOOD.  ;-)

							Thanx, Paul

> Thanks!
> 
> 
> > 							Thanx, Paul
> > 
> > > > While that is in flight, a few questions:
> > > > 
> > > > o	Please check the Co-developed-by rules.  Last I knew, it was
> > > > 	necessary to have a Signed-off-by after each Co-developed-by.
> > > > 
> > > > o	Is it possible to get a Tested-by from the original reporter?
> > > > 	Or is this not reproducible?
> > > > 
> > > > o	Is it possible to convince rcutorture to find this sort of
> > > > 	bug?  Seems like it should be, but easy to say...
> > > 
> > > And one other thing...
> > > 
> > > o	What other bugs like this one are hiding elsewhere
> > > 	in RCU?
> > > 
> > > > o	Frederic, would you like to include this in your upcoming
> > > > 	pull request?  Or does it need more time?
> > > 
> > > 						Thanx, Paul
> > > 
> > > > > ---
> > > > > 
> > > > > Frederic Weisbecker (5):
> > > > >       srcu: Fix callbacks acceleration mishandling
> > > > >       srcu: Only accelerate on enqueue time
> > > > >       srcu: Remove superfluous callbacks advancing from srcu_start_gp()
> > > > >       srcu: No need to advance/accelerate if no callback enqueued
> > > > >       srcu: Explain why callbacks invocations can't run concurrently
> > > > > 
> > > > > 
> > > > >  kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
> > > > >  1 file changed, 39 insertions(+), 16 deletions(-)
Paul E. McKenney Oct. 4, 2023, 4:47 p.m. UTC | #8
On Wed, Oct 04, 2023 at 07:06:58AM -0700, Paul E. McKenney wrote:
> On Wed, Oct 04, 2023 at 11:36:49AM +0200, Frederic Weisbecker wrote:
> > On Tue, Oct 03, 2023 at 08:30:45PM -0700, Paul E. McKenney wrote:
> > > On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote:
> > > > On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> > > > > > 
> > > > > > 	http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
> > > > > > 
> > > > > > And a few cleanups.
> > > > > > 
> > > > > > Passed 50 hours of SRCU-P and SRCU-N.
> > > > > > 
> > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > > > > 	srcu/fixes
> > > > > > 
> > > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> > > > > > 
> > > > > > Thanks,
> > > > > > 	Frederic
> > > > > 
> > > > > Very good, and a big "Thank You!!!" to all of you!
> > > > > 
> > > > > I queued this series for testing purposes, and have started a bunch of
> > > > > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
> > > > > SRCU-N on another system, but with both scenarios resized to 40 CPU each.
> > > 
> > > The 200*1h of SRCU-N and the 100*1h of SRCU-p passed other than the usual
> > > tick-stop errors.  (Is there a patch for that one?)  The 40-CPU SRCU-N
> > > run was fine, but the 40-CPU SRCU-P run failed due to the fanouts setting
> > > a maximum of 16 CPUs.  So I started a 10-hour 40-CPU SRCU-P and a pair
> > > of 10-hour 16-CPU SRCU-N runs on one system, and 200*10h of SRCU-N and
> > > 100*10h of SRCU-P.
> > > 
> > > I will let you know how it goes.
> > 
> > Very nice! It might be worth testing the first patch alone as
> > well if we backport only this one.
> 
> The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs
> completed without failure.  The others had some failures, but I need
> to look and see if any were unexpected.  In the meantime, I started a
> two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on
> just that first commit.  Also servicing SIGSHOWER and SIGFOOD.  ;-)

And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N
runs (on only the first commit) completed without incident.

The other set of overnight full-stack runs had only tick-stop errors,
so I started a two-hour set on the first commit.

So far so good!

							Thanx, Paul

> > Thanks!
> > 
> > 
> > > 							Thanx, Paul
> > > 
> > > > > While that is in flight, a few questions:
> > > > > 
> > > > > o	Please check the Co-developed-by rules.  Last I knew, it was
> > > > > 	necessary to have a Signed-off-by after each Co-developed-by.
> > > > > 
> > > > > o	Is it possible to get a Tested-by from the original reporter?
> > > > > 	Or is this not reproducible?
> > > > > 
> > > > > o	Is it possible to convince rcutorture to find this sort of
> > > > > 	bug?  Seems like it should be, but easy to say...
> > > > 
> > > > And one other thing...
> > > > 
> > > > o	What other bugs like this one are hiding elsewhere
> > > > 	in RCU?
> > > > 
> > > > > o	Frederic, would you like to include this in your upcoming
> > > > > 	pull request?  Or does it need more time?
> > > > 
> > > > 						Thanx, Paul
> > > > 
> > > > > > ---
> > > > > > 
> > > > > > Frederic Weisbecker (5):
> > > > > >       srcu: Fix callbacks acceleration mishandling
> > > > > >       srcu: Only accelerate on enqueue time
> > > > > >       srcu: Remove superfluous callbacks advancing from srcu_start_gp()
> > > > > >       srcu: No need to advance/accelerate if no callback enqueued
> > > > > >       srcu: Explain why callbacks invocations can't run concurrently
> > > > > > 
> > > > > > 
> > > > > >  kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
> > > > > >  1 file changed, 39 insertions(+), 16 deletions(-)
Frederic Weisbecker Oct. 4, 2023, 9:27 p.m. UTC | #9
Le Wed, Oct 04, 2023 at 09:47:04AM -0700, Paul E. McKenney a écrit :
> > The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs
> > completed without failure.  The others had some failures, but I need
> > to look and see if any were unexpected.  In the meantime, I started a
> > two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on
> > just that first commit.  Also servicing SIGSHOWER and SIGFOOD.  ;-)
> 
> And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N
> runs (on only the first commit) completed without incident.
> 
> The other set of overnight full-stack runs had only tick-stop errors,
> so I started a two-hour set on the first commit.
> 
> So far so good!

Very nice!

As for the tick-stop error, see the upstream fix:

   1a6a46477494 ("timers: Tag (hr)timer softirq as hotplug safe")

Thanks!
Paul E. McKenney Oct. 4, 2023, 9:54 p.m. UTC | #10
On Wed, Oct 04, 2023 at 11:27:29PM +0200, Frederic Weisbecker wrote:
> Le Wed, Oct 04, 2023 at 09:47:04AM -0700, Paul E. McKenney a écrit :
> > > The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs
> > > completed without failure.  The others had some failures, but I need
> > > to look and see if any were unexpected.  In the meantime, I started a
> > > two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on
> > > just that first commit.  Also servicing SIGSHOWER and SIGFOOD.  ;-)
> > 
> > And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N
> > runs (on only the first commit) completed without incident.
> > 
> > The other set of overnight full-stack runs had only tick-stop errors,
> > so I started a two-hour set on the first commit.
> > 
> > So far so good!
> 
> Very nice!
> 
> As for the tick-stop error, see the upstream fix:
> 
>    1a6a46477494 ("timers: Tag (hr)timer softirq as hotplug safe")

Got it, thank you!

And the two-hour set of 200*SRCU-N and 100*SRCU-P had only tick-stop
errors.  I am refreshing the test grid, and will run overnight.

Here is hoping!

							Thanx, Paul
Paul E. McKenney Oct. 5, 2023, 4:54 p.m. UTC | #11
On Wed, Oct 04, 2023 at 02:54:57PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 04, 2023 at 11:27:29PM +0200, Frederic Weisbecker wrote:
> > Le Wed, Oct 04, 2023 at 09:47:04AM -0700, Paul E. McKenney a écrit :
> > > > The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs
> > > > completed without failure.  The others had some failures, but I need
> > > > to look and see if any were unexpected.  In the meantime, I started a
> > > > two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on
> > > > just that first commit.  Also servicing SIGSHOWER and SIGFOOD.  ;-)
> > > 
> > > And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N
> > > runs (on only the first commit) completed without incident.
> > > 
> > > The other set of overnight full-stack runs had only tick-stop errors,
> > > so I started a two-hour set on the first commit.
> > > 
> > > So far so good!
> > 
> > Very nice!
> > 
> > As for the tick-stop error, see the upstream fix:
> > 
> >    1a6a46477494 ("timers: Tag (hr)timer softirq as hotplug safe")
> 
> Got it, thank you!
> 
> And the two-hour set of 200*SRCU-N and 100*SRCU-P had only tick-stop
> errors.  I am refreshing the test grid, and will run overnight.

And the ten-hour test passed with only the tick-stop errors, representing
2000 hours of SRCU-N and 1000 hours of SRCU-P.  (I ran the exact same
stack, without the rebased fix you call out above.)

Looking good!

							Thanx, Paul
zhuangel570 Oct. 7, 2023, 10:24 a.m. UTC | #12
On Wed, Oct 4, 2023 at 5:25 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> > > Hi,
> > >
> > > This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> > >
> > >     http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com
> > >
> > > And a few cleanups.
> > >
> > > Passed 50 hours of SRCU-P and SRCU-N.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > >     srcu/fixes
> > >
> > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> > >
> > > Thanks,
> > >     Frederic
> >
> > Very good, and a big "Thank You!!!" to all of you!
> >
> > I queued this series for testing purposes, and have started a bunch of
> > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
> > SRCU-N on another system, but with both scenarios resized to 40 CPU each.
> >
> > While that is in flight, a few questions:
> >
> > o     Please check the Co-developed-by rules.  Last I knew, it was
> >       necessary to have a Signed-off-by after each Co-developed-by.
>
> Indeed! I'll try to collect the three of them within a few days. If some
> are missing, I'll put a Reported-by instead.
>
> >
> > o     Is it possible to get a Tested-by from the original reporter?
> >       Or is this not reproducible?
>
> It seems that the issue would trigger rarely. But I hope we can get one.

There is currently no way to reproduce this problem in our environment.
The problem has appeared on 2 machines, and each time it occurred, the
test had been running for more than a month.

BTW, I will run tests with these patches in our environment.

>
> >
> > o     Is it possible to convince rcutorture to find this sort of
> >       bug?  Seems like it should be, but easy to say...
>
> So at least the part where advance/accelerate fail is observed from time
> to time. But then we must meet two more rare events:
>
> 1) The CPU failing to ACC/ADV must also fail to start the grace period because
>   another CPU was faster.
>
> 2) The callbacks invocation must not run until that grace period has ended (even
>   though we had a previous one completed with callbacks ready).
>
>   Or it can run after all but at least the acceleration part of it has to
>   happen after the end of the new grace period.
>
> Perhaps all these conditions can me met more often if we overcommit the number
> of vCPU. For example run 10 SRCU-P instances within 3 real CPUs. This could
> introduce random breaks within the torture writers...
>
> Just an idea...
>
> >
> > o     Frederic, would you like to include this in your upcoming
> >       pull request?  Or does it need more time?
>
> At least the first patch yes. It should be easily backported and
> it should be enough to solve the race. I'll just wait a bit to collect
> more tags.
>
> Thanks!
>
> >
> >                                               Thanx, Paul
> >
> > > ---
> > >
> > > Frederic Weisbecker (5):
> > >       srcu: Fix callbacks acceleration mishandling
> > >       srcu: Only accelerate on enqueue time
> > >       srcu: Remove superfluous callbacks advancing from srcu_start_gp()
> > >       srcu: No need to advance/accelerate if no callback enqueued
> > >       srcu: Explain why callbacks invocations can't run concurrently
> > >
> > >
> > >  kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
> > >  1 file changed, 39 insertions(+), 16 deletions(-)
Frederic Weisbecker Oct. 10, 2023, 11:23 a.m. UTC | #13
On Thu, Oct 05, 2023 at 09:54:12AM -0700, Paul E. McKenney wrote:
> On Wed, Oct 04, 2023 at 02:54:57PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 04, 2023 at 11:27:29PM +0200, Frederic Weisbecker wrote:
> > > Le Wed, Oct 04, 2023 at 09:47:04AM -0700, Paul E. McKenney a écrit :
> > > > > The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs
> > > > > completed without failure.  The others had some failures, but I need
> > > > > to look and see if any were unexpected.  In the meantime, I started a
> > > > > two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on
> > > > > just that first commit.  Also servicing SIGSHOWER and SIGFOOD.  ;-)
> > > > 
> > > > And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N
> > > > runs (on only the first commit) completed without incident.
> > > > 
> > > > The other set of overnight full-stack runs had only tick-stop errors,
> > > > so I started a two-hour set on the first commit.
> > > > 
> > > > So far so good!
> > > 
> > > Very nice!
> > > 
> > > As for the tick-stop error, see the upstream fix:
> > > 
> > >    1a6a46477494 ("timers: Tag (hr)timer softirq as hotplug safe")
> > 
> > Got it, thank you!
> > 
> > And the two-hour set of 200*SRCU-N and 100*SRCU-P had only tick-stop
> > errors.  I am refreshing the test grid, and will run overnight.
> 
> And the ten-hour test passed with only the tick-stop errors, representing
> 2000 hours of SRCU-N and 1000 hours of SRCU-P.  (I ran the exact same
> stack, without the rebased fix you call out above.)

Thanks a lot!
Frederic Weisbecker Oct. 10, 2023, 11:27 a.m. UTC | #14
On Sat, Oct 07, 2023 at 06:24:53PM +0800, zhuangel570 wrote:
> On Wed, Oct 4, 2023 at 5:25 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> There is currently no way to reproduce this problem in our environment.
> The problem has appeared on 2 machines, and each time it occurred, the
> test had been running for more than a month.
> 
> BTW, I will run tests with these patches in our environment.

Ok, let us know if it ever triggers after this series.

Since I added you in the Co-developed-by: tags, I will need to
also add your Signed-off-by: tag.

Is that ok for you?

Thanks.
zhuangel570 Oct. 10, 2023, 1:20 p.m. UTC | #15
On Tue, Oct 10, 2023 at 7:27 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Sat, Oct 07, 2023 at 06:24:53PM +0800, zhuangel570 wrote:
> > On Wed, Oct 4, 2023 at 5:25 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > There is currently no way to reproduce this problem in our environment.
> > The problem has appeared on 2 machines, and each time it occurred, the
> > test had been running for more than a month.
> >
> > BTW, I will run tests with these patches in our environment.
>
> Ok, let us know if it ever triggers after this series.

Sure, I have ported the patch set to our test environment, now 2 machines
already run test for 3 days, everything looks fine.

The patch set is very elegant and completely eliminates the possibility of
unexpected accelerations in our analysis. I am very confident in fixing our
problem.

>
> Since I added you in the Co-developed-by: tags, I will need to
> also add your Signed-off-by: tag.
>
> Is that ok for you?

Sure. Big thanks!
If possible, would you please change my "Signed-off-by" into:

Signed-off-by: Yong He <alexyonghe@tencent.com>

>
> Thanks.