diff mbox series

[RFC] Stop rcu_tasks_invoke_cbs() from using never-online CPUs

Message ID 83d037d1-ef12-4b31-a7b9-7b1ed6c3ae42@paulmck-laptop (mailing list archive)
State Accepted
Commit 26a29ad68271df988a813e227da98861ae3bef9e
Headers show
Series [RFC] Stop rcu_tasks_invoke_cbs() from using never-online CPUs | expand

Commit Message

Paul E. McKenney April 26, 2023, 5:26 p.m. UTC
The rcu_tasks_invoke_cbs() relies on queue_work_on() to silently fall
back to WORK_CPU_UNBOUND when the specified CPU is offline.  However,
the queue_work_on() function's silent fallback mechanism relies on that
CPU having been online at some time in the past.  When queue_work_on()
is passed a CPU that has never been online, workqueue lockups ensue,
which can be bad for your kernel's general health and well-being.

This commit therefore checks whether a given CPU is currently online,
and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to
queue_work_on().  Why not simply omit the queue_work_on() call entirely?
Because this function is flooding callback-invocation notifications
to all CPUs, and must deal with possibilities that include a sparse
cpu_possible_mask.

Fixes: d363f833c6d88 rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Comments

Tejun Heo April 26, 2023, 7:55 p.m. UTC | #1
Hello, Paul.

On Wed, Apr 26, 2023 at 10:26:38AM -0700, Paul E. McKenney wrote:
> The rcu_tasks_invoke_cbs() relies on queue_work_on() to silently fall
> back to WORK_CPU_UNBOUND when the specified CPU is offline.  However,
> the queue_work_on() function's silent fallback mechanism relies on that
> CPU having been online at some time in the past.  When queue_work_on()
> is passed a CPU that has never been online, workqueue lockups ensue,
> which can be bad for your kernel's general health and well-being.
> 
> This commit therefore checks whether a given CPU is currently online,
> and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to
> queue_work_on().  Why not simply omit the queue_work_on() call entirely?
> Because this function is flooding callback-invocation notifications
> to all CPUs, and must deal with possibilities that include a sparse
> cpu_possible_mask.
> 
> Fixes: d363f833c6d88 rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

I don't understand the code at all but wonder whether it can do sth similar
to cpumask_any_distribute() which RR's through the specified cpumask. Would
it make sense to change rcu_tasks_invoke_cbs() to do something similar
against cpu_online_mask?

Thanks.
Paul E. McKenney April 26, 2023, 9:17 p.m. UTC | #2
On Wed, Apr 26, 2023 at 09:55:55AM -1000, Tejun Heo wrote:
> Hello, Paul.
> 
> On Wed, Apr 26, 2023 at 10:26:38AM -0700, Paul E. McKenney wrote:
> > The rcu_tasks_invoke_cbs() relies on queue_work_on() to silently fall
> > back to WORK_CPU_UNBOUND when the specified CPU is offline.  However,
> > the queue_work_on() function's silent fallback mechanism relies on that
> > CPU having been online at some time in the past.  When queue_work_on()
> > is passed a CPU that has never been online, workqueue lockups ensue,
> > which can be bad for your kernel's general health and well-being.
> > 
> > This commit therefore checks whether a given CPU is currently online,
> > and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to
> > queue_work_on().  Why not simply omit the queue_work_on() call entirely?
> > Because this function is flooding callback-invocation notifications
> > to all CPUs, and must deal with possibilities that include a sparse
> > cpu_possible_mask.
> > 
> > Fixes: d363f833c6d88 rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> I don't understand the code at all but wonder whether it can do sth similar
> to cpumask_any_distribute() which RR's through the specified cpumask. Would
> it make sense to change rcu_tasks_invoke_cbs() to do something similar
> against cpu_online_mask?

It might.

But the idea here is to spread the load of queueing the work as well as
spreading the load of invoking the callbacks.

I suppose that I could allocate an array of ints, gather the online CPUs
into that array, and do a power-of-two distribution across that array.
But RCU Tasks allows CPUs to go offline with queued callbacks, so this
array would also need to include those CPUs as well as the ones that
are online.

Given that the common-case system has a dense cpus_online_mask, I opted
to keep it simple, which is optimal in the common case.

Or am I missing a trick here?

							Thanx, Paul
Tejun Heo April 26, 2023, 9:31 p.m. UTC | #3
Hello,

On Wed, Apr 26, 2023 at 02:17:03PM -0700, Paul E. McKenney wrote:
> But the idea here is to spread the load of queueing the work as well as
> spreading the load of invoking the callbacks.
> 
> I suppose that I could allocate an array of ints, gather the online CPUs
> into that array, and do a power-of-two distribution across that array.
> But RCU Tasks allows CPUs to go offline with queued callbacks, so this
> array would also need to include those CPUs as well as the ones that
> are online.

Ah, I see, so it needs to make the distinction between cpus which have never
been online and are currently offline but used to be online.

> Given that the common-case system has a dense cpus_online_mask, I opted
> to keep it simple, which is optimal in the common case.
> 
> Or am I missing a trick here?

The worry is that on systems with actual CPU hotplugging, cpu_online_mask
can be pretty sparse - e.g. 1/4 filled wouldn't be too out there. In such
cases, the current code would end scheduling the work items on the issuing
CPU (which is what WORK_CPU_UNBOUND does) 3/4 of the time which probably
isn't the desired behavior.

So, I can initialize all per-cpu workqueues for all possible cpus on boot so
that rcu doesn't have to worry about it but that would still have a similar
problem of the callbacks not really being spread as intended.

I think it depends on how important it is to spread the callback workload
evenly. If that matters quite a bit, it probably would make sense to
maintain a cpumask for has-ever-been-online CPUs. Otherwise, do you think it
can just use an unbound workqueue and forget about manually distributing the
workload?

Thanks.
Paul E. McKenney April 26, 2023, 9:55 p.m. UTC | #4
On Wed, Apr 26, 2023 at 11:31:00AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 26, 2023 at 02:17:03PM -0700, Paul E. McKenney wrote:
> > But the idea here is to spread the load of queueing the work as well as
> > spreading the load of invoking the callbacks.
> > 
> > I suppose that I could allocate an array of ints, gather the online CPUs
> > into that array, and do a power-of-two distribution across that array.
> > But RCU Tasks allows CPUs to go offline with queued callbacks, so this
> > array would also need to include those CPUs as well as the ones that
> > are online.
> 
> Ah, I see, so it needs to make the distinction between cpus which have never
> been online and are currently offline but used to be online.

But only for as long as the used-to-be-online CPUs have callbacks for
the corresponding flavor of Tasks RCU.  :-/

> > Given that the common-case system has a dense cpus_online_mask, I opted
> > to keep it simple, which is optimal in the common case.
> > 
> > Or am I missing a trick here?
> 
> The worry is that on systems with actual CPU hotplugging, cpu_online_mask
> can be pretty sparse - e.g. 1/4 filled wouldn't be too out there. In such
> cases, the current code would end scheduling the work items on the issuing
> CPU (which is what WORK_CPU_UNBOUND does) 3/4 of the time which probably
> isn't the desired behavior.
> 
> So, I can initialize all per-cpu workqueues for all possible cpus on boot so
> that rcu doesn't have to worry about it but that would still have a similar
> problem of the callbacks not really being spread as intended.

Unless you get a few more users that care about this, it is probably
best to just let RCU deal with it.

For whatever it is worth, I am working a smaller patch that doesn't need
to do cpus_read_lock(), but anyone with short-term needs should stick
with the existing patch.

> I think it depends on how important it is to spread the callback workload
> evenly. If that matters quite a bit, it probably would make sense to
> maintain a cpumask for has-ever-been-online CPUs. Otherwise, do you think it
> can just use an unbound workqueue and forget about manually distributing the
> workload?

If there are not very many callbacks, then you are right that spreading
the load makes no sense.  And the 18-months-ago version of this code in
fact didn't bother spreading.  But new workloads came up that cared about
update-side performance and scalability, which led to the current code.

This code initially just invokes all the callbacks directly, just like
it did unconditionally 18 months ago, due to ->percpu_dequeue_lim being
initialized to 1.  This causes all the RCU Tasks callbacks to be queued
on CPU 0 and to be invoked directly by the grace-period kthread.

But if the call_rcu_tasks_*() code detects too much lock contention on
CPU 0's queue, which indicates that very large numbers of callbacks are
being queued, it switches to per-CPU mode.  In which case, we are likely
to have lots of callbacks on lots of queues, and in that case we really
want to invoke them concurrently.

Then if a later grace period finds that there are no more callbacks, it
switches back to CPU-0 mode.  So this extra workqueue overhead should
happen only on systems with sparse cpu_online_masks that are under heavy
call_rcu_tasks_*() load.

That is the theory, anyway!  ;-)

							Thanx, Paul
Tejun Heo April 26, 2023, 10:09 p.m. UTC | #5
Hello, Paul.

On Wed, Apr 26, 2023 at 02:55:04PM -0700, Paul E. McKenney wrote:
> But if the call_rcu_tasks_*() code detects too much lock contention on
> CPU 0's queue, which indicates that very large numbers of callbacks are
> being queued, it switches to per-CPU mode.  In which case, we are likely
> to have lots of callbacks on lots of queues, and in that case we really
> want to invoke them concurrently.
> 
> Then if a later grace period finds that there are no more callbacks, it
> switches back to CPU-0 mode.  So this extra workqueue overhead should
> happen only on systems with sparse cpu_online_masks that are under heavy
> call_rcu_tasks_*() load.

I still wonder whether it can be solved by simply switching to unbound
workqueues instead of implementing custom load-spreading mechanism. We'd be
basically asking the scheduler to what it thinks is best instead of trying
to make manual CPU placement decisions. That said, as a fix, the original
patch looks fine to me. Gonna go ack that.

Thanks.
Tejun Heo April 26, 2023, 10:12 p.m. UTC | #6
On Wed, Apr 26, 2023 at 10:26:38AM -0700, Paul E. McKenney wrote:
> The rcu_tasks_invoke_cbs() relies on queue_work_on() to silently fall
> back to WORK_CPU_UNBOUND when the specified CPU is offline.  However,
> the queue_work_on() function's silent fallback mechanism relies on that
> CPU having been online at some time in the past.  When queue_work_on()
> is passed a CPU that has never been online, workqueue lockups ensue,
> which can be bad for your kernel's general health and well-being.
> 
> This commit therefore checks whether a given CPU is currently online,
> and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to
> queue_work_on().  Why not simply omit the queue_work_on() call entirely?
> Because this function is flooding callback-invocation notifications
> to all CPUs, and must deal with possibilities that include a sparse
> cpu_possible_mask.
> 
> Fixes: d363f833c6d88 rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
...
> +		// If a CPU has never been online, queue_work_on()
> +		// objects to queueing work on that CPU.  Approximate a
> +		// check for this by checking if the CPU is currently online.
> +
> +		cpus_read_lock();
> +		cpuwq1 = cpu_online(cpunext) ? cpunext : WORK_CPU_UNBOUND;
> +		cpuwq2 = cpu_online(cpunext + 1) ? cpunext + 1 : WORK_CPU_UNBOUND;
> +		cpus_read_unlock();
> +
> +		// Yes, either CPU could go offline here.  But that is
> +		// OK because queue_work_on() will (in effect) silently
> +		// fall back to WORK_CPU_UNBOUND for any CPU that has ever
> +		// been online.

Looks like cpus_read_lock() isn't protecting anything really.

> +		queue_work_on(cpuwq1, system_wq, &rtpcp_next->rtp_work);
>  		cpunext++;
>  		if (cpunext < smp_load_acquire(&rtp->percpu_dequeue_lim)) {
>  			rtpcp_next = per_cpu_ptr(rtp->rtpcpu, cpunext);
> -			queue_work_on(cpunext, system_wq, &rtpcp_next->rtp_work);
> +			queue_work_on(cpuwq2, system_wq, &rtpcp_next->rtp_work);

As discussed in the thread, I kinda wonder whether just using an unbound
workqueue would be sufficient but as a fix this looks good to me.

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.
Paul E. McKenney April 26, 2023, 10:29 p.m. UTC | #7
On Wed, Apr 26, 2023 at 12:12:05PM -1000, Tejun Heo wrote:
> On Wed, Apr 26, 2023 at 10:26:38AM -0700, Paul E. McKenney wrote:
> > The rcu_tasks_invoke_cbs() relies on queue_work_on() to silently fall
> > back to WORK_CPU_UNBOUND when the specified CPU is offline.  However,
> > the queue_work_on() function's silent fallback mechanism relies on that
> > CPU having been online at some time in the past.  When queue_work_on()
> > is passed a CPU that has never been online, workqueue lockups ensue,
> > which can be bad for your kernel's general health and well-being.
> > 
> > This commit therefore checks whether a given CPU is currently online,
> > and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to
> > queue_work_on().  Why not simply omit the queue_work_on() call entirely?
> > Because this function is flooding callback-invocation notifications
> > to all CPUs, and must deal with possibilities that include a sparse
> > cpu_possible_mask.
> > 
> > Fixes: d363f833c6d88 rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ...
> > +		// If a CPU has never been online, queue_work_on()
> > +		// objects to queueing work on that CPU.  Approximate a
> > +		// check for this by checking if the CPU is currently online.
> > +
> > +		cpus_read_lock();
> > +		cpuwq1 = cpu_online(cpunext) ? cpunext : WORK_CPU_UNBOUND;
> > +		cpuwq2 = cpu_online(cpunext + 1) ? cpunext + 1 : WORK_CPU_UNBOUND;
> > +		cpus_read_unlock();
> > +
> > +		// Yes, either CPU could go offline here.  But that is
> > +		// OK because queue_work_on() will (in effect) silently
> > +		// fall back to WORK_CPU_UNBOUND for any CPU that has ever
> > +		// been online.
> 
> Looks like cpus_read_lock() isn't protecting anything really.

It certainly isn't protecting much.  ;-)

The purpose is to avoid a situation where this CPU thinks that some other
CPU is online, but the corresponding workqueue has not yet been created.
And I freely admit that, given the huge amount of synchronization and
delay on the CPU-hotplug path, it is really hard to imagine this getting
messed up.  Except that this is the quick fix, where it would be bad to
rely on anything requiring much thought.

I am working a more dainty fix for mainline that doesn't need
cpus_read_lock(), even in nightmarish theorizing.

> > +		queue_work_on(cpuwq1, system_wq, &rtpcp_next->rtp_work);
> >  		cpunext++;
> >  		if (cpunext < smp_load_acquire(&rtp->percpu_dequeue_lim)) {
> >  			rtpcp_next = per_cpu_ptr(rtp->rtpcpu, cpunext);
> > -			queue_work_on(cpunext, system_wq, &rtpcp_next->rtp_work);
> > +			queue_work_on(cpuwq2, system_wq, &rtpcp_next->rtp_work);
> 
> As discussed in the thread, I kinda wonder whether just using an unbound
> workqueue would be sufficient but as a fix this looks good to me.

I would agree except for the callback-flood corner case, in which
callback invocation needs all the help it can get if it is to keep up
with a tightish loop doing call_rcu_tasks_*().

> Acked-by: Tejun Heo <tj@kernel.org>

I will apply this on my next rebase, thank you!

							Thanx, Paul
diff mbox series

Patch

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index cf7b00af9474..055a5f152127 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -464,6 +464,8 @@  static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
 {
 	int cpu;
 	int cpunext;
+	int cpuwq1;
+	int cpuwq2;
 	unsigned long flags;
 	int len;
 	struct rcu_head *rhp;
@@ -474,11 +476,26 @@  static void rcu_tasks_invoke_cbs(struct rcu_tasks *rtp, struct rcu_tasks_percpu
 	cpunext = cpu * 2 + 1;
 	if (cpunext < smp_load_acquire(&rtp->percpu_dequeue_lim)) {
 		rtpcp_next = per_cpu_ptr(rtp->rtpcpu, cpunext);
-		queue_work_on(cpunext, system_wq, &rtpcp_next->rtp_work);
+
+		// If a CPU has never been online, queue_work_on()
+		// objects to queueing work on that CPU.  Approximate a
+		// check for this by checking if the CPU is currently online.
+
+		cpus_read_lock();
+		cpuwq1 = cpu_online(cpunext) ? cpunext : WORK_CPU_UNBOUND;
+		cpuwq2 = cpu_online(cpunext + 1) ? cpunext + 1 : WORK_CPU_UNBOUND;
+		cpus_read_unlock();
+
+		// Yes, either CPU could go offline here.  But that is
+		// OK because queue_work_on() will (in effect) silently
+		// fall back to WORK_CPU_UNBOUND for any CPU that has ever
+		// been online.
+
+		queue_work_on(cpuwq1, system_wq, &rtpcp_next->rtp_work);
 		cpunext++;
 		if (cpunext < smp_load_acquire(&rtp->percpu_dequeue_lim)) {
 			rtpcp_next = per_cpu_ptr(rtp->rtpcpu, cpunext);
-			queue_work_on(cpunext, system_wq, &rtpcp_next->rtp_work);
+			queue_work_on(cpuwq2, system_wq, &rtpcp_next->rtp_work);
 		}
 	}