diff mbox series

kthread_worker: re-set CPU affinities if CPU come online

Message ID 20201026065213.30477-1-qiang.zhang@windriver.com (mailing list archive)
State New, archived
Headers show
Series kthread_worker: re-set CPU affinities if CPU come online | expand

Commit Message

Zhang, Qiang Oct. 26, 2020, 6:52 a.m. UTC
From: Zqiang <qiang.zhang@windriver.com>

When someone CPU offlined, the 'kthread_worker' which bind this CPU,
will run anywhere, if this CPU online, recovery of 'kthread_worker'
affinity by cpuhp notifiers.

Signed-off-by: Zqiang <qiang.zhang@windriver.com>
---
 include/linux/kthread.h |  2 ++
 kernel/kthread.c        | 35 ++++++++++++++++++++++++++++++++++-
 2 files changed, 36 insertions(+), 1 deletion(-)

Comments

Tejun Heo Oct. 26, 2020, 1:50 p.m. UTC | #1
On Mon, Oct 26, 2020 at 02:52:13PM +0800, qiang.zhang@windriver.com wrote:
> @@ -737,8 +741,11 @@ __kthread_create_worker(int cpu, unsigned int flags,
>  	if (IS_ERR(task))
>  		goto fail_task;
>  
> -	if (cpu >= 0)
> +	if (cpu >= 0) {
>  		kthread_bind(task, cpu);
> +		worker->bind_cpu = cpu;
> +		cpuhp_state_add_instance_nocalls(kworker_online, &worker->cpuhp_node);
> +	}
>  
>  	worker->flags = flags;
>  	worker->task = task;
...
> +static int kworker_cpu_online(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct kthread_worker *worker = hlist_entry(node, struct kthread_worker, cpuhp_node);
> +	struct task_struct *task = worker->task;
> +
> +	if (cpu == worker->bind_cpu)
> +		WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpumask_of(cpu)) < 0);
> +	return 0;
> +}

I don't think this works. The kthread may have changed its binding while
running using set_cpus_allowed_ptr() as you're doing above. Besides, when a
cpu goes offline, the bound kthread can fall back to other cpus but its cpu
mask isn't cleared, is it?

Thanks.
Petr Mladek Oct. 26, 2020, 4:45 p.m. UTC | #2
On Mon 2020-10-26 09:50:11, Tejun Heo wrote:
> On Mon, Oct 26, 2020 at 02:52:13PM +0800, qiang.zhang@windriver.com wrote:
> > @@ -737,8 +741,11 @@ __kthread_create_worker(int cpu, unsigned int flags,
> >  	if (IS_ERR(task))
> >  		goto fail_task;
> >  
> > -	if (cpu >= 0)
> > +	if (cpu >= 0) {
> >  		kthread_bind(task, cpu);
> > +		worker->bind_cpu = cpu;
> > +		cpuhp_state_add_instance_nocalls(kworker_online, &worker->cpuhp_node);
> > +	}
> >  
> >  	worker->flags = flags;
> >  	worker->task = task;
> ...
> > +static int kworker_cpu_online(unsigned int cpu, struct hlist_node *node)
> > +{
> > +	struct kthread_worker *worker = hlist_entry(node, struct kthread_worker, cpuhp_node);
> > +	struct task_struct *task = worker->task;
> > +
> > +	if (cpu == worker->bind_cpu)
> > +		WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpumask_of(cpu)) < 0);
> > +	return 0;
> > +}
> 
> I don't think this works. The kthread may have changed its binding while
> running using set_cpus_allowed_ptr() as you're doing above. Besides, when a
> cpu goes offline, the bound kthread can fall back to other cpus but its cpu
> mask isn't cleared, is it?

If I get it correctly, select_fallback_rq() calls
do_set_cpus_allowed() explicitly or in cpuset_cpus_allowed_fallback().
It seems that the original mask gets lost.

It would make sense to assume that kthread_worker API will take care of
the affinity when it was set by kthread_create_worker_on_cpu().

But is it safe to assume that the work can be safely proceed also
on another CPU? We should probably add a warning into
kthread_worker_fn() when it detects wrong CPU.

BTW: kthread_create_worker_on_cpu() is currently used only by
     start_power_clamp_worker(). And it has its own CPU hotplug
     handling. The kthreads are stopped and started again
     in powerclamp_cpu_predown() and  powerclamp_cpu_online().


I havn't checked all details yet. But in principle, the patch looks
sane to me.

Best Regards,
Petr
Tejun Heo Oct. 26, 2020, 4:53 p.m. UTC | #3
Hello, Petr.

On Mon, Oct 26, 2020 at 05:45:55PM +0100, Petr Mladek wrote:
> > I don't think this works. The kthread may have changed its binding while
> > running using set_cpus_allowed_ptr() as you're doing above. Besides, when a
> > cpu goes offline, the bound kthread can fall back to other cpus but its cpu
> > mask isn't cleared, is it?
> 
> If I get it correctly, select_fallback_rq() calls
> do_set_cpus_allowed() explicitly or in cpuset_cpus_allowed_fallback().
> It seems that the original mask gets lost.

Oh, I see.

> It would make sense to assume that kthread_worker API will take care of
> the affinity when it was set by kthread_create_worker_on_cpu().

I was for some reason thinking this was for all kthreads. Yeah, for
kthread_workers it does make sense.

> But is it safe to assume that the work can be safely proceed also
> on another CPU? We should probably add a warning into
> kthread_worker_fn() when it detects wrong CPU.

Per-cpu workqueues behave like that too. When the CPU goes down, per-cpu
workers on that CPU are unbound and may run anywhere. They get rebound when
CPU comes back up.

> BTW: kthread_create_worker_on_cpu() is currently used only by
>      start_power_clamp_worker(). And it has its own CPU hotplug
>      handling. The kthreads are stopped and started again
>      in powerclamp_cpu_predown() and  powerclamp_cpu_online().

And users which have hard dependency on CPU binding are expected to
implement hotplug events so that e.g. per-cpu work items are flushed when
CPU goes down and scheduled back when it comes back online.

There are pros and cons to the current workqueue behavior but it'd be a good
idea to keep kthread_worker's behavior in sync.

> I havn't checked all details yet. But in principle, the patch looks
> sane to me.

Yeah, agreed.

Thanks.
Petr Mladek Oct. 27, 2020, 4:39 p.m. UTC | #4
On Mon 2020-10-26 14:52:13, qiang.zhang@windriver.com wrote:
> From: Zqiang <qiang.zhang@windriver.com>
> 
> When someone CPU offlined, the 'kthread_worker' which bind this CPU,
> will run anywhere, if this CPU online, recovery of 'kthread_worker'
> affinity by cpuhp notifiers.

I am not familiar with CPU hotplug notifiers. I rather add Thomas and
Peter into Cc.

> 
> Signed-off-by: Zqiang <qiang.zhang@windriver.com>
> ---
>  include/linux/kthread.h |  2 ++
>  kernel/kthread.c        | 35 ++++++++++++++++++++++++++++++++++-
>  2 files changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 65b81e0c494d..5acbf2e731cb 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -93,6 +93,8 @@ struct kthread_worker {
>  	struct list_head	delayed_work_list;
>  	struct task_struct	*task;
>  	struct kthread_work	*current_work;
> +	struct hlist_node       cpuhp_node;
> +	int                     bind_cpu;
>  };
>  
>  struct kthread_work {
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index e29773c82b70..68968832777f 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -28,8 +28,10 @@
>  #include <linux/uaccess.h>
>  #include <linux/numa.h>
>  #include <linux/sched/isolation.h>
> +#include <linux/cpu.h>
>  #include <trace/events/sched.h>
>  
> +static enum cpuhp_state kworker_online;

Please, use kthread_worker_online.

I know that it is too long but it is used everywhere. Consistency
is useful when searching and reading the code.

>  static DEFINE_SPINLOCK(kthread_create_lock);
>  static LIST_HEAD(kthread_create_list);
> @@ -649,6 +651,8 @@ void __kthread_init_worker(struct kthread_worker *worker,
>  	lockdep_set_class_and_name(&worker->lock, key, name);
>  	INIT_LIST_HEAD(&worker->work_list);
>  	INIT_LIST_HEAD(&worker->delayed_work_list);
> +	worker->bind_cpu = -1;
> +	INIT_HLIST_NODE(&worker->cpuhp_node);

Same has to be done also in KTHREAD_WORKER_INIT macro defined
in include/linux/kthread.h.

>  }
>  EXPORT_SYMBOL_GPL(__kthread_init_worker);
>  
> @@ -737,8 +741,11 @@ __kthread_create_worker(int cpu, unsigned int flags,
>  	if (IS_ERR(task))
>  		goto fail_task;
>  
> -	if (cpu >= 0)
> +	if (cpu >= 0) {
>  		kthread_bind(task, cpu);
> +		worker->bind_cpu = cpu;
> +		cpuhp_state_add_instance_nocalls(kworker_online, &worker->cpuhp_node);

There is a rather theoretical race that the CPU might get down and up
between kthread_bind() and adding the callback.

It actually won't be a problem because the kthread_worker is still not
running at this stage and will not get migrated.

But I would switch the order just to be on the safe side and avoid
doubts about this race.


> +	}
>  
>  	worker->flags = flags;
>  	worker->task = task;
> @@ -1220,6 +1227,9 @@ void kthread_destroy_worker(struct kthread_worker *worker)
>  	if (WARN_ON(!task))
>  		return;
>  
> +	if (worker->bind_cpu >= 0)
> +		cpuhp_state_remove_instance_nocalls(kworker_online, &worker->cpuhp_node);
> +
>  	kthread_flush_worker(worker);
>  	kthread_stop(task);
>  	WARN_ON(!list_empty(&worker->work_list));
> @@ -1227,6 +1237,29 @@ void kthread_destroy_worker(struct kthread_worker *worker)
>  }
>  EXPORT_SYMBOL(kthread_destroy_worker);
>  
> +static int kworker_cpu_online(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct kthread_worker *worker = hlist_entry(node, struct kthread_worker, cpuhp_node);

The code here looks correct.

JFYI, I was curious why many cpuhp callbacks used hlist_entry_safe().
But they did not check for NULL. Hence the _safe() variant did
not really prevented any crash.

I seems that it was a cargo-cult programming. cpuhp_invoke_callback() calls
simple hlist_for_each(). If I get it correctly, the operations are
synchronized by cpus_read_lock()/cpus_write_lock() and _safe variant
really is not needed.


> +	struct task_struct *task = worker->task;
> +

The WARN_ON_ONCE() below would trigger only where there is a bug in
the CPU hotplug code. Please, add a comment explaining that it is
a rather theoretical situation. Something like in the workqueue code:

	/* as we're called from CPU_ONLINE, the following shouldn't fail */

> +	if (cpu == worker->bind_cpu)
> +		WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpumask_of(cpu)) < 0);
>
> +	return 0;
> +}
> +
> +static __init int kthread_worker_hotplug_init(void)
> +{
> +	int ret;
> +
> +	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "kthread-worker/online",
> +					kworker_cpu_online, NULL);
> +	if (ret < 0)
> +		return ret;
> +	kworker_online = ret;
> +	return 0;
> +}
> +subsys_initcall(kthread_worker_hotplug_init);

I would make it core_initcall(), It is built-in and should be usable
as early as possible.

Otherwise, the patch looks fine to me. Great catch!

Best Regards,
Petr

> +
>  /**
>   * kthread_use_mm - make the calling kthread operate on an address space
>   * @mm: address space to operate on
> -- 
> 2.17.1
Thomas Gleixner Oct. 27, 2020, 7:19 p.m. UTC | #5
Petr,

On Tue, Oct 27 2020 at 17:39, Petr Mladek wrote:
> On Mon 2020-10-26 14:52:13, qiang.zhang@windriver.com wrote:
>> From: Zqiang <qiang.zhang@windriver.com>
>> 
>> When someone CPU offlined, the 'kthread_worker' which bind this CPU,
>> will run anywhere, if this CPU online, recovery of 'kthread_worker'
>> affinity by cpuhp notifiers.
>
> I am not familiar with CPU hotplug notifiers. I rather add Thomas and
> Peter into Cc.

Thanks!

>> +static int kworker_cpu_online(unsigned int cpu, struct hlist_node *node)
>> +{
>> +	struct kthread_worker *worker = hlist_entry(node, struct kthread_worker, cpuhp_node);
>
> The code here looks correct.
>
> JFYI, I was curious why many cpuhp callbacks used hlist_entry_safe().
> But they did not check for NULL. Hence the _safe() variant did
> not really prevented any crash.
>
> I seems that it was a cargo-cult programming. cpuhp_invoke_callback() calls
> simple hlist_for_each(). If I get it correctly, the operations are
> synchronized by cpus_read_lock()/cpus_write_lock() and _safe variant
> really is not needed.

Correct.

>> +static __init int kthread_worker_hotplug_init(void)
>> +{
>> +	int ret;
>> +
>> +	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "kthread-worker/online",
>> +					kworker_cpu_online, NULL);

The dynamic hotplug states run late. What's preventing work to be queued
on such a worker before it is bound to the CPU again?

Nothing at all.

Moving the hotplug state early does not help either because this cannot
happen _before_ the CPUHP_AP_ONLINE state. After that it's already too
late because that's after interrupts have been reenabled on the upcoming
CPU. Depending on the interrupt routing an interrupt hitting the
upcoming CPU might queue work before the state is reached. Work might
also be queued via a timer before rebind happens.

The only current user (powerclamp) has it's own hotplug handling and
stops the thread and creates a new one when the CPU comes online. So
that's not a problem.

But in general this _is_ a problem. There is also no mechanism to ensure
that work on a CPU bound worker has been drained before the CPU goes
offline and that work on the outgoing CPU cannot be queued after a
certain point in the hotplug state machine.

CPU bound kernel threads have special properties. You can access per CPU
variables without further protection. This blows up in your face once
the worker thread is unbound after a hotplug operation.

So the proposed patch is duct tape and papers over the underlying design
problem.

Either this is fixed in a way which ensures operation on the bound CPU
under all circumstances or it needs to be documented that users have to
have their own hotplug handling similar to what powerclamp does.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 65b81e0c494d..5acbf2e731cb 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -93,6 +93,8 @@  struct kthread_worker {
 	struct list_head	delayed_work_list;
 	struct task_struct	*task;
 	struct kthread_work	*current_work;
+	struct hlist_node       cpuhp_node;
+	int                     bind_cpu;
 };
 
 struct kthread_work {
diff --git a/kernel/kthread.c b/kernel/kthread.c
index e29773c82b70..68968832777f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -28,8 +28,10 @@ 
 #include <linux/uaccess.h>
 #include <linux/numa.h>
 #include <linux/sched/isolation.h>
+#include <linux/cpu.h>
 #include <trace/events/sched.h>
 
+static enum cpuhp_state kworker_online;
 
 static DEFINE_SPINLOCK(kthread_create_lock);
 static LIST_HEAD(kthread_create_list);
@@ -649,6 +651,8 @@  void __kthread_init_worker(struct kthread_worker *worker,
 	lockdep_set_class_and_name(&worker->lock, key, name);
 	INIT_LIST_HEAD(&worker->work_list);
 	INIT_LIST_HEAD(&worker->delayed_work_list);
+	worker->bind_cpu = -1;
+	INIT_HLIST_NODE(&worker->cpuhp_node);
 }
 EXPORT_SYMBOL_GPL(__kthread_init_worker);
 
@@ -737,8 +741,11 @@  __kthread_create_worker(int cpu, unsigned int flags,
 	if (IS_ERR(task))
 		goto fail_task;
 
-	if (cpu >= 0)
+	if (cpu >= 0) {
 		kthread_bind(task, cpu);
+		worker->bind_cpu = cpu;
+		cpuhp_state_add_instance_nocalls(kworker_online, &worker->cpuhp_node);
+	}
 
 	worker->flags = flags;
 	worker->task = task;
@@ -1220,6 +1227,9 @@  void kthread_destroy_worker(struct kthread_worker *worker)
 	if (WARN_ON(!task))
 		return;
 
+	if (worker->bind_cpu >= 0)
+		cpuhp_state_remove_instance_nocalls(kworker_online, &worker->cpuhp_node);
+
 	kthread_flush_worker(worker);
 	kthread_stop(task);
 	WARN_ON(!list_empty(&worker->work_list));
@@ -1227,6 +1237,29 @@  void kthread_destroy_worker(struct kthread_worker *worker)
 }
 EXPORT_SYMBOL(kthread_destroy_worker);
 
+static int kworker_cpu_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct kthread_worker *worker = hlist_entry(node, struct kthread_worker, cpuhp_node);
+	struct task_struct *task = worker->task;
+
+	if (cpu == worker->bind_cpu)
+		WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpumask_of(cpu)) < 0);
+	return 0;
+}
+
+static __init int kthread_worker_hotplug_init(void)
+{
+	int ret;
+
+	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "kthread-worker/online",
+					kworker_cpu_online, NULL);
+	if (ret < 0)
+		return ret;
+	kworker_online = ret;
+	return 0;
+}
+subsys_initcall(kthread_worker_hotplug_init);
+
 /**
  * kthread_use_mm - make the calling kthread operate on an address space
  * @mm: address space to operate on