diff mbox series

[12/19] kthread: Default affine kthread to its preferred NUMA node

Message ID 20240916224925.20540-13-frederic@kernel.org (mailing list archive)
State New
Headers show
Series None | expand

Commit Message

Frederic Weisbecker Sept. 16, 2024, 10:49 p.m. UTC
Kthreads attached to a preferred NUMA node for their task structure
allocation can also be assumed to run preferrably within that same node.

A more precise affinity is usually notified by calling
kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.

For the others, a default affinity to the node is desired and sometimes
implemented with more or less success when it comes to deal with hotplug
events and nohz_full / CPU Isolation interactions:

- kcompactd is affine to its node and handles hotplug but not CPU Isolation
- kswapd is affine to its node and ignores hotplug and CPU Isolation
- A bunch of drivers create their kthreads on a specific node and
  don't take care about affining further.

Handle that default node affinity preference at the generic level
instead, provided a kthread is created on an actual node and doesn't
apply any specific affinity such as a given CPU or a custom cpumask to
bind to before its first wake-up.

This generic handling is aware of CPU hotplug events and CPU isolation
such that:

* When a housekeeping CPU goes up and is part of the node of a given
  kthread, it is added to its applied affinity set (and
  possibly the default last resort online housekeeping set is removed
  from the set).

* When a housekeeping CPU goes down while it was part of the node of a
  kthread, it is removed from the kthread's applied
  affinity. The last resort is to affine the kthread to all online
  housekeeping CPUs.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/cpuhotplug.h |   1 +
 kernel/kthread.c           | 120 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 120 insertions(+), 1 deletion(-)

Comments

Michal Hocko Sept. 17, 2024, 6:26 a.m. UTC | #1
On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> Kthreads attached to a preferred NUMA node for their task structure
> allocation can also be assumed to run preferrably within that same node.
> 
> A more precise affinity is usually notified by calling
> kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> 
> For the others, a default affinity to the node is desired and sometimes
> implemented with more or less success when it comes to deal with hotplug
> events and nohz_full / CPU Isolation interactions:
> 
> - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> - kswapd is affine to its node and ignores hotplug and CPU Isolation
> - A bunch of drivers create their kthreads on a specific node and
>   don't take care about affining further.
> 
> Handle that default node affinity preference at the generic level
> instead, provided a kthread is created on an actual node and doesn't
> apply any specific affinity such as a given CPU or a custom cpumask to
> bind to before its first wake-up.

Makes sense.

> This generic handling is aware of CPU hotplug events and CPU isolation
> such that:
> 
> * When a housekeeping CPU goes up and is part of the node of a given
>   kthread, it is added to its applied affinity set (and
>   possibly the default last resort online housekeeping set is removed
>   from the set).
> 
> * When a housekeeping CPU goes down while it was part of the node of a
>   kthread, it is removed from the kthread's applied
>   affinity. The last resort is to affine the kthread to all online
>   housekeeping CPUs.

But I am not really sure about this part. Sure it makes sense to set the
affinity to exclude isolated CPUs but why do we care about hotplug
events at all. Let's say we offline all cpus from a given node (or
that all but isolated cpus are offline - is this even
realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
affinity in such a case? In other words how is that different from
tasksetting an userspace task to a cpu that goes offline? We still do
allow such a task to run, right? We just do not care about affinity
anymore.
Vlastimil Babka Sept. 17, 2024, 7:01 a.m. UTC | #2
On 9/17/24 8:26 AM, Michal Hocko wrote:
> On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
>> Kthreads attached to a preferred NUMA node for their task structure
>> allocation can also be assumed to run preferrably within that same node.
>>
>> A more precise affinity is usually notified by calling
>> kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
>>
>> For the others, a default affinity to the node is desired and sometimes
>> implemented with more or less success when it comes to deal with hotplug
>> events and nohz_full / CPU Isolation interactions:
>>
>> - kcompactd is affine to its node and handles hotplug but not CPU Isolation
>> - kswapd is affine to its node and ignores hotplug and CPU Isolation
>> - A bunch of drivers create their kthreads on a specific node and
>>   don't take care about affining further.
>>
>> Handle that default node affinity preference at the generic level
>> instead, provided a kthread is created on an actual node and doesn't
>> apply any specific affinity such as a given CPU or a custom cpumask to
>> bind to before its first wake-up.
> 
> Makes sense.
> 
>> This generic handling is aware of CPU hotplug events and CPU isolation
>> such that:
>>
>> * When a housekeeping CPU goes up and is part of the node of a given
>>   kthread, it is added to its applied affinity set (and
>>   possibly the default last resort online housekeeping set is removed
>>   from the set).
>>
>> * When a housekeeping CPU goes down while it was part of the node of a
>>   kthread, it is removed from the kthread's applied
>>   affinity. The last resort is to affine the kthread to all online
>>   housekeeping CPUs.
> 
> But I am not really sure about this part. Sure it makes sense to set the
> affinity to exclude isolated CPUs but why do we care about hotplug
> events at all. Let's say we offline all cpus from a given node (or
> that all but isolated cpus are offline - is this even
> realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> affinity in such a case? In other words how is that different from
> tasksetting an userspace task to a cpu that goes offline? We still do
> allow such a task to run, right? We just do not care about affinity
> anymore.

AFAIU it handles better the situation where all houskeeping cpus from
the preferred node go down, then it affines to houskeeping cpus from any
node vs any cpu including isolated ones.
Yes it's probably a scenario that's not recommendable, but someone might
do it anyway...
Michal Hocko Sept. 17, 2024, 7:05 a.m. UTC | #3
On Tue 17-09-24 09:01:08, Vlastimil Babka wrote:
> On 9/17/24 8:26 AM, Michal Hocko wrote:
> > On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> >> Kthreads attached to a preferred NUMA node for their task structure
> >> allocation can also be assumed to run preferrably within that same node.
> >>
> >> A more precise affinity is usually notified by calling
> >> kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> >>
> >> For the others, a default affinity to the node is desired and sometimes
> >> implemented with more or less success when it comes to deal with hotplug
> >> events and nohz_full / CPU Isolation interactions:
> >>
> >> - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> >> - kswapd is affine to its node and ignores hotplug and CPU Isolation
> >> - A bunch of drivers create their kthreads on a specific node and
> >>   don't take care about affining further.
> >>
> >> Handle that default node affinity preference at the generic level
> >> instead, provided a kthread is created on an actual node and doesn't
> >> apply any specific affinity such as a given CPU or a custom cpumask to
> >> bind to before its first wake-up.
> > 
> > Makes sense.
> > 
> >> This generic handling is aware of CPU hotplug events and CPU isolation
> >> such that:
> >>
> >> * When a housekeeping CPU goes up and is part of the node of a given
> >>   kthread, it is added to its applied affinity set (and
> >>   possibly the default last resort online housekeeping set is removed
> >>   from the set).
> >>
> >> * When a housekeeping CPU goes down while it was part of the node of a
> >>   kthread, it is removed from the kthread's applied
> >>   affinity. The last resort is to affine the kthread to all online
> >>   housekeeping CPUs.
> > 
> > But I am not really sure about this part. Sure it makes sense to set the
> > affinity to exclude isolated CPUs but why do we care about hotplug
> > events at all. Let's say we offline all cpus from a given node (or
> > that all but isolated cpus are offline - is this even
> > realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> > affinity in such a case? In other words how is that different from
> > tasksetting an userspace task to a cpu that goes offline? We still do
> > allow such a task to run, right? We just do not care about affinity
> > anymore.
> 
> AFAIU it handles better the situation where all houskeeping cpus from
> the preferred node go down, then it affines to houskeeping cpus from any
> node vs any cpu including isolated ones.

Doesn't that happen automagically? Or can it end up on a random
isolated cpu?
Vlastimil Babka Sept. 17, 2024, 7:14 a.m. UTC | #4
On 9/17/24 9:05 AM, Michal Hocko wrote:
> On Tue 17-09-24 09:01:08, Vlastimil Babka wrote:
>> On 9/17/24 8:26 AM, Michal Hocko wrote:
>>> On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
>>>> Kthreads attached to a preferred NUMA node for their task structure
>>>> allocation can also be assumed to run preferrably within that same node.
>>>>
>>>> A more precise affinity is usually notified by calling
>>>> kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
>>>>
>>>> For the others, a default affinity to the node is desired and sometimes
>>>> implemented with more or less success when it comes to deal with hotplug
>>>> events and nohz_full / CPU Isolation interactions:
>>>>
>>>> - kcompactd is affine to its node and handles hotplug but not CPU Isolation
>>>> - kswapd is affine to its node and ignores hotplug and CPU Isolation
>>>> - A bunch of drivers create their kthreads on a specific node and
>>>>   don't take care about affining further.
>>>>
>>>> Handle that default node affinity preference at the generic level
>>>> instead, provided a kthread is created on an actual node and doesn't
>>>> apply any specific affinity such as a given CPU or a custom cpumask to
>>>> bind to before its first wake-up.
>>>
>>> Makes sense.
>>>
>>>> This generic handling is aware of CPU hotplug events and CPU isolation
>>>> such that:
>>>>
>>>> * When a housekeeping CPU goes up and is part of the node of a given
>>>>   kthread, it is added to its applied affinity set (and
>>>>   possibly the default last resort online housekeeping set is removed
>>>>   from the set).
>>>>
>>>> * When a housekeeping CPU goes down while it was part of the node of a
>>>>   kthread, it is removed from the kthread's applied
>>>>   affinity. The last resort is to affine the kthread to all online
>>>>   housekeeping CPUs.
>>>
>>> But I am not really sure about this part. Sure it makes sense to set the
>>> affinity to exclude isolated CPUs but why do we care about hotplug
>>> events at all. Let's say we offline all cpus from a given node (or
>>> that all but isolated cpus are offline - is this even
>>> realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
>>> affinity in such a case? In other words how is that different from
>>> tasksetting an userspace task to a cpu that goes offline? We still do
>>> allow such a task to run, right? We just do not care about affinity
>>> anymore.
>>
>> AFAIU it handles better the situation where all houskeeping cpus from
>> the preferred node go down, then it affines to houskeeping cpus from any
>> node vs any cpu including isolated ones.
> 
> Doesn't that happen automagically? Or can it end up on a random
> isolated cpu?

Good question, perhaps it can and there's no automagic, as I see code like:

+		/* Make sure the kthread never gets re-affined globally */
+		set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
Frederic Weisbecker Sept. 17, 2024, 10:34 a.m. UTC | #5
Le Tue, Sep 17, 2024 at 08:26:49AM +0200, Michal Hocko a écrit :
> On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> > Kthreads attached to a preferred NUMA node for their task structure
> > allocation can also be assumed to run preferrably within that same node.
> > 
> > A more precise affinity is usually notified by calling
> > kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> > 
> > For the others, a default affinity to the node is desired and sometimes
> > implemented with more or less success when it comes to deal with hotplug
> > events and nohz_full / CPU Isolation interactions:
> > 
> > - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> > - kswapd is affine to its node and ignores hotplug and CPU Isolation
> > - A bunch of drivers create their kthreads on a specific node and
> >   don't take care about affining further.
> > 
> > Handle that default node affinity preference at the generic level
> > instead, provided a kthread is created on an actual node and doesn't
> > apply any specific affinity such as a given CPU or a custom cpumask to
> > bind to before its first wake-up.
> 
> Makes sense.
> 
> > This generic handling is aware of CPU hotplug events and CPU isolation
> > such that:
> > 
> > * When a housekeeping CPU goes up and is part of the node of a given
> >   kthread, it is added to its applied affinity set (and
> >   possibly the default last resort online housekeeping set is removed
> >   from the set).
> > 
> > * When a housekeeping CPU goes down while it was part of the node of a
> >   kthread, it is removed from the kthread's applied
> >   affinity. The last resort is to affine the kthread to all online
> >   housekeeping CPUs.
> 
> But I am not really sure about this part. Sure it makes sense to set the
> affinity to exclude isolated CPUs but why do we care about hotplug
> events at all. Let's say we offline all cpus from a given node (or
> that all but isolated cpus are offline - is this even
> realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> affinity in such a case? In other words how is that different from
> tasksetting an userspace task to a cpu that goes offline? We still do
> allow such a task to run, right? We just do not care about affinity
> anymore.

Suppose we have this artificial online set:

NODE 0 -> CPU 0
NODE 1 -> CPU 1
NODE 2 -> CPU 2

And we have nohz_full=1,2

So there is kswapd/2 that is affine to NODE 2 and thus CPU 2 for now.

Now CPU 2 goes offline. The scheduler migrates off all
tasks. select_fallback_rq() for kswapd/2 doesn't find a suitable CPU
to run to so it affines kswapd/2 to all remaining online CPUs (CPU 0, CPU 1)
(see the "No more Mr. Nice Guy" comment).

But CPU 1 is nohz_full, so kswapd/2 could run on that isolated CPU. Unless we
handle things before, like this patchset does.

And note that adding isolcpus=domain,1,2 or setting 1,2 as isolated
cpuset partition (like most isolated workloads should do) is not helping
here. And I'm not sure this last resort scheduler code is the right place
to handle isolated cpumasks.

So it looks necessary, unless I am missing something else?

And that is just for reaffine on CPU down. CPU up needs mirroring treatment
and also it must handle new CPUs freshly added to a node.

Thanks.

> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko Sept. 17, 2024, 11:07 a.m. UTC | #6
On Tue 17-09-24 12:34:51, Frederic Weisbecker wrote:
> Le Tue, Sep 17, 2024 at 08:26:49AM +0200, Michal Hocko a écrit :
> > On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> > > Kthreads attached to a preferred NUMA node for their task structure
> > > allocation can also be assumed to run preferrably within that same node.
> > > 
> > > A more precise affinity is usually notified by calling
> > > kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> > > 
> > > For the others, a default affinity to the node is desired and sometimes
> > > implemented with more or less success when it comes to deal with hotplug
> > > events and nohz_full / CPU Isolation interactions:
> > > 
> > > - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> > > - kswapd is affine to its node and ignores hotplug and CPU Isolation
> > > - A bunch of drivers create their kthreads on a specific node and
> > >   don't take care about affining further.
> > > 
> > > Handle that default node affinity preference at the generic level
> > > instead, provided a kthread is created on an actual node and doesn't
> > > apply any specific affinity such as a given CPU or a custom cpumask to
> > > bind to before its first wake-up.
> > 
> > Makes sense.
> > 
> > > This generic handling is aware of CPU hotplug events and CPU isolation
> > > such that:
> > > 
> > > * When a housekeeping CPU goes up and is part of the node of a given
> > >   kthread, it is added to its applied affinity set (and
> > >   possibly the default last resort online housekeeping set is removed
> > >   from the set).
> > > 
> > > * When a housekeeping CPU goes down while it was part of the node of a
> > >   kthread, it is removed from the kthread's applied
> > >   affinity. The last resort is to affine the kthread to all online
> > >   housekeeping CPUs.
> > 
> > But I am not really sure about this part. Sure it makes sense to set the
> > affinity to exclude isolated CPUs but why do we care about hotplug
> > events at all. Let's say we offline all cpus from a given node (or
> > that all but isolated cpus are offline - is this even
> > realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> > affinity in such a case? In other words how is that different from
> > tasksetting an userspace task to a cpu that goes offline? We still do
> > allow such a task to run, right? We just do not care about affinity
> > anymore.
> 
> Suppose we have this artificial online set:
> 
> NODE 0 -> CPU 0
> NODE 1 -> CPU 1
> NODE 2 -> CPU 2
> 
> And we have nohz_full=1,2
> 
> So there is kswapd/2 that is affine to NODE 2 and thus CPU 2 for now.
> 
> Now CPU 2 goes offline. The scheduler migrates off all
> tasks. select_fallback_rq() for kswapd/2 doesn't find a suitable CPU
> to run to so it affines kswapd/2 to all remaining online CPUs (CPU 0, CPU 1)
> (see the "No more Mr. Nice Guy" comment).
> 
> But CPU 1 is nohz_full, so kswapd/2 could run on that isolated CPU. Unless we
> handle things before, like this patchset does.

But that is equally broken as before, no? CPU2 is isolated as well so it
doesn't really make much of a difference.

> And note that adding isolcpus=domain,1,2 or setting 1,2 as isolated
> cpuset partition (like most isolated workloads should do) is not helping
> here. And I'm not sure this last resort scheduler code is the right place
> to handle isolated cpumasks.

Well, we would have the same situation with userspace tasks, no? Say I
have taskset -p 2 (because I want bidning to node2) and that CPU2 goes
offline. The task needs to be moved somewhere. And it would be last
resort logic to do that unless I am missing anything. Why should kernel
threads be any different?

> So it looks necessary, unless I am missing something else?

I am not objecting to patch per se. I am just not sure this is really
needed. It is great to have kernel threads bound to non isolated cpus by
default if they have node preferences. But as soon as somebody starts
offlining cpus excessively and make the initial cpumask empty then
select_fallback_rq sounds like the right thing to do.

Not my call though. I was just curious why this is needed and it seems
to me you are looking for some sort of correctness for broken setups.
Frederic Weisbecker Sept. 18, 2024, 9:37 a.m. UTC | #7
Le Tue, Sep 17, 2024 at 01:07:25PM +0200, Michal Hocko a écrit :
> On Tue 17-09-24 12:34:51, Frederic Weisbecker wrote:
> > Le Tue, Sep 17, 2024 at 08:26:49AM +0200, Michal Hocko a écrit :
> > > On Tue 17-09-24 00:49:16, Frederic Weisbecker wrote:
> > > > Kthreads attached to a preferred NUMA node for their task structure
> > > > allocation can also be assumed to run preferrably within that same node.
> > > > 
> > > > A more precise affinity is usually notified by calling
> > > > kthread_create_on_cpu() or kthread_bind[_mask]() before the first wakeup.
> > > > 
> > > > For the others, a default affinity to the node is desired and sometimes
> > > > implemented with more or less success when it comes to deal with hotplug
> > > > events and nohz_full / CPU Isolation interactions:
> > > > 
> > > > - kcompactd is affine to its node and handles hotplug but not CPU Isolation
> > > > - kswapd is affine to its node and ignores hotplug and CPU Isolation
> > > > - A bunch of drivers create their kthreads on a specific node and
> > > >   don't take care about affining further.
> > > > 
> > > > Handle that default node affinity preference at the generic level
> > > > instead, provided a kthread is created on an actual node and doesn't
> > > > apply any specific affinity such as a given CPU or a custom cpumask to
> > > > bind to before its first wake-up.
> > > 
> > > Makes sense.
> > > 
> > > > This generic handling is aware of CPU hotplug events and CPU isolation
> > > > such that:
> > > > 
> > > > * When a housekeeping CPU goes up and is part of the node of a given
> > > >   kthread, it is added to its applied affinity set (and
> > > >   possibly the default last resort online housekeeping set is removed
> > > >   from the set).
> > > > 
> > > > * When a housekeeping CPU goes down while it was part of the node of a
> > > >   kthread, it is removed from the kthread's applied
> > > >   affinity. The last resort is to affine the kthread to all online
> > > >   housekeeping CPUs.
> > > 
> > > But I am not really sure about this part. Sure it makes sense to set the
> > > affinity to exclude isolated CPUs but why do we care about hotplug
> > > events at all. Let's say we offline all cpus from a given node (or
> > > that all but isolated cpus are offline - is this even
> > > realistic/reasonable usecase?). Wouldn't scheduler ignore the kthread's
> > > affinity in such a case? In other words how is that different from
> > > tasksetting an userspace task to a cpu that goes offline? We still do
> > > allow such a task to run, right? We just do not care about affinity
> > > anymore.
> > 
> > Suppose we have this artificial online set:
> > 
> > NODE 0 -> CPU 0
> > NODE 1 -> CPU 1
> > NODE 2 -> CPU 2
> > 
> > And we have nohz_full=1,2
> > 
> > So there is kswapd/2 that is affine to NODE 2 and thus CPU 2 for now.
> > 
> > Now CPU 2 goes offline. The scheduler migrates off all
> > tasks. select_fallback_rq() for kswapd/2 doesn't find a suitable CPU
> > to run to so it affines kswapd/2 to all remaining online CPUs (CPU 0, CPU 1)
> > (see the "No more Mr. Nice Guy" comment).
> > 
> > But CPU 1 is nohz_full, so kswapd/2 could run on that isolated CPU. Unless we
> > handle things before, like this patchset does.
> 
> But that is equally broken as before, no? CPU2 is isolated as well so it
> doesn't really make much of a difference.

Right. I should correct my example with nohz_full=1 only.

> 
> > And note that adding isolcpus=domain,1,2 or setting 1,2 as isolated
> > cpuset partition (like most isolated workloads should do) is not helping
> > here. And I'm not sure this last resort scheduler code is the right place
> > to handle isolated cpumasks.
> 
> Well, we would have the same situation with userspace tasks, no? Say I
> have taskset -p 2 (because I want bidning to node2) and that CPU2 goes
> offline. The task needs to be moved somewhere. And it would be last
> resort logic to do that unless I am missing anything. Why should kernel
> threads be any different?

Good point.

> 
> > So it looks necessary, unless I am missing something else?
> 
> I am not objecting to patch per se. I am just not sure this is really
> needed. It is great to have kernel threads bound to non isolated cpus by
> default if they have node preferences. But as soon as somebody starts
> offlining cpus excessively and make the initial cpumask empty then
> select_fallback_rq sounds like the right thing to do.
> 
> Not my call though. I was just curious why this is needed and it seems
> to me you are looking for some sort of correctness for broken setups.

It looks like it makes sense to explore that path. We still need the
cpu up probe to reaffine when a suitable target comes up. But it seems
the CPU down part can be handled by select_fallback_rq. I'll try that.

Thanks.

> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko Sept. 18, 2024, 11:17 a.m. UTC | #8
On Wed 18-09-24 11:37:42, Frederic Weisbecker wrote:
> Le Tue, Sep 17, 2024 at 01:07:25PM +0200, Michal Hocko a écrit :
[...]
> > I am not objecting to patch per se. I am just not sure this is really
> > needed. It is great to have kernel threads bound to non isolated cpus by
> > default if they have node preferences. But as soon as somebody starts
> > offlining cpus excessively and make the initial cpumask empty then
> > select_fallback_rq sounds like the right thing to do.
> > 
> > Not my call though. I was just curious why this is needed and it seems
> > to me you are looking for some sort of correctness for broken setups.
> 
> It looks like it makes sense to explore that path. We still need the
> cpu up probe to reaffine when a suitable target comes up. But it seems
> the CPU down part can be handled by select_fallback_rq. I'll try that.

THanks! Btw. when you are looking at this, would it make sense to make
select_fallback_rq more cpu isolation aware as well? I mean using
housekeeping cpus before falling back to task_cpu_possible_mask?
diff mbox series

Patch

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 9316c39260e0..89d852538b72 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -240,6 +240,7 @@  enum cpuhp_state {
 	CPUHP_AP_WORKQUEUE_ONLINE,
 	CPUHP_AP_RANDOM_ONLINE,
 	CPUHP_AP_RCUTREE_ONLINE,
+	CPUHP_AP_KTHREADS_ONLINE,
 	CPUHP_AP_BASE_CACHEINFO_ONLINE,
 	CPUHP_AP_ONLINE_DYN,
 	CPUHP_AP_ONLINE_DYN_END		= CPUHP_AP_ONLINE_DYN + 40,
diff --git a/kernel/kthread.c b/kernel/kthread.c
index ecb719f54f7a..eee5925e7725 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -35,6 +35,10 @@  static DEFINE_SPINLOCK(kthread_create_lock);
 static LIST_HEAD(kthread_create_list);
 struct task_struct *kthreadd_task;
 
+static struct cpumask kthread_online_mask;
+static LIST_HEAD(kthreads_hotplug);
+static DEFINE_MUTEX(kthreads_hotplug_lock);
+
 struct kthread_create_info
 {
 	/* Information passed to kthread() from kthreadd. */
@@ -53,6 +57,7 @@  struct kthread_create_info
 struct kthread {
 	unsigned long flags;
 	unsigned int cpu;
+	unsigned int node;
 	int started;
 	int result;
 	int (*threadfn)(void *);
@@ -64,6 +69,8 @@  struct kthread {
 #endif
 	/* To store the full name if task comm is truncated. */
 	char *full_name;
+	struct task_struct *task;
+	struct list_head hotplug_node;
 };
 
 enum KTHREAD_BITS {
@@ -122,8 +129,11 @@  bool set_kthread_struct(struct task_struct *p)
 
 	init_completion(&kthread->exited);
 	init_completion(&kthread->parked);
+	INIT_LIST_HEAD(&kthread->hotplug_node);
 	p->vfork_done = &kthread->exited;
 
+	kthread->task = p;
+	kthread->node = tsk_fork_get_node(current);
 	p->worker_private = kthread;
 	return true;
 }
@@ -314,6 +324,13 @@  void __noreturn kthread_exit(long result)
 {
 	struct kthread *kthread = to_kthread(current);
 	kthread->result = result;
+	if (!list_empty(&kthread->hotplug_node)) {
+		mutex_lock(&kthreads_hotplug_lock);
+		list_del(&kthread->hotplug_node);
+		/* Make sure the kthread never gets re-affined globally */
+		set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
+		mutex_unlock(&kthreads_hotplug_lock);
+	}
 	do_exit(0);
 }
 EXPORT_SYMBOL(kthread_exit);
@@ -339,6 +356,45 @@  void __noreturn kthread_complete_and_exit(struct completion *comp, long code)
 }
 EXPORT_SYMBOL(kthread_complete_and_exit);
 
+static void kthread_fetch_affinity(struct kthread *k, struct cpumask *mask)
+{
+	if (k->node == NUMA_NO_NODE) {
+		cpumask_copy(mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+	} else {
+		/*
+		 * The node cpumask is racy when read from kthread() but:
+		 * - a racing CPU going down won't be present in kthread_online_mask
+		 * - a racing CPU going up will be handled by kthreads_online_cpu()
+		 */
+		cpumask_and(mask, cpumask_of_node(k->node), &kthread_online_mask);
+		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+		if (cpumask_empty(mask))
+			cpumask_copy(mask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+	}
+}
+
+static int kthread_affine_node(void)
+{
+	struct kthread *kthread = to_kthread(current);
+	cpumask_var_t affinity;
+
+	WARN_ON_ONCE(kthread_is_per_cpu(current));
+
+	if (!zalloc_cpumask_var(&affinity, GFP_KERNEL))
+		return -ENOMEM;
+
+	mutex_lock(&kthreads_hotplug_lock);
+	WARN_ON_ONCE(!list_empty(&kthread->hotplug_node));
+	list_add_tail(&kthread->hotplug_node, &kthreads_hotplug);
+	kthread_fetch_affinity(kthread, affinity);
+	set_cpus_allowed_ptr(current, affinity);
+	mutex_unlock(&kthreads_hotplug_lock);
+
+	free_cpumask_var(affinity);
+
+	return 0;
+}
+
 static int kthread(void *_create)
 {
 	static const struct sched_param param = { .sched_priority = 0 };
@@ -369,7 +425,6 @@  static int kthread(void *_create)
 	 * back to default in case they have been changed.
 	 */
 	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
-	set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
 
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_UNINTERRUPTIBLE);
@@ -385,6 +440,9 @@  static int kthread(void *_create)
 
 	self->started = 1;
 
+	if (!(current->flags & PF_NO_SETAFFINITY))
+		kthread_affine_node();
+
 	ret = -EINTR;
 	if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
 		cgroup_kthread_ready();
@@ -779,6 +837,66 @@  int kthreadd(void *unused)
 	return 0;
 }
 
+static int kthreads_hotplug_update(void)
+{
+	cpumask_var_t affinity;
+	struct kthread *k;
+	int err;
+
+	if (list_empty(&kthreads_hotplug))
+		return 0;
+
+	if (!zalloc_cpumask_var(&affinity, GFP_KERNEL))
+		return -ENOMEM;
+
+	err = 0;
+
+	list_for_each_entry(k, &kthreads_hotplug, hotplug_node) {
+		if (WARN_ON_ONCE((k->task->flags & PF_NO_SETAFFINITY) ||
+				 kthread_is_per_cpu(k->task))) {
+			err = -EINVAL;
+			continue;
+		}
+		kthread_fetch_affinity(k, affinity);
+		set_cpus_allowed_ptr(k->task, affinity);
+	}
+
+	free_cpumask_var(affinity);
+
+	return err;
+}
+
+static int kthreads_offline_cpu(unsigned int cpu)
+{
+	int ret = 0;
+
+	mutex_lock(&kthreads_hotplug_lock);
+	cpumask_clear_cpu(cpu, &kthread_online_mask);
+	ret = kthreads_hotplug_update();
+	mutex_unlock(&kthreads_hotplug_lock);
+
+	return ret;
+}
+
+static int kthreads_online_cpu(unsigned int cpu)
+{
+	int ret = 0;
+
+	mutex_lock(&kthreads_hotplug_lock);
+	cpumask_set_cpu(cpu, &kthread_online_mask);
+	ret = kthreads_hotplug_update();
+	mutex_unlock(&kthreads_hotplug_lock);
+
+	return ret;
+}
+
+static int kthreads_init(void)
+{
+	return cpuhp_setup_state(CPUHP_AP_KTHREADS_ONLINE, "kthreads:online",
+				kthreads_online_cpu, kthreads_offline_cpu);
+}
+early_initcall(kthreads_init);
+
 void __kthread_init_worker(struct kthread_worker *worker,
 				const char *name,
 				struct lock_class_key *key)