diff mbox series

[RFC] autonuma: Support to scan page table asynchronously

Message ID 20200414081951.297676-1-ying.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series [RFC] autonuma: Support to scan page table asynchronously | expand

Commit Message

Huang, Ying April 14, 2020, 8:19 a.m. UTC
In current AutoNUMA implementation, the page tables of the processes
are scanned periodically to trigger the NUMA hint page faults.  The
scanning runs in the context of the processes, so will delay the
running of the processes.  In a test with 64 threads pmbench memory
accessing benchmark on a 2-socket server machine with 104 logical CPUs
and 256 GB memory, there are more than 20000 latency outliers that are
> 1 ms in 3600s run time.  These latency outliers are almost all
caused by the AutoNUMA page table scanning.  Because they almost all
disappear after applying this patch to scan the page tables
asynchronously.

Because there are idle CPUs in system, the asynchronous running page
table scanning code can run on these idle CPUs instead of the CPUs the
workload is running on.

So on system with enough idle CPU time, it's better to scan the page
tables asynchronously to take full advantages of these idle CPU time.
Another scenario which can benefit from this is to scan the page
tables on some service CPUs of the socket, so that the real workload
can run on the isolated CPUs without the latency outliers caused by
the page table scanning.

But it's not perfect to scan page tables asynchronously too.  For
example, on system without enough idle CPU time, the CPU time isn't
scheduled fairly because the page table scanning is charged to the
workqueue thread instead of the process/thread it works for.  And
although the page tables are scanned for the target process, it may
run on a CPU that is not in the cpuset of the target process.

One possible solution is to let the system administrator to choose the
better behavior for the system via a sysctl knob (implemented in the
patch).  But it's not perfect too.  Because every user space knob adds
maintenance overhead.

A better solution may be to back-charge the CPU time to scan the page
tables to the process/thread, and find a way to run the work on the
proper cpuset.  After some googling, I found there's some discussion
about this as in the following thread,

https://lkml.org/lkml/2019/6/13/1321

So this patch may be not ready to be merged by upstream yet.  It
quantizes the latency outliers caused by the page table scanning in
AutoNUMA.  And it provides a possible way to resolve the issue for
users who cares about it.  And it is a potential customer of the work
related to the cgroup-aware workqueue or other asynchronous execution
mechanisms.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Aubrey Li <aubrey.li@intel.com>
---
 include/linux/sched.h        |  1 +
 include/linux/sched/sysctl.h |  1 +
 kernel/exit.c                |  1 +
 kernel/sched/fair.c          | 55 +++++++++++++++++++++++++++---------
 kernel/sysctl.c              |  9 ++++++
 5 files changed, 53 insertions(+), 14 deletions(-)

Comments

Mel Gorman April 14, 2020, 12:06 p.m. UTC | #1
On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
> In current AutoNUMA implementation, the page tables of the processes
> are scanned periodically to trigger the NUMA hint page faults.  The
> scanning runs in the context of the processes, so will delay the
> running of the processes.  In a test with 64 threads pmbench memory
> accessing benchmark on a 2-socket server machine with 104 logical CPUs
> and 256 GB memory, there are more than 20000 latency outliers that are
> > 1 ms in 3600s run time.  These latency outliers are almost all
> caused by the AutoNUMA page table scanning.  Because they almost all
> disappear after applying this patch to scan the page tables
> asynchronously.
> 
> Because there are idle CPUs in system, the asynchronous running page
> table scanning code can run on these idle CPUs instead of the CPUs the
> workload is running on.
> 
> So on system with enough idle CPU time, it's better to scan the page
> tables asynchronously to take full advantages of these idle CPU time.
> Another scenario which can benefit from this is to scan the page
> tables on some service CPUs of the socket, so that the real workload
> can run on the isolated CPUs without the latency outliers caused by
> the page table scanning.
> 
> But it's not perfect to scan page tables asynchronously too.  For
> example, on system without enough idle CPU time, the CPU time isn't
> scheduled fairly because the page table scanning is charged to the
> workqueue thread instead of the process/thread it works for.  And
> although the page tables are scanned for the target process, it may
> run on a CPU that is not in the cpuset of the target process.
> 
> One possible solution is to let the system administrator to choose the
> better behavior for the system via a sysctl knob (implemented in the
> patch).  But it's not perfect too.  Because every user space knob adds
> maintenance overhead.
> 
> A better solution may be to back-charge the CPU time to scan the page
> tables to the process/thread, and find a way to run the work on the
> proper cpuset.  After some googling, I found there's some discussion
> about this as in the following thread,
> 
> https://lkml.org/lkml/2019/6/13/1321
> 
> So this patch may be not ready to be merged by upstream yet.  It
> quantizes the latency outliers caused by the page table scanning in
> AutoNUMA.  And it provides a possible way to resolve the issue for
> users who cares about it.  And it is a potential customer of the work
> related to the cgroup-aware workqueue or other asynchronous execution
> mechanisms.
> 

The caveats you list are the important ones and the reason why it was
not done asynchronously. In an earlier implementation all the work was
done by a dedicated thread and ultimately abandoned.

There is no guarantee there is an idle CPU available and one that is
local to the thread that should be doing the scanning. Even if there is,
it potentially prevents another task from scheduling on an idle CPU and
similarly other workqueue tasks may be delayed waiting on the scanner. The
hiding of the cost is also problematic because the CPU cost is hidden
and mixed with other unrelated workqueues. It also has the potential
to mask bugs. Lets say for example there is a bug whereby a task is
scanning excessively, that can be easily missed when the work is done by
a workqueue.

While it's just an opinion, my preference would be to focus on reducing
the cost and amount of scanning done -- particularly for threads. For
example, all threads operate on the same address space but there can be
significant overlap where all threads are potentially scanning the same
areas or regions that the thread has no interest in. One option would be
to track the highest and lowest pages accessed and only scan within
those regions for example. The tricky part is that library pages may
create very wide windows that render the tracking useless but it could
at least be investigated.
Huang, Ying April 15, 2020, 8:14 a.m. UTC | #2
Mel Gorman <mgorman@techsingularity.net> writes:

> On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
>> In current AutoNUMA implementation, the page tables of the processes
>> are scanned periodically to trigger the NUMA hint page faults.  The
>> scanning runs in the context of the processes, so will delay the
>> running of the processes.  In a test with 64 threads pmbench memory
>> accessing benchmark on a 2-socket server machine with 104 logical CPUs
>> and 256 GB memory, there are more than 20000 latency outliers that are
>> > 1 ms in 3600s run time.  These latency outliers are almost all
>> caused by the AutoNUMA page table scanning.  Because they almost all
>> disappear after applying this patch to scan the page tables
>> asynchronously.
>> 
>> Because there are idle CPUs in system, the asynchronous running page
>> table scanning code can run on these idle CPUs instead of the CPUs the
>> workload is running on.
>> 
>> So on system with enough idle CPU time, it's better to scan the page
>> tables asynchronously to take full advantages of these idle CPU time.
>> Another scenario which can benefit from this is to scan the page
>> tables on some service CPUs of the socket, so that the real workload
>> can run on the isolated CPUs without the latency outliers caused by
>> the page table scanning.
>> 
>> But it's not perfect to scan page tables asynchronously too.  For
>> example, on system without enough idle CPU time, the CPU time isn't
>> scheduled fairly because the page table scanning is charged to the
>> workqueue thread instead of the process/thread it works for.  And
>> although the page tables are scanned for the target process, it may
>> run on a CPU that is not in the cpuset of the target process.
>> 
>> One possible solution is to let the system administrator to choose the
>> better behavior for the system via a sysctl knob (implemented in the
>> patch).  But it's not perfect too.  Because every user space knob adds
>> maintenance overhead.
>> 
>> A better solution may be to back-charge the CPU time to scan the page
>> tables to the process/thread, and find a way to run the work on the
>> proper cpuset.  After some googling, I found there's some discussion
>> about this as in the following thread,
>> 
>> https://lkml.org/lkml/2019/6/13/1321
>> 
>> So this patch may be not ready to be merged by upstream yet.  It
>> quantizes the latency outliers caused by the page table scanning in
>> AutoNUMA.  And it provides a possible way to resolve the issue for
>> users who cares about it.  And it is a potential customer of the work
>> related to the cgroup-aware workqueue or other asynchronous execution
>> mechanisms.
>> 
>
> The caveats you list are the important ones and the reason why it was
> not done asynchronously. In an earlier implementation all the work was
> done by a dedicated thread and ultimately abandoned.
>
> There is no guarantee there is an idle CPU available and one that is
> local to the thread that should be doing the scanning. Even if there is,
> it potentially prevents another task from scheduling on an idle CPU and
> similarly other workqueue tasks may be delayed waiting on the scanner. The
> hiding of the cost is also problematic because the CPU cost is hidden
> and mixed with other unrelated workqueues. It also has the potential
> to mask bugs. Lets say for example there is a bug whereby a task is
> scanning excessively, that can be easily missed when the work is done by
> a workqueue.

Do you think something like cgroup-aware workqueue is a solution deserve
to be tried when it's available?  It will not hide the scanning cost,
because the CPU time will be charged to the original cgroup or task.
Although the other tasks may be disturbed, cgroup can provide some kind
of management via cpusets.

> While it's just an opinion, my preference would be to focus on reducing
> the cost and amount of scanning done -- particularly for threads. For
> example, all threads operate on the same address space but there can be
> significant overlap where all threads are potentially scanning the same
> areas or regions that the thread has no interest in. One option would be
> to track the highest and lowest pages accessed and only scan within
> those regions for example. The tricky part is that library pages may
> create very wide windows that render the tracking useless but it could
> at least be investigated. 

In general, I think it's good to reduce the scanning cost.

Why do you think there will be overlap between the threads of a process?
If my understanding were correctly, the threads will scan one by one
instead of simultaneously.  And how to determine whether a vma need to
be scanned or not?  For example, there may be only a small portion of
pages been accessed in a vma, but they may be accessed remotely and
consumes quite some inter-node bandwidth, so need to be migrated.

Best Regards,
Huang, Ying
Peter Zijlstra April 15, 2020, 11:32 a.m. UTC | #3
On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
> While it's just an opinion, my preference would be to focus on reducing
> the cost and amount of scanning done -- particularly for threads.

This; I really don't believe in those back-charging things, esp. since
not having cgroups or having multiple applications in a single cgroup is
a valid setup.

Another way to reduce latency spikes is to decrease both
sysctl_numa_balancing_scan_delay and sysctl_numa_balancing_scan_size.
Then you do more smaller scans. By scanning more often you reduce the
contrast, by reducing the size you lower the max latency.

And this is all assuming you actually want numa balancing for this
process.
Huang, Ying April 16, 2020, 1:24 a.m. UTC | #4
Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
>> While it's just an opinion, my preference would be to focus on reducing
>> the cost and amount of scanning done -- particularly for threads.
>
> This; I really don't believe in those back-charging things, esp. since
> not having cgroups or having multiple applications in a single cgroup is
> a valid setup.

Technically, it appears possible to back-charge the CPU time to the
process/thread directly (not the cgroup).

> Another way to reduce latency spikes is to decrease both
> sysctl_numa_balancing_scan_delay and sysctl_numa_balancing_scan_size.
> Then you do more smaller scans. By scanning more often you reduce the
> contrast, by reducing the size you lower the max latency.

Yes.  This can reduce latency spikes.

Best Regards,
Huang, Ying

> And this is all assuming you actually want numa balancing for this
> process.
Peter Zijlstra April 17, 2020, 10:04 a.m. UTC | #5
On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> I think the main idea of DAMON[1] might be able to applied here.  Have you
> considered it?
> 
> [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/

I've ignored that entire thing after you said the information it
provides was already available through the PMU.
Peter Zijlstra April 17, 2020, 10:06 a.m. UTC | #6
On Thu, Apr 16, 2020 at 09:24:35AM +0800, Huang, Ying wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
> >> While it's just an opinion, my preference would be to focus on reducing
> >> the cost and amount of scanning done -- particularly for threads.
> >
> > This; I really don't believe in those back-charging things, esp. since
> > not having cgroups or having multiple applications in a single cgroup is
> > a valid setup.
> 
> Technically, it appears possible to back-charge the CPU time to the
> process/thread directly (not the cgroup).

I've yet to see a sane proposal there. What we're not going to do is
make regular task accounting more expensive than it already is.
Mel Gorman April 17, 2020, 12:16 p.m. UTC | #7
On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > I think the main idea of DAMON[1] might be able to applied here.  Have you
> > > considered it?
> > > 
> > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
> > 
> > I've ignored that entire thing after you said the information it
> > provides was already available through the PMU.
> 
> Sorry if my answer made you confused.  What I wanted to say was that the
> fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> for now, but it could be modified to use PMU or other features instead. 

I would not be inclined to lean towards either approach for NUMA
balancing. Fiddling with the accessed bit can have consequences for page
aging and residency -- fine for debugging a problem, not to fine for
normal usage. I would expect the PMU approach would have high overhead
as well as taking over a PMU counter that userspace debugging may expect
to be available.
Peter Zijlstra April 17, 2020, 12:21 p.m. UTC | #8
On Fri, Apr 17, 2020 at 01:16:29PM +0100, Mel Gorman wrote:
> On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > I think the main idea of DAMON[1] might be able to applied here.  Have you
> > > > considered it?
> > > > 
> > > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
> > > 
> > > I've ignored that entire thing after you said the information it
> > > provides was already available through the PMU.
> > 
> > Sorry if my answer made you confused.  What I wanted to say was that the
> > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > for now, but it could be modified to use PMU or other features instead. 
> 
> I would not be inclined to lean towards either approach for NUMA
> balancing. Fiddling with the accessed bit can have consequences for page
> aging and residency -- fine for debugging a problem, not to fine for
> normal usage. I would expect the PMU approach would have high overhead
> as well as taking over a PMU counter that userspace debugging may expect
> to be available.

Oh, quite agreed; I was just saying I never saw the use of that whole
DAMON thing. AFAICT it's not actually solving a problem, just making
more.
Mel Gorman April 17, 2020, 2:46 p.m. UTC | #9
On Fri, Apr 17, 2020 at 02:44:37PM +0200, SeongJae Park wrote:
> On Fri, 17 Apr 2020 13:16:29 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > > I think the main idea of DAMON[1] might be able to applied here.  Have you
> > > > > considered it?
> > > > > 
> > > > > [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
> > > > 
> > > > I've ignored that entire thing after you said the information it
> > > > provides was already available through the PMU.
> > > 
> > > Sorry if my answer made you confused.  What I wanted to say was that the
> > > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > > for now, but it could be modified to use PMU or other features instead. 
> > 
> > I would not be inclined to lean towards either approach for NUMA
> > balancing. Fiddling with the accessed bit can have consequences for page
> > aging and residency -- fine for debugging a problem, not to fine for
> > normal usage. I would expect the PMU approach would have high overhead
> > as well as taking over a PMU counter that userspace debugging may expect
> > to be available.
> 
> Good point.  But, isn't it ok to use Accessed bit as long as PG_Idle and
> PG_Young is adjusted properly?  Current DAMON implementation does so, as
> idle_page_tracking also does.
> 

PG_idle and PG_young are page flags that are used by idle tracking. The
accessed bit I'm thinking of is in the PTE bit and it's the PTE that
is changed by page_referenced() during reclaim. So it's not clear how
adjusting the page bits help avoid page aging problems during reclaim.

Maybe your suggestion was to move NUMA balancing to use the PG_idle and
PG_young bits from idle tracking but I'm not sure what that gains us.
This may be because I did not look too closely at DAMON as for debugging
and analysis, the PMU sounded more suitable.

It's not clear to me how looking at the page or pte bit handling of
DAMON helps us reduce the scanning cost of numa balancing. There may
be some benefit in splitting the address space and rescanning sections
that were referenced but it has similar pitfalls to simply tracking the
highest/lowest address that trapped a NUMA hinting fault.
Huang, Ying April 20, 2020, 3:26 a.m. UTC | #10
Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Apr 16, 2020 at 09:24:35AM +0800, Huang, Ying wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
>> >> While it's just an opinion, my preference would be to focus on reducing
>> >> the cost and amount of scanning done -- particularly for threads.
>> >
>> > This; I really don't believe in those back-charging things, esp. since
>> > not having cgroups or having multiple applications in a single cgroup is
>> > a valid setup.
>> 
>> Technically, it appears possible to back-charge the CPU time to the
>> process/thread directly (not the cgroup).
>
> I've yet to see a sane proposal there. What we're not going to do is
> make regular task accounting more expensive than it already is.

Yes.  There's overhead to back-charge.  To reduce the overhead, instead
of back-charge immediately, we can

- Add one field to task_struct, say backcharge_time, to track the
  delayed back-charged CPU time.

- When the work item completes its work, add the CPU time it spends to
  task_struct->backcharge_time atomically

- When the task account CPU regularly, e.g. in scheduler_tick(),
  task_struct->backcharge is considered too.

Although this cannot eliminate the overhead, it can reduce it.  Do you
think this is acceptable or not?

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..433f77436ed4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1087,6 +1087,7 @@  struct task_struct {
 	u64				last_task_numa_placement;
 	u64				last_sum_exec_runtime;
 	struct callback_head		numa_work;
+	struct work_struct		numa_async_work;
 
 	/*
 	 * This pointer is only modified for current in syscall and
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..ea595ae9e573 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -37,6 +37,7 @@  extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_numa_balancing_scan_async;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/exit.c b/kernel/exit.c
index 0b81b26a872a..f5e84ff4b788 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -798,6 +798,7 @@  void __noreturn do_exit(long code)
 	if (group_dead)
 		disassociate_ctty(1);
 	exit_task_namespaces(tsk);
+	cancel_work_sync(&tsk->numa_async_work);
 	exit_task_work(tsk);
 	exit_thread(tsk);
 	exit_umh(tsk);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1217bfe5e81..7d82c6826708 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,9 @@  unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/* Scan asynchronously via work queue */
+unsigned int sysctl_numa_balancing_scan_async;
+
 struct numa_group {
 	refcount_t refcount;
 
@@ -2481,24 +2484,17 @@  static void reset_ptenuma_scan(struct task_struct *p)
 	p->mm->numa_scan_offset = 0;
 }
 
-/*
- * The expensive part of numa migration is done from task_work context.
- * Triggered from task_tick_numa().
- */
-static void task_numa_work(struct callback_head *work)
+static void numa_scan(struct task_struct *p)
 {
 	unsigned long migrate, next_scan, now = jiffies;
-	struct task_struct *p = current;
+	struct task_struct *pc = current;
 	struct mm_struct *mm = p->mm;
-	u64 runtime = p->se.sum_exec_runtime;
+	u64 runtime = pc->se.sum_exec_runtime;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
 	long pages, virtpages;
 
-	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
-
-	work->next = work;
 	/*
 	 * Who cares about NUMA placement when they're dying.
 	 *
@@ -2621,12 +2617,36 @@  static void task_numa_work(struct callback_head *work)
 	 * Usually update_task_scan_period slows down scanning enough; on an
 	 * overloaded system we need to limit overhead on a per task basis.
 	 */
-	if (unlikely(p->se.sum_exec_runtime != runtime)) {
-		u64 diff = p->se.sum_exec_runtime - runtime;
+	if (unlikely(pc->se.sum_exec_runtime != runtime)) {
+		u64 diff = pc->se.sum_exec_runtime - runtime;
 		p->node_stamp += 32 * diff;
 	}
 }
 
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+static void task_numa_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+
+	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
+
+	work->next = work;
+
+	numa_scan(p);
+}
+
+static void numa_async_work(struct work_struct *work)
+{
+	struct task_struct *p;
+
+	p = container_of(work, struct task_struct, numa_async_work);
+
+	numa_scan(p);
+}
+
 void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
 	int mm_users = 0;
@@ -2650,6 +2670,7 @@  void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->last_sum_exec_runtime	= 0;
 
 	init_task_work(&p->numa_work, task_numa_work);
+	INIT_WORK(&p->numa_async_work, numa_async_work);
 
 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
@@ -2699,8 +2720,14 @@  static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 			curr->numa_scan_period = task_scan_start(curr);
 		curr->node_stamp += period;
 
-		if (!time_before(jiffies, curr->mm->numa_next_scan))
-			task_work_add(curr, work, true);
+		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+			if (sysctl_numa_balancing_scan_async)
+				queue_work_node(numa_node_id(),
+						system_unbound_wq,
+						&curr->numa_async_work);
+			else
+				task_work_add(curr, work, true);
+		}
 	}
 }
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ad5b88a53c5a..31a987230c58 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -418,6 +418,15 @@  static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_scan_async",
+		.data		= &sysctl_numa_balancing_scan_async,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */