diff mbox series

[v2,3/3] mm/page_alloc: Remotely drain per-cpu lists

Message ID 20211103170512.2745765-4-nsaenzju@redhat.com (mailing list archive)
State New
Headers show
Series mm/page_alloc: Remote per-cpu page list drain support | expand

Commit Message

Nicolas Saenz Julienne Nov. 3, 2021, 5:05 p.m. UTC
Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce new a mechanism
to remotely drain the per-cpu lists. It is made possible by remotely
locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
new scheme is that drain operations are now migration safe.

There was no observed performance degradation vs. the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second.
The new scheme performs a bit better (~5%), although the important point
here is there are no performance regressions vs. the previous mechanism.
Per-cpu lists draining happens only in slow paths.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 mm/page_alloc.c | 59 +++++--------------------------------------------
 1 file changed, 5 insertions(+), 54 deletions(-)

Comments

Mel Gorman Dec. 3, 2021, 2:13 p.m. UTC | #1
On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> drain work queued by __drain_all_pages(). So introduce new a mechanism
> to remotely drain the per-cpu lists. It is made possible by remotely
> locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> new scheme is that drain operations are now migration safe.
> 
> There was no observed performance degradation vs. the previous scheme.
> Both netperf and hackbench were run in parallel to triggering the
> __drain_all_pages(NULL, true) code path around ~100 times per second.
> The new scheme performs a bit better (~5%), although the important point
> here is there are no performance regressions vs. the previous mechanism.
> Per-cpu lists draining happens only in slow paths.
> 

netperf and hackbench are not great indicators of page allocator
performance as IIRC they are more slab-intensive than page allocator
intensive. I ran the series through a few benchmarks and can confirm
that there was negligible difference to netperf and hackbench.

However, on Page Fault Test (pft in mmtests), it is noticable. On a
2-socket cascadelake machine I get

pft timings
                                 5.16.0-rc1             5.16.0-rc1
                                    vanilla    mm-remotedrain-v2r1
Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*

It's not specific to cascade lake, I see varying size regressions on
different Intel and AMD chips, some better and worse than this result.
The smallest regression was on a single CPU skylake machine with a 2-6%
hit. Worst was Zen1 with a 3-107% hit.

I didn't profile it to establish why but in all cases the system CPU
usage was much higher. It *might* be because the spinlock in
per_cpu_pages crosses a new cache line and it might be cold although the
penalty seems a bit high for that to be the only factor.

Code-wise, the patches look fine but the apparent penalty for PFT is
too severe.
Nicolas Saenz Julienne Dec. 9, 2021, 10:50 a.m. UTC | #2
Hi Mel,

On Fri, 2021-12-03 at 14:13 +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > drain work queued by __drain_all_pages(). So introduce new a mechanism
> > to remotely drain the per-cpu lists. It is made possible by remotely
> > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> > new scheme is that drain operations are now migration safe.
> > 
> > There was no observed performance degradation vs. the previous scheme.
> > Both netperf and hackbench were run in parallel to triggering the
> > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > The new scheme performs a bit better (~5%), although the important point
> > here is there are no performance regressions vs. the previous mechanism.
> > Per-cpu lists draining happens only in slow paths.
> > 
> 
> netperf and hackbench are not great indicators of page allocator
> performance as IIRC they are more slab-intensive than page allocator
> intensive. I ran the series through a few benchmarks and can confirm
> that there was negligible difference to netperf and hackbench.
> 
> However, on Page Fault Test (pft in mmtests), it is noticable. On a
> 2-socket cascadelake machine I get
> 
> pft timings
>                                  5.16.0-rc1             5.16.0-rc1
>                                     vanilla    mm-remotedrain-v2r1
> Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
> Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
> Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
> Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
> Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
> Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
> Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
> Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
> Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
> Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
> Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
> Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
> Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
> Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
> Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
> Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
> Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
> Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*
> 
> It's not specific to cascade lake, I see varying size regressions on
> different Intel and AMD chips, some better and worse than this result.
> The smallest regression was on a single CPU skylake machine with a 2-6%
> hit. Worst was Zen1 with a 3-107% hit.
> 
> I didn't profile it to establish why but in all cases the system CPU
> usage was much higher. It *might* be because the spinlock in
> per_cpu_pages crosses a new cache line and it might be cold although the
> penalty seems a bit high for that to be the only factor.
> 
> Code-wise, the patches look fine but the apparent penalty for PFT is
> too severe.

Thanks for taking the time to look at this. I agree the performance penalty is
way too big. I'll move to an alternative approach.
Marcelo Tosatti Dec. 9, 2021, 5:45 p.m. UTC | #3
On Fri, Dec 03, 2021 at 02:13:06PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > drain work queued by __drain_all_pages(). So introduce new a mechanism
> > to remotely drain the per-cpu lists. It is made possible by remotely
> > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> > new scheme is that drain operations are now migration safe.
> > 
> > There was no observed performance degradation vs. the previous scheme.
> > Both netperf and hackbench were run in parallel to triggering the
> > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > The new scheme performs a bit better (~5%), although the important point
> > here is there are no performance regressions vs. the previous mechanism.
> > Per-cpu lists draining happens only in slow paths.
> > 
> 
> netperf and hackbench are not great indicators of page allocator
> performance as IIRC they are more slab-intensive than page allocator
> intensive. I ran the series through a few benchmarks and can confirm
> that there was negligible difference to netperf and hackbench.
> 
> However, on Page Fault Test (pft in mmtests), it is noticable. On a
> 2-socket cascadelake machine I get
> 
> pft timings
>                                  5.16.0-rc1             5.16.0-rc1
>                                     vanilla    mm-remotedrain-v2r1
> Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
> Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
> Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
> Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
> Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
> Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
> Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
> Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
> Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
> Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
> Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
> Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
> Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
> Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
> Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
> Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
> Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
> Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*
> 
> It's not specific to cascade lake, I see varying size regressions on
> different Intel and AMD chips, some better and worse than this result.
> The smallest regression was on a single CPU skylake machine with a 2-6%
> hit. Worst was Zen1 with a 3-107% hit.
> 
> I didn't profile it to establish why but in all cases the system CPU
> usage was much higher. It *might* be because the spinlock in
> per_cpu_pages crosses a new cache line and it might be cold although the
> penalty seems a bit high for that to be the only factor.
> 
> Code-wise, the patches look fine but the apparent penalty for PFT is
> too severe.

Mel,

Have you read Nicolas RCU patches?

Date: Fri,  8 Oct 2021 18:19:19 +0200                                                                                   
From: Nicolas Saenz Julienne <nsaenzju@redhat.com>      
Subject: [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support

RCU seems like a natural fit, we were wondering whether people see any
fundamental problem with this approach.
Mel Gorman Dec. 10, 2021, 10:55 a.m. UTC | #4
On Thu, Dec 09, 2021 at 02:45:35PM -0300, Marcelo Tosatti wrote:
> On Fri, Dec 03, 2021 at 02:13:06PM +0000, Mel Gorman wrote:
> > On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> > > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > > drain work queued by __drain_all_pages(). So introduce new a mechanism
> > > to remotely drain the per-cpu lists. It is made possible by remotely
> > > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> > > new scheme is that drain operations are now migration safe.
> > > 
> > > There was no observed performance degradation vs. the previous scheme.
> > > Both netperf and hackbench were run in parallel to triggering the
> > > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > > The new scheme performs a bit better (~5%), although the important point
> > > here is there are no performance regressions vs. the previous mechanism.
> > > Per-cpu lists draining happens only in slow paths.
> > > 
> > 
> > netperf and hackbench are not great indicators of page allocator
> > performance as IIRC they are more slab-intensive than page allocator
> > intensive. I ran the series through a few benchmarks and can confirm
> > that there was negligible difference to netperf and hackbench.
> > 
> > However, on Page Fault Test (pft in mmtests), it is noticable. On a
> > 2-socket cascadelake machine I get
> > 
> > pft timings
> >                                  5.16.0-rc1             5.16.0-rc1
> >                                     vanilla    mm-remotedrain-v2r1
> > Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
> > Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
> > Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
> > Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
> > Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
> > Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
> > Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
> > Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
> > Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
> > Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
> > Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
> > Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
> > Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
> > Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
> > Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
> > Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
> > Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
> > Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*
> > 
> > It's not specific to cascade lake, I see varying size regressions on
> > different Intel and AMD chips, some better and worse than this result.
> > The smallest regression was on a single CPU skylake machine with a 2-6%
> > hit. Worst was Zen1 with a 3-107% hit.
> > 
> > I didn't profile it to establish why but in all cases the system CPU
> > usage was much higher. It *might* be because the spinlock in
> > per_cpu_pages crosses a new cache line and it might be cold although the
> > penalty seems a bit high for that to be the only factor.
> > 
> > Code-wise, the patches look fine but the apparent penalty for PFT is
> > too severe.
> 
> Mel,
> 
> Have you read Nicolas RCU patches?
> 

I agree with Vlastimil's review on overhead.

I think it would be more straight-forward to disable the pcp allocator for
NOHZ_FULL CPUs like what zone_pcp_disable except for individual CPUs with
care taken to not accidentally re-enable nohz CPus in zone_pcp_enable. The
downside is that there will be a performance penalty if an application
running on a NOHZ_FULL CPU is page allocator intensive for whatever
reason.  However, I guess this is unlikely because if there was a lot
of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also
cause interference.
Marcelo Tosatti Dec. 14, 2021, 10:58 a.m. UTC | #5
On Fri, Dec 10, 2021 at 10:55:49AM +0000, Mel Gorman wrote:
> On Thu, Dec 09, 2021 at 02:45:35PM -0300, Marcelo Tosatti wrote:
> > On Fri, Dec 03, 2021 at 02:13:06PM +0000, Mel Gorman wrote:
> > > On Wed, Nov 03, 2021 at 06:05:12PM +0100, Nicolas Saenz Julienne wrote:
> > > > Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> > > > drain work queued by __drain_all_pages(). So introduce new a mechanism
> > > > to remotely drain the per-cpu lists. It is made possible by remotely
> > > > locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
> > > > new scheme is that drain operations are now migration safe.
> > > > 
> > > > There was no observed performance degradation vs. the previous scheme.
> > > > Both netperf and hackbench were run in parallel to triggering the
> > > > __drain_all_pages(NULL, true) code path around ~100 times per second.
> > > > The new scheme performs a bit better (~5%), although the important point
> > > > here is there are no performance regressions vs. the previous mechanism.
> > > > Per-cpu lists draining happens only in slow paths.
> > > > 
> > > 
> > > netperf and hackbench are not great indicators of page allocator
> > > performance as IIRC they are more slab-intensive than page allocator
> > > intensive. I ran the series through a few benchmarks and can confirm
> > > that there was negligible difference to netperf and hackbench.
> > > 
> > > However, on Page Fault Test (pft in mmtests), it is noticable. On a
> > > 2-socket cascadelake machine I get
> > > 
> > > pft timings
> > >                                  5.16.0-rc1             5.16.0-rc1
> > >                                     vanilla    mm-remotedrain-v2r1
> > > Amean     system-1         27.48 (   0.00%)       27.85 *  -1.35%*
> > > Amean     system-4         28.65 (   0.00%)       30.84 *  -7.65%*
> > > Amean     system-7         28.70 (   0.00%)       32.43 * -13.00%*
> > > Amean     system-12        30.33 (   0.00%)       34.21 * -12.80%*
> > > Amean     system-21        37.14 (   0.00%)       41.51 * -11.76%*
> > > Amean     system-30        36.79 (   0.00%)       46.15 * -25.43%*
> > > Amean     system-48        58.95 (   0.00%)       65.28 * -10.73%*
> > > Amean     system-79       111.61 (   0.00%)      114.78 *  -2.84%*
> > > Amean     system-80       113.59 (   0.00%)      116.73 *  -2.77%*
> > > Amean     elapsed-1        32.83 (   0.00%)       33.12 *  -0.88%*
> > > Amean     elapsed-4         8.60 (   0.00%)        9.17 *  -6.66%*
> > > Amean     elapsed-7         4.97 (   0.00%)        5.53 * -11.30%*
> > > Amean     elapsed-12        3.08 (   0.00%)        3.43 * -11.41%*
> > > Amean     elapsed-21        2.19 (   0.00%)        2.41 * -10.06%*
> > > Amean     elapsed-30        1.73 (   0.00%)        2.04 * -17.87%*
> > > Amean     elapsed-48        1.73 (   0.00%)        2.03 * -17.77%*
> > > Amean     elapsed-79        1.61 (   0.00%)        1.64 *  -1.90%*
> > > Amean     elapsed-80        1.60 (   0.00%)        1.64 *  -2.50%*
> > > 
> > > It's not specific to cascade lake, I see varying size regressions on
> > > different Intel and AMD chips, some better and worse than this result.
> > > The smallest regression was on a single CPU skylake machine with a 2-6%
> > > hit. Worst was Zen1 with a 3-107% hit.
> > > 
> > > I didn't profile it to establish why but in all cases the system CPU
> > > usage was much higher. It *might* be because the spinlock in
> > > per_cpu_pages crosses a new cache line and it might be cold although the
> > > penalty seems a bit high for that to be the only factor.
> > > 
> > > Code-wise, the patches look fine but the apparent penalty for PFT is
> > > too severe.
> > 
> > Mel,
> > 
> > Have you read Nicolas RCU patches?
> > 
> 
> I agree with Vlastimil's review on overhead.

Not sure those points are any fundamental performance problem with RCU: 
https://paulmck.livejournal.com/31058.html

> I think it would be more straight-forward to disable the pcp allocator for
> NOHZ_FULL CPUs like what zone_pcp_disable except for individual CPUs with
> care taken to not accidentally re-enable nohz CPus in zone_pcp_enable. The
> downside is that there will be a performance penalty if an application
> running on a NOHZ_FULL CPU is page allocator intensive for whatever
> reason.  However, I guess this is unlikely because if there was a lot
> of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also
> cause interference.

Yes, it does, and its being fixed:

https://lkml.org/lkml/2021/12/8/663

Honestly i am not sure whether the association between a nohz_full CPU
and "should be mostly in userspace" is desired. The RCU solution
would be more generic. As Nicolas mentioned, for the usecases in
questions, either solution is OK.

Thomas, Frederic, Christoph, do you have any opinion on this ?
Christoph Lameter (Ampere) Dec. 14, 2021, 11:42 a.m. UTC | #6
On Tue, 14 Dec 2021, Marcelo Tosatti wrote:

> > downside is that there will be a performance penalty if an application
> > running on a NOHZ_FULL CPU is page allocator intensive for whatever
> > reason.  However, I guess this is unlikely because if there was a lot
> > of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also
> > cause interference.
>
> Yes, it does, and its being fixed:
>
> https://lkml.org/lkml/2021/12/8/663
>
> Honestly i am not sure whether the association between a nohz_full CPU
> and "should be mostly in userspace" is desired. The RCU solution
> would be more generic. As Nicolas mentioned, for the usecases in
> questions, either solution is OK.
>
> Thomas, Frederic, Christoph, do you have any opinion on this ?

Applications running would ideally have no performance penalty and there
is no  issue with kernel activity unless the application is in its special
low latency loop. NOHZ is currently only activated after spinning in that
loop for 2 seconds or so. Would be best to be able to trigger that
manually somehow.

And I would prefer to be able to run the whole system as
NOHZ and have the ability to selectively enable the quiet mode if a
process requires it for its processing.
Marcelo Tosatti Dec. 14, 2021, 12:25 p.m. UTC | #7
On Tue, Dec 14, 2021 at 12:42:58PM +0100, Christoph Lameter wrote:
> On Tue, 14 Dec 2021, Marcelo Tosatti wrote:
> 
> > > downside is that there will be a performance penalty if an application
> > > running on a NOHZ_FULL CPU is page allocator intensive for whatever
> > > reason.  However, I guess this is unlikely because if there was a lot
> > > of kernel activity for a NOHZ_FULL CPU, the vmstat shepherd would also
> > > cause interference.
> >
> > Yes, it does, and its being fixed:
> >
> > https://lkml.org/lkml/2021/12/8/663
> >
> > Honestly i am not sure whether the association between a nohz_full CPU
> > and "should be mostly in userspace" is desired. The RCU solution
> > would be more generic. As Nicolas mentioned, for the usecases in
> > questions, either solution is OK.
> >
> > Thomas, Frederic, Christoph, do you have any opinion on this ?
> 
> Applications running would ideally have no performance penalty and there
> is no  issue with kernel activity unless the application is in its special
> low latency loop. NOHZ is currently only activated after spinning in that
> loop for 2 seconds or so. Would be best to be able to trigger that
> manually somehow.

Can add a task isolation feature to do that.

> And I would prefer to be able to run the whole system as
> NOHZ and have the ability to selectively enable the quiet mode if a
> process requires it for its processing.

IIRC Frederic has been working on that.

Thanks.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b332d5cc40f1..7dbdab100461 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -140,13 +140,7 @@  DEFINE_PER_CPU(int, _numa_mem_);		/* Kernel "local memory" node */
 EXPORT_PER_CPU_SYMBOL(_numa_mem_);
 #endif
 
-/* work_structs for global per-cpu drains */
-struct pcpu_drain {
-	struct zone *zone;
-	struct work_struct work;
-};
 static DEFINE_MUTEX(pcpu_drain_mutex);
-static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
 
 #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
 volatile unsigned long latent_entropy __latent_entropy;
@@ -3050,9 +3044,6 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
  * Called from the vmstat counter updater to drain pagesets of this
  * currently executing processor on remote nodes after they have
  * expired.
- *
- * Note that this function must be called with the thread pinned to
- * a single processor.
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
@@ -3070,10 +3061,6 @@  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 
 /*
  * Drain pcplists of the indicated processor and zone.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
@@ -3089,10 +3076,6 @@  static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 /*
  * Drain pcplists of all zones on the indicated processor.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages(unsigned int cpu)
 {
@@ -3105,9 +3088,6 @@  static void drain_pages(unsigned int cpu)
 
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
- *
- * The CPU has to be pinned. When zone parameter is non-NULL, spill just
- * the single zone's pages.
  */
 void drain_local_pages(struct zone *zone)
 {
@@ -3119,24 +3099,6 @@  void drain_local_pages(struct zone *zone)
 		drain_pages(cpu);
 }
 
-static void drain_local_pages_wq(struct work_struct *work)
-{
-	struct pcpu_drain *drain;
-
-	drain = container_of(work, struct pcpu_drain, work);
-
-	/*
-	 * drain_all_pages doesn't use proper cpu hotplug protection so
-	 * we can race with cpu offline when the WQ can move this from
-	 * a cpu pinned worker to an unbound one. We can operate on a different
-	 * cpu which is alright but we also have to make sure to not move to
-	 * a different one.
-	 */
-	migrate_disable();
-	drain_local_pages(drain->zone);
-	migrate_enable();
-}
-
 /*
  * The implementation of drain_all_pages(), exposing an extra parameter to
  * drain on all cpus.
@@ -3157,13 +3119,6 @@  static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 */
 	static cpumask_t cpus_with_pcps;
 
-	/*
-	 * Make sure nobody triggers this path before mm_percpu_wq is fully
-	 * initialized.
-	 */
-	if (WARN_ON_ONCE(!mm_percpu_wq))
-		return;
-
 	/*
 	 * Do not drain if one is already in progress unless it's specific to
 	 * a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3213,14 +3168,12 @@  static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	}
 
 	for_each_cpu(cpu, &cpus_with_pcps) {
-		struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
-
-		drain->zone = zone;
-		INIT_WORK(&drain->work, drain_local_pages_wq);
-		queue_work_on(cpu, mm_percpu_wq, &drain->work);
+		if (zone) {
+			drain_pages_zone(cpu, zone);
+		} else {
+			drain_pages(cpu);
+		}
 	}
-	for_each_cpu(cpu, &cpus_with_pcps)
-		flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
 
 	mutex_unlock(&pcpu_drain_mutex);
 }
@@ -3229,8 +3182,6 @@  static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
  * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
  *
  * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
  */
 void drain_all_pages(struct zone *zone)
 {