mbox series

[v4,0/4] percpu: partial chunk depopulation

Message ID 20210419225047.3415425-1-dennis@kernel.org (mailing list archive)
Headers show
Series percpu: partial chunk depopulation | expand

Message

Dennis Zhou April 19, 2021, 10:50 p.m. UTC
Hello,

This series is a continuation of Roman's series in [1]. It aims to solve
chunks holding onto free pages by adding a reclaim process to the percpu
balance work item.

The main difference is that the nr_empty_pop_pages is now managed at
time of isolation instead of intermixed. This helps with deciding which
chunks to free instead of having to interleave returning chunks to
active duty.

The allocation priority is as follows:
  1) appropriate chunk slot increasing until fit
  2) sidelined chunks
  3) full free chunks

The last slot for to_depopulate is never used for allocations.

A big thanks to Roman for initiating the work and being available for
iterating on these ideas.

This patchset contains the following 4 patches:
  0001-percpu-factor-out-pcpu_check_block_hint.patch
  0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
  0003-percpu-implement-partial-chunk-depopulation.patch
  0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch

0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
initially from Roman. 0004 adds a reclaim threshold so we do not need to
schedule for every page freed.

This series is on top of percpu$for-5.14 67c2669d69fb.

diffstats below:

Dennis Zhou (2):
  percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
  percpu: use reclaim threshold instead of running for every page

Roman Gushchin (2):
  percpu: factor out pcpu_check_block_hint()
  percpu: implement partial chunk depopulation

 mm/percpu-internal.h |   5 +
 mm/percpu-km.c       |   5 +
 mm/percpu-stats.c    |  20 ++--
 mm/percpu-vm.c       |  30 ++++++
 mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 278 insertions(+), 34 deletions(-)

Thanks,
Dennis

Comments

Dennis Zhou April 19, 2021, 10:54 p.m. UTC | #1
On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> Hello,
> 
> This series is a continuation of Roman's series in [1]. It aims to solve
> chunks holding onto free pages by adding a reclaim process to the percpu
> balance work item.
> 

And I forgot to link [1]...
[1] https://lore.kernel.org/lkml/20210408035736.883861-1-guro@fb.com/

> The main difference is that the nr_empty_pop_pages is now managed at
> time of isolation instead of intermixed. This helps with deciding which
> chunks to free instead of having to interleave returning chunks to
> active duty.
> 
> The allocation priority is as follows:
>   1) appropriate chunk slot increasing until fit
>   2) sidelined chunks
>   3) full free chunks
> 
> The last slot for to_depopulate is never used for allocations.
> 
> A big thanks to Roman for initiating the work and being available for
> iterating on these ideas.
> 
> This patchset contains the following 4 patches:
>   0001-percpu-factor-out-pcpu_check_block_hint.patch
>   0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>   0003-percpu-implement-partial-chunk-depopulation.patch
>   0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> 
> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> initially from Roman. 0004 adds a reclaim threshold so we do not need to
> schedule for every page freed.
> 
> This series is on top of percpu$for-5.14 67c2669d69fb.
> 
> diffstats below:
> 
> Dennis Zhou (2):
>   percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>   percpu: use reclaim threshold instead of running for every page
> 
> Roman Gushchin (2):
>   percpu: factor out pcpu_check_block_hint()
>   percpu: implement partial chunk depopulation
> 
>  mm/percpu-internal.h |   5 +
>  mm/percpu-km.c       |   5 +
>  mm/percpu-stats.c    |  20 ++--
>  mm/percpu-vm.c       |  30 ++++++
>  mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 278 insertions(+), 34 deletions(-)
> 
> Thanks,
> Dennis
Dennis Zhou April 19, 2021, 10:57 p.m. UTC | #2
On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> Hello,
> 
> This series is a continuation of Roman's series in [1]. It aims to solve
> chunks holding onto free pages by adding a reclaim process to the percpu
> balance work item.
> 
> The main difference is that the nr_empty_pop_pages is now managed at
> time of isolation instead of intermixed. This helps with deciding which
> chunks to free instead of having to interleave returning chunks to
> active duty.
> 
> The allocation priority is as follows:
>   1) appropriate chunk slot increasing until fit
>   2) sidelined chunks
>   3) full free chunks
> 
> The last slot for to_depopulate is never used for allocations.
> 
> A big thanks to Roman for initiating the work and being available for
> iterating on these ideas.
> 
> This patchset contains the following 4 patches:
>   0001-percpu-factor-out-pcpu_check_block_hint.patch
>   0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>   0003-percpu-implement-partial-chunk-depopulation.patch
>   0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> 
> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> initially from Roman. 0004 adds a reclaim threshold so we do not need to
> schedule for every page freed.
> 
> This series is on top of percpu$for-5.14 67c2669d69fb.
> 
> diffstats below:
> 
> Dennis Zhou (2):
>   percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>   percpu: use reclaim threshold instead of running for every page
> 
> Roman Gushchin (2):
>   percpu: factor out pcpu_check_block_hint()
>   percpu: implement partial chunk depopulation
> 
>  mm/percpu-internal.h |   5 +
>  mm/percpu-km.c       |   5 +
>  mm/percpu-stats.c    |  20 ++--
>  mm/percpu-vm.c       |  30 ++++++
>  mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 278 insertions(+), 34 deletions(-)
> 
> Thanks,
> Dennis

Hello Pratik,

Do you mind testing this series again on POWER9? The base is available
here:
https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14

Thanks,
Dennis
Pratik R. Sampat April 20, 2021, 11:07 a.m. UTC | #3
On 20/04/21 4:27 am, Dennis Zhou wrote:
> On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
>> Hello,
>>
>> This series is a continuation of Roman's series in [1]. It aims to solve
>> chunks holding onto free pages by adding a reclaim process to the percpu
>> balance work item.
>>
>> The main difference is that the nr_empty_pop_pages is now managed at
>> time of isolation instead of intermixed. This helps with deciding which
>> chunks to free instead of having to interleave returning chunks to
>> active duty.
>>
>> The allocation priority is as follows:
>>    1) appropriate chunk slot increasing until fit
>>    2) sidelined chunks
>>    3) full free chunks
>>
>> The last slot for to_depopulate is never used for allocations.
>>
>> A big thanks to Roman for initiating the work and being available for
>> iterating on these ideas.
>>
>> This patchset contains the following 4 patches:
>>    0001-percpu-factor-out-pcpu_check_block_hint.patch
>>    0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>>    0003-percpu-implement-partial-chunk-depopulation.patch
>>    0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
>>
>> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
>> initially from Roman. 0004 adds a reclaim threshold so we do not need to
>> schedule for every page freed.
>>
>> This series is on top of percpu$for-5.14 67c2669d69fb.
>>
>> diffstats below:
>>
>> Dennis Zhou (2):
>>    percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>>    percpu: use reclaim threshold instead of running for every page
>>
>> Roman Gushchin (2):
>>    percpu: factor out pcpu_check_block_hint()
>>    percpu: implement partial chunk depopulation
>>
>>   mm/percpu-internal.h |   5 +
>>   mm/percpu-km.c       |   5 +
>>   mm/percpu-stats.c    |  20 ++--
>>   mm/percpu-vm.c       |  30 ++++++
>>   mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>>   5 files changed, 278 insertions(+), 34 deletions(-)
>>
>> Thanks,
>> Dennis
> Hello Pratik,
>
> Do you mind testing this series again on POWER9? The base is available
> here:
> https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14
>
> Thanks,
> Dennis

Hello Dennis, I have tested this patchset on POWER9.

I have tried variations of the percpu_test in the top level and nested cgroups
creation as the test with 1000:10 didn't show any benefits.

The following example shows more consistent benefits with the de-allocation
strategy.
Outer: 1000
Inner: 50
# ./percpu_test.sh
Percpu:             6912 kB
Percpu:           532736 kB
Percpu:           278784 kB

I believe it could be a result of bulk freeing within "free_unref_page_commit",
where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
page size it would end up creating lesser number of pages but with the
effects of fragmentation.

Having said that, the patchset and its behavior does look good to me.

Thanks!
Pratik
Dennis Zhou April 20, 2021, 2:39 p.m. UTC | #4
On Tue, Apr 20, 2021 at 04:37:02PM +0530, Pratik Sampat wrote:
> 
> On 20/04/21 4:27 am, Dennis Zhou wrote:
> > On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> > > Hello,
> > > 
> > > This series is a continuation of Roman's series in [1]. It aims to solve
> > > chunks holding onto free pages by adding a reclaim process to the percpu
> > > balance work item.
> > > 
> > > The main difference is that the nr_empty_pop_pages is now managed at
> > > time of isolation instead of intermixed. This helps with deciding which
> > > chunks to free instead of having to interleave returning chunks to
> > > active duty.
> > > 
> > > The allocation priority is as follows:
> > >    1) appropriate chunk slot increasing until fit
> > >    2) sidelined chunks
> > >    3) full free chunks
> > > 
> > > The last slot for to_depopulate is never used for allocations.
> > > 
> > > A big thanks to Roman for initiating the work and being available for
> > > iterating on these ideas.
> > > 
> > > This patchset contains the following 4 patches:
> > >    0001-percpu-factor-out-pcpu_check_block_hint.patch
> > >    0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
> > >    0003-percpu-implement-partial-chunk-depopulation.patch
> > >    0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> > > 
> > > 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> > > initially from Roman. 0004 adds a reclaim threshold so we do not need to
> > > schedule for every page freed.
> > > 
> > > This series is on top of percpu$for-5.14 67c2669d69fb.
> > > 
> > > diffstats below:
> > > 
> > > Dennis Zhou (2):
> > >    percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> > >    percpu: use reclaim threshold instead of running for every page
> > > 
> > > Roman Gushchin (2):
> > >    percpu: factor out pcpu_check_block_hint()
> > >    percpu: implement partial chunk depopulation
> > > 
> > >   mm/percpu-internal.h |   5 +
> > >   mm/percpu-km.c       |   5 +
> > >   mm/percpu-stats.c    |  20 ++--
> > >   mm/percpu-vm.c       |  30 ++++++
> > >   mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
> > >   5 files changed, 278 insertions(+), 34 deletions(-)
> > > 
> > > Thanks,
> > > Dennis
> > Hello Pratik,
> > 
> > Do you mind testing this series again on POWER9? The base is available
> > here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14
> > 
> > Thanks,
> > Dennis
> 
> Hello Dennis, I have tested this patchset on POWER9.
> 
> I have tried variations of the percpu_test in the top level and nested cgroups
> creation as the test with 1000:10 didn't show any benefits.

This is most likely because the 1 in every 11 still pins every page
while 1 in 50 does not. Can you try the patch below on top? I think it
may show slightly better perf as well. If it doesn't I'll just drop it.

> 
> The following example shows more consistent benefits with the de-allocation
> strategy.
> Outer: 1000
> Inner: 50
> # ./percpu_test.sh
> Percpu:             6912 kB
> Percpu:           532736 kB
> Percpu:           278784 kB
> 
> I believe it could be a result of bulk freeing within "free_unref_page_commit",
> where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
> page size it would end up creating lesser number of pages but with the
> effects of fragmentation.

This is unrelated to per cpu pages in slab/slub. Percpu is a separate
memory allocator.

> 
> Having said that, the patchset and its behavior does look good to me.

Thanks, can I throw the following on the appropriate patches? In the
future it's good to be explicit about this because some prefer to credit
different emails.

Tested-by: Pratik Sampat <psampat@linux.ibm.com>

Thanks,
Dennis

The following may do a little better on power9:
---
From a1464c4d5900cca68fd95b935178d72bb74837d5 Mon Sep 17 00:00:00 2001
From: Dennis Zhou <dennis@kernel.org>
Date: Tue, 20 Apr 2021 14:25:20 +0000
Subject: [PATCH] percpu: convert free page float to bytes

The percpu memory allocator keeps around a minimum number of free pages
to ensure we can satisfy atomic allocations. However, we've always kept
this number in terms of pages. On certain architectures like arm and
powerpc, the default page size could be 64k instead of 4k. So, start
with a target number of free bytes and then convert to pages.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
---
 mm/percpu.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index ba13e683d022..287fe3091244 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -80,6 +80,7 @@
 #include <linux/mutex.h>
 #include <linux/percpu.h>
 #include <linux/pfn.h>
+#include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
@@ -107,11 +108,12 @@
 /* chunks in slots below this are subject to being sidelined on failed alloc */
 #define PCPU_SLOT_FAIL_THRESHOLD	3
 
-#define PCPU_EMPTY_POP_PAGES_LOW	2
-#define PCPU_EMPTY_POP_PAGES_HIGH	4
+#define PCPU_EMPTY_POP_PAGES_LOW	(max_t(int, (SZ_8K) / PAGE_SIZE, 1))
+#define PCPU_EMPTY_POP_PAGES_HIGH	(max_t(int, (SZ_16K) / PAGE_SIZE, \
+					       PCPU_EMPTY_POP_PAGES_LOW + 1))
 
 /* only schedule reclaim if there are at least N empty pop pages sidelined */
-#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	4
+#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	(PCPU_EMPTY_POP_PAGES_HIGH)
 
 #ifdef CONFIG_SMP
 /* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */
Pratik R. Sampat April 20, 2021, 3:25 p.m. UTC | #5
Hello Dennis,


On 20/04/21 8:09 pm, Dennis Zhou wrote:
> On Tue, Apr 20, 2021 at 04:37:02PM +0530, Pratik Sampat wrote:
>> On 20/04/21 4:27 am, Dennis Zhou wrote:
>>> On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
>>>> Hello,
>>>>
>>>> This series is a continuation of Roman's series in [1]. It aims to solve
>>>> chunks holding onto free pages by adding a reclaim process to the percpu
>>>> balance work item.
>>>>
>>>> The main difference is that the nr_empty_pop_pages is now managed at
>>>> time of isolation instead of intermixed. This helps with deciding which
>>>> chunks to free instead of having to interleave returning chunks to
>>>> active duty.
>>>>
>>>> The allocation priority is as follows:
>>>>     1) appropriate chunk slot increasing until fit
>>>>     2) sidelined chunks
>>>>     3) full free chunks
>>>>
>>>> The last slot for to_depopulate is never used for allocations.
>>>>
>>>> A big thanks to Roman for initiating the work and being available for
>>>> iterating on these ideas.
>>>>
>>>> This patchset contains the following 4 patches:
>>>>     0001-percpu-factor-out-pcpu_check_block_hint.patch
>>>>     0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>>>>     0003-percpu-implement-partial-chunk-depopulation.patch
>>>>     0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
>>>>
>>>> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
>>>> initially from Roman. 0004 adds a reclaim threshold so we do not need to
>>>> schedule for every page freed.
>>>>
>>>> This series is on top of percpu$for-5.14 67c2669d69fb.
>>>>
>>>> diffstats below:
>>>>
>>>> Dennis Zhou (2):
>>>>     percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>>>>     percpu: use reclaim threshold instead of running for every page
>>>>
>>>> Roman Gushchin (2):
>>>>     percpu: factor out pcpu_check_block_hint()
>>>>     percpu: implement partial chunk depopulation
>>>>
>>>>    mm/percpu-internal.h |   5 +
>>>>    mm/percpu-km.c       |   5 +
>>>>    mm/percpu-stats.c    |  20 ++--
>>>>    mm/percpu-vm.c       |  30 ++++++
>>>>    mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>>>>    5 files changed, 278 insertions(+), 34 deletions(-)
>>>>
>>>> Thanks,
>>>> Dennis
>>> Hello Pratik,
>>>
>>> Do you mind testing this series again on POWER9? The base is available
>>> here:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14
>>>
>>> Thanks,
>>> Dennis
>> Hello Dennis, I have tested this patchset on POWER9.
>>
>> I have tried variations of the percpu_test in the top level and nested cgroups
>> creation as the test with 1000:10 didn't show any benefits.
> This is most likely because the 1 in every 11 still pins every page
> while 1 in 50 does not. Can you try the patch below on top? I think it
> may show slightly better perf as well. If it doesn't I'll just drop it.

I did try it out, although my test spanned only across varying the inner cgroup
folders; it didn't seem to show any benefit over the previous test without the
patch for being being able to spawn as little memory cgroup folders and seeing
partial memory depopulation.

>> The following example shows more consistent benefits with the de-allocation
>> strategy.
>> Outer: 1000
>> Inner: 50
>> # ./percpu_test.sh
>> Percpu:             6912 kB
>> Percpu:           532736 kB
>> Percpu:           278784 kB
>>
>> I believe it could be a result of bulk freeing within "free_unref_page_commit",
>> where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
>> page size it would end up creating lesser number of pages but with the
>> effects of fragmentation.
> This is unrelated to per cpu pages in slab/slub. Percpu is a separate
> memory allocator.
>
You're right, I was actually referencing incorrect code.

>> Having said that, the patchset and its behavior does look good to me.
> Thanks, can I throw the following on the appropriate patches? In the
> future it's good to be explicit about this because some prefer to credit
> different emails.
>
> Tested-by: Pratik Sampat <psampat@linux.ibm.com>

Sure thing please feel free to add a tested-by wherever you feel appropriate.
I'll be more explicit about them in the future. Thanks!

> Thanks,
> Dennis
>
> The following may do a little better on power9:
> ---
>  From a1464c4d5900cca68fd95b935178d72bb74837d5 Mon Sep 17 00:00:00 2001
> From: Dennis Zhou <dennis@kernel.org>
> Date: Tue, 20 Apr 2021 14:25:20 +0000
> Subject: [PATCH] percpu: convert free page float to bytes
>
> The percpu memory allocator keeps around a minimum number of free pages
> to ensure we can satisfy atomic allocations. However, we've always kept
> this number in terms of pages. On certain architectures like arm and
> powerpc, the default page size could be 64k instead of 4k. So, start
> with a target number of free bytes and then convert to pages.
>
> Signed-off-by: Dennis Zhou <dennis@kernel.org>
> ---
>   mm/percpu.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index ba13e683d022..287fe3091244 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -80,6 +80,7 @@
>   #include <linux/mutex.h>
>   #include <linux/percpu.h>
>   #include <linux/pfn.h>
> +#include <linux/sizes.h>
>   #include <linux/slab.h>
>   #include <linux/spinlock.h>
>   #include <linux/vmalloc.h>
> @@ -107,11 +108,12 @@
>   /* chunks in slots below this are subject to being sidelined on failed alloc */
>   #define PCPU_SLOT_FAIL_THRESHOLD	3
>   
> -#define PCPU_EMPTY_POP_PAGES_LOW	2
> -#define PCPU_EMPTY_POP_PAGES_HIGH	4
> +#define PCPU_EMPTY_POP_PAGES_LOW	(max_t(int, (SZ_8K) / PAGE_SIZE, 1))
> +#define PCPU_EMPTY_POP_PAGES_HIGH	(max_t(int, (SZ_16K) / PAGE_SIZE, \
> +					       PCPU_EMPTY_POP_PAGES_LOW + 1))
>   
>   /* only schedule reclaim if there are at least N empty pop pages sidelined */
> -#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	4
> +#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	(PCPU_EMPTY_POP_PAGES_HIGH)
>   
>   #ifdef CONFIG_SMP
>   /* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */

Thanks,
Pratik
Dennis Zhou April 21, 2021, 6:24 p.m. UTC | #6
On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> Hello,
> 
> This series is a continuation of Roman's series in [1]. It aims to solve
> chunks holding onto free pages by adding a reclaim process to the percpu
> balance work item.
> 
> The main difference is that the nr_empty_pop_pages is now managed at
> time of isolation instead of intermixed. This helps with deciding which
> chunks to free instead of having to interleave returning chunks to
> active duty.
> 
> The allocation priority is as follows:
>   1) appropriate chunk slot increasing until fit
>   2) sidelined chunks
>   3) full free chunks
> 
> The last slot for to_depopulate is never used for allocations.
> 
> A big thanks to Roman for initiating the work and being available for
> iterating on these ideas.
> 
> This patchset contains the following 4 patches:
>   0001-percpu-factor-out-pcpu_check_block_hint.patch
>   0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>   0003-percpu-implement-partial-chunk-depopulation.patch
>   0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> 
> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> initially from Roman. 0004 adds a reclaim threshold so we do not need to
> schedule for every page freed.
> 
> This series is on top of percpu$for-5.14 67c2669d69fb.
> 
> diffstats below:
> 
> Dennis Zhou (2):
>   percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>   percpu: use reclaim threshold instead of running for every page
> 
> Roman Gushchin (2):
>   percpu: factor out pcpu_check_block_hint()
>   percpu: implement partial chunk depopulation
> 
>  mm/percpu-internal.h |   5 +
>  mm/percpu-km.c       |   5 +
>  mm/percpu-stats.c    |  20 ++--
>  mm/percpu-vm.c       |  30 ++++++
>  mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>  5 files changed, 278 insertions(+), 34 deletions(-)
> 
> Thanks,
> Dennis

I've applied this to for-5.14 with 1 modification to the batching
threshold, 4 pages is quite little, so make it 16.

Thanks,
Dennis