mbox series

[rfc,0/3] mm: allow more high-order pages stored on PCP lists

Message ID 20240415081220.3246839-1-wangkefeng.wang@huawei.com (mailing list archive)
Headers show
Series mm: allow more high-order pages stored on PCP lists | expand

Message

Kefeng Wang April 15, 2024, 8:12 a.m. UTC
Both the file pages and anonymous pages support large folio, high-order
pages except PMD_ORDER will also be allocated frequently which could
increase the zone lock contention, allow high-order pages on pcp lists
could reduce the big zone lock contention, but as commit 44042b449872
("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
pointed, it may not win in all the scenes, add a new control sysfs to
enable or disable specified high-order pages stored on PCP lists, the order
(PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.

With perf lock tools, the lock contention from will-it-scale page_fault1
(with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show
below(only care about zone spinlock and pcp spinlock),

Without patches,
 contended   total wait     max wait     avg wait         type   caller
       713      4.64 ms     74.37 us      6.51 us     spinlock   __alloc_pages+0x23c

With patches,
 contended   total wait     max wait     avg wait         type   caller
         2     25.66 us     16.31 us     12.83 us     spinlock   rmqueue_pcplist+0x2b0

Similar results on shell8 from unixbench,

Without patches,
      4942    901.09 ms      1.31 ms    182.33 us     spinlock   __alloc_pages+0x23c	
      1556    298.76 ms      1.23 ms    192.01 us     spinlock   rmqueue_pcplist+0x2b0
       991    182.73 ms    879.80 us    184.39 us     spinlock   rmqueue_pcplist+0x2b0

With patches,
contended   total wait     max wait     avg wait         type   caller
       988    187.63 ms    855.18 us    189.91 us     spinlock   rmqueue_pcplist+0x2b0
       505     88.99 ms    793.27 us    176.21 us     spinlock   rmqueue_pcplist+0x2b0

The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the
zone lock from __alloc_pages() disappeared.

Kefeng Wang (3):
  mm: prepare more high-order pages to be stored on the per-cpu lists
  mm: add control to allow specified high-order pages stored on PCP list
  mm: pcp: show each order page count

 Documentation/admin-guide/mm/transhuge.rst | 11 ++++
 include/linux/gfp.h                        |  1 +
 include/linux/huge_mm.h                    |  1 +
 include/linux/mmzone.h                     | 10 ++-
 include/linux/vmstat.h                     | 19 ++++++
 mm/Kconfig.debug                           |  8 +++
 mm/huge_memory.c                           | 74 ++++++++++++++++++++++
 mm/page_alloc.c                            | 30 +++++++--
 mm/vmstat.c                                | 16 +++++
 9 files changed, 164 insertions(+), 6 deletions(-)

Comments

Barry Song April 15, 2024, 8:18 a.m. UTC | #1
On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> Both the file pages and anonymous pages support large folio, high-order
> pages except PMD_ORDER will also be allocated frequently which could
> increase the zone lock contention, allow high-order pages on pcp lists
> could reduce the big zone lock contention, but as commit 44042b449872
> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
> pointed, it may not win in all the scenes, add a new control sysfs to
> enable or disable specified high-order pages stored on PCP lists, the order
> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.

This is precisely something Baolin and I have discussed and intended
to implement[1],
but unfortunately, we haven't had the time to do so.

[1] https://lore.kernel.org/linux-mm/13c59ca8-baac-405e-8640-e693c78ef79a@suse.cz/T/#mecb0514ced830ac4df320113bedd7073bea9ab7a

>
> With perf lock tools, the lock contention from will-it-scale page_fault1
> (with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show
> below(only care about zone spinlock and pcp spinlock),
>
> Without patches,
>  contended   total wait     max wait     avg wait         type   caller
>        713      4.64 ms     74.37 us      6.51 us     spinlock   __alloc_pages+0x23c
>
> With patches,
>  contended   total wait     max wait     avg wait         type   caller
>          2     25.66 us     16.31 us     12.83 us     spinlock   rmqueue_pcplist+0x2b0
>
> Similar results on shell8 from unixbench,
>
> Without patches,
>       4942    901.09 ms      1.31 ms    182.33 us     spinlock   __alloc_pages+0x23c
>       1556    298.76 ms      1.23 ms    192.01 us     spinlock   rmqueue_pcplist+0x2b0
>        991    182.73 ms    879.80 us    184.39 us     spinlock   rmqueue_pcplist+0x2b0
>
> With patches,
> contended   total wait     max wait     avg wait         type   caller
>        988    187.63 ms    855.18 us    189.91 us     spinlock   rmqueue_pcplist+0x2b0
>        505     88.99 ms    793.27 us    176.21 us     spinlock   rmqueue_pcplist+0x2b0
>
> The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the
> zone lock from __alloc_pages() disappeared.
>
> Kefeng Wang (3):
>   mm: prepare more high-order pages to be stored on the per-cpu lists
>   mm: add control to allow specified high-order pages stored on PCP list
>   mm: pcp: show each order page count
>
>  Documentation/admin-guide/mm/transhuge.rst | 11 ++++
>  include/linux/gfp.h                        |  1 +
>  include/linux/huge_mm.h                    |  1 +
>  include/linux/mmzone.h                     | 10 ++-
>  include/linux/vmstat.h                     | 19 ++++++
>  mm/Kconfig.debug                           |  8 +++
>  mm/huge_memory.c                           | 74 ++++++++++++++++++++++
>  mm/page_alloc.c                            | 30 +++++++--
>  mm/vmstat.c                                | 16 +++++
>  9 files changed, 164 insertions(+), 6 deletions(-)
>
> --
> 2.27.0
>
>
Kefeng Wang April 15, 2024, 8:59 a.m. UTC | #2
On 2024/4/15 16:18, Barry Song wrote:
> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> Both the file pages and anonymous pages support large folio, high-order
>> pages except PMD_ORDER will also be allocated frequently which could
>> increase the zone lock contention, allow high-order pages on pcp lists
>> could reduce the big zone lock contention, but as commit 44042b449872
>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
>> pointed, it may not win in all the scenes, add a new control sysfs to
>> enable or disable specified high-order pages stored on PCP lists, the order
>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
> 
> This is precisely something Baolin and I have discussed and intended
> to implement[1],
> but unfortunately, we haven't had the time to do so.

Indeed, same thing. Recently, we are working on unixbench/lmbench
optimization, I tested Multi-size THP for anonymous memory by hard-cord
PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
not for all cases and not very stable, so re-implemented it by according
to the user requirement and enable it dynamically.

[1] 
https://lore.kernel.org/linux-mm/b8f5a47a-af1e-44ed-a89b-460d0be56d2c@huawei.com/

> 
> [1] https://lore.kernel.org/linux-mm/13c59ca8-baac-405e-8640-e693c78ef79a@suse.cz/T/#mecb0514ced830ac4df320113bedd7073bea9ab7a
> 
>>
>> With perf lock tools, the lock contention from will-it-scale page_fault1
>> (with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show
>> below(only care about zone spinlock and pcp spinlock),
>>
>> Without patches,
>>   contended   total wait     max wait     avg wait         type   caller
>>         713      4.64 ms     74.37 us      6.51 us     spinlock   __alloc_pages+0x23c
>>
>> With patches,
>>   contended   total wait     max wait     avg wait         type   caller
>>           2     25.66 us     16.31 us     12.83 us     spinlock   rmqueue_pcplist+0x2b0
>>
>> Similar results on shell8 from unixbench,
>>
>> Without patches,
>>        4942    901.09 ms      1.31 ms    182.33 us     spinlock   __alloc_pages+0x23c
>>        1556    298.76 ms      1.23 ms    192.01 us     spinlock   rmqueue_pcplist+0x2b0
>>         991    182.73 ms    879.80 us    184.39 us     spinlock   rmqueue_pcplist+0x2b0
>>
>> With patches,
>> contended   total wait     max wait     avg wait         type   caller
>>         988    187.63 ms    855.18 us    189.91 us     spinlock   rmqueue_pcplist+0x2b0
>>         505     88.99 ms    793.27 us    176.21 us     spinlock   rmqueue_pcplist+0x2b0
>>
>> The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the
>> zone lock from __alloc_pages() disappeared.
>>
>> Kefeng Wang (3):
>>    mm: prepare more high-order pages to be stored on the per-cpu lists
>>    mm: add control to allow specified high-order pages stored on PCP list
>>    mm: pcp: show each order page count
>>
>>   Documentation/admin-guide/mm/transhuge.rst | 11 ++++
>>   include/linux/gfp.h                        |  1 +
>>   include/linux/huge_mm.h                    |  1 +
>>   include/linux/mmzone.h                     | 10 ++-
>>   include/linux/vmstat.h                     | 19 ++++++
>>   mm/Kconfig.debug                           |  8 +++
>>   mm/huge_memory.c                           | 74 ++++++++++++++++++++++
>>   mm/page_alloc.c                            | 30 +++++++--
>>   mm/vmstat.c                                | 16 +++++
>>   9 files changed, 164 insertions(+), 6 deletions(-)
>>
>> --
>> 2.27.0
>>
>>
>
David Hildenbrand April 15, 2024, 10:52 a.m. UTC | #3
On 15.04.24 10:59, Kefeng Wang wrote:
> 
> 
> On 2024/4/15 16:18, Barry Song wrote:
>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>
>>> Both the file pages and anonymous pages support large folio, high-order
>>> pages except PMD_ORDER will also be allocated frequently which could
>>> increase the zone lock contention, allow high-order pages on pcp lists
>>> could reduce the big zone lock contention, but as commit 44042b449872
>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
>>> pointed, it may not win in all the scenes, add a new control sysfs to
>>> enable or disable specified high-order pages stored on PCP lists, the order
>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
>>
>> This is precisely something Baolin and I have discussed and intended
>> to implement[1],
>> but unfortunately, we haven't had the time to do so.
> 
> Indeed, same thing. Recently, we are working on unixbench/lmbench
> optimization, I tested Multi-size THP for anonymous memory by hard-cord
> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> not for all cases and not very stable, so re-implemented it by according
> to the user requirement and enable it dynamically.

I'm wondering, though, if this is really a suitable candidate for a 
sysctl toggle. Can anybody really come up with an educated guess for 
these values?

Especially reading "Benchmarks Score shows a little improvoment(0.28%)" 
and "it may not win in all the scenes", to me it mostly sounds like 
"minimal impact" -- so who cares?

How much is the cost vs. benefit of just having one sane system 
configuration?
Barry Song April 15, 2024, 11:14 a.m. UTC | #4
On Mon, Apr 15, 2024 at 6:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.04.24 10:59, Kefeng Wang wrote:
> >
> >
> > On 2024/4/15 16:18, Barry Song wrote:
> >> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>
> >>> Both the file pages and anonymous pages support large folio, high-order
> >>> pages except PMD_ORDER will also be allocated frequently which could
> >>> increase the zone lock contention, allow high-order pages on pcp lists
> >>> could reduce the big zone lock contention, but as commit 44042b449872
> >>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
> >>> pointed, it may not win in all the scenes, add a new control sysfs to
> >>> enable or disable specified high-order pages stored on PCP lists, the order
> >>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
> >>
> >> This is precisely something Baolin and I have discussed and intended
> >> to implement[1],
> >> but unfortunately, we haven't had the time to do so.
> >
> > Indeed, same thing. Recently, we are working on unixbench/lmbench
> > optimization, I tested Multi-size THP for anonymous memory by hard-cord
> > PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> > not for all cases and not very stable, so re-implemented it by according
> > to the user requirement and enable it dynamically.
>
> I'm wondering, though, if this is really a suitable candidate for a
> sysctl toggle. Can anybody really come up with an educated guess for
> these values?
>
> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> and "it may not win in all the scenes", to me it mostly sounds like
> "minimal impact" -- so who cares?

Considering the original goal of employing PCP to alleviate page allocation
lock contention, and now that we have configured mTHP, for instance, to
64KiB, it's possible that 64KiB could become the most common page allocation
size just like order0. We should expect to see similar improvements as a result.

I'm questioning whether shell8 is the suitable benchmark for this
situation. A mere
0.28% performance enhancement might not be substantial to pique interest.
Shouldn't we have numerous threads allocating and freeing in parallel to truly
gauge the benefits of PCP?

>
> How much is the cost vs. benefit of just having one sane system
> configuration?
>
> --
> Cheers,
>
> David / dhildenb
>
Kefeng Wang April 15, 2024, 12:17 p.m. UTC | #5
On 2024/4/15 18:52, David Hildenbrand wrote:
> On 15.04.24 10:59, Kefeng Wang wrote:
>>
>>
>> On 2024/4/15 16:18, Barry Song wrote:
>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang 
>>> <wangkefeng.wang@huawei.com> wrote:
>>>>
>>>> Both the file pages and anonymous pages support large folio, high-order
>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>> increase the zone lock contention, allow high-order pages on pcp lists
>>>> could reduce the big zone lock contention, but as commit 44042b449872
>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu 
>>>> lists")
>>>> pointed, it may not win in all the scenes, add a new control sysfs to
>>>> enable or disable specified high-order pages stored on PCP lists, 
>>>> the order
>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by 
>>>> default.
>>>
>>> This is precisely something Baolin and I have discussed and intended
>>> to implement[1],
>>> but unfortunately, we haven't had the time to do so.
>>
>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>> optimization, I tested Multi-size THP for anonymous memory by hard-cord
>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>> not for all cases and not very stable, so re-implemented it by according
>> to the user requirement and enable it dynamically.
> 
> I'm wondering, though, if this is really a suitable candidate for a 
> sysctl toggle. Can anybody really come up with an educated guess for 
> these values?

Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
we could trace __alloc_pages() and do order statistic to decide to
choose the high-order to be enabled on PCP.

> 
> Especially reading "Benchmarks Score shows a little improvoment(0.28%)" 
> and "it may not win in all the scenes", to me it mostly sounds like 
> "minimal impact" -- so who cares?

Even though lock conflicts are eliminated, there is very limited
performance improvement(even maybe fluctuation), it is not a good
testcase to show improvement, just show the zone-lock issue, we need to
find other better testcase, maybe some test on Andriod(heavy use 64K, no
PMD THP), or LKP maybe give some help?

I will try to find other testcase to show the benefit.

> 
> How much is the cost vs. benefit of just having one sane system 
> configuration?
> 

For arm64 with 4k, five more high-orders(4~8), five more pcplists,
and for high-orders, we assumes most of them are moveable, but maybe
not, so enable it by default maybe more fragmentization, see
5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized 
allocations").
Barry Song April 16, 2024, 12:21 a.m. UTC | #6
On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/4/15 18:52, David Hildenbrand wrote:
> > On 15.04.24 10:59, Kefeng Wang wrote:
> >>
> >>
> >> On 2024/4/15 16:18, Barry Song wrote:
> >>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
> >>> <wangkefeng.wang@huawei.com> wrote:
> >>>>
> >>>> Both the file pages and anonymous pages support large folio, high-order
> >>>> pages except PMD_ORDER will also be allocated frequently which could
> >>>> increase the zone lock contention, allow high-order pages on pcp lists
> >>>> could reduce the big zone lock contention, but as commit 44042b449872
> >>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
> >>>> lists")
> >>>> pointed, it may not win in all the scenes, add a new control sysfs to
> >>>> enable or disable specified high-order pages stored on PCP lists,
> >>>> the order
> >>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
> >>>> default.
> >>>
> >>> This is precisely something Baolin and I have discussed and intended
> >>> to implement[1],
> >>> but unfortunately, we haven't had the time to do so.
> >>
> >> Indeed, same thing. Recently, we are working on unixbench/lmbench
> >> optimization, I tested Multi-size THP for anonymous memory by hard-cord
> >> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> >> not for all cases and not very stable, so re-implemented it by according
> >> to the user requirement and enable it dynamically.
> >
> > I'm wondering, though, if this is really a suitable candidate for a
> > sysctl toggle. Can anybody really come up with an educated guess for
> > these values?
>
> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
> we could trace __alloc_pages() and do order statistic to decide to
> choose the high-order to be enabled on PCP.
>
> >
> > Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> > and "it may not win in all the scenes", to me it mostly sounds like
> > "minimal impact" -- so who cares?
>
> Even though lock conflicts are eliminated, there is very limited
> performance improvement(even maybe fluctuation), it is not a good
> testcase to show improvement, just show the zone-lock issue, we need to
> find other better testcase, maybe some test on Andriod(heavy use 64K, no
> PMD THP), or LKP maybe give some help?
>
> I will try to find other testcase to show the benefit.

Hi Kefeng,

I wonder if you will see some major improvements on mTHP 64KiB using
the below microbench I wrote just now, for example perf and time to
finish the program

#define DATA_SIZE (2UL * 1024 * 1024)

int main(int argc, char **argv)
{
        /* make 32 concurrent alloc and free of mTHP */
        fork(); fork(); fork(); fork(); fork();

        for (int i = 0; i < 100000; i++) {
                void *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE,
                                MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
                if (addr == MAP_FAILED) {
                        perror("fail to malloc");
                        return -1;
                }
                memset(addr, 0x11, DATA_SIZE);
                munmap(addr, DATA_SIZE);
        }

        return 0;
}

>
> >
> > How much is the cost vs. benefit of just having one sane system
> > configuration?
> >
>
> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
> and for high-orders, we assumes most of them are moveable, but maybe
> not, so enable it by default maybe more fragmentization, see
> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> allocations").
>
Kefeng Wang April 16, 2024, 4:50 a.m. UTC | #7
On 2024/4/16 8:21, Barry Song wrote:
> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>
>>>>>> Both the file pages and anonymous pages support large folio, high-order
>>>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>>>> increase the zone lock contention, allow high-order pages on pcp lists
>>>>>> could reduce the big zone lock contention, but as commit 44042b449872
>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>>> lists")
>>>>>> pointed, it may not win in all the scenes, add a new control sysfs to
>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>> the order
>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>>>> default.
>>>>>
>>>>> This is precisely something Baolin and I have discussed and intended
>>>>> to implement[1],
>>>>> but unfortunately, we haven't had the time to do so.
>>>>
>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>> optimization, I tested Multi-size THP for anonymous memory by hard-cord
>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>>>> not for all cases and not very stable, so re-implemented it by according
>>>> to the user requirement and enable it dynamically.
>>>
>>> I'm wondering, though, if this is really a suitable candidate for a
>>> sysctl toggle. Can anybody really come up with an educated guess for
>>> these values?
>>
>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
>> we could trace __alloc_pages() and do order statistic to decide to
>> choose the high-order to be enabled on PCP.
>>
>>>
>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
>>> and "it may not win in all the scenes", to me it mostly sounds like
>>> "minimal impact" -- so who cares?
>>
>> Even though lock conflicts are eliminated, there is very limited
>> performance improvement(even maybe fluctuation), it is not a good
>> testcase to show improvement, just show the zone-lock issue, we need to
>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
>> PMD THP), or LKP maybe give some help?
>>
>> I will try to find other testcase to show the benefit.
> 
> Hi Kefeng,
> 
> I wonder if you will see some major improvements on mTHP 64KiB using
> the below microbench I wrote just now, for example perf and time to
> finish the program
> 
> #define DATA_SIZE (2UL * 1024 * 1024)
> 
> int main(int argc, char **argv)
> {
>          /* make 32 concurrent alloc and free of mTHP */
>          fork(); fork(); fork(); fork(); fork();
> 
>          for (int i = 0; i < 100000; i++) {
>                  void *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE,
>                                  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>                  if (addr == MAP_FAILED) {
>                          perror("fail to malloc");
>                          return -1;
>                  }
>                  memset(addr, 0x11, DATA_SIZE);
>                  munmap(addr, DATA_SIZE);
>          }
> 
>          return 0;
> }
> 

1) PCP disabled
	1	2	3	4	5	average		
real	200.41	202.18	203.16	201.54	200.91	201.64	
user	6.49	6.21	6.25	6.31	6.35	6.322		
sys 	193.3	195.39	196.3	194.65	194.01	194.73	
	
2) PCP enabled							
real	198.25	199.26	195.51	199.28	189.12	196.284	   -2.66%
user	6.21	6.02	6.02	6.28	6.21	6.148	   -2.75%
sys 	191.46	192.64	188.96	192.47	182.39	189.584	   -2.64%

for above test, time reduce 2.x%


And re-test page_fault1(anon) from will-it-scale

1) PCP enabled 					
tasks	processes	processes_idle	threads	threads_idle	linear
0	0	100	0	100	0
1	1416915	98.95	1418128	98.95	1418128
20	5327312	79.22	3821312	94.36	28362560
40	9437184	58.58	4463657	94.55	56725120
60	8120003	38.16	4736716	94.61	85087680
80	7356508	18.29	4847824	94.46	113450240
100	7256185	1.48	4870096	94.61	141812800

2) PCP disabled
tasks	processes	processes_idle	threads	threads_idle	linear
0	0	100	0	100	0
1	1365398	98.95	1354502	98.95	1365398
20	5174918	79.22	3722368	94.65	27307960
40	9094265	58.58	4427267	94.82	54615920
60	8021606	38.18	4572896	94.93	81923880
80	7497318	18.2	4637062	94.76	109231840
100	6819897	1.47	4654521	94.63	136539800

------------------------------------
1) vs 2)  pcp enabled improve 3.86%

3) PCP re-enabled					
tasks	processes	processes_idle	threads	threads_idle	linear
0	0	100	0	100	0
1	1419036	98.96	1428403	98.95	1428403
20	5356092	79.23	3851849	94.41	28568060
40	9437184	58.58	4512918	94.63	57136120
60	8252342	38.16	4659552	94.68	85704180
80	7414899	18.26	4790576	94.77	114272240
100	7062902	1.46	4759030	94.64	142840300

4) PCP re-disabled
tasks	processes	processes_idle	threads	threads_idle	linear
0	0	100	0	100	0
1	1352649	98.95	1354806	98.95	1354806
20	5172924	79.22	3719292	94.64	27096120
40	9174505	58.59	4310649	94.93	54192240
60	8021606	38.17	4552960	94.81	81288360
80	7497318	18.18	4671638	94.81	108384480
100	6823926	1.47	4725955	94.64	135480600

------------------------------------
3) vs 4)  pcp enabled improve 5.43%

Average: 4.645%





>>
>>>
>>> How much is the cost vs. benefit of just having one sane system
>>> configuration?
>>>
>>
>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
>> and for high-orders, we assumes most of them are moveable, but maybe
>> not, so enable it by default maybe more fragmentization, see
>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>> allocations").
>>
Kefeng Wang April 16, 2024, 4:58 a.m. UTC | #8
On 2024/4/16 12:50, Kefeng Wang wrote:
> 
> 
> On 2024/4/16 8:21, Barry Song wrote:
>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang 
>> <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
>>>
>>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>
>>>>>>> Both the file pages and anonymous pages support large folio, 
>>>>>>> high-order
>>>>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>>>>> increase the zone lock contention, allow high-order pages on pcp 
>>>>>>> lists
>>>>>>> could reduce the big zone lock contention, but as commit 
>>>>>>> 44042b449872
>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>>>> lists")
>>>>>>> pointed, it may not win in all the scenes, add a new control 
>>>>>>> sysfs to
>>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>>> the order
>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>>>>> default.
>>>>>>
>>>>>> This is precisely something Baolin and I have discussed and intended
>>>>>> to implement[1],
>>>>>> but unfortunately, we haven't had the time to do so.
>>>>>
>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>>> optimization, I tested Multi-size THP for anonymous memory by 
>>>>> hard-cord
>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>>>>> not for all cases and not very stable, so re-implemented it by 
>>>>> according
>>>>> to the user requirement and enable it dynamically.
>>>>
>>>> I'm wondering, though, if this is really a suitable candidate for a
>>>> sysctl toggle. Can anybody really come up with an educated guess for
>>>> these values?
>>>
>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
>>> we could trace __alloc_pages() and do order statistic to decide to
>>> choose the high-order to be enabled on PCP.
>>>
>>>>
>>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
>>>> and "it may not win in all the scenes", to me it mostly sounds like
>>>> "minimal impact" -- so who cares?
>>>
>>> Even though lock conflicts are eliminated, there is very limited
>>> performance improvement(even maybe fluctuation), it is not a good
>>> testcase to show improvement, just show the zone-lock issue, we need to
>>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
>>> PMD THP), or LKP maybe give some help?
>>>
>>> I will try to find other testcase to show the benefit.
>>
>> Hi Kefeng,
>>
>> I wonder if you will see some major improvements on mTHP 64KiB using
>> the below microbench I wrote just now, for example perf and time to
>> finish the program
>>
>> #define DATA_SIZE (2UL * 1024 * 1024)
>>
>> int main(int argc, char **argv)
>> {
>>          /* make 32 concurrent alloc and free of mTHP */
>>          fork(); fork(); fork(); fork(); fork();
>>
>>          for (int i = 0; i < 100000; i++) {
>>                  void *addr = mmap(NULL, DATA_SIZE, PROT_READ | 
>> PROT_WRITE,
>>                                  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>                  if (addr == MAP_FAILED) {
>>                          perror("fail to malloc");
>>                          return -1;
>>                  }
>>                  memset(addr, 0x11, DATA_SIZE);
>>                  munmap(addr, DATA_SIZE);
>>          }
>>
>>          return 0;
>> }
>>

Rebased on next-20240415,

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled

Compare with
   echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
   echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled

> 
> 1) PCP disabled
>      1    2    3    4    5    average
> real    200.41    202.18    203.16    201.54    200.91    201.64
> user    6.49    6.21    6.25    6.31    6.35    6.322
> sys     193.3    195.39    196.3    194.65    194.01    194.73
> 
> 2) PCP enabled
> real    198.25    199.26    195.51    199.28    189.12    196.284       
> -2.66%
> user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
> sys     191.46    192.64    188.96    192.47    182.39    189.584       
> -2.64%
> 
> for above test, time reduce 2.x%
> 
> 
> And re-test page_fault1(anon) from will-it-scale
> 
> 1) PCP enabled
> tasks    processes    processes_idle    threads    threads_idle    linear
> 0    0    100    0    100    0
> 1    1416915    98.95    1418128    98.95    1418128
> 20    5327312    79.22    3821312    94.36    28362560
> 40    9437184    58.58    4463657    94.55    56725120
> 60    8120003    38.16    4736716    94.61    85087680
> 80    7356508    18.29    4847824    94.46    113450240
> 100    7256185    1.48    4870096    94.61    141812800
> 
> 2) PCP disabled
> tasks    processes    processes_idle    threads    threads_idle    linear
> 0    0    100    0    100    0
> 1    1365398    98.95    1354502    98.95    1365398
> 20    5174918    79.22    3722368    94.65    27307960
> 40    9094265    58.58    4427267    94.82    54615920
> 60    8021606    38.18    4572896    94.93    81923880
> 80    7497318    18.2    4637062    94.76    109231840
> 100    6819897    1.47    4654521    94.63    136539800
> 
> ------------------------------------
> 1) vs 2)  pcp enabled improve 3.86%
> 
> 3) PCP re-enabled
> tasks    processes    processes_idle    threads    threads_idle    linear
> 0    0    100    0    100    0
> 1    1419036    98.96    1428403    98.95    1428403
> 20    5356092    79.23    3851849    94.41    28568060
> 40    9437184    58.58    4512918    94.63    57136120
> 60    8252342    38.16    4659552    94.68    85704180
> 80    7414899    18.26    4790576    94.77    114272240
> 100    7062902    1.46    4759030    94.64    142840300
> 
> 4) PCP re-disabled
> tasks    processes    processes_idle    threads    threads_idle    linear
> 0    0    100    0    100    0
> 1    1352649    98.95    1354806    98.95    1354806
> 20    5172924    79.22    3719292    94.64    27096120
> 40    9174505    58.59    4310649    94.93    54192240
> 60    8021606    38.17    4552960    94.81    81288360
> 80    7497318    18.18    4671638    94.81    108384480
> 100    6823926    1.47    4725955    94.64    135480600
> 
> ------------------------------------
> 3) vs 4)  pcp enabled improve 5.43%
> 
> Average: 4.645%
> 
> 
> 
> 
> 
>>>
>>>>
>>>> How much is the cost vs. benefit of just having one sane system
>>>> configuration?
>>>>
>>>
>>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
>>> and for high-orders, we assumes most of them are moveable, but maybe
>>> not, so enable it by default maybe more fragmentization, see
>>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>>> allocations").
>>>
>
Barry Song April 16, 2024, 5:26 a.m. UTC | #9
On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/4/16 12:50, Kefeng Wang wrote:
> >
> >
> > On 2024/4/16 8:21, Barry Song wrote:
> >> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
> >> <wangkefeng.wang@huawei.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2024/4/15 18:52, David Hildenbrand wrote:
> >>>> On 15.04.24 10:59, Kefeng Wang wrote:
> >>>>>
> >>>>>
> >>>>> On 2024/4/15 16:18, Barry Song wrote:
> >>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
> >>>>>> <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>
> >>>>>>> Both the file pages and anonymous pages support large folio,
> >>>>>>> high-order
> >>>>>>> pages except PMD_ORDER will also be allocated frequently which could
> >>>>>>> increase the zone lock contention, allow high-order pages on pcp
> >>>>>>> lists
> >>>>>>> could reduce the big zone lock contention, but as commit
> >>>>>>> 44042b449872
> >>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
> >>>>>>> lists")
> >>>>>>> pointed, it may not win in all the scenes, add a new control
> >>>>>>> sysfs to
> >>>>>>> enable or disable specified high-order pages stored on PCP lists,
> >>>>>>> the order
> >>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
> >>>>>>> default.
> >>>>>>
> >>>>>> This is precisely something Baolin and I have discussed and intended
> >>>>>> to implement[1],
> >>>>>> but unfortunately, we haven't had the time to do so.
> >>>>>
> >>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
> >>>>> optimization, I tested Multi-size THP for anonymous memory by
> >>>>> hard-cord
> >>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> >>>>> not for all cases and not very stable, so re-implemented it by
> >>>>> according
> >>>>> to the user requirement and enable it dynamically.
> >>>>
> >>>> I'm wondering, though, if this is really a suitable candidate for a
> >>>> sysctl toggle. Can anybody really come up with an educated guess for
> >>>> these values?
> >>>
> >>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
> >>> we could trace __alloc_pages() and do order statistic to decide to
> >>> choose the high-order to be enabled on PCP.
> >>>
> >>>>
> >>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> >>>> and "it may not win in all the scenes", to me it mostly sounds like
> >>>> "minimal impact" -- so who cares?
> >>>
> >>> Even though lock conflicts are eliminated, there is very limited
> >>> performance improvement(even maybe fluctuation), it is not a good
> >>> testcase to show improvement, just show the zone-lock issue, we need to
> >>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
> >>> PMD THP), or LKP maybe give some help?
> >>>
> >>> I will try to find other testcase to show the benefit.
> >>
> >> Hi Kefeng,
> >>
> >> I wonder if you will see some major improvements on mTHP 64KiB using
> >> the below microbench I wrote just now, for example perf and time to
> >> finish the program
> >>
> >> #define DATA_SIZE (2UL * 1024 * 1024)
> >>
> >> int main(int argc, char **argv)
> >> {
> >>          /* make 32 concurrent alloc and free of mTHP */
> >>          fork(); fork(); fork(); fork(); fork();
> >>
> >>          for (int i = 0; i < 100000; i++) {
> >>                  void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
> >> PROT_WRITE,
> >>                                  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >>                  if (addr == MAP_FAILED) {
> >>                          perror("fail to malloc");
> >>                          return -1;
> >>                  }
> >>                  memset(addr, 0x11, DATA_SIZE);
> >>                  munmap(addr, DATA_SIZE);
> >>          }
> >>
> >>          return 0;
> >> }
> >>
>
> Rebased on next-20240415,
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>
> Compare with
>    echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>    echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>
> >
> > 1) PCP disabled
> >      1    2    3    4    5    average
> > real    200.41    202.18    203.16    201.54    200.91    201.64
> > user    6.49    6.21    6.25    6.31    6.35    6.322
> > sys     193.3    195.39    196.3    194.65    194.01    194.73
> >
> > 2) PCP enabled
> > real    198.25    199.26    195.51    199.28    189.12    196.284
> > -2.66%
> > user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
> > sys     191.46    192.64    188.96    192.47    182.39    189.584
> > -2.64%
> >
> > for above test, time reduce 2.x%

This is an improvement from 0.28%, but it's still below my expectations.
I suspect it's due to mTHP reducing the frequency of allocations and frees.
Running the same test on order-0 might yield much better results.

I suppose that as the order increases, PCP exhibits fewer improvements
since both allocation and release activities decrease.

Conversely, we also employ PCP for THP (2MB). Do we have any data
demonstrating that such large-size allocations can benefit from PCP
before ?

> >
> >
> > And re-test page_fault1(anon) from will-it-scale
> >
> > 1) PCP enabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1416915    98.95    1418128    98.95    1418128
> > 20    5327312    79.22    3821312    94.36    28362560
> > 40    9437184    58.58    4463657    94.55    56725120
> > 60    8120003    38.16    4736716    94.61    85087680
> > 80    7356508    18.29    4847824    94.46    113450240
> > 100    7256185    1.48    4870096    94.61    141812800
> >
> > 2) PCP disabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1365398    98.95    1354502    98.95    1365398
> > 20    5174918    79.22    3722368    94.65    27307960
> > 40    9094265    58.58    4427267    94.82    54615920
> > 60    8021606    38.18    4572896    94.93    81923880
> > 80    7497318    18.2    4637062    94.76    109231840
> > 100    6819897    1.47    4654521    94.63    136539800
> >
> > ------------------------------------
> > 1) vs 2)  pcp enabled improve 3.86%
> >
> > 3) PCP re-enabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1419036    98.96    1428403    98.95    1428403
> > 20    5356092    79.23    3851849    94.41    28568060
> > 40    9437184    58.58    4512918    94.63    57136120
> > 60    8252342    38.16    4659552    94.68    85704180
> > 80    7414899    18.26    4790576    94.77    114272240
> > 100    7062902    1.46    4759030    94.64    142840300
> >
> > 4) PCP re-disabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1352649    98.95    1354806    98.95    1354806
> > 20    5172924    79.22    3719292    94.64    27096120
> > 40    9174505    58.59    4310649    94.93    54192240
> > 60    8021606    38.17    4552960    94.81    81288360
> > 80    7497318    18.18    4671638    94.81    108384480
> > 100    6823926    1.47    4725955    94.64    135480600
> >
> > ------------------------------------
> > 3) vs 4)  pcp enabled improve 5.43%
> >
> > Average: 4.645%
> >
> >
> >
> >
> >
> >>>
> >>>>
> >>>> How much is the cost vs. benefit of just having one sane system
> >>>> configuration?
> >>>>
> >>>
> >>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
> >>> and for high-orders, we assumes most of them are moveable, but maybe
> >>> not, so enable it by default maybe more fragmentization, see
> >>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> >>> allocations").
> >>>

Thanks
Barry
David Hildenbrand April 16, 2024, 7:03 a.m. UTC | #10
On 16.04.24 07:26, Barry Song wrote:
> On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/4/16 12:50, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/4/16 8:21, Barry Song wrote:
>>>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>>>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>> Both the file pages and anonymous pages support large folio,
>>>>>>>>> high-order
>>>>>>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>>>>>>> increase the zone lock contention, allow high-order pages on pcp
>>>>>>>>> lists
>>>>>>>>> could reduce the big zone lock contention, but as commit
>>>>>>>>> 44042b449872
>>>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>>>>>> lists")
>>>>>>>>> pointed, it may not win in all the scenes, add a new control
>>>>>>>>> sysfs to
>>>>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>>>>> the order
>>>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>>>>>>> default.
>>>>>>>>
>>>>>>>> This is precisely something Baolin and I have discussed and intended
>>>>>>>> to implement[1],
>>>>>>>> but unfortunately, we haven't had the time to do so.
>>>>>>>
>>>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>>>>> optimization, I tested Multi-size THP for anonymous memory by
>>>>>>> hard-cord
>>>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>>>>>>> not for all cases and not very stable, so re-implemented it by
>>>>>>> according
>>>>>>> to the user requirement and enable it dynamically.
>>>>>>
>>>>>> I'm wondering, though, if this is really a suitable candidate for a
>>>>>> sysctl toggle. Can anybody really come up with an educated guess for
>>>>>> these values?
>>>>>
>>>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
>>>>> we could trace __alloc_pages() and do order statistic to decide to
>>>>> choose the high-order to be enabled on PCP.
>>>>>
>>>>>>
>>>>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
>>>>>> and "it may not win in all the scenes", to me it mostly sounds like
>>>>>> "minimal impact" -- so who cares?
>>>>>
>>>>> Even though lock conflicts are eliminated, there is very limited
>>>>> performance improvement(even maybe fluctuation), it is not a good
>>>>> testcase to show improvement, just show the zone-lock issue, we need to
>>>>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
>>>>> PMD THP), or LKP maybe give some help?
>>>>>
>>>>> I will try to find other testcase to show the benefit.
>>>>
>>>> Hi Kefeng,
>>>>
>>>> I wonder if you will see some major improvements on mTHP 64KiB using
>>>> the below microbench I wrote just now, for example perf and time to
>>>> finish the program
>>>>
>>>> #define DATA_SIZE (2UL * 1024 * 1024)
>>>>
>>>> int main(int argc, char **argv)
>>>> {
>>>>           /* make 32 concurrent alloc and free of mTHP */
>>>>           fork(); fork(); fork(); fork(); fork();
>>>>
>>>>           for (int i = 0; i < 100000; i++) {
>>>>                   void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
>>>> PROT_WRITE,
>>>>                                   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>                   if (addr == MAP_FAILED) {
>>>>                           perror("fail to malloc");
>>>>                           return -1;
>>>>                   }
>>>>                   memset(addr, 0x11, DATA_SIZE);
>>>>                   munmap(addr, DATA_SIZE);
>>>>           }
>>>>
>>>>           return 0;
>>>> }
>>>>
>>
>> Rebased on next-20240415,
>>
>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>
>> Compare with
>>     echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>     echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>
>>>
>>> 1) PCP disabled
>>>       1    2    3    4    5    average
>>> real    200.41    202.18    203.16    201.54    200.91    201.64
>>> user    6.49    6.21    6.25    6.31    6.35    6.322
>>> sys     193.3    195.39    196.3    194.65    194.01    194.73
>>>
>>> 2) PCP enabled
>>> real    198.25    199.26    195.51    199.28    189.12    196.284
>>> -2.66%
>>> user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
>>> sys     191.46    192.64    188.96    192.47    182.39    189.584
>>> -2.64%
>>>
>>> for above test, time reduce 2.x%
> 
> This is an improvement from 0.28%, but it's still below my expectations.

Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it 
does feel a bit like we're trying to come up with the problem after we 
have a solution; I'd have thought some existing benchmark could 
highlight if that is worth it.
Kefeng Wang April 16, 2024, 8:06 a.m. UTC | #11
On 2024/4/16 15:03, David Hildenbrand wrote:
> On 16.04.24 07:26, Barry Song wrote:
>> On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang 
>> <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
>>>
>>> On 2024/4/16 12:50, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/4/16 8:21, Barry Song wrote:
>>>>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>>>>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Both the file pages and anonymous pages support large folio,
>>>>>>>>>> high-order
>>>>>>>>>> pages except PMD_ORDER will also be allocated frequently which 
>>>>>>>>>> could
>>>>>>>>>> increase the zone lock contention, allow high-order pages on pcp
>>>>>>>>>> lists
>>>>>>>>>> could reduce the big zone lock contention, but as commit
>>>>>>>>>> 44042b449872
>>>>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the 
>>>>>>>>>> per-cpu
>>>>>>>>>> lists")
>>>>>>>>>> pointed, it may not win in all the scenes, add a new control
>>>>>>>>>> sysfs to
>>>>>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>>>>>> the order
>>>>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP 
>>>>>>>>>> list by
>>>>>>>>>> default.
>>>>>>>>>
>>>>>>>>> This is precisely something Baolin and I have discussed and 
>>>>>>>>> intended
>>>>>>>>> to implement[1],
>>>>>>>>> but unfortunately, we haven't had the time to do so.
>>>>>>>>
>>>>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>>>>>> optimization, I tested Multi-size THP for anonymous memory by
>>>>>>>> hard-cord
>>>>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some 
>>>>>>>> improvement but
>>>>>>>> not for all cases and not very stable, so re-implemented it by
>>>>>>>> according
>>>>>>>> to the user requirement and enable it dynamically.
>>>>>>>
>>>>>>> I'm wondering, though, if this is really a suitable candidate for a
>>>>>>> sysctl toggle. Can anybody really come up with an educated guess for
>>>>>>> these values?
>>>>>>
>>>>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in 
>>>>>> sysctl,
>>>>>> we could trace __alloc_pages() and do order statistic to decide to
>>>>>> choose the high-order to be enabled on PCP.
>>>>>>
>>>>>>>
>>>>>>> Especially reading "Benchmarks Score shows a little 
>>>>>>> improvoment(0.28%)"
>>>>>>> and "it may not win in all the scenes", to me it mostly sounds like
>>>>>>> "minimal impact" -- so who cares?
>>>>>>
>>>>>> Even though lock conflicts are eliminated, there is very limited
>>>>>> performance improvement(even maybe fluctuation), it is not a good
>>>>>> testcase to show improvement, just show the zone-lock issue, we 
>>>>>> need to
>>>>>> find other better testcase, maybe some test on Andriod(heavy use 
>>>>>> 64K, no
>>>>>> PMD THP), or LKP maybe give some help?
>>>>>>
>>>>>> I will try to find other testcase to show the benefit.
>>>>>
>>>>> Hi Kefeng,
>>>>>
>>>>> I wonder if you will see some major improvements on mTHP 64KiB using
>>>>> the below microbench I wrote just now, for example perf and time to
>>>>> finish the program
>>>>>
>>>>> #define DATA_SIZE (2UL * 1024 * 1024)
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>>           /* make 32 concurrent alloc and free of mTHP */
>>>>>           fork(); fork(); fork(); fork(); fork();
>>>>>
>>>>>           for (int i = 0; i < 100000; i++) {
>>>>>                   void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
>>>>> PROT_WRITE,
>>>>>                                   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>>                   if (addr == MAP_FAILED) {
>>>>>                           perror("fail to malloc");
>>>>>                           return -1;
>>>>>                   }
>>>>>                   memset(addr, 0x11, DATA_SIZE);
>>>>>                   munmap(addr, DATA_SIZE);
>>>>>           }
>>>>>
>>>>>           return 0;
>>>>> }
>>>>>
>>>
>>> Rebased on next-20240415,
>>>
>>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>>
>>> Compare with
>>>     echo 0 > 
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>>     echo 1 > 
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>>
>>>>
>>>> 1) PCP disabled
>>>>       1    2    3    4    5    average
>>>> real    200.41    202.18    203.16    201.54    200.91    201.64
>>>> user    6.49    6.21    6.25    6.31    6.35    6.322
>>>> sys     193.3    195.39    196.3    194.65    194.01    194.73
>>>>
>>>> 2) PCP enabled
>>>> real    198.25    199.26    195.51    199.28    189.12    196.284
>>>> -2.66%
>>>> user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
>>>> sys     191.46    192.64    188.96    192.47    182.39    189.584
>>>> -2.64%
>>>>
>>>> for above test, time reduce 2.x%
>>
>> This is an improvement from 0.28%, but it's still below my expectations.
> 
> Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it 
> does feel a bit like we're trying to come up with the problem after we 
> have a solution; I'd have thought some existing benchmark could 
> highlight if that is worth it.


96 core, with 129 threads, a quick test with pcp_enabled to control
hugepages-2048KB, it is no big improvement on 2M

PCP enabled
	1	2	3	average
real	221.8	225.6	221.5	222.9666667
user	14.91	14.91	17.05	15.62333333
sys 	141.91	159.25	156.23	152.4633333
				
PCP disabled				
real	230.76	231.39	228.39	230.18
user	15.47	15.88	17.5	16.28333333
sys 	159.07	162.32	159.09	160.16


 From 44042b449872 ("mm/page_alloc: allow high-order pages to be stored
on the per-cpu lists"), it seems limited improve,

  netperf-udp
                                   5.13.0-rc2             5.13.0-rc2
                             mm-pcpburst-v3r4   mm-pcphighorder-v1r7
  Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
  Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
  Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
  Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
  Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
  Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
  Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
  Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
  Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*