mbox series

[v3,0/4] mm/folio_zero_user: add multi-page clearing

Message ID 20250414034607.762653-1-ankur.a.arora@oracle.com (mailing list archive)
Headers show
Series mm/folio_zero_user: add multi-page clearing | expand

Message

Ankur Arora April 14, 2025, 3:46 a.m. UTC
This series adds multi-page clearing for hugepages. It is a rework
of [1] which took a detour through PREEMPT_LAZY [2].

Why multi-page clearing?: multi-page clearing improves upon the
current page-at-a-time approach by providing the processor with a
hint as to the real region size. A processor could use this hint to,
for instance, elide cacheline allocation when clearing a large
region.

This optimization in particular is done by REP; STOS on AMD Zen
where regions larger than L3-size use non-temporal stores.

This results in significantly better performance. 

We also see performance improvement for cases where this optimization is
unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
REP; STOS is typically microcoded which can now be amortized over
larger regions and the hint allows the hardware prefetcher to do a
better job.

Milan (EPYC 7J13, boost=0, preempt=full|lazy):

                 mm/folio_zero_user    x86/folio_zero_user     change
                  (GB/s  +- stddev)      (GB/s  +- stddev)

  pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
  pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%

Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):

                 mm/folio_zero_user    x86/folio_zero_user     change
                  (GB/s +- stddev)      (GB/s +- stddev)

  pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
  pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%

Interaction with preemption: as discussed in [3], zeroing large
regions with string instructions doesn't work well with cooperative
preemption models which need regular invocations of cond_resched(). So,
this optimization is limited to only preemptible models (full, lazy).

This is done by overriding __folio_zero_user() -- which does the usual
page-at-a-time zeroing -- by an architecture optimized version but
only when running under preemptible models.
As such this ties an architecture specific optimization too closely
to preemption. Should be easy enough to change but seemed like the
simplest approach.

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages-preempt.v1


[1] https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/
[2] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
[3] https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/

Ankur Arora (4):
  x86/clear_page: extend clear_page*() for multi-page clearing
  x86/clear_page: add clear_pages()
  huge_page: allow arch override for folio_zero_user()
  x86/folio_zero_user: multi-page clearing

 arch/x86/include/asm/page_32.h |  6 ++++
 arch/x86/include/asm/page_64.h | 27 +++++++++------
 arch/x86/lib/clear_page_64.S   | 52 +++++++++++++++++++++--------
 arch/x86/mm/Makefile           |  1 +
 arch/x86/mm/memory.c           | 60 ++++++++++++++++++++++++++++++++++
 include/linux/mm.h             |  1 +
 mm/memory.c                    | 38 ++++++++++++++++++---
 7 files changed, 156 insertions(+), 29 deletions(-)
 create mode 100644 arch/x86/mm/memory.c

Comments

Ingo Molnar April 14, 2025, 5:34 a.m. UTC | #1
* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Ankur Arora (4):
>   x86/clear_page: extend clear_page*() for multi-page clearing
>   x86/clear_page: add clear_pages()
>   huge_page: allow arch override for folio_zero_user()
>   x86/folio_zero_user: multi-page clearing

These are not how x86 commit titles should look like. Please take a 
look at the titles of previous commits to the x86 files you are 
modifying and follow that style. (Capitalization, use of verbs, etc.)

Thanks,

	Ingo
Ingo Molnar April 14, 2025, 6:36 a.m. UTC | #2
* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> We also see performance improvement for cases where this optimization is
> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
> REP; STOS is typically microcoded which can now be amortized over
> larger regions and the hint allows the hardware prefetcher to do a
> better job.
> 
> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> 
>                  mm/folio_zero_user    x86/folio_zero_user     change
>                   (GB/s  +- stddev)      (GB/s  +- stddev)
> 
>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
> 
> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> 
>                  mm/folio_zero_user    x86/folio_zero_user     change
>                   (GB/s +- stddev)      (GB/s +- stddev)
> 
>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%

How was this measured? Could you integrate this measurement as a new 
tools/perf/bench/ subcommand so that people can try it on different 
systems, etc.? There's already a 'perf bench mem' subcommand space 
where this feature could be added to.

Thanks,

	Ingo
Ankur Arora April 14, 2025, 7:19 p.m. UTC | #3
Ingo Molnar <mingo@kernel.org> writes:

> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>>
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s  +- stddev)      (GB/s  +- stddev)
>>
>>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>>
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s +- stddev)      (GB/s +- stddev)
>>
>>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>
> How was this measured? Could you integrate this measurement as a new
> tools/perf/bench/ subcommand so that people can try it on different
> systems, etc.? There's already a 'perf bench mem' subcommand space
> where this feature could be added to.

This was a standalone trivial mmap workload similar to what qemu does
when creating a VM, really any hugetlb mmap().

x86-64-stosq (lib/memset_64.S::__memset) should have the same performance
characteristics but it uses malloc() for allocation.

For this workload we want to control the allocation path as well. Let me
see if it makes sense to extend perf bench mem memset to optionally allocate
via mmap(MAP_HUGETLB) or add a new workload under perf bench mem which
does that.

Thanks for the review!

--
ankur
Ankur Arora April 14, 2025, 7:30 p.m. UTC | #4
Ingo Molnar <mingo@kernel.org> writes:

> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Ankur Arora (4):
>>   x86/clear_page: extend clear_page*() for multi-page clearing
>>   x86/clear_page: add clear_pages()
>>   huge_page: allow arch override for folio_zero_user()
>>   x86/folio_zero_user: multi-page clearing
>
> These are not how x86 commit titles should look like. Please take a
> look at the titles of previous commits to the x86 files you are
> modifying and follow that style. (Capitalization, use of verbs, etc.)

Ack. Will fix.

--
ankur
Zi Yan April 15, 2025, 7:10 p.m. UTC | #5
On 13 Apr 2025, at 23:46, Ankur Arora wrote:

> This series adds multi-page clearing for hugepages. It is a rework
> of [1] which took a detour through PREEMPT_LAZY [2].
>
> Why multi-page clearing?: multi-page clearing improves upon the
> current page-at-a-time approach by providing the processor with a
> hint as to the real region size. A processor could use this hint to,
> for instance, elide cacheline allocation when clearing a large
> region.
>
> This optimization in particular is done by REP; STOS on AMD Zen
> where regions larger than L3-size use non-temporal stores.
>
> This results in significantly better performance.

Do you have init_on_alloc=1 in your kernel?
With that, pages coming from buddy allocator are zeroed
in post_alloc_hook() by kernel_init_pages(), which is a for loop
of clear_highpage_kasan_tagged(), a wrap of clear_page().
And folio_zero_user() is not used.

At least Debian, Fedora, and Ubuntu by default have
CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y, which means init_on_alloc=1.

Maybe kernel_init_pages() should get your optimization as well,
unless you only target hugetlb pages.

Best Regards,
Yan, Zi