Message ID | 20250414034607.762653-1-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/folio_zero_user: add multi-page clearing | expand |
* Ankur Arora <ankur.a.arora@oracle.com> wrote: > Ankur Arora (4): > x86/clear_page: extend clear_page*() for multi-page clearing > x86/clear_page: add clear_pages() > huge_page: allow arch override for folio_zero_user() > x86/folio_zero_user: multi-page clearing These are not how x86 commit titles should look like. Please take a look at the titles of previous commits to the x86 files you are modifying and follow that style. (Capitalization, use of verbs, etc.) Thanks, Ingo
* Ankur Arora <ankur.a.arora@oracle.com> wrote: > We also see performance improvement for cases where this optimization is > unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because > REP; STOS is typically microcoded which can now be amortized over > larger regions and the hint allows the hardware prefetcher to do a > better job. > > Milan (EPYC 7J13, boost=0, preempt=full|lazy): > > mm/folio_zero_user x86/folio_zero_user change > (GB/s +- stddev) (GB/s +- stddev) > > pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2% > pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5% > > Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy): > > mm/folio_zero_user x86/folio_zero_user change > (GB/s +- stddev) (GB/s +- stddev) > > pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57% > pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10% How was this measured? Could you integrate this measurement as a new tools/perf/bench/ subcommand so that people can try it on different systems, etc.? There's already a 'perf bench mem' subcommand space where this feature could be added to. Thanks, Ingo
Ingo Molnar <mingo@kernel.org> writes: > * Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> We also see performance improvement for cases where this optimization is >> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because >> REP; STOS is typically microcoded which can now be amortized over >> larger regions and the hint allows the hardware prefetcher to do a >> better job. >> >> Milan (EPYC 7J13, boost=0, preempt=full|lazy): >> >> mm/folio_zero_user x86/folio_zero_user change >> (GB/s +- stddev) (GB/s +- stddev) >> >> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2% >> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5% >> >> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy): >> >> mm/folio_zero_user x86/folio_zero_user change >> (GB/s +- stddev) (GB/s +- stddev) >> >> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57% >> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10% > > How was this measured? Could you integrate this measurement as a new > tools/perf/bench/ subcommand so that people can try it on different > systems, etc.? There's already a 'perf bench mem' subcommand space > where this feature could be added to. This was a standalone trivial mmap workload similar to what qemu does when creating a VM, really any hugetlb mmap(). x86-64-stosq (lib/memset_64.S::__memset) should have the same performance characteristics but it uses malloc() for allocation. For this workload we want to control the allocation path as well. Let me see if it makes sense to extend perf bench mem memset to optionally allocate via mmap(MAP_HUGETLB) or add a new workload under perf bench mem which does that. Thanks for the review! -- ankur
Ingo Molnar <mingo@kernel.org> writes: > * Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> Ankur Arora (4): >> x86/clear_page: extend clear_page*() for multi-page clearing >> x86/clear_page: add clear_pages() >> huge_page: allow arch override for folio_zero_user() >> x86/folio_zero_user: multi-page clearing > > These are not how x86 commit titles should look like. Please take a > look at the titles of previous commits to the x86 files you are > modifying and follow that style. (Capitalization, use of verbs, etc.) Ack. Will fix. -- ankur
On 13 Apr 2025, at 23:46, Ankur Arora wrote: > This series adds multi-page clearing for hugepages. It is a rework > of [1] which took a detour through PREEMPT_LAZY [2]. > > Why multi-page clearing?: multi-page clearing improves upon the > current page-at-a-time approach by providing the processor with a > hint as to the real region size. A processor could use this hint to, > for instance, elide cacheline allocation when clearing a large > region. > > This optimization in particular is done by REP; STOS on AMD Zen > where regions larger than L3-size use non-temporal stores. > > This results in significantly better performance. Do you have init_on_alloc=1 in your kernel? With that, pages coming from buddy allocator are zeroed in post_alloc_hook() by kernel_init_pages(), which is a for loop of clear_highpage_kasan_tagged(), a wrap of clear_page(). And folio_zero_user() is not used. At least Debian, Fedora, and Ubuntu by default have CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y, which means init_on_alloc=1. Maybe kernel_init_pages() should get your optimization as well, unless you only target hugetlb pages. Best Regards, Yan, Zi