Message ID | 20250414034607.762653-5-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/folio_zero_user: add multi-page clearing | expand |
* Ankur Arora <ankur.a.arora@oracle.com> wrote: > clear_pages_rep(), clear_pages_erms() use string instructions to zero > memory. When operating on more than a single page, we can use these > more effectively by explicitly advertising the region-size to the > processor, which can use that as a hint to optimize the clearing > (ex. by eliding cacheline allocation.) > +#ifndef CONFIG_HIGHMEM > +/* > + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user(). > + * > + * Taking inspiration from the common code variant, we split the zeroing in > + * three parts: left of the fault, right of the fault, and up to 5 pages > + * in the immediate neighbourhood of the target page. > + * > + * Cleared in that order to keep cache lines of the target region hot. > + * > + * For gigantic pages, there is no expectation of cache locality so just do a > + * straight zero. > + */ > +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint) > +{ > + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio)); > + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE; > + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1); > + int width = 2; /* pages cleared last on either side */ > + struct range r[3]; > + int i; > + > + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) { > + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio)); > + clear_pages(page_address(folio_page(folio, r[i].start)), len); So the _user postfix naming is super confusing here and elsewhere in this series. clear_page(), and by extension the clear_pages() interface you extended it to, fundamentally only works on kernel addresses: /* * Zero a page. * %rdi - page */ SYM_TYPED_FUNC_START(clear_page_rep) movl $4096/8,%ecx xorl %eax,%eax rep stosq RET Note the absolute lack of fault & exception handling. But folio_zero_user*() uses the kernel-space variants of page clearing AFAICT (contrary to the naming): void folio_zero_user(struct folio *folio, unsigned long addr_hint) { unsigned int nr_pages = folio_nr_pages(folio); if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) clear_gigantic_page(folio, addr_hint, nr_pages); else process_huge_page(addr_hint, nr_pages, clear_subpage, folio); } static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, unsigned int nr_pages) { unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio)); int i; might_sleep(); for (i = 0; i < nr_pages; i++) { cond_resched(); clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE); } } Which on x86 is simply mapped into a kernel-memory interface: static inline void clear_user_page(void *page, unsigned long vaddr, struct page *pg) { clear_page(page); } So at minimum this is a misnomer and a confusing mixture of user/kernel interface names on an epic scale that TBH should be cleaned up first before extended... > +out: > + /* Explicitly invoke cond_resched() to handle any live patching necessary. */ > + cond_resched(); What again? Thanks, Ingo
* Ankur Arora <ankur.a.arora@oracle.com> wrote: > clear_pages_rep(), clear_pages_erms() use string instructions to zero > memory. When operating on more than a single page, we can use these > more effectively by explicitly advertising the region-size to the > processor, which can use that as a hint to optimize the clearing > (ex. by eliding cacheline allocation.) > > As a secondary benefit, string instructions are typically microcoded, > and working with larger regions helps amortize the cost of the decode. Not just the decoding, but also iterations around page-sized chunks are not cheap these days: there's various compiler generated mitigations and other overhead that applies on a typical kernel, and using larger sizes amortizes that per-page-iteration setup cost. > When zeroing the 2MB page, maximize spatial locality by clearing in > three sections: the faulting page and its immediate neighbourhood, the > left and the right regions, with the local neighbourhood cleared last. s/zeroing the 2MB page /zeroing a 2MB page > It's not entirely clear why the performance for pg-sz=2MB improves. > We decode fewer instructions and the hardware prefetcher can do a > better job, but the perf stats for both of those aren't convincing > enough to the extent of ~30%. s/why the performance /why performance > For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we > see a drop in cycles but there's no drop in cacheline allocation. s/Icelakex, behaves similarly /Icelakex behaves similarly > Performance for preempt=none|voluntary remains unchanged. CONFIG_PREEMPT_VOLUNTARY=y is the default on a number of major distributions, such as Ubuntu, and a lot of enterprise distro kernels - and this patch does nothing for them, for no good reason. So could you please provide a sensible size granularity cutoff of 16MB or so on non-preemptible kernels, instead of this weird build-time all-or-nothing binary cutoff based on preemption modes? On preempt=full/lazy the granularity limit would be infinite. I.e the only code dependent on the preemption mode should be the size cutoff/limit. On full/lazy preemption the code would, ideally, compile to something close to your current code. > +obj-$(CONFIG_PREEMPTION) += memory.o > +#ifndef CONFIG_HIGHMEM > +/* > + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user(). We don't care much about HIGHMEM these days I suppose, but this dependency still feels wrong. Is this a stealth dependency on x86-64, trying to avoid a new arch Kconfig for this new API, right? ;-) Thanks, Ingo
Ingo Molnar <mingo@kernel.org> writes: > * Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> clear_pages_rep(), clear_pages_erms() use string instructions to zero >> memory. When operating on more than a single page, we can use these >> more effectively by explicitly advertising the region-size to the >> processor, which can use that as a hint to optimize the clearing >> (ex. by eliding cacheline allocation.) > >> +#ifndef CONFIG_HIGHMEM >> +/* >> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user(). >> + * >> + * Taking inspiration from the common code variant, we split the zeroing in >> + * three parts: left of the fault, right of the fault, and up to 5 pages >> + * in the immediate neighbourhood of the target page. >> + * >> + * Cleared in that order to keep cache lines of the target region hot. >> + * >> + * For gigantic pages, there is no expectation of cache locality so just do a >> + * straight zero. >> + */ >> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint) >> +{ >> + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio)); >> + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE; >> + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1); >> + int width = 2; /* pages cleared last on either side */ >> + struct range r[3]; >> + int i; >> + >> + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) { >> + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio)); > >> + clear_pages(page_address(folio_page(folio, r[i].start)), len); > > So the _user postfix naming is super confusing here and elsewhere in > this series. The problem is that the _user naming comes from the MM interface name and is meant to address architectures where you might need to do more than just zero the kernel address range for the page. > clear_page(), and by extension the clear_pages() interface you extended > it to, fundamentally only works on kernel addresses: Agreed. > /* > * Zero a page. > * %rdi - page > */ > SYM_TYPED_FUNC_START(clear_page_rep) > movl $4096/8,%ecx > xorl %eax,%eax > rep stosq > RET > > Note the absolute lack of fault & exception handling. Yeah. And, as you are implying that is safe because the folio_zero_user() (and this path) is only called after this range has been validated. > But folio_zero_user*() uses the kernel-space variants of page clearing > AFAICT (contrary to the naming): > > void folio_zero_user(struct folio *folio, unsigned long addr_hint) > { > unsigned int nr_pages = folio_nr_pages(folio); > > if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) > clear_gigantic_page(folio, addr_hint, nr_pages); > else > process_huge_page(addr_hint, nr_pages, clear_subpage, folio); > } > > > static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, > unsigned int nr_pages) > { > unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio)); > int i; > > might_sleep(); > for (i = 0; i < nr_pages; i++) { > cond_resched(); > clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE); > } > } > > Which on x86 is simply mapped into a kernel-memory interface: > > static inline void clear_user_page(void *page, unsigned long vaddr, > struct page *pg) > { > clear_page(page); > } > > So at minimum this is a misnomer and a confusing mixture of user/kernel > interface names on an epic scale that TBH should be cleaned up first > before extended... I think a comment to avoid this confusion is definitely warranted. About the mixture of names, I'm not sure how to avoid that. For instance see arch/arc/mm/cache.c::clear_user_page():: void clear_user_page(void *to, unsigned long u_vaddr, struct page *page) { struct folio *folio = page_folio(page); clear_page(to); clear_bit(PG_dc_clean, &folio->flags); } arch/arm also does a bunch of stuff which made my head hurt but the arc version is clearly different enough. >> +out: >> + /* Explicitly invoke cond_resched() to handle any live patching necessary. */ >> + cond_resched(); > > What again? Yeah, I can see how this looks out of place :). The idea was that even though we don't need explicit invocations of cond_resched() (because this path is only called when preemptible), we still need some because cond_resched() is overloaded to help with live patching. Anyway, this comment can go away based on your suggestion elsewhere (extensions for cooperative preemption models.) Thanks for the detailed review. -- ankur
Ingo Molnar <mingo@kernel.org> writes: > * Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> clear_pages_rep(), clear_pages_erms() use string instructions to zero >> memory. When operating on more than a single page, we can use these >> more effectively by explicitly advertising the region-size to the >> processor, which can use that as a hint to optimize the clearing >> (ex. by eliding cacheline allocation.) >> >> As a secondary benefit, string instructions are typically microcoded, >> and working with larger regions helps amortize the cost of the decode. > > Not just the decoding, but also iterations around page-sized chunks are > not cheap these days: there's various compiler generated mitigations > and other overhead that applies on a typical kernel, and using larger > sizes amortizes that per-page-iteration setup cost. Thanks. Yeah, I was completely forgetting that even the cost of returns has gone up in the mitigation era :D. Is retbleed the one you were alluding to or there might be others that would apply here as well? >> When zeroing the 2MB page, maximize spatial locality by clearing in >> three sections: the faulting page and its immediate neighbourhood, the >> left and the right regions, with the local neighbourhood cleared last. > > s/zeroing the 2MB page > /zeroing a 2MB page > > >> It's not entirely clear why the performance for pg-sz=2MB improves. >> We decode fewer instructions and the hardware prefetcher can do a >> better job, but the perf stats for both of those aren't convincing >> enough to the extent of ~30%. > > s/why the performance > /why performance > >> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we >> see a drop in cycles but there's no drop in cacheline allocation. > > s/Icelakex, behaves similarly > /Icelakex behaves similarly Ack to all of the above. >> Performance for preempt=none|voluntary remains unchanged. > > CONFIG_PREEMPT_VOLUNTARY=y is the default on a number of major > distributions, such as Ubuntu, and a lot of enterprise distro kernels - > and this patch does nothing for them, for no good reason. > So could you please provide a sensible size granularity cutoff of 16MB > or so on non-preemptible kernels, instead of this weird build-time > all-or-nothing binary cutoff based on preemption modes? So, the reason for associating this with preemption modes was in part not the difficulty of deciding a sensible granularity cutoff. I had done a variety of chunking for an earlier version which was a bit of a mess: https://lore.kernel.org/lkml/20220606203725.1313715-11-ankur.a.arora@oracle.com/. Fixed size chunking should be straight-forward enough. However, 16MB is around 1.6ms if you zero at 10GBps. And, longer if you are on older hardware. > On preempt=full/lazy the granularity limit would be infinite. > > I.e the only code dependent on the preemption mode should be the size > cutoff/limit. > On full/lazy preemption the code would, ideally, compile to something > close to your current code. Yeah, agree. >> +obj-$(CONFIG_PREEMPTION) += memory.o > >> +#ifndef CONFIG_HIGHMEM >> +/* >> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user(). > > We don't care much about HIGHMEM these days I suppose, but this > dependency still feels wrong. Is this a stealth dependency on x86-64, > trying to avoid a new arch Kconfig for this new API, right? ;-) Alas nothing so crafty :). HIGHMEM means that we need to map pages in a hugepage folio via kmap_local_page() -- so cannot treat a hugepage folio as continguous memory and thus cannot use REP; STOS on it. I guess the CONFIG_HIGHMEM condition clearly warrants a comment. -- ankur
On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote: > clear_pages_rep(), clear_pages_erms() use string instructions to zero > memory. When operating on more than a single page, we can use these > more effectively by explicitly advertising the region-size to the > processor, which can use that as a hint to optimize the clearing > (ex. by eliding cacheline allocation.) > > As a secondary benefit, string instructions are typically microcoded, > and working with larger regions helps amortize the cost of the decode. > > When zeroing the 2MB page, maximize spatial locality by clearing in > three sections: the faulting page and its immediate neighbourhood, the > left and the right regions, with the local neighbourhood cleared last. > > Performance > == > > Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local > NUMA node. > > Milan (EPYC 7J13, boost=0, preempt=full|lazy): > > mm/folio_zero_user x86/folio_zero_user change > (GB/s +- stddev) (GB/s +- stddev) > > pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5% > pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2% > > Milan uses a threshold of LLC-size (~32MB) for eliding cacheline > allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB. > > pg-sz=1GB: > - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%) > - 544,878,976 instructions # 0.06 insn per cycle > - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%) > - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%) > > + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%) > + 10,979,121 instructions # 0.00 insn per cycle > + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%) > + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%) > > That's not the case with pg-sz=2MB, where we also perform better but > the number of cacheline allocations remain the same. > > It's not entirely clear why the performance for pg-sz=2MB improves. We > decode fewer instructions and the hardware prefetcher can do a better > job, but the perf stats for both of those aren't convincing enough to > the extent of ~30%. > > pg-sz=2MB: > - 13,110,306,584 cycles # 2.418 GHz ( +- 0.48% ) (46.13%) > - 607,589,360 instructions # 0.05 insn per cycle > - 2,416,130,434 L1-dcache-loads # 445.682 M/sec ( +- 0.08% ) (46.19%) > - 1,080,187,594 L1-dcache-load-misses # 44.71% of all L1-dcache accesses ( +- 0.01% ) (46.18%) > > + 9,624,624,178 cycles # 2.418 GHz ( +- 0.01% ) (46.13%) > + 277,336,691 instructions # 0.03 insn per cycle > + 2,251,220,599 L1-dcache-loads # 565.624 M/sec ( +- 0.01% ) (46.20%) > + 1,092,386,130 L1-dcache-load-misses # 48.52% of all L1-dcache accesses ( +- 0.02% ) (46.19%) > > Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy): > > mm/folio_zero_user x86/folio_zero_user change > (GB/s +- stddev) (GB/s +- stddev) > > pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10% > pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57% > > For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we > see a drop in cycles but there's no drop in cacheline allocation. > Back when I was young and handsome and 32-bit x86 was king, people assumed 4K pages needed to be cleared with non-temporal stores to avoid evicting stuff from caches. I had never seen measurements showing this has the intended effect. Some time after this became a thing I did see measurements showing that this in fact *increases* cache misses. I am not saying this was necessarily the case for all x86 uarchs, merely that the sensibly sounding assumption turned bogus at some point (if it was ever legit). This brings me to the multi-stage clearing employed here for locality. While it sounds great on paper, for all I know it does not provide any advantage. It very well may be it is harmful by preventing the CPU from knowing what you are trying to do. I think doing this warrants obtaining stats from some real workloads, but given how time consuming this can be I think it would be tolerable to skip it for now. > Performance for preempt=none|voluntary remains unchanged. > So I was under the impression the benefit would be realized for all kernels. I don't know how preemption support is implemented on Linux. Do you always get an IPI? I was thinking something like this: a per-cpu var akin to preemption count, but indicating the particular code section is fully preemptible Then: preemptible_enter(); clear_pages(); preemptible_exit(); for simpler handling of the var it could prevent migration to other CPUs. then the IPI handler for preemption would check if ->preemptible is set + preemption disablement is zero, in which case it would take you off cpu. If this is a problem, then a better granularity would help (say 8 pages between cond_rescheds?) > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> > --- > arch/x86/mm/Makefile | 1 + > arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ > include/linux/mm.h | 1 + > 3 files changed, 62 insertions(+) > create mode 100644 arch/x86/mm/memory.c > > diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile > index 32035d5be5a0..e61b4d331cdf 100644 > --- a/arch/x86/mm/Makefile > +++ b/arch/x86/mm/Makefile > @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o > obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o > obj-$(CONFIG_AMD_NUMA) += amdtopology.o > obj-$(CONFIG_ACPI_NUMA) += srat.o > +obj-$(CONFIG_PREEMPTION) += memory.o > > obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o > obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o > diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c > new file mode 100644 > index 000000000000..99851c246fcc > --- /dev/null > +++ b/arch/x86/mm/memory.c > @@ -0,0 +1,60 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +#include <linux/mm.h> > +#include <linux/range.h> > +#include <linux/minmax.h> > + > +#ifndef CONFIG_HIGHMEM > +/* > + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user(). > + * > + * Taking inspiration from the common code variant, we split the zeroing in > + * three parts: left of the fault, right of the fault, and up to 5 pages > + * in the immediate neighbourhood of the target page. > + * > + * Cleared in that order to keep cache lines of the target region hot. > + * > + * For gigantic pages, there is no expectation of cache locality so just do a > + * straight zero. > + */ > +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint) > +{ > + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio)); > + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE; > + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1); > + int width = 2; /* pages cleared last on either side */ > + struct range r[3]; > + int i; > + > + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) { > + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio)); > + goto out; > + } > + > + /* > + * Faulting page and its immediate neighbourhood. Cleared at the end to > + * ensure it sticks around in the cache. > + */ > + r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end), > + clamp_t(s64, fault_idx + width, pg.start, pg.end)); > + > + /* Region to the left of the fault */ > + r[1] = DEFINE_RANGE(pg.start, > + clamp_t(s64, r[2].start-1, pg.start-1, r[2].start)); > + > + /* Region to the right of the fault: always valid for the common fault_idx=0 case. */ > + r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1), > + pg.end); > + > + for (i = 0; i <= 2; i++) { > + int len = range_len(&r[i]); > + > + if (len > 0) > + clear_pages(page_address(folio_page(folio, r[i].start)), len); > + } > + > +out: > + /* Explicitly invoke cond_resched() to handle any live patching necessary. */ > + cond_resched(); > +} > + > +#endif /* CONFIG_HIGHMEM */ > diff --git a/include/linux/mm.h b/include/linux/mm.h > index b7f13f087954..b57512da8173 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -4114,6 +4114,7 @@ enum mf_action_page_type { > }; > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint); > void folio_zero_user(struct folio *folio, unsigned long addr_hint); > int copy_user_large_folio(struct folio *dst, struct folio *src, > unsigned long addr_hint, > -- > 2.31.1 > >
Mateusz Guzik <mjguzik@gmail.com> writes: > On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote: >> clear_pages_rep(), clear_pages_erms() use string instructions to zero >> memory. When operating on more than a single page, we can use these >> more effectively by explicitly advertising the region-size to the >> processor, which can use that as a hint to optimize the clearing >> (ex. by eliding cacheline allocation.) >> >> As a secondary benefit, string instructions are typically microcoded, >> and working with larger regions helps amortize the cost of the decode. >> >> When zeroing the 2MB page, maximize spatial locality by clearing in >> three sections: the faulting page and its immediate neighbourhood, the >> left and the right regions, with the local neighbourhood cleared last. >> >> Performance >> == >> >> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local >> NUMA node. >> >> Milan (EPYC 7J13, boost=0, preempt=full|lazy): >> >> mm/folio_zero_user x86/folio_zero_user change >> (GB/s +- stddev) (GB/s +- stddev) >> >> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5% >> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2% >> >> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline >> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB. >> >> pg-sz=1GB: >> - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%) >> - 544,878,976 instructions # 0.06 insn per cycle >> - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%) >> - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%) >> >> + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%) >> + 10,979,121 instructions # 0.00 insn per cycle >> + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%) >> + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%) >> >> That's not the case with pg-sz=2MB, where we also perform better but >> the number of cacheline allocations remain the same. >> >> It's not entirely clear why the performance for pg-sz=2MB improves. We >> decode fewer instructions and the hardware prefetcher can do a better >> job, but the perf stats for both of those aren't convincing enough to >> the extent of ~30%. >> >> pg-sz=2MB: >> - 13,110,306,584 cycles # 2.418 GHz ( +- 0.48% ) (46.13%) >> - 607,589,360 instructions # 0.05 insn per cycle >> - 2,416,130,434 L1-dcache-loads # 445.682 M/sec ( +- 0.08% ) (46.19%) >> - 1,080,187,594 L1-dcache-load-misses # 44.71% of all L1-dcache accesses ( +- 0.01% ) (46.18%) >> >> + 9,624,624,178 cycles # 2.418 GHz ( +- 0.01% ) (46.13%) >> + 277,336,691 instructions # 0.03 insn per cycle >> + 2,251,220,599 L1-dcache-loads # 565.624 M/sec ( +- 0.01% ) (46.20%) >> + 1,092,386,130 L1-dcache-load-misses # 48.52% of all L1-dcache accesses ( +- 0.02% ) (46.19%) >> >> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy): >> >> mm/folio_zero_user x86/folio_zero_user change >> (GB/s +- stddev) (GB/s +- stddev) >> >> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10% >> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57% >> >> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we >> see a drop in cycles but there's no drop in cacheline allocation. >> > > Back when I was young and handsome and 32-bit x86 was king, people > assumed 4K pages needed to be cleared with non-temporal stores to avoid > evicting stuff from caches. I had never seen measurements showing this > has the intended effect. Some time after this became a thing I did see > measurements showing that this in fact *increases* cache misses. I am > not saying this was necessarily the case for all x86 uarchs, merely that > the sensibly sounding assumption turned bogus at some point (if it was > ever legit). That was a long time ago though ;-). And, your point makes sense for small sized pages. But, consider that zeroing a 1GB page can easily blow away an L3 cache for absolutely nothing gained -- probabilistically, nothing that was in the page that remains in the cache will ever be accessed. Now, you could argue that the situation is less clear for 2MB pages. > This brings me to the multi-stage clearing employed here for locality. > While it sounds great on paper, for all I know it does not provide any > advantage. It very well may be it is harmful by preventing the CPU from > knowing what you are trying to do. > > I think doing this warrants obtaining stats from some real workloads, > but given how time consuming this can be I think it would be tolerable > to skip it for now. > >> Performance for preempt=none|voluntary remains unchanged. >> > > So I was under the impression the benefit would be realized for all > kernels. > > I don't know how preemption support is implemented on Linux. Do you > always get an IPI? No. The need-resched bit is common. It's just there's no preemption via irqentry, just synchronous calls to cond_resched() (as you mention below). Zeroing via a subroutine like instruction (rep; stos) is incompatible with synchronous calls to cond_resched() so this code is explicitly not called for none/voluntary (see patch 3.) That said, I'll probably take Ingo's suggestion of chunking things up in say 8/16MB portions for cooperative preemption models. Ankur > I was thinking something like this: a per-cpu var akin to preemption > count, but indicating the particular code section is fully preemptible > > Then: > > preemptible_enter(); > clear_pages(); > preemptible_exit(); > > for simpler handling of the var it could prevent migration to other > CPUs. > > then the IPI handler for preemption would check if ->preemptible is set > + preemption disablement is zero, in which case it would take you off > cpu. > > If this is a problem, then a better granularity would help (say 8 pages > between cond_rescheds?) > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> >> --- >> arch/x86/mm/Makefile | 1 + >> arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ >> include/linux/mm.h | 1 + >> 3 files changed, 62 insertions(+) >> create mode 100644 arch/x86/mm/memory.c >> >> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile >> index 32035d5be5a0..e61b4d331cdf 100644 >> --- a/arch/x86/mm/Makefile >> +++ b/arch/x86/mm/Makefile >> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o >> obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o >> obj-$(CONFIG_AMD_NUMA) += amdtopology.o >> obj-$(CONFIG_ACPI_NUMA) += srat.o >> +obj-$(CONFIG_PREEMPTION) += memory.o >> >> obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o >> obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o >> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c >> new file mode 100644 >> index 000000000000..99851c246fcc >> --- /dev/null >> +++ b/arch/x86/mm/memory.c >> @@ -0,0 +1,60 @@ >> +// SPDX-License-Identifier: GPL-2.0-or-later >> +#include <linux/mm.h> >> +#include <linux/range.h> >> +#include <linux/minmax.h> >> + >> +#ifndef CONFIG_HIGHMEM >> +/* >> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user(). >> + * >> + * Taking inspiration from the common code variant, we split the zeroing in >> + * three parts: left of the fault, right of the fault, and up to 5 pages >> + * in the immediate neighbourhood of the target page. >> + * >> + * Cleared in that order to keep cache lines of the target region hot. >> + * >> + * For gigantic pages, there is no expectation of cache locality so just do a >> + * straight zero. >> + */ >> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint) >> +{ >> + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio)); >> + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE; >> + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1); >> + int width = 2; /* pages cleared last on either side */ >> + struct range r[3]; >> + int i; >> + >> + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) { >> + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio)); >> + goto out; >> + } >> + >> + /* >> + * Faulting page and its immediate neighbourhood. Cleared at the end to >> + * ensure it sticks around in the cache. >> + */ >> + r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end), >> + clamp_t(s64, fault_idx + width, pg.start, pg.end)); >> + >> + /* Region to the left of the fault */ >> + r[1] = DEFINE_RANGE(pg.start, >> + clamp_t(s64, r[2].start-1, pg.start-1, r[2].start)); >> + >> + /* Region to the right of the fault: always valid for the common fault_idx=0 case. */ >> + r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1), >> + pg.end); >> + >> + for (i = 0; i <= 2; i++) { >> + int len = range_len(&r[i]); >> + >> + if (len > 0) >> + clear_pages(page_address(folio_page(folio, r[i].start)), len); >> + } >> + >> +out: >> + /* Explicitly invoke cond_resched() to handle any live patching necessary. */ >> + cond_resched(); >> +} >> + >> +#endif /* CONFIG_HIGHMEM */ >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index b7f13f087954..b57512da8173 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -4114,6 +4114,7 @@ enum mf_action_page_type { >> }; >> >> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) >> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint); >> void folio_zero_user(struct folio *folio, unsigned long addr_hint); >> int copy_user_large_folio(struct folio *dst, struct folio *src, >> unsigned long addr_hint, >> -- >> 2.31.1 >> >> -- ankur
On Tue, Apr 15, 2025 at 11:46 PM Ankur Arora <ankur.a.arora@oracle.com> wrote: > > > Mateusz Guzik <mjguzik@gmail.com> writes: > > > On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote: > >> clear_pages_rep(), clear_pages_erms() use string instructions to zero > >> memory. When operating on more than a single page, we can use these > >> more effectively by explicitly advertising the region-size to the > >> processor, which can use that as a hint to optimize the clearing > >> (ex. by eliding cacheline allocation.) > >> > >> As a secondary benefit, string instructions are typically microcoded, > >> and working with larger regions helps amortize the cost of the decode. > >> > >> When zeroing the 2MB page, maximize spatial locality by clearing in > >> three sections: the faulting page and its immediate neighbourhood, the > >> left and the right regions, with the local neighbourhood cleared last. > >> > >> Performance > >> == > >> > >> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local > >> NUMA node. > >> > >> Milan (EPYC 7J13, boost=0, preempt=full|lazy): > >> > >> mm/folio_zero_user x86/folio_zero_user change > >> (GB/s +- stddev) (GB/s +- stddev) > >> > >> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5% > >> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2% > >> > >> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline > >> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB. > >> > >> pg-sz=1GB: > >> - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%) > >> - 544,878,976 instructions # 0.06 insn per cycle > >> - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%) > >> - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%) > >> > >> + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%) > >> + 10,979,121 instructions # 0.00 insn per cycle > >> + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%) > >> + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%) > >> > >> That's not the case with pg-sz=2MB, where we also perform better but > >> the number of cacheline allocations remain the same. > >> > >> It's not entirely clear why the performance for pg-sz=2MB improves. We > >> decode fewer instructions and the hardware prefetcher can do a better > >> job, but the perf stats for both of those aren't convincing enough to > >> the extent of ~30%. > >> > >> pg-sz=2MB: > >> - 13,110,306,584 cycles # 2.418 GHz ( +- 0.48% ) (46.13%) > >> - 607,589,360 instructions # 0.05 insn per cycle > >> - 2,416,130,434 L1-dcache-loads # 445.682 M/sec ( +- 0.08% ) (46.19%) > >> - 1,080,187,594 L1-dcache-load-misses # 44.71% of all L1-dcache accesses ( +- 0.01% ) (46.18%) > >> > >> + 9,624,624,178 cycles # 2.418 GHz ( +- 0.01% ) (46.13%) > >> + 277,336,691 instructions # 0.03 insn per cycle > >> + 2,251,220,599 L1-dcache-loads # 565.624 M/sec ( +- 0.01% ) (46.20%) > >> + 1,092,386,130 L1-dcache-load-misses # 48.52% of all L1-dcache accesses ( +- 0.02% ) (46.19%) > >> > >> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy): > >> > >> mm/folio_zero_user x86/folio_zero_user change > >> (GB/s +- stddev) (GB/s +- stddev) > >> > >> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10% > >> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57% > >> > >> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we > >> see a drop in cycles but there's no drop in cacheline allocation. > >> > > > > Back when I was young and handsome and 32-bit x86 was king, people > > assumed 4K pages needed to be cleared with non-temporal stores to avoid > > evicting stuff from caches. I had never seen measurements showing this > > has the intended effect. Some time after this became a thing I did see > > measurements showing that this in fact *increases* cache misses. I am > > not saying this was necessarily the case for all x86 uarchs, merely that > > the sensibly sounding assumption turned bogus at some point (if it was > > ever legit). > > That was a long time ago though ;-). And, your point makes sense for > small sized pages. But, consider that zeroing a 1GB page can easily blow > away an L3 cache for absolutely nothing gained -- probabilistically, > nothing that was in the page that remains in the cache will ever be > accessed. > > Now, you could argue that the situation is less clear for 2MB pages. > Well I was talking about 2MB. ;) I thought it is a foregone conclusion that 1GB pages will be handled with non-temporal stores, but maybe I'm crossing my wires. > > This brings me to the multi-stage clearing employed here for locality. > > While it sounds great on paper, for all I know it does not provide any > > advantage. It very well may be it is harmful by preventing the CPU from > > knowing what you are trying to do. > > > > I think doing this warrants obtaining stats from some real workloads, > > but given how time consuming this can be I think it would be tolerable > > to skip it for now. > > > >> Performance for preempt=none|voluntary remains unchanged. > >> > > > > So I was under the impression the benefit would be realized for all > > kernels. > > > > I don't know how preemption support is implemented on Linux. Do you > > always get an IPI? > > No. The need-resched bit is common. It's just there's no preemption via > irqentry, just synchronous calls to cond_resched() (as you mention below). > > Zeroing via a subroutine like instruction (rep; stos) is incompatible with > synchronous calls to cond_resched() so this code is explicitly not called > for none/voluntary (see patch 3.) > > That said, I'll probably take Ingo's suggestion of chunking things up > in say 8/16MB portions for cooperative preemption models. makes sense, thanks > > > > I was thinking something like this: a per-cpu var akin to preemption > > count, but indicating the particular code section is fully preemptible > > > > Then: > > > > preemptible_enter(); > > clear_pages(); > > preemptible_exit(); > > > > for simpler handling of the var it could prevent migration to other > > CPUs. > > > > then the IPI handler for preemption would check if ->preemptible is set > > + preemption disablement is zero, in which case it would take you off > > cpu. > > > > If this is a problem, then a better granularity would help (say 8 pages > > between cond_rescheds?) > > > >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> > >> --- > >> arch/x86/mm/Makefile | 1 + > >> arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ > >> include/linux/mm.h | 1 + > >> 3 files changed, 62 insertions(+) > >> create mode 100644 arch/x86/mm/memory.c > >> > >> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile > >> index 32035d5be5a0..e61b4d331cdf 100644 > >> --- a/arch/x86/mm/Makefile > >> +++ b/arch/x86/mm/Makefile > >> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o > >> obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o > >> obj-$(CONFIG_AMD_NUMA) += amdtopology.o > >> obj-$(CONFIG_ACPI_NUMA) += srat.o > >> +obj-$(CONFIG_PREEMPTION) += memory.o > >> > >> obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o > >> obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o > >> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c > >> new file mode 100644 > >> index 000000000000..99851c246fcc > >> --- /dev/null > >> +++ b/arch/x86/mm/memory.c > >> @@ -0,0 +1,60 @@ > >> +// SPDX-License-Identifier: GPL-2.0-or-later > >> +#include <linux/mm.h> > >> +#include <linux/range.h> > >> +#include <linux/minmax.h> > >> + > >> +#ifndef CONFIG_HIGHMEM > >> +/* > >> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user(). > >> + * > >> + * Taking inspiration from the common code variant, we split the zeroing in > >> + * three parts: left of the fault, right of the fault, and up to 5 pages > >> + * in the immediate neighbourhood of the target page. > >> + * > >> + * Cleared in that order to keep cache lines of the target region hot. > >> + * > >> + * For gigantic pages, there is no expectation of cache locality so just do a > >> + * straight zero. > >> + */ > >> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint) > >> +{ > >> + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio)); > >> + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE; > >> + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1); > >> + int width = 2; /* pages cleared last on either side */ > >> + struct range r[3]; > >> + int i; > >> + > >> + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) { > >> + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio)); > >> + goto out; > >> + } > >> + > >> + /* > >> + * Faulting page and its immediate neighbourhood. Cleared at the end to > >> + * ensure it sticks around in the cache. > >> + */ > >> + r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end), > >> + clamp_t(s64, fault_idx + width, pg.start, pg.end)); > >> + > >> + /* Region to the left of the fault */ > >> + r[1] = DEFINE_RANGE(pg.start, > >> + clamp_t(s64, r[2].start-1, pg.start-1, r[2].start)); > >> + > >> + /* Region to the right of the fault: always valid for the common fault_idx=0 case. */ > >> + r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1), > >> + pg.end); > >> + > >> + for (i = 0; i <= 2; i++) { > >> + int len = range_len(&r[i]); > >> + > >> + if (len > 0) > >> + clear_pages(page_address(folio_page(folio, r[i].start)), len); > >> + } > >> + > >> +out: > >> + /* Explicitly invoke cond_resched() to handle any live patching necessary. */ > >> + cond_resched(); > >> +} > >> + > >> +#endif /* CONFIG_HIGHMEM */ > >> diff --git a/include/linux/mm.h b/include/linux/mm.h > >> index b7f13f087954..b57512da8173 100644 > >> --- a/include/linux/mm.h > >> +++ b/include/linux/mm.h > >> @@ -4114,6 +4114,7 @@ enum mf_action_page_type { > >> }; > >> > >> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > >> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint); > >> void folio_zero_user(struct folio *folio, unsigned long addr_hint); > >> int copy_user_large_folio(struct folio *dst, struct folio *src, > >> unsigned long addr_hint, > >> -- > >> 2.31.1 > >> > >> > > > -- > ankur
Mateusz Guzik <mjguzik@gmail.com> writes: > On Tue, Apr 15, 2025 at 11:46 PM Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> >> Mateusz Guzik <mjguzik@gmail.com> writes: >> >> > On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote: >> >> clear_pages_rep(), clear_pages_erms() use string instructions to zero >> >> memory. When operating on more than a single page, we can use these >> >> more effectively by explicitly advertising the region-size to the >> >> processor, which can use that as a hint to optimize the clearing >> >> (ex. by eliding cacheline allocation.) >> >> >> >> As a secondary benefit, string instructions are typically microcoded, >> >> and working with larger regions helps amortize the cost of the decode. >> >> >> >> When zeroing the 2MB page, maximize spatial locality by clearing in >> >> three sections: the faulting page and its immediate neighbourhood, the >> >> left and the right regions, with the local neighbourhood cleared last. >> >> >> >> Performance >> >> == >> >> >> >> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local >> >> NUMA node. >> >> >> >> Milan (EPYC 7J13, boost=0, preempt=full|lazy): >> >> >> >> mm/folio_zero_user x86/folio_zero_user change >> >> (GB/s +- stddev) (GB/s +- stddev) >> >> >> >> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5% >> >> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2% >> >> >> >> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline >> >> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB. >> >> >> >> pg-sz=1GB: >> >> - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%) >> >> - 544,878,976 instructions # 0.06 insn per cycle >> >> - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%) >> >> - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%) >> >> >> >> + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%) >> >> + 10,979,121 instructions # 0.00 insn per cycle >> >> + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%) >> >> + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%) >> >> >> >> That's not the case with pg-sz=2MB, where we also perform better but >> >> the number of cacheline allocations remain the same. >> >> >> >> It's not entirely clear why the performance for pg-sz=2MB improves. We >> >> decode fewer instructions and the hardware prefetcher can do a better >> >> job, but the perf stats for both of those aren't convincing enough to >> >> the extent of ~30%. >> >> >> >> pg-sz=2MB: >> >> - 13,110,306,584 cycles # 2.418 GHz ( +- 0.48% ) (46.13%) >> >> - 607,589,360 instructions # 0.05 insn per cycle >> >> - 2,416,130,434 L1-dcache-loads # 445.682 M/sec ( +- 0.08% ) (46.19%) >> >> - 1,080,187,594 L1-dcache-load-misses # 44.71% of all L1-dcache accesses ( +- 0.01% ) (46.18%) >> >> >> >> + 9,624,624,178 cycles # 2.418 GHz ( +- 0.01% ) (46.13%) >> >> + 277,336,691 instructions # 0.03 insn per cycle >> >> + 2,251,220,599 L1-dcache-loads # 565.624 M/sec ( +- 0.01% ) (46.20%) >> >> + 1,092,386,130 L1-dcache-load-misses # 48.52% of all L1-dcache accesses ( +- 0.02% ) (46.19%) >> >> >> >> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy): >> >> >> >> mm/folio_zero_user x86/folio_zero_user change >> >> (GB/s +- stddev) (GB/s +- stddev) >> >> >> >> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10% >> >> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57% >> >> >> >> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we >> >> see a drop in cycles but there's no drop in cacheline allocation. >> >> >> > >> > Back when I was young and handsome and 32-bit x86 was king, people >> > assumed 4K pages needed to be cleared with non-temporal stores to avoid >> > evicting stuff from caches. I had never seen measurements showing this >> > has the intended effect. Some time after this became a thing I did see >> > measurements showing that this in fact *increases* cache misses. I am >> > not saying this was necessarily the case for all x86 uarchs, merely that >> > the sensibly sounding assumption turned bogus at some point (if it was >> > ever legit). >> >> That was a long time ago though ;-). And, your point makes sense for >> small sized pages. But, consider that zeroing a 1GB page can easily blow >> away an L3 cache for absolutely nothing gained -- probabilistically, >> nothing that was in the page that remains in the cache will ever be >> accessed. >> >> Now, you could argue that the situation is less clear for 2MB pages. >> > > Well I was talking about 2MB. ;) I thought it is a foregone conclusion > that 1GB pages will be handled with non-temporal stores, but maybe I'm > crossing my wires. Or I am. I agree the situation is less clear for 2MB. I think for the immediate neighbourhood we clearly do want temporal. For the rest, L2 sizes are typically ~2MB or less. While there might be latency benefits for caching the rest of the page, there are clear negatives to blowing away state that was definitely useful in the near past. Would be good idea to quantify. Not sure what's a good benchmark to do that though. Ankur >> > This brings me to the multi-stage clearing employed here for locality. >> > While it sounds great on paper, for all I know it does not provide any >> > advantage. It very well may be it is harmful by preventing the CPU from >> > knowing what you are trying to do. >> > >> > I think doing this warrants obtaining stats from some real workloads, >> > but given how time consuming this can be I think it would be tolerable >> > to skip it for now. >> > >> >> Performance for preempt=none|voluntary remains unchanged. >> >> >> > >> > So I was under the impression the benefit would be realized for all >> > kernels. >> > >> > I don't know how preemption support is implemented on Linux. Do you >> > always get an IPI? >> >> No. The need-resched bit is common. It's just there's no preemption via >> irqentry, just synchronous calls to cond_resched() (as you mention below). >> >> Zeroing via a subroutine like instruction (rep; stos) is incompatible with >> synchronous calls to cond_resched() so this code is explicitly not called >> for none/voluntary (see patch 3.) >> >> That said, I'll probably take Ingo's suggestion of chunking things up >> in say 8/16MB portions for cooperative preemption models. > > makes sense, thanks > >> >> >> > I was thinking something like this: a per-cpu var akin to preemption >> > count, but indicating the particular code section is fully preemptible >> > >> > Then: >> > >> > preemptible_enter(); >> > clear_pages(); >> > preemptible_exit(); >> > >> > for simpler handling of the var it could prevent migration to other >> > CPUs. >> > >> > then the IPI handler for preemption would check if ->preemptible is set >> > + preemption disablement is zero, in which case it would take you off >> > cpu. >> > >> > If this is a problem, then a better granularity would help (say 8 pages >> > between cond_rescheds?) >> > >> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> >> >> --- >> >> arch/x86/mm/Makefile | 1 + >> >> arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ >> >> include/linux/mm.h | 1 + >> >> 3 files changed, 62 insertions(+) >> >> create mode 100644 arch/x86/mm/memory.c >> >> >> >> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile >> >> index 32035d5be5a0..e61b4d331cdf 100644 >> >> --- a/arch/x86/mm/Makefile >> >> +++ b/arch/x86/mm/Makefile >> >> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o >> >> obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o >> >> obj-$(CONFIG_AMD_NUMA) += amdtopology.o >> >> obj-$(CONFIG_ACPI_NUMA) += srat.o >> >> +obj-$(CONFIG_PREEMPTION) += memory.o >> >> >> >> obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o >> >> obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o >> >> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c >> >> new file mode 100644 >> >> index 000000000000..99851c246fcc >> >> --- /dev/null >> >> +++ b/arch/x86/mm/memory.c >> >> @@ -0,0 +1,60 @@ >> >> +// SPDX-License-Identifier: GPL-2.0-or-later >> >> +#include <linux/mm.h> >> >> +#include <linux/range.h> >> >> +#include <linux/minmax.h> >> >> + >> >> +#ifndef CONFIG_HIGHMEM >> >> +/* >> >> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user(). >> >> + * >> >> + * Taking inspiration from the common code variant, we split the zeroing in >> >> + * three parts: left of the fault, right of the fault, and up to 5 pages >> >> + * in the immediate neighbourhood of the target page. >> >> + * >> >> + * Cleared in that order to keep cache lines of the target region hot. >> >> + * >> >> + * For gigantic pages, there is no expectation of cache locality so just do a >> >> + * straight zero. >> >> + */ >> >> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint) >> >> +{ >> >> + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio)); >> >> + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE; >> >> + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1); >> >> + int width = 2; /* pages cleared last on either side */ >> >> + struct range r[3]; >> >> + int i; >> >> + >> >> + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) { >> >> + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio)); >> >> + goto out; >> >> + } >> >> + >> >> + /* >> >> + * Faulting page and its immediate neighbourhood. Cleared at the end to >> >> + * ensure it sticks around in the cache. >> >> + */ >> >> + r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end), >> >> + clamp_t(s64, fault_idx + width, pg.start, pg.end)); >> >> + >> >> + /* Region to the left of the fault */ >> >> + r[1] = DEFINE_RANGE(pg.start, >> >> + clamp_t(s64, r[2].start-1, pg.start-1, r[2].start)); >> >> + >> >> + /* Region to the right of the fault: always valid for the common fault_idx=0 case. */ >> >> + r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1), >> >> + pg.end); >> >> + >> >> + for (i = 0; i <= 2; i++) { >> >> + int len = range_len(&r[i]); >> >> + >> >> + if (len > 0) >> >> + clear_pages(page_address(folio_page(folio, r[i].start)), len); >> >> + } >> >> + >> >> +out: >> >> + /* Explicitly invoke cond_resched() to handle any live patching necessary. */ >> >> + cond_resched(); >> >> +} >> >> + >> >> +#endif /* CONFIG_HIGHMEM */ >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> >> index b7f13f087954..b57512da8173 100644 >> >> --- a/include/linux/mm.h >> >> +++ b/include/linux/mm.h >> >> @@ -4114,6 +4114,7 @@ enum mf_action_page_type { >> >> }; >> >> >> >> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) >> >> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint); >> >> void folio_zero_user(struct folio *folio, unsigned long addr_hint); >> >> int copy_user_large_folio(struct folio *dst, struct folio *src, >> >> unsigned long addr_hint, >> >> -- >> >> 2.31.1 >> >> >> >> >> >> >> -- >> ankur -- ankur
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 32035d5be5a0..e61b4d331cdf 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o obj-$(CONFIG_AMD_NUMA) += amdtopology.o obj-$(CONFIG_ACPI_NUMA) += srat.o +obj-$(CONFIG_PREEMPTION) += memory.o obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c new file mode 100644 index 000000000000..99851c246fcc --- /dev/null +++ b/arch/x86/mm/memory.c @@ -0,0 +1,60 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +#include <linux/mm.h> +#include <linux/range.h> +#include <linux/minmax.h> + +#ifndef CONFIG_HIGHMEM +/* + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user(). + * + * Taking inspiration from the common code variant, we split the zeroing in + * three parts: left of the fault, right of the fault, and up to 5 pages + * in the immediate neighbourhood of the target page. + * + * Cleared in that order to keep cache lines of the target region hot. + * + * For gigantic pages, there is no expectation of cache locality so just do a + * straight zero. + */ +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint) +{ + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio)); + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE; + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1); + int width = 2; /* pages cleared last on either side */ + struct range r[3]; + int i; + + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) { + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio)); + goto out; + } + + /* + * Faulting page and its immediate neighbourhood. Cleared at the end to + * ensure it sticks around in the cache. + */ + r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end), + clamp_t(s64, fault_idx + width, pg.start, pg.end)); + + /* Region to the left of the fault */ + r[1] = DEFINE_RANGE(pg.start, + clamp_t(s64, r[2].start-1, pg.start-1, r[2].start)); + + /* Region to the right of the fault: always valid for the common fault_idx=0 case. */ + r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1), + pg.end); + + for (i = 0; i <= 2; i++) { + int len = range_len(&r[i]); + + if (len > 0) + clear_pages(page_address(folio_page(folio, r[i].start)), len); + } + +out: + /* Explicitly invoke cond_resched() to handle any live patching necessary. */ + cond_resched(); +} + +#endif /* CONFIG_HIGHMEM */ diff --git a/include/linux/mm.h b/include/linux/mm.h index b7f13f087954..b57512da8173 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4114,6 +4114,7 @@ enum mf_action_page_type { }; #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint); void folio_zero_user(struct folio *folio, unsigned long addr_hint); int copy_user_large_folio(struct folio *dst, struct folio *src, unsigned long addr_hint,
clear_pages_rep(), clear_pages_erms() use string instructions to zero memory. When operating on more than a single page, we can use these more effectively by explicitly advertising the region-size to the processor, which can use that as a hint to optimize the clearing (ex. by eliding cacheline allocation.) As a secondary benefit, string instructions are typically microcoded, and working with larger regions helps amortize the cost of the decode. When zeroing the 2MB page, maximize spatial locality by clearing in three sections: the faulting page and its immediate neighbourhood, the left and the right regions, with the local neighbourhood cleared last. Performance == Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local NUMA node. Milan (EPYC 7J13, boost=0, preempt=full|lazy): mm/folio_zero_user x86/folio_zero_user change (GB/s +- stddev) (GB/s +- stddev) pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5% pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2% Milan uses a threshold of LLC-size (~32MB) for eliding cacheline allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB. pg-sz=1GB: - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%) - 544,878,976 instructions # 0.06 insn per cycle - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%) - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%) + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%) + 10,979,121 instructions # 0.00 insn per cycle + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%) + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%) That's not the case with pg-sz=2MB, where we also perform better but the number of cacheline allocations remain the same. It's not entirely clear why the performance for pg-sz=2MB improves. We decode fewer instructions and the hardware prefetcher can do a better job, but the perf stats for both of those aren't convincing enough to the extent of ~30%. pg-sz=2MB: - 13,110,306,584 cycles # 2.418 GHz ( +- 0.48% ) (46.13%) - 607,589,360 instructions # 0.05 insn per cycle - 2,416,130,434 L1-dcache-loads # 445.682 M/sec ( +- 0.08% ) (46.19%) - 1,080,187,594 L1-dcache-load-misses # 44.71% of all L1-dcache accesses ( +- 0.01% ) (46.18%) + 9,624,624,178 cycles # 2.418 GHz ( +- 0.01% ) (46.13%) + 277,336,691 instructions # 0.03 insn per cycle + 2,251,220,599 L1-dcache-loads # 565.624 M/sec ( +- 0.01% ) (46.20%) + 1,092,386,130 L1-dcache-load-misses # 48.52% of all L1-dcache accesses ( +- 0.02% ) (46.19%) Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy): mm/folio_zero_user x86/folio_zero_user change (GB/s +- stddev) (GB/s +- stddev) pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10% pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57% For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we see a drop in cycles but there's no drop in cacheline allocation. Performance for preempt=none|voluntary remains unchanged. Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> --- arch/x86/mm/Makefile | 1 + arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/mm.h | 1 + 3 files changed, 62 insertions(+) create mode 100644 arch/x86/mm/memory.c