Message ID | 20230830184958.2333078-7-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/clear_huge_page: multi-page clearing | expand |
Hi Ankur, kernel test robot noticed the following build errors: [auto build test ERROR on tip/x86/core] [also build test ERROR on akpm-mm/mm-everything tip/x86/mm tip/sched/core tip/core/entry linus/master v6.5 next-20230831] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/mm-clear_huge_page-allow-arch-override-for-clear_huge_page/20230831-030156 base: tip/x86/core patch link: https://lore.kernel.org/r/20230830184958.2333078-7-ankur.a.arora%40oracle.com patch subject: [PATCH v2 6/9] x86/clear_huge_page: multi-page clearing config: i386-buildonly-randconfig-006-20230831 (https://download.01.org/0day-ci/archive/20230901/202309010241.u7o4IETu-lkp@intel.com/config) compiler: gcc-11 (Debian 11.3.0-12) 11.3.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230901/202309010241.u7o4IETu-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202309010241.u7o4IETu-lkp@intel.com/ All errors (new ones prefixed by >>): arch/x86/mm/hugetlbpage.c: In function 'clear_contig_region': >> arch/x86/mm/hugetlbpage.c:155:9: error: implicit declaration of function 'clear_pages'; did you mean 'clear_page'? [-Werror=implicit-function-declaration] 155 | clear_pages(page_address(page), npages); | ^~~~~~~~~~~ | clear_page cc1: some warnings being treated as errors vim +155 arch/x86/mm/hugetlbpage.c 151 152 #ifndef CONFIG_HIGHMEM 153 static void clear_contig_region(struct page *page, unsigned int npages) 154 { > 155 clear_pages(page_address(page), npages); 156 } 157
On Wed, Aug 30, 2023 at 11:49:55AM -0700, Ankur Arora wrote: > +#ifndef CONFIG_HIGHMEM I'm thinking this wants to be: #ifdef CONFIG_X86_64. All the previous stuff was 64bit specific. > +static void clear_contig_region(struct page *page, unsigned int npages) > +{ > + clear_pages(page_address(page), npages); > +} I'm not sure about the naming of this helper -- but whatever. > + > +/* > + * clear_huge_page(): multi-page clearing variant of clear_huge_page(). > + * > + * Taking inspiration from the common code variant, we split the zeroing in > + * three parts: left of the fault, right of the fault, and up to 5 pages > + * in the immediate neighbourhood of the target page. > + * > + * Cleared in that order to keep cache lines of the target region hot. > + * > + * For gigantic pages, there is no expectation of cache locality so we do a > + * straight zeroing. > + */ > +void clear_huge_page(struct page *page, > + unsigned long addr_hint, unsigned int pages_per_huge_page) > +{ > + unsigned long addr = addr_hint & > + ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); > + const long pgidx = (addr_hint - addr) / PAGE_SIZE; > + const int first_pg = 0, last_pg = pages_per_huge_page - 1; > + const int width = 2; /* pages cleared last on either side */ > + int sidx[3], eidx[3]; > + int i, n; > + > + if (pages_per_huge_page > MAX_ORDER_NR_PAGES) > + return clear_contig_region(page, pages_per_huge_page); > + > + /* > + * Neighbourhood of the fault. Cleared at the end to ensure > + * it sticks around in the cache. > + */ > + n = 2; > + sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width); > + eidx[n] = (pgidx + width) > last_pg ? last_pg : (pgidx + width); > + > + sidx[0] = first_pg; /* Region to the left of the fault */ > + eidx[0] = sidx[n] - 1; > + > + sidx[1] = eidx[n] + 1; /* Region to the right of the fault */ > + eidx[1] = last_pg; > + > + for (i = 0; i <= 2; i++) { > + if (eidx[i] >= sidx[i]) > + clear_contig_region(page + sidx[i], > + eidx[i] - sidx[i] + 1); Since the if has a multi-line statement it needs { } per coding style. > + } > +} > +#endif /* CONFIG_HIGHMEM */ > #endif /* CONFIG_HUGETLB_PAGE */ > > #ifdef CONFIG_X86_64 > -- > 2.31.1 >
On 8/31/2023 12:19 AM, Ankur Arora wrote: > clear_pages_rep(), clear_pages_erms() clear using string instructions. > While clearing extents of more than a single page, we can use these > more effectively by explicitly advertising the region-size to the > processor. > > This can be used as a hint by the processor-uarch to optimize the > clearing (ex. to avoid polluting one or more levels of the data-cache.) > > As a secondary benefit, string instructions are typically microcoded, > and so it's a good idea to amortize the cost of the decode across larger > regions. > > Accordingly, clear_huge_page() now does huge-page clearing in three > parts: the neighbourhood of the faulting address, the left, and the > right region of the neighbourhood. > > The local neighbourhood is cleared last to keep its cachelines hot. > [...] > > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> > --- > arch/x86/mm/hugetlbpage.c | 54 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 54 insertions(+) > Hello Ankur, Just thinking loud here (w.r.t THP). V3 patchset with uarch changes had changes in THP path too, where one could explicitly give hints or non-caching hints. and they are passed down to call incoherent clearing. IMO, those changes logically belong to uarch optimizations.. right?
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 5804bbae4f01..0b9f7a6dad93 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -148,6 +148,60 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, return hugetlb_get_unmapped_area_topdown(file, addr, len, pgoff, flags); } + +#ifndef CONFIG_HIGHMEM +static void clear_contig_region(struct page *page, unsigned int npages) +{ + clear_pages(page_address(page), npages); +} + +/* + * clear_huge_page(): multi-page clearing variant of clear_huge_page(). + * + * Taking inspiration from the common code variant, we split the zeroing in + * three parts: left of the fault, right of the fault, and up to 5 pages + * in the immediate neighbourhood of the target page. + * + * Cleared in that order to keep cache lines of the target region hot. + * + * For gigantic pages, there is no expectation of cache locality so we do a + * straight zeroing. + */ +void clear_huge_page(struct page *page, + unsigned long addr_hint, unsigned int pages_per_huge_page) +{ + unsigned long addr = addr_hint & + ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); + const long pgidx = (addr_hint - addr) / PAGE_SIZE; + const int first_pg = 0, last_pg = pages_per_huge_page - 1; + const int width = 2; /* pages cleared last on either side */ + int sidx[3], eidx[3]; + int i, n; + + if (pages_per_huge_page > MAX_ORDER_NR_PAGES) + return clear_contig_region(page, pages_per_huge_page); + + /* + * Neighbourhood of the fault. Cleared at the end to ensure + * it sticks around in the cache. + */ + n = 2; + sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width); + eidx[n] = (pgidx + width) > last_pg ? last_pg : (pgidx + width); + + sidx[0] = first_pg; /* Region to the left of the fault */ + eidx[0] = sidx[n] - 1; + + sidx[1] = eidx[n] + 1; /* Region to the right of the fault */ + eidx[1] = last_pg; + + for (i = 0; i <= 2; i++) { + if (eidx[i] >= sidx[i]) + clear_contig_region(page + sidx[i], + eidx[i] - sidx[i] + 1); + } +} +#endif /* CONFIG_HIGHMEM */ #endif /* CONFIG_HUGETLB_PAGE */ #ifdef CONFIG_X86_64
clear_pages_rep(), clear_pages_erms() clear using string instructions. While clearing extents of more than a single page, we can use these more effectively by explicitly advertising the region-size to the processor. This can be used as a hint by the processor-uarch to optimize the clearing (ex. to avoid polluting one or more levels of the data-cache.) As a secondary benefit, string instructions are typically microcoded, and so it's a good idea to amortize the cost of the decode across larger regions. Accordingly, clear_huge_page() now does huge-page clearing in three parts: the neighbourhood of the faulting address, the left, and the right region of the neighbourhood. The local neighbourhood is cleared last to keep its cachelines hot. Performance == Use mmap(MAP_HUGETLB) to demand fault a 128GB region (on the local NUMA node): Milan (EPYC 7J13, boost=1): mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 14.55 19.29 +32.5% pg-sz=1GB 19.34 49.60 +156.4% Milan uses a threshold of LLC-size (~32MB) for eliding cacheline allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB: pg-sz=1GB: -23,088,001,347 cycles # 3.487 GHz ( +- 0.08% ) (35.68%) - 4,680,678,939 L1-dcache-loads # 706.831 M/sec ( +- 0.02% ) (35.74%) - 2,150,395,280 L1-dcache-load-misses # 45.93% of all L1-dcache accesses ( +- 0.01% ) (35.74%) + 8,983,798,764 cycles # 3.489 GHz ( +- 0.05% ) (35.59%) + 18,294,725 L1-dcache-loads # 7.104 M/sec ( +- 18.88% ) (35.78%) + 6,677,565 L1-dcache-load-misses # 30.48% of all L1-dcache accesses ( +- 20.72% ) (35.78%) That's not the case with pg-sz=2MB, where we perform better but the number of cacheline allocations remain the same: pg-sz=2MB: -31,087,683,852 cycles # 3.494 GHz ( +- 0.17% ) (35.72%) - 4,898,684,886 L1-dcache-loads # 550.627 M/sec ( +- 0.03% ) (35.71%) - 2,161,434,236 L1-dcache-load-misses # 44.11% of all L1-dcache accesses ( +- 0.01% ) (35.71%) +23,368,914,596 cycles # 3.480 GHz ( +- 0.27% ) (35.72%) + 4,481,808,430 L1-dcache-loads # 667.382 M/sec ( +- 0.03% ) (35.71%) + 2,170,453,309 L1-dcache-load-misses # 48.41% of all L1-dcache accesses ( +- 0.06% ) (35.71%) Icelakex (Platinum 8358, no_turbo=0): mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 9.19 12.94 +40.8% pg-sz=1GB 9.36 12.97 +38.5% For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we see a drop in cycles but there's no drop in cacheline allocation. Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> --- arch/x86/mm/hugetlbpage.c | 54 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+)