diff mbox series

[v2,6/9] x86/clear_huge_page: multi-page clearing

Message ID 20230830184958.2333078-7-ankur.a.arora@oracle.com (mailing list archive)
State New
Headers show
Series x86/clear_huge_page: multi-page clearing | expand

Commit Message

Ankur Arora Aug. 30, 2023, 6:49 p.m. UTC
clear_pages_rep(), clear_pages_erms() clear using string instructions.
While clearing extents of more than a single page, we can use these
more effectively by explicitly advertising the region-size to the
processor.

This can be used as a hint by the processor-uarch to optimize the
clearing (ex. to avoid polluting one or more levels of the data-cache.)

As a secondary benefit, string instructions are typically microcoded,
and so it's a good idea to amortize the cost of the decode across larger
regions.

Accordingly, clear_huge_page() now does huge-page clearing in three
parts: the neighbourhood of the faulting address, the left, and the
right region of the neighbourhood.

The local neighbourhood is cleared last to keep its cachelines hot.

Performance
==

Use mmap(MAP_HUGETLB) to demand fault a 128GB region (on the local
NUMA node):

Milan (EPYC 7J13, boost=1):

              mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                14.55                 19.29    +32.5%
  pg-sz=1GB                19.34                 49.60   +156.4%

Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, so we see a dropoff in cacheline-allocations for
pg-sz=1GB:

pg-sz=1GB:
    -23,088,001,347      cycles                    #    3.487 GHz                      ( +-  0.08% )  (35.68%)
    - 4,680,678,939      L1-dcache-loads           #  706.831 M/sec                    ( +-  0.02% )  (35.74%)
    - 2,150,395,280      L1-dcache-load-misses     #   45.93% of all L1-dcache accesses  ( +-  0.01% )  (35.74%)

    + 8,983,798,764      cycles                    #    3.489 GHz                      ( +-  0.05% )  (35.59%)
    +    18,294,725      L1-dcache-loads           #    7.104 M/sec                    ( +- 18.88% )  (35.78%)
    +     6,677,565      L1-dcache-load-misses     #   30.48% of all L1-dcache accesses  ( +- 20.72% )  (35.78%)

That's not the case with pg-sz=2MB, where we perform better but the
number of cacheline allocations remain the same:

pg-sz=2MB:
    -31,087,683,852      cycles                    #    3.494 GHz                      ( +-  0.17% )  (35.72%)
    - 4,898,684,886      L1-dcache-loads           #  550.627 M/sec                    ( +-  0.03% )  (35.71%)
    - 2,161,434,236      L1-dcache-load-misses     #   44.11% of all L1-dcache accesses  ( +-  0.01% )  (35.71%)

    +23,368,914,596      cycles                    #    3.480 GHz                      ( +-  0.27% )  (35.72%)
    + 4,481,808,430      L1-dcache-loads           #  667.382 M/sec                    ( +-  0.03% )  (35.71%)
    + 2,170,453,309      L1-dcache-load-misses     #   48.41% of all L1-dcache accesses  ( +-  0.06% )  (35.71%)


Icelakex (Platinum 8358, no_turbo=0):

              mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                 9.19                 12.94   +40.8%
  pg-sz=1GB                 9.36                 12.97   +38.5%

For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
see a drop in cycles but there's no drop in cacheline allocation.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/mm/hugetlbpage.c | 54 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

Comments

kernel test robot Aug. 31, 2023, 6:26 p.m. UTC | #1
Hi Ankur,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on akpm-mm/mm-everything tip/x86/mm tip/sched/core tip/core/entry linus/master v6.5 next-20230831]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/mm-clear_huge_page-allow-arch-override-for-clear_huge_page/20230831-030156
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20230830184958.2333078-7-ankur.a.arora%40oracle.com
patch subject: [PATCH v2 6/9] x86/clear_huge_page: multi-page clearing
config: i386-buildonly-randconfig-006-20230831 (https://download.01.org/0day-ci/archive/20230901/202309010241.u7o4IETu-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230901/202309010241.u7o4IETu-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309010241.u7o4IETu-lkp@intel.com/

All errors (new ones prefixed by >>):

   arch/x86/mm/hugetlbpage.c: In function 'clear_contig_region':
>> arch/x86/mm/hugetlbpage.c:155:9: error: implicit declaration of function 'clear_pages'; did you mean 'clear_page'? [-Werror=implicit-function-declaration]
     155 |         clear_pages(page_address(page), npages);
         |         ^~~~~~~~~~~
         |         clear_page
   cc1: some warnings being treated as errors


vim +155 arch/x86/mm/hugetlbpage.c

   151	
   152	#ifndef CONFIG_HIGHMEM
   153	static void clear_contig_region(struct page *page, unsigned int npages)
   154	{
 > 155		clear_pages(page_address(page), npages);
   156	}
   157
Peter Zijlstra Sept. 8, 2023, 12:38 p.m. UTC | #2
On Wed, Aug 30, 2023 at 11:49:55AM -0700, Ankur Arora wrote:

> +#ifndef CONFIG_HIGHMEM

I'm thinking this wants to be: #ifdef CONFIG_X86_64. All the previous
stuff was 64bit specific.

> +static void clear_contig_region(struct page *page, unsigned int npages)
> +{
> +	clear_pages(page_address(page), npages);
> +}

I'm not sure about the naming of this helper -- but whatever.

> +
> +/*
> + * clear_huge_page(): multi-page clearing variant of clear_huge_page().
> + *
> + * Taking inspiration from the common code variant, we split the zeroing in
> + * three parts: left of the fault, right of the fault, and up to 5 pages
> + * in the immediate neighbourhood of the target page.
> + *
> + * Cleared in that order to keep cache lines of the target region hot.
> + *
> + * For gigantic pages, there is no expectation of cache locality so we do a
> + * straight zeroing.
> + */
> +void clear_huge_page(struct page *page,
> +		     unsigned long addr_hint, unsigned int pages_per_huge_page)
> +{
> +	unsigned long addr = addr_hint &
> +		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
> +	const long pgidx = (addr_hint - addr) / PAGE_SIZE;
> +	const int first_pg = 0, last_pg = pages_per_huge_page - 1;
> +	const int width = 2; /* pages cleared last on either side */
> +	int sidx[3], eidx[3];
> +	int i, n;
> +
> +	if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
> +		return clear_contig_region(page, pages_per_huge_page);
> +
> +	/*
> +	 * Neighbourhood of the fault. Cleared at the end to ensure
> +	 * it sticks around in the cache.
> +	 */
> +	n = 2;
> +	sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
> +	eidx[n] = (pgidx + width) > last_pg  ? last_pg  : (pgidx + width);
> +
> +	sidx[0] = first_pg;	/* Region to the left of the fault */
> +	eidx[0] = sidx[n] - 1;
> +
> +	sidx[1] = eidx[n] + 1;	/* Region to the right of the fault */
> +	eidx[1] = last_pg;
> +
> +	for (i = 0; i <= 2; i++) {
> +		if (eidx[i] >= sidx[i])
> +			clear_contig_region(page + sidx[i],
> +					    eidx[i] - sidx[i] + 1);

Since the if has a multi-line statement it needs { } per coding style.

> +	}
> +}
> +#endif /* CONFIG_HIGHMEM */
>  #endif /* CONFIG_HUGETLB_PAGE */
>  
>  #ifdef CONFIG_X86_64
> -- 
> 2.31.1
>
Raghavendra K T Sept. 13, 2023, 6:43 a.m. UTC | #3
On 8/31/2023 12:19 AM, Ankur Arora wrote:
> clear_pages_rep(), clear_pages_erms() clear using string instructions.
> While clearing extents of more than a single page, we can use these
> more effectively by explicitly advertising the region-size to the
> processor.
> 
> This can be used as a hint by the processor-uarch to optimize the
> clearing (ex. to avoid polluting one or more levels of the data-cache.)
> 
> As a secondary benefit, string instructions are typically microcoded,
> and so it's a good idea to amortize the cost of the decode across larger
> regions.
> 
> Accordingly, clear_huge_page() now does huge-page clearing in three
> parts: the neighbourhood of the faulting address, the left, and the
> right region of the neighbourhood.
> 
> The local neighbourhood is cleared last to keep its cachelines hot.
> 
[...]
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>   arch/x86/mm/hugetlbpage.c | 54 +++++++++++++++++++++++++++++++++++++++
>   1 file changed, 54 insertions(+)
> 

Hello Ankur,

Just thinking loud here (w.r.t THP).

V3 patchset with uarch changes had changes in THP path too, where
one could explicitly give hints or non-caching hints. and they are
  passed down to call incoherent clearing.

IMO, those changes logically belong to uarch optimizations.. right?
diff mbox series

Patch

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 5804bbae4f01..0b9f7a6dad93 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -148,6 +148,60 @@  hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		return hugetlb_get_unmapped_area_topdown(file, addr, len,
 				pgoff, flags);
 }
+
+#ifndef CONFIG_HIGHMEM
+static void clear_contig_region(struct page *page, unsigned int npages)
+{
+	clear_pages(page_address(page), npages);
+}
+
+/*
+ * clear_huge_page(): multi-page clearing variant of clear_huge_page().
+ *
+ * Taking inspiration from the common code variant, we split the zeroing in
+ * three parts: left of the fault, right of the fault, and up to 5 pages
+ * in the immediate neighbourhood of the target page.
+ *
+ * Cleared in that order to keep cache lines of the target region hot.
+ *
+ * For gigantic pages, there is no expectation of cache locality so we do a
+ * straight zeroing.
+ */
+void clear_huge_page(struct page *page,
+		     unsigned long addr_hint, unsigned int pages_per_huge_page)
+{
+	unsigned long addr = addr_hint &
+		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+	const long pgidx = (addr_hint - addr) / PAGE_SIZE;
+	const int first_pg = 0, last_pg = pages_per_huge_page - 1;
+	const int width = 2; /* pages cleared last on either side */
+	int sidx[3], eidx[3];
+	int i, n;
+
+	if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
+		return clear_contig_region(page, pages_per_huge_page);
+
+	/*
+	 * Neighbourhood of the fault. Cleared at the end to ensure
+	 * it sticks around in the cache.
+	 */
+	n = 2;
+	sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
+	eidx[n] = (pgidx + width) > last_pg  ? last_pg  : (pgidx + width);
+
+	sidx[0] = first_pg;	/* Region to the left of the fault */
+	eidx[0] = sidx[n] - 1;
+
+	sidx[1] = eidx[n] + 1;	/* Region to the right of the fault */
+	eidx[1] = last_pg;
+
+	for (i = 0; i <= 2; i++) {
+		if (eidx[i] >= sidx[i])
+			clear_contig_region(page + sidx[i],
+					    eidx[i] - sidx[i] + 1);
+	}
+}
+#endif /* CONFIG_HIGHMEM */
 #endif /* CONFIG_HUGETLB_PAGE */
 
 #ifdef CONFIG_X86_64