diff mbox series

[v2,12/14] gup: use uncached path when clearing large regions

Message ID 20211020170305.376118-13-ankur.a.arora@oracle.com (mailing list archive)
State New
Headers show
Series Use uncached stores while clearing huge pages | expand

Commit Message

Ankur Arora Oct. 20, 2021, 5:03 p.m. UTC
When clearing a large region, or when the user explicitly specifies
via FOLL_HINT_BULK that a call to get_user_pages() is part of a larger
region, take the uncached path.

One notable limitation is that this is only done when the underlying
pages are huge or gigantic, even if a large region composed of PAGE_SIZE
pages is being cleared. This is because uncached stores are generally
weakly ordered and need some kind of store fence -- which would need
to be done at PTE write granularity to avoid data leakage. This would be
expensive enough that it would negate any performance advantage.

Performance
====

System:    Oracle E4-2C (2 nodes * 64 cores * 2 threads) (Milan)
Processor: AMD EPYC 7J13 64-Core
Memory:    2048 GB evenly split between nodes
LLC-size:  32MB for each CCX (8-core * 2-threads)
boost: 0, Microcode: 0xa001137, scaling-governor: performance

System:    Oracle X9-2 (2 nodes * 32 cores * 2 threads) (Icelake)
Processor: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Memory:    512 GB evenly split between nodes
LLC-size:  48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance

Workload: qemu-VM-create
==

Create a large VM, backed by preallocated 2MB pages.
(This test needs a minor change in qemu so it mmap's with
MAP_POPULATE instead of demand faulting each page.)

 Milan,     sz=1550 GB, runs=3   BW           stdev     diff
                                 ----------   ------    --------
 baseline   (clear_page_erms)     8.05 GBps     0.08
 CLZERO   (clear_page_clzero)    29.94 GBps     0.31    +271.92%

 (VM creation time decreases from 192.6s to 51.7s.)

 Icelake, sz=200 GB, runs=3      BW           stdev     diff
                                 ----------   ------    ---------
 baseline   (clear_page_erms)     8.25 GBps     0.05
 MOVNT     (clear_page_movnt)    21.55 GBps     0.31    +161.21%

 (VM creation time decreases from 25.2s to 9.3s.)

As the diff shows, for both these micro-architectures there's a
significant speedup with the CLZERO and MOVNT based interfaces.

Workload: Kernel build with background clear_huge_page()
==

Probe the cache-pollution aspect of this commit with a kernel
build (make -j 15 bzImage) alongside a background clear_huge_page()
load which does mmap(length=64GB, flags=MAP_POPULATE|MAP_HUGE_2MB)
in a loop.

The expectation -- assuming the kernel build performance is partly
cache limited -- is that the background load of clear_page_erms()
should show a greater slowdown, than clear_page_movnt() or
clear_page_clzero().
The build itself does not use THP or similar, so any performance changes
are due to the background load.

  # Milan, compile.sh internally tasksets to a CCX
  # perf stat -r 5  -e task-clock -e cycles -e stalled-cycles-frontend \
    -e stalled-cycles-backend -e instructions -e branches              \
    -e branch-misses -e L1-dcache-loads -e L1-dcache-load-misses       \
    -e cache-references -e cache-misses -e all_data_cache_accesses     \
    -e l1_data_cache_fills_all  -e l1_data_cache_fills_from_memory     \
    ./compile.sh

      Milan               kernel-build[1]          kernel-build[2]            kernel-build[3]
                          (bg: nothing)            (bg:clear_page_erms())     (bg:clear_page_clzero())
  -----------------     ---------------------      ----------------------     ------------------------
  run time              280.12s     (+- 0.59%)     322.21s     (+- 0.26%)     307.02s     (+- 1.35%)
  IPC                     1.16                       1.05                       1.14
  backend-idle            3.78%     (+- 0.06%)       4.62%     (+- 0.11%)       3.87%     (+- 0.10%)
  cache-misses           20.08%     (+- 0.14%)      20.88%     (+- 0.13%)      20.09%     (+- 0.11%)
  (% of cache-refs)
  l1_data_cache_fills-   2.77M/sec  (+- 0.20%)       3.11M/sec (+- 0.32%)       2.73M/sec (+- 0.12%)
   _from_memory

From the backend-idle stats in [1], the kernel build is only mildly
memory subsystem bound. However, there's a small but clear effect where
the background load of clear_page_clzero() does not leave much of an
imprint on the kernel-build in [3] -- both [1] and [3] have largely
similar IPC, memory and cache behaviour. OTOH, the clear_page_erms()
workload in [2] constrains the kernel-build more.

(Fuller perf stat output, at [1], [2], [3].)

  # Icelake, compile.sh internally tasksets to a socket
  # perf stat -r 5 -e task-clock -e cycles -e stalled-cycles-frontend \
    -e stalled-cycles-backend -e instructions -e branches             \
    -e branch-misses -e L1-dcache-loads -e L1-dcache-load-misses      \
    -e cache-references -e cache-misses -e LLC-loads                  \
    -e LLC-load-misses ./compile.sh

  Icelake               kernel-build[4]         kernel-build[5]            kernel-build[6]
                        (bg: nothing)           (bg:clear_page_erms())     (bg:clear_page_movnt())
  -----------------     -----------------       ----------------------     -----------------------

  run time              135.47s  (+- 0.25%)     136.75s  (+- 0.23%)        135.65s  (+- 0.15%)
  IPC                     1.81                    1.80                       1.80
  cache-misses           21.68%  (+- 0.42%)      22.88%  (+- 0.87%)         21.19%  (+- 0.51%)
  (% of cache-refs)
  LLC-load-misses        35.56%  (+- 0.83%)      37.44%  (+- 0.99%)         33.54%  (+- 1.17%)

From the LLC-load-miss and the cache-miss numbers, clear_page_erms()
seems to cause some additional cache contention in the kernel-build in
[5], compared to [4] and [6]. However, from the IPC and the run time
numbers, looks like the CPU pipeline compensates for the extra misses
quite well.
(Increasing the number of make jobs to 60, did not change the overall
picture appreciably.)

(Fuller perf stat output, at [4], [5], [6].)

[1] Milan, kernel-build

  Performance counter stats for './compile.sh' (5 runs):

      2,525,721.45 msec task-clock                #    9.016 CPUs utilized            ( +-  0.06% )
 4,642,144,895,632      cycles                    #    1.838 GHz                      ( +-  0.01% )  (47.38%)
    54,430,239,074      stalled-cycles-frontend   #    1.17% frontend cycles idle     ( +-  0.16% )  (47.35%)
   175,620,521,760      stalled-cycles-backend    #    3.78% backend cycles idle      ( +-  0.06% )  (47.34%)
 5,392,053,273,328      instructions              #    1.16  insn per cycle
                                                  #    0.03  stalled cycles per insn  ( +-  0.02% )  (47.34%)
 1,181,224,298,651      branches                  #  467.572 M/sec                    ( +-  0.01% )  (47.33%)
    27,668,103,863      branch-misses             #    2.34% of all branches          ( +-  0.04% )  (47.33%)
 2,141,384,087,286      L1-dcache-loads           #  847.639 M/sec                    ( +-  0.01% )  (47.32%)
    86,216,717,118      L1-dcache-load-misses     #    4.03% of all L1-dcache accesses  ( +-  0.08% )  (47.35%)
   264,844,001,975      cache-references          #  104.835 M/sec                    ( +-  0.03% )  (47.36%)
    53,225,109,745      cache-misses              #   20.086 % of all cache refs      ( +-  0.14% )  (47.37%)
 2,610,041,169,859      all_data_cache_accesses   #    1.033 G/sec                    ( +-  0.01% )  (47.37%)
    96,419,361,379      l1_data_cache_fills_all   #   38.166 M/sec                    ( +-  0.06% )  (47.37%)
     7,005,118,698      l1_data_cache_fills_from_memory #    2.773 M/sec                    ( +-  0.20% )  (47.38%)

            280.12 +- 1.65 seconds time elapsed  ( +-  0.59% )

[2] Milan, kernel-build (bg: clear_page_erms() workload)

 Performance counter stats for './compile.sh' (5 runs):

      2,852,168.93 msec task-clock                #    8.852 CPUs utilized            ( +-  0.14% )
 5,166,249,772,084      cycles                    #    1.821 GHz                      ( +-  0.05% )  (47.27%)
    62,039,291,151      stalled-cycles-frontend   #    1.20% frontend cycles idle     ( +-  0.04% )  (47.29%)
   238,472,446,709      stalled-cycles-backend    #    4.62% backend cycles idle      ( +-  0.11% )  (47.30%)
 5,419,530,293,688      instructions              #    1.05  insn per cycle
                                                  #    0.04  stalled cycles per insn  ( +-  0.01% )  (47.31%)
 1,186,958,893,481      branches                  #  418.404 M/sec                    ( +-  0.01% )  (47.31%)
    28,106,023,654      branch-misses             #    2.37% of all branches          ( +-  0.03% )  (47.29%)
 2,160,377,315,024      L1-dcache-loads           #  761.534 M/sec                    ( +-  0.03% )  (47.26%)
    89,101,836,173      L1-dcache-load-misses     #    4.13% of all L1-dcache accesses  ( +-  0.06% )  (47.25%)
   276,859,144,248      cache-references          #   97.593 M/sec                    ( +-  0.04% )  (47.22%)
    57,774,174,239      cache-misses              #   20.889 % of all cache refs      ( +-  0.13% )  (47.24%)
 2,641,613,011,234      all_data_cache_accesses   #  931.170 M/sec                    ( +-  0.01% )  (47.22%)
    99,595,968,133      l1_data_cache_fills_all   #   35.108 M/sec                    ( +-  0.06% )  (47.24%)
     8,831,873,628      l1_data_cache_fills_from_memory #    3.113 M/sec                    ( +-  0.32% )  (47.23%)

           322.211 +- 0.837 seconds time elapsed  ( +-  0.26% )

[3] Milan, kernel-build + (bg: clear_page_clzero() workload)

 Performance counter stats for './compile.sh' (5 runs):

      2,607,387.17 msec task-clock                #    8.493 CPUs utilized            ( +-  0.14% )
 4,749,807,054,468      cycles                    #    1.824 GHz                      ( +-  0.09% )  (47.28%)
    56,579,908,946      stalled-cycles-frontend   #    1.19% frontend cycles idle     ( +-  0.19% )  (47.28%)
   183,367,955,020      stalled-cycles-backend    #    3.87% backend cycles idle      ( +-  0.10% )  (47.28%)
 5,395,577,260,957      instructions              #    1.14  insn per cycle
                                                  #    0.03  stalled cycles per insn  ( +-  0.02% )  (47.29%)
 1,181,904,525,139      branches                  #  453.753 M/sec                    ( +-  0.01% )  (47.30%)
    27,702,316,890      branch-misses             #    2.34% of all branches          ( +-  0.02% )  (47.31%)
 2,137,616,885,978      L1-dcache-loads           #  820.667 M/sec                    ( +-  0.01% )  (47.32%)
    85,841,996,509      L1-dcache-load-misses     #    4.02% of all L1-dcache accesses  ( +-  0.03% )  (47.32%)
   262,784,890,310      cache-references          #  100.888 M/sec                    ( +-  0.04% )  (47.32%)
    52,812,245,646      cache-misses              #   20.094 % of all cache refs      ( +-  0.11% )  (47.32%)
 2,605,653,350,299      all_data_cache_accesses   #    1.000 G/sec                    ( +-  0.01% )  (47.32%)
    95,770,076,665      l1_data_cache_fills_all   #   36.768 M/sec                    ( +-  0.03% )  (47.30%)
     7,134,690,513      l1_data_cache_fills_from_memory #    2.739 M/sec                    ( +-  0.12% )  (47.29%)

            307.02 +- 4.15 seconds time elapsed  ( +-  1.35% )

[4] Icelake, kernel-build

 Performance counter stats for './compile.sh' (5 runs):

           421,633      cs                        #  358.780 /sec                     ( +-  0.04% )
      1,173,522.36 msec task-clock                #    8.662 CPUs utilized            ( +-  0.14% )
 2,991,427,421,282      cycles                    #    2.545 GHz                      ( +-  0.15% )  (82.42%)
 5,410,090,251,681      instructions              #    1.81  insn per cycle           ( +-  0.02% )  (91.13%)
 1,189,406,048,438      branches                  #    1.012 G/sec                    ( +-  0.02% )  (91.05%)
    21,291,454,717      branch-misses             #    1.79% of all branches          ( +-  0.02% )  (91.06%)
 1,462,419,736,675      L1-dcache-loads           #    1.244 G/sec                    ( +-  0.02% )  (91.06%)
    47,084,269,809      L1-dcache-load-misses     #    3.22% of all L1-dcache accesses  ( +-  0.01% )  (91.05%)
    23,527,140,332      cache-references          #   20.020 M/sec                    ( +-  0.13% )  (91.04%)
     5,093,132,060      cache-misses              #   21.682 % of all cache refs      ( +-  0.42% )  (91.03%)
     4,220,672,439      LLC-loads                 #    3.591 M/sec                    ( +-  0.14% )  (91.04%)
     1,501,704,609      LLC-load-misses           #   35.56% of all LL-cache accesses  ( +-  0.83% )  (73.10%)

           135.478 +- 0.335 seconds time elapsed  ( +-  0.25% )

[5] Icelake, kernel-build + (bg: clear_page_erms() workload)

 Performance counter stats for './compile.sh' (5 runs):

           410,611      cs                        #  347.771 /sec                     ( +-  0.02% )
      1,184,382.84 msec task-clock                #    8.661 CPUs utilized            ( +-  0.08% )
 3,018,535,155,772      cycles                    #    2.557 GHz                      ( +-  0.08% )  (82.42%)
 5,408,788,104,113      instructions              #    1.80  insn per cycle           ( +-  0.00% )  (91.13%)
 1,189,173,209,515      branches                  #    1.007 G/sec                    ( +-  0.00% )  (91.05%)
    21,279,087,578      branch-misses             #    1.79% of all branches          ( +-  0.01% )  (91.06%)
 1,462,243,374,967      L1-dcache-loads           #    1.238 G/sec                    ( +-  0.00% )  (91.05%)
    47,210,704,159      L1-dcache-load-misses     #    3.23% of all L1-dcache accesses  ( +-  0.02% )  (91.04%)
    23,378,470,958      cache-references          #   19.801 M/sec                    ( +-  0.03% )  (91.05%)
     5,339,921,426      cache-misses              #   22.814 % of all cache refs      ( +-  0.87% )  (91.03%)
     4,241,388,134      LLC-loads                 #    3.592 M/sec                    ( +-  0.02% )  (91.05%)
     1,588,055,137      LLC-load-misses           #   37.44% of all LL-cache accesses  ( +-  0.99% )  (73.09%)

           136.750 +- 0.315 seconds time elapsed  ( +-  0.23% )

[6] Icelake, kernel-build + (bg: clear_page_movnt() workload)

 Performance counter stats for './compile.sh' (5 runs):

           409,978      cs                        #  347.850 /sec                     ( +-  0.06% )
      1,174,090.99 msec task-clock                #    8.655 CPUs utilized            ( +-  0.10% )
 2,992,914,428,930      cycles                    #    2.539 GHz                      ( +-  0.10% )  (82.40%)
 5,408,632,560,457      instructions              #    1.80  insn per cycle           ( +-  0.00% )  (91.12%)
 1,189,083,425,674      branches                  #    1.009 G/sec                    ( +-  0.00% )  (91.05%)
    21,273,992,588      branch-misses             #    1.79% of all branches          ( +-  0.02% )  (91.05%)
 1,462,081,591,012      L1-dcache-loads           #    1.241 G/sec                    ( +-  0.00% )  (91.05%)
    47,071,136,770      L1-dcache-load-misses     #    3.22% of all L1-dcache accesses  ( +-  0.03% )  (91.04%)
    23,331,268,072      cache-references          #   19.796 M/sec                    ( +-  0.05% )  (91.04%)
     4,953,198,057      cache-misses              #   21.190 % of all cache refs      ( +-  0.51% )  (91.04%)
     4,194,721,070      LLC-loads                 #    3.559 M/sec                    ( +-  0.10% )  (91.06%)
     1,412,414,538      LLC-load-misses           #   33.54% of all LL-cache accesses  ( +-  1.17% )  (73.09%)

           135.654 +- 0.203 seconds time elapsed  ( +-  0.15% )

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 fs/hugetlbfs/inode.c |  7 ++++++-
 mm/gup.c             | 20 ++++++++++++++++++++
 mm/huge_memory.c     |  2 +-
 mm/hugetlb.c         |  9 ++++++++-
 4 files changed, 35 insertions(+), 3 deletions(-)
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cdfb1ae78a3f..44cee9d30035 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -636,6 +636,7 @@  static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	loff_t hpage_size = huge_page_size(h);
 	unsigned long hpage_shift = huge_page_shift(h);
 	pgoff_t start, index, end;
+	bool hint_uncached;
 	int error;
 	u32 hash;
 
@@ -653,6 +654,9 @@  static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	start = offset >> hpage_shift;
 	end = (offset + len + hpage_size - 1) >> hpage_shift;
 
+	/* Don't pollute the cache if we are fallocte'ing a large region. */
+	hint_uncached = clear_page_prefer_uncached((end - start) << hpage_shift);
+
 	inode_lock(inode);
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -731,7 +735,8 @@  static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 			error = PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, addr, pages_per_huge_page(h));
+		clear_huge_page(page, addr, pages_per_huge_page(h),
+				hint_uncached);
 		__SetPageUptodate(page);
 		error = huge_add_to_page_cache(page, mapping, index);
 		if (unlikely(error)) {
diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..930944e0c6eb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -933,6 +933,13 @@  static int faultin_page(struct vm_area_struct *vma,
 		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
+	if (*flags & FOLL_HINT_BULK) {
+		/*
+		 * From the user hint, we might be faulting-in a large region
+		 * so minimize cache-pollution.
+		 */
+		fault_flags |= FAULT_FLAG_UNCACHED;
+	}
 
 	ret = handle_mm_fault(vma, address, fault_flags, NULL);
 	if (ret & VM_FAULT_ERROR) {
@@ -1100,6 +1107,19 @@  static long __get_user_pages(struct mm_struct *mm,
 	if (!(gup_flags & FOLL_FORCE))
 		gup_flags |= FOLL_NUMA;
 
+	/*
+	 * Uncached page clearing is generally faster when clearing regions
+	 * sized ~LLC/2 or thereabouts. So hint the uncached path based
+	 * on clear_page_prefer_uncached().
+	 *
+	 * Note, however that this get_user_pages() call might end up
+	 * needing to clear an extent smaller than nr_pages when we have
+	 * taken the (potentially slower) uncached path based on the
+	 * expectation of a larger nr_pages value.
+	 */
+	if (clear_page_prefer_uncached(nr_pages * PAGE_SIZE))
+		gup_flags |= FOLL_HINT_BULK;
+
 	do {
 		struct page *page;
 		unsigned int foll_flags = gup_flags;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ffd4b07285ba..2d239967a8a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -600,7 +600,7 @@  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 	pgtable_t pgtable;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	vm_fault_t ret = 0;
-	bool uncached = false;
+	bool uncached = vmf->flags & FAULT_FLAG_UNCACHED;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a920b1133cdb..35b643df5854 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4874,7 +4874,7 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
-	bool uncached = false;
+	bool uncached = flags & FAULT_FLAG_UNCACHED;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5503,6 +5503,13 @@  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				 */
 				fault_flags |= FAULT_FLAG_TRIED;
 			}
+			if (flags & FOLL_HINT_BULK) {
+				/*
+				 * From the user hint, we might be faulting-in a large
+				 * region so minimize cache-pollution.
+				 */
+				fault_flags |= FAULT_FLAG_UNCACHED;
+			}
 			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
 			if (ret & VM_FAULT_ERROR) {
 				err = vm_fault_to_errno(ret, flags);