RFC: clear 1G pages with streaming stores on x86

Message ID	20180724204639.26934-1-cannonmatthews@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of 3m5bxww4kcjiyw99a98wff30ie2aa270.ya8749gj-886hwy6.ad2@flex--cannonmatthews.bounces.google.com designates 209.85.220.73 as permitted sender) client-ip=209.85.220.73; MIME-Version: 1.0 Date: Tue, 24 Jul 2018 13:46:39 -0700 Message-Id: <20180724204639.26934-1-cannonmatthews@google.com> Subject: [PATCH] RFC: clear 1G pages with streaming stores on x86 From: Cannon Matthews <cannonmatthews@google.com> To: Michal Hocko <mhocko@kernel.org>, Mike Kravetz <mike.kravetz@oracle.com>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andres Lagar-Cavilla <andreslc@google.com>, Salman Qazi <sqazi@google.com>, Paul Turner <pjt@google.com>, David Matlack <dmatlack@google.com>, Peter Feiner <pfeiner@google.com>, Alain Trinh <nullptr@google.com>, Cannon Matthews <cannonmatthews@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	RFC: clear 1G pages with streaming stores on x86 \| expand RFC: clear 1G pages with streaming stores on x86

Cannon Matthews July 24, 2018, 8:46 p.m. UTC

Reimplement clear_gigantic_page() to clear gigabytes pages using the
non-temporal streaming store instructions that bypass the cache
(movnti), since an entire 1GiB region will not fit in the cache anyway.

Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
and optimizing the control flow over the constituent small pages, this
can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
taking only 34 seconds on average, or 67ms/GiB.

The assembly code for the __clear_page_nt routine is more or less
taken directly from the output of gcc with -O3 for this function with
some tweaks to support arbitrary sizes and moving memory barriers:

void clear_page_nt_64i (void *page)
{
  for (int i = 0; i < GiB /sizeof(long long int); ++i)
    {
      _mm_stream_si64 (((long long int*)page) + i, 0);
    }
  sfence();
}

In general I would love to hear any thoughts and feedback on this
approach and any ways it could be improved.

Some specific questions:

- What is the appropriate method for defining an arch specific
implementation like this, is the #ifndef code sufficient, and did stuff
land in appropriate files?

- Are there any obvious pitfalls or caveats that have not been
considered? In particular the iterator over mem_map_next() seemed like a
no-op on x86, but looked like it could be important in certain
configurations or architectures I am not familiar with.

- Are there any x86_64 implementations that do not support SSE2
instructions like `movnti` ? What is the appropriate way to detect and
code around that if so?

- Is there anything that could be improved about the assembly code? I
originally wrote it in C and don't have much experience hand writing x86
asm, which seems riddled with optimization pitfalls.

- Is the highmem codepath really necessary? would 1GiB pages really be
of much use on a highmem system? We recently removed some other parts of
the code that support HIGHMEM for gigantic pages (see:
http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com)
so this seems like a logical continuation.

- The calls to cond_resched() have been reduced from between every 4k
page to every 64, as between all of the 256K page seemed overly
frequent.  Does this seem like an appropriate frequency? On an idle
system with many spare CPUs it get's rescheduled typically once or twice
out of the 4096 times it calls cond_resched(), which seems like it is
maybe the right amount, but more insight from a scheduling/latency point
of view would be helpful.

- Any other thoughts on the change overall and ways that this could
be made more generally useful, and designed to be easily extensible to
other platforms with non-temporal instructions and 1G pages, or any
additional pitfalls I have not thought to consider.

Tested:
	Time to `mlock()` a 512GiB region on broadwell CPU
				AVG time (s)	% imp.	ms/page
	clear_page_erms		133.584		-	261
	clear_page_nt		34.154		74.43%	67

Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
---
 arch/x86/include/asm/page_64.h     |  3 +++
 arch/x86/lib/Makefile              |  2 +-
 arch/x86/lib/clear_gigantic_page.c | 30 ++++++++++++++++++++++++++++++
 arch/x86/lib/clear_page_64.S       | 20 ++++++++++++++++++++
 include/linux/mm.h                 |  3 +++
 mm/memory.c                        |  5 ++++-
 6 files changed, 61 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/lib/clear_gigantic_page.c

--
2.18.0.233.g985f88cf7e-goog

Andrew Morton July 24, 2018, 8:53 p.m. UTC | #1

On Tue, 24 Jul 2018 13:46:39 -0700 Cannon Matthews <cannonmatthews@google.com> wrote:

> Reimplement clear_gigantic_page() to clear gigabytes pages using the
> non-temporal streaming store instructions that bypass the cache
> (movnti), since an entire 1GiB region will not fit in the cache anyway.
> 
> ...
> 
> Tested:
> 	Time to `mlock()` a 512GiB region on broadwell CPU
> 				AVG time (s)	% imp.	ms/page
> 	clear_page_erms		133.584		-	261
> 	clear_page_nt		34.154		74.43%	67

A gigantic improvement!

> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -56,6 +56,9 @@ static inline void clear_page(void *page)
> 
>  void copy_page(void *to, void *from);
> 
> +#define __HAVE_ARCH_CLEAR_GIGANTIC_PAGE
> +void __clear_page_nt(void *page, u64 page_size);

Nit: the modern way is

#ifndef __clear_page_nt
void __clear_page_nt(void *page, u64 page_size);
#define __clear_page_nt __clear_page_nt
#endif

Not sure why, really.  I guess it avoids adding two symbols and 
having to remember and maintain the relationship between them.

> --- /dev/null
> +++ b/arch/x86/lib/clear_gigantic_page.c
> @@ -0,0 +1,30 @@
> +#include <asm/page.h>
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> +#define PAGES_BETWEEN_RESCHED 64
> +void clear_gigantic_page(struct page *page,
> +				unsigned long addr,
> +				unsigned int pages_per_huge_page)
> +{
> +	int i;
> +	void *dest = page_to_virt(page);
> +	int resched_count = 0;
> +
> +	BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> +	BUG_ON(!dest);
> +
> +	might_sleep();

cond_resched() already does might_sleep() - it doesn't seem needed here.

> +	for (i = 0; i < pages_per_huge_page; i += PAGES_BETWEEN_RESCHED) {
> +		__clear_page_nt(dest + (i * PAGE_SIZE),
> +				PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> +		resched_count += cond_resched();
> +	}
> +	/* __clear_page_nt requrires and `sfence` barrier. */
> +	wmb();
> +	pr_debug("clear_gigantic_page: rescheduled %d times\n", resched_count);
> +}
> +#endif

Matthew Wilcox (Oracle) July 24, 2018, 9:09 p.m. UTC | #2

On Tue, Jul 24, 2018 at 01:46:39PM -0700, Cannon Matthews wrote:
> Reimplement clear_gigantic_page() to clear gigabytes pages using the
> non-temporal streaming store instructions that bypass the cache
> (movnti), since an entire 1GiB region will not fit in the cache anyway.
> 
> Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
> average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
> and optimizing the control flow over the constituent small pages, this
> can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
> taking only 34 seconds on average, or 67ms/GiB.

This is great data ...

> - The calls to cond_resched() have been reduced from between every 4k
> page to every 64, as between all of the 256K page seemed overly
> frequent.  Does this seem like an appropriate frequency? On an idle
> system with many spare CPUs it get's rescheduled typically once or twice
> out of the 4096 times it calls cond_resched(), which seems like it is
> maybe the right amount, but more insight from a scheduling/latency point
> of view would be helpful.

... which makes the lack of data here disappointing -- what're the
comparable timings if you do check every 4kB or every 64kB instead of
every 256kB?

> The assembly code for the __clear_page_nt routine is more or less
> taken directly from the output of gcc with -O3 for this function with
> some tweaks to support arbitrary sizes and moving memory barriers:
> 
> void clear_page_nt_64i (void *page)
> {
>   for (int i = 0; i < GiB /sizeof(long long int); ++i)
>     {
>       _mm_stream_si64 (((long long int*)page) + i, 0);
>     }
>   sfence();
> }
> 
> In general I would love to hear any thoughts and feedback on this
> approach and any ways it could be improved.
> 
> Some specific questions:
> 
> - What is the appropriate method for defining an arch specific
> implementation like this, is the #ifndef code sufficient, and did stuff
> land in appropriate files?
> 
> - Are there any obvious pitfalls or caveats that have not been
> considered? In particular the iterator over mem_map_next() seemed like a
> no-op on x86, but looked like it could be important in certain
> configurations or architectures I am not familiar with.
> 
> - Are there any x86_64 implementations that do not support SSE2
> instructions like `movnti` ? What is the appropriate way to detect and
> code around that if so?

No.  SSE2 was introduced with the Pentium 4, before x86-64.  The XMM
registers are used as part of the x86-64 calling conventions, so SSE2
is mandatory for x86-64 implementations.

> - Is there anything that could be improved about the assembly code? I
> originally wrote it in C and don't have much experience hand writing x86
> asm, which seems riddled with optimization pitfalls.

I suspect it might be slightly faster if implemented as inline asm in the
x86 clear_gigantic_page() implementation instead of a function call.
Might not affect performance a lot though.

> - Is the highmem codepath really necessary? would 1GiB pages really be
> of much use on a highmem system? We recently removed some other parts of
> the code that support HIGHMEM for gigantic pages (see:
> http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com)
> so this seems like a logical continuation.

PAE paging doesn't support 1GB pages, so there's no need for it on x86.

> diff --git a/mm/memory.c b/mm/memory.c
> index 7206a634270b..2515cae4af4e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -70,6 +70,7 @@
>  #include <linux/dax.h>
>  #include <linux/oom.h>
> 
> +
>  #include <asm/io.h>
>  #include <asm/mmu_context.h>
>  #include <asm/pgalloc.h>

Spurious.

Cannon Matthews July 25, 2018, 2:37 a.m. UTC | #3

Reimplement clear_gigantic_page() to clear gigabytes pages using the
non-temporal streaming store instructions that bypass the cache
(movnti), since an entire 1GiB region will not fit in the cache anyway.

Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
and optimizing the control flow over the constituent small pages, this
can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
taking only 34 seconds on average, or 67ms/GiB.

The assembly code for the __clear_page_nt routine is more or less
taken directly from the output of gcc with -O3 for this function with
some tweaks to support arbitrary sizes and moving memory barriers:

void clear_page_nt_64i (void *page)
{
  for (int i = 0; i < GiB /sizeof(long long int); ++i)
    {
      _mm_stream_si64 (((long long int*)page) + i, 0);
    }
  sfence();
}

In general I would love to hear any thoughts and feedback on this
approach and any ways it could be improved.

Some specific questions:

- What is the appropriate method for defining an arch specific
implementation like this, is the #ifndef code sufficient, and did stuff
land in appropriate files?

- Are there any obvious pitfalls or caveats that have not been
considered? In particular the iterator over mem_map_next() seemed like a
no-op on x86, but looked like it could be important in certain
configurations or architectures I am not familiar with.

- Is there anything that could be improved about the assembly code? I
originally wrote it in C and don't have much experience hand writing x86
asm, which seems riddled with optimization pitfalls.

- Is the highmem codepath really necessary? would 1GiB pages really be
of much use on a highmem system? We recently removed some other parts of
the code that support HIGHMEM for gigantic pages (see:
http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com)
so this seems like a logical continuation.

- The calls to cond_resched() have been reduced from between every 4k
page to every 64, as between all of the 256K page seemed overly
frequent.  Does this seem like an appropriate frequency? On an idle
system with many spare CPUs it get's rescheduled typically once or twice
out of the 4096 times it calls cond_resched(), which seems like it is
maybe the right amount, but more insight from a scheduling/latency point
of view would be helpful. See the "Tested:" section below for some more data.

- Any other thoughts on the change overall and ways that this could
be made more generally useful, and designed to be easily extensible to
other platforms with non-temporal instructions and 1G pages, or any
additional pitfalls I have not thought to consider.

Tested:
	Time to `mlock()` a 512GiB region on broadwell CPU
				AVG time (s)	% imp.	ms/page
	clear_page_erms		133.584		-	261
	clear_page_nt		34.154		74.43%	67

For a more in depth look at how the frequency we call cond_resched() affects
the time this takes, I tested both on an idle system, and a system running
`stress -c N` program to overcommit CPU to ~115%, and ran 10 replications of
the 512GiB mlock test.

Unfortunately there wasn't as clear of a pattern as I had hoped. On an
otherwise idle system there is no substantive difference different values of
PAGES_BETWEEN_RESCHED.

On a stressed system, there appears to be a pattern, that resembles something
of a bell curve: constantly offering to yield, or never yielding until the end
produces the fastest results, but yielding infrequently increases latency to a
slight degree.

That being said, it's not clear this is actually a significant difference, the
std deviation is occasionally quite high, and perhaps a larger sample set would
be more informative. From looking at the log messages indicating the number of
times cond_resched() returned 1, there wasn't that much variance, with it
usually being 1 or 2 when idle, and only increasing to ~4-7 when stressed.


	PAGES_BETWEEN_RESCHED	state	AVG	stddev
	1	4 KiB		idle	36.086	1.920
	16	64 KiB		idle	34.797	1.702
	32	128 KiB		idle	35.104	1.752
	64	256 KiB		idle	34.468	0.661
	512	2048 KiB	idle	36.427	0.946
	2048	8192 KiB	idle	34.988	2.406
	262144	1048576 KiB	idle	36.792	0.193
	infin	512 GiB		idle	38.817	0.238  [causes softlockup]
	1	4 KiB		stress 	55.562	0.661
	16	64 KiB		stress 	57.509	0.248
	32	128 KiB		stress 	69.265	3.913
	64	256 KiB		stress 	70.217	4.534
	512	2048 KiB	stress 	68.474	1.708
	2048	8192 KiB	stress 	70.806	1.068
	262144	1048576 KiB	stress 	55.217	1.184
	infin	512 GiB		stress 	55.062	0.291  [causes softlockup]

Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
---

v2:
 - Removed question about SSE2 Availability.
 - Changed #ifndef symbol to match function
 - removed spurious newlines
 - Expanded Tested: field to include additional timings for different sizes
   between cond_resched().

 arch/x86/include/asm/page_64.h     |  5 +++++
 arch/x86/lib/Makefile              |  2 +-
 arch/x86/lib/clear_gigantic_page.c | 29 +++++++++++++++++++++++++++++
 arch/x86/lib/clear_page_64.S       | 20 ++++++++++++++++++++
 include/linux/mm.h                 |  3 +++
 mm/memory.c                        |  4 +++-
 6 files changed, 61 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/lib/clear_gigantic_page.c

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 939b1cff4a7b..177196d6abc7 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -56,6 +56,11 @@ static inline void clear_page(void *page)

 void copy_page(void *to, void *from);

+#ifndef __clear_page_nt
+void __clear_page_nt(void *page, u64 page_size);
+#define __clear_page_nt __clear_page_nt
+#endif  /* __clear_page_nt */
+
 #endif	/* !__ASSEMBLY__ */

 #ifdef CONFIG_X86_VSYSCALL_EMULATION
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 25a972c61b0a..4ba395234088 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -44,7 +44,7 @@ endif
 else
         obj-y += iomap_copy_64.o
         lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
-        lib-y += clear_page_64.o copy_page_64.o
+        lib-y += clear_page_64.o copy_page_64.o clear_gigantic_page.o
         lib-y += memmove_64.o memset_64.o
         lib-y += copy_user_64.o
 	lib-y += cmpxchg16b_emu.o
diff --git a/arch/x86/lib/clear_gigantic_page.c b/arch/x86/lib/clear_gigantic_page.c
new file mode 100644
index 000000000000..0d51e38b5be0
--- /dev/null
+++ b/arch/x86/lib/clear_gigantic_page.c
@@ -0,0 +1,29 @@
+#include <asm/page.h>
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+#define PAGES_BETWEEN_RESCHED 64
+void clear_gigantic_page(struct page *page,
+				unsigned long addr,
+				unsigned int pages_per_huge_page)
+{
+	int i;
+	void *dest = page_to_virt(page);
+	int resched_count = 0;
+
+	BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
+	BUG_ON(!dest);
+
+	for (i = 0; i < pages_per_huge_page; i += PAGES_BETWEEN_RESCHED) {
+		__clear_page_nt(dest + (i * PAGE_SIZE),
+				PAGES_BETWEEN_RESCHED * PAGE_SIZE);
+		resched_count += cond_resched();
+	}
+	/* __clear_page_nt requrires and `sfence` barrier. */
+	wmb();
+	pr_debug("clear_gigantic_page: rescheduled %d times\n", resched_count);
+}
+#endif
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 88acd349911b..81a39804ac72 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -49,3 +49,23 @@ ENTRY(clear_page_erms)
 	ret
 ENDPROC(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Zero memory using non temporal stores, bypassing the cache.
+ * Requires an `sfence` (wmb()) afterwards.
+ * %rdi - destination.
+ * %rsi - page size. Must be 64 bit aligned.
+*/
+ENTRY(__clear_page_nt)
+	leaq	(%rdi,%rsi), %rdx
+	xorl	%eax, %eax
+	.p2align 4,,10
+	.p2align 3
+.L2:
+	movnti	%rax, (%rdi)
+	addq	$8, %rdi
+	cmpq	%rdx, %rdi
+	jne	.L2
+	ret
+ENDPROC(__clear_page_nt)
+EXPORT_SYMBOL(__clear_page_nt)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9ffe380..d10ac4e7ef6a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2729,6 +2729,9 @@ enum mf_action_page_type {
 };

 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void clear_gigantic_page(struct page *page,
+			 unsigned long addr,
+			 unsigned int pages_per_huge_page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr_hint,
 			    unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..e43a3a446380 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4568,7 +4568,8 @@ EXPORT_SYMBOL(__might_fault);
 #endif

 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
-static void clear_gigantic_page(struct page *page,
+#ifndef __clear_page_nt
+void clear_gigantic_page(struct page *page,
 				unsigned long addr,
 				unsigned int pages_per_huge_page)
 {
@@ -4582,6 +4583,7 @@ static void clear_gigantic_page(struct page *page,
 		clear_user_highpage(p, addr + i * PAGE_SIZE);
 	}
 }
+#endif  /* __clear_page_nt */
 void clear_huge_page(struct page *page,
 		     unsigned long addr_hint, unsigned int pages_per_huge_page)
 {
--
2.18.0.233.g985f88cf7e-goog

Cannon Matthews July 25, 2018, 2:46 a.m. UTC | #4

On Tue, Jul 24, 2018 at 2:09 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Jul 24, 2018 at 01:46:39PM -0700, Cannon Matthews wrote:
> > Reimplement clear_gigantic_page() to clear gigabytes pages using the
> > non-temporal streaming store instructions that bypass the cache
> > (movnti), since an entire 1GiB region will not fit in the cache anyway.
> >
> > Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
> > average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
> > and optimizing the control flow over the constituent small pages, this
> > can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
> > taking only 34 seconds on average, or 67ms/GiB.
>
> This is great data ...
Thanks!
>
> > - The calls to cond_resched() have been reduced from between every 4k
> > page to every 64, as between all of the 256K page seemed overly
> > frequent.  Does this seem like an appropriate frequency? On an idle
> > system with many spare CPUs it get's rescheduled typically once or twice
> > out of the 4096 times it calls cond_resched(), which seems like it is
> > maybe the right amount, but more insight from a scheduling/latency point
> > of view would be helpful.
>
> ... which makes the lack of data here disappointing -- what're the
> comparable timings if you do check every 4kB or every 64kB instead of
> every 256kB?

Fair enough, my data was lacking in that axis. I ran a bunch of trials
with different
sizes and included that in the v2 patch description.

TL;DR: It doesn't seem to make a significant difference in
performance, but might
need more trials to know with more confidence.

>
> > The assembly code for the __clear_page_nt routine is more or less
> > taken directly from the output of gcc with -O3 for this function with
> > some tweaks to support arbitrary sizes and moving memory barriers:
> >
> > void clear_page_nt_64i (void *page)
> > {
> >   for (int i = 0; i < GiB /sizeof(long long int); ++i)
> >     {
> >       _mm_stream_si64 (((long long int*)page) + i, 0);
> >     }
> >   sfence();
> > }
> >
> > In general I would love to hear any thoughts and feedback on this
> > approach and any ways it could be improved.
> >
> > Some specific questions:
> >
> > - What is the appropriate method for defining an arch specific
> > implementation like this, is the #ifndef code sufficient, and did stuff
> > land in appropriate files?
> >
> > - Are there any obvious pitfalls or caveats that have not been
> > considered? In particular the iterator over mem_map_next() seemed like a
> > no-op on x86, but looked like it could be important in certain
> > configurations or architectures I am not familiar with.
> >
> > - Are there any x86_64 implementations that do not support SSE2
> > instructions like `movnti` ? What is the appropriate way to detect and
> > code around that if so?
>
> No.  SSE2 was introduced with the Pentium 4, before x86-64.  The XMM
> registers are used as part of the x86-64 calling conventions, so SSE2
> is mandatory for x86-64 implementations.

Awesome, good to know.

>
> > - Is there anything that could be improved about the assembly code? I
> > originally wrote it in C and don't have much experience hand writing x86
> > asm, which seems riddled with optimization pitfalls.
>
> I suspect it might be slightly faster if implemented as inline asm in the
> x86 clear_gigantic_page() implementation instead of a function call.
> Might not affect performance a lot though.

I can try to experiment with that tomorrow. Since the performance doesn't vary
much on an idle machine when you make one function call for the whole GiB
or 256K of them for each 4K page I would suspect it won't matter much.

>
> > - Is the highmem codepath really necessary? would 1GiB pages really be
> > of much use on a highmem system? We recently removed some other parts of
> > the code that support HIGHMEM for gigantic pages (see:
> > http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com)
> > so this seems like a logical continuation.
>
> PAE paging doesn't support 1GB pages, so there's no need for it on x86.

Excellent. Do you happen to know if/when it is necessary on any other
architectures?

>
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 7206a634270b..2515cae4af4e 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -70,6 +70,7 @@
> >  #include <linux/dax.h>
> >  #include <linux/oom.h>
> >
> > +
> >  #include <asm/io.h>
> >  #include <asm/mmu_context.h>
> >  #include <asm/pgalloc.h>
>
> Spurious.
>
Thanks for catching that, removed.

Cannon Matthews July 25, 2018, 2:50 a.m. UTC | #5

On Tue, Jul 24, 2018 at 1:53 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 24 Jul 2018 13:46:39 -0700 Cannon Matthews <cannonmatthews@google.com> wrote:
>
> > Reimplement clear_gigantic_page() to clear gigabytes pages using the
> > non-temporal streaming store instructions that bypass the cache
> > (movnti), since an entire 1GiB region will not fit in the cache anyway.
> >
> > ...
> >
> > Tested:
> >       Time to `mlock()` a 512GiB region on broadwell CPU
> >                               AVG time (s)    % imp.  ms/page
> >       clear_page_erms         133.584         -       261
> >       clear_page_nt           34.154          74.43%  67
>
> A gigantic improvement!
>
> > --- a/arch/x86/include/asm/page_64.h
> > +++ b/arch/x86/include/asm/page_64.h
> > @@ -56,6 +56,9 @@ static inline void clear_page(void *page)
> >
> >  void copy_page(void *to, void *from);
> >
> > +#define __HAVE_ARCH_CLEAR_GIGANTIC_PAGE
> > +void __clear_page_nt(void *page, u64 page_size);
>
> Nit: the modern way is
>
> #ifndef __clear_page_nt
> void __clear_page_nt(void *page, u64 page_size);
> #define __clear_page_nt __clear_page_nt
> #endif
>
> Not sure why, really.  I guess it avoids adding two symbols and
> having to remember and maintain the relationship between them.
>

That makes sense, changed to this style. Thanks.

> > --- /dev/null
> > +++ b/arch/x86/lib/clear_gigantic_page.c
> > @@ -0,0 +1,30 @@
> > +#include <asm/page.h>
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/mm.h>
> > +#include <linux/sched.h>
> > +
> > +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > +#define PAGES_BETWEEN_RESCHED 64
> > +void clear_gigantic_page(struct page *page,
> > +                             unsigned long addr,
> > +                             unsigned int pages_per_huge_page)
> > +{
> > +     int i;
> > +     void *dest = page_to_virt(page);
> > +     int resched_count = 0;
> > +
> > +     BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> > +     BUG_ON(!dest);
> > +
> > +     might_sleep();
>
> cond_resched() already does might_sleep() - it doesn't seem needed here.

Ah gotcha, removed it. The original implementation called both, which
does seem redundant.

>
> > +     for (i = 0; i < pages_per_huge_page; i += PAGES_BETWEEN_RESCHED) {
> > +             __clear_page_nt(dest + (i * PAGE_SIZE),
> > +                             PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> > +             resched_count += cond_resched();
> > +     }
> > +     /* __clear_page_nt requrires and `sfence` barrier. */
> > +     wmb();
> > +     pr_debug("clear_gigantic_page: rescheduled %d times\n", resched_count);
> > +}
> > +#endif
>

Elliott, Robert (Servers) July 25, 2018, 5:02 a.m. UTC | #6

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> owner@vger.kernel.org> On Behalf Of Cannon Matthews
> Sent: Tuesday, July 24, 2018 9:37 PM
> Subject: Re: [PATCH v2] RFC: clear 1G pages with streaming stores on
> x86
> 
> Reimplement clear_gigantic_page() to clear gigabytes pages using the
> non-temporal streaming store instructions that bypass the cache
> (movnti), since an entire 1GiB region will not fit in the cache
> anyway.
>
> Doing an mlock() on a 512GiB 1G-hugetlb region previously would take
> on average 134 seconds, about 260ms/GiB which is quite slow. Using
> `movnti` and optimizing the control flow over the constituent small
> pages, this can be improved roughly by a factor of 3-4x, with the
> 512GiB mlock() taking only 34 seconds on average, or 67ms/GiB.

...
> - Are there any obvious pitfalls or caveats that have not been
> considered? 

Note that Kirill attempted something like this in 2012 - see
https://www.spinics.net/lists/linux-mm/msg40575.html

...
> +++ b/arch/x86/lib/clear_gigantic_page.c
> @@ -0,0 +1,29 @@
> +#include <asm/page.h>
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) ||
> defined(CONFIG_HUGETLBFS)
> +#define PAGES_BETWEEN_RESCHED 64
> +void clear_gigantic_page(struct page *page,
> +				unsigned long addr,

The previous attempt used cacheable stores in the page containing
addr to prevent an inevitable cache miss after the clearing completes.
This function is not using addr at all.

> +				unsigned int pages_per_huge_page)
> +{
> +	int i;
> +	void *dest = page_to_virt(page);
> +	int resched_count = 0;
> +
> +	BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> +	BUG_ON(!dest);

Are those really possible conditions?  Is there a safer fallback
than crashing the whole kernel?

> +
> +	for (i = 0; i < pages_per_huge_page; i +=
> PAGES_BETWEEN_RESCHED) {
> +		__clear_page_nt(dest + (i * PAGE_SIZE),
> +				PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> +		resched_count += cond_resched();
> +	}
> +	/* __clear_page_nt requrires and `sfence` barrier. */

requires an

...
> diff --git a/arch/x86/lib/clear_page_64.S
...
> +/*
> + * Zero memory using non temporal stores, bypassing the cache.
> + * Requires an `sfence` (wmb()) afterwards.
> + * %rdi - destination.
> + * %rsi - page size. Must be 64 bit aligned.
> +*/
> +ENTRY(__clear_page_nt)
> +	leaq	(%rdi,%rsi), %rdx
> +	xorl	%eax, %eax
> +	.p2align 4,,10
> +	.p2align 3
> +.L2:
> +	movnti	%rax, (%rdi)
> +	addq	$8, %rdi

Also consider using the AVX vmovntdq instruction (if available), the
most recent of which does 64-byte (cache line) sized transfers to
zmm registers. There's a hefty context switching overhead (e.g.,
304 clocks), but it might be worthwhile for 1 GiB (which
is 16,777,216 cache lines).

glibc memcpy() makes that choice for transfers > 75% of the L3 cache
size divided by the number of cores.  (last I tried, it was still
selecting "rep stosb" for large memset()s, although it has an
AVX-512 function available)

Even with that, one CPU core won't saturate the memory bus; multiple
CPU cores (preferably on the same NUMA node as the memory) need to
share the work.

---
Robert Elliott, HPE Persistent Memory

Michal Hocko July 25, 2018, 12:57 p.m. UTC | #7

[Cc Huang]
On Tue 24-07-18 19:37:28, Cannon Matthews wrote:
> Reimplement clear_gigantic_page() to clear gigabytes pages using the
> non-temporal streaming store instructions that bypass the cache
> (movnti), since an entire 1GiB region will not fit in the cache anyway.
> 
> Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
> average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
> and optimizing the control flow over the constituent small pages, this
> can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
> taking only 34 seconds on average, or 67ms/GiB.
> 
> The assembly code for the __clear_page_nt routine is more or less
> taken directly from the output of gcc with -O3 for this function with
> some tweaks to support arbitrary sizes and moving memory barriers:
> 
> void clear_page_nt_64i (void *page)
> {
>   for (int i = 0; i < GiB /sizeof(long long int); ++i)
>     {
>       _mm_stream_si64 (((long long int*)page) + i, 0);
>     }
>   sfence();
> }
> 
> In general I would love to hear any thoughts and feedback on this
> approach and any ways it could be improved.

Well, I like it. In fact 2MB pages are in a similar situation even
though they fit into the cache so the problem is not that pressing.
Anyway if you are a standard DB wokrload which simply preallocates large
hugetlb shared files then it would help. Huang has gone a different
direction c79b57e462b5 ("mm: hugetlb: clear target sub-page last when
clearing huge page") and I was asking about using the mechanism you are
proposing back then http://lkml.kernel.org/r/20170821115235.GD25956@dhcp22.suse.cz
I've got an explanation http://lkml.kernel.org/r/87h8x0whfs.fsf@yhuang-dev.intel.com
which hasn't really satisfied me but I didn't really want to block the
obvious optimization. The similar approach has been proposed for GB
pages IIRC but I do not see it in linux-next so I am not sure what
happened with it.

Is there any reason to use a different scheme for GB an 2MB pages? Why
don't we settle with movnti for both? The first access will be a miss
but I am not really sure it matters all that much.

Keeping the rest of the email for reference

> Some specific questions:
> 
> - What is the appropriate method for defining an arch specific
> implementation like this, is the #ifndef code sufficient, and did stuff
> land in appropriate files?
> 
> - Are there any obvious pitfalls or caveats that have not been
> considered? In particular the iterator over mem_map_next() seemed like a
> no-op on x86, but looked like it could be important in certain
> configurations or architectures I am not familiar with.
> 
> - Is there anything that could be improved about the assembly code? I
> originally wrote it in C and don't have much experience hand writing x86
> asm, which seems riddled with optimization pitfalls.
> 
> - Is the highmem codepath really necessary? would 1GiB pages really be
> of much use on a highmem system? We recently removed some other parts of
> the code that support HIGHMEM for gigantic pages (see:
> http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com)
> so this seems like a logical continuation.
> 
> - The calls to cond_resched() have been reduced from between every 4k
> page to every 64, as between all of the 256K page seemed overly
> frequent.  Does this seem like an appropriate frequency? On an idle
> system with many spare CPUs it get's rescheduled typically once or twice
> out of the 4096 times it calls cond_resched(), which seems like it is
> maybe the right amount, but more insight from a scheduling/latency point
> of view would be helpful. See the "Tested:" section below for some more data.
> 
> - Any other thoughts on the change overall and ways that this could
> be made more generally useful, and designed to be easily extensible to
> other platforms with non-temporal instructions and 1G pages, or any
> additional pitfalls I have not thought to consider.
> 
> Tested:
> 	Time to `mlock()` a 512GiB region on broadwell CPU
> 				AVG time (s)	% imp.	ms/page
> 	clear_page_erms		133.584		-	261
> 	clear_page_nt		34.154		74.43%	67
> 
> For a more in depth look at how the frequency we call cond_resched() affects
> the time this takes, I tested both on an idle system, and a system running
> `stress -c N` program to overcommit CPU to ~115%, and ran 10 replications of
> the 512GiB mlock test.
> 
> Unfortunately there wasn't as clear of a pattern as I had hoped. On an
> otherwise idle system there is no substantive difference different values of
> PAGES_BETWEEN_RESCHED.
> 
> On a stressed system, there appears to be a pattern, that resembles something
> of a bell curve: constantly offering to yield, or never yielding until the end
> produces the fastest results, but yielding infrequently increases latency to a
> slight degree.
> 
> That being said, it's not clear this is actually a significant difference, the
> std deviation is occasionally quite high, and perhaps a larger sample set would
> be more informative. From looking at the log messages indicating the number of
> times cond_resched() returned 1, there wasn't that much variance, with it
> usually being 1 or 2 when idle, and only increasing to ~4-7 when stressed.
> 
> 
> 	PAGES_BETWEEN_RESCHED	state	AVG	stddev
> 	1	4 KiB		idle	36.086	1.920
> 	16	64 KiB		idle	34.797	1.702
> 	32	128 KiB		idle	35.104	1.752
> 	64	256 KiB		idle	34.468	0.661
> 	512	2048 KiB	idle	36.427	0.946
> 	2048	8192 KiB	idle	34.988	2.406
> 	262144	1048576 KiB	idle	36.792	0.193
> 	infin	512 GiB		idle	38.817	0.238  [causes softlockup]
> 	1	4 KiB		stress 	55.562	0.661
> 	16	64 KiB		stress 	57.509	0.248
> 	32	128 KiB		stress 	69.265	3.913
> 	64	256 KiB		stress 	70.217	4.534
> 	512	2048 KiB	stress 	68.474	1.708
> 	2048	8192 KiB	stress 	70.806	1.068
> 	262144	1048576 KiB	stress 	55.217	1.184
> 	infin	512 GiB		stress 	55.062	0.291  [causes softlockup]
> 
> Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
> ---
> 
> v2:
>  - Removed question about SSE2 Availability.
>  - Changed #ifndef symbol to match function
>  - removed spurious newlines
>  - Expanded Tested: field to include additional timings for different sizes
>    between cond_resched().
> 
>  arch/x86/include/asm/page_64.h     |  5 +++++
>  arch/x86/lib/Makefile              |  2 +-
>  arch/x86/lib/clear_gigantic_page.c | 29 +++++++++++++++++++++++++++++
>  arch/x86/lib/clear_page_64.S       | 20 ++++++++++++++++++++
>  include/linux/mm.h                 |  3 +++
>  mm/memory.c                        |  4 +++-
>  6 files changed, 61 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/lib/clear_gigantic_page.c
> 
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index 939b1cff4a7b..177196d6abc7 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -56,6 +56,11 @@ static inline void clear_page(void *page)
> 
>  void copy_page(void *to, void *from);
> 
> +#ifndef __clear_page_nt
> +void __clear_page_nt(void *page, u64 page_size);
> +#define __clear_page_nt __clear_page_nt
> +#endif  /* __clear_page_nt */
> +
>  #endif	/* !__ASSEMBLY__ */
> 
>  #ifdef CONFIG_X86_VSYSCALL_EMULATION
> diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
> index 25a972c61b0a..4ba395234088 100644
> --- a/arch/x86/lib/Makefile
> +++ b/arch/x86/lib/Makefile
> @@ -44,7 +44,7 @@ endif
>  else
>          obj-y += iomap_copy_64.o
>          lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
> -        lib-y += clear_page_64.o copy_page_64.o
> +        lib-y += clear_page_64.o copy_page_64.o clear_gigantic_page.o
>          lib-y += memmove_64.o memset_64.o
>          lib-y += copy_user_64.o
>  	lib-y += cmpxchg16b_emu.o
> diff --git a/arch/x86/lib/clear_gigantic_page.c b/arch/x86/lib/clear_gigantic_page.c
> new file mode 100644
> index 000000000000..0d51e38b5be0
> --- /dev/null
> +++ b/arch/x86/lib/clear_gigantic_page.c
> @@ -0,0 +1,29 @@
> +#include <asm/page.h>
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> +#define PAGES_BETWEEN_RESCHED 64
> +void clear_gigantic_page(struct page *page,
> +				unsigned long addr,
> +				unsigned int pages_per_huge_page)
> +{
> +	int i;
> +	void *dest = page_to_virt(page);
> +	int resched_count = 0;
> +
> +	BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> +	BUG_ON(!dest);
> +
> +	for (i = 0; i < pages_per_huge_page; i += PAGES_BETWEEN_RESCHED) {
> +		__clear_page_nt(dest + (i * PAGE_SIZE),
> +				PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> +		resched_count += cond_resched();
> +	}
> +	/* __clear_page_nt requrires and `sfence` barrier. */
> +	wmb();
> +	pr_debug("clear_gigantic_page: rescheduled %d times\n", resched_count);
> +}
> +#endif
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index 88acd349911b..81a39804ac72 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -49,3 +49,23 @@ ENTRY(clear_page_erms)
>  	ret
>  ENDPROC(clear_page_erms)
>  EXPORT_SYMBOL_GPL(clear_page_erms)
> +
> +/*
> + * Zero memory using non temporal stores, bypassing the cache.
> + * Requires an `sfence` (wmb()) afterwards.
> + * %rdi - destination.
> + * %rsi - page size. Must be 64 bit aligned.
> +*/
> +ENTRY(__clear_page_nt)
> +	leaq	(%rdi,%rsi), %rdx
> +	xorl	%eax, %eax
> +	.p2align 4,,10
> +	.p2align 3
> +.L2:
> +	movnti	%rax, (%rdi)
> +	addq	$8, %rdi
> +	cmpq	%rdx, %rdi
> +	jne	.L2
> +	ret
> +ENDPROC(__clear_page_nt)
> +EXPORT_SYMBOL(__clear_page_nt)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a0fbb9ffe380..d10ac4e7ef6a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2729,6 +2729,9 @@ enum mf_action_page_type {
>  };
> 
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> +extern void clear_gigantic_page(struct page *page,
> +			 unsigned long addr,
> +			 unsigned int pages_per_huge_page);
>  extern void clear_huge_page(struct page *page,
>  			    unsigned long addr_hint,
>  			    unsigned int pages_per_huge_page);
> diff --git a/mm/memory.c b/mm/memory.c
> index 7206a634270b..e43a3a446380 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4568,7 +4568,8 @@ EXPORT_SYMBOL(__might_fault);
>  #endif
> 
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> -static void clear_gigantic_page(struct page *page,
> +#ifndef __clear_page_nt
> +void clear_gigantic_page(struct page *page,
>  				unsigned long addr,
>  				unsigned int pages_per_huge_page)
>  {
> @@ -4582,6 +4583,7 @@ static void clear_gigantic_page(struct page *page,
>  		clear_user_highpage(p, addr + i * PAGE_SIZE);
>  	}
>  }
> +#endif  /* __clear_page_nt */
>  void clear_huge_page(struct page *page,
>  		     unsigned long addr_hint, unsigned int pages_per_huge_page)
>  {
> --
> 2.18.0.233.g985f88cf7e-goog

Matthew Wilcox (Oracle) July 25, 2018, 2:38 p.m. UTC | #8

On Wed, Jul 25, 2018 at 05:02:46AM +0000, Elliott, Robert (Persistent Memory) wrote:
> Even with that, one CPU core won't saturate the memory bus; multiple
> CPU cores (preferably on the same NUMA node as the memory) need to
> share the work.

It's probably OK to not saturate the memory bus; it'd be nice if other
cores were allowed to get work done.  If your workload is single-threaded,
you're right, of course, but who has a single-threaded workload these
days?!

Cannon Matthews July 25, 2018, 5:30 p.m. UTC | #9

Thanks for the feedback!
On Tue, Jul 24, 2018 at 10:02 PM Elliott, Robert (Persistent Memory)
<elliott@hpe.com> wrote:
>
>
>
> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> > owner@vger.kernel.org> On Behalf Of Cannon Matthews
> > Sent: Tuesday, July 24, 2018 9:37 PM
> > Subject: Re: [PATCH v2] RFC: clear 1G pages with streaming stores on
> > x86
> >
> > Reimplement clear_gigantic_page() to clear gigabytes pages using the
> > non-temporal streaming store instructions that bypass the cache
> > (movnti), since an entire 1GiB region will not fit in the cache
> > anyway.
> >
> > Doing an mlock() on a 512GiB 1G-hugetlb region previously would take
> > on average 134 seconds, about 260ms/GiB which is quite slow. Using
> > `movnti` and optimizing the control flow over the constituent small
> > pages, this can be improved roughly by a factor of 3-4x, with the
> > 512GiB mlock() taking only 34 seconds on average, or 67ms/GiB.
>
> ...
> > - Are there any obvious pitfalls or caveats that have not been
> > considered?
>
> Note that Kirill attempted something like this in 2012 - see
> https://www.spinics.net/lists/linux-mm/msg40575.html
>

Oh very interesting I had not seen this before. So it seems like that was an
attempt to implement this more generally for THP and smaller page sizes,
but the performance numbers just weren't there to fully motivate it?

However, from the last follow up, it's suggested it might be a better fit
for hugetlbfs 1G pages:

> It would make a whole lot more sense for hugetlbfs giga pages than for
> THP (unlike for THP, cache trashing with giga pages is guaranteed),
> but even with giga pages, it's not like they're allocated frequently
> (maybe once per OS reboot) so that's also sure totally lost in the
> noise as it only saves a few accesses after the cache copy is
> finished.

I'm obviously inclined to agree with that, but I think that moreover,
the increase
in DRAM sizes in the last 6 years makes this more attractive as now you can
build systems with thousands of 1GiB pages, those time savings add up more.

Of course 1G pages are still unlikely to be reallocated frequently (though there
certainly use cases for more than once per reboot), but taking a long time to
clear them can add 10s of minutes to application startup or cause quarter
second long page faults later, neither is particularly desirable.

> ...
> > +++ b/arch/x86/lib/clear_gigantic_page.c
> > @@ -0,0 +1,29 @@
> > +#include <asm/page.h>
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/mm.h>
> > +#include <linux/sched.h>
> > +
> > +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) ||
> > defined(CONFIG_HUGETLBFS)
> > +#define PAGES_BETWEEN_RESCHED 64
> > +void clear_gigantic_page(struct page *page,
> > +                             unsigned long addr,
>
> The previous attempt used cacheable stores in the page containing
> addr to prevent an inevitable cache miss after the clearing completes.
> This function is not using addr at all.

Ah that's a good idea, I admittedly do not have a good understanding of what
the arguments were supposed to represent, as there are no comments, and
tracing through the existing codepath via clear_user_highpage() it seems like
the addr/vaddr argument was passed around only to be ultimately ignored by
clear_user_page()

While that's generally true about the cache miss being inevitable, if you are
preallocating a lot of 1G pages to avoid long page faults later via mlock()
or MAP_POPULATE will there still be an immediate cache miss, or will those
paths just build the page tables without touching anything?

Regardless I'm not sure how you would detect that here, and this seems like
an inexpensive optimization in anycase.

>
> > +                             unsigned int pages_per_huge_page)
> > +{
> > +     int i;
> > +     void *dest = page_to_virt(page);
> > +     int resched_count = 0;
> > +
> > +     BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> > +     BUG_ON(!dest);
>
> Are those really possible conditions?  Is there a safer fallback
> than crashing the whole kernel?

Perhaps not, I hope not anyhow,  this was something of a first pass
with paranoid
invariant checking, and initially I wrote this outside of the x86
specific directory.

I suppose that would depend on:

Is page_to_virt() always available and guaranteed to return something valid?
Will `page_per_huge_page` ever be anything other than 262144, and if so
anything besides 512 or 1?

It seems like on x86 these conditions will always be true, but I don't know
enough to say for 100% certain.

>
> > +
> > +     for (i = 0; i < pages_per_huge_page; i +=
> > PAGES_BETWEEN_RESCHED) {
> > +             __clear_page_nt(dest + (i * PAGE_SIZE),
> > +                             PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> > +             resched_count += cond_resched();
> > +     }
> > +     /* __clear_page_nt requrires and `sfence` barrier. */
>
> requires an
>
Good catch thanks.
> ...
> > diff --git a/arch/x86/lib/clear_page_64.S
> ...
> > +/*
> > + * Zero memory using non temporal stores, bypassing the cache.
> > + * Requires an `sfence` (wmb()) afterwards.
> > + * %rdi - destination.
> > + * %rsi - page size. Must be 64 bit aligned.
> > +*/
> > +ENTRY(__clear_page_nt)
> > +     leaq    (%rdi,%rsi), %rdx
> > +     xorl    %eax, %eax
> > +     .p2align 4,,10
> > +     .p2align 3
> > +.L2:
> > +     movnti  %rax, (%rdi)
> > +     addq    $8, %rdi
>
> Also consider using the AVX vmovntdq instruction (if available), the
> most recent of which does 64-byte (cache line) sized transfers to
> zmm registers. There's a hefty context switching overhead (e.g.,
> 304 clocks), but it might be worthwhile for 1 GiB (which
> is 16,777,216 cache lines).
>
> glibc memcpy() makes that choice for transfers > 75% of the L3 cache
> size divided by the number of cores.  (last I tried, it was still
> selecting "rep stosb" for large memset()s, although it has an
> AVX-512 function available)
>
> Even with that, one CPU core won't saturate the memory bus; multiple
> CPU cores (preferably on the same NUMA node as the memory) need to
> share the work.
>

Before I started this I experimented with all of those variants, and
interestingly found that I could equally saturate the memory bandwidth with
64,128, or 256bit wide instructions on a broadwell CPU ( I did not have a
skylake/AVX-512 machine available to run the tests on, would be a curious
thing to see it it holds for that as well).

From userspace I did a mmap(MAP_POPULATE), then measured the time
 to zero a 100GiB region:

mmap(MAP_POPULATE):     27.740127291
memset [libc, AVX]:     19.318307069
rep stosb:              19.301119348
movntq:                 5.874515236
movnti:                 5.786089655
movtndq:                5.837171599
vmovntdq:               5.798766718

It was interesting also that both the libc memset using AVX
instructions
(confirmed with gdb, though maybe it's more dynamic/tricksy than I know) was
almost identical to the `rep stosb` implementation.

I had some conversations with some platforms engineers who thought this made
sense, but that it is likely to be highly CPU dependent, and some CPUs might be
able to do larger bursts of transfers in parallel and get better
performance from
the wider instructions, but this got way over my head into hardware SDRAM
controller design. More benchmarking would tell however.

Another thing to consider about AVX instructions is that they affect core
frequency and power/thermals, though I can't really speak to specifics but I
understand that using 512/256 bit instructions and zmm registers can use more
power and limit the  frequency of other cores or something along those
lines.
Anyone with expertise feel free to correct me on this though. I assume this is
also highly CPU dependent.

But anyway, since the wide instructions were no faster than `movnti` on a
single core it didn't seem worth the FPU context saving in the kernel.

Perhaps AVX-512 goes further however, it might be worth testing there too.
> ---
> Robert Elliott, HPE Persistent Memory
>
>
>

Cannon Matthews July 25, 2018, 5:55 p.m. UTC | #10

On Wed, Jul 25, 2018 at 5:57 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> [Cc Huang]
> On Tue 24-07-18 19:37:28, Cannon Matthews wrote:
> > Reimplement clear_gigantic_page() to clear gigabytes pages using the
> > non-temporal streaming store instructions that bypass the cache
> > (movnti), since an entire 1GiB region will not fit in the cache anyway.
> >
> > Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
> > average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
> > and optimizing the control flow over the constituent small pages, this
> > can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
> > taking only 34 seconds on average, or 67ms/GiB.
> >
> > The assembly code for the __clear_page_nt routine is more or less
> > taken directly from the output of gcc with -O3 for this function with
> > some tweaks to support arbitrary sizes and moving memory barriers:
> >
> > void clear_page_nt_64i (void *page)
> > {
> >   for (int i = 0; i < GiB /sizeof(long long int); ++i)
> >     {
> >       _mm_stream_si64 (((long long int*)page) + i, 0);
> >     }
> >   sfence();
> > }
> >
> > In general I would love to hear any thoughts and feedback on this
> > approach and any ways it could be improved.
>
> Well, I like it. In fact 2MB pages are in a similar situation even
> though they fit into the cache so the problem is not that pressing.
> Anyway if you are a standard DB wokrload which simply preallocates large
> hugetlb shared files then it would help. Huang has gone a different
> direction c79b57e462b5 ("mm: hugetlb: clear target sub-page last when
> clearing huge page") and I was asking about using the mechanism you are
> proposing back then http://lkml.kernel.org/r/20170821115235.GD25956@dhcp22.suse.cz
> I've got an explanation http://lkml.kernel.org/r/87h8x0whfs.fsf@yhuang-dev.intel.com
> which hasn't really satisfied me but I didn't really want to block the
> obvious optimization. The similar approach has been proposed for GB
> pages IIRC but I do not see it in linux-next so I am not sure what
> happened with it.
>
> Is there any reason to use a different scheme for GB an 2MB pages? Why
> don't we settle with movnti for both? The first access will be a miss
> but I am not really sure it matters all that much.
>
My only hesitation is that while the benefits of doing it faster seem
obvious at a
1GiB granularity, things become more subtle at 2M, and they are used much
more frequently, where negative impacts from this approach could outweigh.

Not that that is actually the case, but I am not familiar enough to be
confident
proposing that, especially when it gets into the stuff in that
response you liked
about synchronous RAM loads and such.

With the right benchmarking we could
certainly motivate it one way or the other, but I wouldn't know where
to begin to
do so in a robust enough way.

For the first access being a miss, there is the suggestion that Robert
Elliot had
above of doing a normal caching store on the sub-page that contains the faulting
address, as an optimization to avoid that. Perhaps that would be enough.

> Keeping the rest of the email for reference
>
> > Some specific questions:
> >
> > - What is the appropriate method for defining an arch specific
> > implementation like this, is the #ifndef code sufficient, and did stuff
> > land in appropriate files?
> >
> > - Are there any obvious pitfalls or caveats that have not been
> > considered? In particular the iterator over mem_map_next() seemed like a
> > no-op on x86, but looked like it could be important in certain
> > configurations or architectures I am not familiar with.
> >
> > - Is there anything that could be improved about the assembly code? I
> > originally wrote it in C and don't have much experience hand writing x86
> > asm, which seems riddled with optimization pitfalls.
> >
> > - Is the highmem codepath really necessary? would 1GiB pages really be
> > of much use on a highmem system? We recently removed some other parts of
> > the code that support HIGHMEM for gigantic pages (see:
> > http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com)
> > so this seems like a logical continuation.
> >
> > - The calls to cond_resched() have been reduced from between every 4k
> > page to every 64, as between all of the 256K page seemed overly
> > frequent.  Does this seem like an appropriate frequency? On an idle
> > system with many spare CPUs it get's rescheduled typically once or twice
> > out of the 4096 times it calls cond_resched(), which seems like it is
> > maybe the right amount, but more insight from a scheduling/latency point
> > of view would be helpful. See the "Tested:" section below for some more data.
> >
> > - Any other thoughts on the change overall and ways that this could
> > be made more generally useful, and designed to be easily extensible to
> > other platforms with non-temporal instructions and 1G pages, or any
> > additional pitfalls I have not thought to consider.
> >
> > Tested:
> >       Time to `mlock()` a 512GiB region on broadwell CPU
> >                               AVG time (s)    % imp.  ms/page
> >       clear_page_erms         133.584         -       261
> >       clear_page_nt           34.154          74.43%  67
> >
> > For a more in depth look at how the frequency we call cond_resched() affects
> > the time this takes, I tested both on an idle system, and a system running
> > `stress -c N` program to overcommit CPU to ~115%, and ran 10 replications of
> > the 512GiB mlock test.
> >
> > Unfortunately there wasn't as clear of a pattern as I had hoped. On an
> > otherwise idle system there is no substantive difference different values of
> > PAGES_BETWEEN_RESCHED.
> >
> > On a stressed system, there appears to be a pattern, that resembles something
> > of a bell curve: constantly offering to yield, or never yielding until the end
> > produces the fastest results, but yielding infrequently increases latency to a
> > slight degree.
> >
> > That being said, it's not clear this is actually a significant difference, the
> > std deviation is occasionally quite high, and perhaps a larger sample set would
> > be more informative. From looking at the log messages indicating the number of
> > times cond_resched() returned 1, there wasn't that much variance, with it
> > usually being 1 or 2 when idle, and only increasing to ~4-7 when stressed.
> >
> >
> >       PAGES_BETWEEN_RESCHED   state   AVG     stddev
> >       1       4 KiB           idle    36.086  1.920
> >       16      64 KiB          idle    34.797  1.702
> >       32      128 KiB         idle    35.104  1.752
> >       64      256 KiB         idle    34.468  0.661
> >       512     2048 KiB        idle    36.427  0.946
> >       2048    8192 KiB        idle    34.988  2.406
> >       262144  1048576 KiB     idle    36.792  0.193
> >       infin   512 GiB         idle    38.817  0.238  [causes softlockup]
> >       1       4 KiB           stress  55.562  0.661
> >       16      64 KiB          stress  57.509  0.248
> >       32      128 KiB         stress  69.265  3.913
> >       64      256 KiB         stress  70.217  4.534
> >       512     2048 KiB        stress  68.474  1.708
> >       2048    8192 KiB        stress  70.806  1.068
> >       262144  1048576 KiB     stress  55.217  1.184
> >       infin   512 GiB         stress  55.062  0.291  [causes softlockup]
> >
> > Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
> > ---
> >
> > v2:
> >  - Removed question about SSE2 Availability.
> >  - Changed #ifndef symbol to match function
> >  - removed spurious newlines
> >  - Expanded Tested: field to include additional timings for different sizes
> >    between cond_resched().
> >
> >  arch/x86/include/asm/page_64.h     |  5 +++++
> >  arch/x86/lib/Makefile              |  2 +-
> >  arch/x86/lib/clear_gigantic_page.c | 29 +++++++++++++++++++++++++++++
> >  arch/x86/lib/clear_page_64.S       | 20 ++++++++++++++++++++
> >  include/linux/mm.h                 |  3 +++
> >  mm/memory.c                        |  4 +++-
> >  6 files changed, 61 insertions(+), 2 deletions(-)
> >  create mode 100644 arch/x86/lib/clear_gigantic_page.c
> >
> > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> > index 939b1cff4a7b..177196d6abc7 100644
> > --- a/arch/x86/include/asm/page_64.h
> > +++ b/arch/x86/include/asm/page_64.h
> > @@ -56,6 +56,11 @@ static inline void clear_page(void *page)
> >
> >  void copy_page(void *to, void *from);
> >
> > +#ifndef __clear_page_nt
> > +void __clear_page_nt(void *page, u64 page_size);
> > +#define __clear_page_nt __clear_page_nt
> > +#endif  /* __clear_page_nt */
> > +
> >  #endif       /* !__ASSEMBLY__ */
> >
> >  #ifdef CONFIG_X86_VSYSCALL_EMULATION
> > diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
> > index 25a972c61b0a..4ba395234088 100644
> > --- a/arch/x86/lib/Makefile
> > +++ b/arch/x86/lib/Makefile
> > @@ -44,7 +44,7 @@ endif
> >  else
> >          obj-y += iomap_copy_64.o
> >          lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
> > -        lib-y += clear_page_64.o copy_page_64.o
> > +        lib-y += clear_page_64.o copy_page_64.o clear_gigantic_page.o
> >          lib-y += memmove_64.o memset_64.o
> >          lib-y += copy_user_64.o
> >       lib-y += cmpxchg16b_emu.o
> > diff --git a/arch/x86/lib/clear_gigantic_page.c b/arch/x86/lib/clear_gigantic_page.c
> > new file mode 100644
> > index 000000000000..0d51e38b5be0
> > --- /dev/null
> > +++ b/arch/x86/lib/clear_gigantic_page.c
> > @@ -0,0 +1,29 @@
> > +#include <asm/page.h>
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/mm.h>
> > +#include <linux/sched.h>
> > +
> > +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > +#define PAGES_BETWEEN_RESCHED 64
> > +void clear_gigantic_page(struct page *page,
> > +                             unsigned long addr,
> > +                             unsigned int pages_per_huge_page)
> > +{
> > +     int i;
> > +     void *dest = page_to_virt(page);
> > +     int resched_count = 0;
> > +
> > +     BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> > +     BUG_ON(!dest);
> > +
> > +     for (i = 0; i < pages_per_huge_page; i += PAGES_BETWEEN_RESCHED) {
> > +             __clear_page_nt(dest + (i * PAGE_SIZE),
> > +                             PAGES_BETWEEN_RESCHED * PAGE_SIZE);
> > +             resched_count += cond_resched();
> > +     }
> > +     /* __clear_page_nt requrires and `sfence` barrier. */
> > +     wmb();
> > +     pr_debug("clear_gigantic_page: rescheduled %d times\n", resched_count);
> > +}
> > +#endif
> > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> > index 88acd349911b..81a39804ac72 100644
> > --- a/arch/x86/lib/clear_page_64.S
> > +++ b/arch/x86/lib/clear_page_64.S
> > @@ -49,3 +49,23 @@ ENTRY(clear_page_erms)
> >       ret
> >  ENDPROC(clear_page_erms)
> >  EXPORT_SYMBOL_GPL(clear_page_erms)
> > +
> > +/*
> > + * Zero memory using non temporal stores, bypassing the cache.
> > + * Requires an `sfence` (wmb()) afterwards.
> > + * %rdi - destination.
> > + * %rsi - page size. Must be 64 bit aligned.
> > +*/
> > +ENTRY(__clear_page_nt)
> > +     leaq    (%rdi,%rsi), %rdx
> > +     xorl    %eax, %eax
> > +     .p2align 4,,10
> > +     .p2align 3
> > +.L2:
> > +     movnti  %rax, (%rdi)
> > +     addq    $8, %rdi
> > +     cmpq    %rdx, %rdi
> > +     jne     .L2
> > +     ret
> > +ENDPROC(__clear_page_nt)
> > +EXPORT_SYMBOL(__clear_page_nt)
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a0fbb9ffe380..d10ac4e7ef6a 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2729,6 +2729,9 @@ enum mf_action_page_type {
> >  };
> >
> >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > +extern void clear_gigantic_page(struct page *page,
> > +                      unsigned long addr,
> > +                      unsigned int pages_per_huge_page);
> >  extern void clear_huge_page(struct page *page,
> >                           unsigned long addr_hint,
> >                           unsigned int pages_per_huge_page);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 7206a634270b..e43a3a446380 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4568,7 +4568,8 @@ EXPORT_SYMBOL(__might_fault);
> >  #endif
> >
> >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > -static void clear_gigantic_page(struct page *page,
> > +#ifndef __clear_page_nt
> > +void clear_gigantic_page(struct page *page,
> >                               unsigned long addr,
> >                               unsigned int pages_per_huge_page)
> >  {
> > @@ -4582,6 +4583,7 @@ static void clear_gigantic_page(struct page *page,
> >               clear_user_highpage(p, addr + i * PAGE_SIZE);
> >       }
> >  }
> > +#endif  /* __clear_page_nt */
> >  void clear_huge_page(struct page *page,
> >                    unsigned long addr_hint, unsigned int pages_per_huge_page)
> >  {
> > --
> > 2.18.0.233.g985f88cf7e-goog
>
> --
> Michal Hocko
> SUSE Labs

Matthew Wilcox (Oracle) July 25, 2018, 6:23 p.m. UTC | #11

On Wed, Jul 25, 2018 at 10:30:40AM -0700, Cannon Matthews wrote:
> On Tue, Jul 24, 2018 at 10:02 PM Elliott, Robert (Persistent Memory)
> > > +     BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> > > +     BUG_ON(!dest);
> >
> > Are those really possible conditions?  Is there a safer fallback
> > than crashing the whole kernel?
> 
> Perhaps not, I hope not anyhow,  this was something of a first pass
> with paranoid
> invariant checking, and initially I wrote this outside of the x86
> specific directory.
> 
> I suppose that would depend on:
> 
> Is page_to_virt() always available and guaranteed to return something valid?
> Will `page_per_huge_page` ever be anything other than 262144, and if so
> anything besides 512 or 1?

page_to_virt() can only return NULL for HIGHMEM, which we already know
isn't going to be supported.  pages_per_huge_page might vary in the
future, but is always going to be a power of two.  You can turn that into
a build-time assert, or just leave it for the person who tries to change
gigantic pages from being anything other than 1GB.

> It seems like on x86 these conditions will always be true, but I don't know
> enough to say for 100% certain.

They're true based on the current manuals.  If Intel want to change them,
it's fair that they should have to change this code too.

> Before I started this I experimented with all of those variants, and
> interestingly found that I could equally saturate the memory bandwidth with
> 64,128, or 256bit wide instructions on a broadwell CPU ( I did not have a
> skylake/AVX-512 machine available to run the tests on, would be a curious
> thing to see it it holds for that as well).
> 
> >From userspace I did a mmap(MAP_POPULATE), then measured the time
>  to zero a 100GiB region:
> 
> mmap(MAP_POPULATE):     27.740127291
> memset [libc, AVX]:     19.318307069
> rep stosb:              19.301119348
> movntq:                 5.874515236
> movnti:                 5.786089655
> movtndq:                5.837171599
> vmovntdq:               5.798766718
> 
> It was interesting also that both the libc memset using AVX
> instructions
> (confirmed with gdb, though maybe it's more dynamic/tricksy than I know) was
> almost identical to the `rep stosb` implementation.
> 
> I had some conversations with some platforms engineers who thought this made
> sense, but that it is likely to be highly CPU dependent, and some CPUs might be
> able to do larger bursts of transfers in parallel and get better
> performance from
> the wider instructions, but this got way over my head into hardware SDRAM
> controller design. More benchmarking would tell however.
> 
> Another thing to consider about AVX instructions is that they affect core
> frequency and power/thermals, though I can't really speak to specifics but I
> understand that using 512/256 bit instructions and zmm registers can use more
> power and limit the  frequency of other cores or something along those
> lines.
> Anyone with expertise feel free to correct me on this though. I assume this is
> also highly CPU dependent.

There's a difference between using AVX{256,512} load/store and arithmetic
instructions in terms of power draw; at least that's my recollection
from reading threads on realworldtech.  But I think it's not worth
going further than you have.  You've got a really nice speedup and it's
guaranteed to be faster on basically every microarch.  If somebody wants
to do something super-specialised for their microarch, they can submit
a patch on top of yours.

Cannon Matthews July 25, 2018, 6:48 p.m. UTC | #12

On Wed, Jul 25, 2018 at 11:23 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Jul 25, 2018 at 10:30:40AM -0700, Cannon Matthews wrote:
> > On Tue, Jul 24, 2018 at 10:02 PM Elliott, Robert (Persistent Memory)
> > > > +     BUG_ON(pages_per_huge_page % PAGES_BETWEEN_RESCHED != 0);
> > > > +     BUG_ON(!dest);
> > >
> > > Are those really possible conditions?  Is there a safer fallback
> > > than crashing the whole kernel?
> >
> > Perhaps not, I hope not anyhow,  this was something of a first pass
> > with paranoid
> > invariant checking, and initially I wrote this outside of the x86
> > specific directory.
> >
> > I suppose that would depend on:
> >
> > Is page_to_virt() always available and guaranteed to return something valid?
> > Will `page_per_huge_page` ever be anything other than 262144, and if so
> > anything besides 512 or 1?
>
> page_to_virt() can only return NULL for HIGHMEM, which we already know
> isn't going to be supported.  pages_per_huge_page might vary in the
> future, but is always going to be a power of two.  You can turn that into
> a build-time assert, or just leave it for the person who tries to change
> gigantic pages from being anything other than 1GB.
>
> > It seems like on x86 these conditions will always be true, but I don't know
> > enough to say for 100% certain.
>
> They're true based on the current manuals.  If Intel want to change them,
> it's fair that they should have to change this code too.

Thanks for the confirmations!

>
> > Before I started this I experimented with all of those variants, and
> > interestingly found that I could equally saturate the memory bandwidth with
> > 64,128, or 256bit wide instructions on a broadwell CPU ( I did not have a
> > skylake/AVX-512 machine available to run the tests on, would be a curious
> > thing to see it it holds for that as well).
> >
> > >From userspace I did a mmap(MAP_POPULATE), then measured the time
> >  to zero a 100GiB region:
> >
> > mmap(MAP_POPULATE):     27.740127291
> > memset [libc, AVX]:     19.318307069
> > rep stosb:              19.301119348
> > movntq:                 5.874515236
> > movnti:                 5.786089655
> > movtndq:                5.837171599
> > vmovntdq:               5.798766718
> >
> > It was interesting also that both the libc memset using AVX
> > instructions
> > (confirmed with gdb, though maybe it's more dynamic/tricksy than I know) was
> > almost identical to the `rep stosb` implementation.
> >
> > I had some conversations with some platforms engineers who thought this made
> > sense, but that it is likely to be highly CPU dependent, and some CPUs might be
> > able to do larger bursts of transfers in parallel and get better
> > performance from
> > the wider instructions, but this got way over my head into hardware SDRAM
> > controller design. More benchmarking would tell however.
> >
> > Another thing to consider about AVX instructions is that they affect core
> > frequency and power/thermals, though I can't really speak to specifics but I
> > understand that using 512/256 bit instructions and zmm registers can use more
> > power and limit the  frequency of other cores or something along those
> > lines.
> > Anyone with expertise feel free to correct me on this though. I assume this is
> > also highly CPU dependent.
>
> There's a difference between using AVX{256,512} load/store and arithmetic
> instructions in terms of power draw; at least that's my recollection
> from reading threads on realworldtech.  But I think it's not worth
> going further than you have.  You've got a really nice speedup and it's
> guaranteed to be faster on basically every microarch.  If somebody wants
> to do something super-specialised for their microarch, they can submit
> a patch on top of yours.

Good point, that was a subtly that escaped my recollection. In particular I've
also been told that using the zmm registers has power/thermal penalties as
well, though xmm/ymm is OK as long as you don't wake up the multipliers
at least on specific microarches.

Nonetheless I agree, we can start with this general one and leave room for
more specialized alternatives should anyone ever have the interest to build on
top of this.

Michal Hocko July 26, 2018, 1:19 p.m. UTC | #13

On Wed 25-07-18 10:55:40, Cannon Matthews wrote:
> On Wed, Jul 25, 2018 at 5:57 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > [Cc Huang]
> > On Tue 24-07-18 19:37:28, Cannon Matthews wrote:
> > > Reimplement clear_gigantic_page() to clear gigabytes pages using the
> > > non-temporal streaming store instructions that bypass the cache
> > > (movnti), since an entire 1GiB region will not fit in the cache anyway.
> > >
> > > Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
> > > average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
> > > and optimizing the control flow over the constituent small pages, this
> > > can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
> > > taking only 34 seconds on average, or 67ms/GiB.
> > >
> > > The assembly code for the __clear_page_nt routine is more or less
> > > taken directly from the output of gcc with -O3 for this function with
> > > some tweaks to support arbitrary sizes and moving memory barriers:
> > >
> > > void clear_page_nt_64i (void *page)
> > > {
> > >   for (int i = 0; i < GiB /sizeof(long long int); ++i)
> > >     {
> > >       _mm_stream_si64 (((long long int*)page) + i, 0);
> > >     }
> > >   sfence();
> > > }
> > >
> > > In general I would love to hear any thoughts and feedback on this
> > > approach and any ways it could be improved.
> >
> > Well, I like it. In fact 2MB pages are in a similar situation even
> > though they fit into the cache so the problem is not that pressing.
> > Anyway if you are a standard DB wokrload which simply preallocates large
> > hugetlb shared files then it would help. Huang has gone a different
> > direction c79b57e462b5 ("mm: hugetlb: clear target sub-page last when
> > clearing huge page") and I was asking about using the mechanism you are
> > proposing back then http://lkml.kernel.org/r/20170821115235.GD25956@dhcp22.suse.cz
> > I've got an explanation http://lkml.kernel.org/r/87h8x0whfs.fsf@yhuang-dev.intel.com
> > which hasn't really satisfied me but I didn't really want to block the
> > obvious optimization. The similar approach has been proposed for GB
> > pages IIRC but I do not see it in linux-next so I am not sure what
> > happened with it.
> >
> > Is there any reason to use a different scheme for GB an 2MB pages? Why
> > don't we settle with movnti for both? The first access will be a miss
> > but I am not really sure it matters all that much.
> >
> My only hesitation is that while the benefits of doing it faster seem
> obvious at a 1GiB granularity, things become more subtle at 2M, and
> they are used much more frequently, where negative impacts from this
> approach could outweigh.

Well, one would expect that even 2M huge pages would be long lived. And
that is usually the case for hugetlb pages which are usually
preallocated and pre-faulted/initialized during the startup.

> Not that that is actually the case, but I am not familiar enough to be
> confident proposing that, especially when it gets into the stuff in
> that response you liked about synchronous RAM loads and such.
>
> With the right benchmarking we could certainly motivate it one way or
> the other, but I wouldn't know where to begin to do so in a robust
> enough way.
>
> For the first access being a miss, there is the suggestion that Robert
> Elliot had above of doing a normal caching store on the sub-page
> that contains the faulting address, as an optimization to avoid
> that. Perhaps that would be enough.

Well, currently we are initializating pages towards the faulting address
(from both ends). Extending that to non-temporal mov shouldn't be hard.

Huang, Ying July 27, 2018, 12:05 a.m. UTC | #14

Hi, Cannon,

Cannon Matthews <cannonmatthews@google.com> writes:

> On Wed, Jul 25, 2018 at 5:57 AM Michal Hocko <mhocko@kernel.org> wrote:
>>
>> [Cc Huang]
>> On Tue 24-07-18 19:37:28, Cannon Matthews wrote:
>> > Reimplement clear_gigantic_page() to clear gigabytes pages using the
>> > non-temporal streaming store instructions that bypass the cache
>> > (movnti), since an entire 1GiB region will not fit in the cache anyway.
>> >
>> > Doing an mlock() on a 512GiB 1G-hugetlb region previously would take on
>> > average 134 seconds, about 260ms/GiB which is quite slow. Using `movnti`
>> > and optimizing the control flow over the constituent small pages, this
>> > can be improved roughly by a factor of 3-4x, with the 512GiB mlock()
>> > taking only 34 seconds on average, or 67ms/GiB.

I am impressed with the improvement number!  Thanks!

>> > The assembly code for the __clear_page_nt routine is more or less
>> > taken directly from the output of gcc with -O3 for this function with
>> > some tweaks to support arbitrary sizes and moving memory barriers:
>> >
>> > void clear_page_nt_64i (void *page)
>> > {
>> >   for (int i = 0; i < GiB /sizeof(long long int); ++i)
>> >     {
>> >       _mm_stream_si64 (((long long int*)page) + i, 0);
>> >     }
>> >   sfence();
>> > }
>> >
>> > In general I would love to hear any thoughts and feedback on this
>> > approach and any ways it could be improved.
>>
>> Well, I like it. In fact 2MB pages are in a similar situation even
>> though they fit into the cache so the problem is not that pressing.
>> Anyway if you are a standard DB wokrload which simply preallocates large
>> hugetlb shared files then it would help. Huang has gone a different
>> direction c79b57e462b5 ("mm: hugetlb: clear target sub-page last when
>> clearing huge page") and I was asking about using the mechanism you are
>> proposing back then http://lkml.kernel.org/r/20170821115235.GD25956@dhcp22.suse.cz
>> I've got an explanation http://lkml.kernel.org/r/87h8x0whfs.fsf@yhuang-dev.intel.com
>> which hasn't really satisfied me but I didn't really want to block the
>> obvious optimization. The similar approach has been proposed for GB
>> pages IIRC but I do not see it in linux-next so I am not sure what
>> happened with it.
>>
>> Is there any reason to use a different scheme for GB an 2MB pages? Why
>> don't we settle with movnti for both? The first access will be a miss
>> but I am not really sure it matters all that much.
>>
> My only hesitation is that while the benefits of doing it faster seem
> obvious at a
> 1GiB granularity, things become more subtle at 2M, and they are used much
> more frequently, where negative impacts from this approach could outweigh.
>
> Not that that is actually the case, but I am not familiar enough to be
> confident
> proposing that, especially when it gets into the stuff in that
> response you liked
> about synchronous RAM loads and such.
>
> With the right benchmarking we could
> certainly motivate it one way or the other, but I wouldn't know where
> to begin to
> do so in a robust enough way.
>
> For the first access being a miss, there is the suggestion that Robert
> Elliot had
> above of doing a normal caching store on the sub-page that contains the faulting
> address, as an optimization to avoid that. Perhaps that would be enough.

Yes.  We should consider caching too.  It shouldn't be an issue for 1G
huge page because it cannot fit in cache.  But that is important for 2M
huge page.  I think we should try Robert's idea.  Measure the difference
between no-cache, cache target sub-page, full cache cases.  With
per-thread cache size <2M and >2M.

>> Keeping the rest of the email for reference
>>
>> > Some specific questions:
>> >
>> > - What is the appropriate method for defining an arch specific
>> > implementation like this, is the #ifndef code sufficient, and did stuff
>> > land in appropriate files?
>> >
>> > - Are there any obvious pitfalls or caveats that have not been
>> > considered? In particular the iterator over mem_map_next() seemed like a
>> > no-op on x86, but looked like it could be important in certain
>> > configurations or architectures I am not familiar with.
>> >
>> > - Is there anything that could be improved about the assembly code? I
>> > originally wrote it in C and don't have much experience hand writing x86
>> > asm, which seems riddled with optimization pitfalls.
>> >
>> > - Is the highmem codepath really necessary? would 1GiB pages really be
>> > of much use on a highmem system? We recently removed some other parts of
>> > the code that support HIGHMEM for gigantic pages (see:
>> > http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com)
>> > so this seems like a logical continuation.
>> >
>> > - The calls to cond_resched() have been reduced from between every 4k
>> > page to every 64, as between all of the 256K page seemed overly
>> > frequent.  Does this seem like an appropriate frequency? On an idle
>> > system with many spare CPUs it get's rescheduled typically once or twice
>> > out of the 4096 times it calls cond_resched(), which seems like it is
>> > maybe the right amount, but more insight from a scheduling/latency point
>> > of view would be helpful. See the "Tested:" section below for some more data.
>> >
>> > - Any other thoughts on the change overall and ways that this could
>> > be made more generally useful, and designed to be easily extensible to
>> > other platforms with non-temporal instructions and 1G pages, or any
>> > additional pitfalls I have not thought to consider.
>> >
>> > Tested:
>> >       Time to `mlock()` a 512GiB region on broadwell CPU
>> >                               AVG time (s)    % imp.  ms/page
>> >       clear_page_erms         133.584         -       261
>> >       clear_page_nt           34.154          74.43%  67
>> >
>> > For a more in depth look at how the frequency we call cond_resched() affects
>> > the time this takes, I tested both on an idle system, and a system running
>> > `stress -c N` program to overcommit CPU to ~115%, and ran 10 replications of
>> > the 512GiB mlock test.
>> >
>> > Unfortunately there wasn't as clear of a pattern as I had hoped. On an
>> > otherwise idle system there is no substantive difference different values of
>> > PAGES_BETWEEN_RESCHED.
>> >
>> > On a stressed system, there appears to be a pattern, that resembles something
>> > of a bell curve: constantly offering to yield, or never yielding until the end
>> > produces the fastest results, but yielding infrequently increases latency to a
>> > slight degree.
>> >
>> > That being said, it's not clear this is actually a significant difference, the
>> > std deviation is occasionally quite high, and perhaps a larger sample set would
>> > be more informative. From looking at the log messages indicating the number of
>> > times cond_resched() returned 1, there wasn't that much variance, with it
>> > usually being 1 or 2 when idle, and only increasing to ~4-7 when stressed.
>> >
>> >
>> >       PAGES_BETWEEN_RESCHED   state   AVG     stddev
>> >       1       4 KiB           idle    36.086  1.920
>> >       16      64 KiB          idle    34.797  1.702
>> >       32      128 KiB         idle    35.104  1.752
>> >       64      256 KiB         idle    34.468  0.661
>> >       512     2048 KiB        idle    36.427  0.946
>> >       2048    8192 KiB        idle    34.988  2.406
>> >       262144  1048576 KiB     idle    36.792  0.193
>> >       infin   512 GiB         idle    38.817  0.238  [causes softlockup]
>> >       1       4 KiB           stress  55.562  0.661
>> >       16      64 KiB          stress  57.509  0.248
>> >       32      128 KiB         stress  69.265  3.913
>> >       64      256 KiB         stress  70.217  4.534
>> >       512     2048 KiB        stress  68.474  1.708
>> >       2048    8192 KiB        stress  70.806  1.068
>> >       262144  1048576 KiB     stress  55.217  1.184
>> >       infin   512 GiB         stress  55.062  0.291  [causes softlockup]

I think it may be good to separate the two optimization into 2 patches.
This makes it easier to evaluate the benefit of individual optimization.

Best Regards,
Huang, Ying

Borislav Petkov July 30, 2018, 4:29 p.m. UTC | #15

On Tue, Jul 24, 2018 at 07:37:28PM -0700, Cannon Matthews wrote:
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index 88acd349911b..81a39804ac72 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -49,3 +49,23 @@ ENTRY(clear_page_erms)
>  	ret
>  ENDPROC(clear_page_erms)
>  EXPORT_SYMBOL_GPL(clear_page_erms)
> +
> +/*
> + * Zero memory using non temporal stores, bypassing the cache.
> + * Requires an `sfence` (wmb()) afterwards.
> + * %rdi - destination.
> + * %rsi - page size. Must be 64 bit aligned.
> +*/
> +ENTRY(__clear_page_nt)
> +	leaq	(%rdi,%rsi), %rdx
> +	xorl	%eax, %eax
> +	.p2align 4,,10
> +	.p2align 3
> +.L2:
> +	movnti	%rax, (%rdi)
> +	addq	$8, %rdi
> +	cmpq	%rdx, %rdi
> +	jne	.L2
> +	ret
> +ENDPROC(__clear_page_nt)
> +EXPORT_SYMBOL(__clear_page_nt)

EXPORT_SYMBOL_GPL like the other functions in that file.

Cannon Matthews July 31, 2018, 12:28 a.m. UTC | #16

Thanks for all the feedback from everyone.

I am going to try to fix this up, and do some more robust benchmarking,
including the 2MB case, and try to have an updated/non-RFC patch(es) in
a few days.

Thanks!
Cannon
On Mon, Jul 30, 2018 at 9:29 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jul 24, 2018 at 07:37:28PM -0700, Cannon Matthews wrote:
> > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> > index 88acd349911b..81a39804ac72 100644
> > --- a/arch/x86/lib/clear_page_64.S
> > +++ b/arch/x86/lib/clear_page_64.S
> > @@ -49,3 +49,23 @@ ENTRY(clear_page_erms)
> >       ret
> >  ENDPROC(clear_page_erms)
> >  EXPORT_SYMBOL_GPL(clear_page_erms)
> > +
> > +/*
> > + * Zero memory using non temporal stores, bypassing the cache.
> > + * Requires an `sfence` (wmb()) afterwards.
> > + * %rdi - destination.
> > + * %rsi - page size. Must be 64 bit aligned.
> > +*/
> > +ENTRY(__clear_page_nt)
> > +     leaq    (%rdi,%rsi), %rdx
> > +     xorl    %eax, %eax
> > +     .p2align 4,,10
> > +     .p2align 3
> > +.L2:
> > +     movnti  %rax, (%rdi)
> > +     addq    $8, %rdi
> > +     cmpq    %rdx, %rdi
> > +     jne     .L2
> > +     ret
> > +ENDPROC(__clear_page_nt)
> > +EXPORT_SYMBOL(__clear_page_nt)
>
> EXPORT_SYMBOL_GPL like the other functions in that file.
>
> --
> Regards/Gruss,
>     Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --

Matthew Wilcox (Oracle) July 31, 2018, 12:45 a.m. UTC | #17

On Mon, Jul 30, 2018 at 06:29:27PM +0200, Borislav Petkov wrote:
> > +EXPORT_SYMBOL(__clear_page_nt)
> 
> EXPORT_SYMBOL_GPL like the other functions in that file.

Actually, __clear_page_nt doesn't need to be exported at all for this
patch set; it's not invoked from a module.

RFC: clear 1G pages with streaming stores on x86

Commit Message

Comments

Patch