diff mbox series

mm/vmalloc: Fix regression caused by needless vmalloc_sync_all()

Message ID 20191113095530.228959-1-shile.zhang@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series mm/vmalloc: Fix regression caused by needless vmalloc_sync_all() | expand

Commit Message

Shile Zhang Nov. 13, 2019, 9:55 a.m. UTC
vmalloc_sync_all() was put in the common path in
__purge_vmap_area_lazy(), for one sync issue only happened on X86_32
with PTI enabled. It is needless for X86_64, which caused a big regression
in UnixBench Shell8 testing on X86_64.
Similar regression also reported by 0-day kernel test robot in reaim
benchmarking:
https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/

Fix it by adding more conditions.

Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")

Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
---
 mm/vmalloc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Qian Cai Nov. 13, 2019, 3:17 p.m. UTC | #1
On Wed, 2019-11-13 at 17:55 +0800, Shile Zhang wrote:
> vmalloc_sync_all() was put in the common path in
> __purge_vmap_area_lazy(), for one sync issue only happened on X86_32
> with PTI enabled. It is needless for X86_64, which caused a big regression
> in UnixBench Shell8 testing on X86_64.
> Similar regression also reported by 0-day kernel test robot in reaim
> benchmarking:
> https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
> 
> Fix it by adding more conditions.
> 
> Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
> 
> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
> ---
>  mm/vmalloc.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a3c70e275f4e..7b9fc7966da6 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1255,11 +1255,17 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  	if (unlikely(valist == NULL))
>  		return false;
>  
> +#if defined(CONFIG_X86_32) && defined(CONFIG_X86_PAE)

Is it cleaner to use if (IS_ENABLED()...) ?

>  	/*
>  	 * First make sure the mappings are removed from all page-tables
>  	 * before they are freed.
> +	 *
> +	 * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which is
> +	 * the case on a PAE kernel with PTI enabled.
>  	 */
> -	vmalloc_sync_all();
> +	if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI))
> +		vmalloc_sync_all();
> +#endif
>  
>  	/*
>  	 * TODO: to calculate a flush range without looping.
Andrew Morton Nov. 13, 2019, 9:12 p.m. UTC | #2
On Wed, 13 Nov 2019 17:55:30 +0800 Shile Zhang <shile.zhang@linux.alibaba.com> wrote:

> vmalloc_sync_all() was put in the common path in
> __purge_vmap_area_lazy(), for one sync issue only happened on X86_32
> with PTI enabled. It is needless for X86_64, which caused a big regression
> in UnixBench Shell8 testing on X86_64.
> Similar regression also reported by 0-day kernel test robot in reaim
> benchmarking:
> https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/

That is indeed a large performance regression.

> Fix it by adding more conditions.
> 
> Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
>
> ...
>
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1255,11 +1255,17 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  	if (unlikely(valist == NULL))
>  		return false;
>  
> +#if defined(CONFIG_X86_32) && defined(CONFIG_X86_PAE)

Are we sure that x86_32 is the only architecture whcih is (or ever will
be) affected?

>  	/*
>  	 * First make sure the mappings are removed from all page-tables
>  	 * before they are freed.
> +	 *
> +	 * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which is
> +	 * the case on a PAE kernel with PTI enabled.
>  	 */
> -	vmalloc_sync_all();
> +	if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI))
> +		vmalloc_sync_all();
> +#endif
>  
>  	/*
>  	 * TODO: to calculate a flush range without looping.

CONFIG_X86_PAE depends on CONFIG_X86_32 so no need to check
CONFIG_X86_32?

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-vmalloc-fix-regression-caused-by-needless-vmalloc_sync_all-fix

simplify config expression, use IS_ENABLED()

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

--- a/mm/vmalloc.c~mm-vmalloc-fix-regression-caused-by-needless-vmalloc_sync_all-fix
+++ a/mm/vmalloc.c
@@ -1255,16 +1255,17 @@ static bool __purge_vmap_area_lazy(unsig
 	if (unlikely(valist == NULL))
 		return false;
 
-#if defined(CONFIG_X86_32) && defined(CONFIG_X86_PAE)
-	/*
-	 * First make sure the mappings are removed from all page-tables
-	 * before they are freed.
-	 *
-	 * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which is
-	 * the case on a PAE kernel with PTI enabled.
-	 */
-	if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI))
-		vmalloc_sync_all();
+	if (IS_ENABLED(CONFIG_X86_PAE)) {
+		/*
+		 * First make sure the mappings are removed from all page-tables
+		 * before they are freed.
+		 *
+		 * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which
+		 * is the case on a PAE kernel with PTI enabled.
+		 */
+		if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI))
+			vmalloc_sync_all();
+	}
 #endif
 
 	/*
Michal Hocko Nov. 14, 2019, 1:56 p.m. UTC | #3
On Wed 13-11-19 17:55:30, Shile Zhang wrote:
> vmalloc_sync_all() was put in the common path in
> __purge_vmap_area_lazy(), for one sync issue only happened on X86_32
> with PTI enabled. It is needless for X86_64, which caused a big regression
> in UnixBench Shell8 testing on X86_64.
> Similar regression also reported by 0-day kernel test robot in reaim
> benchmarking:
> https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
> 
> Fix it by adding more conditions.

This doesn't really explain a lot. Could you explain why the syncing
should be limited only to PAE and !SHARED_KERNEL_PMD? 

> Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
> 
> Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
> ---
>  mm/vmalloc.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a3c70e275f4e..7b9fc7966da6 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1255,11 +1255,17 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  	if (unlikely(valist == NULL))
>  		return false;
>  
> +#if defined(CONFIG_X86_32) && defined(CONFIG_X86_PAE)
>  	/*
>  	 * First make sure the mappings are removed from all page-tables
>  	 * before they are freed.
> +	 *
> +	 * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which is
> +	 * the case on a PAE kernel with PTI enabled.
>  	 */
> -	vmalloc_sync_all();
> +	if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI))
> +		vmalloc_sync_all();
> +#endif
>  
>  	/*
>  	 * TODO: to calculate a flush range without looping.
> -- 
> 2.24.0.rc2
>
Joerg Roedel Nov. 14, 2019, 5:12 p.m. UTC | #4
On Wed, Nov 13, 2019 at 05:55:30PM +0800, Shile Zhang wrote:
> +#if defined(CONFIG_X86_32) && defined(CONFIG_X86_PAE)
>  	/*
>  	 * First make sure the mappings are removed from all page-tables
>  	 * before they are freed.
> +	 *
> +	 * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which is
> +	 * the case on a PAE kernel with PTI enabled.
>  	 */
> -	vmalloc_sync_all();
> +	if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI))
> +		vmalloc_sync_all();
> +#endif

I already submitted another fix for this quite some time ago:

	https://lore.kernel.org/lkml/20191009124418.8286-1-joro@8bytes.org/

Regards,

	Joerg
Andrew Morton Nov. 15, 2019, 12:01 a.m. UTC | #5
On Thu, 14 Nov 2019 18:12:31 +0100 Joerg Roedel <jroedel@suse.de> wrote:

> On Wed, Nov 13, 2019 at 05:55:30PM +0800, Shile Zhang wrote:
> > +#if defined(CONFIG_X86_32) && defined(CONFIG_X86_PAE)
> >  	/*
> >  	 * First make sure the mappings are removed from all page-tables
> >  	 * before they are freed.
> > +	 *
> > +	 * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which is
> > +	 * the case on a PAE kernel with PTI enabled.
> >  	 */
> > -	vmalloc_sync_all();
> > +	if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI))
> > +		vmalloc_sync_all();
> > +#endif
> 
> I already submitted another fix for this quite some time ago:
> 
> 	https://lore.kernel.org/lkml/20191009124418.8286-1-joro@8bytes.org/
> 

Your patch is more verbose but looks quite a bit nicer than this patch
(Link:
http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com).
The separation of sync-for-a-mapping versus sync-for-an-unmapping adds
clarity.


It's fairly urgent - I consider this to be -stable material.

Thomas & co, was that a deliberate skip?
Borislav Petkov Nov. 15, 2019, 8:38 a.m. UTC | #6
On Thu, Nov 14, 2019 at 04:01:08PM -0800, Andrew Morton wrote:
> It's fairly urgent - I consider this to be -stable material.
> 
> Thomas & co, was that a deliberate skip?

More like lost in the mail flood. I'm assuming you're taking it?

In any case, patch looks ok to me - the "unmappings" naming sounds like
coming from someone with a very wild phantasy... :-)

Thx.
Andrew Morton Nov. 15, 2019, 11:37 p.m. UTC | #7
On Fri, 15 Nov 2019 09:38:47 +0100 Borislav Petkov <bp@alien8.de> wrote:

> On Thu, Nov 14, 2019 at 04:01:08PM -0800, Andrew Morton wrote:
> > It's fairly urgent - I consider this to be -stable material.
> > 
> > Thomas & co, was that a deliberate skip?
> 
> More like lost in the mail flood. I'm assuming you're taking it?
> 
> In any case, patch looks ok to me - the "unmappings" naming sounds like
> coming from someone with a very wild phantasy... :-)
> 

OK, thanks, here's what I queued.  Reviews, acks and testing, please??

From: Joerg Roedel <jroedel@suse.de>
Subject: x86/mm: Split vmalloc_sync_all()

Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the
vunmap() code-path.  While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.

Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.

To avoid the unnecessary work on x86-64 and to gain the performance back,
split up vmalloc_sync_all() into two functions:

	* vmalloc_sync_mappings(), and
	* vmalloc_sync_unmappings()

Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized.  The only exception is the new call-site added in the above
mentioned commit.

Shile Zhang directed us to a report of an 80% regression in reaim
throughput.

Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/mm/fault.c      |   26 ++++++++++++++++++++++++--
 drivers/acpi/apei/ghes.c |    2 +-
 include/linux/vmalloc.h  |    5 +++--
 kernel/notifier.c        |    2 +-
 mm/nommu.c               |   10 +++++++---
 mm/vmalloc.c             |   11 +++++++----
 6 files changed, 43 insertions(+), 13 deletions(-)

--- a/arch/x86/mm/fault.c~x86-mm-split-vmalloc_sync_all
+++ a/arch/x86/mm/fault.c
@@ -189,7 +189,7 @@ static inline pmd_t *vmalloc_sync_one(pg
 	return pmd_k;
 }
 
-void vmalloc_sync_all(void)
+static void vmalloc_sync(void)
 {
 	unsigned long address;
 
@@ -216,6 +216,16 @@ void vmalloc_sync_all(void)
 	}
 }
 
+void vmalloc_sync_mappings(void)
+{
+	vmalloc_sync();
+}
+
+void vmalloc_sync_unmappings(void)
+{
+	vmalloc_sync();
+}
+
 /*
  * 32-bit:
  *
@@ -318,11 +328,23 @@ out:
 
 #else /* CONFIG_X86_64: */
 
-void vmalloc_sync_all(void)
+void vmalloc_sync_mappings(void)
 {
+	/*
+	 * 64-bit mappings might allocate new p4d/pud pages
+	 * that need to be propagated to all tasks' PGDs.
+	 */
 	sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
 }
 
+void vmalloc_sync_unmappings(void)
+{
+	/*
+	 * Unmappings never allocate or free p4d/pud pages.
+	 * No work is required here.
+	 */
+}
+
 /*
  * 64-bit:
  *
--- a/drivers/acpi/apei/ghes.c~x86-mm-split-vmalloc_sync_all
+++ a/drivers/acpi/apei/ghes.c
@@ -171,7 +171,7 @@ int ghes_estatus_pool_init(int num_ghes)
 	 * New allocation must be visible in all pgd before it can be found by
 	 * an NMI allocating from the pool.
 	 */
-	vmalloc_sync_all();
+	vmalloc_sync_mappings();
 
 	rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1);
 	if (rc)
--- a/include/linux/vmalloc.h~x86-mm-split-vmalloc_sync_all
+++ a/include/linux/vmalloc.h
@@ -126,8 +126,9 @@ extern int remap_vmalloc_range_partial(s
 
 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 							unsigned long pgoff);
-void vmalloc_sync_all(void);
- 
+void vmalloc_sync_mappings(void);
+void vmalloc_sync_unmappings(void);
+
 /*
  *	Lowlevel-APIs (not for driver use!)
  */
--- a/kernel/notifier.c~x86-mm-split-vmalloc_sync_all
+++ a/kernel/notifier.c
@@ -554,7 +554,7 @@ NOKPROBE_SYMBOL(notify_die);
 
 int register_die_notifier(struct notifier_block *nb)
 {
-	vmalloc_sync_all();
+	vmalloc_sync_mappings();
 	return atomic_notifier_chain_register(&die_chain, nb);
 }
 EXPORT_SYMBOL_GPL(register_die_notifier);
--- a/mm/nommu.c~x86-mm-split-vmalloc_sync_all
+++ a/mm/nommu.c
@@ -359,10 +359,14 @@ void vm_unmap_aliases(void)
 EXPORT_SYMBOL_GPL(vm_unmap_aliases);
 
 /*
- * Implement a stub for vmalloc_sync_all() if the architecture chose not to
- * have one.
+ * Implement a stub for vmalloc_sync_[un]mapping() if the architecture
+ * chose not to have one.
  */
-void __weak vmalloc_sync_all(void)
+void __weak vmalloc_sync_mappings(void)
+{
+}
+
+void __weak vmalloc_sync_unmappings(void)
 {
 }
 
--- a/mm/vmalloc.c~x86-mm-split-vmalloc_sync_all
+++ a/mm/vmalloc.c
@@ -1259,7 +1259,7 @@ static bool __purge_vmap_area_lazy(unsig
 	 * First make sure the mappings are removed from all page-tables
 	 * before they are freed.
 	 */
-	vmalloc_sync_all();
+	vmalloc_sync_unmappings();
 
 	/*
 	 * TODO: to calculate a flush range without looping.
@@ -3050,16 +3050,19 @@ int remap_vmalloc_range(struct vm_area_s
 EXPORT_SYMBOL(remap_vmalloc_range);
 
 /*
- * Implement a stub for vmalloc_sync_all() if the architecture chose not to
- * have one.
+ * Implement stubs for vmalloc_sync_[un]mappings () if the architecture chose
+ * not to have one.
  *
  * The purpose of this function is to make sure the vmalloc area
  * mappings are identical in all page-tables in the system.
  */
-void __weak vmalloc_sync_all(void)
+void __weak vmalloc_sync_mappings(void)
 {
 }
 
+void __weak vmalloc_sync_unmappings(void)
+{
+}
 
 static int f(pte_t *pte, unsigned long addr, void *data)
 {
Borislav Petkov Nov. 16, 2019, 5:38 p.m. UTC | #8
On Fri, Nov 15, 2019 at 03:37:21PM -0800, Andrew Morton wrote:
> OK, thanks, here's what I queued.  Reviews, acks and testing, please??
> 
> From: Joerg Roedel <jroedel@suse.de>
> Subject: x86/mm: Split vmalloc_sync_all()
> 
> Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
> __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the
> vunmap() code-path.  While this change was necessary to maintain
> correctness on x86-32-pae kernels, it also adds additional cycles for
> architectures that don't need it.

...

>  arch/x86/mm/fault.c      |   26 ++++++++++++++++++++++++--
>  drivers/acpi/apei/ghes.c |    2 +-
>  include/linux/vmalloc.h  |    5 +++--
>  kernel/notifier.c        |    2 +-
>  mm/nommu.c               |   10 +++++++---
>  mm/vmalloc.c             |   11 +++++++----
>  6 files changed, 43 insertions(+), 13 deletions(-)

Booting it looks ok here:

Tested-by: Borislav Petkov <bp@suse.de>
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a3c70e275f4e..7b9fc7966da6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1255,11 +1255,17 @@  static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 	if (unlikely(valist == NULL))
 		return false;
 
+#if defined(CONFIG_X86_32) && defined(CONFIG_X86_PAE)
 	/*
 	 * First make sure the mappings are removed from all page-tables
 	 * before they are freed.
+	 *
+	 * This is only needed on x86-32 with !SHARED_KERNEL_PMD, which is
+	 * the case on a PAE kernel with PTI enabled.
 	 */
-	vmalloc_sync_all();
+	if (!SHARED_KERNEL_PMD && boot_cpu_has(X86_FEATURE_PTI))
+		vmalloc_sync_all();
+#endif
 
 	/*
 	 * TODO: to calculate a flush range without looping.