diff mbox

[1/2] mm/vmalloc: Add interfaces to free unused page table

Message ID 20180307183227.17983-2-toshi.kani@hpe.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kani, Toshi March 7, 2018, 6:32 p.m. UTC
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap()
may create pud/pmd mappings.  Kernel panic was observed on arm64
systems with Cortex-A75 in the following steps as described by
Hanjun Guo.

1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
   then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
   which will lead to kernel panic.

This panic is not reproducible on x86.  INVLPG, called from iounmap,
purges all levels of entries associated with purged address on x86.
x86 still has memory leak.

Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(),
which clear a given pud/pmd entry and free up a page for the lower
level entries.

This patch implements their stub functions on x86 and arm64, which
work as workaround.

Reported-by: Lei Li <lious.lilei@hisilicon.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
---
 arch/arm64/mm/mmu.c           |   10 ++++++++++
 arch/x86/mm/pgtable.c         |   20 ++++++++++++++++++++
 include/asm-generic/pgtable.h |   10 ++++++++++
 lib/ioremap.c                 |    6 ++++--
 4 files changed, 44 insertions(+), 2 deletions(-)

Comments

Andrew Morton March 7, 2018, 10:54 p.m. UTC | #1
On Wed,  7 Mar 2018 11:32:26 -0700 Toshi Kani <toshi.kani@hpe.com> wrote:

> On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap()
> may create pud/pmd mappings.  Kernel panic was observed on arm64
> systems with Cortex-A75 in the following steps as described by
> Hanjun Guo.
> 
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
>    then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
>    which will lead to kernel panic.
> 
> This panic is not reproducible on x86.  INVLPG, called from iounmap,
> purges all levels of entries associated with purged address on x86.
> x86 still has memory leak.
> 
> Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(),
> which clear a given pud/pmd entry and free up a page for the lower
> level entries.
> 
> This patch implements their stub functions on x86 and arm64, which
> work as workaround.
> 
> index 004abf9ebf12..942f4fa341f1 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -702,4 +702,24 @@ int pmd_clear_huge(pmd_t *pmd)
>  
>  	return 0;
>  }
> +
> +/**
> + * pud_free_pmd_page - clear pud entry and free pmd page
> + *
> + * Returns 1 on success and 0 on failure (pud not cleared).
> + */
> +int pud_free_pmd_page(pud_t *pud)
> +{
> +	return pud_none(*pud);
> +}
> +
> +/**
> + * pmd_free_pte_page - clear pmd entry and free pte page
> + *
> + * Returns 1 on success and 0 on failure (pmd not cleared).
> + */
> +int pmd_free_pte_page(pmd_t *pmd)
> +{
> +	return pmd_none(*pmd);
> +}

Are these functions well named?  I mean, the comment says "clear pud
entry and free pmd page" but the implementatin does neither of those
things.  The name implies that the function frees a pmd_page but the
callsites use the function as a way of querying state.
Andrew Morton March 7, 2018, 10:55 p.m. UTC | #2
On Wed,  7 Mar 2018 11:32:26 -0700 Toshi Kani <toshi.kani@hpe.com> wrote:

> On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap()
> may create pud/pmd mappings.  Kernel panic was observed on arm64
> systems with Cortex-A75 in the following steps as described by
> Hanjun Guo.
> 
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
>    then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
>    which will lead to kernel panic.
> 
> This panic is not reproducible on x86.  INVLPG, called from iounmap,
> purges all levels of entries associated with purged address on x86.
> x86 still has memory leak.
> 
> Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(),
> which clear a given pud/pmd entry and free up a page for the lower
> level entries.
> 
> This patch implements their stub functions on x86 and arm64, which
> work as workaround.

Also, as this fixes an arm64 kernel panic, should we be using cc:stable?
Kani, Toshi March 7, 2018, 11:02 p.m. UTC | #3
On Wed, 2018-03-07 at 14:54 -0800, Andrew Morton wrote:
> On Wed,  7 Mar 2018 11:32:26 -0700 Toshi Kani <toshi.kani@hpe.com> wrote:
> 
> > On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap()
> > may create pud/pmd mappings.  Kernel panic was observed on arm64
> > systems with Cortex-A75 in the following steps as described by
> > Hanjun Guo.
> > 
> > 1. ioremap a 4K size, valid page table will build,
> > 2. iounmap it, pte0 will set to 0;
> > 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
> >    then set the a new value for pmd;
> > 4. pte0 is leaked;
> > 5. CPU may meet exception because the old pmd is still in TLB,
> >    which will lead to kernel panic.
> > 
> > This panic is not reproducible on x86.  INVLPG, called from iounmap,
> > purges all levels of entries associated with purged address on x86.
> > x86 still has memory leak.
> > 
> > Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(),
> > which clear a given pud/pmd entry and free up a page for the lower
> > level entries.
> > 
> > This patch implements their stub functions on x86 and arm64, which
> > work as workaround.
> > 
> > index 004abf9ebf12..942f4fa341f1 100644
> > --- a/arch/x86/mm/pgtable.c
> > +++ b/arch/x86/mm/pgtable.c
> > @@ -702,4 +702,24 @@ int pmd_clear_huge(pmd_t *pmd)
> >  
> >  	return 0;
> >  }
> > +
> > +/**
> > + * pud_free_pmd_page - clear pud entry and free pmd page
> > + *
> > + * Returns 1 on success and 0 on failure (pud not cleared).
> > + */
> > +int pud_free_pmd_page(pud_t *pud)
> > +{
> > +	return pud_none(*pud);
> > +}
> > +
> > +/**
> > + * pmd_free_pte_page - clear pmd entry and free pte page
> > + *
> > + * Returns 1 on success and 0 on failure (pmd not cleared).
> > + */
> > +int pmd_free_pte_page(pmd_t *pmd)
> > +{
> > +	return pmd_none(*pmd);
> > +}
> 
> Are these functions well named?  I mean, the comment says "clear pud
> entry and free pmd page" but the implementatin does neither of those
> things.  The name implies that the function frees a pmd_page but the
> callsites use the function as a way of querying state.

This patch 1/2 only implements stubs.  Patch 2/2 implements what is
described here.

> Also, as this fixes an arm64 kernel panic, should we be using
> cc:stable?

Right.

Thanks,
-Toshi
Matthew Wilcox (Oracle) March 8, 2018, 4 a.m. UTC | #4
On Wed, Mar 07, 2018 at 11:32:26AM -0700, Toshi Kani wrote:
> +/**
> + * pud_free_pmd_page - clear pud entry and free pmd page
> + *
> + * Returns 1 on success and 0 on failure (pud not cleared).
> + */
> +int pud_free_pmd_page(pud_t *pud)
> +{
> +	return pud_none(*pud);
> +}

Wouldn't it be clearer if you returned 'bool' instead of 'int' here?

Also you didn't document the pud parameter, nor use the approved form
for documenting the return type, nor the calling context.  So I would
have written it out like this:

/**
 * pud_free_pmd_page - Clear pud entry and free pmd page.
 * @pud: Pointer to a PUD.
 *
 * Context: Caller should hold mmap_sem write-locked.
 * Return: %true if clearing the entry succeeded.
 */
Ingo Molnar March 8, 2018, 8:08 a.m. UTC | #5
* Toshi Kani <toshi.kani@hpe.com> wrote:

> On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap()
> may create pud/pmd mappings.  Kernel panic was observed on arm64
> systems with Cortex-A75 in the following steps as described by
> Hanjun Guo.
> 
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
>    then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
>    which will lead to kernel panic.
> 
> This panic is not reproducible on x86.  INVLPG, called from iounmap,
> purges all levels of entries associated with purged address on x86.

Where does x86 iounmap() do that?

> x86 still has memory leak.
> Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(),
> which clear a given pud/pmd entry and free up a page for the lower
> level entries.
> 
> This patch implements their stub functions on x86 and arm64, which
> work as workaround.

At minimum the ordering of the patches is very confusing: why don't you introduce 
the new methods in patch #1, and then use them in patch #2?

Also please double check the coding style of your patches, there's a number of 
obvious problems of outright bad patterns and also cases where you clearly don't 
try to follow the (correct) style of existing code.

Thanks,

	Ingo
Kani, Toshi March 8, 2018, 3:56 p.m. UTC | #6
On Wed, 2018-03-07 at 20:00 -0800, Matthew Wilcox wrote:
> On Wed, Mar 07, 2018 at 11:32:26AM -0700, Toshi Kani wrote:
> > +/**
> > + * pud_free_pmd_page - clear pud entry and free pmd page
> > + *
> > + * Returns 1 on success and 0 on failure (pud not cleared).
> > + */
> > +int pud_free_pmd_page(pud_t *pud)
> > +{
> > +	return pud_none(*pud);
> > +}
> 
> Wouldn't it be clearer if you returned 'bool' instead of 'int' here?

I thought about it and decided to use 'int' since all other pud/pmd/pte
interfaces, such as pud_none() above, use 'int'.

> Also you didn't document the pud parameter, nor use the approved form
> for documenting the return type, nor the calling context.  So I would
> have written it out like this:
> 
> /**
>  * pud_free_pmd_page - Clear pud entry and free pmd page.
>  * @pud: Pointer to a PUD.
>  *
>  * Context: Caller should hold mmap_sem write-locked.
>  * Return: %true if clearing the entry succeeded.
>  */

Will do.

Thanks!
-Toshi
Will Deacon March 8, 2018, 6:04 p.m. UTC | #7
Hi Toshi,

Thanks for the patches!

On Wed, Mar 07, 2018 at 11:32:26AM -0700, Toshi Kani wrote:
> On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap()
> may create pud/pmd mappings.  Kernel panic was observed on arm64
> systems with Cortex-A75 in the following steps as described by
> Hanjun Guo.
> 
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
>    then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
>    which will lead to kernel panic.
> 
> This panic is not reproducible on x86.  INVLPG, called from iounmap,
> purges all levels of entries associated with purged address on x86.
> x86 still has memory leak.
> 
> Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(),
> which clear a given pud/pmd entry and free up a page for the lower
> level entries.
> 
> This patch implements their stub functions on x86 and arm64, which
> work as workaround.
> 
> Reported-by: Lei Li <lious.lilei@hisilicon.com>
> Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Hanjun Guo <guohanjun@huawei.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Borislav Petkov <bp@suse.de>
> ---
>  arch/arm64/mm/mmu.c           |   10 ++++++++++
>  arch/x86/mm/pgtable.c         |   20 ++++++++++++++++++++
>  include/asm-generic/pgtable.h |   10 ++++++++++
>  lib/ioremap.c                 |    6 ++++--
>  4 files changed, 44 insertions(+), 2 deletions(-)

[...]

> diff --git a/lib/ioremap.c b/lib/ioremap.c
> index b808a390e4c3..54e5bbaa3200 100644
> --- a/lib/ioremap.c
> +++ b/lib/ioremap.c
> @@ -91,7 +91,8 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
>  
>  		if (ioremap_pmd_enabled() &&
>  		    ((next - addr) == PMD_SIZE) &&
> -		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
> +		    IS_ALIGNED(phys_addr + addr, PMD_SIZE) &&
> +		    pmd_free_pte_page(pmd)) {

I find it a bit weird that we're postponing this to the subsequent map. If
we want to address the break-before-make issue that was causing a panic on
arm64, then I think it would be better to do this on the unmap path to avoid
duplicating TLB invalidation.

Will
Kani, Toshi March 8, 2018, 7:30 p.m. UTC | #8
On Thu, 2018-03-08 at 18:04 +0000, Will Deacon wrote:
 :
> > diff --git a/lib/ioremap.c b/lib/ioremap.c
> > index b808a390e4c3..54e5bbaa3200 100644
> > --- a/lib/ioremap.c
> > +++ b/lib/ioremap.c
> > @@ -91,7 +91,8 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
> >  
> >  		if (ioremap_pmd_enabled() &&
> >  		    ((next - addr) == PMD_SIZE) &&
> > -		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
> > +		    IS_ALIGNED(phys_addr + addr, PMD_SIZE) &&
> > +		    pmd_free_pte_page(pmd)) {
> 
> I find it a bit weird that we're postponing this to the subsequent map. If
> we want to address the break-before-make issue that was causing a panic on
> arm64, then I think it would be better to do this on the unmap path to avoid
> duplicating TLB invalidation.

Hi Will,

Yes, I started looking into doing it the unmap path, but found the
following issues:

 - The iounmap() path is shared with vunmap().  Since vmap() only
supports pte mappings, making vunmap() to free pte pages is an overhead
for regular vmap users as they do not need pte pages freed up.
 - Checking to see if all entries in a pte page are cleared in the unmap
path is racy, and serializing this check is expensive.
 - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
purge.

Hence, I decided to postpone and do it in the ioremap path when a
pud/pmd mapping is set.  The "break" on arm64 happens when you update a
pmd entry without purging it.  So, the unmap path is not broken.  I
understand that arm64 may need extra TLB purge in pmd_free_pte_page(),
but it limits this overhead only when it sets up a pud/pmd mapping.

Thanks,
-Toshi
Matthew Wilcox (Oracle) March 8, 2018, 10:07 p.m. UTC | #9
On Thu, Mar 08, 2018 at 03:56:30PM +0000, Kani, Toshi wrote:
> On Wed, 2018-03-07 at 20:00 -0800, Matthew Wilcox wrote:
> > On Wed, Mar 07, 2018 at 11:32:26AM -0700, Toshi Kani wrote:
> > > +/**
> > > + * pud_free_pmd_page - clear pud entry and free pmd page
> > > + *
> > > + * Returns 1 on success and 0 on failure (pud not cleared).
> > > + */
> > > +int pud_free_pmd_page(pud_t *pud)
> > > +{
> > > +	return pud_none(*pud);
> > > +}
> > 
> > Wouldn't it be clearer if you returned 'bool' instead of 'int' here?
> 
> I thought about it and decided to use 'int' since all other pud/pmd/pte
> interfaces, such as pud_none() above, use 'int'.

These interfaces were introduced before we had bool ... I suspect nobody's
taken the time to go through and convert them all.
Kani, Toshi March 8, 2018, 11:27 p.m. UTC | #10
On Thu, 2018-03-08 at 14:07 -0800, Matthew Wilcox wrote:
> On Thu, Mar 08, 2018 at 03:56:30PM +0000, Kani, Toshi wrote:
> > On Wed, 2018-03-07 at 20:00 -0800, Matthew Wilcox wrote:
> > > On Wed, Mar 07, 2018 at 11:32:26AM -0700, Toshi Kani wrote:
> > > > +/**
> > > > + * pud_free_pmd_page - clear pud entry and free pmd page
> > > > + *
> > > > + * Returns 1 on success and 0 on failure (pud not cleared).
> > > > + */
> > > > +int pud_free_pmd_page(pud_t *pud)
> > > > +{
> > > > +	return pud_none(*pud);
> > > > +}
> > > 
> > > Wouldn't it be clearer if you returned 'bool' instead of 'int' here?
> > 
> > I thought about it and decided to use 'int' since all other pud/pmd/pte
> > interfaces, such as pud_none() above, use 'int'.
> 
> These interfaces were introduced before we had bool ... I suspect nobody's
> taken the time to go through and convert them all.

I see.  Since this patchset already changes core, arm and x86, and will
be back ported to stables.  So, if you do not mind, I'd like to leave it
consistent with others with 'int', and make the footstep minimum.

Thanks,
-Toshi
diff mbox

Patch

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 84a019f55022..84a37b4bc28e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -972,3 +972,13 @@  int pmd_clear_huge(pmd_t *pmdp)
 	pmd_clear(pmdp);
 	return 1;
 }
+
+int pud_free_pmd_page(pud_t *pud)
+{
+	return pud_none(*pud);
+}
+
+int pmd_free_pte_page(pmd_t *pmd)
+{
+	return pmd_none(*pmd);
+}
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 004abf9ebf12..942f4fa341f1 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -702,4 +702,24 @@  int pmd_clear_huge(pmd_t *pmd)
 
 	return 0;
 }
+
+/**
+ * pud_free_pmd_page - clear pud entry and free pmd page
+ *
+ * Returns 1 on success and 0 on failure (pud not cleared).
+ */
+int pud_free_pmd_page(pud_t *pud)
+{
+	return pud_none(*pud);
+}
+
+/**
+ * pmd_free_pte_page - clear pmd entry and free pte page
+ *
+ * Returns 1 on success and 0 on failure (pmd not cleared).
+ */
+int pmd_free_pte_page(pmd_t *pmd)
+{
+	return pmd_none(*pmd);
+}
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 2cfa3075d148..2490800f7c5a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -983,6 +983,8 @@  int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
 int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
 int pud_clear_huge(pud_t *pud);
 int pmd_clear_huge(pmd_t *pmd);
+int pud_free_pmd_page(pud_t *pud);
+int pmd_free_pte_page(pmd_t *pmd);
 #else	/* !CONFIG_HAVE_ARCH_HUGE_VMAP */
 static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
 {
@@ -1008,6 +1010,14 @@  static inline int pmd_clear_huge(pmd_t *pmd)
 {
 	return 0;
 }
+static inline int pud_free_pmd_page(pud_t *pud)
+{
+	return 0;
+}
+static inline int pmd_free_pte_page(pud_t *pmd)
+{
+	return 0;
+}
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
 #ifndef __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
diff --git a/lib/ioremap.c b/lib/ioremap.c
index b808a390e4c3..54e5bbaa3200 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -91,7 +91,8 @@  static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 
 		if (ioremap_pmd_enabled() &&
 		    ((next - addr) == PMD_SIZE) &&
-		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
+		    IS_ALIGNED(phys_addr + addr, PMD_SIZE) &&
+		    pmd_free_pte_page(pmd)) {
 			if (pmd_set_huge(pmd, phys_addr + addr, prot))
 				continue;
 		}
@@ -117,7 +118,8 @@  static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 
 		if (ioremap_pud_enabled() &&
 		    ((next - addr) == PUD_SIZE) &&
-		    IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
+		    IS_ALIGNED(phys_addr + addr, PUD_SIZE) &&
+		    pud_free_pmd_page(pud)) {
 			if (pud_set_huge(pud, phys_addr + addr, prot))
 				continue;
 		}