diff mbox

[v3,3/3] x86/mm: add TLB purge to free pmd/pte page interfaces

Message ID 20180516233207.1580-4-toshi.kani@hpe.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kani, Toshi May 16, 2018, 11:32 p.m. UTC
ioremap() calls pud_free_pmd_page() / pmd_free_pte_page() when it creates
a pud / pmd map.  The following preconditions are met at their entry.
 - All pte entries for a target pud/pmd address range have been cleared.
 - System-wide TLB purges have been peformed for a target pud/pmd address
   range.

The preconditions assure that there is no stale TLB entry for the range.
Speculation may not cache TLB entries since it requires all levels of page
entries, including ptes, to have P & A-bits set for an associated address.
However, speculation may cache pud/pmd entries (paging-structure caches)
when they have P-bit set.

Add a system-wide TLB purge (INVLPG) to a single page after clearing
pud/pmd entry's P-bit.

SDM 4.10.4.1, Operation that Invalidate TLBs and Paging-Structure Caches,
states that:
  INVLPG invalidates all paging-structure caches associated with the
  current PCID regardless of the liner addresses to which they correspond.

Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: <stable@vger.kernel.org>
---
 arch/x86/mm/pgtable.c |   34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

Comments

Joerg Roedel May 29, 2018, 2:44 p.m. UTC | #1
On Wed, May 16, 2018 at 05:32:07PM -0600, Toshi Kani wrote:
>  	pmd = (pmd_t *)pud_page_vaddr(*pud);
> +	pmd_sv = (pmd_t *)__get_free_page(GFP_KERNEL);
> +	if (!pmd_sv)
> +		return 0;

So your code still needs to allocate a full page where a simple
list_head on the stack would do the same job.

Ingo, Thomas, can you please just revert the original broken patch for
now until there is proper fix?

Thanks,

	Joerg
Kani, Toshi May 29, 2018, 4:10 p.m. UTC | #2
On Tue, 2018-05-29 at 16:44 +0200, Joerg Roedel wrote:
> On Wed, May 16, 2018 at 05:32:07PM -0600, Toshi Kani wrote:
> >  	pmd = (pmd_t *)pud_page_vaddr(*pud);
> > +	pmd_sv = (pmd_t *)__get_free_page(GFP_KERNEL);
> > +	if (!pmd_sv)
> > +		return 0;
> 
> So your code still needs to allocate a full page where a simple
> list_head on the stack would do the same job.

Can you explain why you think allocating a page here is a major problem?
 
As I explained before, pud_free_pmd_page() covers an extremely rare case
 which I could not even hit with a huge number of ioremap() calls until
I instrumented alloc_vmap_area() to force this case to happen.  I do not
think pages should be listed for such a rare case.

> Ingo, Thomas, can you please just revert the original broken patch for
> now until there is proper fix?

If we just revert, please apply patch 1/3 first.  This patch address the
BUG_ON issue on PAE.  This is a real issue that needs a fix ASAP.

The page-directory cache issue on x64, which is addressed by patch 3/3,
is a theoretical issue that I could not hit by putting ioremap() calls
into a loop for a whole day.  Nobody hit this issue, either.

The simple revert patch Joerg posted a while ago causes
pmd_free_pte_page() to fail on x64.  This causes multiple pmd mappings
to fall into pte mappings on my test systems.  This can be seen as a
degradation, and I am afraid that it is more harmful than good.

Thanks,
-Toshi
Joerg Roedel May 30, 2018, 4:59 a.m. UTC | #3
On Tue, May 29, 2018 at 04:10:24PM +0000, Kani, Toshi wrote:
> Can you explain why you think allocating a page here is a major problem?

Because a larger allocation is more likely to fail. And if you fail the
allocation, you also fail to free more pages, which _is_ a problem. So
better avoid any allocations in code paths that are about freeing
memory.

> If we just revert, please apply patch 1/3 first.  This patch address the
> BUG_ON issue on PAE.  This is a real issue that needs a fix ASAP.

It does not address the problem of dirty page-walk caches on x86-64.

> The page-directory cache issue on x64, which is addressed by patch 3/3,
> is a theoretical issue that I could not hit by putting ioremap() calls
> into a loop for a whole day.  Nobody hit this issue, either.

How do you know you didn't hit that issue? It might cause silent data
corruption, which might not be easily detected.

> The simple revert patch Joerg posted a while ago causes
> pmd_free_pte_page() to fail on x64.  This causes multiple pmd mappings
> to fall into pte mappings on my test systems.  This can be seen as a
> degradation, and I am afraid that it is more harmful than good.

The plain revert just removes all the issues with the dirty TLB that the
original patch introduced and prevents huge mappings from being
established when there have been smaller mappings before. This is not
ideal, but at least its is consistent and does not leak pages and leaves
no dirty TLBs. So this is the easiest and most reliable fix for this
stage in the release process.


Regards,

	Joerg
Kani, Toshi May 30, 2018, 3:39 p.m. UTC | #4
On Wed, 2018-05-30 at 06:59 +0200, joro@8bytes.org wrote:
> On Tue, May 29, 2018 at 04:10:24PM +0000, Kani, Toshi wrote:
> > Can you explain why you think allocating a page here is a major problem?
> 
> Because a larger allocation is more likely to fail. And if you fail the
> allocation, you also fail to free more pages, which _is_ a problem. So
> better avoid any allocations in code paths that are about freeing
> memory.

It only allocates a single page.  If it fails to allocate a single page,
this pud mapping request simply falls to pmd mappings, which is only as
bad as your suggested plain revert always does for both pud and pmd
mappings.  Also, this func is called from ioremap() when a driver
initializes its mapping.  If the system does not have a single page to
allocate, the driver has a much bigger issue to deal with than its
request falling back to pmd mappings.  Please also note that this func
is not called from free-memory path.

> > If we just revert, please apply patch 1/3 first.  This patch address the
> > BUG_ON issue on PAE.  This is a real issue that needs a fix ASAP.
> 
> It does not address the problem of dirty page-walk caches on x86-64.

This patch 3/3 fixes it.  Hope my explanation above clarified.

> > The page-directory cache issue on x64, which is addressed by patch 3/3,
> > is a theoretical issue that I could not hit by putting ioremap() calls
> > into a loop for a whole day.  Nobody hit this issue, either.
> 
> How do you know you didn't hit that issue? It might cause silent data
> corruption, which might not be easily detected.

If the test hit that issue, it would have caused a kernel page fault
(freed & cleared pte page still unmapped, or this page reused and
attribute data invalid) or MCE (pte page reused and phys addr invalid)
when it accessed to ioremap'd address.

Causing silent data corruption requires that this freed & cleared pte
page to be reused and re-initialized with a valid pte entry data.  While
this case is possible, the chance of my test only hitting this case
without ever hitting much more likely cases of page fault or MCE is
basically zero.

> > The simple revert patch Joerg posted a while ago causes
> > pmd_free_pte_page() to fail on x64.  This causes multiple pmd mappings
> > to fall into pte mappings on my test systems.  This can be seen as a
> > degradation, and I am afraid that it is more harmful than good.
> 
> The plain revert just removes all the issues with the dirty TLB that the
> original patch introduced and prevents huge mappings from being
> established when there have been smaller mappings before. This is not
> ideal, but at least its is consistent and does not leak pages and leaves
> no dirty TLBs. So this is the easiest and most reliable fix for this
> stage in the release process.

If you think the page directory issue needs a fix ASAP, I believe taking
patch 3/3 is much better option than the plain revert, which will
introduce the fall-back issue I explained.

Thanks,
-Toshi
diff mbox

Patch

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f60fdf411103..7e96594c7e97 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -721,24 +721,42 @@  int pmd_clear_huge(pmd_t *pmd)
  * @pud: Pointer to a PUD.
  * @addr: Virtual address associated with pud.
  *
- * Context: The pud range has been unmaped and TLB purged.
+ * Context: The pud range has been unmapped and TLB purged.
  * Return: 1 if clearing the entry succeeded. 0 otherwise.
  */
 int pud_free_pmd_page(pud_t *pud, unsigned long addr)
 {
-	pmd_t *pmd;
+	pmd_t *pmd, *pmd_sv;
+	pte_t *pte;
 	int i;
 
 	if (pud_none(*pud))
 		return 1;
 
 	pmd = (pmd_t *)pud_page_vaddr(*pud);
+	pmd_sv = (pmd_t *)__get_free_page(GFP_KERNEL);
+	if (!pmd_sv)
+		return 0;
 
-	for (i = 0; i < PTRS_PER_PMD; i++)
-		if (!pmd_free_pte_page(&pmd[i], addr + (i * PMD_SIZE)))
-			return 0;
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd_sv[i] = pmd[i];
+		if (!pmd_none(pmd[i]))
+			pmd_clear(&pmd[i]);
+	}
 
 	pud_clear(pud);
+
+	/* INVLPG to clear all paging-structure caches */
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_none(pmd_sv[i])) {
+			pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
+			free_page((unsigned long)pte);
+		}
+	}
+
+	free_page((unsigned long)pmd_sv);
 	free_page((unsigned long)pmd);
 
 	return 1;
@@ -749,7 +767,7 @@  int pud_free_pmd_page(pud_t *pud, unsigned long addr)
  * @pmd: Pointer to a PMD.
  * @addr: Virtual address associated with pmd.
  *
- * Context: The pmd range has been unmaped and TLB purged.
+ * Context: The pmd range has been unmapped and TLB purged.
  * Return: 1 if clearing the entry succeeded. 0 otherwise.
  */
 int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
@@ -761,6 +779,10 @@  int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 	pte = (pte_t *)pmd_page_vaddr(*pmd);
 	pmd_clear(pmd);
+
+	/* INVLPG to clear all paging-structure caches */
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
+
 	free_page((unsigned long)pte);
 
 	return 1;