diff mbox

[v2,2/2] x86/mm: implement free pmd/pte page interfaces

Message ID 20180314180155.19492-3-toshi.kani@hpe.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kani, Toshi March 14, 2018, 6:01 p.m. UTC
Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
clear a given pud/pmd entry and free up lower level page table(s).
Address range associated with the pud/pmd entry must have been purged
by INVLPG.

fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
---
 arch/x86/mm/pgtable.c |   28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

Comments

Chintan Pandya March 15, 2018, 7:39 a.m. UTC | #1
On 3/14/2018 11:31 PM, Toshi Kani wrote:
> Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
> clear a given pud/pmd entry and free up lower level page table(s).
> Address range associated with the pud/pmd entry must have been purged
> by INVLPG.
> 
> fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
> Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: <stable@vger.kernel.org>
> ---
>   arch/x86/mm/pgtable.c |   28 ++++++++++++++++++++++++++--
>   1 file changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 1eed7ed518e6..34cda7e0551b 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -712,7 +712,22 @@ int pmd_clear_huge(pmd_t *pmd)
>    */
>   int pud_free_pmd_page(pud_t *pud)
>   {
> -	return pud_none(*pud);
> +	pmd_t *pmd;
> +	int i;
> +
> +	if (pud_none(*pud))
> +		return 1;
> +
> +	pmd = (pmd_t *)pud_page_vaddr(*pud);
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++)
> +		if (!pmd_free_pte_page(&pmd[i]))

This is forced action and no optional. Also, pmd_free_pte_page()
doesn't return 0 in any case. So, you may remove _if_ ?

> +			return 0;
> +
> +	pud_clear(pud);
> +	free_page((unsigned long)pmd);
> +
> +	return 1;
>   }
>   
>   /**
> @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud)
>    */
>   int pmd_free_pte_page(pmd_t *pmd)
>   {
> -	return pmd_none(*pmd);
> +	pte_t *pte;
> +
> +	if (pmd_none(*pmd))

This should also check if pmd is already huge. Same for pud ?

> +		return 1;
> +
> +	pte = (pte_t *)pmd_page_vaddr(*pmd);
> +	pmd_clear(pmd);
> +	free_page((unsigned long)pte);
> +
> +	return 1;
>   }
>   #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

Chintan
Kani, Toshi March 15, 2018, 2:51 p.m. UTC | #2
On Thu, 2018-03-15 at 13:09 +0530, Chintan Pandya wrote:
> 
> On 3/14/2018 11:31 PM, Toshi Kani wrote:
> > Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
> > clear a given pud/pmd entry and free up lower level page table(s).
> > Address range associated with the pud/pmd entry must have been purged
> > by INVLPG.
> > 
> > fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Borislav Petkov <bp@suse.de>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   arch/x86/mm/pgtable.c |   28 ++++++++++++++++++++++++++--
> >   1 file changed, 26 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> > index 1eed7ed518e6..34cda7e0551b 100644
> > --- a/arch/x86/mm/pgtable.c
> > +++ b/arch/x86/mm/pgtable.c
> > @@ -712,7 +712,22 @@ int pmd_clear_huge(pmd_t *pmd)
> >    */
> >   int pud_free_pmd_page(pud_t *pud)
> >   {
> > -	return pud_none(*pud);
> > +	pmd_t *pmd;
> > +	int i;
> > +
> > +	if (pud_none(*pud))
> > +		return 1;
> > +
> > +	pmd = (pmd_t *)pud_page_vaddr(*pud);
> > +
> > +	for (i = 0; i < PTRS_PER_PMD; i++)
> > +		if (!pmd_free_pte_page(&pmd[i]))
> 
> This is forced action and no optional. Also, pmd_free_pte_page()
> doesn't return 0 in any case. So, you may remove _if_ ?

The code needs to be written per the interface definition, not per the
current implementation.

> > +			return 0;
> > +
> > +	pud_clear(pud);
> > +	free_page((unsigned long)pmd);
> > +
> > +	return 1;
> >   }
> >   
> >   /**
> > @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud)
> >    */
> >   int pmd_free_pte_page(pmd_t *pmd)
> >   {
> > -	return pmd_none(*pmd);
> > +	pte_t *pte;
> > +
> > +	if (pmd_none(*pmd))
> 
> This should also check if pmd is already huge. Same for pud ?

Not necessary. As described in the function header, one of the entry
conditions is that a given pmd range is unmapped.  See
vunmap_pmd_range().

Thanks,
-Toshi
Joerg Roedel April 26, 2018, 2:19 p.m. UTC | #3
Hi Toshi, Andrew,

this patch(-set) is broken in several ways, please see below.

On Wed, Mar 14, 2018 at 12:01:55PM -0600, Toshi Kani wrote:
> Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
> clear a given pud/pmd entry and free up lower level page table(s).
> Address range associated with the pud/pmd entry must have been purged
> by INVLPG.

An INVLPG before actually unmapping the page is useless, as other cores
or even speculative instruction execution can bring the TLB entry back
before the code actually unmaps the page.

>  int pud_free_pmd_page(pud_t *pud)
>  {
> -	return pud_none(*pud);
> +	pmd_t *pmd;
> +	int i;
> +
> +	if (pud_none(*pud))
> +		return 1;
> +
> +	pmd = (pmd_t *)pud_page_vaddr(*pud);
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++)
> +		if (!pmd_free_pte_page(&pmd[i]))
> +			return 0;
> +
> +	pud_clear(pud);

TLB flush needed here, before the page is freed.

> +	free_page((unsigned long)pmd);
> +
> +	return 1;
>  }
>  
>  /**
> @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud)
>   */
>  int pmd_free_pte_page(pmd_t *pmd)
>  {
> -	return pmd_none(*pmd);
> +	pte_t *pte;
> +
> +	if (pmd_none(*pmd))
> +		return 1;
> +
> +	pte = (pte_t *)pmd_page_vaddr(*pmd);
> +	pmd_clear(pmd);

Same here, TLB flush needed.

Further this needs synchronization with other page-tables in the system
when the kernel PMDs are not shared between processes. In x86-32 with
PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268
because the page-tables are not correctly synchronized.

> +	free_page((unsigned long)pte);
> +
> +	return 1;
>  }
>  #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
Kani, Toshi April 26, 2018, 4:21 p.m. UTC | #4
On Thu, 2018-04-26 at 16:19 +0200, Joerg Roedel wrote:
> Hi Toshi, Andrew,
> 
> this patch(-set) is broken in several ways, please see below.
> 
> On Wed, Mar 14, 2018 at 12:01:55PM -0600, Toshi Kani wrote:
> > Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
> > clear a given pud/pmd entry and free up lower level page table(s).
> > Address range associated with the pud/pmd entry must have been purged
> > by INVLPG.
> 
> An INVLPG before actually unmapping the page is useless, as other cores
> or even speculative instruction execution can bring the TLB entry back
> before the code actually unmaps the page.

Hi Joerg,

All pages under the pmd had been unmapped and then lazy TLB purged with
INVLPG before coming to this code path.  Speculation is not allowed to
pages without mapping.  

> >  int pud_free_pmd_page(pud_t *pud)
> >  {
> > -	return pud_none(*pud);
> > +	pmd_t *pmd;
> > +	int i;
> > +
> > +	if (pud_none(*pud))
> > +		return 1;
> > +
> > +	pmd = (pmd_t *)pud_page_vaddr(*pud);
> > +
> > +	for (i = 0; i < PTRS_PER_PMD; i++)
> > +		if (!pmd_free_pte_page(&pmd[i]))
> > +			return 0;
> > +
> > +	pud_clear(pud);
> 
> TLB flush needed here, before the page is freed.
> 
> > +	free_page((unsigned long)pmd);
> > +
> > +	return 1;
> >  }
> >  
> >  /**
> > @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud)
> >   */
> >  int pmd_free_pte_page(pmd_t *pmd)
> >  {
> > -	return pmd_none(*pmd);
> > +	pte_t *pte;
> > +
> > +	if (pmd_none(*pmd))
> > +		return 1;
> > +
> > +	pte = (pte_t *)pmd_page_vaddr(*pmd);
> > +	pmd_clear(pmd);
> 
> Same here, TLB flush needed.
> 
> Further this needs synchronization with other page-tables in the system
> when the kernel PMDs are not shared between processes. In x86-32 with
> PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268
> because the page-tables are not correctly synchronized.

I think this is an issue with pmd mapping support on x86-32-PAE, not
with this patch.  I think the code needed to be updated to sync at the
pud level.

Thanks,
-Toshi
Joerg Roedel April 26, 2018, 5:23 p.m. UTC | #5
On Thu, Apr 26, 2018 at 04:21:19PM +0000, Kani, Toshi wrote:
> All pages under the pmd had been unmapped and then lazy TLB purged with
> INVLPG before coming to this code path.  Speculation is not allowed to
> pages without mapping.

CPUs have not only TLBs, but also page-walk caches which cache
intermediary results of page-table walks and which is flushed together
with the TLB.

So the PMD entry you clear can still be in a page-walk cache and this
needs to be flushed too before you can free the PTE page. Otherwise
page-walks might still go to the page you just freed. That is especially
bad when the page is already reallocated and filled with other data.

> > Further this needs synchronization with other page-tables in the system
> > when the kernel PMDs are not shared between processes. In x86-32 with
> > PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268
> > because the page-tables are not correctly synchronized.
> 
> I think this is an issue with pmd mapping support on x86-32-PAE, not
> with this patch.  I think the code needed to be updated to sync at the
> pud level.

It is an issue with this patch, because this patch is for x86 and on x86
every change to the kernel page-tables potentially needs to by
synchronized to the other page-tables. And this patch doesn't implement
it, which triggers a BUG_ON() under certain conditions.


Regards,

	Joerg
Kani, Toshi April 26, 2018, 5:49 p.m. UTC | #6
On Thu, 2018-04-26 at 19:23 +0200, joro@8bytes.org wrote:
> On Thu, Apr 26, 2018 at 04:21:19PM +0000, Kani, Toshi wrote:
> > All pages under the pmd had been unmapped and then lazy TLB purged with
> > INVLPG before coming to this code path.  Speculation is not allowed to
> > pages without mapping.
> 
> CPUs have not only TLBs, but also page-walk caches which cache
> intermediary results of page-table walks and which is flushed together
> with the TLB.
> 
> So the PMD entry you clear can still be in a page-walk cache and this
> needs to be flushed too before you can free the PTE page. Otherwise
> page-walks might still go to the page you just freed. That is especially
> bad when the page is already reallocated and filled with other data.

I do not understand why we need to flush processor caches here. x86
processor caches are coherent with MESI.  So, clearing an PMD entry
modifies a cache entry on the processor associated with the address,
which in turn invalidates all stale cache entries on other processors.

> > > Further this needs synchronization with other page-tables in the system
> > > when the kernel PMDs are not shared between processes. In x86-32 with
> > > PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268
> > > because the page-tables are not correctly synchronized.
> > 
> > I think this is an issue with pmd mapping support on x86-32-PAE, not
> > with this patch.  I think the code needed to be updated to sync at the
> > pud level.
> 
> It is an issue with this patch, because this patch is for x86 and on x86
> every change to the kernel page-tables potentially needs to by
> synchronized to the other page-tables. And this patch doesn't implement
> it, which triggers a BUG_ON() under certain conditions.

The issue was introduced when pmd mapping support was added on x86/32,
which was made prior to this patch.

Thanks,
-Toshi
Joerg Roedel April 26, 2018, 8:07 p.m. UTC | #7
On Thu, Apr 26, 2018 at 05:49:58PM +0000, Kani, Toshi wrote:
> On Thu, 2018-04-26 at 19:23 +0200, joro@8bytes.org wrote:
> > So the PMD entry you clear can still be in a page-walk cache and this
> > needs to be flushed too before you can free the PTE page. Otherwise
> > page-walks might still go to the page you just freed. That is especially
> > bad when the page is already reallocated and filled with other data.
> 
> I do not understand why we need to flush processor caches here. x86
> processor caches are coherent with MESI.  So, clearing an PMD entry
> modifies a cache entry on the processor associated with the address,
> which in turn invalidates all stale cache entries on other processors.

A page walk cache is not about the processors data cache, its a cache
similar to the TLB to speed up page-walks by caching intermediate
results of previous page walks.


Thanks,

	Joerg
Kani, Toshi April 26, 2018, 10:30 p.m. UTC | #8
On Thu, 2018-04-26 at 22:07 +0200, joro@8bytes.org wrote:
> On Thu, Apr 26, 2018 at 05:49:58PM +0000, Kani, Toshi wrote:
> > On Thu, 2018-04-26 at 19:23 +0200, joro@8bytes.org wrote:
> > > So the PMD entry you clear can still be in a page-walk cache and this
> > > needs to be flushed too before you can free the PTE page. Otherwise
> > > page-walks might still go to the page you just freed. That is especially
> > > bad when the page is already reallocated and filled with other data.
> > 
> > I do not understand why we need to flush processor caches here. x86
> > processor caches are coherent with MESI.  So, clearing an PMD entry
> > modifies a cache entry on the processor associated with the address,
> > which in turn invalidates all stale cache entries on other processors.
> 
> A page walk cache is not about the processors data cache, its a cache
> similar to the TLB to speed up page-walks by caching intermediate
> results of previous page walks.

Thanks for the clarification. After reading through SDM one more time, I
agree that we need a TLB purge here. Here is my current understanding. 

 - INVLPG purges both TLB and paging-structure caches. So, PMD cache was
purged once.
 - However, processor may cache this PMD entry later in speculation
since it has p-bit set. (This is where my misunderstanding was.
Speculation is not allowed to access a target address, but it may still
cache this PMD entry.)
 - A single INVLPG on each processor purges this PMD cache. It does not
need a range purge (which was already done).

Does it sound right to you?

As for the BUG_ON issue, are you able to reproduce this issue?  If so,
would you be able to test the fix?

Regards,
-Toshi
Joerg Roedel April 27, 2018, 7:37 a.m. UTC | #9
On Thu, Apr 26, 2018 at 10:30:14PM +0000, Kani, Toshi wrote:
> Thanks for the clarification. After reading through SDM one more time, I
> agree that we need a TLB purge here. Here is my current understanding. 
> 
>  - INVLPG purges both TLB and paging-structure caches. So, PMD cache was
> purged once.
>  - However, processor may cache this PMD entry later in speculation
> since it has p-bit set. (This is where my misunderstanding was.
> Speculation is not allowed to access a target address, but it may still
> cache this PMD entry.)
>  - A single INVLPG on each processor purges this PMD cache. It does not
> need a range purge (which was already done).
> 
> Does it sound right to you?

The right fix is to first synchronize the changes when the PMD/PUD is
cleared and then flush the TLB system-wide. After that is done you can
free the page.

But doing all that in the pud/pmd_free_pmd/pte_page() functions is too
expensive, as the TLB flush requires to send IPIs to all cores in the
system, and that every time the function is called.

So what needs to be done is to fix this from high-level ioremap code to
first unmap all required PTE/PMD pages and collect them in a list. When
that is done you can synchronize the changes with the other page-tables
in the system and do one system-wide TLB flush. When that is complete
you can free the pages on the list that were collected while unmapping.

Then the new mappings can be established and again synchronized with the
other page-tables in the system.

> As for the BUG_ON issue, are you able to reproduce this issue?  If so,
> would you be able to test the fix?

Yes, I can reproduce the BUG_ON with my PTI patches and a fedora-i386
VM.

I already ran into the issue before your patches were merged upstream,
but my "fix" is different because it just prevents huge-mappings when
there were smaller mappings before. See

	e3e288121408 x86/pgtable: Don't set huge PUD/PMD on non-leaf entries

for details. This patch does not fix the base-problem, but hides it
again, as the real fix needs some more work across architectures.

Your patch actually makes the problem worse, without it the PTE/PMD pages
were just leaked, so that they could not be reused. But with your patch
the pages can be used again and the page-walker might establish TLB
entries based on random content the new owner writes to it. This can
lead to all kinds of random and very hard to debug data corruption
issues.

So until we make the generic ioremap code in lib/ioremap.c smarter about
unmapping/remapping ranges the best solution is making my fix work again
by reverting your patch.


Thanks,

	Joerg
Michal Hocko April 27, 2018, 11:39 a.m. UTC | #10
On Fri 27-04-18 09:37:20, joro@8bytes.org wrote:
[...]
> So until we make the generic ioremap code in lib/ioremap.c smarter about
> unmapping/remapping ranges the best solution is making my fix work again
> by reverting your patch.

Could you reuse the same mmu_gather mechanism as we use in the
zap_*_range?
Joerg Roedel April 27, 2018, 11:46 a.m. UTC | #11
On Fri, Apr 27, 2018 at 01:39:23PM +0200, Michal Hocko wrote:
> On Fri 27-04-18 09:37:20, joro@8bytes.org wrote:
> [...]
> > So until we make the generic ioremap code in lib/ioremap.c smarter about
> > unmapping/remapping ranges the best solution is making my fix work again
> > by reverting your patch.
> 
> Could you reuse the same mmu_gather mechanism as we use in the
> zap_*_range?

Yeah, maybe, I havn't looked into the details yet. At least something
similar is needed.


	Joerg
Chintan Pandya April 27, 2018, 11:52 a.m. UTC | #12
On 4/27/2018 1:07 PM, joro@8bytes.org wrote:
> On Thu, Apr 26, 2018 at 10:30:14PM +0000, Kani, Toshi wrote:
>> Thanks for the clarification. After reading through SDM one more time, I
>> agree that we need a TLB purge here. Here is my current understanding.
>>
>>   - INVLPG purges both TLB and paging-structure caches. So, PMD cache was
>> purged once.
>>   - However, processor may cache this PMD entry later in speculation
>> since it has p-bit set. (This is where my misunderstanding was.
>> Speculation is not allowed to access a target address, but it may still
>> cache this PMD entry.)
>>   - A single INVLPG on each processor purges this PMD cache. It does not
>> need a range purge (which was already done).
>>
>> Does it sound right to you?
> 
> The right fix is to first synchronize the changes when the PMD/PUD is
> cleared and then flush the TLB system-wide. After that is done you can
> free the page.
> 

I'm bit confused here. Are you pointing to race within ioremap/vmalloc
framework while updating the page table or race during tlb ops. Since
later is arch dependent, I would not comment. But if the race being 
discussed here while altering page tables, I'm not on the same page.

Current ioremap/vmalloc framework works with reserved virtual area for 
its own purpose. Within this virtual area, we maintain mutual 
exclusiveness by maintaining separate rbtree which is of course 
synchronized. In the __vunmap leg, we perform page table ops first and
then release the virtual area for someone else to re-use. This way, 
without taking any additional locks for page table modifications, we are
good.

If that's not the case and I'm missing something here.

Also, I'm curious to know what race you are observing at your end.


Chintan
Joerg Roedel April 27, 2018, 12:48 p.m. UTC | #13
On Fri, Apr 27, 2018 at 05:22:28PM +0530, Chintan Pandya wrote:
> I'm bit confused here. Are you pointing to race within ioremap/vmalloc
> framework while updating the page table or race during tlb ops. Since
> later is arch dependent, I would not comment. But if the race being
> discussed here while altering page tables, I'm not on the same page.

The race condition is between hardware and software. It is not
sufficient to just remove the software references to the page that is
about to be freed (by clearing the PMD/PUD), also the hardware
references in the page-walk cache need to be removed with a TLB flush.
Otherwise the hardware can use the freed (and possibly reused) page to
establish new TLB entries.



	Joerg
Chintan Pandya April 27, 2018, 1:42 p.m. UTC | #14
On 4/27/2018 6:18 PM, joro@8bytes.org wrote:
> On Fri, Apr 27, 2018 at 05:22:28PM +0530, Chintan Pandya wrote:
>> I'm bit confused here. Are you pointing to race within ioremap/vmalloc
>> framework while updating the page table or race during tlb ops. Since
>> later is arch dependent, I would not comment. But if the race being
>> discussed here while altering page tables, I'm not on the same page.
> 
> The race condition is between hardware and software. It is not
> sufficient to just remove the software references to the page that is
> about to be freed (by clearing the PMD/PUD), also the hardware
> references in the page-walk cache need to be removed with a TLB flush.
> Otherwise the hardware can use the freed (and possibly reused) page to
> establish new TLB entries.

Sure ! This is my understanding too (from arm64 context).

> 
> 
> 
> 	Joerg
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

Chintan
Kani, Toshi April 27, 2018, 2:31 p.m. UTC | #15
On Fri, 2018-04-27 at 09:37 +0200, joro@8bytes.org wrote:
> On Thu, Apr 26, 2018 at 10:30:14PM +0000, Kani, Toshi wrote:
> > Thanks for the clarification. After reading through SDM one more time, I
> > agree that we need a TLB purge here. Here is my current understanding. 
> > 
> >  - INVLPG purges both TLB and paging-structure caches. So, PMD cache was
> > purged once.
> >  - However, processor may cache this PMD entry later in speculation
> > since it has p-bit set. (This is where my misunderstanding was.
> > Speculation is not allowed to access a target address, but it may still
> > cache this PMD entry.)
> >  - A single INVLPG on each processor purges this PMD cache. It does not
> > need a range purge (which was already done).
> > 
> > Does it sound right to you?
> 
> The right fix is to first synchronize the changes when the PMD/PUD is
> cleared and then flush the TLB system-wide. After that is done you can
> free the page.

Agreed. This can be done on top of this patch.

> But doing all that in the pud/pmd_free_pmd/pte_page() functions is too
> expensive, as the TLB flush requires to send IPIs to all cores in the
> system, and that every time the function is called.
>
> So what needs to be done is to fix this from high-level ioremap code to
> first unmap all required PTE/PMD pages and collect them in a list. When
> that is done you can synchronize the changes with the other page-tables
> in the system and do one system-wide TLB flush. When that is complete
> you can free the pages on the list that were collected while unmapping.
>
> Then the new mappings can be established and again synchronized with the
> other page-tables in the system.

Yes, and this patch was designed to work in such way.  Please note that
this patch added pud_free_pmd_page() and pmd_free_pte_page() to the
ioremap() path when and only when it creates a pud or pmd map.  This
assures the following preconditions are met without overhead.
 - All pte entries for a target pud/pmd address range have been cleared.
 - System-wide TLB purges have been done for a target pud/pmd address
range.

So, we can add the step 2 on top of this patch.
 1. Clear pud/pmd entry.
 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
 3. Free its underlining pmd/pte page.

> > As for the BUG_ON issue, are you able to reproduce this issue?  If so,
> > would you be able to test the fix?
> 
> Yes, I can reproduce the BUG_ON with my PTI patches and a fedora-i386
> VM.

Great!

> I already ran into the issue before your patches were merged upstream,
> but my "fix" is different because it just prevents huge-mappings when
> there were smaller mappings before. See
> 
> 	e3e288121408 x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
> 
> for details. This patch does not fix the base-problem, but hides it
> again, as the real fix needs some more work across architectures.

Right.  Patch 1/2 of this series made the same fix as well.  See:

b6bdb7517c3d  mm/vmalloc: add interfaces to free unmapped page table

> Your patch actually makes the problem worse, without it the PTE/PMD pages
> were just leaked, so that they could not be reused. But with your patch
> the pages can be used again and the page-walker might establish TLB
> entries based on random content the new owner writes to it. This can
> lead to all kinds of random and very hard to debug data corruption
> issues.
> 
> So until we make the generic ioremap code in lib/ioremap.c smarter about
> unmapping/remapping ranges the best solution is making my fix work again
> by reverting your patch.

We do not need to revert this patch.  We can make the above change I
mentioned.

Thanks,
-Toshi
Joerg Roedel April 28, 2018, 9:02 a.m. UTC | #16
On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote:
> So, we can add the step 2 on top of this patch.
>  1. Clear pud/pmd entry.
>  2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
>  3. Free its underlining pmd/pte page.

This still lacks the page-table synchronization and will thus not fix
the BUG_ON being triggered.

> We do not need to revert this patch.  We can make the above change I
> mentioned.

Please note that we are not in the merge window anymore and that any fix
needs to be simple and obviously correct.


Thanks,

	Joerg
Kani, Toshi April 28, 2018, 8:54 p.m. UTC | #17
On Sat, 2018-04-28 at 11:02 +0200, joro@8bytes.org wrote:
> On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote:
> > So, we can add the step 2 on top of this patch.
> >  1. Clear pud/pmd entry.
> >  2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
> >  3. Free its underlining pmd/pte page.
> 
> This still lacks the page-table synchronization and will thus not fix
> the BUG_ON being triggered.

The BUG_ON issue is specific to PAE that it syncs at the pmd level.
x86/64 does not have this issue since it syncs at the pgd or p4d level.

> > We do not need to revert this patch.  We can make the above change I
> > mentioned.
> 
> Please note that we are not in the merge window anymore and that any fix
> needs to be simple and obviously correct.

Understood.  Changing the x86/32 sync point is risky.  So, I am going to
revert the free page handling for PAE.   

Thanks,
-Toshi
Chintan Pandya April 30, 2018, 7:30 a.m. UTC | #18
On 4/29/2018 2:24 AM, Kani, Toshi wrote:
> On Sat, 2018-04-28 at 11:02 +0200, joro@8bytes.org wrote:
>> On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote:
>>> So, we can add the step 2 on top of this patch.
>>>   1. Clear pud/pmd entry.
>>>   2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
>>>   3. Free its underlining pmd/pte page.
>>
>> This still lacks the page-table synchronization and will thus not fix
>> the BUG_ON being triggered.
> 
> The BUG_ON issue is specific to PAE that it syncs at the pmd level.
> x86/64 does not have this issue since it syncs at the pgd or p4d level.
> 
>>> We do not need to revert this patch.  We can make the above change I
>>> mentioned.
>>
>> Please note that we are not in the merge window anymore and that any fix
>> needs to be simple and obviously correct.
> 
> Understood.  Changing the x86/32 sync point is risky.  So, I am going to
> revert the free page handling for PAE.

Will this affect pmd_free_pte_page() & pud_free_pmd_page() 's existence
or its parameters ? I'm asking because, I've similar change for arm64
and ready to send v9 patches.

I'm thinking to share my v9 patches in any case. If you are going to do
TLB invalidation within these APIs, my first patch will help.

> 
> Thanks,
> -Toshi
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

Chintan
Kani, Toshi April 30, 2018, 1:43 p.m. UTC | #19
On Mon, 2018-04-30 at 13:00 +0530, Chintan Pandya wrote:
> 
> On 4/29/2018 2:24 AM, Kani, Toshi wrote:
> > On Sat, 2018-04-28 at 11:02 +0200, joro@8bytes.org wrote:
> > > On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote:
> > > > So, we can add the step 2 on top of this patch.
> > > >   1. Clear pud/pmd entry.
> > > >   2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
> > > >   3. Free its underlining pmd/pte page.
> > > 
> > > This still lacks the page-table synchronization and will thus not fix
> > > the BUG_ON being triggered.
> > 
> > The BUG_ON issue is specific to PAE that it syncs at the pmd level.
> > x86/64 does not have this issue since it syncs at the pgd or p4d level.
> > 
> > > > We do not need to revert this patch.  We can make the above change I
> > > > mentioned.
> > > 
> > > Please note that we are not in the merge window anymore and that any fix
> > > needs to be simple and obviously correct.
> > 
> > Understood.  Changing the x86/32 sync point is risky.  So, I am going to
> > revert the free page handling for PAE.
> 
> Will this affect pmd_free_pte_page() & pud_free_pmd_page() 's existence
> or its parameters ? I'm asking because, I've similar change for arm64
> and ready to send v9 patches.

No, it won't.  The change is only to the x86 side.

> I'm thinking to share my v9 patches in any case. If you are going to do
> TLB invalidation within these APIs, my first patch will help.

I will make my change on top of your v9 1/4 patch so that we can avoid
merge conflict.

Thanks,
-Toshi
diff mbox

Patch

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 1eed7ed518e6..34cda7e0551b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -712,7 +712,22 @@  int pmd_clear_huge(pmd_t *pmd)
  */
 int pud_free_pmd_page(pud_t *pud)
 {
-	return pud_none(*pud);
+	pmd_t *pmd;
+	int i;
+
+	if (pud_none(*pud))
+		return 1;
+
+	pmd = (pmd_t *)pud_page_vaddr(*pud);
+
+	for (i = 0; i < PTRS_PER_PMD; i++)
+		if (!pmd_free_pte_page(&pmd[i]))
+			return 0;
+
+	pud_clear(pud);
+	free_page((unsigned long)pmd);
+
+	return 1;
 }
 
 /**
@@ -724,6 +739,15 @@  int pud_free_pmd_page(pud_t *pud)
  */
 int pmd_free_pte_page(pmd_t *pmd)
 {
-	return pmd_none(*pmd);
+	pte_t *pte;
+
+	if (pmd_none(*pmd))
+		return 1;
+
+	pte = (pte_t *)pmd_page_vaddr(*pmd);
+	pmd_clear(pmd);
+	free_page((unsigned long)pte);
+
+	return 1;
 }
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */