Message ID | 20180314180155.19492-3-toshi.kani@hpe.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 3/14/2018 11:31 PM, Toshi Kani wrote: > Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which > clear a given pud/pmd entry and free up lower level page table(s). > Address range associated with the pud/pmd entry must have been purged > by INVLPG. > > fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings") > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: Borislav Petkov <bp@suse.de> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: <stable@vger.kernel.org> > --- > arch/x86/mm/pgtable.c | 28 ++++++++++++++++++++++++++-- > 1 file changed, 26 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c > index 1eed7ed518e6..34cda7e0551b 100644 > --- a/arch/x86/mm/pgtable.c > +++ b/arch/x86/mm/pgtable.c > @@ -712,7 +712,22 @@ int pmd_clear_huge(pmd_t *pmd) > */ > int pud_free_pmd_page(pud_t *pud) > { > - return pud_none(*pud); > + pmd_t *pmd; > + int i; > + > + if (pud_none(*pud)) > + return 1; > + > + pmd = (pmd_t *)pud_page_vaddr(*pud); > + > + for (i = 0; i < PTRS_PER_PMD; i++) > + if (!pmd_free_pte_page(&pmd[i])) This is forced action and no optional. Also, pmd_free_pte_page() doesn't return 0 in any case. So, you may remove _if_ ? > + return 0; > + > + pud_clear(pud); > + free_page((unsigned long)pmd); > + > + return 1; > } > > /** > @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud) > */ > int pmd_free_pte_page(pmd_t *pmd) > { > - return pmd_none(*pmd); > + pte_t *pte; > + > + if (pmd_none(*pmd)) This should also check if pmd is already huge. Same for pud ? > + return 1; > + > + pte = (pte_t *)pmd_page_vaddr(*pmd); > + pmd_clear(pmd); > + free_page((unsigned long)pte); > + > + return 1; > } > #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > Chintan
On Thu, 2018-03-15 at 13:09 +0530, Chintan Pandya wrote: > > On 3/14/2018 11:31 PM, Toshi Kani wrote: > > Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which > > clear a given pud/pmd entry and free up lower level page table(s). > > Address range associated with the pud/pmd entry must have been purged > > by INVLPG. > > > > fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings") > > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Thomas Gleixner <tglx@linutronix.de> > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: "H. Peter Anvin" <hpa@zytor.com> > > Cc: Borislav Petkov <bp@suse.de> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: <stable@vger.kernel.org> > > --- > > arch/x86/mm/pgtable.c | 28 ++++++++++++++++++++++++++-- > > 1 file changed, 26 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c > > index 1eed7ed518e6..34cda7e0551b 100644 > > --- a/arch/x86/mm/pgtable.c > > +++ b/arch/x86/mm/pgtable.c > > @@ -712,7 +712,22 @@ int pmd_clear_huge(pmd_t *pmd) > > */ > > int pud_free_pmd_page(pud_t *pud) > > { > > - return pud_none(*pud); > > + pmd_t *pmd; > > + int i; > > + > > + if (pud_none(*pud)) > > + return 1; > > + > > + pmd = (pmd_t *)pud_page_vaddr(*pud); > > + > > + for (i = 0; i < PTRS_PER_PMD; i++) > > + if (!pmd_free_pte_page(&pmd[i])) > > This is forced action and no optional. Also, pmd_free_pte_page() > doesn't return 0 in any case. So, you may remove _if_ ? The code needs to be written per the interface definition, not per the current implementation. > > + return 0; > > + > > + pud_clear(pud); > > + free_page((unsigned long)pmd); > > + > > + return 1; > > } > > > > /** > > @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud) > > */ > > int pmd_free_pte_page(pmd_t *pmd) > > { > > - return pmd_none(*pmd); > > + pte_t *pte; > > + > > + if (pmd_none(*pmd)) > > This should also check if pmd is already huge. Same for pud ? Not necessary. As described in the function header, one of the entry conditions is that a given pmd range is unmapped. See vunmap_pmd_range(). Thanks, -Toshi
Hi Toshi, Andrew, this patch(-set) is broken in several ways, please see below. On Wed, Mar 14, 2018 at 12:01:55PM -0600, Toshi Kani wrote: > Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which > clear a given pud/pmd entry and free up lower level page table(s). > Address range associated with the pud/pmd entry must have been purged > by INVLPG. An INVLPG before actually unmapping the page is useless, as other cores or even speculative instruction execution can bring the TLB entry back before the code actually unmaps the page. > int pud_free_pmd_page(pud_t *pud) > { > - return pud_none(*pud); > + pmd_t *pmd; > + int i; > + > + if (pud_none(*pud)) > + return 1; > + > + pmd = (pmd_t *)pud_page_vaddr(*pud); > + > + for (i = 0; i < PTRS_PER_PMD; i++) > + if (!pmd_free_pte_page(&pmd[i])) > + return 0; > + > + pud_clear(pud); TLB flush needed here, before the page is freed. > + free_page((unsigned long)pmd); > + > + return 1; > } > > /** > @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud) > */ > int pmd_free_pte_page(pmd_t *pmd) > { > - return pmd_none(*pmd); > + pte_t *pte; > + > + if (pmd_none(*pmd)) > + return 1; > + > + pte = (pte_t *)pmd_page_vaddr(*pmd); > + pmd_clear(pmd); Same here, TLB flush needed. Further this needs synchronization with other page-tables in the system when the kernel PMDs are not shared between processes. In x86-32 with PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268 because the page-tables are not correctly synchronized. > + free_page((unsigned long)pte); > + > + return 1; > } > #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
On Thu, 2018-04-26 at 16:19 +0200, Joerg Roedel wrote: > Hi Toshi, Andrew, > > this patch(-set) is broken in several ways, please see below. > > On Wed, Mar 14, 2018 at 12:01:55PM -0600, Toshi Kani wrote: > > Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which > > clear a given pud/pmd entry and free up lower level page table(s). > > Address range associated with the pud/pmd entry must have been purged > > by INVLPG. > > An INVLPG before actually unmapping the page is useless, as other cores > or even speculative instruction execution can bring the TLB entry back > before the code actually unmaps the page. Hi Joerg, All pages under the pmd had been unmapped and then lazy TLB purged with INVLPG before coming to this code path. Speculation is not allowed to pages without mapping. > > int pud_free_pmd_page(pud_t *pud) > > { > > - return pud_none(*pud); > > + pmd_t *pmd; > > + int i; > > + > > + if (pud_none(*pud)) > > + return 1; > > + > > + pmd = (pmd_t *)pud_page_vaddr(*pud); > > + > > + for (i = 0; i < PTRS_PER_PMD; i++) > > + if (!pmd_free_pte_page(&pmd[i])) > > + return 0; > > + > > + pud_clear(pud); > > TLB flush needed here, before the page is freed. > > > + free_page((unsigned long)pmd); > > + > > + return 1; > > } > > > > /** > > @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud) > > */ > > int pmd_free_pte_page(pmd_t *pmd) > > { > > - return pmd_none(*pmd); > > + pte_t *pte; > > + > > + if (pmd_none(*pmd)) > > + return 1; > > + > > + pte = (pte_t *)pmd_page_vaddr(*pmd); > > + pmd_clear(pmd); > > Same here, TLB flush needed. > > Further this needs synchronization with other page-tables in the system > when the kernel PMDs are not shared between processes. In x86-32 with > PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268 > because the page-tables are not correctly synchronized. I think this is an issue with pmd mapping support on x86-32-PAE, not with this patch. I think the code needed to be updated to sync at the pud level. Thanks, -Toshi
On Thu, Apr 26, 2018 at 04:21:19PM +0000, Kani, Toshi wrote: > All pages under the pmd had been unmapped and then lazy TLB purged with > INVLPG before coming to this code path. Speculation is not allowed to > pages without mapping. CPUs have not only TLBs, but also page-walk caches which cache intermediary results of page-table walks and which is flushed together with the TLB. So the PMD entry you clear can still be in a page-walk cache and this needs to be flushed too before you can free the PTE page. Otherwise page-walks might still go to the page you just freed. That is especially bad when the page is already reallocated and filled with other data. > > Further this needs synchronization with other page-tables in the system > > when the kernel PMDs are not shared between processes. In x86-32 with > > PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268 > > because the page-tables are not correctly synchronized. > > I think this is an issue with pmd mapping support on x86-32-PAE, not > with this patch. I think the code needed to be updated to sync at the > pud level. It is an issue with this patch, because this patch is for x86 and on x86 every change to the kernel page-tables potentially needs to by synchronized to the other page-tables. And this patch doesn't implement it, which triggers a BUG_ON() under certain conditions. Regards, Joerg
On Thu, 2018-04-26 at 19:23 +0200, joro@8bytes.org wrote: > On Thu, Apr 26, 2018 at 04:21:19PM +0000, Kani, Toshi wrote: > > All pages under the pmd had been unmapped and then lazy TLB purged with > > INVLPG before coming to this code path. Speculation is not allowed to > > pages without mapping. > > CPUs have not only TLBs, but also page-walk caches which cache > intermediary results of page-table walks and which is flushed together > with the TLB. > > So the PMD entry you clear can still be in a page-walk cache and this > needs to be flushed too before you can free the PTE page. Otherwise > page-walks might still go to the page you just freed. That is especially > bad when the page is already reallocated and filled with other data. I do not understand why we need to flush processor caches here. x86 processor caches are coherent with MESI. So, clearing an PMD entry modifies a cache entry on the processor associated with the address, which in turn invalidates all stale cache entries on other processors. > > > Further this needs synchronization with other page-tables in the system > > > when the kernel PMDs are not shared between processes. In x86-32 with > > > PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268 > > > because the page-tables are not correctly synchronized. > > > > I think this is an issue with pmd mapping support on x86-32-PAE, not > > with this patch. I think the code needed to be updated to sync at the > > pud level. > > It is an issue with this patch, because this patch is for x86 and on x86 > every change to the kernel page-tables potentially needs to by > synchronized to the other page-tables. And this patch doesn't implement > it, which triggers a BUG_ON() under certain conditions. The issue was introduced when pmd mapping support was added on x86/32, which was made prior to this patch. Thanks, -Toshi
On Thu, Apr 26, 2018 at 05:49:58PM +0000, Kani, Toshi wrote: > On Thu, 2018-04-26 at 19:23 +0200, joro@8bytes.org wrote: > > So the PMD entry you clear can still be in a page-walk cache and this > > needs to be flushed too before you can free the PTE page. Otherwise > > page-walks might still go to the page you just freed. That is especially > > bad when the page is already reallocated and filled with other data. > > I do not understand why we need to flush processor caches here. x86 > processor caches are coherent with MESI. So, clearing an PMD entry > modifies a cache entry on the processor associated with the address, > which in turn invalidates all stale cache entries on other processors. A page walk cache is not about the processors data cache, its a cache similar to the TLB to speed up page-walks by caching intermediate results of previous page walks. Thanks, Joerg
On Thu, 2018-04-26 at 22:07 +0200, joro@8bytes.org wrote: > On Thu, Apr 26, 2018 at 05:49:58PM +0000, Kani, Toshi wrote: > > On Thu, 2018-04-26 at 19:23 +0200, joro@8bytes.org wrote: > > > So the PMD entry you clear can still be in a page-walk cache and this > > > needs to be flushed too before you can free the PTE page. Otherwise > > > page-walks might still go to the page you just freed. That is especially > > > bad when the page is already reallocated and filled with other data. > > > > I do not understand why we need to flush processor caches here. x86 > > processor caches are coherent with MESI. So, clearing an PMD entry > > modifies a cache entry on the processor associated with the address, > > which in turn invalidates all stale cache entries on other processors. > > A page walk cache is not about the processors data cache, its a cache > similar to the TLB to speed up page-walks by caching intermediate > results of previous page walks. Thanks for the clarification. After reading through SDM one more time, I agree that we need a TLB purge here. Here is my current understanding. - INVLPG purges both TLB and paging-structure caches. So, PMD cache was purged once. - However, processor may cache this PMD entry later in speculation since it has p-bit set. (This is where my misunderstanding was. Speculation is not allowed to access a target address, but it may still cache this PMD entry.) - A single INVLPG on each processor purges this PMD cache. It does not need a range purge (which was already done). Does it sound right to you? As for the BUG_ON issue, are you able to reproduce this issue? If so, would you be able to test the fix? Regards, -Toshi
On Thu, Apr 26, 2018 at 10:30:14PM +0000, Kani, Toshi wrote: > Thanks for the clarification. After reading through SDM one more time, I > agree that we need a TLB purge here. Here is my current understanding. > > - INVLPG purges both TLB and paging-structure caches. So, PMD cache was > purged once. > - However, processor may cache this PMD entry later in speculation > since it has p-bit set. (This is where my misunderstanding was. > Speculation is not allowed to access a target address, but it may still > cache this PMD entry.) > - A single INVLPG on each processor purges this PMD cache. It does not > need a range purge (which was already done). > > Does it sound right to you? The right fix is to first synchronize the changes when the PMD/PUD is cleared and then flush the TLB system-wide. After that is done you can free the page. But doing all that in the pud/pmd_free_pmd/pte_page() functions is too expensive, as the TLB flush requires to send IPIs to all cores in the system, and that every time the function is called. So what needs to be done is to fix this from high-level ioremap code to first unmap all required PTE/PMD pages and collect them in a list. When that is done you can synchronize the changes with the other page-tables in the system and do one system-wide TLB flush. When that is complete you can free the pages on the list that were collected while unmapping. Then the new mappings can be established and again synchronized with the other page-tables in the system. > As for the BUG_ON issue, are you able to reproduce this issue? If so, > would you be able to test the fix? Yes, I can reproduce the BUG_ON with my PTI patches and a fedora-i386 VM. I already ran into the issue before your patches were merged upstream, but my "fix" is different because it just prevents huge-mappings when there were smaller mappings before. See e3e288121408 x86/pgtable: Don't set huge PUD/PMD on non-leaf entries for details. This patch does not fix the base-problem, but hides it again, as the real fix needs some more work across architectures. Your patch actually makes the problem worse, without it the PTE/PMD pages were just leaked, so that they could not be reused. But with your patch the pages can be used again and the page-walker might establish TLB entries based on random content the new owner writes to it. This can lead to all kinds of random and very hard to debug data corruption issues. So until we make the generic ioremap code in lib/ioremap.c smarter about unmapping/remapping ranges the best solution is making my fix work again by reverting your patch. Thanks, Joerg
On Fri 27-04-18 09:37:20, joro@8bytes.org wrote: [...] > So until we make the generic ioremap code in lib/ioremap.c smarter about > unmapping/remapping ranges the best solution is making my fix work again > by reverting your patch. Could you reuse the same mmu_gather mechanism as we use in the zap_*_range?
On Fri, Apr 27, 2018 at 01:39:23PM +0200, Michal Hocko wrote: > On Fri 27-04-18 09:37:20, joro@8bytes.org wrote: > [...] > > So until we make the generic ioremap code in lib/ioremap.c smarter about > > unmapping/remapping ranges the best solution is making my fix work again > > by reverting your patch. > > Could you reuse the same mmu_gather mechanism as we use in the > zap_*_range? Yeah, maybe, I havn't looked into the details yet. At least something similar is needed. Joerg
On 4/27/2018 1:07 PM, joro@8bytes.org wrote: > On Thu, Apr 26, 2018 at 10:30:14PM +0000, Kani, Toshi wrote: >> Thanks for the clarification. After reading through SDM one more time, I >> agree that we need a TLB purge here. Here is my current understanding. >> >> - INVLPG purges both TLB and paging-structure caches. So, PMD cache was >> purged once. >> - However, processor may cache this PMD entry later in speculation >> since it has p-bit set. (This is where my misunderstanding was. >> Speculation is not allowed to access a target address, but it may still >> cache this PMD entry.) >> - A single INVLPG on each processor purges this PMD cache. It does not >> need a range purge (which was already done). >> >> Does it sound right to you? > > The right fix is to first synchronize the changes when the PMD/PUD is > cleared and then flush the TLB system-wide. After that is done you can > free the page. > I'm bit confused here. Are you pointing to race within ioremap/vmalloc framework while updating the page table or race during tlb ops. Since later is arch dependent, I would not comment. But if the race being discussed here while altering page tables, I'm not on the same page. Current ioremap/vmalloc framework works with reserved virtual area for its own purpose. Within this virtual area, we maintain mutual exclusiveness by maintaining separate rbtree which is of course synchronized. In the __vunmap leg, we perform page table ops first and then release the virtual area for someone else to re-use. This way, without taking any additional locks for page table modifications, we are good. If that's not the case and I'm missing something here. Also, I'm curious to know what race you are observing at your end. Chintan
On Fri, Apr 27, 2018 at 05:22:28PM +0530, Chintan Pandya wrote: > I'm bit confused here. Are you pointing to race within ioremap/vmalloc > framework while updating the page table or race during tlb ops. Since > later is arch dependent, I would not comment. But if the race being > discussed here while altering page tables, I'm not on the same page. The race condition is between hardware and software. It is not sufficient to just remove the software references to the page that is about to be freed (by clearing the PMD/PUD), also the hardware references in the page-walk cache need to be removed with a TLB flush. Otherwise the hardware can use the freed (and possibly reused) page to establish new TLB entries. Joerg
On 4/27/2018 6:18 PM, joro@8bytes.org wrote: > On Fri, Apr 27, 2018 at 05:22:28PM +0530, Chintan Pandya wrote: >> I'm bit confused here. Are you pointing to race within ioremap/vmalloc >> framework while updating the page table or race during tlb ops. Since >> later is arch dependent, I would not comment. But if the race being >> discussed here while altering page tables, I'm not on the same page. > > The race condition is between hardware and software. It is not > sufficient to just remove the software references to the page that is > about to be freed (by clearing the PMD/PUD), also the hardware > references in the page-walk cache need to be removed with a TLB flush. > Otherwise the hardware can use the freed (and possibly reused) page to > establish new TLB entries. Sure ! This is my understanding too (from arm64 context). > > > > Joerg > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > Chintan
On Fri, 2018-04-27 at 09:37 +0200, joro@8bytes.org wrote: > On Thu, Apr 26, 2018 at 10:30:14PM +0000, Kani, Toshi wrote: > > Thanks for the clarification. After reading through SDM one more time, I > > agree that we need a TLB purge here. Here is my current understanding. > > > > - INVLPG purges both TLB and paging-structure caches. So, PMD cache was > > purged once. > > - However, processor may cache this PMD entry later in speculation > > since it has p-bit set. (This is where my misunderstanding was. > > Speculation is not allowed to access a target address, but it may still > > cache this PMD entry.) > > - A single INVLPG on each processor purges this PMD cache. It does not > > need a range purge (which was already done). > > > > Does it sound right to you? > > The right fix is to first synchronize the changes when the PMD/PUD is > cleared and then flush the TLB system-wide. After that is done you can > free the page. Agreed. This can be done on top of this patch. > But doing all that in the pud/pmd_free_pmd/pte_page() functions is too > expensive, as the TLB flush requires to send IPIs to all cores in the > system, and that every time the function is called. > > So what needs to be done is to fix this from high-level ioremap code to > first unmap all required PTE/PMD pages and collect them in a list. When > that is done you can synchronize the changes with the other page-tables > in the system and do one system-wide TLB flush. When that is complete > you can free the pages on the list that were collected while unmapping. > > Then the new mappings can be established and again synchronized with the > other page-tables in the system. Yes, and this patch was designed to work in such way. Please note that this patch added pud_free_pmd_page() and pmd_free_pte_page() to the ioremap() path when and only when it creates a pud or pmd map. This assures the following preconditions are met without overhead. - All pte entries for a target pud/pmd address range have been cleared. - System-wide TLB purges have been done for a target pud/pmd address range. So, we can add the step 2 on top of this patch. 1. Clear pud/pmd entry. 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH 3. Free its underlining pmd/pte page. > > As for the BUG_ON issue, are you able to reproduce this issue? If so, > > would you be able to test the fix? > > Yes, I can reproduce the BUG_ON with my PTI patches and a fedora-i386 > VM. Great! > I already ran into the issue before your patches were merged upstream, > but my "fix" is different because it just prevents huge-mappings when > there were smaller mappings before. See > > e3e288121408 x86/pgtable: Don't set huge PUD/PMD on non-leaf entries > > for details. This patch does not fix the base-problem, but hides it > again, as the real fix needs some more work across architectures. Right. Patch 1/2 of this series made the same fix as well. See: b6bdb7517c3d mm/vmalloc: add interfaces to free unmapped page table > Your patch actually makes the problem worse, without it the PTE/PMD pages > were just leaked, so that they could not be reused. But with your patch > the pages can be used again and the page-walker might establish TLB > entries based on random content the new owner writes to it. This can > lead to all kinds of random and very hard to debug data corruption > issues. > > So until we make the generic ioremap code in lib/ioremap.c smarter about > unmapping/remapping ranges the best solution is making my fix work again > by reverting your patch. We do not need to revert this patch. We can make the above change I mentioned. Thanks, -Toshi
On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote: > So, we can add the step 2 on top of this patch. > 1. Clear pud/pmd entry. > 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH > 3. Free its underlining pmd/pte page. This still lacks the page-table synchronization and will thus not fix the BUG_ON being triggered. > We do not need to revert this patch. We can make the above change I > mentioned. Please note that we are not in the merge window anymore and that any fix needs to be simple and obviously correct. Thanks, Joerg
On Sat, 2018-04-28 at 11:02 +0200, joro@8bytes.org wrote: > On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote: > > So, we can add the step 2 on top of this patch. > > 1. Clear pud/pmd entry. > > 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH > > 3. Free its underlining pmd/pte page. > > This still lacks the page-table synchronization and will thus not fix > the BUG_ON being triggered. The BUG_ON issue is specific to PAE that it syncs at the pmd level. x86/64 does not have this issue since it syncs at the pgd or p4d level. > > We do not need to revert this patch. We can make the above change I > > mentioned. > > Please note that we are not in the merge window anymore and that any fix > needs to be simple and obviously correct. Understood. Changing the x86/32 sync point is risky. So, I am going to revert the free page handling for PAE. Thanks, -Toshi
On 4/29/2018 2:24 AM, Kani, Toshi wrote: > On Sat, 2018-04-28 at 11:02 +0200, joro@8bytes.org wrote: >> On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote: >>> So, we can add the step 2 on top of this patch. >>> 1. Clear pud/pmd entry. >>> 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH >>> 3. Free its underlining pmd/pte page. >> >> This still lacks the page-table synchronization and will thus not fix >> the BUG_ON being triggered. > > The BUG_ON issue is specific to PAE that it syncs at the pmd level. > x86/64 does not have this issue since it syncs at the pgd or p4d level. > >>> We do not need to revert this patch. We can make the above change I >>> mentioned. >> >> Please note that we are not in the merge window anymore and that any fix >> needs to be simple and obviously correct. > > Understood. Changing the x86/32 sync point is risky. So, I am going to > revert the free page handling for PAE. Will this affect pmd_free_pte_page() & pud_free_pmd_page() 's existence or its parameters ? I'm asking because, I've similar change for arm64 and ready to send v9 patches. I'm thinking to share my v9 patches in any case. If you are going to do TLB invalidation within these APIs, my first patch will help. > > Thanks, > -Toshi > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > Chintan
On Mon, 2018-04-30 at 13:00 +0530, Chintan Pandya wrote: > > On 4/29/2018 2:24 AM, Kani, Toshi wrote: > > On Sat, 2018-04-28 at 11:02 +0200, joro@8bytes.org wrote: > > > On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote: > > > > So, we can add the step 2 on top of this patch. > > > > 1. Clear pud/pmd entry. > > > > 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH > > > > 3. Free its underlining pmd/pte page. > > > > > > This still lacks the page-table synchronization and will thus not fix > > > the BUG_ON being triggered. > > > > The BUG_ON issue is specific to PAE that it syncs at the pmd level. > > x86/64 does not have this issue since it syncs at the pgd or p4d level. > > > > > > We do not need to revert this patch. We can make the above change I > > > > mentioned. > > > > > > Please note that we are not in the merge window anymore and that any fix > > > needs to be simple and obviously correct. > > > > Understood. Changing the x86/32 sync point is risky. So, I am going to > > revert the free page handling for PAE. > > Will this affect pmd_free_pte_page() & pud_free_pmd_page() 's existence > or its parameters ? I'm asking because, I've similar change for arm64 > and ready to send v9 patches. No, it won't. The change is only to the x86 side. > I'm thinking to share my v9 patches in any case. If you are going to do > TLB invalidation within these APIs, my first patch will help. I will make my change on top of your v9 1/4 patch so that we can avoid merge conflict. Thanks, -Toshi
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 1eed7ed518e6..34cda7e0551b 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -712,7 +712,22 @@ int pmd_clear_huge(pmd_t *pmd) */ int pud_free_pmd_page(pud_t *pud) { - return pud_none(*pud); + pmd_t *pmd; + int i; + + if (pud_none(*pud)) + return 1; + + pmd = (pmd_t *)pud_page_vaddr(*pud); + + for (i = 0; i < PTRS_PER_PMD; i++) + if (!pmd_free_pte_page(&pmd[i])) + return 0; + + pud_clear(pud); + free_page((unsigned long)pmd); + + return 1; } /** @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud) */ int pmd_free_pte_page(pmd_t *pmd) { - return pmd_none(*pmd); + pte_t *pte; + + if (pmd_none(*pmd)) + return 1; + + pte = (pte_t *)pmd_page_vaddr(*pmd); + pmd_clear(pmd); + free_page((unsigned long)pte); + + return 1; } #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which clear a given pud/pmd entry and free up lower level page table(s). Address range associated with the pud/pmd entry must have been purged by INVLPG. fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> --- arch/x86/mm/pgtable.c | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-)