diff mbox series

[resend] powerpc/64s: fix page table fragment refcount race vs speculative references

Message ID 20180727114817.27190-1-npiggin@gmail.com (mailing list archive)
State New, archived
Headers show
Series [resend] powerpc/64s: fix page table fragment refcount race vs speculative references | expand

Commit Message

Nicholas Piggin July 27, 2018, 11:48 a.m. UTC
The page table fragment allocator uses the main page refcount racily
with respect to speculative references. A customer observed a BUG due
to page table page refcount underflow in the fragment allocator. This
can be caused by the fragment allocator set_page_count stomping on a
speculative reference, and then the speculative failure handler
decrements the new reference, and the underflow eventually pops when
the page tables are freed.

Fix this by using a dedicated field in the struct page for the page
table fragment allocator.

Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage")
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---

Any objection to the struct page change to grab the arch specific
page table page word for powerpc to use? If not, then this should
go via powerpc tree because it's inconsequential for core mm.

Thanks,
Nick

 arch/powerpc/mm/mmu_context_book3s64.c |  8 ++++----
 arch/powerpc/mm/pgtable-book3s64.c     | 17 +++++++++++------
 include/linux/mm_types.h               |  5 ++++-
 3 files changed, 19 insertions(+), 11 deletions(-)

Comments

Matthew Wilcox July 27, 2018, 1:41 p.m. UTC | #1
On Fri, Jul 27, 2018 at 09:48:17PM +1000, Nicholas Piggin wrote:
> The page table fragment allocator uses the main page refcount racily
> with respect to speculative references. A customer observed a BUG due
> to page table page refcount underflow in the fragment allocator. This
> can be caused by the fragment allocator set_page_count stomping on a
> speculative reference, and then the speculative failure handler
> decrements the new reference, and the underflow eventually pops when
> the page tables are freed.

Oof.  Can't you fix this instead by using page_ref_add() instead of
set_page_count()?

> Any objection to the struct page change to grab the arch specific
> page table page word for powerpc to use? If not, then this should
> go via powerpc tree because it's inconsequential for core mm.

I want (eventually) to get to the point where every struct page carries
a pointer to the struct mm that it belongs to.  It's good for debugging
as well as handling memory errors in page tables.
Nicholas Piggin July 27, 2018, 2:29 p.m. UTC | #2
On Fri, 27 Jul 2018 06:41:56 -0700
Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Jul 27, 2018 at 09:48:17PM +1000, Nicholas Piggin wrote:
> > The page table fragment allocator uses the main page refcount racily
> > with respect to speculative references. A customer observed a BUG due
> > to page table page refcount underflow in the fragment allocator. This
> > can be caused by the fragment allocator set_page_count stomping on a
> > speculative reference, and then the speculative failure handler
> > decrements the new reference, and the underflow eventually pops when
> > the page tables are freed.  
> 
> Oof.  Can't you fix this instead by using page_ref_add() instead of
> set_page_count()?

It's ugly doing it that way. The problem is we have a page table
destructor and that would be missed if the spec ref was the last
put. In practice with RCU page table freeing maybe you can say
there will be no spec ref there (unless something changes), but
still it just seems much simpler doing this and avoiding any
complexity or relying on other synchronization.

> 
> > Any objection to the struct page change to grab the arch specific
> > page table page word for powerpc to use? If not, then this should
> > go via powerpc tree because it's inconsequential for core mm.  
> 
> I want (eventually) to get to the point where every struct page carries
> a pointer to the struct mm that it belongs to.  It's good for debugging
> as well as handling memory errors in page tables.

That doesn't seem like it should be a problem, there's some spare
words there for arch independent users.

Thanks,
Nick
Matthew Wilcox July 27, 2018, 3:38 p.m. UTC | #3
On Sat, Jul 28, 2018 at 12:29:06AM +1000, Nicholas Piggin wrote:
> On Fri, 27 Jul 2018 06:41:56 -0700
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Fri, Jul 27, 2018 at 09:48:17PM +1000, Nicholas Piggin wrote:
> > > The page table fragment allocator uses the main page refcount racily
> > > with respect to speculative references. A customer observed a BUG due
> > > to page table page refcount underflow in the fragment allocator. This
> > > can be caused by the fragment allocator set_page_count stomping on a
> > > speculative reference, and then the speculative failure handler
> > > decrements the new reference, and the underflow eventually pops when
> > > the page tables are freed.  
> > 
> > Oof.  Can't you fix this instead by using page_ref_add() instead of
> > set_page_count()?
> 
> It's ugly doing it that way. The problem is we have a page table
> destructor and that would be missed if the spec ref was the last
> put. In practice with RCU page table freeing maybe you can say
> there will be no spec ref there (unless something changes), but
> still it just seems much simpler doing this and avoiding any
> complexity or relying on other synchronization.

I don't want to rely on the speculative reference not happening by the
time the page table is torn down; that's way too black-magic for me.
Another possibility would be to use, say, the top 16 bits of the
atomic for your counter and call the dtor once the atomic is below 64k.
I'm also thinking about overhauling the dtor system so it's not tied to
compound pages; anyone with a bit in page_type would be able to use it.
That way you'd always get your dtor called, even if the speculative
reference was the last one.

> > > Any objection to the struct page change to grab the arch specific
> > > page table page word for powerpc to use? If not, then this should
> > > go via powerpc tree because it's inconsequential for core mm.  
> > 
> > I want (eventually) to get to the point where every struct page carries
> > a pointer to the struct mm that it belongs to.  It's good for debugging
> > as well as handling memory errors in page tables.
> 
> That doesn't seem like it should be a problem, there's some spare
> words there for arch independent users.

Could you take one of the spare words instead then?  My intent was to
just take the 'x86 pgds only' comment off that member.  _pt_pad_2 looks
ideal because it'll be initialised to 0 and you'll return it to 0 by
the time you're done.
Nicholas Piggin July 27, 2018, 4:32 p.m. UTC | #4
On Fri, 27 Jul 2018 08:38:35 -0700
Matthew Wilcox <willy@infradead.org> wrote:

> On Sat, Jul 28, 2018 at 12:29:06AM +1000, Nicholas Piggin wrote:
> > On Fri, 27 Jul 2018 06:41:56 -0700
> > Matthew Wilcox <willy@infradead.org> wrote:
> >   
> > > On Fri, Jul 27, 2018 at 09:48:17PM +1000, Nicholas Piggin wrote:  
> > > > The page table fragment allocator uses the main page refcount racily
> > > > with respect to speculative references. A customer observed a BUG due
> > > > to page table page refcount underflow in the fragment allocator. This
> > > > can be caused by the fragment allocator set_page_count stomping on a
> > > > speculative reference, and then the speculative failure handler
> > > > decrements the new reference, and the underflow eventually pops when
> > > > the page tables are freed.    
> > > 
> > > Oof.  Can't you fix this instead by using page_ref_add() instead of
> > > set_page_count()?  
> > 
> > It's ugly doing it that way. The problem is we have a page table
> > destructor and that would be missed if the spec ref was the last
> > put. In practice with RCU page table freeing maybe you can say
> > there will be no spec ref there (unless something changes), but
> > still it just seems much simpler doing this and avoiding any
> > complexity or relying on other synchronization.  
> 
> I don't want to rely on the speculative reference not happening by the
> time the page table is torn down; that's way too black-magic for me.
> Another possibility would be to use, say, the top 16 bits of the
> atomic for your counter and call the dtor once the atomic is below 64k.
> I'm also thinking about overhauling the dtor system so it's not tied to
> compound pages; anyone with a bit in page_type would be able to use it.
> That way you'd always get your dtor called, even if the speculative
> reference was the last one.

Yeah we could look at doing either of those if necessary.

> 
> > > > Any objection to the struct page change to grab the arch specific
> > > > page table page word for powerpc to use? If not, then this should
> > > > go via powerpc tree because it's inconsequential for core mm.    
> > > 
> > > I want (eventually) to get to the point where every struct page carries
> > > a pointer to the struct mm that it belongs to.  It's good for debugging
> > > as well as handling memory errors in page tables.  
> > 
> > That doesn't seem like it should be a problem, there's some spare
> > words there for arch independent users.  
> 
> Could you take one of the spare words instead then?  My intent was to
> just take the 'x86 pgds only' comment off that member.  _pt_pad_2 looks
> ideal because it'll be initialised to 0 and you'll return it to 0 by
> the time you're done.

It doesn't matter for powerpc where the atomic_t goes, so I'm fine with
moving it. But could you juggle the fields with your patch instead? I
thought it would be nice to using this field that has been already
tested on x86 not to overlap with any other data for
bug fix that'll have to be widely backported.

Thanks,
Nick
Michael Ellerman July 31, 2018, 11:42 a.m. UTC | #5
Nicholas Piggin <npiggin@gmail.com> writes:
> On Fri, 27 Jul 2018 08:38:35 -0700
> Matthew Wilcox <willy@infradead.org> wrote:
>> On Sat, Jul 28, 2018 at 12:29:06AM +1000, Nicholas Piggin wrote:
>> > On Fri, 27 Jul 2018 06:41:56 -0700
>> > Matthew Wilcox <willy@infradead.org> wrote:
>> > > On Fri, Jul 27, 2018 at 09:48:17PM +1000, Nicholas Piggin wrote:  
>> > > > The page table fragment allocator uses the main page refcount racily
>> > > > with respect to speculative references. A customer observed a BUG due
>> > > > to page table page refcount underflow in the fragment allocator. This
>> > > > can be caused by the fragment allocator set_page_count stomping on a
>> > > > speculative reference, and then the speculative failure handler
>> > > > decrements the new reference, and the underflow eventually pops when
>> > > > the page tables are freed.    
>> > > 
>> > > Oof.  Can't you fix this instead by using page_ref_add() instead of
>> > > set_page_count()?  
>> > 
>> > It's ugly doing it that way. The problem is we have a page table
>> > destructor and that would be missed if the spec ref was the last
>> > put. In practice with RCU page table freeing maybe you can say
>> > there will be no spec ref there (unless something changes), but
>> > still it just seems much simpler doing this and avoiding any
>> > complexity or relying on other synchronization.  
>> 
>> I don't want to rely on the speculative reference not happening by the
>> time the page table is torn down; that's way too black-magic for me.
>> Another possibility would be to use, say, the top 16 bits of the
>> atomic for your counter and call the dtor once the atomic is below 64k.
>> I'm also thinking about overhauling the dtor system so it's not tied to
>> compound pages; anyone with a bit in page_type would be able to use it.
>> That way you'd always get your dtor called, even if the speculative
>> reference was the last one.
>
> Yeah we could look at doing either of those if necessary.
>
>> > > > Any objection to the struct page change to grab the arch specific
>> > > > page table page word for powerpc to use? If not, then this should
>> > > > go via powerpc tree because it's inconsequential for core mm.    
>> > > 
>> > > I want (eventually) to get to the point where every struct page carries
>> > > a pointer to the struct mm that it belongs to.  It's good for debugging
>> > > as well as handling memory errors in page tables.  
>> > 
>> > That doesn't seem like it should be a problem, there's some spare
>> > words there for arch independent users.  
>> 
>> Could you take one of the spare words instead then?  My intent was to
>> just take the 'x86 pgds only' comment off that member.  _pt_pad_2 looks
>> ideal because it'll be initialised to 0 and you'll return it to 0 by
>> the time you're done.
>
> It doesn't matter for powerpc where the atomic_t goes, so I'm fine with
> moving it. But could you juggle the fields with your patch instead? I
> thought it would be nice to using this field that has been already
> tested on x86 not to overlap with any other data for
> bug fix that'll have to be widely backported.

Can we come to a conclusion on this one?

As far as backporting goes pt_mm is new in 4.18-rc so the patch will
need to be manually backported anyway. But I agree with Nick we'd rather
use a slot that is known to be free for arch use.

cheers
Nicholas Piggin Aug. 1, 2018, 2:45 a.m. UTC | #6
On Tue, 31 Jul 2018 21:42:22 +1000
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> > On Fri, 27 Jul 2018 08:38:35 -0700
> > Matthew Wilcox <willy@infradead.org> wrote:  
> >> On Sat, Jul 28, 2018 at 12:29:06AM +1000, Nicholas Piggin wrote:  
> >> > On Fri, 27 Jul 2018 06:41:56 -0700
> >> > Matthew Wilcox <willy@infradead.org> wrote:  
> >> > > On Fri, Jul 27, 2018 at 09:48:17PM +1000, Nicholas Piggin wrote:    
> >> > > > The page table fragment allocator uses the main page refcount racily
> >> > > > with respect to speculative references. A customer observed a BUG due
> >> > > > to page table page refcount underflow in the fragment allocator. This
> >> > > > can be caused by the fragment allocator set_page_count stomping on a
> >> > > > speculative reference, and then the speculative failure handler
> >> > > > decrements the new reference, and the underflow eventually pops when
> >> > > > the page tables are freed.      
> >> > > 
> >> > > Oof.  Can't you fix this instead by using page_ref_add() instead of
> >> > > set_page_count()?    
> >> > 
> >> > It's ugly doing it that way. The problem is we have a page table
> >> > destructor and that would be missed if the spec ref was the last
> >> > put. In practice with RCU page table freeing maybe you can say
> >> > there will be no spec ref there (unless something changes), but
> >> > still it just seems much simpler doing this and avoiding any
> >> > complexity or relying on other synchronization.    
> >> 
> >> I don't want to rely on the speculative reference not happening by the
> >> time the page table is torn down; that's way too black-magic for me.
> >> Another possibility would be to use, say, the top 16 bits of the
> >> atomic for your counter and call the dtor once the atomic is below 64k.
> >> I'm also thinking about overhauling the dtor system so it's not tied to
> >> compound pages; anyone with a bit in page_type would be able to use it.
> >> That way you'd always get your dtor called, even if the speculative
> >> reference was the last one.  
> >
> > Yeah we could look at doing either of those if necessary.
> >  
> >> > > > Any objection to the struct page change to grab the arch specific
> >> > > > page table page word for powerpc to use? If not, then this should
> >> > > > go via powerpc tree because it's inconsequential for core mm.      
> >> > > 
> >> > > I want (eventually) to get to the point where every struct page carries
> >> > > a pointer to the struct mm that it belongs to.  It's good for debugging
> >> > > as well as handling memory errors in page tables.    
> >> > 
> >> > That doesn't seem like it should be a problem, there's some spare
> >> > words there for arch independent users.    
> >> 
> >> Could you take one of the spare words instead then?  My intent was to
> >> just take the 'x86 pgds only' comment off that member.  _pt_pad_2 looks
> >> ideal because it'll be initialised to 0 and you'll return it to 0 by
> >> the time you're done.  
> >
> > It doesn't matter for powerpc where the atomic_t goes, so I'm fine with
> > moving it. But could you juggle the fields with your patch instead? I
> > thought it would be nice to using this field that has been already
> > tested on x86 not to overlap with any other data for
> > bug fix that'll have to be widely backported.  
> 
> Can we come to a conclusion on this one?
> 
> As far as backporting goes pt_mm is new in 4.18-rc so the patch will
> need to be manually backported anyway. But I agree with Nick we'd rather
> use a slot that is known to be free for arch use.

Let's go with that for now. I'd really rather not fix this obscure
bug by introducing something even worse. I'll volunteer to change
the powerpc page table cache code if we can't find any more space in
the struct page.

So what does mapping get used for by page table pages? 4c21e2f2441
("[PATCH] mm: split page table lock") adds that page->mapping = NULL
in pte_lock_deinit, but I don't see why because page->mapping is
never used anywhere else by that patch. Maybe a previous version
of that patch used mapping rather than private?

Thanks,
Nick
Michael Ellerman Aug. 8, 2018, 2:26 p.m. UTC | #7
On Fri, 2018-07-27 at 11:48:17 UTC, Nicholas Piggin wrote:
> The page table fragment allocator uses the main page refcount racily
> with respect to speculative references. A customer observed a BUG due
> to page table page refcount underflow in the fragment allocator. This
> can be caused by the fragment allocator set_page_count stomping on a
> speculative reference, and then the speculative failure handler
> decrements the new reference, and the underflow eventually pops when
> the page tables are freed.
> 
> Fix this by using a dedicated field in the struct page for the page
> table fragment allocator.
> 
> Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage")
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/4231aba000f5a4583dd9f67057aadb

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index f3d4b4a0e561..3bb5cec03d1f 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -200,9 +200,9 @@  static void pte_frag_destroy(void *pte_frag)
 	/* drop all the pending references */
 	count = ((unsigned long)pte_frag & ~PAGE_MASK) >> PTE_FRAG_SIZE_SHIFT;
 	/* We allow PTE_FRAG_NR fragments from a PTE page */
-	if (page_ref_sub_and_test(page, PTE_FRAG_NR - count)) {
+	if (atomic_sub_and_test(PTE_FRAG_NR - count, &page->pt_frag_refcount)) {
 		pgtable_page_dtor(page);
-		free_unref_page(page);
+		__free_page(page);
 	}
 }
 
@@ -215,9 +215,9 @@  static void pmd_frag_destroy(void *pmd_frag)
 	/* drop all the pending references */
 	count = ((unsigned long)pmd_frag & ~PAGE_MASK) >> PMD_FRAG_SIZE_SHIFT;
 	/* We allow PTE_FRAG_NR fragments from a PTE page */
-	if (page_ref_sub_and_test(page, PMD_FRAG_NR - count)) {
+	if (atomic_sub_and_test(PMD_FRAG_NR - count, &page->pt_frag_refcount)) {
 		pgtable_pmd_page_dtor(page);
-		free_unref_page(page);
+		__free_page(page);
 	}
 }
 
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 4afbfbb64bfd..78d0b3d5ebad 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -270,6 +270,8 @@  static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 		return NULL;
 	}
 
+	atomic_set(&page->pt_frag_refcount, 1);
+
 	ret = page_address(page);
 	/*
 	 * if we support only one fragment just return the
@@ -285,7 +287,7 @@  static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 	 * count.
 	 */
 	if (likely(!mm->context.pmd_frag)) {
-		set_page_count(page, PMD_FRAG_NR);
+		atomic_set(&page->pt_frag_refcount, PMD_FRAG_NR);
 		mm->context.pmd_frag = ret + PMD_FRAG_SIZE;
 	}
 	spin_unlock(&mm->page_table_lock);
@@ -308,9 +310,10 @@  void pmd_fragment_free(unsigned long *pmd)
 {
 	struct page *page = virt_to_page(pmd);
 
-	if (put_page_testzero(page)) {
+	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
+	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		pgtable_pmd_page_dtor(page);
-		free_unref_page(page);
+		__free_page(page);
 	}
 }
 
@@ -352,6 +355,7 @@  static pte_t *__alloc_for_ptecache(struct mm_struct *mm, int kernel)
 			return NULL;
 	}
 
+	atomic_set(&page->pt_frag_refcount, 1);
 
 	ret = page_address(page);
 	/*
@@ -367,7 +371,7 @@  static pte_t *__alloc_for_ptecache(struct mm_struct *mm, int kernel)
 	 * count.
 	 */
 	if (likely(!mm->context.pte_frag)) {
-		set_page_count(page, PTE_FRAG_NR);
+		atomic_set(&page->pt_frag_refcount, PTE_FRAG_NR);
 		mm->context.pte_frag = ret + PTE_FRAG_SIZE;
 	}
 	spin_unlock(&mm->page_table_lock);
@@ -390,10 +394,11 @@  void pte_fragment_free(unsigned long *table, int kernel)
 {
 	struct page *page = virt_to_page(table);
 
-	if (put_page_testzero(page)) {
+	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
+	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		if (!kernel)
 			pgtable_page_dtor(page);
-		free_unref_page(page);
+		__free_page(page);
 	}
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..22651e124071 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -139,7 +139,10 @@  struct page {
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
-			struct mm_struct *pt_mm;	/* x86 pgds only */
+			union {
+				struct mm_struct *pt_mm; /* x86 pgds only */
+				atomic_t pt_frag_refcount; /* powerpc */
+			};
 #if ALLOC_SPLIT_PTLOCKS
 			spinlock_t *ptl;
 #else