diff mbox series

[02/11] mm/memory: Remove page fault assumption of compound page size

Message ID 20200908195539.25896-3-willy@infradead.org (mailing list archive)
State New, archived
Headers show
Series Remove assumptions of THP size | expand

Commit Message

Matthew Wilcox Sept. 8, 2020, 7:55 p.m. UTC
A compound page in the page cache will not necessarily be of PMD size,
so check explicitly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Kirill A . Shutemov Sept. 9, 2020, 2:29 p.m. UTC | #1
On Tue, Sep 08, 2020 at 08:55:29PM +0100, Matthew Wilcox (Oracle) wrote:
> A compound page in the page cache will not necessarily be of PMD size,
> so check explicitly.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/memory.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 602f4283122f..4b35b4e71e64 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3562,13 +3562,14 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>  	pmd_t entry;
>  	int i;
> -	vm_fault_t ret;
> +	vm_fault_t ret = VM_FAULT_FALLBACK;
>  
>  	if (!transhuge_vma_suitable(vma, haddr))
> -		return VM_FAULT_FALLBACK;
> +		return ret;
>  
> -	ret = VM_FAULT_FALLBACK;
>  	page = compound_head(page);
> +	if (page_order(page) != HPAGE_PMD_ORDER)
> +		return ret;

Maybe also VM_BUG_ON_PAGE(page_order(page) > HPAGE_PMD_ORDER, page)?
Just in case.

>  
>  	/*
>  	 * Archs like ppc64 need additonal space to store information
> -- 
> 2.28.0
>
Matthew Wilcox Sept. 9, 2020, 2:50 p.m. UTC | #2
On Wed, Sep 09, 2020 at 05:29:04PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 08, 2020 at 08:55:29PM +0100, Matthew Wilcox (Oracle) wrote:
> > A compound page in the page cache will not necessarily be of PMD size,
> > so check explicitly.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  mm/memory.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 602f4283122f..4b35b4e71e64 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3562,13 +3562,14 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> >  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> >  	pmd_t entry;
> >  	int i;
> > -	vm_fault_t ret;
> > +	vm_fault_t ret = VM_FAULT_FALLBACK;
> >  
> >  	if (!transhuge_vma_suitable(vma, haddr))
> > -		return VM_FAULT_FALLBACK;
> > +		return ret;
> >  
> > -	ret = VM_FAULT_FALLBACK;
> >  	page = compound_head(page);
> > +	if (page_order(page) != HPAGE_PMD_ORDER)
> > +		return ret;
> 
> Maybe also VM_BUG_ON_PAGE(page_order(page) > HPAGE_PMD_ORDER, page)?
> Just in case.

In the patch where I actually start creating THPs, I limit the order to
HPAGE_PMD_ORDER, so we're not going to see this today.  At some point
in the future, I can imagine that we allow THPs larger than PMD size,
and what we'd want alloc_set_pte() to look like is:

	if (pud_none(*vmf->pud) && PageTransCompound(page)) {
		ret = do_set_pud(vmf, page);
		if (ret != VM_FAULT_FALLBACK)
			return ret;
	}
	if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
		ret = do_set_pmd(vmf, page);
		if (ret != VM_FAULT_FALLBACK)
			return ret;
	}

Once we're in that situation, in do_set_pmd(), we'd want to figure out
which sub-page of the >PMD-sized page to insert.  But I don't want to
write code for that now.

So, what's the right approach if somebody does call alloc_set_pte()
with a >PMD sized page?  It's not exported, so the only two ways to get
it called with a >PMD sized page is to (1) persuade filemap_map_pages()
to call it, which means putting it in the page cache or (2) return it
from vm_ops->fault.  If someone actually does that (an interesting
device driver, perhaps), I don't think hitting it with a BUG is the
right response.  I think it should actually be to map the right PMD-sized
chunk of the page, but we don't even do that today -- we map the first
PMD-sized chunk of the page.

With this patch, we'll simply map the appropriate PAGE_SIZE chunk at the
requested address.  So this would be a bugfix for such a demented driver.
At some point, it'd be nice to handle this with a PMD, but I don't want
to write that code without a test-case.  We could probably simulate
it with the page cache THP code and be super-aggressive about creating
order-10 pages ... but this is feeling more and more out of scope for
this patch set, which today hit 99 patches.
Kirill A . Shutemov Sept. 11, 2020, 2:51 p.m. UTC | #3
On Wed, Sep 09, 2020 at 03:50:35PM +0100, Matthew Wilcox wrote:
> On Wed, Sep 09, 2020 at 05:29:04PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Sep 08, 2020 at 08:55:29PM +0100, Matthew Wilcox (Oracle) wrote:
> > > A compound page in the page cache will not necessarily be of PMD size,
> > > so check explicitly.
> > > 
> > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > ---
> > >  mm/memory.c | 7 ++++---
> > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 602f4283122f..4b35b4e71e64 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3562,13 +3562,14 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> > >  	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> > >  	pmd_t entry;
> > >  	int i;
> > > -	vm_fault_t ret;
> > > +	vm_fault_t ret = VM_FAULT_FALLBACK;
> > >  
> > >  	if (!transhuge_vma_suitable(vma, haddr))
> > > -		return VM_FAULT_FALLBACK;
> > > +		return ret;
> > >  
> > > -	ret = VM_FAULT_FALLBACK;
> > >  	page = compound_head(page);
> > > +	if (page_order(page) != HPAGE_PMD_ORDER)
> > > +		return ret;
> > 
> > Maybe also VM_BUG_ON_PAGE(page_order(page) > HPAGE_PMD_ORDER, page)?
> > Just in case.
> 
> In the patch where I actually start creating THPs, I limit the order to
> HPAGE_PMD_ORDER, so we're not going to see this today.  At some point
> in the future, I can imagine that we allow THPs larger than PMD size,
> and what we'd want alloc_set_pte() to look like is:
> 
> 	if (pud_none(*vmf->pud) && PageTransCompound(page)) {
> 		ret = do_set_pud(vmf, page);
> 		if (ret != VM_FAULT_FALLBACK)
> 			return ret;
> 	}
> 	if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
> 		ret = do_set_pmd(vmf, page);
> 		if (ret != VM_FAULT_FALLBACK)
> 			return ret;
> 	}
> 
> Once we're in that situation, in do_set_pmd(), we'd want to figure out
> which sub-page of the >PMD-sized page to insert.  But I don't want to
> write code for that now.
> 
> So, what's the right approach if somebody does call alloc_set_pte()
> with a >PMD sized page?  It's not exported, so the only two ways to get
> it called with a >PMD sized page is to (1) persuade filemap_map_pages()
> to call it, which means putting it in the page cache or (2) return it
> from vm_ops->fault.  If someone actually does that (an interesting
> device driver, perhaps), I don't think hitting it with a BUG is the
> right response.  I think it should actually be to map the right PMD-sized
> chunk of the page, but we don't even do that today -- we map the first
> PMD-sized chunk of the page.
> 
> With this patch, we'll simply map the appropriate PAGE_SIZE chunk at the
> requested address.  So this would be a bugfix for such a demented driver.
> At some point, it'd be nice to handle this with a PMD, but I don't want
> to write that code without a test-case.  We could probably simulate
> it with the page cache THP code and be super-aggressive about creating
> order-10 pages ... but this is feeling more and more out of scope for
> this patch set, which today hit 99 patches.

Okay, fair enough. VM_BUG is too strong reaction here as we can make a
reasonable fallback. Maybe WARN_ON_ONCE() would make sense? It would
indicate the place that has to be adjust once we would get abouve
PMD-order pages.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 602f4283122f..4b35b4e71e64 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3562,13 +3562,14 @@  static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	pmd_t entry;
 	int i;
-	vm_fault_t ret;
+	vm_fault_t ret = VM_FAULT_FALLBACK;
 
 	if (!transhuge_vma_suitable(vma, haddr))
-		return VM_FAULT_FALLBACK;
+		return ret;
 
-	ret = VM_FAULT_FALLBACK;
 	page = compound_head(page);
+	if (page_order(page) != HPAGE_PMD_ORDER)
+		return ret;
 
 	/*
 	 * Archs like ppc64 need additonal space to store information