[v2] mm/memory: Don't require head page for do_set_pmd()

Message ID	20240611153216.2794513-1-abrestic@rivosinc.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Andrew Bresticker <abrestic@rivosinc.com> To: Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Bresticker <abrestic@rivosinc.com> Subject: [PATCH v2] mm/memory: Don't require head page for do_set_pmd() Date: Tue, 11 Jun 2024 08:32:16 -0700 Message-Id: <20240611153216.2794513-1-abrestic@rivosinc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] mm/memory: Don't require head page for do_set_pmd() \| expand [v2] mm/memory: Don't require head page for do_set_pmd()

Andrew Bresticker June 11, 2024, 3:32 p.m. UTC

The requirement that the head page be passed to do_set_pmd() was added
in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() ->
folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
finish_fault() and filemap_map_pages() paths if the page to be inserted
is anything but the head page for an otherwise suitable vma and pmd-sized
page.

Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()")
Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com>
---
 mm/memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

David Hildenbrand June 11, 2024, 3:33 p.m. UTC | #1

On 11.06.24 17:32, Andrew Bresticker wrote:
> The requirement that the head page be passed to do_set_pmd() was added
> in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() ->
> folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
> finish_fault() and filemap_map_pages() paths if the page to be inserted
> is anything but the head page for an otherwise suitable vma and pmd-sized
> page.
> 
> Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()")
> Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com>
> ---
>   mm/memory.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f47a533014e..a1fce5ddacb3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>   	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>   		return ret;
>   
> -	if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> +	if (folio_order(folio) != HPAGE_PMD_ORDER)
>   		return ret;
> +	page = &folio->page;
>   
>   	/*
>   	 * Just backoff if any subpage of a THP is corrupted otherwise

Acked-by: David Hildenbrand <david@redhat.com>

Matthew Wilcox June 11, 2024, 6:03 p.m. UTC | #2

On Tue, Jun 11, 2024 at 08:32:16AM -0700, Andrew Bresticker wrote:
> -	if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> +	if (folio_order(folio) != HPAGE_PMD_ORDER)
>  		return ret;
> +	page = &folio->page;

This works today, but in about six months time it's going to be a pain.

+	page = folio_page(folio, 0);

is the one which works today and in the future.

Of course, as I'm writing this now I'm thinking we could make this work
for folios which are larger than PMD size.  Which we currently prohibit
(at least in part because this function doesn't work with folios larger
than PMD size)

	pgoff_t idx = ((haddr - vma->vm_start) / PAGE_SIZE) + vma->vm_pgoff;
	page = folio_file_page(folio, idx);

(there might be a more succinct way to write that, and I might have
messed up the polarity on something).

Worth starting to fix these places that don't work with folios larger
than PMD size?  I think it'd be worth supporting, eg 4MB folios.

Current bandwidth of a gen5 x8 link is 32GB/s, so we can read 8,000
4MB pages/second at a latency of 125us.  My laptop only has a gen4 x2
SSD, so a 4MB folio would take 1ms to read which might be a little
excessive for normal use.  Of course allocating an order 10 folio is
hard, so it's not going to happen terribly often anyway.

Andrew Morton June 11, 2024, 6:06 p.m. UTC | #3

On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote:

> On 11.06.24 17:32, Andrew Bresticker wrote:
> > The requirement that the head page be passed to do_set_pmd() was added
> > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() ->
> > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
> > finish_fault() and filemap_map_pages() paths if the page to be inserted
> > is anything but the head page for an otherwise suitable vma and pmd-sized
> > page.
> > 
> > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()")
> > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com>
> > ---
> >   mm/memory.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0f47a533014e..a1fce5ddacb3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> >   	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> >   		return ret;
> >   
> > -	if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> > +	if (folio_order(folio) != HPAGE_PMD_ORDER)
> >   		return ret;
> > +	page = &folio->page;
> >   
> >   	/*
> >   	 * Just backoff if any subpage of a THP is corrupted otherwise
> 
> Acked-by: David Hildenbrand <david@redhat.com>

You know what I'm going to ask ;) I'm assuming that the runtime effects
are "small performance optimization" and that "should we backport the
fix" is "no".

Andrew Morton June 11, 2024, 6:21 p.m. UTC | #4

On Tue, 11 Jun 2024 19:03:29 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> On Tue, Jun 11, 2024 at 08:32:16AM -0700, Andrew Bresticker wrote:
> > -	if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> > +	if (folio_order(folio) != HPAGE_PMD_ORDER)
> >  		return ret;
> > +	page = &folio->page;
> 
> This works today, but in about six months time it's going to be a pain.
> 
> +	page = folio_page(folio, 0);
> 
> is the one which works today and in the future.

I was wondering about that.

hp2:/usr/src/25> fgrep "&folio->page" mm/*.c | wc -l
84
hp2:/usr/src/25> fgrep "folio_page(" mm/*.c | wc -l 
35

Should these all be converted?  What's the general rule here?

Matthew Wilcox June 11, 2024, 6:22 p.m. UTC | #5

On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote:
> On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
> > On 11.06.24 17:32, Andrew Bresticker wrote:
> > > The requirement that the head page be passed to do_set_pmd() was added
> > > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() ->
> > > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
> > > finish_fault() and filemap_map_pages() paths if the page to be inserted
> > > is anything but the head page for an otherwise suitable vma and pmd-sized
> > > page.
> > > 
> > > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()")
> > > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com>
> > > ---
> > >   mm/memory.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 0f47a533014e..a1fce5ddacb3 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> > >   	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> > >   		return ret;
> > >   
> > > -	if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> > > +	if (folio_order(folio) != HPAGE_PMD_ORDER)
> > >   		return ret;
> > > +	page = &folio->page;
> > >   
> > >   	/*
> > >   	 * Just backoff if any subpage of a THP is corrupted otherwise
> > 
> > Acked-by: David Hildenbrand <david@redhat.com>
> 
> You know what I'm going to ask ;) I'm assuming that the runtime effects
> are "small performance optimization" and that "should we backport the
> fix" is "no".

We're going to stop using PMDs to map large folios unless the fault is
within the first 4KiB of the PMD.  No idea how many workloads that
affects, but it only needs to be backported as far as v6.8, so we
may as well backport it.

Matthew Wilcox June 11, 2024, 6:38 p.m. UTC | #6

On Tue, Jun 11, 2024 at 11:21:31AM -0700, Andrew Morton wrote:
> On Tue, 11 Jun 2024 19:03:29 +0100 Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Tue, Jun 11, 2024 at 08:32:16AM -0700, Andrew Bresticker wrote:
> > > -	if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> > > +	if (folio_order(folio) != HPAGE_PMD_ORDER)
> > >  		return ret;
> > > +	page = &folio->page;
> > 
> > This works today, but in about six months time it's going to be a pain.
> > 
> > +	page = folio_page(folio, 0);
> > 
> > is the one which works today and in the future.
> 
> I was wondering about that.
> 
> hp2:/usr/src/25> fgrep "&folio->page" mm/*.c | wc -l
> 84
> hp2:/usr/src/25> fgrep "folio_page(" mm/*.c | wc -l 
> 35
> 
> Should these all be converted?  What's the general rule here?

The rule is ...

 - If we haven't thought about it, use &folio->page to indicate that
   somebody needs to think about it.
 - If the code needs to be modified to split folio and page apart, use
   &folio->page.
 - If the code is part of compat code which is going to have to be
   removed, use &folio->page (eg do_read_cache_page()).

To *think* about it, and use folio_page() or folio_file_page(), don't
just blindly pass 0 as the second argument.  Think about which page
within the folio is expected by the function you're working on.
Often that is "the first one!" and so folio_page(folio, 0) is the
right answer.  But that should be justified.

It might be the right answer is "Oh, that function should take a folio".

Andrew Morton June 11, 2024, 8:07 p.m. UTC | #7

On Tue, 11 Jun 2024 19:22:03 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote:
> > On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote:
> > 
> > > On 11.06.24 17:32, Andrew Bresticker wrote:
> > > > The requirement that the head page be passed to do_set_pmd() was added
> > > > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() ->
> > > > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
> > > > finish_fault() and filemap_map_pages() paths if the page to be inserted
> > > > is anything but the head page for an otherwise suitable vma and pmd-sized
> > > > page.
> > > > 
> > > > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()")
> > > > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com>
> > > > ---
> > > >   mm/memory.c | 3 ++-
> > > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 0f47a533014e..a1fce5ddacb3 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> > > >   	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> > > >   		return ret;
> > > >   
> > > > -	if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> > > > +	if (folio_order(folio) != HPAGE_PMD_ORDER)
> > > >   		return ret;
> > > > +	page = &folio->page;
> > > >   
> > > >   	/*
> > > >   	 * Just backoff if any subpage of a THP is corrupted otherwise
> > > 
> > > Acked-by: David Hildenbrand <david@redhat.com>
> > 
> > You know what I'm going to ask ;) I'm assuming that the runtime effects
> > are "small performance optimization" and that "should we backport the
> > fix" is "no".
> 
> We're going to stop using PMDs to map large folios unless the fault is
> within the first 4KiB of the PMD.  No idea how many workloads that
> affects, but it only needs to be backported as far as v6.8, so we
> may as well backport it.

OK, thanks, I pasted the above text and added the cc:stable.

I didn't move it into the hotfixes queue - it's a non-trivial
behavioral change and extra test time seems prudent(?).

Hugh Dickins June 11, 2024, 9:18 p.m. UTC | #8

On Tue, 11 Jun 2024, Andrew Morton wrote:
> On Tue, 11 Jun 2024 19:22:03 +0100 Matthew Wilcox <willy@infradead.org> wrote:
> > On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote:
> > > On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > > On 11.06.24 17:32, Andrew Bresticker wrote:
> > > > > The requirement that the head page be passed to do_set_pmd() was added
> > > > > in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() ->
> > > > > folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
> > > > > finish_fault() and filemap_map_pages() paths if the page to be inserted
> > > > > is anything but the head page for an otherwise suitable vma and pmd-sized
> > > > > page.
> > > > > 
> > > > > Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()")
> > > > > Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com>
> > > > > ---
> > > > >   mm/memory.c | 3 ++-
> > > > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 0f47a533014e..a1fce5ddacb3 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> > > > >   	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> > > > >   		return ret;
> > > > >   
> > > > > -	if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> > > > > +	if (folio_order(folio) != HPAGE_PMD_ORDER)
> > > > >   		return ret;
> > > > > +	page = &folio->page;
> > > > >   
> > > > >   	/*
> > > > >   	 * Just backoff if any subpage of a THP is corrupted otherwise
> > > > 
> > > > Acked-by: David Hildenbrand <david@redhat.com>

Acked-by: Hugh Dickins <hughd@google.com>

> > > 
> > > You know what I'm going to ask ;) I'm assuming that the runtime effects
> > > are "small performance optimization" and that "should we backport the
> > > fix" is "no".
> > 
> > We're going to stop using PMDs to map large folios unless the fault is
> > within the first 4KiB of the PMD.  No idea how many workloads that
> > affects, but it only needs to be backported as far as v6.8, so we
> > may as well backport it.
> 
> OK, thanks, I pasted the above text and added the cc:stable.

Yes please. My interest in this being that yesterday I discovered
the large drop in ShmemPmdMapped between v6.7 and v6.8, bisected,
and was testing overnight with a patch very much like this one of
Andrew's. I'd been hoping to send mine today, but now no need.

> 
> I didn't move it into the hotfixes queue - it's a non-trivial
> behavioral change and extra test time seems prudent(?).

It is certainly worth some test soak time, and the bug might have
been masking other issues which may now emerge; but the fix is
just reverting to the old pre-v6.8 behaviour.

Thanks,
Hugh

David Hildenbrand June 12, 2024, 6:41 p.m. UTC | #9

On 11.06.24 23:18, Hugh Dickins wrote:
> On Tue, 11 Jun 2024, Andrew Morton wrote:
>> On Tue, 11 Jun 2024 19:22:03 +0100 Matthew Wilcox <willy@infradead.org> wrote:
>>> On Tue, Jun 11, 2024 at 11:06:22AM -0700, Andrew Morton wrote:
>>>> On Tue, 11 Jun 2024 17:33:17 +0200 David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>>> On 11.06.24 17:32, Andrew Bresticker wrote:
>>>>>> The requirement that the head page be passed to do_set_pmd() was added
>>>>>> in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() ->
>>>>>> folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
>>>>>> finish_fault() and filemap_map_pages() paths if the page to be inserted
>>>>>> is anything but the head page for an otherwise suitable vma and pmd-sized
>>>>>> page.
>>>>>>
>>>>>> Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()")
>>>>>> Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com>
>>>>>> ---
>>>>>>    mm/memory.c | 3 ++-
>>>>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 0f47a533014e..a1fce5ddacb3 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -4614,8 +4614,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>>>>>    	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>>>>>>    		return ret;
>>>>>>    
>>>>>> -	if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>>>>>> +	if (folio_order(folio) != HPAGE_PMD_ORDER)
>>>>>>    		return ret;
>>>>>> +	page = &folio->page;
>>>>>>    
>>>>>>    	/*
>>>>>>    	 * Just backoff if any subpage of a THP is corrupted otherwise
>>>>>
>>>>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
>>>>
>>>> You know what I'm going to ask ;) I'm assuming that the runtime effects
>>>> are "small performance optimization" and that "should we backport the
>>>> fix" is "no".
>>>
>>> We're going to stop using PMDs to map large folios unless the fault is
>>> within the first 4KiB of the PMD.  No idea how many workloads that
>>> affects, but it only needs to be backported as far as v6.8, so we
>>> may as well backport it.
>>
>> OK, thanks, I pasted the above text and added the cc:stable.
> 
> Yes please. My interest in this being that yesterday I discovered
> the large drop in ShmemPmdMapped between v6.7 and v6.8, bisected,
> and was testing overnight with a patch very much like this one of
> Andrew's. I'd been hoping to send mine today, but now no need.
> 
>>
>> I didn't move it into the hotfixes queue - it's a non-trivial
>> behavioral change and extra test time seems prudent(?).
> 
> It is certainly worth some test soak time, and the bug might have
> been masking other issues which may now emerge; but the fix is
> just reverting to the old pre-v6.8 behaviour.

Right, I don't expect surprises, really. I'm rather surprised that 
nobody noticed and that the usual 0-day benchmarks don't trigger that case.

[v2] mm/memory: Don't require head page for do_set_pmd()

Commit Message

Comments

Patch