mm: Always initialise folio->_deferred_list

Message ID	20240312035017.516865-1-willy@infradead.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-mm@kvack.org Subject: [PATCH] mm: Always initialise folio->_deferred_list Date: Tue, 12 Mar 2024 03:50:12 +0000 Message-ID: <20240312035017.516865-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Always initialise folio->_deferred_list \| expand mm: Always initialise folio->_deferred_list

Matthew Wilcox (Oracle) March 12, 2024, 3:50 a.m. UTC

For compound pages which are at least order-2 (and hence have a
deferred_list), initialise it and then we can check at free that the
page is not part of a deferred list.  We recently found this useful to
rule out a source of corruption.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/huge_memory.c | 2 --
 mm/internal.h    | 2 ++
 mm/memcontrol.c  | 2 ++
 mm/page_alloc.c  | 9 +++++----
 4 files changed, 9 insertions(+), 6 deletions(-)

David Hildenbrand March 12, 2024, 1:17 p.m. UTC | #1

On 12.03.24 04:50, Matthew Wilcox (Oracle) wrote:
> For compound pages which are at least order-2 (and hence have a
> deferred_list), initialise it and then we can check at free that the
> page is not part of a deferred list.  We recently found this useful to
> rule out a source of corruption.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   mm/huge_memory.c | 2 --
>   mm/internal.h    | 2 ++
>   mm/memcontrol.c  | 2 ++
>   mm/page_alloc.c  | 9 +++++----
>   4 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9859aa4f7553..04fb994a7b0b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio)
>   {
>   	if (!folio || !folio_test_large(folio))
>   		return;
> -	if (folio_order(folio) > 1)
> -		INIT_LIST_HEAD(&folio->_deferred_list);
>   	folio_set_large_rmappable(folio);
>   }
>   
> diff --git a/mm/internal.h b/mm/internal.h
> index 7e486f2c502c..10895ec52546 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -525,6 +525,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>   	atomic_set(&folio->_entire_mapcount, -1);
>   	atomic_set(&folio->_nr_pages_mapped, 0);
>   	atomic_set(&folio->_pincount, 0);
> +	if (order > 1)
> +		INIT_LIST_HEAD(&folio->_deferred_list);
>   }
>   
>   static inline void prep_compound_tail(struct page *head, int tail_idx)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 138bcfa18234..e2334c4ee550 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7483,6 +7483,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>   	struct obj_cgroup *objcg;
>   
>   	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> +	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> +			!list_empty(&folio->_deferred_list), folio);
>   
>   	/*
>   	 * Nobody should be changing or seriously looking at
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 62240014a498..8374ba9b6249 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1006,10 +1006,11 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>   		}
>   		break;
>   	case 2:
> -		/*
> -		 * the second tail page: ->mapping is
> -		 * deferred_list.next -- ignore value.
> -		 */
> +		/* the second tail page: deferred_list overlaps ->mapping */
> +		if (unlikely(!list_empty(&folio->_deferred_list))) {
> +			bad_page(page, "on deferred list");
> +			goto out;
> +		}

IIRC, hugetlb might overwrite this with _hugetlb_subpool? Isn't that a 
problem we have to handle?

Matthew Wilcox (Oracle) March 12, 2024, 2:01 p.m. UTC | #2

On Tue, Mar 12, 2024 at 02:17:01PM +0100, David Hildenbrand wrote:
> On 12.03.24 04:50, Matthew Wilcox (Oracle) wrote:
> > +++ b/mm/page_alloc.c
> > @@ -1006,10 +1006,11 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
> >   		}
> >   		break;
> >   	case 2:
> > -		/*
> > -		 * the second tail page: ->mapping is
> > -		 * deferred_list.next -- ignore value.
> > -		 */
> > +		/* the second tail page: deferred_list overlaps ->mapping */
> > +		if (unlikely(!list_empty(&folio->_deferred_list))) {
> > +			bad_page(page, "on deferred list");
> > +			goto out;
> > +		}
> 
> IIRC, hugetlb might overwrite this with _hugetlb_subpool? Isn't that a
> problem we have to handle?

Ah yes, you're right.  I think this should do the trick?

+++ b/mm/hugetlb.c
@@ -1796,6 +1796,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
                destroy_compound_gigantic_folio(folio, huge_page_order(h));
                free_gigantic_folio(folio, huge_page_order(h));
        } else {
+               INIT_LIST_HEAD(&folio->_deferred_list);
                __free_pages(&folio->page, huge_page_order(h));
        }
 }

(that should also be a folio_put(), IMO)

David Hildenbrand March 12, 2024, 2:34 p.m. UTC | #3

On 12.03.24 15:01, Matthew Wilcox wrote:
> On Tue, Mar 12, 2024 at 02:17:01PM +0100, David Hildenbrand wrote:
>> On 12.03.24 04:50, Matthew Wilcox (Oracle) wrote:
>>> +++ b/mm/page_alloc.c
>>> @@ -1006,10 +1006,11 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>>>    		}
>>>    		break;
>>>    	case 2:
>>> -		/*
>>> -		 * the second tail page: ->mapping is
>>> -		 * deferred_list.next -- ignore value.
>>> -		 */
>>> +		/* the second tail page: deferred_list overlaps ->mapping */
>>> +		if (unlikely(!list_empty(&folio->_deferred_list))) {
>>> +			bad_page(page, "on deferred list");
>>> +			goto out;
>>> +		}
>>
>> IIRC, hugetlb might overwrite this with _hugetlb_subpool? Isn't that a
>> problem we have to handle?
> 
> Ah yes, you're right.  I think this should do the trick?
> 
> +++ b/mm/hugetlb.c
> @@ -1796,6 +1796,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>                  destroy_compound_gigantic_folio(folio, huge_page_order(h));
>                  free_gigantic_folio(folio, huge_page_order(h));
>          } else {
> +               INIT_LIST_HEAD(&folio->_deferred_list);
>                  __free_pages(&folio->page, huge_page_order(h));

Heh, __free_pages() says:

"This function can free multi-page allocations that are not compound 
pages". I suspect that means "can also free compound pages", but it's 
confusing.

Especially, I thought we recently learned that free hugetlb folios do 
have a refcount of 0? Confusing.

Anyhow, I suspect your change does the trick and agree that 
__free_pages() might need a replacement.

... but it's confusing.

Matthew Wilcox (Oracle) March 12, 2024, 3:38 p.m. UTC | #4

On Tue, Mar 12, 2024 at 03:34:13PM +0100, David Hildenbrand wrote:
> On 12.03.24 15:01, Matthew Wilcox wrote:
> >                  __free_pages(&folio->page, huge_page_order(h));
> 
> Heh, __free_pages() says:
> 
> "This function can free multi-page allocations that are not compound pages".
> I suspect that means "can also free compound pages", but it's confusing.

Sorry.  I wrote that documentation and I was focused on one thing,
but failed to think of the other thing ... it can indeed free compound
pages.  I'll add this:

 * If the pages were allocated with __GFP_COMP, prefer using put_page()
 * to using this function.  Also prefer to call put_page() rather than
 * calling __free_page() or __free_pages(page, 0).

> Especially, I thought we recently learned that free hugetlb folios do have a
> refcount of 0? Confusing.

Yes, that is confusing.  This function wouldn't work if the refcount
were 0 (put_page_testzero() would complain).  Ah, here we go:

In __folio_put(), we know that refcount is 0.
__folio_put_large() -> destroy_large_folio -> free_huge_folio
(refcount still 0 here, indeed it asserts it)
If we add it back to the pool, the refcount stays at 0.
But if we're removing it, we call __remove_hugetlb_folio()
which calls folio_ref_unfreeze().

David Hildenbrand March 12, 2024, 3:45 p.m. UTC | #5

On 12.03.24 16:38, Matthew Wilcox wrote:
> On Tue, Mar 12, 2024 at 03:34:13PM +0100, David Hildenbrand wrote:
>> On 12.03.24 15:01, Matthew Wilcox wrote:
>>>                   __free_pages(&folio->page, huge_page_order(h));
>>
>> Heh, __free_pages() says:
>>
>> "This function can free multi-page allocations that are not compound pages".
>> I suspect that means "can also free compound pages", but it's confusing.
> 
> Sorry.  I wrote that documentation and I was focused on one thing,
> but failed to think of the other thing ... it can indeed free compound
> pages.  I'll add this:
> 
>   * If the pages were allocated with __GFP_COMP, prefer using put_page()
>   * to using this function.  Also prefer to call put_page() rather than
>   * calling __free_page() or __free_pages(page, 0).

Sounds good.

> 
>> Especially, I thought we recently learned that free hugetlb folios do have a
>> refcount of 0? Confusing.
> 
> Yes, that is confusing.  This function wouldn't work if the refcount
> were 0 (put_page_testzero() would complain).  Ah, here we go:
> 
> In __folio_put(), we know that refcount is 0.
> __folio_put_large() -> destroy_large_folio -> free_huge_folio
> (refcount still 0 here, indeed it asserts it)
> If we add it back to the pool, the refcount stays at 0.
> But if we're removing it, we call __remove_hugetlb_folio()
> which calls folio_ref_unfreeze().

Ah, that's the case I was missing. The joy of hugetlb :)

As a side note: hugetlb.c still talks about "destructor" and "dtor". I 
assume that dates back to the times where we identified hugetlb folios 
by a deconstrucor in the compound page.

mm: Always initialise folio->_deferred_list

Commit Message

Comments

Patch