diff mbox series

[v2,5/9] mm/hugetlb: convert isolate_or_dissolve_huge_page to folios

Message ID 20221101223059.460937-6-sidhartha.kumar@oracle.com (mailing list archive)
State New
Headers show
Series convert hugetlb_cgroup helper functions to folios | expand

Commit Message

Sidhartha Kumar Nov. 1, 2022, 10:30 p.m. UTC
Removes a call to compound_head() by using a folio when operating on the
head page of a hugetlb compound page.

Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Comments

Muchun Song Nov. 2, 2022, 6:48 a.m. UTC | #1
> On Nov 2, 2022, at 06:30, Sidhartha Kumar <sidhartha.kumar@oracle.com> wrote:
> 
> Removes a call to compound_head() by using a folio when operating on the
> head page of a hugetlb compound page.
> 
> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.
Matthew Wilcox June 12, 2023, 5:41 p.m. UTC | #2
On Tue, Nov 01, 2022 at 03:30:55PM -0700, Sidhartha Kumar wrote:
> +++ b/mm/hugetlb.c
> @@ -2815,7 +2815,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>  int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
>  {
>  	struct hstate *h;
> -	struct page *head;
> +	struct folio *folio = page_folio(page);

Is this safe?  I was reviewing a different patch today, and I spotted
this.  With THP, we can relatively easily hit this case:

struct page points to a page with pfn 0x40305, in a folio of order 2.
We call page_folio() on it and the resulting pointer is for the folio
with pfn 0x40304.
If we don't have our own refcount (or some other protection ...) against
freeing, the folio can now be freed and reallocated.  Say it's now part
of an order-3 folio.
Our 'folio' pointer is now actually a pointer to a tail page, and we
have various assertions that a folio pointer doesn't point to a tail
page, so they trigger.

It seems to me that this ...

        /*
         * The page might have been dissolved from under our feet, so make sure
         * to carefully check the state under the lock.
         * Return success when racing as if we dissolved the page ourselves.
         */
        spin_lock_irq(&hugetlb_lock);
        if (folio_test_hugetlb(folio)) {
                h = folio_hstate(folio);
        } else {
                spin_unlock_irq(&hugetlb_lock);
                return 0;
        }

implies that we don't have our own reference on the folio, so we might
find a situation where the folio pointer we have is no longer a folio
pointer.

Maybe the page_folio() call should be moved inside the hugetlb_lock
protection?  Is that enough?  I don't know enough about how hugetlb
pages are split, freed & allocated to know what's going on.

But then we _drop_ the lock, and keep referring to ...

> @@ -2841,10 +2840,10 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
>  	if (hstate_is_gigantic(h))
>  		return -ENOMEM;
>  
> -	if (page_count(head) && !isolate_hugetlb(head, list))
> +	if (folio_ref_count(folio) && !isolate_hugetlb(&folio->page, list))
>  		ret = 0;
> -	else if (!page_count(head))
> -		ret = alloc_and_dissolve_huge_page(h, head, list);
> +	else if (!folio_ref_count(folio))
> +		ret = alloc_and_dissolve_huge_page(h, &folio->page, list);

And I fall back to saying "I don't know enough to know if this is safe".
Sidhartha Kumar June 12, 2023, 6:45 p.m. UTC | #3
On 6/12/23 10:41 AM, Matthew Wilcox wrote:
> On Tue, Nov 01, 2022 at 03:30:55PM -0700, Sidhartha Kumar wrote:
>> +++ b/mm/hugetlb.c
>> @@ -2815,7 +2815,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
>>   int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
>>   {
>>   	struct hstate *h;
>> -	struct page *head;
>> +	struct folio *folio = page_folio(page);
> 
> Is this safe?  I was reviewing a different patch today, and I spotted
> this.  With THP, we can relatively easily hit this case:
> 
> struct page points to a page with pfn 0x40305, in a folio of order 2.
> We call page_folio() on it and the resulting pointer is for the folio
> with pfn 0x40304.
> If we don't have our own refcount (or some other protection ...) against
> freeing, the folio can now be freed and reallocated.  Say it's now part
> of an order-3 folio.
> Our 'folio' pointer is now actually a pointer to a tail page, and we
> have various assertions that a folio pointer doesn't point to a tail
> page, so they trigger.
> 
> It seems to me that this ...
> 
>          /*
>           * The page might have been dissolved from under our feet, so make sure
>           * to carefully check the state under the lock.
>           * Return success when racing as if we dissolved the page ourselves.
>           */
>          spin_lock_irq(&hugetlb_lock);
>          if (folio_test_hugetlb(folio)) {
>                  h = folio_hstate(folio);
>          } else {
>                  spin_unlock_irq(&hugetlb_lock);
>                  return 0;
>          }
> 
> implies that we don't have our own reference on the folio, so we might
> find a situation where the folio pointer we have is no longer a folio
> pointer.
> 

If the folio became free and reallocated would this be considered a 
success? If the folio is no longer a hugetlb folio, 
isolate_or_dissolve_huge_page() returns as if it dissolved the page itself.

Later in the call stack, within alloc_and_dissolve_hugetlb_folio() there is

	if (!folio_test_hugetlb(old_folio)) {
		/*
		 * Freed from under us. Drop new_folio too.
		 */
		goto free_new;
	}

which would imply it is safe for the old_folio to have been dropped/freed.


> Maybe the page_folio() call should be moved inside the hugetlb_lock
> protection?  Is that enough?  I don't know enough about how hugetlb
> pages are split, freed & allocated to know what's going on.
> But then we _drop_ the lock, and keep referring to ...
> 
>> @@ -2841,10 +2840,10 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
>>   	if (hstate_is_gigantic(h))
>>   		return -ENOMEM;
>>   
>> -	if (page_count(head) && !isolate_hugetlb(head, list))
>> +	if (folio_ref_count(folio) && !isolate_hugetlb(&folio->page, list))
>>   		ret = 0;
>> -	else if (!page_count(head))
>> -		ret = alloc_and_dissolve_huge_page(h, head, list);
>> +	else if (!folio_ref_count(folio))
>> +		ret = alloc_and_dissolve_huge_page(h, &folio->page, list);
> 
> And I fall back to saying "I don't know enough to know if this is safe".
Mike Kravetz June 12, 2023, 11:34 p.m. UTC | #4
On 06/12/23 18:41, Matthew Wilcox wrote:
> On Tue, Nov 01, 2022 at 03:30:55PM -0700, Sidhartha Kumar wrote:
> > +++ b/mm/hugetlb.c
> > @@ -2815,7 +2815,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
> >  int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> >  {
> >  	struct hstate *h;
> > -	struct page *head;
> > +	struct folio *folio = page_folio(page);
> 
> Is this safe?  I was reviewing a different patch today, and I spotted
> this.  With THP, we can relatively easily hit this case:
> 
> struct page points to a page with pfn 0x40305, in a folio of order 2.
> We call page_folio() on it and the resulting pointer is for the folio
> with pfn 0x40304.
> If we don't have our own refcount (or some other protection ...) against
> freeing, the folio can now be freed and reallocated.  Say it's now part
> of an order-3 folio.
> Our 'folio' pointer is now actually a pointer to a tail page, and we
> have various assertions that a folio pointer doesn't point to a tail
> page, so they trigger.
> 
> It seems to me that this ...
> 
>         /*
>          * The page might have been dissolved from under our feet, so make sure
>          * to carefully check the state under the lock.
>          * Return success when racing as if we dissolved the page ourselves.
>          */
>         spin_lock_irq(&hugetlb_lock);
>         if (folio_test_hugetlb(folio)) {
>                 h = folio_hstate(folio);
>         } else {
>                 spin_unlock_irq(&hugetlb_lock);
>                 return 0;
>         }
> 
> implies that we don't have our own reference on the folio, so we might
> find a situation where the folio pointer we have is no longer a folio
> pointer.

Your analysis is correct.

This is not safe because we hold no locks or references.  The folio
pointer obtained via page_folio(page) may not be valid when calling
folio_test_hugetlb(folio) and later.

My bad for the Reviewed-by: :(

> 
> Maybe the page_folio() call should be moved inside the hugetlb_lock
> protection?  Is that enough?  I don't know enough about how hugetlb
> pages are split, freed & allocated to know what's going on.
> 
> But then we _drop_ the lock, and keep referring to ...
> 
> > @@ -2841,10 +2840,10 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> >  	if (hstate_is_gigantic(h))
> >  		return -ENOMEM;
> >  
> > -	if (page_count(head) && !isolate_hugetlb(head, list))
> > +	if (folio_ref_count(folio) && !isolate_hugetlb(&folio->page, list))
> >  		ret = 0;
> > -	else if (!page_count(head))
> > -		ret = alloc_and_dissolve_huge_page(h, head, list);
> > +	else if (!folio_ref_count(folio))
> > +		ret = alloc_and_dissolve_huge_page(h, &folio->page, list);

The above was OK when using struct page instead of folio.  The 'racy'
part was getting the ref count on the head page.  It was OK because this
was only a check to see if we should TRY to isolate or dissolve.  The
code to actually isolate or dissolve would take the appropriate locks.

I'm afraid the code is now making even more use of a potentially invalid
folio.  Here is how the above now looks in v6.3:

	spin_unlock_irq(&hugetlb_lock);

	/*
	 * Fence off gigantic pages as there is a cyclic dependency between
	 * alloc_contig_range and them. Return -ENOMEM as this has the effect
	 * of bailing out right away without further retrying.
	 */
	if (hstate_is_gigantic(h))
		return -ENOMEM;

	if (folio_ref_count(folio) && isolate_hugetlb(folio, list))
		ret = 0;
	else if (!folio_ref_count(folio))
		ret = alloc_and_dissolve_hugetlb_folio(h, folio, list);

Looks like that potentially invalid folio is being passed to other
routines.  Previous code would take lock and revalidate that struct page
was still a hugetlb page.  We can not do the same with a folio.
Mike Kravetz June 13, 2023, 11:29 p.m. UTC | #5
On 06/12/23 16:34, Mike Kravetz wrote:
> On 06/12/23 18:41, Matthew Wilcox wrote:
> > On Tue, Nov 01, 2022 at 03:30:55PM -0700, Sidhartha Kumar wrote:
> > > +++ b/mm/hugetlb.c
> > > @@ -2815,7 +2815,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
> > >  int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> > >  {
> > >  	struct hstate *h;
> > > -	struct page *head;
> > > +	struct folio *folio = page_folio(page);
> > 
> > Is this safe?  I was reviewing a different patch today, and I spotted
> > this.  With THP, we can relatively easily hit this case:
> > 
> > struct page points to a page with pfn 0x40305, in a folio of order 2.
> > We call page_folio() on it and the resulting pointer is for the folio
> > with pfn 0x40304.
> > If we don't have our own refcount (or some other protection ...) against
> > freeing, the folio can now be freed and reallocated.  Say it's now part
> > of an order-3 folio.
> > Our 'folio' pointer is now actually a pointer to a tail page, and we
> > have various assertions that a folio pointer doesn't point to a tail
> > page, so they trigger.
> > 
> > It seems to me that this ...
> > 
> >         /*
> >          * The page might have been dissolved from under our feet, so make sure
> >          * to carefully check the state under the lock.
> >          * Return success when racing as if we dissolved the page ourselves.
> >          */
> >         spin_lock_irq(&hugetlb_lock);
> >         if (folio_test_hugetlb(folio)) {
> >                 h = folio_hstate(folio);
> >         } else {
> >                 spin_unlock_irq(&hugetlb_lock);
> >                 return 0;
> >         }
> > 
> > implies that we don't have our own reference on the folio, so we might
> > find a situation where the folio pointer we have is no longer a folio
> > pointer.
> 
> Your analysis is correct.
> 
> This is not safe because we hold no locks or references.  The folio
> pointer obtained via page_folio(page) may not be valid when calling
> folio_test_hugetlb(folio) and later.
> 
> My bad for the Reviewed-by: :(
> 

I was looking at this more closely and need a bit of clarification.  As
mentioned, your analysis is correct.  However, it appears that there is
other code doing:

folio = page_folio(page);
...
if (folio_test_hugetlb(folio))

without holding a folio ref or some type of lock.  split_huge_pages_all()
is one such example.

So, either this code has the same issue or there are folio routines that
can be called without holding a ref/lock.  The kerneldoc for
folio_test_hugetlb says "Caller should have a reference on the folio to
prevent it from being turned into a tail page.".  However, is that mostly
to make sure the returned value is consistent/valid?  Can it really lead
to an assert if folio pointer is changed to point to something else?

> > Maybe the page_folio() call should be moved inside the hugetlb_lock
> > protection?  Is that enough?  I don't know enough about how hugetlb
> > pages are split, freed & allocated to know what's going on.

Upon further thought, I think we should move the page_folio() inside the
lock just to be more correct.

> > 
> > But then we _drop_ the lock, and keep referring to ...
> > 
> > > @@ -2841,10 +2840,10 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> > >  	if (hstate_is_gigantic(h))
> > >  		return -ENOMEM;
> > >  
> > > -	if (page_count(head) && !isolate_hugetlb(head, list))
> > > +	if (folio_ref_count(folio) && !isolate_hugetlb(&folio->page, list))
> > >  		ret = 0;
> > > -	else if (!page_count(head))
> > > -		ret = alloc_and_dissolve_huge_page(h, head, list);
> > > +	else if (!folio_ref_count(folio))
> > > +		ret = alloc_and_dissolve_huge_page(h, &folio->page, list);
> 
> The above was OK when using struct page instead of folio.  The 'racy'
> part was getting the ref count on the head page.  It was OK because this
> was only a check to see if we should TRY to isolate or dissolve.  The
> code to actually isolate or dissolve would take the appropriate locks.

page_count() is doing 'folio_ref_count(page_folio(page));' and there I suspect
there are many places doing page_count without taking a page ref or locking.
So, it seems like this would also be safe?

> I'm afraid the code is now making even more use of a potentially invalid
> folio.  Here is how the above now looks in v6.3:
> 
> 	spin_unlock_irq(&hugetlb_lock);
> 
> 	/*
> 	 * Fence off gigantic pages as there is a cyclic dependency between
> 	 * alloc_contig_range and them. Return -ENOMEM as this has the effect
> 	 * of bailing out right away without further retrying.
> 	 */
> 	if (hstate_is_gigantic(h))
> 		return -ENOMEM;
> 
> 	if (folio_ref_count(folio) && isolate_hugetlb(folio, list))
> 		ret = 0;
> 	else if (!folio_ref_count(folio))
> 		ret = alloc_and_dissolve_hugetlb_folio(h, folio, list);
> 
> Looks like that potentially invalid folio is being passed to other
> routines.  Previous code would take lock and revalidate that struct page
> was still a hugetlb page.  We can not do the same with a folio.

Perhaps I spoke too soon.  Yes, we pass a potentially invalid folio
pointer to isolate_hugetlb() and alloc_and_dissolve_hugetlb_folio().
However, it seems the validation they perform should be sufficient.

bool isolate_hugetlb(struct folio *folio, struct list_head *list)
{
	bool ret = true;

	spin_lock_irq(&hugetlb_lock);
	if (!folio_test_hugetlb(folio) ||
	    !folio_test_hugetlb_migratable(folio) ||
	    !folio_try_get(folio)) {
		ret = false;
		goto unlock;


static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
			struct folio *old_folio, struct list_head *list)
{
	...
retry:
	spin_lock_irq(&hugetlb_lock);
	if (!folio_test_hugetlb(old_folio)) {
		...
	} else if (folio_ref_count(old_folio)) {
		...
	} else if (!folio_test_hugetlb_freed(old_folio)) {
		...
		goto retry;
	} else {
		/*
		 * Ok, old_folio is still a genuine free hugepage.

Upon further consideration, I do not see an issue with the existing
code.  If there are issues with calling folio_test_hugetlb() or
folio_ref_count() on a potentially invalid folio pointer, then we do
have issues here.  However, such an issue would be more widespread as
there is more code doing the same.
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2a48feadb41c..bcc39d2613b2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2815,7 +2815,7 @@  static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 {
 	struct hstate *h;
-	struct page *head;
+	struct folio *folio = page_folio(page);
 	int ret = -EBUSY;
 
 	/*
@@ -2824,9 +2824,8 @@  int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	 * Return success when racing as if we dissolved the page ourselves.
 	 */
 	spin_lock_irq(&hugetlb_lock);
-	if (PageHuge(page)) {
-		head = compound_head(page);
-		h = page_hstate(head);
+	if (folio_test_hugetlb(folio)) {
+		h = folio_hstate(folio);
 	} else {
 		spin_unlock_irq(&hugetlb_lock);
 		return 0;
@@ -2841,10 +2840,10 @@  int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	if (hstate_is_gigantic(h))
 		return -ENOMEM;
 
-	if (page_count(head) && !isolate_hugetlb(head, list))
+	if (folio_ref_count(folio) && !isolate_hugetlb(&folio->page, list))
 		ret = 0;
-	else if (!page_count(head))
-		ret = alloc_and_dissolve_huge_page(h, head, list);
+	else if (!folio_ref_count(folio))
+		ret = alloc_and_dissolve_huge_page(h, &folio->page, list);
 
 	return ret;
 }