diff mbox series

[v2,2/2] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

Message ID 20230718004942.113174-3-mike.kravetz@oracle.com (mailing list archive)
State New
Headers show
Series Fix hugetlb free path race with memory errors | expand

Commit Message

Mike Kravetz July 18, 2023, 12:49 a.m. UTC
update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators.  This may require
allocating vmemmmap pages associated with each hugetlb page.  The
hugetlb page destructor must be changed before pages are freed to lower
level allocators.  However, the destructor must be changed under the
hugetlb lock.  This means there is potentially one lock cycle per page.

Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)

Comments

James Houghton July 18, 2023, 4:31 p.m. UTC | #1
On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> update_and_free_pages_bulk is designed to free a list of hugetlb pages
> back to their associated lower level allocators.  This may require
> allocating vmemmmap pages associated with each hugetlb page.  The
> hugetlb page destructor must be changed before pages are freed to lower
> level allocators.  However, the destructor must be changed under the
> hugetlb lock.  This means there is potentially one lock cycle per page.
>
> Minimize the number of lock cycles in update_and_free_pages_bulk by:
> 1) allocating necessary vmemmap for all hugetlb pages on the list
> 2) take hugetlb lock and clear destructor for all pages on the list
> 3) free all pages on list back to low level allocators
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++----
>  1 file changed, 34 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4a910121a647..e6b780291539 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1856,13 +1856,43 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
>  static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
>  {
>         struct page *page, *t_page;
> -       struct folio *folio;
> +       bool clear_dtor = false;
>
> +       /*
> +        * First allocate required vmemmmap for all pages on list.  If vmemmap
> +        * can not be allocated, we can not free page to lower level allocator,
> +        * so add back as hugetlb surplus page.
> +        */
>         list_for_each_entry_safe(page, t_page, list, lru) {
> -               folio = page_folio(page);
> -               update_and_free_hugetlb_folio(h, folio, false);
> -               cond_resched();
> +               if (HPageVmemmapOptimized(page)) {
> +                       if (hugetlb_vmemmap_restore(h, page)) {
> +                               spin_lock_irq(&hugetlb_lock);
> +                               add_hugetlb_folio(h, page_folio(page), true);
> +                               spin_unlock_irq(&hugetlb_lock);
> +                       } else
> +                               clear_dtor = true;
> +                       cond_resched();
> +               }
> +       }
> +
> +       /*
> +        * If vmemmmap allocation performed above, then take lock to clear

s/vmemmmap/vmemmap. Also is a little hard to understand, something
like "If vmemmap allocation was performed above for any folios,
then..." seems clearer to me.

> +        * destructor of all pages on list.
> +        */
> +       if (clear_dtor) {
> +               spin_lock_irq(&hugetlb_lock);
> +               list_for_each_entry(page, list, lru)
> +                       __clear_hugetlb_destructor(h, page_folio(page));
> +               spin_unlock_irq(&hugetlb_lock);
>         }

I'm not too familiar with this code, but the above block seems weird
to me. If we successfully allocated the vmemmap for *any* folio, we
clear the hugetlb destructor for all the folios? I feel like we should
only be clearing the hugetlb destructor for all folios if the vmemmap
allocation succeeded for *all* folios. If the code is functionally
correct as is, I'm a little bit confused why we need `clear_dtor`; it
seems like this function doesn't really need it. (I could have some
huge misunderstanding here.)

> +
> +       /*
> +        * Free pages back to low level allocators.  vmemmap and destructors
> +        * were taken care of above, so update_and_free_hugetlb_folio will
> +        * not need to take hugetlb lock.
> +        */
> +       list_for_each_entry_safe(page, t_page, list, lru)
> +               update_and_free_hugetlb_folio(h, page_folio(page), false);
>  }
>
>  struct hstate *size_to_hstate(unsigned long size)
> --
> 2.41.0
>
Mike Kravetz July 18, 2023, 4:46 p.m. UTC | #2
On 07/18/23 09:31, James Houghton wrote:
> On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > update_and_free_pages_bulk is designed to free a list of hugetlb pages
> > back to their associated lower level allocators.  This may require
> > allocating vmemmmap pages associated with each hugetlb page.  The
> > hugetlb page destructor must be changed before pages are freed to lower
> > level allocators.  However, the destructor must be changed under the
> > hugetlb lock.  This means there is potentially one lock cycle per page.
> >
> > Minimize the number of lock cycles in update_and_free_pages_bulk by:
> > 1) allocating necessary vmemmap for all hugetlb pages on the list
> > 2) take hugetlb lock and clear destructor for all pages on the list
> > 3) free all pages on list back to low level allocators
> >
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++----
> >  1 file changed, 34 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 4a910121a647..e6b780291539 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1856,13 +1856,43 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
> >  static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> >  {
> >         struct page *page, *t_page;
> > -       struct folio *folio;
> > +       bool clear_dtor = false;
> >
> > +       /*
> > +        * First allocate required vmemmmap for all pages on list.  If vmemmap
> > +        * can not be allocated, we can not free page to lower level allocator,
> > +        * so add back as hugetlb surplus page.
> > +        */
> >         list_for_each_entry_safe(page, t_page, list, lru) {
> > -               folio = page_folio(page);
> > -               update_and_free_hugetlb_folio(h, folio, false);
> > -               cond_resched();
> > +               if (HPageVmemmapOptimized(page)) {
> > +                       if (hugetlb_vmemmap_restore(h, page)) {
> > +                               spin_lock_irq(&hugetlb_lock);
> > +                               add_hugetlb_folio(h, page_folio(page), true);
> > +                               spin_unlock_irq(&hugetlb_lock);
> > +                       } else
> > +                               clear_dtor = true;
> > +                       cond_resched();
> > +               }
> > +       }
> > +
> > +       /*
> > +        * If vmemmmap allocation performed above, then take lock to clear
> 
> s/vmemmmap/vmemmap. Also is a little hard to understand, something
> like "If vmemmap allocation was performed above for any folios,
> then..." seems clearer to me.
> 

Typo :(
Yes, that would be more clear ... see below.

> > +        * destructor of all pages on list.
> > +        */
> > +       if (clear_dtor) {
> > +               spin_lock_irq(&hugetlb_lock);
> > +               list_for_each_entry(page, list, lru)
> > +                       __clear_hugetlb_destructor(h, page_folio(page));
> > +               spin_unlock_irq(&hugetlb_lock);
> >         }
> 
> I'm not too familiar with this code, but the above block seems weird
> to me. If we successfully allocated the vmemmap for *any* folio, we
> clear the hugetlb destructor for all the folios? I feel like we should
> only be clearing the hugetlb destructor for all folios if the vmemmap
> allocation succeeded for *all* folios. If the code is functionally
> correct as is, I'm a little bit confused why we need `clear_dtor`; it
> seems like this function doesn't really need it. (I could have some
> huge misunderstanding here.)
> 

Yes, it is a bit strange.

I was thinking this has to also handle the case where hugetlb vmemmap
optimization is off system wide.  In that case, clear_dtor would never
be set and there is no sense in ever walking the list and calling
__clear_hugetlb_destructor() would would be a NOOP in this case.  Think
of the case where there are TBs of hugetlb pages.

That is one of the reasons I made __clear_hugetlb_destructor() check
for the need to modify the destructor.  The other reason is in the
dissolve_free_huge_page() code path where we allocate vmemmap.  I
suppose, there could be an explicit call to __clear_hugetlb_destructor()
in dissolve_free_huge_page.  But, I thought it might be better if
we just handled both cases here.

My thinking is that the clear_dtor boolean would tell us if vmemmap was
restored for ANY hugetlb page.  I am aware that just because vmemmap was
allocated for one page, does not mean that it was allocated for others.
However, in the common case where hugetlb vmemmap optimization is on
system wide, we would have allocated vmemmap for all pages on the list
and would need to clear the destructor for them all.

So, clear_dtor is really just an optimization for the
hugetlb_free_vmemmap=off case.  Perhaps that is just over thinking and
not a useful miro-optimization.

Thanks for taking a look!
Muchun Song July 19, 2023, 3:35 a.m. UTC | #3
> On Jul 18, 2023, at 08:49, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> update_and_free_pages_bulk is designed to free a list of hugetlb pages
> back to their associated lower level allocators.  This may require
> allocating vmemmmap pages associated with each hugetlb page.  The
> hugetlb page destructor must be changed before pages are freed to lower
> level allocators.  However, the destructor must be changed under the
> hugetlb lock.  This means there is potentially one lock cycle per page.
> 
> Minimize the number of lock cycles in update_and_free_pages_bulk by:
> 1) allocating necessary vmemmap for all hugetlb pages on the list
> 2) take hugetlb lock and clear destructor for all pages on the list
> 3) free all pages on list back to low level allocators
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.
James Houghton July 20, 2023, 12:02 a.m. UTC | #4
On Tue, Jul 18, 2023 at 9:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 07/18/23 09:31, James Houghton wrote:
> > On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > +        * destructor of all pages on list.
> > > +        */
> > > +       if (clear_dtor) {
> > > +               spin_lock_irq(&hugetlb_lock);
> > > +               list_for_each_entry(page, list, lru)
> > > +                       __clear_hugetlb_destructor(h, page_folio(page));
> > > +               spin_unlock_irq(&hugetlb_lock);
> > >         }
> >
> > I'm not too familiar with this code, but the above block seems weird
> > to me. If we successfully allocated the vmemmap for *any* folio, we
> > clear the hugetlb destructor for all the folios? I feel like we should
> > only be clearing the hugetlb destructor for all folios if the vmemmap
> > allocation succeeded for *all* folios. If the code is functionally
> > correct as is, I'm a little bit confused why we need `clear_dtor`; it
> > seems like this function doesn't really need it. (I could have some
> > huge misunderstanding here.)
> >
>
> Yes, it is a bit strange.
>
> I was thinking this has to also handle the case where hugetlb vmemmap
> optimization is off system wide.  In that case, clear_dtor would never
> be set and there is no sense in ever walking the list and calling
> __clear_hugetlb_destructor() would would be a NOOP in this case.  Think
> of the case where there are TBs of hugetlb pages.
>
> That is one of the reasons I made __clear_hugetlb_destructor() check
> for the need to modify the destructor.  The other reason is in the
> dissolve_free_huge_page() code path where we allocate vmemmap.  I
> suppose, there could be an explicit call to __clear_hugetlb_destructor()
> in dissolve_free_huge_page.  But, I thought it might be better if
> we just handled both cases here.
>
> My thinking is that the clear_dtor boolean would tell us if vmemmap was
> restored for ANY hugetlb page.  I am aware that just because vmemmap was
> allocated for one page, does not mean that it was allocated for others.
> However, in the common case where hugetlb vmemmap optimization is on
> system wide, we would have allocated vmemmap for all pages on the list
> and would need to clear the destructor for them all.
>
> So, clear_dtor is really just an optimization for the
> hugetlb_free_vmemmap=off case.  Perhaps that is just over thinking and
> not a useful miro-optimization.

Ok I think I understand; I think the micro-optimization is fine to
add. But I think there's still a bug here:

If we have two vmemmap-optimized hugetlb pages and restoring the page
structs for one of them fails, that page will end up with the
incorrect dtor (add_hugetlb_folio will set it properly, but then we
clear it afterwards because clear_dtor was set).

What do you think?
Mike Kravetz July 20, 2023, 12:18 a.m. UTC | #5
On 07/19/23 17:02, James Houghton wrote:
> On Tue, Jul 18, 2023 at 9:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 07/18/23 09:31, James Houghton wrote:
> > > On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > +        * destructor of all pages on list.
> > > > +        */
> > > > +       if (clear_dtor) {
> > > > +               spin_lock_irq(&hugetlb_lock);
> > > > +               list_for_each_entry(page, list, lru)
> > > > +                       __clear_hugetlb_destructor(h, page_folio(page));
> > > > +               spin_unlock_irq(&hugetlb_lock);
> > > >         }
> > >
> > > I'm not too familiar with this code, but the above block seems weird
> > > to me. If we successfully allocated the vmemmap for *any* folio, we
> > > clear the hugetlb destructor for all the folios? I feel like we should
> > > only be clearing the hugetlb destructor for all folios if the vmemmap
> > > allocation succeeded for *all* folios. If the code is functionally
> > > correct as is, I'm a little bit confused why we need `clear_dtor`; it
> > > seems like this function doesn't really need it. (I could have some
> > > huge misunderstanding here.)
> > >
> >
> > Yes, it is a bit strange.
> >
> > I was thinking this has to also handle the case where hugetlb vmemmap
> > optimization is off system wide.  In that case, clear_dtor would never
> > be set and there is no sense in ever walking the list and calling
> > __clear_hugetlb_destructor() would would be a NOOP in this case.  Think
> > of the case where there are TBs of hugetlb pages.
> >
> > That is one of the reasons I made __clear_hugetlb_destructor() check
> > for the need to modify the destructor.  The other reason is in the
> > dissolve_free_huge_page() code path where we allocate vmemmap.  I
> > suppose, there could be an explicit call to __clear_hugetlb_destructor()
> > in dissolve_free_huge_page.  But, I thought it might be better if
> > we just handled both cases here.
> >
> > My thinking is that the clear_dtor boolean would tell us if vmemmap was
> > restored for ANY hugetlb page.  I am aware that just because vmemmap was
> > allocated for one page, does not mean that it was allocated for others.
> > However, in the common case where hugetlb vmemmap optimization is on
> > system wide, we would have allocated vmemmap for all pages on the list
> > and would need to clear the destructor for them all.
> >
> > So, clear_dtor is really just an optimization for the
> > hugetlb_free_vmemmap=off case.  Perhaps that is just over thinking and
> > not a useful miro-optimization.
> 
> Ok I think I understand; I think the micro-optimization is fine to
> add. But I think there's still a bug here:
> 
> If we have two vmemmap-optimized hugetlb pages and restoring the page
> structs for one of them fails, that page will end up with the
> incorrect dtor (add_hugetlb_folio will set it properly, but then we
> clear it afterwards because clear_dtor was set).
> 
> What do you think?

add_hugetlb_folio() will call enqueue_hugetlb_folio() which will move
the  folio from the existing list we are processing to the hugetlb free
list.  Therefore, the page for which we could not restore vmemmap is not
on the list for that 'if (clear_dtor)' block of code.
James Houghton July 20, 2023, 12:50 a.m. UTC | #6
On Wed, Jul 19, 2023 at 5:19 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 07/19/23 17:02, James Houghton wrote:
> > On Tue, Jul 18, 2023 at 9:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 07/18/23 09:31, James Houghton wrote:
> > > > On Mon, Jul 17, 2023 at 5:50 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > > +        * destructor of all pages on list.
> > > > > +        */
> > > > > +       if (clear_dtor) {
> > > > > +               spin_lock_irq(&hugetlb_lock);
> > > > > +               list_for_each_entry(page, list, lru)
> > > > > +                       __clear_hugetlb_destructor(h, page_folio(page));
> > > > > +               spin_unlock_irq(&hugetlb_lock);
> > > > >         }
> > > >
> > > > I'm not too familiar with this code, but the above block seems weird
> > > > to me. If we successfully allocated the vmemmap for *any* folio, we
> > > > clear the hugetlb destructor for all the folios? I feel like we should
> > > > only be clearing the hugetlb destructor for all folios if the vmemmap
> > > > allocation succeeded for *all* folios. If the code is functionally
> > > > correct as is, I'm a little bit confused why we need `clear_dtor`; it
> > > > seems like this function doesn't really need it. (I could have some
> > > > huge misunderstanding here.)
> > > >
> > >
> > > Yes, it is a bit strange.
> > >
> > > I was thinking this has to also handle the case where hugetlb vmemmap
> > > optimization is off system wide.  In that case, clear_dtor would never
> > > be set and there is no sense in ever walking the list and calling
> > > __clear_hugetlb_destructor() would would be a NOOP in this case.  Think
> > > of the case where there are TBs of hugetlb pages.
> > >
> > > That is one of the reasons I made __clear_hugetlb_destructor() check
> > > for the need to modify the destructor.  The other reason is in the
> > > dissolve_free_huge_page() code path where we allocate vmemmap.  I
> > > suppose, there could be an explicit call to __clear_hugetlb_destructor()
> > > in dissolve_free_huge_page.  But, I thought it might be better if
> > > we just handled both cases here.
> > >
> > > My thinking is that the clear_dtor boolean would tell us if vmemmap was
> > > restored for ANY hugetlb page.  I am aware that just because vmemmap was
> > > allocated for one page, does not mean that it was allocated for others.
> > > However, in the common case where hugetlb vmemmap optimization is on
> > > system wide, we would have allocated vmemmap for all pages on the list
> > > and would need to clear the destructor for them all.
> > >
> > > So, clear_dtor is really just an optimization for the
> > > hugetlb_free_vmemmap=off case.  Perhaps that is just over thinking and
> > > not a useful miro-optimization.
> >
> > Ok I think I understand; I think the micro-optimization is fine to
> > add. But I think there's still a bug here:
> >
> > If we have two vmemmap-optimized hugetlb pages and restoring the page
> > structs for one of them fails, that page will end up with the
> > incorrect dtor (add_hugetlb_folio will set it properly, but then we
> > clear it afterwards because clear_dtor was set).
> >
> > What do you think?
>
> add_hugetlb_folio() will call enqueue_hugetlb_folio() which will move
> the  folio from the existing list we are processing to the hugetlb free
> list.  Therefore, the page for which we could not restore vmemmap is not
> on the list for that 'if (clear_dtor)' block of code.

Oh, I see. Thanks! Unless you think it's pretty obvious, perhaps a
comment would be good to add here, to explain that folios are removed
from 'list' if their vmemmap isn't restored.

Unrelated nit: I think you mean to use
folio_test_hugetlb_vmemmap_optimized instead of HPageVmemmapOptimized
in this patch.

Feel free to add:

Acked-by: James Houghton <jthoughton@google.com>


>
> --
> Mike Kravetz
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4a910121a647..e6b780291539 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1856,13 +1856,43 @@  static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
 static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
 {
 	struct page *page, *t_page;
-	struct folio *folio;
+	bool clear_dtor = false;
 
+	/*
+	 * First allocate required vmemmmap for all pages on list.  If vmemmap
+	 * can not be allocated, we can not free page to lower level allocator,
+	 * so add back as hugetlb surplus page.
+	 */
 	list_for_each_entry_safe(page, t_page, list, lru) {
-		folio = page_folio(page);
-		update_and_free_hugetlb_folio(h, folio, false);
-		cond_resched();
+		if (HPageVmemmapOptimized(page)) {
+			if (hugetlb_vmemmap_restore(h, page)) {
+				spin_lock_irq(&hugetlb_lock);
+				add_hugetlb_folio(h, page_folio(page), true);
+				spin_unlock_irq(&hugetlb_lock);
+			} else
+				clear_dtor = true;
+			cond_resched();
+		}
+	}
+
+	/*
+	 * If vmemmmap allocation performed above, then take lock to clear
+	 * destructor of all pages on list.
+	 */
+	if (clear_dtor) {
+		spin_lock_irq(&hugetlb_lock);
+		list_for_each_entry(page, list, lru)
+			__clear_hugetlb_destructor(h, page_folio(page));
+		spin_unlock_irq(&hugetlb_lock);
 	}
+
+	/*
+	 * Free pages back to low level allocators.  vmemmap and destructors
+	 * were taken care of above, so update_and_free_hugetlb_folio will
+	 * not need to take hugetlb lock.
+	 */
+	list_for_each_entry_safe(page, t_page, list, lru)
+		update_and_free_hugetlb_folio(h, page_folio(page), false);
 }
 
 struct hstate *size_to_hstate(unsigned long size)