diff mbox series

[v3,1/2] mm: Make alloc_contig_range handle free hugetlb pages

Message ID 20210222135137.25717-2-osalvador@suse.de (mailing list archive)
State New, archived
Headers show
Series Make alloc_contig_range handle Hugetlb pages | expand

Commit Message

Oscar Salvador Feb. 22, 2021, 1:51 p.m. UTC
alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugepage, and if it
succeeds, it will replace the old one in the free hugepage pool.

All operations are being handled under hugetlb_lock, so no races are
possible. The only exception is when page's refcount is 0, but it still
has not been flagged as PageHugeFreed.
In this case we retry as the window race is quite small and we have high
chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/hugetlb.h |   6 +++
 mm/compaction.c         |  12 ++++++
 mm/hugetlb.c            | 111 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 127 insertions(+), 2 deletions(-)

Comments

Mike Kravetz Feb. 25, 2021, 8:03 p.m. UTC | #1
On 2/22/21 5:51 AM, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> All operations are being handled under hugetlb_lock, so no races are
> possible. The only exception is when page's refcount is 0, but it still
> has not been flagged as PageHugeFreed.
> In this case we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Thanks Oscar,

I spent a bunch of time looking for possible race issues.  Thankfully,
the recent code from Muchun dealing with free lists helps. In addition,
all the hugetlb acounting looks good.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

> ---
>  include/linux/hugetlb.h |   6 +++
>  mm/compaction.c         |  12 ++++++
>  mm/hugetlb.c            | 111 +++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 127 insertions(+), 2 deletions(-)
Michal Hocko Feb. 26, 2021, 8:35 a.m. UTC | #2
On Mon 22-02-21 14:51:36, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> All operations are being handled under hugetlb_lock, so no races are
> possible. The only exception is when page's refcount is 0, but it still
> has not been flagged as PageHugeFreed.

I think it would be helpful to call out that specific case explicitly
here. I can see only one scenario (are there more?)
__free_huge_page()		isolate_or_dissolve_huge_page
				  PageHuge() == T
				  alloc_and_dissolve_huge_page
				    alloc_fresh_huge_page()
				    spin_lock(hugetlb_lock)
				    // PageHuge() && !PageHugeFreed &&
				    // !PageCount()
				    spin_unlock(hugetlb_lock)
  spin_lock(hugetlb_lock)
  1) update_and_free_page
       PageHuge() == F
       __free_pages()
  2) enqueue_huge_page
       SetPageHugeFreed()
  spin_unlock(&hugetlb_lock)			    

> In this case we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Thanks this looks much better than the initial version. One nit below.
Acked-by: Michal Hocko <mhocko@suse.com>

[...]
> +bool isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	struct hstate *h = NULL;
> +	struct page *head;
> +	bool ret = false;
> +
> +	spin_lock(&hugetlb_lock);
> +	if (PageHuge(page)) {
> +		head = compound_head(page);
> +		h = page_hstate(head);
> +	}
> +	spin_unlock(&hugetlb_lock);
> +
> +	/*
> +	 * The page might have been dissolved from under our feet.
> +	 * If that is the case, return success as if we dissolved it ourselves.
> +	 */
> +	if (!h)
> +		return true;

I am still fighting with this construct a bit. It is not really clear
what the lock is protecting us from here. alloc_fresh_huge_page deals
with all potential races and this looks like an optimistic check to save
some work. But in fact the lock is really necessary for correctness
because hstate might be completely bogus without the lock or us holding
a reference on the poage. The following construct would be more
explicit and compact. What do you think?

	struct hstate *h;

	/*
	 * The page might have been dissloved from under our feet
	 * so make sure to carefully check the state under the lock.
	 * Return success on when racing as if we dissloved the page
	 * ourselves.
	 */
	spin_lock(&hugetlb_lock);
	if (PageHuge(page)) {
		head = compound_head(page);
		h = page_hstate(head);
	} else {
		spin_unlock(&hugetlb_lock);
		return true;
	}
	spin_unlock(&hugetlb_lock);

> +
> +	/*
> +	 * Fence off gigantic pages as there is a cyclic dependency
> +	 * between alloc_contig_range and them.
> +	 */
> +	if (hstate_is_gigantic(h))
> +		return ret;
> +
> +	if (!page_count(head) && alloc_and_dissolve_huge_page(h, head))
> +		ret = true;
> +
> +	return ret;
> +}
> +
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>  				    unsigned long addr, int avoid_reserve)
>  {
> -- 
> 2.16.3
Michal Hocko Feb. 26, 2021, 8:38 a.m. UTC | #3
On Fri 26-02-21 09:35:10, Michal Hocko wrote:
> On Mon 22-02-21 14:51:36, Oscar Salvador wrote:
> > alloc_contig_range will fail if it ever sees a HugeTLB page within the
> > range we are trying to allocate, even when that page is free and can be
> > easily reallocated.
> > This has proved to be problematic for some users of alloc_contic_range,
> > e.g: CMA and virtio-mem, where those would fail the call even when those
> > pages lay in ZONE_MOVABLE and are free.
> > 
> > We can do better by trying to replace such page.
> > 
> > Free hugepages are tricky to handle so as to no userspace application
> > notices disruption, we need to replace the current free hugepage with
> > a new one.
> > 
> > In order to do that, a new function called alloc_and_dissolve_huge_page
> > is introduced.
> > This function will first try to get a new fresh hugepage, and if it
> > succeeds, it will replace the old one in the free hugepage pool.
> > 
> > All operations are being handled under hugetlb_lock, so no races are
> > possible. The only exception is when page's refcount is 0, but it still
> > has not been flagged as PageHugeFreed.
> 
> I think it would be helpful to call out that specific case explicitly
> here. I can see only one scenario (are there more?)
> __free_huge_page()		isolate_or_dissolve_huge_page
> 				  PageHuge() == T
> 				  alloc_and_dissolve_huge_page
> 				    alloc_fresh_huge_page()
> 				    spin_lock(hugetlb_lock)
> 				    // PageHuge() && !PageHugeFreed &&
> 				    // !PageCount()
> 				    spin_unlock(hugetlb_lock)
>   spin_lock(hugetlb_lock)
>   1) update_and_free_page
>        PageHuge() == F
>        __free_pages()
>   2) enqueue_huge_page
>        SetPageHugeFreed()
>   spin_unlock(&hugetlb_lock)			    
> 
> > In this case we retry as the window race is quite small and we have high
> > chances to succeed next time.
> > 
> > With regard to the allocation, we restrict it to the node the page belongs
> > to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> > 
> > Note that gigantic hugetlb pages are fenced off since there is a cyclic
> > dependency between them and alloc_contig_range.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> 
> Thanks this looks much better than the initial version. One nit below.
> Acked-by: Michal Hocko <mhocko@suse.com>

Btw. if David has some numbers it would be great to add them to the
changelog.
David Hildenbrand Feb. 26, 2021, 9:25 a.m. UTC | #4
> Am 26.02.2021 um 09:38 schrieb Michal Hocko <mhocko@suse.com>:
> 
> On Fri 26-02-21 09:35:10, Michal Hocko wrote:
>>> On Mon 22-02-21 14:51:36, Oscar Salvador wrote:
>>> alloc_contig_range will fail if it ever sees a HugeTLB page within the
>>> range we are trying to allocate, even when that page is free and can be
>>> easily reallocated.
>>> This has proved to be problematic for some users of alloc_contic_range,
>>> e.g: CMA and virtio-mem, where those would fail the call even when those
>>> pages lay in ZONE_MOVABLE and are free.
>>> 
>>> We can do better by trying to replace such page.
>>> 
>>> Free hugepages are tricky to handle so as to no userspace application
>>> notices disruption, we need to replace the current free hugepage with
>>> a new one.
>>> 
>>> In order to do that, a new function called alloc_and_dissolve_huge_page
>>> is introduced.
>>> This function will first try to get a new fresh hugepage, and if it
>>> succeeds, it will replace the old one in the free hugepage pool.
>>> 
>>> All operations are being handled under hugetlb_lock, so no races are
>>> possible. The only exception is when page's refcount is 0, but it still
>>> has not been flagged as PageHugeFreed.
>> 
>> I think it would be helpful to call out that specific case explicitly
>> here. I can see only one scenario (are there more?)
>> __free_huge_page()        isolate_or_dissolve_huge_page
>>                  PageHuge() == T
>>                  alloc_and_dissolve_huge_page
>>                    alloc_fresh_huge_page()
>>                    spin_lock(hugetlb_lock)
>>                    // PageHuge() && !PageHugeFreed &&
>>                    // !PageCount()
>>                    spin_unlock(hugetlb_lock)
>>  spin_lock(hugetlb_lock)
>>  1) update_and_free_page
>>       PageHuge() == F
>>       __free_pages()
>>  2) enqueue_huge_page
>>       SetPageHugeFreed()
>>  spin_unlock(&hugetlb_lock)                
>> 
>>> In this case we retry as the window race is quite small and we have high
>>> chances to succeed next time.
>>> 
>>> With regard to the allocation, we restrict it to the node the page belongs
>>> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
>>> 
>>> Note that gigantic hugetlb pages are fenced off since there is a cyclic
>>> dependency between them and alloc_contig_range.
>>> 
>>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>> 
>> Thanks this looks much better than the initial version. One nit below.
>> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Btw. if David has some numbers it would be great to add them to the
> changelog.

I‘m planning on giving both patches a churn early next week, with

a) free huge pages
b) idle allocated huge pages
c) heavily read huge pages

(Them I‘m also planning on having another brief look at the patches :) )

Thanks!
Oscar Salvador Feb. 26, 2021, 9:45 a.m. UTC | #5
On Fri, Feb 26, 2021 at 09:35:09AM +0100, Michal Hocko wrote:
> I think it would be helpful to call out that specific case explicitly
> here. I can see only one scenario (are there more?)
> __free_huge_page()		isolate_or_dissolve_huge_page
> 				  PageHuge() == T
> 				  alloc_and_dissolve_huge_page
> 				    alloc_fresh_huge_page()
> 				    spin_lock(hugetlb_lock)
> 				    // PageHuge() && !PageHugeFreed &&
> 				    // !PageCount()
> 				    spin_unlock(hugetlb_lock)
>   spin_lock(hugetlb_lock)
>   1) update_and_free_page
>        PageHuge() == F
>        __free_pages()
>   2) enqueue_huge_page
>        SetPageHugeFreed()
>   spin_unlock(&hugetlb_lock)

I do not think there are more scenarios. The thing is to find a !page_count &&
!PageHugeFreed.
AFAICS, this can only happen after:
put_page->put_page_test_zero->..->_free_huge_page gets called but __free_huge_page
has not reached enqueue_huge_page yet (as you just described above)

By calling out this case, you meant to describe it in the changelog?

> 
> > In this case we retry as the window race is quite small and we have high
> > chances to succeed next time.
> > 
> > With regard to the allocation, we restrict it to the node the page belongs
> > to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> > 
> > Note that gigantic hugetlb pages are fenced off since there is a cyclic
> > dependency between them and alloc_contig_range.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> 
> Thanks this looks much better than the initial version. One nit below.
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> > +bool isolate_or_dissolve_huge_page(struct page *page)
> > +{
> > +	struct hstate *h = NULL;
> > +	struct page *head;
> > +	bool ret = false;
> > +
> > +	spin_lock(&hugetlb_lock);
> > +	if (PageHuge(page)) {
> > +		head = compound_head(page);
> > +		h = page_hstate(head);
> > +	}
> > +	spin_unlock(&hugetlb_lock);
> > +
> > +	/*
> > +	 * The page might have been dissolved from under our feet.
> > +	 * If that is the case, return success as if we dissolved it ourselves.
> > +	 */
> > +	if (!h)
> > +		return true;
> 
> I am still fighting with this construct a bit. It is not really clear
> what the lock is protecting us from here. alloc_fresh_huge_page deals
> with all potential races and this looks like an optimistic check to save
> some work. But in fact the lock is really necessary for correctness
> because hstate might be completely bogus without the lock or us holding
> a reference on the poage. The following construct would be more
> explicit and compact. What do you think?
> 
> 	struct hstate *h;
> 
> 	/*
> 	 * The page might have been dissloved from under our feet
> 	 * so make sure to carefully check the state under the lock.
> 	 * Return success on when racing as if we dissloved the page
> 	 * ourselves.
> 	 */
> 	spin_lock(&hugetlb_lock);
> 	if (PageHuge(page)) {
> 		head = compound_head(page);
> 		h = page_hstate(head);
> 	} else {
> 		spin_unlock(&hugetlb_lock);
> 		return true;
> 	}
> 	spin_unlock(&hugetlb_lock);

Yes, I find this less eyesore.

I will fix it up in v4.
Oscar Salvador Feb. 26, 2021, 9:47 a.m. UTC | #6
On Fri, Feb 26, 2021 at 10:25:46AM +0100, David Hildenbrand wrote:
> I‘m planning on giving both patches a churn early next week, with
> 
> a) free huge pages
> b) idle allocated huge pages
> c) heavily read huge pages
> 
> (Them I‘m also planning on having another brief look at the patches :) )

That would be great, thanks!
Oscar Salvador Feb. 26, 2021, 9:48 a.m. UTC | #7
On Thu, Feb 25, 2021 at 12:03:18PM -0800, Mike Kravetz wrote:
> Thanks Oscar,
> 
> I spent a bunch of time looking for possible race issues.  Thankfully,
> the recent code from Muchun dealing with free lists helps. In addition,
> all the hugetlb acounting looks good.
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

thanks for having a look Mike ;-)

> 
> > ---
> >  include/linux/hugetlb.h |   6 +++
> >  mm/compaction.c         |  12 ++++++
> >  mm/hugetlb.c            | 111 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 127 insertions(+), 2 deletions(-)
> -- 
> Mike Kravetz
Michal Hocko Feb. 26, 2021, 9:51 a.m. UTC | #8
On Fri 26-02-21 10:45:14, Oscar Salvador wrote:
> On Fri, Feb 26, 2021 at 09:35:09AM +0100, Michal Hocko wrote:
> > I think it would be helpful to call out that specific case explicitly
> > here. I can see only one scenario (are there more?)
> > __free_huge_page()		isolate_or_dissolve_huge_page
> > 				  PageHuge() == T
> > 				  alloc_and_dissolve_huge_page
> > 				    alloc_fresh_huge_page()
> > 				    spin_lock(hugetlb_lock)
> > 				    // PageHuge() && !PageHugeFreed &&
> > 				    // !PageCount()
> > 				    spin_unlock(hugetlb_lock)
> >   spin_lock(hugetlb_lock)
> >   1) update_and_free_page
> >        PageHuge() == F
> >        __free_pages()
> >   2) enqueue_huge_page
> >        SetPageHugeFreed()
> >   spin_unlock(&hugetlb_lock)
> 
> I do not think there are more scenarios. The thing is to find a !page_count &&
> !PageHugeFreed.
> AFAICS, this can only happen after:
> put_page->put_page_test_zero->..->_free_huge_page gets called but __free_huge_page
> has not reached enqueue_huge_page yet (as you just described above)
> 
> By calling out this case, you meant to describe it in the changelog?

Yes.
[...]
> > 	struct hstate *h;
> > 
> > 	/*
> > 	 * The page might have been dissloved from under our feet
> > 	 * so make sure to carefully check the state under the lock.
> > 	 * Return success on when racing as if we dissloved the page
> > 	 * ourselves.
> > 	 */
> > 	spin_lock(&hugetlb_lock);
> > 	if (PageHuge(page)) {
> > 		head = compound_head(page);
> > 		h = page_hstate(head);
> > 	} else {
> > 		spin_unlock(&hugetlb_lock);
> > 		return true;
> > 	}
> > 	spin_unlock(&hugetlb_lock);
> 
> Yes, I find this less eyesore.
> 
> I will fix it up in v4.

Thanks!
David Hildenbrand March 1, 2021, 2:09 p.m. UTC | #9
On 22.02.21 14:51, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> All operations are being handled under hugetlb_lock, so no races are
> possible. The only exception is when page's refcount is 0, but it still
> has not been flagged as PageHugeFreed.
> In this case we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   include/linux/hugetlb.h |   6 +++
>   mm/compaction.c         |  12 ++++++
>   mm/hugetlb.c            | 111 +++++++++++++++++++++++++++++++++++++++++++++++-
>   3 files changed, 127 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index b5807f23caf8..72352d718829 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -505,6 +505,7 @@ struct huge_bootmem_page {
>   	struct hstate *hstate;
>   };
>   
> +bool isolate_or_dissolve_huge_page(struct page *page);
>   struct page *alloc_huge_page(struct vm_area_struct *vma,
>   				unsigned long addr, int avoid_reserve);
>   struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> @@ -775,6 +776,11 @@ void set_page_huge_active(struct page *page);
>   #else	/* CONFIG_HUGETLB_PAGE */
>   struct hstate {};
>   
> +static inline bool isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	return false;
> +}
> +
>   static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
>   					   unsigned long addr,
>   					   int avoid_reserve)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 190ccdaa6c19..d52506ed9db7 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -905,6 +905,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   			valid_page = page;
>   		}
>   
> +		if (PageHuge(page) && cc->alloc_contig) {
> +			if (!isolate_or_dissolve_huge_page(page))
> +				goto isolate_fail;


So, the callchain is:

alloc_contig_range()->__alloc_contig_migrate_range()->isolate_migratepages_range()->isolate_migratepages_block()

The case I am thinking about is if we run out of memory and would return 
-ENOMEM from alloc_and_dissolve_huge_page(). We silently drop the real 
error (e.g., -ENOMEM vs. -EBUSY vs. e.g., -EAGAIN) we had in 
isolate_or_dissolve_huge_page().


I think we should not swallo such return values in 
isolate_or_dissolve_huge_page() and instead properly report esp. -ENOMEM 
properly up this callchain now. Otherwise we'll end up retrying / 
reporting -EBUSY, which is misleading.

 From isolate_migratepages_range()/isolate_migratepages_block() we'll 
keep reporting "pfn > 0".

a) In isolate_migratepages_range() we'll keep iterating over pageblocks 
although we should just fail with -ENOMEM right away.

b) In __alloc_contig_migrate_range() we'll keep retrying up to 5 times 
although we should just fail with -ENOMEM. We end up returning "-EBUSY" 
after retrying.

c) In alloc_contig_range() we'll continue trying to isolate although we 
should just return -ENOMEM.


I think we have should start returning proper errors from 
isolate_migratepages_range()/isolate_migratepages_block() on critical 
issues (-EINTR, -ENOMEM) instead of going via "!pfn vs. pfn" and 
retrying on "pfn".

So we should then fail with -ENOMEM during isolate_migratepages_range() 
cleanly, just as we would do when we get -ENOMEM during migrate_pages().
Oscar Salvador March 4, 2021, 10:19 a.m. UTC | #10
On Mon, Mar 01, 2021 at 03:09:06PM +0100, David Hildenbrand wrote:
> On 22.02.21 14:51, Oscar Salvador wrote:
> > @@ -905,6 +905,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >   			valid_page = page;
> >   		}
> > +		if (PageHuge(page) && cc->alloc_contig) {
> > +			if (!isolate_or_dissolve_huge_page(page))
> > +				goto isolate_fail;

Bleh, sorry for the lateness David, I was farly busy.

> So, the callchain is:
> 
> alloc_contig_range()->__alloc_contig_migrate_range()->isolate_migratepages_range()->isolate_migratepages_block()
> 
> The case I am thinking about is if we run out of memory and would return
> -ENOMEM from alloc_and_dissolve_huge_page(). We silently drop the real error
> (e.g., -ENOMEM vs. -EBUSY vs. e.g., -EAGAIN) we had in
> isolate_or_dissolve_huge_page().

Yes, that is true.

> I think we should not swallo such return values in
> isolate_or_dissolve_huge_page() and instead properly report esp. -ENOMEM
> properly up this callchain now. Otherwise we'll end up retrying / reporting
> -EBUSY, which is misleading.

I am not sure I follow you here.
So, atm, alloc_and_dissolve_huge_page can either generate -ENOMEM or
-EBUSY wrt. error codes.
-ENOMEM when we cannot allocate a page, and -EBUSY when we raced with
someone.
You mean to only report ENOMEM down the chain?

> From isolate_migratepages_range()/isolate_migratepages_block() we'll keep
> reporting "pfn > 0".
> 
> a) In isolate_migratepages_range() we'll keep iterating over pageblocks
> although we should just fail with -ENOMEM right away.
> 
> b) In __alloc_contig_migrate_range() we'll keep retrying up to 5 times
> although we should just fail with -ENOMEM. We end up returning "-EBUSY"
> after retrying.
> 
> c) In alloc_contig_range() we'll continue trying to isolate although we
> should just return -ENOMEM.

Yes, "fatal" errors get masked, and hence we treat everything as "things
are busy, let us try again", and this is rather unforunate.

> I think we have should start returning proper errors from
> isolate_migratepages_range()/isolate_migratepages_block() on critical issues
> (-EINTR, -ENOMEM) instead of going via "!pfn vs. pfn" and retrying on "pfn".
> 
> So we should then fail with -ENOMEM during isolate_migratepages_range()
> cleanly, just as we would do when we get -ENOMEM during migrate_pages().

I guess we could rework the interface and make isolate_migratepages_range and
isolate_migratepages_block to report the right thing.
I yet have to check that this does not mess up a lot with the compaction
interface.

But overall I agree with your point here, and I am willing to to tackle
this.

The question is whether we want to do this as part of this series, or
after this series gets picked up.
IMHO, we could do it after, as a follow-up, unless you feel strong about
it.

What do you think?
David Hildenbrand March 4, 2021, 10:32 a.m. UTC | #11
>> I think we should not swallo such return values in
>> isolate_or_dissolve_huge_page() and instead properly report esp. -ENOMEM
>> properly up this callchain now. Otherwise we'll end up retrying / reporting
>> -EBUSY, which is misleading.
> 
> I am not sure I follow you here.
> So, atm, alloc_and_dissolve_huge_page can either generate -ENOMEM or
> -EBUSY wrt. error codes.
> -ENOMEM when we cannot allocate a page, and -EBUSY when we raced with
> someone.
> You mean to only report ENOMEM down the chain?

Yes, fatal errors.

> 
>>  From isolate_migratepages_range()/isolate_migratepages_block() we'll keep
>> reporting "pfn > 0".
>>
>> a) In isolate_migratepages_range() we'll keep iterating over pageblocks
>> although we should just fail with -ENOMEM right away.
>>
>> b) In __alloc_contig_migrate_range() we'll keep retrying up to 5 times
>> although we should just fail with -ENOMEM. We end up returning "-EBUSY"
>> after retrying.
>>
>> c) In alloc_contig_range() we'll continue trying to isolate although we
>> should just return -ENOMEM.
> 
> Yes, "fatal" errors get masked, and hence we treat everything as "things
> are busy, let us try again", and this is rather unforunate.
> 
>> I think we have should start returning proper errors from
>> isolate_migratepages_range()/isolate_migratepages_block() on critical issues
>> (-EINTR, -ENOMEM) instead of going via "!pfn vs. pfn" and retrying on "pfn".
>>
>> So we should then fail with -ENOMEM during isolate_migratepages_range()
>> cleanly, just as we would do when we get -ENOMEM during migrate_pages().
> 
> I guess we could rework the interface and make isolate_migratepages_range and
> isolate_migratepages_block to report the right thing.
> I yet have to check that this does not mess up a lot with the compaction
> interface.
> 
> But overall I agree with your point here, and I am willing to to tackle
> this.
> 
> The question is whether we want to do this as part of this series, or
> after this series gets picked up.
> IMHO, we could do it after, as a follow-up, unless you feel strong about
> it.
> 
> What do you think?

I think this is now the second fatal error we can have (-EINTR, 
-ENOMEM), thus the current interface (return "NULL" on fatal errros) no 
longer works properly.

No strong opinion about fixing this up on top - could be it's cleaner to 
just do it right away.
Oscar Salvador March 4, 2021, 10:41 a.m. UTC | #12
On Thu, Mar 04, 2021 at 11:32:29AM +0100, David Hildenbrand wrote:
> I think this is now the second fatal error we can have (-EINTR, -ENOMEM),
> thus the current interface (return "NULL" on fatal errros) no longer works
> properly.
> 
> No strong opinion about fixing this up on top - could be it's cleaner to
> just do it right away.

Ok, I see.

Then I will start working on that in v4.

Any more feedback to either patch#1 or patch#2?

Thanks!
diff mbox series

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..72352d718829 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -505,6 +505,7 @@  struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+bool isolate_or_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -775,6 +776,11 @@  void set_page_huge_active(struct page *page);
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline bool isolate_or_dissolve_huge_page(struct page *page)
+{
+	return false;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
 					   unsigned long addr,
 					   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 190ccdaa6c19..d52506ed9db7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -905,6 +905,18 @@  isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			valid_page = page;
 		}
 
+		if (PageHuge(page) && cc->alloc_contig) {
+			if (!isolate_or_dissolve_huge_page(page))
+				goto isolate_fail;
+
+			/*
+			 * Ok, the hugepage was dissolved. Now these pages are
+			 * Buddy and cannot be re-allocated because they are
+			 * isolated. Fall-through as the check below handles
+			 * Buddy pages.
+			 */
+		}
+
 		/*
 		 * Skip if free. We read page order here without zone lock
 		 * which is generally unsafe, but the race window is small and
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..56eba64a1d33 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1037,13 +1037,18 @@  static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
+static void __enqueue_huge_page(struct list_head *list, struct page *page)
+{
+	list_move(&page->lru, list);
+	SetPageHugeFreed(page);
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_move(&page->lru, &h->hugepage_freelists[nid]);
+	__enqueue_huge_page(&h->hugepage_freelists[nid], page);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
-	SetPageHugeFreed(page);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -2294,6 +2299,108 @@  static void restore_reserve_on_error(struct hstate *h,
 	}
 }
 
+/*
+ * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
+ * @h: struct hstate old page belongs to
+ * @old_page: Old page to dissolve
+ * Returns 0 on success, otherwise negated error.
+ */
+
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
+{
+	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+	int nid = page_to_nid(old_page);
+	struct page *new_page;
+	int ret = 0;
+
+	/*
+	 * Before dissolving the page, we need to allocate a new one,
+	 * so the pool remains stable.
+	 */
+	new_page = alloc_fresh_huge_page(h, gfp_mask, nid, NULL, NULL);
+	if (!new_page)
+		return -ENOMEM;
+
+	/*
+	 * Pages got from Buddy are self-refcounted, but free hugepages
+	 * need to have a refcount of 0.
+	 */
+	page_ref_dec(new_page);
+retry:
+	spin_lock(&hugetlb_lock);
+	if (!PageHuge(old_page)) {
+		/*
+		 * Freed from under us. Drop new_page too.
+		 */
+		update_and_free_page(h, new_page);
+		goto unlock;
+	} else if (page_count(old_page)) {
+		/*
+		 * Someone has grabbed the page, fail for now.
+		 */
+		ret = -EBUSY;
+		update_and_free_page(h, new_page);
+		goto unlock;
+	} else if (!PageHugeFreed(old_page)) {
+		/*
+		 * Page's refcount is 0 but it has not been enqueued in the
+		 * freelist yet. Race window is small, so we can succed here if
+		 * we retry.
+		 */
+		spin_unlock(&hugetlb_lock);
+		cond_resched();
+		goto retry;
+	} else {
+		/*
+		 * Ok, old_page is still a genuine free hugepage. Replace it
+		 * with the new one.
+		 */
+		list_del(&old_page->lru);
+		update_and_free_page(h, old_page);
+		/*
+		 * h->free_huge_pages{_node} counters do not need to be updated.
+		 */
+		__enqueue_huge_page(&h->hugepage_freelists[nid], new_page);
+	}
+unlock:
+	spin_unlock(&hugetlb_lock);
+
+	return ret;
+}
+
+bool isolate_or_dissolve_huge_page(struct page *page)
+{
+	struct hstate *h = NULL;
+	struct page *head;
+	bool ret = false;
+
+	spin_lock(&hugetlb_lock);
+	if (PageHuge(page)) {
+		head = compound_head(page);
+		h = page_hstate(head);
+	}
+	spin_unlock(&hugetlb_lock);
+
+	/*
+	 * The page might have been dissolved from under our feet.
+	 * If that is the case, return success as if we dissolved it ourselves.
+	 */
+	if (!h)
+		return true;
+
+	/*
+	 * Fence off gigantic pages as there is a cyclic dependency
+	 * between alloc_contig_range and them.
+	 */
+	if (hstate_is_gigantic(h))
+		return ret;
+
+	if (!page_count(head) && alloc_and_dissolve_huge_page(h, head))
+		ret = true;
+
+	return ret;
+}
+
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {