[10/10] mm/hugetlb: not necessary to abuse temporary page to workaround the nasty free_huge_page
diff mbox series

Message ID 20200807091251.12129-11-richard.weiyang@linux.alibaba.com
State New
Headers show
Series
  • mm/hugetlb: code refine and simplification
Related show

Commit Message

Wei Yang Aug. 7, 2020, 9:12 a.m. UTC
Let's always increase surplus_huge_pages and so that free_huge_page
could decrease it at free time.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
---
 mm/hugetlb.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

Comments

Baoquan He Aug. 10, 2020, 2:17 a.m. UTC | #1
On 08/07/20 at 05:12pm, Wei Yang wrote:
> Let's always increase surplus_huge_pages and so that free_huge_page
> could decrease it at free time.
> 
> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
> ---
>  mm/hugetlb.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1f2010c9dd8d..a0eb81e0e4c5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1913,21 +1913,19 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  		return NULL;
>  
>  	spin_lock(&hugetlb_lock);
> +
> +	h->surplus_huge_pages++;
> +	h->surplus_huge_pages_node[page_to_nid(page)]++;
> +
>  	/*
>  	 * We could have raced with the pool size change.
>  	 * Double check that and simply deallocate the new page
> -	 * if we would end up overcommiting the surpluses. Abuse
> -	 * temporary page to workaround the nasty free_huge_page
> -	 * codeflow
> +	 * if we would end up overcommiting the surpluses.
>  	 */
> -	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> -		SetPageHugeTemporary(page);

Hmm, the temporary page way is taken intentionally in
commit 9980d744a0428 ("mm, hugetlb: get rid of surplus page accounting tricks").
From code, this is done inside hugetlb_lock holding, and the code flow
is straightforward, should be safe. Adding Michal to CC.


> +	if (h->surplus_huge_pages > h->nr_overcommit_huge_pages) {
>  		spin_unlock(&hugetlb_lock);
>  		put_page(page);
>  		return NULL;
> -	} else {
> -		h->surplus_huge_pages++;
> -		h->surplus_huge_pages_node[page_to_nid(page)]++;
>  	}
>  
>  out_unlock:
> -- 
> 2.20.1 (Apple Git-117)
> 
>
Mike Kravetz Aug. 11, 2020, 12:19 a.m. UTC | #2
On 8/9/20 7:17 PM, Baoquan He wrote:
> On 08/07/20 at 05:12pm, Wei Yang wrote:
>> Let's always increase surplus_huge_pages and so that free_huge_page
>> could decrease it at free time.
>>
>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>> ---
>>  mm/hugetlb.c | 14 ++++++--------
>>  1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 1f2010c9dd8d..a0eb81e0e4c5 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1913,21 +1913,19 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>>  		return NULL;
>>  
>>  	spin_lock(&hugetlb_lock);
>> +
>> +	h->surplus_huge_pages++;
>> +	h->surplus_huge_pages_node[page_to_nid(page)]++;
>> +
>>  	/*
>>  	 * We could have raced with the pool size change.
>>  	 * Double check that and simply deallocate the new page
>> -	 * if we would end up overcommiting the surpluses. Abuse
>> -	 * temporary page to workaround the nasty free_huge_page
>> -	 * codeflow
>> +	 * if we would end up overcommiting the surpluses.
>>  	 */
>> -	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>> -		SetPageHugeTemporary(page);
> 
> Hmm, the temporary page way is taken intentionally in
> commit 9980d744a0428 ("mm, hugetlb: get rid of surplus page accounting tricks").
> From code, this is done inside hugetlb_lock holding, and the code flow
> is straightforward, should be safe. Adding Michal to CC.
> 

I remember when the temporary page code was added for page migration.
The use of temporary page here was added at about the same time.  Temporary
page does have one advantage in that it will not CAUSE surplus count to
exceed overcommit.  This patch could cause surplus to exceed overcommit
for a very short period of time.  However, do note that for this to happen
the code needs to race with a pool resize which itself could cause surplus
to exceed overcommit.

IMO both approaches are valid.
- Advantage of temporary page is that it can not cause surplus to exceed
  overcommit.  Disadvantage is as mentioned in the comment 'abuse of temporary
  page'.
- Advantage of this patch is that it uses existing counters.  Disadvantage
  is that it can momentarily cause surplus to exceed overcommit.

Unless someone has a strong opinion, I prefer the changes in this patch.
Baoquan He Aug. 11, 2020, 1:51 a.m. UTC | #3
On 08/10/20 at 05:19pm, Mike Kravetz wrote:
> On 8/9/20 7:17 PM, Baoquan He wrote:
> > On 08/07/20 at 05:12pm, Wei Yang wrote:
> >> Let's always increase surplus_huge_pages and so that free_huge_page
> >> could decrease it at free time.
> >>
> >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
> >> ---
> >>  mm/hugetlb.c | 14 ++++++--------
> >>  1 file changed, 6 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index 1f2010c9dd8d..a0eb81e0e4c5 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -1913,21 +1913,19 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> >>  		return NULL;
> >>  
> >>  	spin_lock(&hugetlb_lock);
> >> +
> >> +	h->surplus_huge_pages++;
> >> +	h->surplus_huge_pages_node[page_to_nid(page)]++;
> >> +
> >>  	/*
> >>  	 * We could have raced with the pool size change.
> >>  	 * Double check that and simply deallocate the new page
> >> -	 * if we would end up overcommiting the surpluses. Abuse
> >> -	 * temporary page to workaround the nasty free_huge_page
> >> -	 * codeflow
> >> +	 * if we would end up overcommiting the surpluses.
> >>  	 */
> >> -	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> >> -		SetPageHugeTemporary(page);
> > 
> > Hmm, the temporary page way is taken intentionally in
> > commit 9980d744a0428 ("mm, hugetlb: get rid of surplus page accounting tricks").
> > From code, this is done inside hugetlb_lock holding, and the code flow
> > is straightforward, should be safe. Adding Michal to CC.
> > 
> 
> I remember when the temporary page code was added for page migration.
> The use of temporary page here was added at about the same time.  Temporary
> page does have one advantage in that it will not CAUSE surplus count to
> exceed overcommit.  This patch could cause surplus to exceed overcommit
> for a very short period of time.  However, do note that for this to happen
> the code needs to race with a pool resize which itself could cause surplus
> to exceed overcommit.
> 
> IMO both approaches are valid.
> - Advantage of temporary page is that it can not cause surplus to exceed
>   overcommit.  Disadvantage is as mentioned in the comment 'abuse of temporary
>   page'.
> - Advantage of this patch is that it uses existing counters.  Disadvantage
>   is that it can momentarily cause surplus to exceed overcommit.

Yeah, since it's all done inside hugetlb_lock, should be OK even
though it may cause surplus to exceed overcommit.
> 
> Unless someone has a strong opinion, I prefer the changes in this patch.

Agree, I also prefer the code change in this patch, to remove the
unnecessary confusion about the temporary page.
Michal Hocko Aug. 11, 2020, 6:54 a.m. UTC | #4
On Tue 11-08-20 09:51:48, Baoquan He wrote:
> On 08/10/20 at 05:19pm, Mike Kravetz wrote:
> > On 8/9/20 7:17 PM, Baoquan He wrote:
> > > On 08/07/20 at 05:12pm, Wei Yang wrote:
> > >> Let's always increase surplus_huge_pages and so that free_huge_page
> > >> could decrease it at free time.
> > >>
> > >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
> > >> ---
> > >>  mm/hugetlb.c | 14 ++++++--------
> > >>  1 file changed, 6 insertions(+), 8 deletions(-)
> > >>
> > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > >> index 1f2010c9dd8d..a0eb81e0e4c5 100644
> > >> --- a/mm/hugetlb.c
> > >> +++ b/mm/hugetlb.c
> > >> @@ -1913,21 +1913,19 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> > >>  		return NULL;
> > >>  
> > >>  	spin_lock(&hugetlb_lock);
> > >> +
> > >> +	h->surplus_huge_pages++;
> > >> +	h->surplus_huge_pages_node[page_to_nid(page)]++;
> > >> +
> > >>  	/*
> > >>  	 * We could have raced with the pool size change.
> > >>  	 * Double check that and simply deallocate the new page
> > >> -	 * if we would end up overcommiting the surpluses. Abuse
> > >> -	 * temporary page to workaround the nasty free_huge_page
> > >> -	 * codeflow
> > >> +	 * if we would end up overcommiting the surpluses.
> > >>  	 */
> > >> -	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> > >> -		SetPageHugeTemporary(page);
> > > 
> > > Hmm, the temporary page way is taken intentionally in
> > > commit 9980d744a0428 ("mm, hugetlb: get rid of surplus page accounting tricks").
> > > From code, this is done inside hugetlb_lock holding, and the code flow
> > > is straightforward, should be safe. Adding Michal to CC.

But the lock is not held during the migration, right?

> > I remember when the temporary page code was added for page migration.
> > The use of temporary page here was added at about the same time.  Temporary
> > page does have one advantage in that it will not CAUSE surplus count to
> > exceed overcommit.  This patch could cause surplus to exceed overcommit
> > for a very short period of time.  However, do note that for this to happen
> > the code needs to race with a pool resize which itself could cause surplus
> > to exceed overcommit.

Correct.

> > IMO both approaches are valid.
> > - Advantage of temporary page is that it can not cause surplus to exceed
> >   overcommit.  Disadvantage is as mentioned in the comment 'abuse of temporary
> >   page'.
> > - Advantage of this patch is that it uses existing counters.  Disadvantage
> >   is that it can momentarily cause surplus to exceed overcommit.

Do I remember correctly that this can cause an allocation failure due to
overcommit check? In other words it would be user space visible thing?

> Yeah, since it's all done inside hugetlb_lock, should be OK even
> though it may cause surplus to exceed overcommit.
> > 
> > Unless someone has a strong opinion, I prefer the changes in this patch.
> 
> Agree, I also prefer the code change in this patch, to remove the
> unnecessary confusion about the temporary page.

I have managed to forgot all the juicy details since I have made that
change. All that remains is that the surplus pages accounting was quite
tricky and back then I didn't figure out a simpler method that would
achieve the consistent look at those counters. As mentioned above I
suspect this could lead to pre-mature allocation failures while the
migration is ongoing. Sure quite unlikely to happen and the race window
is likely very small. Maybe this is even acceptable but I would strongly
recommend to have all this thinking documented in the changelog.
Mike Kravetz Aug. 11, 2020, 9:43 p.m. UTC | #5
On 8/10/20 11:54 PM, Michal Hocko wrote:
> 
> I have managed to forgot all the juicy details since I have made that
> change. All that remains is that the surplus pages accounting was quite
> tricky and back then I didn't figure out a simpler method that would
> achieve the consistent look at those counters. As mentioned above I
> suspect this could lead to pre-mature allocation failures while the
> migration is ongoing.

It is likely lost in the e-mail thread, but the suggested change was to
alloc_surplus_huge_page().  The code which allocates the migration target
(alloc_migrate_huge_page) will not be changed.  So, this should not be
an issue.

>                       Sure quite unlikely to happen and the race window
> is likely very small. Maybe this is even acceptable but I would strongly
> recommend to have all this thinking documented in the changelog.

I wrote down a description of what happens in the two different approaches
"temporary page" vs "surplus page".  It is at the very end of this e-mail.
When looking at the details, I came up with what may be an even better
approach.  Why not just call the low level routine to free the page instead
of going through put_page/free_huge_page?  At the very least, it saves a
lock roundtrip and there is no need to worry about the counters/accounting.

Here is a patch to do that.  However, we are optimizing a return path in
a race condition that we are unlikely to ever hit.  I 'tested' it by allocating
an 'extra' page and freeing it via this method in alloc_surplus_huge_page.

From 864c5f8ef4900c95ca3f6f2363a85f3cb25e793e Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 11 Aug 2020 12:45:41 -0700
Subject: [PATCH] hugetlb: optimize race error return in
 alloc_surplus_huge_page

The routine alloc_surplus_huge_page() could race with with a pool
size change.  If this happens, the allocated page may not be needed.
To free the page, the current code will 'Abuse temporary page to
workaround the nasty free_huge_page codeflow'.  Instead, directly
call the low level routine that free_huge_page uses.  This works
out well because the page is new, we hold the only reference and
already hold the hugetlb_lock.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 590111ea6975..ac89b91fba86 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1923,14 +1923,17 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	/*
 	 * We could have raced with the pool size change.
 	 * Double check that and simply deallocate the new page
-	 * if we would end up overcommiting the surpluses. Abuse
-	 * temporary page to workaround the nasty free_huge_page
-	 * codeflow
+	 * if we would end up overcommiting the surpluses.
 	 */
 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
-		SetPageHugeTemporary(page);
+		/*
+		 * Since this page is new, we hold the only reference, and
+		 * we already hold the hugetlb_lock call the low level free
+		 * page routine.  This saves at least a lock roundtrip.
+		 */
+		(void)put_page_testzero(page); /* don't call destructor */
+		update_and_free_page(h, page);
 		spin_unlock(&hugetlb_lock);
-		put_page(page);
 		return NULL;
 	} else {
 		h->surplus_huge_pages++;
Wei Yang Aug. 11, 2020, 11:19 p.m. UTC | #6
On Tue, Aug 11, 2020 at 02:43:28PM -0700, Mike Kravetz wrote:
>On 8/10/20 11:54 PM, Michal Hocko wrote:
>> 
>> I have managed to forgot all the juicy details since I have made that
>> change. All that remains is that the surplus pages accounting was quite
>> tricky and back then I didn't figure out a simpler method that would
>> achieve the consistent look at those counters. As mentioned above I
>> suspect this could lead to pre-mature allocation failures while the
>> migration is ongoing.
>
>It is likely lost in the e-mail thread, but the suggested change was to
>alloc_surplus_huge_page().  The code which allocates the migration target
>(alloc_migrate_huge_page) will not be changed.  So, this should not be
>an issue.
>
>>                       Sure quite unlikely to happen and the race window
>> is likely very small. Maybe this is even acceptable but I would strongly
>> recommend to have all this thinking documented in the changelog.
>
>I wrote down a description of what happens in the two different approaches
>"temporary page" vs "surplus page".  It is at the very end of this e-mail.
>When looking at the details, I came up with what may be an even better
>approach.  Why not just call the low level routine to free the page instead
>of going through put_page/free_huge_page?  At the very least, it saves a
>lock roundtrip and there is no need to worry about the counters/accounting.
>
>Here is a patch to do that.  However, we are optimizing a return path in
>a race condition that we are unlikely to ever hit.  I 'tested' it by allocating
>an 'extra' page and freeing it via this method in alloc_surplus_huge_page.
>
>>From 864c5f8ef4900c95ca3f6f2363a85f3cb25e793e Mon Sep 17 00:00:00 2001
>From: Mike Kravetz <mike.kravetz@oracle.com>
>Date: Tue, 11 Aug 2020 12:45:41 -0700
>Subject: [PATCH] hugetlb: optimize race error return in
> alloc_surplus_huge_page
>
>The routine alloc_surplus_huge_page() could race with with a pool
>size change.  If this happens, the allocated page may not be needed.
>To free the page, the current code will 'Abuse temporary page to
>workaround the nasty free_huge_page codeflow'.  Instead, directly
>call the low level routine that free_huge_page uses.  This works
>out well because the page is new, we hold the only reference and
>already hold the hugetlb_lock.
>
>Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>---
> mm/hugetlb.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>index 590111ea6975..ac89b91fba86 100644
>--- a/mm/hugetlb.c
>+++ b/mm/hugetlb.c
>@@ -1923,14 +1923,17 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> 	/*
> 	 * We could have raced with the pool size change.
> 	 * Double check that and simply deallocate the new page
>-	 * if we would end up overcommiting the surpluses. Abuse
>-	 * temporary page to workaround the nasty free_huge_page
>-	 * codeflow
>+	 * if we would end up overcommiting the surpluses.
> 	 */
> 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>-		SetPageHugeTemporary(page);
>+		/*
>+		 * Since this page is new, we hold the only reference, and
>+		 * we already hold the hugetlb_lock call the low level free
>+		 * page routine.  This saves at least a lock roundtrip.

The change looks good to me, while I may not understand the "lock roundtrip".
You mean we don't need to release the hugetlb_lock?

>+		 */
>+		(void)put_page_testzero(page); /* don't call destructor */
>+		update_and_free_page(h, page);
> 		spin_unlock(&hugetlb_lock);
>-		put_page(page);
> 		return NULL;
> 	} else {
> 		h->surplus_huge_pages++;
>-- 
>2.25.4
>
>
>Here is a description of the difference in "Temporary Page" vs "Surplus
>Page" approach.
>
>Both only allocate a fresh huge page if surplus_huge_pages is less than
>nr_overcommit_huge_pages.  Of course, the lock protecting those counts
>must be dropped to perform the allocation.  After reacquiring the lock
>is where we have the proposed difference in behavior.
>
>temporary page behavior
>-----------------------
>if surplus_huge_pages >= h->nr_overcommit_huge_pages
>	SetPageHugeTemporary(page)
>	spin_unlock(&hugetlb_lock);
>	put_page(page);
>
>At this time we know surplus_huge_pages is 'at least' nr_overcommit_huge_pages.
>As a result, any new allocation will fail.
>Only user visible result is that number of huge pages will be one greater than
>that specified by user and overcommit values.  This is only visible for the
>short time until the page is actully freed as a result of put_page().
>
>free_huge_page()
>	number of huge pages will be decremented
>
>suprlus page behavior
>---------------------
>surplus_huge_pages++
>surplus_huge_pages_node[page_to_nid(page)]++
>if surplus_huge_pages > nr_overcommit_huge_pages
>	spin_unlock(&hugetlb_lock);
>	put_page(page);
>
>At this time we know surplus_huge_pages is greater than
>nr_overcommit_huge_pages.  As a result, any new allocation will fail.
>User visible result is an increase in surplus pages as well as number of
>huge pages.  In addition, surplus pages will exceed overcommit.  This is
>only visible for the short time until the page is actully freed as a
>result of put_page().
>
>free_huge_page()
>	number of huge pages will be decremented
>	h->surplus_huge_pages--;
>	h->surplus_huge_pages_node[nid]--;
Mike Kravetz Aug. 11, 2020, 11:25 p.m. UTC | #7
On 8/11/20 4:19 PM, Wei Yang wrote:
> On Tue, Aug 11, 2020 at 02:43:28PM -0700, Mike Kravetz wrote:
>> Subject: [PATCH] hugetlb: optimize race error return in
>> alloc_surplus_huge_page
>>
>> The routine alloc_surplus_huge_page() could race with with a pool
>> size change.  If this happens, the allocated page may not be needed.
>> To free the page, the current code will 'Abuse temporary page to
>> workaround the nasty free_huge_page codeflow'.  Instead, directly
>> call the low level routine that free_huge_page uses.  This works
>> out well because the page is new, we hold the only reference and
>> already hold the hugetlb_lock.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>> mm/hugetlb.c | 13 ++++++++-----
>> 1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 590111ea6975..ac89b91fba86 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1923,14 +1923,17 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>> 	/*
>> 	 * We could have raced with the pool size change.
>> 	 * Double check that and simply deallocate the new page
>> -	 * if we would end up overcommiting the surpluses. Abuse
>> -	 * temporary page to workaround the nasty free_huge_page
>> -	 * codeflow
>> +	 * if we would end up overcommiting the surpluses.
>> 	 */
>> 	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>> -		SetPageHugeTemporary(page);
>> +		/*
>> +		 * Since this page is new, we hold the only reference, and
>> +		 * we already hold the hugetlb_lock call the low level free
>> +		 * page routine.  This saves at least a lock roundtrip.
> 
> The change looks good to me, while I may not understand the "lock roundtrip".
> You mean we don't need to release the hugetlb_lock?

Correct.
Normally we would free the page via free_huge_page() processing.  To do that
we need to drop hugetlb_lock and call put_page/free_huge_page which will
need to acquire the hugetlb_lock again.
Baoquan He Aug. 11, 2020, 11:55 p.m. UTC | #8
On 08/11/20 at 08:54am, Michal Hocko wrote:
> On Tue 11-08-20 09:51:48, Baoquan He wrote:
> > On 08/10/20 at 05:19pm, Mike Kravetz wrote:
> > > On 8/9/20 7:17 PM, Baoquan He wrote:
> > > > On 08/07/20 at 05:12pm, Wei Yang wrote:
> > > >> Let's always increase surplus_huge_pages and so that free_huge_page
> > > >> could decrease it at free time.
> > > >>
> > > >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
> > > >> ---
> > > >>  mm/hugetlb.c | 14 ++++++--------
> > > >>  1 file changed, 6 insertions(+), 8 deletions(-)
> > > >>
> > > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > >> index 1f2010c9dd8d..a0eb81e0e4c5 100644
> > > >> --- a/mm/hugetlb.c
> > > >> +++ b/mm/hugetlb.c
> > > >> @@ -1913,21 +1913,19 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> > > >>  		return NULL;
> > > >>  
> > > >>  	spin_lock(&hugetlb_lock);
> > > >> +
> > > >> +	h->surplus_huge_pages++;
> > > >> +	h->surplus_huge_pages_node[page_to_nid(page)]++;
> > > >> +
> > > >>  	/*
> > > >>  	 * We could have raced with the pool size change.
> > > >>  	 * Double check that and simply deallocate the new page
> > > >> -	 * if we would end up overcommiting the surpluses. Abuse
> > > >> -	 * temporary page to workaround the nasty free_huge_page
> > > >> -	 * codeflow
> > > >> +	 * if we would end up overcommiting the surpluses.
> > > >>  	 */
> > > >> -	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> > > >> -		SetPageHugeTemporary(page);
> > > > 
> > > > Hmm, the temporary page way is taken intentionally in
> > > > commit 9980d744a0428 ("mm, hugetlb: get rid of surplus page accounting tricks").
> > > > From code, this is done inside hugetlb_lock holding, and the code flow
> > > > is straightforward, should be safe. Adding Michal to CC.
> 
> But the lock is not held during the migration, right?

I see what I misunderstoold about the hugetlb_lock holding. The
put_page() is called after releasing hugetlb_lock in
alloc_surplus_huge_page(), I mistakenly got put_page() is inside
hugetlb_lock. Yes, there's obviously a race window, and the temporary
page way is an effective way to not mess up the surplus_huge_pages
accounting.

> 
> > > I remember when the temporary page code was added for page migration.
> > > The use of temporary page here was added at about the same time.  Temporary
> > > page does have one advantage in that it will not CAUSE surplus count to
> > > exceed overcommit.  This patch could cause surplus to exceed overcommit
> > > for a very short period of time.  However, do note that for this to happen
> > > the code needs to race with a pool resize which itself could cause surplus
> > > to exceed overcommit.
> 
> Correct.
> 
> > > IMO both approaches are valid.
> > > - Advantage of temporary page is that it can not cause surplus to exceed
> > >   overcommit.  Disadvantage is as mentioned in the comment 'abuse of temporary
> > >   page'.
> > > - Advantage of this patch is that it uses existing counters.  Disadvantage
> > >   is that it can momentarily cause surplus to exceed overcommit.
> 
> Do I remember correctly that this can cause an allocation failure due to
> overcommit check? In other words it would be user space visible thing?
> 
> > Yeah, since it's all done inside hugetlb_lock, should be OK even
> > though it may cause surplus to exceed overcommit.
> > > 
> > > Unless someone has a strong opinion, I prefer the changes in this patch.
> > 
> > Agree, I also prefer the code change in this patch, to remove the
> > unnecessary confusion about the temporary page.
> 
> I have managed to forgot all the juicy details since I have made that
> change. All that remains is that the surplus pages accounting was quite
> tricky and back then I didn't figure out a simpler method that would
> achieve the consistent look at those counters. As mentioned above I
> suspect this could lead to pre-mature allocation failures while the
> migration is ongoing. Sure quite unlikely to happen and the race window
> is likely very small. Maybe this is even acceptable but I would strongly
> recommend to have all this thinking documented in the changelog.
> -- 
> Michal Hocko
> SUSE Labs
>
Baoquan He Aug. 12, 2020, 5:40 a.m. UTC | #9
On 08/11/20 at 02:43pm, Mike Kravetz wrote:
> Here is a patch to do that.  However, we are optimizing a return path in
> a race condition that we are unlikely to ever hit.  I 'tested' it by allocating
> an 'extra' page and freeing it via this method in alloc_surplus_huge_page.
> 
> From 864c5f8ef4900c95ca3f6f2363a85f3cb25e793e Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 11 Aug 2020 12:45:41 -0700
> Subject: [PATCH] hugetlb: optimize race error return in
>  alloc_surplus_huge_page
> 
> The routine alloc_surplus_huge_page() could race with with a pool
> size change.  If this happens, the allocated page may not be needed.
> To free the page, the current code will 'Abuse temporary page to
> workaround the nasty free_huge_page codeflow'.  Instead, directly
> call the low level routine that free_huge_page uses.  This works
> out well because the page is new, we hold the only reference and
> already hold the hugetlb_lock.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 590111ea6975..ac89b91fba86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1923,14 +1923,17 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  	/*
>  	 * We could have raced with the pool size change.
>  	 * Double check that and simply deallocate the new page
> -	 * if we would end up overcommiting the surpluses. Abuse
> -	 * temporary page to workaround the nasty free_huge_page
> -	 * codeflow
> +	 * if we would end up overcommiting the surpluses.
>  	 */
>  	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> -		SetPageHugeTemporary(page);
> +		/*
> +		 * Since this page is new, we hold the only reference, and
> +		 * we already hold the hugetlb_lock call the low level free
> +		 * page routine.  This saves at least a lock roundtrip.
> +		 */
> +		(void)put_page_testzero(page); /* don't call destructor */
> +		update_and_free_page(h, page);

Yeah, taking this code change, or keeping the temporary page way as is,
both looks good.

>  		spin_unlock(&hugetlb_lock);
> -		put_page(page);
>  		return NULL;
>  	} else {
>  		h->surplus_huge_pages++;
Michal Hocko Aug. 13, 2020, 11:46 a.m. UTC | #10
On Tue 11-08-20 14:43:28, Mike Kravetz wrote:
> On 8/10/20 11:54 PM, Michal Hocko wrote:
> > 
> > I have managed to forgot all the juicy details since I have made that
> > change. All that remains is that the surplus pages accounting was quite
> > tricky and back then I didn't figure out a simpler method that would
> > achieve the consistent look at those counters. As mentioned above I
> > suspect this could lead to pre-mature allocation failures while the
> > migration is ongoing.
> 
> It is likely lost in the e-mail thread, but the suggested change was to
> alloc_surplus_huge_page().  The code which allocates the migration target
> (alloc_migrate_huge_page) will not be changed.  So, this should not be
> an issue.

OK, I've missed that obviously.

> >                       Sure quite unlikely to happen and the race window
> > is likely very small. Maybe this is even acceptable but I would strongly
> > recommend to have all this thinking documented in the changelog.
> 
> I wrote down a description of what happens in the two different approaches
> "temporary page" vs "surplus page".  It is at the very end of this e-mail.
> When looking at the details, I came up with what may be an even better
> approach.  Why not just call the low level routine to free the page instead
> of going through put_page/free_huge_page?  At the very least, it saves a
> lock roundtrip and there is no need to worry about the counters/accounting.
> 
> Here is a patch to do that.  However, we are optimizing a return path in
> a race condition that we are unlikely to ever hit.  I 'tested' it by allocating
> an 'extra' page and freeing it via this method in alloc_surplus_huge_page.
> 
> >From 864c5f8ef4900c95ca3f6f2363a85f3cb25e793e Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 11 Aug 2020 12:45:41 -0700
> Subject: [PATCH] hugetlb: optimize race error return in
>  alloc_surplus_huge_page
> 
> The routine alloc_surplus_huge_page() could race with with a pool
> size change.  If this happens, the allocated page may not be needed.
> To free the page, the current code will 'Abuse temporary page to
> workaround the nasty free_huge_page codeflow'.  Instead, directly
> call the low level routine that free_huge_page uses.  This works
> out well because the page is new, we hold the only reference and
> already hold the hugetlb_lock.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 590111ea6975..ac89b91fba86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1923,14 +1923,17 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>  	/*
>  	 * We could have raced with the pool size change.
>  	 * Double check that and simply deallocate the new page
> -	 * if we would end up overcommiting the surpluses. Abuse
> -	 * temporary page to workaround the nasty free_huge_page
> -	 * codeflow
> +	 * if we would end up overcommiting the surpluses.
>  	 */
>  	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> -		SetPageHugeTemporary(page);
> +		/*
> +		 * Since this page is new, we hold the only reference, and
> +		 * we already hold the hugetlb_lock call the low level free
> +		 * page routine.  This saves at least a lock roundtrip.
> +		 */
> +		(void)put_page_testzero(page); /* don't call destructor */
> +		update_and_free_page(h, page);
>  		spin_unlock(&hugetlb_lock);
> -		put_page(page);
>  		return NULL;
>  	} else {
>  		h->surplus_huge_pages++;

Yes this makes sense. I would have to think about this more to be
confident and give Acked-by but this looks sensible from a quick glance.

Thanks!
Wei Yang Aug. 17, 2020, 3:04 a.m. UTC | #11
On Thu, Aug 13, 2020 at 01:46:38PM +0200, Michal Hocko wrote:
>On Tue 11-08-20 14:43:28, Mike Kravetz wrote:
>> On 8/10/20 11:54 PM, Michal Hocko wrote:
>> > 
>> > I have managed to forgot all the juicy details since I have made that
>> > change. All that remains is that the surplus pages accounting was quite
>> > tricky and back then I didn't figure out a simpler method that would
>> > achieve the consistent look at those counters. As mentioned above I
>> > suspect this could lead to pre-mature allocation failures while the
>> > migration is ongoing.
>> 
>> It is likely lost in the e-mail thread, but the suggested change was to
>> alloc_surplus_huge_page().  The code which allocates the migration target
>> (alloc_migrate_huge_page) will not be changed.  So, this should not be
>> an issue.
>
>OK, I've missed that obviously.
>
>> >                       Sure quite unlikely to happen and the race window
>> > is likely very small. Maybe this is even acceptable but I would strongly
>> > recommend to have all this thinking documented in the changelog.
>> 
>> I wrote down a description of what happens in the two different approaches
>> "temporary page" vs "surplus page".  It is at the very end of this e-mail.
>> When looking at the details, I came up with what may be an even better
>> approach.  Why not just call the low level routine to free the page instead
>> of going through put_page/free_huge_page?  At the very least, it saves a
>> lock roundtrip and there is no need to worry about the counters/accounting.
>> 
>> Here is a patch to do that.  However, we are optimizing a return path in
>> a race condition that we are unlikely to ever hit.  I 'tested' it by allocating
>> an 'extra' page and freeing it via this method in alloc_surplus_huge_page.
>> 
>> >From 864c5f8ef4900c95ca3f6f2363a85f3cb25e793e Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>> Date: Tue, 11 Aug 2020 12:45:41 -0700
>> Subject: [PATCH] hugetlb: optimize race error return in
>>  alloc_surplus_huge_page
>> 
>> The routine alloc_surplus_huge_page() could race with with a pool
>> size change.  If this happens, the allocated page may not be needed.
>> To free the page, the current code will 'Abuse temporary page to
>> workaround the nasty free_huge_page codeflow'.  Instead, directly
>> call the low level routine that free_huge_page uses.  This works
>> out well because the page is new, we hold the only reference and
>> already hold the hugetlb_lock.
>> 
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  mm/hugetlb.c | 13 ++++++++-----
>>  1 file changed, 8 insertions(+), 5 deletions(-)
>> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 590111ea6975..ac89b91fba86 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1923,14 +1923,17 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
>>  	/*
>>  	 * We could have raced with the pool size change.
>>  	 * Double check that and simply deallocate the new page
>> -	 * if we would end up overcommiting the surpluses. Abuse
>> -	 * temporary page to workaround the nasty free_huge_page
>> -	 * codeflow
>> +	 * if we would end up overcommiting the surpluses.
>>  	 */
>>  	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
>> -		SetPageHugeTemporary(page);
>> +		/*
>> +		 * Since this page is new, we hold the only reference, and
>> +		 * we already hold the hugetlb_lock call the low level free
>> +		 * page routine.  This saves at least a lock roundtrip.
>> +		 */
>> +		(void)put_page_testzero(page); /* don't call destructor */
>> +		update_and_free_page(h, page);
>>  		spin_unlock(&hugetlb_lock);
>> -		put_page(page);
>>  		return NULL;
>>  	} else {
>>  		h->surplus_huge_pages++;
>
>Yes this makes sense. I would have to think about this more to be
>confident and give Acked-by but this looks sensible from a quick glance.
>

If it is ok, I would like to send v2 without this one to give more time
for a discussion?

>Thanks!
>-- 
>Michal Hocko
>SUSE Labs

Patch
diff mbox series

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1f2010c9dd8d..a0eb81e0e4c5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1913,21 +1913,19 @@  static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 		return NULL;
 
 	spin_lock(&hugetlb_lock);
+
+	h->surplus_huge_pages++;
+	h->surplus_huge_pages_node[page_to_nid(page)]++;
+
 	/*
 	 * We could have raced with the pool size change.
 	 * Double check that and simply deallocate the new page
-	 * if we would end up overcommiting the surpluses. Abuse
-	 * temporary page to workaround the nasty free_huge_page
-	 * codeflow
+	 * if we would end up overcommiting the surpluses.
 	 */
-	if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
-		SetPageHugeTemporary(page);
+	if (h->surplus_huge_pages > h->nr_overcommit_huge_pages) {
 		spin_unlock(&hugetlb_lock);
 		put_page(page);
 		return NULL;
-	} else {
-		h->surplus_huge_pages++;
-		h->surplus_huge_pages_node[page_to_nid(page)]++;
 	}
 
 out_unlock: