diff mbox series

[v2] mm/hugetlb: fix set_max_huge_pages() when there are surplus pages

Message ID 20250401082339.676723-1-tujinjiang@huawei.com (mailing list archive)
State New
Headers show
Series [v2] mm/hugetlb: fix set_max_huge_pages() when there are surplus pages | expand

Commit Message

Jinjiang Tu April 1, 2025, 8:23 a.m. UTC
In set_max_huge_pages(), min_count should mean the acquired persistent
huge pages, but it contains surplus huge pages. It will leads to failing
to freeing free huge pages for a Node.

Steps to reproduce:
1) create 5 hugetlb folios in Node0
2) run a program to use all the hugetlb folios
3) echo 0 > nr_hugepages for Node0 to free the hugetlb folios. Thus the 5
hugetlb folios in Node0 are accounted as surplus.
4) create 5 hugetlb folios in Node1
5) echo 0 > nr_hugepages for Node1 to free the hugetlb folios

The result:
        Node0    Node1
Total     5         5
Free      0         5
Surp      5         5

We couldn't subtract surplus_huge_pages from min_mount, since free hugetlb
folios may be surplus due to HVO. In __update_and_free_hugetlb_folio(),
hugetlb_vmemmap_restore_folio() may fail, add the folio back to pool and
treat it as surplus. If we directly subtract surplus_huge_pages from
min_mount, some free folios will be subtracted twice.

To fix it, check if count is less than the num of free huge pages that
could be destroyed (i.e., available_huge_pages(h)), and remove hugetlb
folios if so.

Since there may exist free surplus hugetlb folios, we should remove
surplus folios first to make surplus count correct.

The result with this patch:
        Node0    Node1
Total     5         0
Free      0         0
Surp      5         0

Fixes: 9a30523066cd ("hugetlb: add per node hstate attributes")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
---
Changelog since v1:
  * Handle free surplus pages due to HVO

 mm/hugetlb.c | 47 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 40 insertions(+), 7 deletions(-)

Comments

Andrew Morton April 3, 2025, 3:49 a.m. UTC | #1
On Tue, 1 Apr 2025 16:23:39 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:

> In set_max_huge_pages(), min_count should mean the acquired persistent
> huge pages, but it contains surplus huge pages. It will leads to failing
> to freeing free huge pages for a Node.
> 
> Steps to reproduce:
> 1) create 5 hugetlb folios in Node0
> 2) run a program to use all the hugetlb folios
> 3) echo 0 > nr_hugepages for Node0 to free the hugetlb folios. Thus the 5
> hugetlb folios in Node0 are accounted as surplus.
> 4) create 5 hugetlb folios in Node1
> 5) echo 0 > nr_hugepages for Node1 to free the hugetlb folios
> 
> The result:
>         Node0    Node1
> Total     5         5
> Free      0         5
> Surp      5         5
> 
> We couldn't subtract surplus_huge_pages from min_mount, since free hugetlb
> folios may be surplus due to HVO. In __update_and_free_hugetlb_folio(),
> hugetlb_vmemmap_restore_folio() may fail, add the folio back to pool and
> treat it as surplus. If we directly subtract surplus_huge_pages from
> min_mount, some free folios will be subtracted twice.
> 
> To fix it, check if count is less than the num of free huge pages that
> could be destroyed (i.e., available_huge_pages(h)), and remove hugetlb
> folios if so.
> 
> Since there may exist free surplus hugetlb folios, we should remove
> surplus folios first to make surplus count correct.
> 
> The result with this patch:
>         Node0    Node1
> Total     5         0
> Free      0         0
> Surp      5         0

Thanks.

> Fixes: 9a30523066cd ("hugetlb: add per node hstate attributes")

16 years ago.  How do people feel about a -stable backport?
Oscar Salvador April 3, 2025, 2:26 p.m. UTC | #2
On Tue, Apr 01, 2025 at 04:23:39PM +0800, Jinjiang Tu wrote:
> In set_max_huge_pages(), min_count should mean the acquired persistent
> huge pages, but it contains surplus huge pages. It will leads to failing
> to freeing free huge pages for a Node.
> 
> Steps to reproduce:
> 1) create 5 hugetlb folios in Node0
> 2) run a program to use all the hugetlb folios
> 3) echo 0 > nr_hugepages for Node0 to free the hugetlb folios. Thus the 5
> hugetlb folios in Node0 are accounted as surplus.
> 4) create 5 hugetlb folios in Node1
> 5) echo 0 > nr_hugepages for Node1 to free the hugetlb folios

> 
> The result:
>         Node0    Node1
> Total     5         5
> Free      0         5
> Surp      5         5
> 
> We couldn't subtract surplus_huge_pages from min_mount, since free hugetlb
> folios may be surplus due to HVO. In __update_and_free_hugetlb_folio(),
> hugetlb_vmemmap_restore_folio() may fail, add the folio back to pool and
> treat it as surplus. If we directly subtract surplus_huge_pages from
> min_mount, some free folios will be subtracted twice.
> 
> To fix it, check if count is less than the num of free huge pages that
> could be destroyed (i.e., available_huge_pages(h)), and remove hugetlb
> folios if so.
> 
> Since there may exist free surplus hugetlb folios, we should remove
> surplus folios first to make surplus count correct.
> 
> The result with this patch:
>         Node0    Node1
> Total     5         0
> Free      0         0
> Surp      5         0

Unfortunately, this will not fly.


Assume this:

# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
# numactl -m 0 ./hugetlb_use_2_pages &
# echo 2 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

Because now you're checking for available_huge_pages in the
remove_pool_hugetlb_folio block, you will fail to free that page, and
will be accounted as surplus, thus failing to reduce the pool.

Of course, once the program terminates, all will be cleaned up, but in
the meantime that page (or more, depending on your set up) will not be
freed into the buddy allocator, which is unexpected.

And there are other scenarios where the same will happen.

So, it seems to me that we have to find a more deterministic approach.
The whole operation is almost entirely covered with the lock, other than
when we free the pages we collected, so I am sure we can find a way that
does not need special casing things.
Oscar Salvador April 4, 2025, 1:18 p.m. UTC | #3
On Thu, Apr 03, 2025 at 04:26:36PM +0200, Oscar Salvador wrote:
> Unfortunately, this will not fly.

I did not spend much time, but solving the problem of hvo free pages
becoming surplus not being accounted twice, should not the following
fix the entire problem?

 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index 39f92aad7bd1..6e9534a825ed 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -3825,6 +3825,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
  static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
  			      nodemask_t *nodes_allowed)
  {
 +	unsigned long substract;
  	unsigned long min_count;
  	unsigned long allocated;
  	struct folio *folio;
 @@ -3960,7 +3961,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
  	 * and won't grow the pool anywhere else. Not until one of the
  	 * sysctls are changed, or the surplus pages go out of use.
  	 */
 -	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
 +	if (h->free_huge_pages > persistent_huge_pages(h))
 +		substract = h->free_huge_pages - h->surplus_huge_pages;
 +	else
 +		h->free_huge_pages;
 +
 +	min_count = h->resv_huge_pages + persistent_huge_pages(h) - substract;
  	min_count = max(count, min_count);
  	try_to_free_low(h, min_count, nodes_allowed);
Oscar Salvador April 6, 2025, 10:30 a.m. UTC | #4
On Fri, Apr 04, 2025 at 03:18:09PM +0200, Oscar Salvador wrote:
>  @@ -3960,7 +3961,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>   	 * and won't grow the pool anywhere else. Not until one of the
>   	 * sysctls are changed, or the surplus pages go out of use.
>   	 */
>  -	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>  +	if (h->free_huge_pages > persistent_huge_pages(h))
>  +		substract = h->free_huge_pages - h->surplus_huge_pages;
>  +	else
>  +		h->free_huge_pages;

Ofc, this wanted to be substract = h->free_huge_pages.
Oscar Salvador April 6, 2025, 10:37 a.m. UTC | #5
On Sun, Apr 06, 2025 at 12:30:43PM +0200, Oscar Salvador wrote:
> On Fri, Apr 04, 2025 at 03:18:09PM +0200, Oscar Salvador wrote:
> >  @@ -3960,7 +3961,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
> >   	 * and won't grow the pool anywhere else. Not until one of the
> >   	 * sysctls are changed, or the surplus pages go out of use.
> >   	 */
> >  -	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
> >  +	if (h->free_huge_pages > persistent_huge_pages(h))
> >  +		substract = h->free_huge_pages - h->surplus_huge_pages;
> >  +	else
> >  +		h->free_huge_pages;
> 
> Ofc, this wanted to be substract = h->free_huge_pages.

Although I am thinking that instead of playing smart, we really need a counter
to tell us how many _free_ surplus pages due to hvo failing to recompose back
those pages we have at any given time, then the substracting thing becomes
easier.
Jinjiang Tu April 7, 2025, 7:23 a.m. UTC | #6
在 2025/4/4 21:18, Oscar Salvador 写道:
> On Thu, Apr 03, 2025 at 04:26:36PM +0200, Oscar Salvador wrote:
>> Unfortunately, this will not fly.
> I did not spend much time, but solving the problem of hvo free pages
> becoming surplus not being accounted twice, should not the following
> fix the entire problem?

Sorry for late replay.

>   diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>   index 39f92aad7bd1..6e9534a825ed 100644
>   --- a/mm/hugetlb.c
>   +++ b/mm/hugetlb.c
>   @@ -3825,6 +3825,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
>    static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>    			      nodemask_t *nodes_allowed)
>    {
>   +	unsigned long substract;
>    	unsigned long min_count;
>    	unsigned long allocated;
>    	struct folio *folio;
>   @@ -3960,7 +3961,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>    	 * and won't grow the pool anywhere else. Not until one of the
>    	 * sysctls are changed, or the surplus pages go out of use.
>    	 */
>   -	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>   +	if (h->free_huge_pages > persistent_huge_pages(h))
>   +		substract = h->free_huge_pages - h->surplus_huge_pages;

h->free_huge_pages may be less than h->surplus_huge_pages

>   +	else
>   +		h->free_huge_pages;
>   +
>   +	min_count = h->resv_huge_pages + persistent_huge_pages(h) - substract;
>    	min_count = max(count, min_count);
>    	try_to_free_low(h, min_count, nodes_allowed);
>
I tries the following diff, and the problem is solved.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 39f92aad7bd1..4afcd5ce417c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3826,6 +3826,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
                               nodemask_t *nodes_allowed)
  {
         unsigned long min_count;
+       unsigned long persistent_free_count;
         unsigned long allocated;
         struct folio *folio;
         LIST_HEAD(page_list);
@@ -3960,7 +3961,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
          * and won't grow the pool anywhere else. Not until one of the
          * sysctls are changed, or the surplus pages go out of use.
          */
-       min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
+       persistent_free_count = h->free_huge_pages;
+       if (h->free_huge_pages > persistent_huge_pages(h)) {
+               if (h->free_huge_pages > h->surplus_huge_pages)
+                       persistent_free_count -= h->surplus_huge_pages;
+               else
+                       persistent_free_count = 0;
+       }
+       min_count = h->resv_huge_pages + persistent_huge_pages(h) - persistent_free_count;
         min_count = max(count, min_count);
         try_to_free_low(h, min_count, nodes_allowed);

Thanks for reviewing.
Jinjiang Tu April 7, 2025, 7:26 a.m. UTC | #7
在 2025/4/6 18:37, Oscar Salvador 写道:
> On Sun, Apr 06, 2025 at 12:30:43PM +0200, Oscar Salvador wrote:
>> On Fri, Apr 04, 2025 at 03:18:09PM +0200, Oscar Salvador wrote:
>>>   @@ -3960,7 +3961,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>>>    	 * and won't grow the pool anywhere else. Not until one of the
>>>    	 * sysctls are changed, or the surplus pages go out of use.
>>>    	 */
>>>   -	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
>>>   +	if (h->free_huge_pages > persistent_huge_pages(h))
>>>   +		substract = h->free_huge_pages - h->surplus_huge_pages;
>>>   +	else
>>>   +		h->free_huge_pages;
>> Ofc, this wanted to be substract = h->free_huge_pages.
> Although I am thinking that instead of playing smart, we really need a counter
> to tell us how many _free_ surplus pages due to hvo failing to recompose back
> those pages we have at any given time, then the substracting thing becomes
> easier.

free surplus count seem to only be used in set_max_huge_pages(), which is called rarely.
Since it can be calcuated, it isn't necessary to maintain a extra count.

Thanks.

>
>
Oscar Salvador April 7, 2025, 9:35 a.m. UTC | #8
On Mon, Apr 07, 2025 at 03:23:08PM +0800, Jinjiang Tu wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 39f92aad7bd1..4afcd5ce417c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3826,6 +3826,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>                               nodemask_t *nodes_allowed)
>  {
>         unsigned long min_count;
> +       unsigned long persistent_free_count;
>         unsigned long allocated;
>         struct folio *folio;
>         LIST_HEAD(page_list);
> @@ -3960,7 +3961,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>          * and won't grow the pool anywhere else. Not until one of the
>          * sysctls are changed, or the surplus pages go out of use.
>          */
> -       min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
> +       persistent_free_count = h->free_huge_pages;
> +       if (h->free_huge_pages > persistent_huge_pages(h)) {
> +               if (h->free_huge_pages > h->surplus_huge_pages)
> +                       persistent_free_count -= h->surplus_huge_pages;
> +               else
> +                       persistent_free_count = 0;
> +       }
> +       min_count = h->resv_huge_pages + persistent_huge_pages(h) - persistent_free_count;
>         min_count = max(count, min_count);
>         try_to_free_low(h, min_count, nodes_allowed);

I think this is ok, but this needs a big fat comment explaining why we
are doing what we are doing, because this us too subtle to grasp
without any context.

Like why it is ok not to subtract anything from persistent_huge_pages,
and the circumstances that might lead to this like that all free pages
are surplus because of hvo failing to restore and so we do not want
to account twice the same thing.

Actually, while you are at it, it would be great to rename variables and
give them a meaningful name (e.g: min_count and count).
That can be done in a follow-up patch of course.

The reason I am suggesting this is because it is not that easy to
untangle what is going on here, and have more meaningful names might
help.

Thanks
Oscar Salvador April 7, 2025, 10 a.m. UTC | #9
On Mon, Apr 07, 2025 at 11:35:27AM +0200, Oscar Salvador wrote:
> On Mon, Apr 07, 2025 at 03:23:08PM +0800, Jinjiang Tu wrote:
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 39f92aad7bd1..4afcd5ce417c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3826,6 +3826,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
> >                               nodemask_t *nodes_allowed)
> >  {
> >         unsigned long min_count;
> > +       unsigned long persistent_free_count;
> >         unsigned long allocated;
> >         struct folio *folio;
> >         LIST_HEAD(page_list);
> > @@ -3960,7 +3961,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
> >          * and won't grow the pool anywhere else. Not until one of the
> >          * sysctls are changed, or the surplus pages go out of use.
> >          */
> > -       min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
> > +       persistent_free_count = h->free_huge_pages;
> > +       if (h->free_huge_pages > persistent_huge_pages(h)) {
> > +               if (h->free_huge_pages > h->surplus_huge_pages)
> > +                       persistent_free_count -= h->surplus_huge_pages;
> > +               else
> > +                       persistent_free_count = 0;
> > +       }
> > +       min_count = h->resv_huge_pages + persistent_huge_pages(h) - persistent_free_count;

I forgot to mention that before this chunk, we might still want
something like your remove_surplus_pool_hugetlb_folio(), but that goes
through all nodes, not only through the nodes_allowed nodemask and
checks for free surplus pages due to hvo, and tries to get rid of them.

But on the oher hand, we can leave as is, since "meh, we failed to restore
those pages so though luck", which is the status quo right now, and I
am fine with it.

In the end, the situation can get fixed by itself as those pages might be
reallocated again and then we have a second chance to restore vmemmap when
they get freed again.
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 318624c96584..57319ff7856b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3577,7 +3577,7 @@  static void try_to_free_low(struct hstate *h, unsigned long count,
 		struct folio *folio, *next;
 		struct list_head *freel = &h->hugepage_freelists[i];
 		list_for_each_entry_safe(folio, next, freel, lru) {
-			if (count >= h->nr_huge_pages)
+			if (count >= available_huge_pages(h))
 				goto out;
 			if (folio_test_highmem(folio))
 				continue;
@@ -3631,11 +3631,30 @@  static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 	return 1;
 }
 
+static struct folio *remove_surplus_pool_hugetlb_folio(struct hstate *h,
+		nodemask_t *nodes_allowed)
+{
+	int nr_nodes, node;
+	struct folio *folio = NULL;
+
+	lockdep_assert_held(&hugetlb_lock);
+	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
+		if (h->surplus_huge_pages_node[node] &&
+		    !list_empty(&h->hugepage_freelists[node])) {
+			folio = list_entry(h->hugepage_freelists[node].next,
+					  struct folio, lru);
+			remove_hugetlb_folio(h, folio, true);
+			break;
+		}
+	}
+
+	return folio;
+}
+
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
 static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
-	unsigned long min_count;
 	unsigned long allocated;
 	struct folio *folio;
 	LIST_HEAD(page_list);
@@ -3770,15 +3789,29 @@  static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * and won't grow the pool anywhere else. Not until one of the
 	 * sysctls are changed, or the surplus pages go out of use.
 	 */
-	min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
-	min_count = max(count, min_count);
-	try_to_free_low(h, min_count, nodes_allowed);
+	try_to_free_low(h, count, nodes_allowed);
 
 	/*
 	 * Collect pages to be removed on list without dropping lock
+	 *
+	 * There may be free surplus huge pages due to HVO, see comments
+	 * in __update_and_free_hugetlb_folio() when calling
+	 * hugetlb_vmemmap_restore_folio(). Collect surplus pages first.
 	 */
-	while (min_count < persistent_huge_pages(h)) {
-		folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
+	while (count < available_huge_pages(h)) {
+		if (h->surplus_huge_pages && h->free_huge_pages) {
+			folio = remove_surplus_pool_hugetlb_folio(h, nodes_allowed);
+			if (!folio)
+				break;
+
+			list_add(&folio->lru, &page_list);
+		} else {
+			break;
+		}
+	}
+
+	while (count < available_huge_pages(h)) {
+		folio = remove_pool_hugetlb_folio(h, nodes_allowed, false);
 		if (!folio)
 			break;