diff mbox series

[v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()

Message ID 1550885529-125561-1-git-send-email-jingxiangfeng@huawei.com (mailing list archive)
State New, archived
Headers show
Series [v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common() | expand

Commit Message

Jing Xiangfeng Feb. 23, 2019, 1:32 a.m. UTC
User can change a node specific hugetlb count. i.e.
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
the calculated value of count is a total number of huge pages. It could
be overflow when a user entering a crazy high value. If so, the total
number of huge pages could be a small value which is not user expect.
We can simply fix it by setting count to ULONG_MAX, then it goes on. This
may be more in line with user's intention of allocating as many huge pages
as possible.

Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
---
 mm/hugetlb.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Mike Kravetz Feb. 25, 2019, 12:45 a.m. UTC | #1
On 2/22/19 5:32 PM, Jing Xiangfeng wrote:
> User can change a node specific hugetlb count. i.e.
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
> the calculated value of count is a total number of huge pages. It could
> be overflow when a user entering a crazy high value. If so, the total
> number of huge pages could be a small value which is not user expect.
> We can simply fix it by setting count to ULONG_MAX, then it goes on. This
> may be more in line with user's intention of allocating as many huge pages
> as possible.
> 
> Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>

Thank you.

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

> ---
>  mm/hugetlb.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index afef616..6688894 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2423,7 +2423,14 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
>  		 * per node hstate attribute: adjust count to global,
>  		 * but restrict alloc/free to the specified node.
>  		 */
> +		unsigned long old_count = count;
>  		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		/*
> +		 * If user specified count causes overflow, set to
> +		 * largest possible value.
> +		 */
> +		if (count < old_count)
> +			count = ULONG_MAX;
>  		init_nodemask_of_node(nodes_allowed, nid);
>  	} else
>  		nodes_allowed = &node_states[N_MEMORY];
>
David Rientjes Feb. 25, 2019, 3:17 a.m. UTC | #2
On Sun, 24 Feb 2019, Mike Kravetz wrote:

> > User can change a node specific hugetlb count. i.e.
> > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
> > the calculated value of count is a total number of huge pages. It could
> > be overflow when a user entering a crazy high value. If so, the total
> > number of huge pages could be a small value which is not user expect.
> > We can simply fix it by setting count to ULONG_MAX, then it goes on. This
> > may be more in line with user's intention of allocating as many huge pages
> > as possible.
> > 
> > Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
> 
> Thank you.
> 
> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> > ---
> >  mm/hugetlb.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index afef616..6688894 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2423,7 +2423,14 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
> >  		 * per node hstate attribute: adjust count to global,
> >  		 * but restrict alloc/free to the specified node.
> >  		 */
> > +		unsigned long old_count = count;
> >  		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> > +		/*
> > +		 * If user specified count causes overflow, set to
> > +		 * largest possible value.
> > +		 */
> > +		if (count < old_count)
> > +			count = ULONG_MAX;
> >  		init_nodemask_of_node(nodes_allowed, nid);
> >  	} else
> >  		nodes_allowed = &node_states[N_MEMORY];
> > 

Looks like this fixes the overflow issue, but isn't there already a 
possible underflow since we don't hold hugetlb_lock?  Even if 
count == 0, what prevents h->nr_huge_pages_node[nid] being greater than 
h->nr_huge_pages here?  I think the per hstate values need to be read with 
READ_ONCE() and stored on the stack to do any sane bounds checking.
Mike Kravetz Feb. 25, 2019, 4:49 p.m. UTC | #3
On 2/24/19 7:17 PM, David Rientjes wrote:
> On Sun, 24 Feb 2019, Mike Kravetz wrote:
> 
>>> User can change a node specific hugetlb count. i.e.
>>> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
>>> the calculated value of count is a total number of huge pages. It could
>>> be overflow when a user entering a crazy high value. If so, the total
>>> number of huge pages could be a small value which is not user expect.
>>> We can simply fix it by setting count to ULONG_MAX, then it goes on. This
>>> may be more in line with user's intention of allocating as many huge pages
>>> as possible.
>>>
>>> Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
>>
>> Thank you.
>>
>> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
>>
>>> ---
>>>  mm/hugetlb.c | 7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index afef616..6688894 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -2423,7 +2423,14 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
>>>  		 * per node hstate attribute: adjust count to global,
>>>  		 * but restrict alloc/free to the specified node.
>>>  		 */
>>> +		unsigned long old_count = count;
>>>  		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
>>> +		/*
>>> +		 * If user specified count causes overflow, set to
>>> +		 * largest possible value.
>>> +		 */
>>> +		if (count < old_count)
>>> +			count = ULONG_MAX;
>>>  		init_nodemask_of_node(nodes_allowed, nid);
>>>  	} else
>>>  		nodes_allowed = &node_states[N_MEMORY];
>>>
> 
> Looks like this fixes the overflow issue, but isn't there already a 
> possible underflow since we don't hold hugetlb_lock?  Even if 
> count == 0, what prevents h->nr_huge_pages_node[nid] being greater than 
> h->nr_huge_pages here?  I think the per hstate values need to be read with 
> READ_ONCE() and stored on the stack to do any sane bounds checking.

Yes, without holding the lock there is the potential for issues.  Looking
back to when the node specific code was added there is a comment about
"re-use/share as much of the existing global hstate attribute initialization
and handling".  I suspect that is the reason for these calculations outside
the lock.

As you mention above, nr_huge_pages_node[nid] could be greater than
nr_huge_pages.  This is true even if we do READ_ONCE().  So, the code would
need to test for this condition and 'fix up' values or re-read.  It is just
racy without holding the lock.

If that is too ugly, then we could just add code for the node specific
adjustments.  set_max_huge_pages() is only called from here.  It would be
pretty easy to modify set_max_huge_pages() to take the node specific value
and do calculations/adjustments under the lock.
Mike Kravetz Feb. 25, 2019, 6:19 p.m. UTC | #4
> On 2/24/19 7:17 PM, David Rientjes wrote:
>> On Sun, 24 Feb 2019, Mike Kravetz wrote:
>>>> @@ -2423,7 +2423,14 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
>>>>  		 * per node hstate attribute: adjust count to global,
>>>>  		 * but restrict alloc/free to the specified node.
>>>>  		 */
>>>> +		unsigned long old_count = count;
>>>>  		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
>>>> +		/*
>>>> +		 * If user specified count causes overflow, set to
>>>> +		 * largest possible value.
>>>> +		 */
>>>> +		if (count < old_count)
>>>> +			count = ULONG_MAX;
>>>>  		init_nodemask_of_node(nodes_allowed, nid);
>>>>  	} else
>>>>  		nodes_allowed = &node_states[N_MEMORY];
>>>>
>>
>> Looks like this fixes the overflow issue, but isn't there already a 
>> possible underflow since we don't hold hugetlb_lock?  Even if 
>> count == 0, what prevents h->nr_huge_pages_node[nid] being greater than 
>> h->nr_huge_pages here?  I think the per hstate values need to be read with 
>> READ_ONCE() and stored on the stack to do any sane bounds checking.
> 
> Yes, without holding the lock there is the potential for issues.  Looking
> back to when the node specific code was added there is a comment about
> "re-use/share as much of the existing global hstate attribute initialization
> and handling".  I suspect that is the reason for these calculations outside
> the lock.
> 
> As you mention above, nr_huge_pages_node[nid] could be greater than
> nr_huge_pages.  This is true even if we do READ_ONCE().  So, the code would
> need to test for this condition and 'fix up' values or re-read.  It is just
> racy without holding the lock.
> 
> If that is too ugly, then we could just add code for the node specific
> adjustments.  set_max_huge_pages() is only called from here.  It would be
> pretty easy to modify set_max_huge_pages() to take the node specific value
> and do calculations/adjustments under the lock.

Ok, what about just moving the calculation/check inside the lock as in the
untested patch below?

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1c5219193b9e..5afa77dc7bc8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
nodemask_t *nodes_allowed,
 }

 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static int set_max_huge_pages(struct hstate *h, unsigned long count,
+static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 						nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
@@ -2289,6 +2289,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned
long count,
 		goto decrease_pool;
 	}

+	spin_lock(&hugetlb_lock);
+
+	/*
+	 * Check for a node specific request.  Adjust global count, but
+	 * restrict alloc/free to the specified node.
+	 */
+	if (nid != NUMA_NO_NODE) {
+		unsigned long old_count = count;
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		/*
+		 * If user specified count causes overflow, set to
+		 * largest possible value.
+		 */
+		if (count < old_count)
+			count = ULONG_MAX;
+	}
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -2300,7 +2317,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned
long count,
 	 * pool might be one hugepage larger than it needs to be, but
 	 * within all the constraints specified by the sysctls.
 	 */
-	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, -1))
 			break;
@@ -2421,16 +2437,18 @@ static ssize_t __nr_hugepages_store_common(bool
obey_mempolicy,
 			nodes_allowed = &node_states[N_MEMORY];
 		}
 	} else if (nodes_allowed) {
+		/* Node specific request */
+		init_nodemask_of_node(nodes_allowed, nid);
+	} else {
 		/*
-		 * per node hstate attribute: adjust count to global,
-		 * but restrict alloc/free to the specified node.
+		 * Node specific request, but we could not allocate
+		 * node mask.  Pass in ALL nodes, and clear nid.
 		 */
-		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
-		init_nodemask_of_node(nodes_allowed, nid);
-	} else
+		nid = NUMA_NO_NODE;
 		nodes_allowed = &node_states[N_MEMORY];
+	}

-	err = set_max_huge_pages(h, count, nodes_allowed);
+	err = set_max_huge_pages(h, count, nid, nodes_allowed);
 	if (err)
 		goto out;
David Rientjes Feb. 25, 2019, 7:17 p.m. UTC | #5
On Mon, 25 Feb 2019, Mike Kravetz wrote:

> Ok, what about just moving the calculation/check inside the lock as in the
> untested patch below?
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 34 ++++++++++++++++++++++++++--------
>  1 file changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1c5219193b9e..5afa77dc7bc8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
> nodemask_t *nodes_allowed,
>  }
> 
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static int set_max_huge_pages(struct hstate *h, unsigned long count,
> +static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  						nodemask_t *nodes_allowed)
>  {
>  	unsigned long min_count, ret;
> @@ -2289,6 +2289,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned
> long count,
>  		goto decrease_pool;
>  	}
> 
> +	spin_lock(&hugetlb_lock);
> +
> +	/*
> +	 * Check for a node specific request.  Adjust global count, but
> +	 * restrict alloc/free to the specified node.
> +	 */
> +	if (nid != NUMA_NO_NODE) {
> +		unsigned long old_count = count;
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		/*
> +		 * If user specified count causes overflow, set to
> +		 * largest possible value.
> +		 */
> +		if (count < old_count)
> +			count = ULONG_MAX;
> +	}
> +
>  	/*
>  	 * Increase the pool size
>  	 * First take pages out of surplus state.  Then make up the
> @@ -2300,7 +2317,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned
> long count,
>  	 * pool might be one hugepage larger than it needs to be, but
>  	 * within all the constraints specified by the sysctls.
>  	 */
> -	spin_lock(&hugetlb_lock);
>  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
>  		if (!adjust_pool_surplus(h, nodes_allowed, -1))
>  			break;
> @@ -2421,16 +2437,18 @@ static ssize_t __nr_hugepages_store_common(bool
> obey_mempolicy,
>  			nodes_allowed = &node_states[N_MEMORY];
>  		}
>  	} else if (nodes_allowed) {
> +		/* Node specific request */
> +		init_nodemask_of_node(nodes_allowed, nid);
> +	} else {
>  		/*
> -		 * per node hstate attribute: adjust count to global,
> -		 * but restrict alloc/free to the specified node.
> +		 * Node specific request, but we could not allocate
> +		 * node mask.  Pass in ALL nodes, and clear nid.
>  		 */
> -		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> -		init_nodemask_of_node(nodes_allowed, nid);
> -	} else
> +		nid = NUMA_NO_NODE;
>  		nodes_allowed = &node_states[N_MEMORY];
> +	}
> 
> -	err = set_max_huge_pages(h, count, nodes_allowed);
> +	err = set_max_huge_pages(h, count, nid, nodes_allowed);
>  	if (err)
>  		goto out;
> 

Looks good; Jing, could you test that this fixes your case?
Jing Xiangfeng Feb. 26, 2019, 2:22 a.m. UTC | #6
On 2019/2/26 3:17, David Rientjes wrote:
> On Mon, 25 Feb 2019, Mike Kravetz wrote:
> 
>> Ok, what about just moving the calculation/check inside the lock as in the
>> untested patch below?
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  mm/hugetlb.c | 34 ++++++++++++++++++++++++++--------
>>  1 file changed, 26 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 1c5219193b9e..5afa77dc7bc8 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
>> nodemask_t *nodes_allowed,
>>  }
>>
>>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
>> -static int set_max_huge_pages(struct hstate *h, unsigned long count,
>> +static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>>  						nodemask_t *nodes_allowed)
>>  {
>>  	unsigned long min_count, ret;
>> @@ -2289,6 +2289,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned
>> long count,
>>  		goto decrease_pool;
>>  	}
>>
>> +	spin_lock(&hugetlb_lock);
>> +
>> +	/*
>> +	 * Check for a node specific request.  Adjust global count, but
>> +	 * restrict alloc/free to the specified node.
>> +	 */
>> +	if (nid != NUMA_NO_NODE) {
>> +		unsigned long old_count = count;
>> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
>> +		/*
>> +		 * If user specified count causes overflow, set to
>> +		 * largest possible value.
>> +		 */
>> +		if (count < old_count)
>> +			count = ULONG_MAX;
>> +	}
>> +
>>  	/*
>>  	 * Increase the pool size
>>  	 * First take pages out of surplus state.  Then make up the
>> @@ -2300,7 +2317,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned
>> long count,
>>  	 * pool might be one hugepage larger than it needs to be, but
>>  	 * within all the constraints specified by the sysctls.
>>  	 */
>> -	spin_lock(&hugetlb_lock);
>>  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
>>  		if (!adjust_pool_surplus(h, nodes_allowed, -1))
>>  			break;
>> @@ -2421,16 +2437,18 @@ static ssize_t __nr_hugepages_store_common(bool
>> obey_mempolicy,
>>  			nodes_allowed = &node_states[N_MEMORY];
>>  		}
>>  	} else if (nodes_allowed) {
>> +		/* Node specific request */
>> +		init_nodemask_of_node(nodes_allowed, nid);
>> +	} else {
>>  		/*
>> -		 * per node hstate attribute: adjust count to global,
>> -		 * but restrict alloc/free to the specified node.
>> +		 * Node specific request, but we could not allocate
>> +		 * node mask.  Pass in ALL nodes, and clear nid.
>>  		 */
>> -		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
>> -		init_nodemask_of_node(nodes_allowed, nid);
>> -	} else
>> +		nid = NUMA_NO_NODE;
>>  		nodes_allowed = &node_states[N_MEMORY];
>> +	}
>>
>> -	err = set_max_huge_pages(h, count, nodes_allowed);
>> +	err = set_max_huge_pages(h, count, nid, nodes_allowed);
>>  	if (err)
>>  		goto out;
>>
> 
> Looks good; Jing, could you test that this fixes your case?

Yes, I have tested this patch, it can also fix my case.
> 
> .
>
David Rientjes Feb. 26, 2019, 6:21 a.m. UTC | #7
On Tue, 26 Feb 2019, Jing Xiangfeng wrote:

> On 2019/2/26 3:17, David Rientjes wrote:
> > On Mon, 25 Feb 2019, Mike Kravetz wrote:
> > 
> >> Ok, what about just moving the calculation/check inside the lock as in the
> >> untested patch below?
> >>
> >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >> ---
> >>  mm/hugetlb.c | 34 ++++++++++++++++++++++++++--------
> >>  1 file changed, 26 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index 1c5219193b9e..5afa77dc7bc8 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
> >> nodemask_t *nodes_allowed,
> >>  }
> >>
> >>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> >> -static int set_max_huge_pages(struct hstate *h, unsigned long count,
> >> +static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
> >>  						nodemask_t *nodes_allowed)
> >>  {
> >>  	unsigned long min_count, ret;
> >> @@ -2289,6 +2289,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned
> >> long count,
> >>  		goto decrease_pool;
> >>  	}
> >>
> >> +	spin_lock(&hugetlb_lock);
> >> +
> >> +	/*
> >> +	 * Check for a node specific request.  Adjust global count, but
> >> +	 * restrict alloc/free to the specified node.
> >> +	 */
> >> +	if (nid != NUMA_NO_NODE) {
> >> +		unsigned long old_count = count;
> >> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> >> +		/*
> >> +		 * If user specified count causes overflow, set to
> >> +		 * largest possible value.
> >> +		 */
> >> +		if (count < old_count)
> >> +			count = ULONG_MAX;
> >> +	}
> >> +
> >>  	/*
> >>  	 * Increase the pool size
> >>  	 * First take pages out of surplus state.  Then make up the
> >> @@ -2300,7 +2317,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned
> >> long count,
> >>  	 * pool might be one hugepage larger than it needs to be, but
> >>  	 * within all the constraints specified by the sysctls.
> >>  	 */
> >> -	spin_lock(&hugetlb_lock);
> >>  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
> >>  		if (!adjust_pool_surplus(h, nodes_allowed, -1))
> >>  			break;
> >> @@ -2421,16 +2437,18 @@ static ssize_t __nr_hugepages_store_common(bool
> >> obey_mempolicy,
> >>  			nodes_allowed = &node_states[N_MEMORY];
> >>  		}
> >>  	} else if (nodes_allowed) {
> >> +		/* Node specific request */
> >> +		init_nodemask_of_node(nodes_allowed, nid);
> >> +	} else {
> >>  		/*
> >> -		 * per node hstate attribute: adjust count to global,
> >> -		 * but restrict alloc/free to the specified node.
> >> +		 * Node specific request, but we could not allocate
> >> +		 * node mask.  Pass in ALL nodes, and clear nid.
> >>  		 */
> >> -		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> >> -		init_nodemask_of_node(nodes_allowed, nid);
> >> -	} else
> >> +		nid = NUMA_NO_NODE;
> >>  		nodes_allowed = &node_states[N_MEMORY];
> >> +	}
> >>
> >> -	err = set_max_huge_pages(h, count, nodes_allowed);
> >> +	err = set_max_huge_pages(h, count, nid, nodes_allowed);
> >>  	if (err)
> >>  		goto out;
> >>
> > 
> > Looks good; Jing, could you test that this fixes your case?
> 
> Yes, I have tested this patch, it can also fix my case.

Great!

Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Tested-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Mike Kravetz Feb. 26, 2019, 7:32 p.m. UTC | #8
On 2/25/19 10:21 PM, David Rientjes wrote:
> On Tue, 26 Feb 2019, Jing Xiangfeng wrote:
>> On 2019/2/26 3:17, David Rientjes wrote:
>>> On Mon, 25 Feb 2019, Mike Kravetz wrote:
>>>
>>>> Ok, what about just moving the calculation/check inside the lock as in the
>>>> untested patch below?
>>>>
>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

<snip>

>>>
>>> Looks good; Jing, could you test that this fixes your case?
>>
>> Yes, I have tested this patch, it can also fix my case.
> 
> Great!
> 
> Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
> Tested-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
> Acked-by: David Rientjes <rientjes@google.com>

Thanks Jing and David!

Here is the patch with an updated commit message and above tags:

From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 26 Feb 2019 10:43:24 -0800
Subject: [PATCH] hugetlbfs: fix potential over/underflow setting node specific
nr_hugepages

The number of node specific huge pages can be set via a file such as:
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
When a node specific value is specified, the global number of huge
pages must also be adjusted.  This adjustment is calculated as the
specified node specific value + (global value - current node value).
If the node specific value provided by the user is large enough, this
calculation could overflow an unsigned long leading to a smaller
than expected number of huge pages.

To fix, check the calculation for overflow.  If overflow is detected,
use ULONG_MAX as the requested value.  This is inline with the user
request to allocate as many huge pages as possible.

It was also noticed that the above calculation was done outside the
hugetlb_lock.  Therefore, the values could be inconsistent and result
in underflow.  To fix, the calculation is moved to within the routine
set_max_huge_pages() where the lock is held.

Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/hugetlb.c | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b37e3100b7cc..a7e4223d2df5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
nodemask_t *nodes_allowed,
 }

 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static int set_max_huge_pages(struct hstate *h, unsigned long count,
+static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 						nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
@@ -2289,6 +2289,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned
long count,
 		goto decrease_pool;
 	}

+	spin_lock(&hugetlb_lock);
+
+	/*
+	 * Check for a node specific request.  Adjust global count, but
+	 * restrict alloc/free to the specified node.
+	 */
+	if (nid != NUMA_NO_NODE) {
+		unsigned long old_count = count;
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		/*
+		 * If user specified count causes overflow, set to
+		 * largest possible value.
+		 */
+		if (count < old_count)
+			count = ULONG_MAX;
+	}
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -2300,7 +2317,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned
long count,
 	 * pool might be one hugepage larger than it needs to be, but
 	 * within all the constraints specified by the sysctls.
 	 */
-	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, -1))
 			break;
@@ -2421,16 +2437,18 @@ static ssize_t __nr_hugepages_store_common(bool
obey_mempolicy,
 			nodes_allowed = &node_states[N_MEMORY];
 		}
 	} else if (nodes_allowed) {
+		/* Node specific request */
+		init_nodemask_of_node(nodes_allowed, nid);
+	} else {
 		/*
-		 * per node hstate attribute: adjust count to global,
-		 * but restrict alloc/free to the specified node.
+		 * Node specific request, but we could not allocate
+		 * node mask.  Pass in ALL nodes, and clear nid.
 		 */
-		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
-		init_nodemask_of_node(nodes_allowed, nid);
-	} else
+		nid = NUMA_NO_NODE;
 		nodes_allowed = &node_states[N_MEMORY];
+	}

-	err = set_max_huge_pages(h, count, nodes_allowed);
+	err = set_max_huge_pages(h, count, nid, nodes_allowed);
 	if (err)
 		goto out;
Andrew Morton Feb. 26, 2019, 10:36 p.m. UTC | #9
> 
> The number of node specific huge pages can be set via a file such as:
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
> When a node specific value is specified, the global number of huge
> pages must also be adjusted.  This adjustment is calculated as the
> specified node specific value + (global value - current node value).
> If the node specific value provided by the user is large enough, this
> calculation could overflow an unsigned long leading to a smaller
> than expected number of huge pages.
> 
> To fix, check the calculation for overflow.  If overflow is detected,
> use ULONG_MAX as the requested value.  This is inline with the user
> request to allocate as many huge pages as possible.
> 
> It was also noticed that the above calculation was done outside the
> hugetlb_lock.  Therefore, the values could be inconsistent and result
> in underflow.  To fix, the calculation is moved to within the routine
> set_max_huge_pages() where the lock is held.
> 
> ...
>
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
> nodemask_t *nodes_allowed,

Please tweak that email client to prevent the wordwraps.

> +	/*
> +	 * Check for a node specific request.  Adjust global count, but
> +	 * restrict alloc/free to the specified node.
> +	 */
> +	if (nid != NUMA_NO_NODE) {
> +		unsigned long old_count = count;
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		/*
> +		 * If user specified count causes overflow, set to
> +		 * largest possible value.
> +		 */
> +		if (count < old_count)
> +			count = ULONG_MAX;
> +	}

The above two comments explain the code, but do not reveal the
reasoning behind the policy decisions which that code implements.

> ...
>
> +	} else {
>  		/*
> -		 * per node hstate attribute: adjust count to global,
> -		 * but restrict alloc/free to the specified node.
> +		 * Node specific request, but we could not allocate
> +		 * node mask.  Pass in ALL nodes, and clear nid.
>  		 */

Ditto here, somewhat.

The old mantra: comments should explain "why", not "what".  Reading the
code tells us the "what".

Thanks.
Mike Kravetz Feb. 27, 2019, 12:03 a.m. UTC | #10
On 2/26/19 2:36 PM, Andrew Morton wrote:
>> ...
>>
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
>> nodemask_t *nodes_allowed,
> 
> Please tweak that email client to prevent the wordwraps.

Sorry about that.

>> +	/*
>> +	 * Check for a node specific request.  Adjust global count, but
>> +	 * restrict alloc/free to the specified node.
>> +	 */

Better comment might be:

	/*
	 * Check for a node specific request.
	 * Changing node specific huge page count may require a corresponding
	 * change to the global count.  In any case, the passed node mask
	 * (nodes_allowed) will restrict alloc/free to the specified node.
	 */

>> +	if (nid != NUMA_NO_NODE) {
>> +		unsigned long old_count = count;
>> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
>> +		/*
>> +		 * If user specified count causes overflow, set to
>> +		 * largest possible value.
>> +		 */

Updated comment:
		/*
		 * User may have specified a large count value which caused the
		 * above calculation to overflow.  In this case, they wanted
		 * to allocate as many huge pages as possible.  Set count to
		 * largest possible value to align with their intention.
		 */

>> +		if (count < old_count)
>> +			count = ULONG_MAX;
>> +	}
> 
> The above two comments explain the code, but do not reveal the
> reasoning behind the policy decisions which that code implements.
> 
>> ...
>>
>> +	} else {
>>  		/*
>> -		 * per node hstate attribute: adjust count to global,
>> -		 * but restrict alloc/free to the specified node.
>> +		 * Node specific request, but we could not allocate
>> +		 * node mask.  Pass in ALL nodes, and clear nid.
>>  		 */
> 
> Ditto here, somewhat.

I was just going to update the comments and send you a new patch, but
but your comment got me thinking about this situation.  I did not really
change the way this code operates.  As a reminder, the original code is like:

NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);

if (nid == NUMA_NO_NODE) {
	/* do something */
} else if (nodes_allowed) {
	/* do something else */
} else {
	nodes_allowed = &node_states[N_MEMORY];
}

So, the only way we get to that final else if if we can not allocate
a node mask (kmalloc a few words).  Right?  I wonder why we should
even try to continue in this case.  Why not just return right there?

The specified count value is either a request to increase number of
huge pages or decrease.  If we can't allocate a few words, we certainly
are not going to find memory to create huge pages.  There 'might' be
surplus pages which can be converted to permanent pages.  But remember
this is a 'node specific' request and we can't allocate a mask to pass
down to the conversion routines.  So, chances are good we would operate
on the wrong node.  The same goes for a request to 'free' huge pages.
Since, we can't allocate a node mask we are likely to free them from
the wrong node.

Unless my reasoning above is incorrect, I think that final else block
in __nr_hugepages_store_common() is wrong.

Any additional thoughts?
Naoya Horiguchi March 4, 2019, 6 a.m. UTC | #11
On Tue, Feb 26, 2019 at 11:32:24AM -0800, Mike Kravetz wrote:
> On 2/25/19 10:21 PM, David Rientjes wrote:
> > On Tue, 26 Feb 2019, Jing Xiangfeng wrote:
> >> On 2019/2/26 3:17, David Rientjes wrote:
> >>> On Mon, 25 Feb 2019, Mike Kravetz wrote:
> >>>
> >>>> Ok, what about just moving the calculation/check inside the lock as in the
> >>>> untested patch below?
> >>>>
> >>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> <snip>
> 
> >>>
> >>> Looks good; Jing, could you test that this fixes your case?
> >>
> >> Yes, I have tested this patch, it can also fix my case.
> > 
> > Great!
> > 
> > Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
> > Tested-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
> > Acked-by: David Rientjes <rientjes@google.com>
> 
> Thanks Jing and David!
> 
> Here is the patch with an updated commit message and above tags:
> 
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 26 Feb 2019 10:43:24 -0800
> Subject: [PATCH] hugetlbfs: fix potential over/underflow setting node specific
> nr_hugepages
> 
> The number of node specific huge pages can be set via a file such as:
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
> When a node specific value is specified, the global number of huge
> pages must also be adjusted.  This adjustment is calculated as the
> specified node specific value + (global value - current node value).
> If the node specific value provided by the user is large enough, this
> calculation could overflow an unsigned long leading to a smaller
> than expected number of huge pages.
> 
> To fix, check the calculation for overflow.  If overflow is detected,
> use ULONG_MAX as the requested value.  This is inline with the user
> request to allocate as many huge pages as possible.
> 
> It was also noticed that the above calculation was done outside the
> hugetlb_lock.  Therefore, the values could be inconsistent and result
> in underflow.  To fix, the calculation is moved to within the routine
> set_max_huge_pages() where the lock is held.
> 
> Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Tested-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
> Acked-by: David Rientjes <rientjes@google.com>

Looks good to me with improved comments.
Thanks everyone.

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

> ---
>  mm/hugetlb.c | 34 ++++++++++++++++++++++++++--------
>  1 file changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b37e3100b7cc..a7e4223d2df5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h,
> nodemask_t *nodes_allowed,
>  }
> 
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static int set_max_huge_pages(struct hstate *h, unsigned long count,
> +static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  						nodemask_t *nodes_allowed)
>  {
>  	unsigned long min_count, ret;
> @@ -2289,6 +2289,23 @@ static int set_max_huge_pages(struct hstate *h, unsigned
> long count,
>  		goto decrease_pool;
>  	}
> 
> +	spin_lock(&hugetlb_lock);
> +
> +	/*
> +	 * Check for a node specific request.  Adjust global count, but
> +	 * restrict alloc/free to the specified node.
> +	 */
> +	if (nid != NUMA_NO_NODE) {
> +		unsigned long old_count = count;
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		/*
> +		 * If user specified count causes overflow, set to
> +		 * largest possible value.
> +		 */
> +		if (count < old_count)
> +			count = ULONG_MAX;
> +	}
> +
>  	/*
>  	 * Increase the pool size
>  	 * First take pages out of surplus state.  Then make up the
> @@ -2300,7 +2317,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned
> long count,
>  	 * pool might be one hugepage larger than it needs to be, but
>  	 * within all the constraints specified by the sysctls.
>  	 */
> -	spin_lock(&hugetlb_lock);
>  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
>  		if (!adjust_pool_surplus(h, nodes_allowed, -1))
>  			break;
> @@ -2421,16 +2437,18 @@ static ssize_t __nr_hugepages_store_common(bool
> obey_mempolicy,
>  			nodes_allowed = &node_states[N_MEMORY];
>  		}
>  	} else if (nodes_allowed) {
> +		/* Node specific request */
> +		init_nodemask_of_node(nodes_allowed, nid);
> +	} else {
>  		/*
> -		 * per node hstate attribute: adjust count to global,
> -		 * but restrict alloc/free to the specified node.
> +		 * Node specific request, but we could not allocate
> +		 * node mask.  Pass in ALL nodes, and clear nid.
>  		 */
> -		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> -		init_nodemask_of_node(nodes_allowed, nid);
> -	} else
> +		nid = NUMA_NO_NODE;
>  		nodes_allowed = &node_states[N_MEMORY];
> +	}
> 
> -	err = set_max_huge_pages(h, count, nodes_allowed);
> +	err = set_max_huge_pages(h, count, nid, nodes_allowed);
>  	if (err)
>  		goto out;
> 
> -- 
> 2.17.2
> 
>
Oscar Salvador March 4, 2019, 1:48 p.m. UTC | #12
On Tue, Feb 26, 2019 at 04:03:23PM -0800, Mike Kravetz wrote:
> I was just going to update the comments and send you a new patch, but
> but your comment got me thinking about this situation.  I did not really
> change the way this code operates.  As a reminder, the original code is like:
> 
> NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);
> 
> if (nid == NUMA_NO_NODE) {
> 	/* do something */
> } else if (nodes_allowed) {
> 	/* do something else */
> } else {
> 	nodes_allowed = &node_states[N_MEMORY];
> }
> 
> So, the only way we get to that final else if if we can not allocate
> a node mask (kmalloc a few words).  Right?  I wonder why we should
> even try to continue in this case.  Why not just return right there?
> 
> The specified count value is either a request to increase number of
> huge pages or decrease.  If we can't allocate a few words, we certainly
> are not going to find memory to create huge pages.  There 'might' be
> surplus pages which can be converted to permanent pages.  But remember
> this is a 'node specific' request and we can't allocate a mask to pass
> down to the conversion routines.  So, chances are good we would operate
> on the wrong node.  The same goes for a request to 'free' huge pages.
> Since, we can't allocate a node mask we are likely to free them from
> the wrong node.
> 
> Unless my reasoning above is incorrect, I think that final else block
> in __nr_hugepages_store_common() is wrong.
> 
> Any additional thoughts?

Could not we kill the NODEMASK_ALLOC there?
__nr_hugepages_store_common() should be called from a rather shallow stack,
and unless I am wrong, the maximum value we can get for a nodemask_t is 128bytes
(1024 NUMA nodes).
Naoya Horiguchi March 5, 2019, 12:03 a.m. UTC | #13
On Tue, Feb 26, 2019 at 04:03:23PM -0800, Mike Kravetz wrote:
> On 2/26/19 2:36 PM, Andrew Morton wrote:
...
> >>
> >> +	} else {
> >>  		/*
> >> -		 * per node hstate attribute: adjust count to global,
> >> -		 * but restrict alloc/free to the specified node.
> >> +		 * Node specific request, but we could not allocate
> >> +		 * node mask.  Pass in ALL nodes, and clear nid.
> >>  		 */
> > 
> > Ditto here, somewhat.

# I missed this part when reviewing yesterday for some reason, sorry.

> 
> I was just going to update the comments and send you a new patch, but
> but your comment got me thinking about this situation.  I did not really
> change the way this code operates.  As a reminder, the original code is like:
> 
> NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);
> 
> if (nid == NUMA_NO_NODE) {
> 	/* do something */
> } else if (nodes_allowed) {
> 	/* do something else */
> } else {
> 	nodes_allowed = &node_states[N_MEMORY];
> }
> 
> So, the only way we get to that final else if if we can not allocate
> a node mask (kmalloc a few words).  Right?  I wonder why we should
> even try to continue in this case.  Why not just return right there?

Simply returning on allocation failure looks better to me.
As you mentioned below, current behavior for this 'else' case is not
helpful for anyone.

Thanks,
Naoya Horiguchi

> 
> The specified count value is either a request to increase number of
> huge pages or decrease.  If we can't allocate a few words, we certainly
> are not going to find memory to create huge pages.  There 'might' be
> surplus pages which can be converted to permanent pages.  But remember
> this is a 'node specific' request and we can't allocate a mask to pass
> down to the conversion routines.  So, chances are good we would operate
> on the wrong node.  The same goes for a request to 'free' huge pages.
> Since, we can't allocate a node mask we are likely to free them from
> the wrong node.
> 
> Unless my reasoning above is incorrect, I think that final else block
> in __nr_hugepages_store_common() is wrong.
> 
> Any additional thoughts?
> -- 
> Mike Kravetz
>
Mike Kravetz March 5, 2019, 4:15 a.m. UTC | #14
On 3/4/19 4:03 PM, Naoya Horiguchi wrote:
> On Tue, Feb 26, 2019 at 04:03:23PM -0800, Mike Kravetz wrote:
>> On 2/26/19 2:36 PM, Andrew Morton wrote:
> ...
>>>>
>>>> +	} else {
>>>>  		/*
>>>> -		 * per node hstate attribute: adjust count to global,
>>>> -		 * but restrict alloc/free to the specified node.
>>>> +		 * Node specific request, but we could not allocate
>>>> +		 * node mask.  Pass in ALL nodes, and clear nid.
>>>>  		 */
>>>
>>> Ditto here, somewhat.
> 
> # I missed this part when reviewing yesterday for some reason, sorry.
> 
>>
>> I was just going to update the comments and send you a new patch, but
>> but your comment got me thinking about this situation.  I did not really
>> change the way this code operates.  As a reminder, the original code is like:
>>
>> NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);
>>
>> if (nid == NUMA_NO_NODE) {
>> 	/* do something */
>> } else if (nodes_allowed) {
>> 	/* do something else */
>> } else {
>> 	nodes_allowed = &node_states[N_MEMORY];
>> }
>>
>> So, the only way we get to that final else if if we can not allocate
>> a node mask (kmalloc a few words).  Right?  I wonder why we should
>> even try to continue in this case.  Why not just return right there?
> 
> Simply returning on allocation failure looks better to me.
> As you mentioned below, current behavior for this 'else' case is not
> helpful for anyone.

Thanks Naoya.  And, thank you Oscar for your suggestion.

I think the simplest thing to do would be simply return in this case.  In
practice, we will likely never hit the condition.  If we do, the system is
really low on memory and there will be other more important issues.

The revised patch below updates comments as suggested, and returns -ENOMEM
if we can not allocate a node mask.

Andrew, this is on top of Alexandre Ghiti's "hugetlb: allow to free gigantic
pages regardless of the configuration" patch.  Both patches modify
__nr_hugepages_store_common().  Alex's patch is going to change slightly
in this area.  Let me know if there is something I can do to help make
merging easier.

From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Mon, 4 Mar 2019 17:45:11 -0800
Subject: [PATCH] hugetlbfs: fix potential over/underflow setting node specific
 nr_hugepages

The number of node specific huge pages can be set via a file such as:
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
When a node specific value is specified, the global number of huge
pages must also be adjusted.  This adjustment is calculated as the
specified node specific value + (global value - current node value).
If the node specific value provided by the user is large enough, this
calculation could overflow an unsigned long leading to a smaller
than expected number of huge pages.

To fix, check the calculation for overflow.  If overflow is detected,
use ULONG_MAX as the requested value.  This is inline with the user
request to allocate as many huge pages as possible.

It was also noticed that the above calculation was done outside the
hugetlb_lock.  Therefore, the values could be inconsistent and result
in underflow.  To fix, the calculation is moved within the routine
set_max_huge_pages() where the lock is held.

In addition, the code in __nr_hugepages_store_common() which tries to
handle the case of not being able to allocate a node mask would likely
result in incorrect behavior.  Luckily, it is very unlikely we will
ever take this path.  If we do, simply return ENOMEM.

Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 42 +++++++++++++++++++++++++++++++++---------
 1 file changed, 33 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c5c4558e4a79..5a190a652cac 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 }
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static int set_max_huge_pages(struct hstate *h, unsigned long count,
+static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 						nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
@@ -2289,6 +2289,28 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count,
 		goto decrease_pool;
 	}
 
+	spin_lock(&hugetlb_lock);
+
+	/*
+	 * Check for a node specific request.
+	 * Changing node specific huge page count may require a corresponding
+	 * change to the global count.  In any case, the passed node mask
+	 * (nodes_allowed) will restrict alloc/free to the specified node.
+	 */
+	if (nid != NUMA_NO_NODE) {
+		unsigned long old_count = count;
+
+		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		/*
+		 * User may have specified a large count value which caused the
+		 * above calculation to overflow.  In this case, they wanted
+		 * to allocate as many huge pages as possible.  Set count to
+		 * largest possible value to align with their intention.
+		 */
+		if (count < old_count)
+			count = ULONG_MAX;
+	}
+
 	/*
 	 * Increase the pool size
 	 * First take pages out of surplus state.  Then make up the
@@ -2300,7 +2322,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count,
 	 * pool might be one hugepage larger than it needs to be, but
 	 * within all the constraints specified by the sysctls.
 	 */
-	spin_lock(&hugetlb_lock);
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, -1))
 			break;
@@ -2421,16 +2442,19 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 			nodes_allowed = &node_states[N_MEMORY];
 		}
 	} else if (nodes_allowed) {
+		/* Node specific request */
+		init_nodemask_of_node(nodes_allowed, nid);
+	} else {
 		/*
-		 * per node hstate attribute: adjust count to global,
-		 * but restrict alloc/free to the specified node.
+		 * Node specific request, but we could not allocate the few
+		 * words required for a node mask.  We are unlikely to hit
+		 * this condition.  Since we can not pass down the appropriate
+		 * node mask, just return ENOMEM.
 		 */
-		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
-		init_nodemask_of_node(nodes_allowed, nid);
-	} else
-		nodes_allowed = &node_states[N_MEMORY];
+		return -ENOMEM;
+	}
 
-	err = set_max_huge_pages(h, count, nodes_allowed);
+	err = set_max_huge_pages(h, count, nid, nodes_allowed);
 	if (err)
 		goto out;
Andrew Morton March 5, 2019, 9:16 p.m. UTC | #15
On Mon, 4 Mar 2019 20:15:40 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> Andrew, this is on top of Alexandre Ghiti's "hugetlb: allow to free gigantic
> pages regardless of the configuration" patch.  Both patches modify
> __nr_hugepages_store_common().  Alex's patch is going to change slightly
> in this area.

OK, thanks, I missed that.  Are the changes significant?
Mike Kravetz March 5, 2019, 9:35 p.m. UTC | #16
On 3/5/19 1:16 PM, Andrew Morton wrote:
> On Mon, 4 Mar 2019 20:15:40 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> Andrew, this is on top of Alexandre Ghiti's "hugetlb: allow to free gigantic
>> pages regardless of the configuration" patch.  Both patches modify
>> __nr_hugepages_store_common().  Alex's patch is going to change slightly
>> in this area.
> 
> OK, thanks, I missed that.  Are the changes significant?
> 

No, changes should be minor.  IIRC, just checking for a condition in an
error path.
Alexandre Ghiti March 5, 2019, 9:41 p.m. UTC | #17
On 3/5/19 4:35 PM, Mike Kravetz wrote:
> On 3/5/19 1:16 PM, Andrew Morton wrote:
>> On Mon, 4 Mar 2019 20:15:40 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>>> Andrew, this is on top of Alexandre Ghiti's "hugetlb: allow to free gigantic
>>> pages regardless of the configuration" patch.  Both patches modify
>>> __nr_hugepages_store_common().  Alex's patch is going to change slightly
>>> in this area.
>> OK, thanks, I missed that.  Are the changes significant?
>>
> No, changes should be minor.  IIRC, just checking for a condition in an
> error path.
I will send the v5 of this patch tomorrow, I was waiting for architecture
maintainer remarks. I still miss sh, but I'm confident on this change.

Alex
Oscar Salvador March 6, 2019, 9:41 a.m. UTC | #18
On Mon, Mar 04, 2019 at 08:15:40PM -0800, Mike Kravetz wrote:
> In addition, the code in __nr_hugepages_store_common() which tries to
> handle the case of not being able to allocate a node mask would likely
> result in incorrect behavior.  Luckily, it is very unlikely we will
> ever take this path.  If we do, simply return ENOMEM.

Hi Mike,

I still thnk that we could just get rid of the NODEMASK_ALLOC machinery
here, it adds a needlessly complexity IMHO.
Note that before "(5df66d306ec9: mm: fix comment for NODEMASK_ALLOC)",
the comment about the size was wrong, showing a much bigger size that it
actually was, and I would not be surprised if people started to add
NODEMASK_ALLOC here and there because of that.

Actually, there was a little talk about removing NODEMASK_ALLOC altogether,
but some further checks must be done before.

> Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

But the overall change looks good to me:

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> ---
>  mm/hugetlb.c | 42 +++++++++++++++++++++++++++++++++---------
>  1 file changed, 33 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c5c4558e4a79..5a190a652cac 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2274,7 +2274,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
>  }
>  
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static int set_max_huge_pages(struct hstate *h, unsigned long count,
> +static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  						nodemask_t *nodes_allowed)
>  {
>  	unsigned long min_count, ret;
> @@ -2289,6 +2289,28 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count,
>  		goto decrease_pool;
>  	}
>  
> +	spin_lock(&hugetlb_lock);
> +
> +	/*
> +	 * Check for a node specific request.
> +	 * Changing node specific huge page count may require a corresponding
> +	 * change to the global count.  In any case, the passed node mask
> +	 * (nodes_allowed) will restrict alloc/free to the specified node.
> +	 */
> +	if (nid != NUMA_NO_NODE) {
> +		unsigned long old_count = count;
> +
> +		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> +		/*
> +		 * User may have specified a large count value which caused the
> +		 * above calculation to overflow.  In this case, they wanted
> +		 * to allocate as many huge pages as possible.  Set count to
> +		 * largest possible value to align with their intention.
> +		 */
> +		if (count < old_count)
> +			count = ULONG_MAX;
> +	}
> +
>  	/*
>  	 * Increase the pool size
>  	 * First take pages out of surplus state.  Then make up the
> @@ -2300,7 +2322,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count,
>  	 * pool might be one hugepage larger than it needs to be, but
>  	 * within all the constraints specified by the sysctls.
>  	 */
> -	spin_lock(&hugetlb_lock);
>  	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
>  		if (!adjust_pool_surplus(h, nodes_allowed, -1))
>  			break;
> @@ -2421,16 +2442,19 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
>  			nodes_allowed = &node_states[N_MEMORY];
>  		}
>  	} else if (nodes_allowed) {
> +		/* Node specific request */
> +		init_nodemask_of_node(nodes_allowed, nid);
> +	} else {
>  		/*
> -		 * per node hstate attribute: adjust count to global,
> -		 * but restrict alloc/free to the specified node.
> +		 * Node specific request, but we could not allocate the few
> +		 * words required for a node mask.  We are unlikely to hit
> +		 * this condition.  Since we can not pass down the appropriate
> +		 * node mask, just return ENOMEM.
>  		 */
> -		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> -		init_nodemask_of_node(nodes_allowed, nid);
> -	} else
> -		nodes_allowed = &node_states[N_MEMORY];
> +		return -ENOMEM;
> +	}
>  
> -	err = set_max_huge_pages(h, count, nodes_allowed);
> +	err = set_max_huge_pages(h, count, nid, nodes_allowed);
>  	if (err)
>  		goto out;
>  
> -- 
> 2.17.2
>
Mike Kravetz March 7, 2019, 12:17 a.m. UTC | #19
On 3/6/19 1:41 AM, Oscar Salvador wrote:
> On Mon, Mar 04, 2019 at 08:15:40PM -0800, Mike Kravetz wrote:
>> In addition, the code in __nr_hugepages_store_common() which tries to
>> handle the case of not being able to allocate a node mask would likely
>> result in incorrect behavior.  Luckily, it is very unlikely we will
>> ever take this path.  If we do, simply return ENOMEM.
> 
> Hi Mike,
> 
> I still thnk that we could just get rid of the NODEMASK_ALLOC machinery
> here, it adds a needlessly complexity IMHO.
> Note that before "(5df66d306ec9: mm: fix comment for NODEMASK_ALLOC)",
> the comment about the size was wrong, showing a much bigger size that it
> actually was, and I would not be surprised if people started to add
> NODEMASK_ALLOC here and there because of that.
> 
> Actually, there was a little talk about removing NODEMASK_ALLOC altogether,
> but some further checks must be done before.

Thanks for the information.  I too saw or remembered a large byte value. :(
A quick grep doesn't reveal any configurable way to get NODE_SHIFT larger
than 10.  Of course, that could change.  So, it does seem a bit funny that
NODEMASK_ALLOC() kicks into dynamic allocation mode with NODE_SHIFT > 8.
Although, my desktop distro has NODE_SHIFT set to 10.

>> Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> But the overall change looks good to me:
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>

Thanks.
I'm going to leave as is for now and put off removal of the dynamic allocation
for a later time.  Unless, you get around to removing NODEMASK_ALLOC
altogether. :)
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index afef616..6688894 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2423,7 +2423,14 @@  static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 		 * per node hstate attribute: adjust count to global,
 		 * but restrict alloc/free to the specified node.
 		 */
+		unsigned long old_count = count;
 		count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+		/*
+		 * If user specified count causes overflow, set to
+		 * largest possible value.
+		 */
+		if (count < old_count)
+			count = ULONG_MAX;
 		init_nodemask_of_node(nodes_allowed, nid);
 	} else
 		nodes_allowed = &node_states[N_MEMORY];