diff mbox series

hugetlb: Support node specified when using cma for gigantic hugepages

Message ID 1633843448-966-1-git-send-email-baolin.wang@linux.alibaba.com (mailing list archive)
State New
Headers show
Series hugetlb: Support node specified when using cma for gigantic hugepages | expand

Commit Message

Baolin Wang Oct. 10, 2021, 5:24 a.m. UTC
Now the size of CMA area for gigantic hugepages runtime allocation is
balanced for all online nodes, but we also want to specify the size of
CMA per-node, or only one node in some cases, which are similar with
commit 86acc55c3d32 ("hugetlbfs: extend the definition of hugepages
parameter to support node allocation")[1].

Thus this patch adds node format for 'hugetlb_cma' parameter to support
specifying the size of CMA per-node. An example is as follows:

hugetlb_cma=0:5G,2:5G

which means allocating 5G size of CMA area on node 0 and node 2
respectively.

[1]
https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 +-
 mm/hugetlb.c                                    | 79 +++++++++++++++++++++----
 2 files changed, 73 insertions(+), 12 deletions(-)

Comments

Andrew Morton Oct. 10, 2021, 8:55 p.m. UTC | #1
On Sun, 10 Oct 2021 13:24:08 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> Now the size of CMA area for gigantic hugepages runtime allocation is
> balanced for all online nodes, but we also want to specify the size of
> CMA per-node, or only one node in some cases, which are similar with

Please describe in full detail why "we want to" do this.  In other
words, what is the benefit to our users?  What are the use-cases, etc?

> commit 86acc55c3d32 ("hugetlbfs: extend the definition of hugepages
> parameter to support node allocation")[1].
> 
> Thus this patch adds node format for 'hugetlb_cma' parameter to support
> specifying the size of CMA per-node. An example is as follows:
> 
> hugetlb_cma=0:5G,2:5G
> 
> which means allocating 5G size of CMA area on node 0 and node 2
> respectively.
>
Baolin Wang Oct. 11, 2021, 2:14 a.m. UTC | #2
On 2021/10/11 4:55, Andrew Morton wrote:
> On Sun, 10 Oct 2021 13:24:08 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> Now the size of CMA area for gigantic hugepages runtime allocation is
>> balanced for all online nodes, but we also want to specify the size of
>> CMA per-node, or only one node in some cases, which are similar with
> 
> Please describe in full detail why "we want to" do this.  In other
> words, what is the benefit to our users?  What are the use-cases, etc?

Sure. On some multi-nodes systems, each node's memory can be different, 
allocating the same size of CMA for each node is not suitable for the 
low-memory nodes. Meanwhile some workloads like DPDK mentioned by 
Zhenguo only need hugepages in one node.

On the other hand, we have some machines with multiple types of memory, 
like DRAM and PMEM (persistent memory). On this system, we may want to 
specify all the hugepages on DRAM node, or specify the proportion of 
DRAM node and PMEM node, to tuning the performance of the workloads.
Mike Kravetz Oct. 13, 2021, 10:06 p.m. UTC | #3
On 10/9/21 10:24 PM, Baolin Wang wrote:
> Now the size of CMA area for gigantic hugepages runtime allocation is
> balanced for all online nodes, but we also want to specify the size of
> CMA per-node, or only one node in some cases, which are similar with
> commit 86acc55c3d32 ("hugetlbfs: extend the definition of hugepages
> parameter to support node allocation")[1].
> 
> Thus this patch adds node format for 'hugetlb_cma' parameter to support
> specifying the size of CMA per-node. An example is as follows:
> 
> hugetlb_cma=0:5G,2:5G
> 
> which means allocating 5G size of CMA area on node 0 and node 2
> respectively.
> 
> [1]
> https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  6 +-
>  mm/hugetlb.c                                    | 79 +++++++++++++++++++++----
>  2 files changed, 73 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 3ad8e9d0..a147faa5 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1587,8 +1587,10 @@
>  			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
>  
>  	hugetlb_cma=	[HW,CMA] The size of a CMA area used for allocation
> -			of gigantic hugepages.
> -			Format: nn[KMGTPE]
> +			of gigantic hugepages. Or using node format, the size
> +			of a CMA area per node can be specified.
> +			Format: nn[KMGTPE] or (node format)
> +				<node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
>  
>  			Reserve a CMA area of given size and allocate gigantic
>  			hugepages using the CMA allocator. If enabled, the
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6d2f4c2..8b4e409 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -50,6 +50,7 @@
>  
>  #ifdef CONFIG_CMA
>  static struct cma *hugetlb_cma[MAX_NUMNODES];
> +static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
>  static bool hugetlb_cma_page(struct page *page, unsigned int order)
>  {
>  	return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
> @@ -62,6 +63,7 @@ static bool hugetlb_cma_page(struct page *page, unsigned int order)
>  }
>  #endif
>  static unsigned long hugetlb_cma_size __initdata;
> +static nodemask_t hugetlb_cma_nodes_allowed = NODE_MASK_NONE;
>  
>  /*
>   * Minimum page order among possible hugepage sizes, set to a proper value
> @@ -3497,9 +3499,15 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
>  
>  	if (nid == NUMA_NO_NODE) {
>  		/*
> +		 * If we've specified the size of CMA area per node,
> +		 * should use it firstly.
> +		 */
> +		if (hstate_is_gigantic(h) && !nodes_empty(hugetlb_cma_nodes_allowed))
> +			n_mask = &hugetlb_cma_nodes_allowed;
> +		/*

IIUC, this changes the behavior for 'balanced' gigantic huge page pool
allocations if per-node hugetlb_cma is specified.  It will now only
attempt to allocate gigantic pages on nodes where CMA was reserved.
Even if we run out of space on the node, it will not go to other nodes
as before.  Is that correct?

I do not believe we want this change in behavior.  IMO, if the user is
doing node specific CMA reservations, then the user should use the node
specific sysfs file for pool allocations on that node.

>  		 * global hstate attribute
>  		 */
> -		if (!(obey_mempolicy &&
> +		else if (!(obey_mempolicy &&
>  				init_nodemask_of_mempolicy(&nodes_allowed)))
>  			n_mask = &node_states[N_MEMORY];
>  		else
> @@ -6745,7 +6753,38 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
>  
>  static int __init cmdline_parse_hugetlb_cma(char *p)
>  {
> -	hugetlb_cma_size = memparse(p, &p);
> +	int nid, count = 0;
> +	unsigned long tmp;
> +	char *s = p;
> +
> +	while (*s) {
> +		if (sscanf(s, "%lu%n", &tmp, &count) != 1)
> +			break;
> +
> +		if (s[count] == ':') {
> +			nid = tmp;
> +			if (nid < 0 || nid >= MAX_NUMNODES)
> +				break;

nid can only be compared to MAX_NUMNODES because this an early param
before numa is setup and we do not know exactly how many nodes there
are.  Is this correct?

Suppose one specifies an invaid node.  For example, on my 2 node system
I use the option 'hugetlb_cma=2:2G'.  This is not flagged as an error
during processing and 1G CMA is reserved on node 0 and 1G is reserved
on node 1.  Is that by design, or just chance?

We should be able to catch this in hugetlb_cma_reserve.  For the example
above, I think we should flag this as an error and not reserve any CMA.

> +
> +			s += count + 1;
> +			tmp = memparse(s, &s);
> +			hugetlb_cma_size_in_node[nid] = tmp;
> +			hugetlb_cma_size += tmp;
> +
> +			/*
> +			 * Skip the separator if have one, otherwise
> +			 * break the parsing.
> +			 */
> +			if (*s == ',')
> +				s++;
> +			else
> +				break;
> +		} else {
> +			hugetlb_cma_size = memparse(p, &p);
> +			break;
> +		}
> +	}
> +
>  	return 0;
>  }
>  
> @@ -6754,6 +6793,7 @@ static int __init cmdline_parse_hugetlb_cma(char *p)
>  void __init hugetlb_cma_reserve(int order)
>  {
>  	unsigned long size, reserved, per_node;
> +	bool node_specific_cma_alloc = false;
>  	int nid;
>  
>  	cma_reserve_called = true;
> @@ -6767,20 +6807,37 @@ void __init hugetlb_cma_reserve(int order)
>  		return;
>  	}

Earlier in hugetlb_cma_reserve (not in the context here), there is this
code:

	if (hugetlb_cma_size < (PAGE_SIZE << order)) {
		pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n",
			(PAGE_SIZE << order) / SZ_1M);
		hugetlb_cma_size = 0;
		return;
	}

That causes an early exit if hugetlb_cma_size is too small for a
gigantic page.

On my 2 node x86 system with 1G gigantic pages, I can specify
'hugetlb_cma=0:512M,1:512M'.  This does not trigger the above early exit
because total hugetlb_cma_size is 1G.  It does end up reserving 1G on
node 0 and nothing on node 1.  I do not believe this is by design.  We
should validate the specified per-node sizes as well.
Baolin Wang Oct. 14, 2021, 2:23 a.m. UTC | #4
On 2021/10/14 6:06, Mike Kravetz wrote:
> On 10/9/21 10:24 PM, Baolin Wang wrote:
>> Now the size of CMA area for gigantic hugepages runtime allocation is
>> balanced for all online nodes, but we also want to specify the size of
>> CMA per-node, or only one node in some cases, which are similar with
>> commit 86acc55c3d32 ("hugetlbfs: extend the definition of hugepages
>> parameter to support node allocation")[1].
>>
>> Thus this patch adds node format for 'hugetlb_cma' parameter to support
>> specifying the size of CMA per-node. An example is as follows:
>>
>> hugetlb_cma=0:5G,2:5G
>>
>> which means allocating 5G size of CMA area on node 0 and node 2
>> respectively.
>>
>> [1]
>> https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   Documentation/admin-guide/kernel-parameters.txt |  6 +-
>>   mm/hugetlb.c                                    | 79 +++++++++++++++++++++----
>>   2 files changed, 73 insertions(+), 12 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 3ad8e9d0..a147faa5 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -1587,8 +1587,10 @@
>>   			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
>>   
>>   	hugetlb_cma=	[HW,CMA] The size of a CMA area used for allocation
>> -			of gigantic hugepages.
>> -			Format: nn[KMGTPE]
>> +			of gigantic hugepages. Or using node format, the size
>> +			of a CMA area per node can be specified.
>> +			Format: nn[KMGTPE] or (node format)
>> +				<node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
>>   
>>   			Reserve a CMA area of given size and allocate gigantic
>>   			hugepages using the CMA allocator. If enabled, the
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 6d2f4c2..8b4e409 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -50,6 +50,7 @@
>>   
>>   #ifdef CONFIG_CMA
>>   static struct cma *hugetlb_cma[MAX_NUMNODES];
>> +static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
>>   static bool hugetlb_cma_page(struct page *page, unsigned int order)
>>   {
>>   	return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
>> @@ -62,6 +63,7 @@ static bool hugetlb_cma_page(struct page *page, unsigned int order)
>>   }
>>   #endif
>>   static unsigned long hugetlb_cma_size __initdata;
>> +static nodemask_t hugetlb_cma_nodes_allowed = NODE_MASK_NONE;
>>   
>>   /*
>>    * Minimum page order among possible hugepage sizes, set to a proper value
>> @@ -3497,9 +3499,15 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
>>   
>>   	if (nid == NUMA_NO_NODE) {
>>   		/*
>> +		 * If we've specified the size of CMA area per node,
>> +		 * should use it firstly.
>> +		 */
>> +		if (hstate_is_gigantic(h) && !nodes_empty(hugetlb_cma_nodes_allowed))
>> +			n_mask = &hugetlb_cma_nodes_allowed;
>> +		/*
> 
> IIUC, this changes the behavior for 'balanced' gigantic huge page pool
> allocations if per-node hugetlb_cma is specified.  It will now only
> attempt to allocate gigantic pages on nodes where CMA was reserved.
> Even if we run out of space on the node, it will not go to other nodes
> as before.  Is that correct?

Right.

> 
> I do not believe we want this change in behavior.  IMO, if the user is
> doing node specific CMA reservations, then the user should use the node
> specific sysfs file for pool allocations on that node.

Sounds more reasonable, will move 'hugetlb_cma_nodes_allowed' to the 
node specific allocation.

>>   		 * global hstate attribute
>>   		 */
>> -		if (!(obey_mempolicy &&
>> +		else if (!(obey_mempolicy &&
>>   				init_nodemask_of_mempolicy(&nodes_allowed)))
>>   			n_mask = &node_states[N_MEMORY];
>>   		else
>> @@ -6745,7 +6753,38 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
>>   
>>   static int __init cmdline_parse_hugetlb_cma(char *p)
>>   {
>> -	hugetlb_cma_size = memparse(p, &p);
>> +	int nid, count = 0;
>> +	unsigned long tmp;
>> +	char *s = p;
>> +
>> +	while (*s) {
>> +		if (sscanf(s, "%lu%n", &tmp, &count) != 1)
>> +			break;
>> +
>> +		if (s[count] == ':') {
>> +			nid = tmp;
>> +			if (nid < 0 || nid >= MAX_NUMNODES)
>> +				break;
> 
> nid can only be compared to MAX_NUMNODES because this an early param
> before numa is setup and we do not know exactly how many nodes there
> are.  Is this correct?

Yes.

> 
> Suppose one specifies an invaid node.  For example, on my 2 node system
> I use the option 'hugetlb_cma=2:2G'.  This is not flagged as an error
> during processing and 1G CMA is reserved on node 0 and 1G is reserved
> on node 1.  Is that by design, or just chance?

Actually we won't allocate any CMA area in this case, since in 
hugetlb_cma_reserve(), we will only iterate the online nodes to try to 
allocate CMA area, and node 2 is not in the range of online nodes in 
this case.

> We should be able to catch this in hugetlb_cma_reserve.  For the example
> above, I think we should flag this as an error and not reserve any CMA.

Sure, as I said above, it will not allocate CMA for the non-online nodes 
though these invalid nodes can be specified in the command line. But I 
can add a warning to catch the invalid nodes setting in 
hugetlb_cma_reserve().

>> +
>> +			s += count + 1;
>> +			tmp = memparse(s, &s);
>> +			hugetlb_cma_size_in_node[nid] = tmp;
>> +			hugetlb_cma_size += tmp;
>> +
>> +			/*
>> +			 * Skip the separator if have one, otherwise
>> +			 * break the parsing.
>> +			 */
>> +			if (*s == ',')
>> +				s++;
>> +			else
>> +				break;
>> +		} else {
>> +			hugetlb_cma_size = memparse(p, &p);
>> +			break;
>> +		}
>> +	}
>> +
>>   	return 0;
>>   }
>>   
>> @@ -6754,6 +6793,7 @@ static int __init cmdline_parse_hugetlb_cma(char *p)
>>   void __init hugetlb_cma_reserve(int order)
>>   {
>>   	unsigned long size, reserved, per_node;
>> +	bool node_specific_cma_alloc = false;
>>   	int nid;
>>   
>>   	cma_reserve_called = true;
>> @@ -6767,20 +6807,37 @@ void __init hugetlb_cma_reserve(int order)
>>   		return;
>>   	}
> 
> Earlier in hugetlb_cma_reserve (not in the context here), there is this
> code:
> 
> 	if (hugetlb_cma_size < (PAGE_SIZE << order)) {
> 		pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n",
> 			(PAGE_SIZE << order) / SZ_1M);
> 		hugetlb_cma_size = 0;
> 		return;
> 	}
> 
> That causes an early exit if hugetlb_cma_size is too small for a
> gigantic page.
> 
> On my 2 node x86 system with 1G gigantic pages, I can specify
> 'hugetlb_cma=0:512M,1:512M'.  This does not trigger the above early exit
> because total hugetlb_cma_size is 1G.  It does end up reserving 1G on
> node 0 and nothing on node 1.  I do not believe this is by design.  We
> should validate the specified per-node sizes as well.

Sure. Will do in next version. Thanks for your comments.
Mike Kravetz Oct. 14, 2021, 2:30 a.m. UTC | #5
On 10/13/21 7:23 PM, Baolin Wang wrote:
> 
> 
> On 2021/10/14 6:06, Mike Kravetz wrote:
>> On 10/9/21 10:24 PM, Baolin Wang wrote:
>>> Now the size of CMA area for gigantic hugepages runtime allocation is
>>> balanced for all online nodes, but we also want to specify the size of
>>> CMA per-node, or only one node in some cases, which are similar with
>>> commit 86acc55c3d32 ("hugetlbfs: extend the definition of hugepages
>>> parameter to support node allocation")[1].
>>>
>>> Thus this patch adds node format for 'hugetlb_cma' parameter to support
>>> specifying the size of CMA per-node. An example is as follows:
>>>
>>> hugetlb_cma=0:5G,2:5G
>>>
>>> which means allocating 5G size of CMA area on node 0 and node 2
>>> respectively.
>>>
>>> [1]
>>> https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   Documentation/admin-guide/kernel-parameters.txt |  6 +-
>>>   mm/hugetlb.c                                    | 79 +++++++++++++++++++++----
>>>   2 files changed, 73 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>> index 3ad8e9d0..a147faa5 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -1587,8 +1587,10 @@
>>>               registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
>>>         hugetlb_cma=    [HW,CMA] The size of a CMA area used for allocation
>>> -            of gigantic hugepages.
>>> -            Format: nn[KMGTPE]
>>> +            of gigantic hugepages. Or using node format, the size
>>> +            of a CMA area per node can be specified.
>>> +            Format: nn[KMGTPE] or (node format)
>>> +                <node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
>>>                 Reserve a CMA area of given size and allocate gigantic
>>>               hugepages using the CMA allocator. If enabled, the
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 6d2f4c2..8b4e409 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -50,6 +50,7 @@
>>>     #ifdef CONFIG_CMA
>>>   static struct cma *hugetlb_cma[MAX_NUMNODES];
>>> +static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
>>>   static bool hugetlb_cma_page(struct page *page, unsigned int order)
>>>   {
>>>       return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
>>> @@ -62,6 +63,7 @@ static bool hugetlb_cma_page(struct page *page, unsigned int order)
>>>   }
>>>   #endif
>>>   static unsigned long hugetlb_cma_size __initdata;
>>> +static nodemask_t hugetlb_cma_nodes_allowed = NODE_MASK_NONE;
>>>     /*
>>>    * Minimum page order among possible hugepage sizes, set to a proper value
>>> @@ -3497,9 +3499,15 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
>>>         if (nid == NUMA_NO_NODE) {
>>>           /*
>>> +         * If we've specified the size of CMA area per node,
>>> +         * should use it firstly.
>>> +         */
>>> +        if (hstate_is_gigantic(h) && !nodes_empty(hugetlb_cma_nodes_allowed))
>>> +            n_mask = &hugetlb_cma_nodes_allowed;
>>> +        /*
>>
>> IIUC, this changes the behavior for 'balanced' gigantic huge page pool
>> allocations if per-node hugetlb_cma is specified.  It will now only
>> attempt to allocate gigantic pages on nodes where CMA was reserved.
>> Even if we run out of space on the node, it will not go to other nodes
>> as before.  Is that correct?
> 
> Right.
> 
>>
>> I do not believe we want this change in behavior.  IMO, if the user is
>> doing node specific CMA reservations, then the user should use the node
>> specific sysfs file for pool allocations on that node.
> 
> Sounds more reasonable, will move 'hugetlb_cma_nodes_allowed' to the node specific allocation.
> 
>>>            * global hstate attribute
>>>            */
>>> -        if (!(obey_mempolicy &&
>>> +        else if (!(obey_mempolicy &&
>>>                   init_nodemask_of_mempolicy(&nodes_allowed)))
>>>               n_mask = &node_states[N_MEMORY];
>>>           else
>>> @@ -6745,7 +6753,38 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
>>>     static int __init cmdline_parse_hugetlb_cma(char *p)
>>>   {
>>> -    hugetlb_cma_size = memparse(p, &p);
>>> +    int nid, count = 0;
>>> +    unsigned long tmp;
>>> +    char *s = p;
>>> +
>>> +    while (*s) {
>>> +        if (sscanf(s, "%lu%n", &tmp, &count) != 1)
>>> +            break;
>>> +
>>> +        if (s[count] == ':') {
>>> +            nid = tmp;
>>> +            if (nid < 0 || nid >= MAX_NUMNODES)
>>> +                break;
>>
>> nid can only be compared to MAX_NUMNODES because this an early param
>> before numa is setup and we do not know exactly how many nodes there
>> are.  Is this correct?
> 
> Yes.
> 
>>
>> Suppose one specifies an invaid node.  For example, on my 2 node system
>> I use the option 'hugetlb_cma=2:2G'.  This is not flagged as an error
>> during processing and 1G CMA is reserved on node 0 and 1G is reserved
>> on node 1.  Is that by design, or just chance?
> 
> Actually we won't allocate any CMA area in this case, since in hugetlb_cma_reserve(), we will only iterate the online nodes to try to allocate CMA area, and node 2 is not in the range of online nodes in this case.
> 

But, since it can not do node specifric allocations it falls through to
the all nodes case?  Here is what I see:

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1
node 0 size: 8053 MB
node 0 free: 6543 MB
node 1 cpus: 2 3
node 1 size: 8150 MB
node 1 free: 4851 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

# reboot

# dmesg | grep -i huge
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-rc4-mm1+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugetlb_cma=2:2G
[    0.008345] hugetlb_cma: reserve 2048 MiB, up to 1024 MiB per node
[    0.008349] hugetlb_cma: reserved 1024 MiB on node 0
[    0.008352] hugetlb_cma: reserved 1024 MiB on node 1
[    0.053682] Kernel command line: BOOT_IMAGE=/vmlinuz-5.15.0-rc4-mm1+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugetlb_cma=2:2G
[    0.401648] HugeTLB: can free 4094 vmemmap pages for hugepages-1048576kB
[    0.413681] HugeTLB: can free 6 vmemmap pages for hugepages-2048kB
[    0.414590] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.415653] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
Baolin Wang Oct. 14, 2021, 2:39 a.m. UTC | #6
在 2021/10/14 10:30, Mike Kravetz 写道:
> On 10/13/21 7:23 PM, Baolin Wang wrote:
>>
>>
>> On 2021/10/14 6:06, Mike Kravetz wrote:
>>> On 10/9/21 10:24 PM, Baolin Wang wrote:
>>>> Now the size of CMA area for gigantic hugepages runtime allocation is
>>>> balanced for all online nodes, but we also want to specify the size of
>>>> CMA per-node, or only one node in some cases, which are similar with
>>>> commit 86acc55c3d32 ("hugetlbfs: extend the definition of hugepages
>>>> parameter to support node allocation")[1].
>>>>
>>>> Thus this patch adds node format for 'hugetlb_cma' parameter to support
>>>> specifying the size of CMA per-node. An example is as follows:
>>>>
>>>> hugetlb_cma=0:5G,2:5G
>>>>
>>>> which means allocating 5G size of CMA area on node 0 and node 2
>>>> respectively.
>>>>
>>>> [1]
>>>> https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    Documentation/admin-guide/kernel-parameters.txt |  6 +-
>>>>    mm/hugetlb.c                                    | 79 +++++++++++++++++++++----
>>>>    2 files changed, 73 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index 3ad8e9d0..a147faa5 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -1587,8 +1587,10 @@
>>>>                registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
>>>>          hugetlb_cma=    [HW,CMA] The size of a CMA area used for allocation
>>>> -            of gigantic hugepages.
>>>> -            Format: nn[KMGTPE]
>>>> +            of gigantic hugepages. Or using node format, the size
>>>> +            of a CMA area per node can be specified.
>>>> +            Format: nn[KMGTPE] or (node format)
>>>> +                <node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
>>>>                  Reserve a CMA area of given size and allocate gigantic
>>>>                hugepages using the CMA allocator. If enabled, the
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index 6d2f4c2..8b4e409 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -50,6 +50,7 @@
>>>>      #ifdef CONFIG_CMA
>>>>    static struct cma *hugetlb_cma[MAX_NUMNODES];
>>>> +static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
>>>>    static bool hugetlb_cma_page(struct page *page, unsigned int order)
>>>>    {
>>>>        return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
>>>> @@ -62,6 +63,7 @@ static bool hugetlb_cma_page(struct page *page, unsigned int order)
>>>>    }
>>>>    #endif
>>>>    static unsigned long hugetlb_cma_size __initdata;
>>>> +static nodemask_t hugetlb_cma_nodes_allowed = NODE_MASK_NONE;
>>>>      /*
>>>>     * Minimum page order among possible hugepage sizes, set to a proper value
>>>> @@ -3497,9 +3499,15 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
>>>>          if (nid == NUMA_NO_NODE) {
>>>>            /*
>>>> +         * If we've specified the size of CMA area per node,
>>>> +         * should use it firstly.
>>>> +         */
>>>> +        if (hstate_is_gigantic(h) && !nodes_empty(hugetlb_cma_nodes_allowed))
>>>> +            n_mask = &hugetlb_cma_nodes_allowed;
>>>> +        /*
>>>
>>> IIUC, this changes the behavior for 'balanced' gigantic huge page pool
>>> allocations if per-node hugetlb_cma is specified.  It will now only
>>> attempt to allocate gigantic pages on nodes where CMA was reserved.
>>> Even if we run out of space on the node, it will not go to other nodes
>>> as before.  Is that correct?
>>
>> Right.
>>
>>>
>>> I do not believe we want this change in behavior.  IMO, if the user is
>>> doing node specific CMA reservations, then the user should use the node
>>> specific sysfs file for pool allocations on that node.
>>
>> Sounds more reasonable, will move 'hugetlb_cma_nodes_allowed' to the node specific allocation.
>>
>>>>             * global hstate attribute
>>>>             */
>>>> -        if (!(obey_mempolicy &&
>>>> +        else if (!(obey_mempolicy &&
>>>>                    init_nodemask_of_mempolicy(&nodes_allowed)))
>>>>                n_mask = &node_states[N_MEMORY];
>>>>            else
>>>> @@ -6745,7 +6753,38 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
>>>>      static int __init cmdline_parse_hugetlb_cma(char *p)
>>>>    {
>>>> -    hugetlb_cma_size = memparse(p, &p);
>>>> +    int nid, count = 0;
>>>> +    unsigned long tmp;
>>>> +    char *s = p;
>>>> +
>>>> +    while (*s) {
>>>> +        if (sscanf(s, "%lu%n", &tmp, &count) != 1)
>>>> +            break;
>>>> +
>>>> +        if (s[count] == ':') {
>>>> +            nid = tmp;
>>>> +            if (nid < 0 || nid >= MAX_NUMNODES)
>>>> +                break;
>>>
>>> nid can only be compared to MAX_NUMNODES because this an early param
>>> before numa is setup and we do not know exactly how many nodes there
>>> are.  Is this correct?
>>
>> Yes.
>>
>>>
>>> Suppose one specifies an invaid node.  For example, on my 2 node system
>>> I use the option 'hugetlb_cma=2:2G'.  This is not flagged as an error
>>> during processing and 1G CMA is reserved on node 0 and 1G is reserved
>>> on node 1.  Is that by design, or just chance?
>>
>> Actually we won't allocate any CMA area in this case, since in hugetlb_cma_reserve(), we will only iterate the online nodes to try to allocate CMA area, and node 2 is not in the range of online nodes in this case.
>>
> 
> But, since it can not do node specifric allocations it falls through to
> the all nodes case?  Here is what I see:
> 
> # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1
> node 0 size: 8053 MB
> node 0 free: 6543 MB
> node 1 cpus: 2 3
> node 1 size: 8150 MB
> node 1 free: 4851 MB
> node distances:
> node   0   1
>    0:  10  20
>    1:  20  10
> 
> # reboot
> 
> # dmesg | grep -i huge
> [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-rc4-mm1+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugetlb_cma=2:2G
> [    0.008345] hugetlb_cma: reserve 2048 MiB, up to 1024 MiB per node
> [    0.008349] hugetlb_cma: reserved 1024 MiB on node 0
> [    0.008352] hugetlb_cma: reserved 1024 MiB on node 1
> [    0.053682] Kernel command line: BOOT_IMAGE=/vmlinuz-5.15.0-rc4-mm1+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugetlb_cma=2:2G
> [    0.401648] HugeTLB: can free 4094 vmemmap pages for hugepages-1048576kB
> [    0.413681] HugeTLB: can free 6 vmemmap pages for hugepages-2048kB
> [    0.414590] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> [    0.415653] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages

Oh, I see. I only validate the online nodes to check if we've specified 
the size of CMA, so in this case, it will fall back to the original 
balanced policy. As you said, I should catch the invalid nodes in 
hugetlb_cma_reserve() to avoid this isse. Thanks for pointing out the issue.

+	for_each_node_state(nid, N_ONLINE) {
+		if (hugetlb_cma_size_in_node[nid] > 0) {
+			node_specific_cma_alloc = true;
+			break;
+		}
+	}
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3ad8e9d0..a147faa5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1587,8 +1587,10 @@ 
 			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
 
 	hugetlb_cma=	[HW,CMA] The size of a CMA area used for allocation
-			of gigantic hugepages.
-			Format: nn[KMGTPE]
+			of gigantic hugepages. Or using node format, the size
+			of a CMA area per node can be specified.
+			Format: nn[KMGTPE] or (node format)
+				<node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
 
 			Reserve a CMA area of given size and allocate gigantic
 			hugepages using the CMA allocator. If enabled, the
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d2f4c2..8b4e409 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -50,6 +50,7 @@ 
 
 #ifdef CONFIG_CMA
 static struct cma *hugetlb_cma[MAX_NUMNODES];
+static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
 static bool hugetlb_cma_page(struct page *page, unsigned int order)
 {
 	return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
@@ -62,6 +63,7 @@  static bool hugetlb_cma_page(struct page *page, unsigned int order)
 }
 #endif
 static unsigned long hugetlb_cma_size __initdata;
+static nodemask_t hugetlb_cma_nodes_allowed = NODE_MASK_NONE;
 
 /*
  * Minimum page order among possible hugepage sizes, set to a proper value
@@ -3497,9 +3499,15 @@  static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 
 	if (nid == NUMA_NO_NODE) {
 		/*
+		 * If we've specified the size of CMA area per node,
+		 * should use it firstly.
+		 */
+		if (hstate_is_gigantic(h) && !nodes_empty(hugetlb_cma_nodes_allowed))
+			n_mask = &hugetlb_cma_nodes_allowed;
+		/*
 		 * global hstate attribute
 		 */
-		if (!(obey_mempolicy &&
+		else if (!(obey_mempolicy &&
 				init_nodemask_of_mempolicy(&nodes_allowed)))
 			n_mask = &node_states[N_MEMORY];
 		else
@@ -6745,7 +6753,38 @@  void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 
 static int __init cmdline_parse_hugetlb_cma(char *p)
 {
-	hugetlb_cma_size = memparse(p, &p);
+	int nid, count = 0;
+	unsigned long tmp;
+	char *s = p;
+
+	while (*s) {
+		if (sscanf(s, "%lu%n", &tmp, &count) != 1)
+			break;
+
+		if (s[count] == ':') {
+			nid = tmp;
+			if (nid < 0 || nid >= MAX_NUMNODES)
+				break;
+
+			s += count + 1;
+			tmp = memparse(s, &s);
+			hugetlb_cma_size_in_node[nid] = tmp;
+			hugetlb_cma_size += tmp;
+
+			/*
+			 * Skip the separator if have one, otherwise
+			 * break the parsing.
+			 */
+			if (*s == ',')
+				s++;
+			else
+				break;
+		} else {
+			hugetlb_cma_size = memparse(p, &p);
+			break;
+		}
+	}
+
 	return 0;
 }
 
@@ -6754,6 +6793,7 @@  static int __init cmdline_parse_hugetlb_cma(char *p)
 void __init hugetlb_cma_reserve(int order)
 {
 	unsigned long size, reserved, per_node;
+	bool node_specific_cma_alloc = false;
 	int nid;
 
 	cma_reserve_called = true;
@@ -6767,20 +6807,37 @@  void __init hugetlb_cma_reserve(int order)
 		return;
 	}
 
-	/*
-	 * If 3 GB area is requested on a machine with 4 numa nodes,
-	 * let's allocate 1 GB on first three nodes and ignore the last one.
-	 */
-	per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
-	pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
-		hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
+	for_each_node_state(nid, N_ONLINE) {
+		if (hugetlb_cma_size_in_node[nid] > 0) {
+			node_specific_cma_alloc = true;
+			break;
+		}
+	}
+
+	if (!node_specific_cma_alloc) {
+		/*
+		 * If 3 GB area is requested on a machine with 4 numa nodes,
+		 * let's allocate 1 GB on first three nodes and ignore the last one.
+		 */
+		per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
+		pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
+			hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
+	}
 
 	reserved = 0;
 	for_each_node_state(nid, N_ONLINE) {
 		int res;
 		char name[CMA_MAX_NAME];
 
-		size = min(per_node, hugetlb_cma_size - reserved);
+		if (node_specific_cma_alloc) {
+			if (hugetlb_cma_size_in_node[nid] <= 0)
+				continue;
+
+			size = hugetlb_cma_size_in_node[nid];
+		} else {
+			size = min(per_node, hugetlb_cma_size - reserved);
+		}
+
 		size = round_up(size, PAGE_SIZE << order);
 
 		snprintf(name, sizeof(name), "hugetlb%d", nid);
@@ -6799,6 +6856,8 @@  void __init hugetlb_cma_reserve(int order)
 			continue;
 		}
 
+		if (node_specific_cma_alloc)
+			node_set(nid, hugetlb_cma_nodes_allowed);
 		reserved += size;
 		pr_info("hugetlb_cma: reserved %lu MiB on node %d\n",
 			size / SZ_1M, nid);