diff mbox series

mm/hugetlb: try preferred node first when alloc gigantic page from cma

Message ID 20200830140418.605627-1-lixinhai.lxh@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm/hugetlb: try preferred node first when alloc gigantic page from cma | expand

Commit Message

Li Xinhai Aug. 30, 2020, 2:04 p.m. UTC
Since commit cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic
hugepages using cma"), the gigantic page would be allocated from node
which is not the preferred node, although there are pages available from
that node. The reason is that the nid parameter has been ignored in
alloc_gigantic_page().

After this patch, the preferred node is tried first before other allowed
nodes.

Fixes: cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
Cc: Roman Gushchin <guro@fb.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
---
 mm/hugetlb.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Mike Kravetz Aug. 31, 2020, 9:44 p.m. UTC | #1
On 8/30/20 7:04 AM, Li Xinhai wrote:
> Since commit cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic
> hugepages using cma"), the gigantic page would be allocated from node
> which is not the preferred node, although there are pages available from
> that node. The reason is that the nid parameter has been ignored in
> alloc_gigantic_page().
> 
> After this patch, the preferred node is tried first before other allowed
> nodes.

Thank you!
This is an issue that needs to be fixed.

> Fixes: cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
> ---
>  mm/hugetlb.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a301c2d672bf..4a28b8853d47 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1256,8 +1256,15 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>  		struct page *page;
>  		int node;
>  
> +		if (hugetlb_cma[nid]) {
> +			page = cma_alloc(hugetlb_cma[nid], nr_pages,
> +					huge_page_order(h), true);
> +			if (page)
> +				return page;
> +		}
> +

When looking at your changes, I noticed that this code for allocation
from CMA does not take gfp_mask into account.  The 'normal' use case
is to allocate pool pages with something similar to:

echo 16 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

The routine alloc_pool_huge_page will try to interleave pages among nodes:

	...
        gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;

        for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
	...

which will eventually call alloc_gigantic_page.  If __GFP_THISNODE is
set we really do not want to execute the below for loop in alloc_gigantic_page.

I think the convention in the mm code is that only the lowest level
allocation routines should interpret the GFP flags.  We may need to make
an exception here and check for __GFP_THISNODE.

Michal would be the best person to comment and perhaps make a recommendation.
Michal Hocko Sept. 1, 2020, 1:41 p.m. UTC | #2
On Mon 31-08-20 14:44:40, Mike Kravetz wrote:
> On 8/30/20 7:04 AM, Li Xinhai wrote:
> > Since commit cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic
> > hugepages using cma"), the gigantic page would be allocated from node
> > which is not the preferred node, although there are pages available from
> > that node. The reason is that the nid parameter has been ignored in
> > alloc_gigantic_page().
> > 
> > After this patch, the preferred node is tried first before other allowed
> > nodes.
> 
> Thank you!
> This is an issue that needs to be fixed.
> 
> > Fixes: cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
> > ---
> >  mm/hugetlb.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a301c2d672bf..4a28b8853d47 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1256,8 +1256,15 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> >  		struct page *page;
> >  		int node;
> >  
> > +		if (hugetlb_cma[nid]) {
> > +			page = cma_alloc(hugetlb_cma[nid], nr_pages,
> > +					huge_page_order(h), true);
> > +			if (page)
> > +				return page;
> > +		}
> > +
> 
> When looking at your changes, I noticed that this code for allocation
> from CMA does not take gfp_mask into account.  The 'normal' use case
> is to allocate pool pages with something similar to:
> 
> echo 16 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 
> The routine alloc_pool_huge_page will try to interleave pages among nodes:
> 
> 	...
>         gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> 
>         for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> 	...
> 
> which will eventually call alloc_gigantic_page.  If __GFP_THISNODE is
> set we really do not want to execute the below for loop in alloc_gigantic_page.

Yes, this is the case indeed.
 
> I think the convention in the mm code is that only the lowest level
> allocation routines should interpret the GFP flags.  We may need to make
> an exception here and check for __GFP_THISNODE.

Yes this is true, But alloc_gigantic_page is actually low level
allocation routine in fact.

I would go with the following
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a301c2d672bf..124754240b56 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1256,6 +1256,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		struct page *page;
 		int node;
 
+		if (nid != NUMA_NO_NODE && hugetlb_cma[nid]) {
+			page = cma_alloc(hugetlb_cma[nid], nr_pages,
+					 huge_page_order(h), true);
+			if (page)
+				return page;
+		}
+
+		if (gfp_mask & __GFP_THISNODE)
+			return NULL;
+
 		for_each_node_mask(node, *nodemask) {
 			if (!hugetlb_cma[node])
 				continue;

I do not think we actually do have an explicit NUMA_NO_NODE user but it
is safer to not asume that here.
Li Xinhai Sept. 1, 2020, 2:20 p.m. UTC | #3
On 2020-09-01 at 21:41 Michal Hocko wrote:
>On Mon 31-08-20 14:44:40, Mike Kravetz wrote:
>> On 8/30/20 7:04 AM, Li Xinhai wrote:
>> > Since commit cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic
>> > hugepages using cma"), the gigantic page would be allocated from node
>> > which is not the preferred node, although there are pages available from
>> > that node. The reason is that the nid parameter has been ignored in
>> > alloc_gigantic_page().
>> >
>> > After this patch, the preferred node is tried first before other allowed
>> > nodes.
>>
>> Thank you!
>> This is an issue that needs to be fixed.
>>
>> > Fixes: cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
>> > Cc: Roman Gushchin <guro@fb.com>
>> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> > Cc: Michal Hocko <mhocko@kernel.org>
>> > Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>> > ---
>> >  mm/hugetlb.c | 9 ++++++++-
>> >  1 file changed, 8 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> > index a301c2d672bf..4a28b8853d47 100644
>> > --- a/mm/hugetlb.c
>> > +++ b/mm/hugetlb.c
>> > @@ -1256,8 +1256,15 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>> >  struct page *page;
>> >  int node;
>> > 
>> > +	if (hugetlb_cma[nid]) {
>> > +	page = cma_alloc(hugetlb_cma[nid], nr_pages,
>> > +	huge_page_order(h), true);
>> > +	if (page)
>> > +	return page;
>> > +	}
>> > +
>>
>> When looking at your changes, I noticed that this code for allocation
>> from CMA does not take gfp_mask into account.  The 'normal' use case
>> is to allocate pool pages with something similar to:
>>
>> echo 16 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>
>> The routine alloc_pool_huge_page will try to interleave pages among nodes:
>>
>> ...
>>         gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>>
>>         for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
>> ...
>>
>> which will eventually call alloc_gigantic_page.  If __GFP_THISNODE is
>> set we really do not want to execute the below for loop in alloc_gigantic_page.
>
>Yes, this is the case indeed.
>
>> I think the convention in the mm code is that only the lowest level
>> allocation routines should interpret the GFP flags.  We may need to make
>> an exception here and check for __GFP_THISNODE.
>
>Yes this is true, But alloc_gigantic_page is actually low level
>allocation routine in fact.
> 
Thanks for the review, we need to consider the __GFP_THISNODE flag.

>I would go with the following
>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>index a301c2d672bf..124754240b56 100644
>--- a/mm/hugetlb.c
>+++ b/mm/hugetlb.c
>@@ -1256,6 +1256,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> struct page *page;
> int node;
>
>+	if (nid != NUMA_NO_NODE && hugetlb_cma[nid]) {
>+	page = cma_alloc(hugetlb_cma[nid], nr_pages,
>+	huge_page_order(h), true);
>+	if (page)
>+	return page;
>+	}
>+
>+	if (gfp_mask & __GFP_THISNODE)
>+	return NULL;
>+ 
I think in case of failed to allocate on THISNODE, it still needs to call below
alloc_contig_pages(), so we have one more chance to allcoate successfully
on the preferred node.

> for_each_node_mask(node, *nodemask) {
> if (!hugetlb_cma[node])
> continue;
> 
>I do not think we actually do have an explicit NUMA_NO_NODE user but it
>is safer to not asume that here.
>--
>Michal Hocko
>SUSE Labs
Michal Hocko Sept. 1, 2020, 2:53 p.m. UTC | #4
On Tue 01-09-20 22:20:44, Li Xinhai wrote:
> On 2020-09-01 at 21:41 Michal Hocko wrote:
> >On Mon 31-08-20 14:44:40, Mike Kravetz wrote:
> >> On 8/30/20 7:04 AM, Li Xinhai wrote:
> >> > Since commit cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic
> >> > hugepages using cma"), the gigantic page would be allocated from node
> >> > which is not the preferred node, although there are pages available from
> >> > that node. The reason is that the nid parameter has been ignored in
> >> > alloc_gigantic_page().
> >> >
> >> > After this patch, the preferred node is tried first before other allowed
> >> > nodes.
> >>
> >> Thank you!
> >> This is an issue that needs to be fixed.
> >>
> >> > Fixes: cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
> >> > Cc: Roman Gushchin <guro@fb.com>
> >> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> >> > Cc: Michal Hocko <mhocko@kernel.org>
> >> > Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
> >> > ---
> >> >  mm/hugetlb.c | 9 ++++++++-
> >> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> > index a301c2d672bf..4a28b8853d47 100644
> >> > --- a/mm/hugetlb.c
> >> > +++ b/mm/hugetlb.c
> >> > @@ -1256,8 +1256,15 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> >> >  struct page *page;
> >> >  int node;
> >> > 
> >> > +	if (hugetlb_cma[nid]) {
> >> > +	page = cma_alloc(hugetlb_cma[nid], nr_pages,
> >> > +	huge_page_order(h), true);
> >> > +	if (page)
> >> > +	return page;
> >> > +	}
> >> > +
> >>
> >> When looking at your changes, I noticed that this code for allocation
> >> from CMA does not take gfp_mask into account.  The 'normal' use case
> >> is to allocate pool pages with something similar to:
> >>
> >> echo 16 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >>
> >> The routine alloc_pool_huge_page will try to interleave pages among nodes:
> >>
> >> ...
> >>         gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> >>
> >>         for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> >> ...
> >>
> >> which will eventually call alloc_gigantic_page.  If __GFP_THISNODE is
> >> set we really do not want to execute the below for loop in alloc_gigantic_page.
> >
> >Yes, this is the case indeed.
> >
> >> I think the convention in the mm code is that only the lowest level
> >> allocation routines should interpret the GFP flags.  We may need to make
> >> an exception here and check for __GFP_THISNODE.
> >
> >Yes this is true, But alloc_gigantic_page is actually low level
> >allocation routine in fact.
> > 
> Thanks for the review, we need to consider the __GFP_THISNODE flag.

Yeah, my bad. Quite ugly but a larger rework would be needed to make it
nicer. Not sure this is worth it.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a301c2d672bf..55baaac848da 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1256,6 +1256,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		struct page *page;
 		int node;
 
+		if (nid != NUMA_NO_NODE && hugetlb_cma[nid]) {
+			page = cma_alloc(hugetlb_cma[node], nr_pages,
+					 huge_page_order(h), true);
+			if (page)
+				return page;
+		}
+
+		if (gfp_mask & __GFP_THISNODE)
+			goto fallback;
+
 		for_each_node_mask(node, *nodemask) {
 			if (!hugetlb_cma[node])
 				continue;
@@ -1266,6 +1276,7 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 				return page;
 		}
 	}
+fallback:
 #endif
 
 	return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
Li Xinhai Sept. 1, 2020, 2:59 p.m. UTC | #5
On 2020-09-01 at 22:53 Michal Hocko wrote:
>On Tue 01-09-20 22:20:44, Li Xinhai wrote:
>> On 2020-09-01 at 21:41 Michal Hocko wrote:
>> >On Mon 31-08-20 14:44:40, Mike Kravetz wrote:
>> >> On 8/30/20 7:04 AM, Li Xinhai wrote:
>> >> > Since commit cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic
>> >> > hugepages using cma"), the gigantic page would be allocated from node
>> >> > which is not the preferred node, although there are pages available from
>> >> > that node. The reason is that the nid parameter has been ignored in
>> >> > alloc_gigantic_page().
>> >> >
>> >> > After this patch, the preferred node is tried first before other allowed
>> >> > nodes.
>> >>
>> >> Thank you!
>> >> This is an issue that needs to be fixed.
>> >>
>> >> > Fixes: cf11e85fc08cc6a4 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
>> >> > Cc: Roman Gushchin <guro@fb.com>
>> >> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> >> > Cc: Michal Hocko <mhocko@kernel.org>
>> >> > Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>> >> > ---
>> >> >  mm/hugetlb.c | 9 ++++++++-
>> >> >  1 file changed, 8 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> >> > index a301c2d672bf..4a28b8853d47 100644
>> >> > --- a/mm/hugetlb.c
>> >> > +++ b/mm/hugetlb.c
>> >> > @@ -1256,8 +1256,15 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>> >> >  struct page *page;
>> >> >  int node;
>> >> > 
>> >> > +	if (hugetlb_cma[nid]) {
>> >> > +	page = cma_alloc(hugetlb_cma[nid], nr_pages,
>> >> > +	huge_page_order(h), true);
>> >> > +	if (page)
>> >> > +	return page;
>> >> > +	}
>> >> > +
>> >>
>> >> When looking at your changes, I noticed that this code for allocation
>> >> from CMA does not take gfp_mask into account.  The 'normal' use case
>> >> is to allocate pool pages with something similar to:
>> >>
>> >> echo 16 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> >>
>> >> The routine alloc_pool_huge_page will try to interleave pages among nodes:
>> >>
>> >> ...
>> >>         gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>> >>
>> >>         for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
>> >> ...
>> >>
>> >> which will eventually call alloc_gigantic_page.  If __GFP_THISNODE is
>> >> set we really do not want to execute the below for loop in alloc_gigantic_page.
>> >
>> >Yes, this is the case indeed.
>> >
>> >> I think the convention in the mm code is that only the lowest level
>> >> allocation routines should interpret the GFP flags.  We may need to make
>> >> an exception here and check for __GFP_THISNODE.
>> >
>> >Yes this is true, But alloc_gigantic_page is actually low level
>> >allocation routine in fact.
>> >
>> Thanks for the review, we need to consider the __GFP_THISNODE flag.
>
>Yeah, my bad. Quite ugly but a larger rework would be needed to make it
>nicer. Not sure this is worth it.
> 
Just sent out the V2, and put the for-loop within the THISNODE check...

>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>index a301c2d672bf..55baaac848da 100644
>--- a/mm/hugetlb.c
>+++ b/mm/hugetlb.c
>@@ -1256,6 +1256,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> struct page *page;
> int node;
>
>+	if (nid != NUMA_NO_NODE && hugetlb_cma[nid]) {
>+	page = cma_alloc(hugetlb_cma[node], nr_pages,
>+	huge_page_order(h), true);
>+	if (page)
>+	return page;
>+	}
>+
>+	if (gfp_mask & __GFP_THISNODE)
>+	goto fallback;
>+
> for_each_node_mask(node, *nodemask) {
> if (!hugetlb_cma[node])
> continue;
>@@ -1266,6 +1276,7 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> return page;
> }
> }
>+fallback:
> #endif
>
> return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
>--
>Michal Hocko
>SUSE Labs
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a301c2d672bf..4a28b8853d47 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1256,8 +1256,15 @@  static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		struct page *page;
 		int node;
 
+		if (hugetlb_cma[nid]) {
+			page = cma_alloc(hugetlb_cma[nid], nr_pages,
+					huge_page_order(h), true);
+			if (page)
+				return page;
+		}
+
 		for_each_node_mask(node, *nodemask) {
-			if (!hugetlb_cma[node])
+			if (node == nid || !hugetlb_cma[node])
 				continue;
 
 			page = cma_alloc(hugetlb_cma[node], nr_pages,