diff mbox series

mm: hugetlb: fix hugetlb_cma_reserve() if CONFIG_NUMA isn't set

Message ID 20200318153424.3202304-1-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series mm: hugetlb: fix hugetlb_cma_reserve() if CONFIG_NUMA isn't set | expand

Commit Message

Roman Gushchin March 18, 2020, 3:34 p.m. UTC
If CONFIG_NUMA isn't set, there is no need to ensure that
the hugetlb cma area belongs to a specific numa node.

min/max_low_pfn can be used for limiting the maximum size
of the hugetlb_cma area.

Also for_each_mem_pfn_range() is defined only if
CONFIG_HAVE_MEMBLOCK_NODE_MAP is set, and on arm (unlike most
other architectures) it depends on CONFIG_NUMA. This makes the
build fail if CONFIG_NUMA isn't set.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Andreas Schaufler <andreas.schaufler@gmx.de>
---
 mm/hugetlb.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Comments

Michal Hocko March 18, 2020, 4:16 p.m. UTC | #1
On Wed 18-03-20 08:34:24, Roman Gushchin wrote:
> If CONFIG_NUMA isn't set, there is no need to ensure that
> the hugetlb cma area belongs to a specific numa node.
> 
> min/max_low_pfn can be used for limiting the maximum size
> of the hugetlb_cma area.
> 
> Also for_each_mem_pfn_range() is defined only if
> CONFIG_HAVE_MEMBLOCK_NODE_MAP is set, and on arm (unlike most
> other architectures) it depends on CONFIG_NUMA. This makes the
> build fail if CONFIG_NUMA isn't set.

CONFIG_HAVE_MEMBLOCK_NODE_MAP has popped out as a problem several times
already. Is there any real reason we cannot make it unconditional?
Essentially make the functionality always enabled and drop the config?
The code below is ugly as hell. Just look at it. You have
for_each_node_state without any ifdefery but the having ifdef
CONFIG_NUMA. That just doesn't make any sense.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reported-by: Andreas Schaufler <andreas.schaufler@gmx.de>
> ---
>  mm/hugetlb.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7a20cae7c77a..a6161239abde 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5439,16 +5439,21 @@ void __init hugetlb_cma_reserve(int order)
>  
>  	reserved = 0;
>  	for_each_node_state(nid, N_ONLINE) {
> -		unsigned long start_pfn, end_pfn;
>  		unsigned long min_pfn = 0, max_pfn = 0;
> -		int res, i;
> +		int res;
> +#ifdef CONFIG_NUMA
> +		unsigned long start_pfn, end_pfn;
> +		int i;
>  
>  		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>  			if (!min_pfn)
>  				min_pfn = start_pfn;
>  			max_pfn = end_pfn;
>  		}
> -
> +#else
> +		min_pfn = min_low_pfn;
> +		max_pfn = max_low_pfn;
> +#endif
>  		size = max(per_node, hugetlb_cma_size - reserved);
>  		size = round_up(size, PAGE_SIZE << order);
>  
> -- 
> 2.24.1
Roman Gushchin March 18, 2020, 5:55 p.m. UTC | #2
On Wed, Mar 18, 2020 at 05:16:25PM +0100, Michal Hocko wrote:
> On Wed 18-03-20 08:34:24, Roman Gushchin wrote:
> > If CONFIG_NUMA isn't set, there is no need to ensure that
> > the hugetlb cma area belongs to a specific numa node.
> > 
> > min/max_low_pfn can be used for limiting the maximum size
> > of the hugetlb_cma area.
> > 
> > Also for_each_mem_pfn_range() is defined only if
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP is set, and on arm (unlike most
> > other architectures) it depends on CONFIG_NUMA. This makes the
> > build fail if CONFIG_NUMA isn't set.
> 
> CONFIG_HAVE_MEMBLOCK_NODE_MAP has popped out as a problem several times
> already. Is there any real reason we cannot make it unconditional?
> Essentially make the functionality always enabled and drop the config?

It depends on CONFIG_NUMA only on arm, and I really don't know
if there is a good justification for it. It not, that will be a much
simpler fix.

> The code below is ugly as hell. Just look at it. You have
> for_each_node_state without any ifdefery but the having ifdef
> CONFIG_NUMA. That just doesn't make any sense.

I don't think it makes no sense:
it tries to reserve a cma area on each node (need for_each_node_state()),
and it uses the for_each_mem_pfn_range() to get a min and max pfn
for each node. With !CONFIG_NUMA the first part is reduced to one
iteration and the second part is not required at all.

I agree that for_each_mem_pfn_range() here looks quite ugly, but I don't know
of a better way to get min/max pfns for a node so early in the boot process.
If somebody has any ideas here, I'll appreciate a lot.

I know Rik plans some further improvements here, so the goal for now
is to fix the build. If you think that enabling CONFIG_HAVE_MEMBLOCK_NODE_MAP
unconditionally is a way to go, I'm fine with it too.

Rik also posted a different fix for the build problem, but from what I've
seen it didn't fix it completely. I'm fine with either option here.

Thanks!

> 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Reported-by: Andreas Schaufler <andreas.schaufler@gmx.de>
> > ---
> >  mm/hugetlb.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 7a20cae7c77a..a6161239abde 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5439,16 +5439,21 @@ void __init hugetlb_cma_reserve(int order)
> >  
> >  	reserved = 0;
> >  	for_each_node_state(nid, N_ONLINE) {
> > -		unsigned long start_pfn, end_pfn;
> >  		unsigned long min_pfn = 0, max_pfn = 0;
> > -		int res, i;
> > +		int res;
> > +#ifdef CONFIG_NUMA
> > +		unsigned long start_pfn, end_pfn;
> > +		int i;
> >  
> >  		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> >  			if (!min_pfn)
> >  				min_pfn = start_pfn;
> >  			max_pfn = end_pfn;
> >  		}
> > -
> > +#else
> > +		min_pfn = min_low_pfn;
> > +		max_pfn = max_low_pfn;
> > +#endif
> >  		size = max(per_node, hugetlb_cma_size - reserved);
> >  		size = round_up(size, PAGE_SIZE << order);
> >  
> > -- 
> > 2.24.1
> 
> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko March 19, 2020, 4:16 p.m. UTC | #3
On Wed 18-03-20 10:55:29, Roman Gushchin wrote:
> On Wed, Mar 18, 2020 at 05:16:25PM +0100, Michal Hocko wrote:
> > On Wed 18-03-20 08:34:24, Roman Gushchin wrote:
> > > If CONFIG_NUMA isn't set, there is no need to ensure that
> > > the hugetlb cma area belongs to a specific numa node.
> > > 
> > > min/max_low_pfn can be used for limiting the maximum size
> > > of the hugetlb_cma area.
> > > 
> > > Also for_each_mem_pfn_range() is defined only if
> > > CONFIG_HAVE_MEMBLOCK_NODE_MAP is set, and on arm (unlike most
> > > other architectures) it depends on CONFIG_NUMA. This makes the
> > > build fail if CONFIG_NUMA isn't set.
> > 
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP has popped out as a problem several times
> > already. Is there any real reason we cannot make it unconditional?
> > Essentially make the functionality always enabled and drop the config?
> 
> It depends on CONFIG_NUMA only on arm, and I really don't know
> if there is a good justification for it. It not, that will be a much
> simpler fix.

I have checked the history and the dependency is there since NUMA was
introduced in arm64. So it would be great to double check with arch
maintainers.

> > The code below is ugly as hell. Just look at it. You have
> > for_each_node_state without any ifdefery but the having ifdef
> > CONFIG_NUMA. That just doesn't make any sense.
> 
> I don't think it makes no sense:
> it tries to reserve a cma area on each node (need for_each_node_state()),
> and it uses the for_each_mem_pfn_range() to get a min and max pfn
> for each node. With !CONFIG_NUMA the first part is reduced to one
> iteration and the second part is not required at all.

Sure the resulting code logic makes sense. I meant that it doesn't make
much sense wrt readability. There is a loop over all existing numa nodes
to have ifdef for NUMA inside the loop. See?

> I agree that for_each_mem_pfn_range() here looks quite ugly, but I don't know
> of a better way to get min/max pfns for a node so early in the boot process.
> If somebody has any ideas here, I'll appreciate a lot.

The loop is ok. Maybe we have other memblock API that would be better
but I am not really aware of it from top of my head. I would stick with
it. It just sucks that this API depends on HAVE_MEMBLOCK_NODE_MAP and
that it is not generally available. This is what I am complaining about.
Just look what kind of dirty hack it made you to create ;)

> I know Rik plans some further improvements here, so the goal for now
> is to fix the build. If you think that enabling CONFIG_HAVE_MEMBLOCK_NODE_MAP
> unconditionally is a way to go, I'm fine with it too.

This is not the first time HAVE_MEMBLOCK_NODE_MAP has been problematic.
I might be missing something but I really do not get why do we really
need it these days. As for !NUMA, I suspect we can make it generate the
right thing when !NUMA.
Rik van Riel March 19, 2020, 4:56 p.m. UTC | #4
On Thu, 2020-03-19 at 17:16 +0100, Michal Hocko wrote:
> 
> This is not the first time HAVE_MEMBLOCK_NODE_MAP has been
> problematic.
> I might be missing something but I really do not get why do we really
> need it these days. As for !NUMA, I suspect we can make it generate
> the
> right thing when !NUMA.

We're working on a different fix now.

It looks like cma_declare_contiguous calls memblock_phys_alloc_range,
which calls memblock_alloc_range_nid, which takes a NUMA node as one
of its arguments.

Aslan is looking at simply adding a cma_declare_contiguous_nid, which
also takes a NUMA node ID as an argument. At that point we can simply
leave CMA free to allocate from anywhere in each NUMA node, which by
default already happens from the top down.

That should be the nicer long term fix to this issue.
Michal Hocko March 19, 2020, 5:01 p.m. UTC | #5
On Thu 19-03-20 12:56:34, Rik van Riel wrote:
> On Thu, 2020-03-19 at 17:16 +0100, Michal Hocko wrote:
> > 
> > This is not the first time HAVE_MEMBLOCK_NODE_MAP has been
> > problematic.
> > I might be missing something but I really do not get why do we really
> > need it these days. As for !NUMA, I suspect we can make it generate
> > the
> > right thing when !NUMA.
> 
> We're working on a different fix now.
> 
> It looks like cma_declare_contiguous calls memblock_phys_alloc_range,
> which calls memblock_alloc_range_nid, which takes a NUMA node as one
> of its arguments.
> 
> Aslan is looking at simply adding a cma_declare_contiguous_nid, which
> also takes a NUMA node ID as an argument. At that point we can simply
> leave CMA free to allocate from anywhere in each NUMA node, which by
> default already happens from the top down.
> 
> That should be the nicer long term fix to this issue.

Yes, that sounds like a better solution. Not that I would be much
happier to have HAVE_MEMBLOCK_NODE_MAP off the table as well ;)
Ohh well, it will have to remain on my todo list for some longer.
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7a20cae7c77a..a6161239abde 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5439,16 +5439,21 @@  void __init hugetlb_cma_reserve(int order)
 
 	reserved = 0;
 	for_each_node_state(nid, N_ONLINE) {
-		unsigned long start_pfn, end_pfn;
 		unsigned long min_pfn = 0, max_pfn = 0;
-		int res, i;
+		int res;
+#ifdef CONFIG_NUMA
+		unsigned long start_pfn, end_pfn;
+		int i;
 
 		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 			if (!min_pfn)
 				min_pfn = start_pfn;
 			max_pfn = end_pfn;
 		}
-
+#else
+		min_pfn = min_low_pfn;
+		max_pfn = max_low_pfn;
+#endif
 		size = max(per_node, hugetlb_cma_size - reserved);
 		size = round_up(size, PAGE_SIZE << order);