diff mbox series

[v3] mm/hugetlb: split hugetlb_cma in nodes with memory

Message ID 20200710120950.37716-1-song.bao.hua@hisilicon.com (mailing list archive)
State New, archived
Headers show
Series [v3] mm/hugetlb: split hugetlb_cma in nodes with memory | expand

Commit Message

Song Bao Hua (Barry Song) July 10, 2020, 12:09 p.m. UTC
Online nodes are not necessarily memory containing nodes. Splitting
huge_cma in online nodes can lead to inconsistent hugetlb_cma size
with user setting. For example, for one system with 4 numa nodes and
only one of them has memory, if users set hugetlb_cma to 4GB, it will
split into four 1GB. So only the node with memory will get 1GB CMA.
All other three nodes get nothing. That means the whole system gets
only 1GB CMA while users ask for 4GB.

Thus, it is more sensible to split hugetlb_cma in nodes with memory.
For the above case, the only node with memory will reserve 4GB cma
which is same with user setting in bootargs. In order to split cma
in nodes with memory, hugetlb_cma_reserve() should scan over those
nodes with N_MEMORY state rather than N_ONLINE state. That means
the function should be called only after arch code has finished
setting the N_MEMORY state of nodes.

The problem is always there if N_ONLINE != N_MEMORY. It is a general
problem to all platforms. But there is some trivial difference among
different architectures.
For example, for ARM64, before hugetlb_cma_reserve() is called, all
nodes have got N_ONLINE state. So hugetlb will get inconsistent cma
size when some online nodes have no memory. For x86 case, the problem
is hidden because X86 happens to just set N_ONLINE on the nodes with
memory when hugetlb_cma_reserve() is called.

Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve()
and lets both x86 and ARM64 call the function after N_MEMORY state
is ready. It also documents the requirement in the definition of
hugetlb_cma_reserve().

Cc: Roman Gushchin <guro@fb.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 v3:
 another try to refine changelog with respect to Anshuman Khandual's comments

 arch/arm64/mm/init.c    | 19 ++++++++++---------
 arch/x86/kernel/setup.c | 12 +++++++++---
 mm/hugetlb.c            | 11 +++++++++--
 3 files changed, 28 insertions(+), 14 deletions(-)

Comments

Mike Kravetz July 14, 2020, 11:21 p.m. UTC | #1
On 7/10/20 5:09 AM, Barry Song wrote:
> Online nodes are not necessarily memory containing nodes. Splitting
> huge_cma in online nodes can lead to inconsistent hugetlb_cma size
> with user setting. For example, for one system with 4 numa nodes and
> only one of them has memory, if users set hugetlb_cma to 4GB, it will
> split into four 1GB. So only the node with memory will get 1GB CMA.
> All other three nodes get nothing. That means the whole system gets
> only 1GB CMA while users ask for 4GB.
> 
> Thus, it is more sensible to split hugetlb_cma in nodes with memory.
> For the above case, the only node with memory will reserve 4GB cma
> which is same with user setting in bootargs. In order to split cma
> in nodes with memory, hugetlb_cma_reserve() should scan over those
> nodes with N_MEMORY state rather than N_ONLINE state. That means
> the function should be called only after arch code has finished
> setting the N_MEMORY state of nodes.
> 
> The problem is always there if N_ONLINE != N_MEMORY. It is a general
> problem to all platforms. But there is some trivial difference among
> different architectures.
> For example, for ARM64, before hugetlb_cma_reserve() is called, all
> nodes have got N_ONLINE state. So hugetlb will get inconsistent cma
> size when some online nodes have no memory. For x86 case, the problem
> is hidden because X86 happens to just set N_ONLINE on the nodes with
> memory when hugetlb_cma_reserve() is called.
> 
> Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve()
> and lets both x86 and ARM64 call the function after N_MEMORY state
> is ready. It also documents the requirement in the definition of
> hugetlb_cma_reserve().
> 
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>

I agree we should only be concerned with N_MEMORY nodes for the CMA
reservations.  However, this patch got me thinking:
- Do we really have to initiate the CMA reservations from arch specific code?
- Can we move the call to reserve CMA a little later into hugetlb arch
  independent code?

I know the cma_declare_contiguous_nid() routine says it should be called
from arch specific code.  However, unless I am missing something that seems
mostly about timing.

What about a change like this on top of this patch?

From 72b5b9a623f8711ad7f79f1a8f910906245f5d07 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 14 Jul 2020 15:54:46 -0700
Subject: [PATCH] hugetlb: move cma allocation call to arch independent code

Instead of calling hugetlb_cma_reserve() from arch specific code,
call from arch independent code when a gigantic page hstate is
created.  This is late enough in the init process that all numa
memory information should be initialized.  And, it is early enough
to still use early memory allocator.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 arch/arm64/mm/init.c    | 10 ----------
 arch/x86/kernel/setup.c |  9 ---------
 mm/hugetlb.c            |  8 +++++++-
 3 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 79806732f4b4..ff0ff584dde9 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -427,16 +427,6 @@ void __init bootmem_init(void)
 	sparse_init();
 	zone_sizes_init(min, max);
 
-	/*
-	 * must be done after zone_sizes_init() which calls free_area_init()
-	 * that calls node_set_state() to initialize node_states[N_MEMORY]
-	 * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
-	 * state
-	 */
-#ifdef CONFIG_ARM64_4K_PAGES
-	hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
-#endif
-
 	memblock_dump_all();
 }
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a1a9712090ae..111c8467fafa 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1177,15 +1177,6 @@ void __init setup_arch(char **cmdline_p)
 
 	x86_init.paging.pagetable_init();
 
-	/*
-	 * must be done after zone_sizes_init() which calls free_area_init()
-	 * that calls node_set_state() to initialize node_states[N_MEMORY]
-	 * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
-	 * state
-	 */
-	if (boot_cpu_has(X86_FEATURE_GBPAGES))
-		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
-
 	kasan_init();
 
 	/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f24acb3af741..a0007d1d12d2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
 
+	if (order >= MAX_ORDER && hugetlb_cma_size)
+		hugetlb_cma_reserve(order);
+
 	parsed_hstate = h;
 }
 
@@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order)
 	unsigned long size, reserved, per_node;
 	int nid;
 
-	cma_reserve_called = true;
+	if (cma_reserve_called)
+		return;
+	else
+		cma_reserve_called = true;
 
 	if (!hugetlb_cma_size)
 		return;
Will Deacon July 15, 2020, 8:18 a.m. UTC | #2
Hi Mike,

On Tue, Jul 14, 2020 at 04:21:01PM -0700, Mike Kravetz wrote:
> I agree we should only be concerned with N_MEMORY nodes for the CMA
> reservations.  However, this patch got me thinking:
> - Do we really have to initiate the CMA reservations from arch specific code?
> - Can we move the call to reserve CMA a little later into hugetlb arch
>   independent code?
> 
> I know the cma_declare_contiguous_nid() routine says it should be called
> from arch specific code.  However, unless I am missing something that seems
> mostly about timing.
> 
> What about a change like this on top of this patch?
> 
> From 72b5b9a623f8711ad7f79f1a8f910906245f5d07 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 14 Jul 2020 15:54:46 -0700
> Subject: [PATCH] hugetlb: move cma allocation call to arch independent code
> 
> Instead of calling hugetlb_cma_reserve() from arch specific code,
> call from arch independent code when a gigantic page hstate is
> created.  This is late enough in the init process that all numa
> memory information should be initialized.  And, it is early enough
> to still use early memory allocator.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  arch/arm64/mm/init.c    | 10 ----------
>  arch/x86/kernel/setup.c |  9 ---------
>  mm/hugetlb.c            |  8 +++++++-
>  3 files changed, 7 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 79806732f4b4..ff0ff584dde9 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -427,16 +427,6 @@ void __init bootmem_init(void)
>  	sparse_init();
>  	zone_sizes_init(min, max);
>  
> -	/*
> -	 * must be done after zone_sizes_init() which calls free_area_init()
> -	 * that calls node_set_state() to initialize node_states[N_MEMORY]
> -	 * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
> -	 * state
> -	 */
> -#ifdef CONFIG_ARM64_4K_PAGES
> -	hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> -#endif
> -
>  	memblock_dump_all();
>  }
>  
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index a1a9712090ae..111c8467fafa 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1177,15 +1177,6 @@ void __init setup_arch(char **cmdline_p)
>  
>  	x86_init.paging.pagetable_init();
>  
> -	/*
> -	 * must be done after zone_sizes_init() which calls free_area_init()
> -	 * that calls node_set_state() to initialize node_states[N_MEMORY]
> -	 * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
> -	 * state
> -	 */
> -	if (boot_cpu_has(X86_FEATURE_GBPAGES))
> -		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> -
>  	kasan_init();
>  
>  	/*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24acb3af741..a0007d1d12d2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
>  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>  					huge_page_size(h)/1024);

(nit: you can also make hugetlb_cma_reserve() static and remote its function
prototypes from hugetlb.h)

> +	if (order >= MAX_ORDER && hugetlb_cma_size)
> +		hugetlb_cma_reserve(order);

Although I really like the idea of moving this out of the arch code, I don't
quite follow the check against MAX_ORDER here -- it looks like a bit of a
hack to try to intercept the "PUD_SHIFT - PAGE_SHIFT" order which we
currently pass to hugetlb_cma_reserve(). Maybe we could instead have
something like:

	#ifndef HUGETLB_CMA_ORDER
	#define HUGETLB_CMA_ORDER	(PUD_SHIFT - PAGE_SHIFT)
	#endif

and then just do:

	if (order == HUGETLB_CMA_ORDER)
		hugetlb_cma_reserve(order);

? Is there something else I'm missing?

> +
>  	parsed_hstate = h;
>  }
>  
> @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order)
>  	unsigned long size, reserved, per_node;
>  	int nid;
>  
> -	cma_reserve_called = true;
> +	if (cma_reserve_called)
> +		return;
> +	else
> +		cma_reserve_called = true;

(nit: don't need the 'else' here)

Will
Song Bao Hua (Barry Song) July 15, 2020, 11:14 a.m. UTC | #3
> -----Original Message-----
> From: Mike Kravetz [mailto:mike.kravetz@oracle.com]
> Sent: Wednesday, July 15, 2020 11:21 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> akpm@linux-foundation.org
> Cc: x86@kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> Linuxarm <linuxarm@huawei.com>; linux-arm-kernel@lists.infradead.org;
> Roman Gushchin <guro@fb.com>; Catalin Marinas
> <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>; Thomas Gleixner
> <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav Petkov
> <bp@alien8.de>; H.Peter Anvin <hpa@zytor.com>; Mike Rapoport
> <rppt@linux.ibm.com>; Anshuman Khandual
> <anshuman.khandual@arm.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>
> Subject: Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory
> 
> On 7/10/20 5:09 AM, Barry Song wrote:
> > Online nodes are not necessarily memory containing nodes. Splitting
> > huge_cma in online nodes can lead to inconsistent hugetlb_cma size
> > with user setting. For example, for one system with 4 numa nodes and
> > only one of them has memory, if users set hugetlb_cma to 4GB, it will
> > split into four 1GB. So only the node with memory will get 1GB CMA.
> > All other three nodes get nothing. That means the whole system gets
> > only 1GB CMA while users ask for 4GB.
> >
> > Thus, it is more sensible to split hugetlb_cma in nodes with memory.
> > For the above case, the only node with memory will reserve 4GB cma
> > which is same with user setting in bootargs. In order to split cma
> > in nodes with memory, hugetlb_cma_reserve() should scan over those
> > nodes with N_MEMORY state rather than N_ONLINE state. That means
> > the function should be called only after arch code has finished
> > setting the N_MEMORY state of nodes.
> >
> > The problem is always there if N_ONLINE != N_MEMORY. It is a general
> > problem to all platforms. But there is some trivial difference among
> > different architectures.
> > For example, for ARM64, before hugetlb_cma_reserve() is called, all
> > nodes have got N_ONLINE state. So hugetlb will get inconsistent cma
> > size when some online nodes have no memory. For x86 case, the problem
> > is hidden because X86 happens to just set N_ONLINE on the nodes with
> > memory when hugetlb_cma_reserve() is called.
> >
> > Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve()
> > and lets both x86 and ARM64 call the function after N_MEMORY state
> > is ready. It also documents the requirement in the definition of
> > hugetlb_cma_reserve().
> >
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> 
> I agree we should only be concerned with N_MEMORY nodes for the CMA
> reservations.  However, this patch got me thinking:
> - Do we really have to initiate the CMA reservations from arch specific code?
> - Can we move the call to reserve CMA a little later into hugetlb arch
>   independent code?
> 
> I know the cma_declare_contiguous_nid() routine says it should be called
> from arch specific code.  However, unless I am missing something that seems
> mostly about timing.
> 
> What about a change like this on top of this patch?
> 
> From 72b5b9a623f8711ad7f79f1a8f910906245f5d07 Mon Sep 17 00:00:00
> 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 14 Jul 2020 15:54:46 -0700
> Subject: [PATCH] hugetlb: move cma allocation call to arch independent code
> 
> Instead of calling hugetlb_cma_reserve() from arch specific code,
> call from arch independent code when a gigantic page hstate is
> created.  This is late enough in the init process that all numa
> memory information should be initialized.  And, it is early enough
> to still use early memory allocator.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  arch/arm64/mm/init.c    | 10 ----------
>  arch/x86/kernel/setup.c |  9 ---------
>  mm/hugetlb.c            |  8 +++++++-
>  3 files changed, 7 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 79806732f4b4..ff0ff584dde9 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -427,16 +427,6 @@ void __init bootmem_init(void)
>  	sparse_init();
>  	zone_sizes_init(min, max);
> 
> -	/*
> -	 * must be done after zone_sizes_init() which calls free_area_init()
> -	 * that calls node_set_state() to initialize node_states[N_MEMORY]
> -	 * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
> -	 * state
> -	 */
> -#ifdef CONFIG_ARM64_4K_PAGES
> -	hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> -#endif
> -
>  	memblock_dump_all();
>  }
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index a1a9712090ae..111c8467fafa 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1177,15 +1177,6 @@ void __init setup_arch(char **cmdline_p)
> 
>  	x86_init.paging.pagetable_init();
> 
> -	/*
> -	 * must be done after zone_sizes_init() which calls free_area_init()
> -	 * that calls node_set_state() to initialize node_states[N_MEMORY]
> -	 * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
> -	 * state
> -	 */
> -	if (boot_cpu_has(X86_FEATURE_GBPAGES))
> -		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> -
>  	kasan_init();
> 
>  	/*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24acb3af741..a0007d1d12d2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
>  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>  					huge_page_size(h)/1024);
> 
> +	if (order >= MAX_ORDER && hugetlb_cma_size)
> +		hugetlb_cma_reserve(order);

Hello, Mike,
I am not sure if it is necessarily correct as the order for gigantic pages is arch-dependent:
https://lkml.org/lkml/2020/7/1/14

> +
>  	parsed_hstate = h;
>  }
> 
> @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order)
>  	unsigned long size, reserved, per_node;
>  	int nid;
> 
> -	cma_reserve_called = true;
> +	if (cma_reserve_called)
> +		return;
> +	else
> +		cma_reserve_called = true;
> 
>  	if (!hugetlb_cma_size)
>  		return;

Thanks
Barry
Mike Kravetz July 15, 2020, 4:59 p.m. UTC | #4
On 7/15/20 1:18 AM, Will Deacon wrote:
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index f24acb3af741..a0007d1d12d2 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
>>  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>>  					huge_page_size(h)/1024);
> 
> (nit: you can also make hugetlb_cma_reserve() static and remote its function
> prototypes from hugetlb.h)

Yes thanks.  I threw this together pretty quickly.

> 
>> +	if (order >= MAX_ORDER && hugetlb_cma_size)
>> +		hugetlb_cma_reserve(order);
> 
> Although I really like the idea of moving this out of the arch code, I don't
> quite follow the check against MAX_ORDER here -- it looks like a bit of a
> hack to try to intercept the "PUD_SHIFT - PAGE_SHIFT" order which we
> currently pass to hugetlb_cma_reserve(). Maybe we could instead have
> something like:
> 
> 	#ifndef HUGETLB_CMA_ORDER
> 	#define HUGETLB_CMA_ORDER	(PUD_SHIFT - PAGE_SHIFT)
> 	#endif
> 
> and then just do:
> 
> 	if (order == HUGETLB_CMA_ORDER)
> 		hugetlb_cma_reserve(order);
> 
> ? Is there something else I'm missing?
> 

Well, the current hugetlb CMA code only kicks in for gigantic pages as
defined by the hugetlb code. For example, the code to allocate a page
from CMA is in the routine alloc_gigantic_page().  alloc_gigantic_page()
is called from alloc_fresh_huge_page() which starts with:

        if (hstate_is_gigantic(h))
                page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
        else
                page = alloc_buddy_huge_page(h, gfp_mask,
                                nid, nmask, node_alloc_noretry);

and, hstate_is_gigantic is,

static inline bool hstate_is_gigantic(struct hstate *h)
{
        return huge_page_order(h) >= MAX_ORDER;
}

So, everything in the existing code really depends on the hugetlb definition
of gigantic page (order >= MAX_ORDER).  The code to check for
'order >= MAX_ORDER' in my proposed patch is just following the same
convention.

I think the current dependency on the hugetlb definition of gigantic page
may be too simplistic if using CMA for huegtlb pages becomes more common.
Some architectures (sparc, powerpc) have more than one gigantic pages size.
Currently there is no way to specify that CMA should be used for one and
not the other.  In addition, I could imagine someone wanting to reserve/use
CMA for non-gigantic (PMD) sized pages.  There is no mechainsm for that today.

I honestly have not heard about many use cases for this CMA functionality.
When support was initially added, it was driven by a specific use case and
the 'all gigantic pages use CMA if defined' implementation was deemed
sufficient.  If there are more use cases, or this seems too simple we can
revisit that decision.

>> +
>>  	parsed_hstate = h;
>>  }
>>  
>> @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order)
>>  	unsigned long size, reserved, per_node;
>>  	int nid;
>>  
>> -	cma_reserve_called = true;
>> +	if (cma_reserve_called)
>> +		return;
>> +	else
>> +		cma_reserve_called = true;
> 
> (nit: don't need the 'else' here)

Yes, duh!
Mike Kravetz July 15, 2020, 5:05 p.m. UTC | #5
On 7/15/20 4:14 AM, Song Bao Hua (Barry Song) wrote:
>> From: Mike Kravetz [mailto:mike.kravetz@oracle.com]
>>  					huge_page_size(h)/1024);
>>
>> +	if (order >= MAX_ORDER && hugetlb_cma_size)
>> +		hugetlb_cma_reserve(order);
> 
> Hello, Mike,
> I am not sure if it is necessarily correct as the order for gigantic pages is arch-dependent:
> https://lkml.org/lkml/2020/7/1/14
> 

See my reply to Wil.

The code to allocate gigantic pages from CMA depends on the arch independent
definition of gigantic page which is 'order >= MAX_ORDER'.
Will Deacon July 16, 2020, 8:12 a.m. UTC | #6
On Wed, Jul 15, 2020 at 09:59:24AM -0700, Mike Kravetz wrote:
> On 7/15/20 1:18 AM, Will Deacon wrote:
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index f24acb3af741..a0007d1d12d2 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
> >>  	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> >>  					huge_page_size(h)/1024);
> > 
> > (nit: you can also make hugetlb_cma_reserve() static and remote its function
> > prototypes from hugetlb.h)
> 
> Yes thanks.  I threw this together pretty quickly.
> 
> > 
> >> +	if (order >= MAX_ORDER && hugetlb_cma_size)
> >> +		hugetlb_cma_reserve(order);
> > 
> > Although I really like the idea of moving this out of the arch code, I don't
> > quite follow the check against MAX_ORDER here -- it looks like a bit of a
> > hack to try to intercept the "PUD_SHIFT - PAGE_SHIFT" order which we
> > currently pass to hugetlb_cma_reserve(). Maybe we could instead have
> > something like:
> > 
> > 	#ifndef HUGETLB_CMA_ORDER
> > 	#define HUGETLB_CMA_ORDER	(PUD_SHIFT - PAGE_SHIFT)
> > 	#endif
> > 
> > and then just do:
> > 
> > 	if (order == HUGETLB_CMA_ORDER)
> > 		hugetlb_cma_reserve(order);
> > 
> > ? Is there something else I'm missing?
> > 
> 
> Well, the current hugetlb CMA code only kicks in for gigantic pages as
> defined by the hugetlb code. For example, the code to allocate a page
> from CMA is in the routine alloc_gigantic_page().  alloc_gigantic_page()
> is called from alloc_fresh_huge_page() which starts with:
> 
>         if (hstate_is_gigantic(h))
>                 page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
>         else
>                 page = alloc_buddy_huge_page(h, gfp_mask,
>                                 nid, nmask, node_alloc_noretry);
> 
> and, hstate_is_gigantic is,
> 
> static inline bool hstate_is_gigantic(struct hstate *h)
> {
>         return huge_page_order(h) >= MAX_ORDER;
> }
> 
> So, everything in the existing code really depends on the hugetlb definition
> of gigantic page (order >= MAX_ORDER).  The code to check for
> 'order >= MAX_ORDER' in my proposed patch is just following the same
> convention.

Fair enough, and thanks for the explanation. Maybe just chuck a comment in,
then? Alternatively, having something like:

static inline bool page_order_is_gigantic(unsigned int order)
{
	return order >= MAX_ORDER;
}

static inline bool hstate_is_gigantic(struct hstate *h)
{
	return page_order_is_gigantic(huge_page_order(h));
}

and then using page_order_is_gigantic() to predicate the call to
hugetlb_cma_reserve? Dunno, maybe it's overkill. Up to you.

> I think the current dependency on the hugetlb definition of gigantic page
> may be too simplistic if using CMA for huegtlb pages becomes more common.
> Some architectures (sparc, powerpc) have more than one gigantic pages size.
> Currently there is no way to specify that CMA should be used for one and
> not the other.  In addition, I could imagine someone wanting to reserve/use
> CMA for non-gigantic (PMD) sized pages.  There is no mechainsm for that today.
> 
> I honestly have not heard about many use cases for this CMA functionality.
> When support was initially added, it was driven by a specific use case and
> the 'all gigantic pages use CMA if defined' implementation was deemed
> sufficient.  If there are more use cases, or this seems too simple we can
> revisit that decision.

Agreed, I think your patch is an improvement regardless of that.

Will
Anshuman Khandual July 17, 2020, 5:02 a.m. UTC | #7
On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 14 Jul 2020 15:54:46 -0700
> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>  hstate
> 
> Instead of calling hugetlb_cma_reserve() directly from arch specific
> code, call from hugetlb_add_hstate when adding a gigantic hstate.
> hugetlb_add_hstate is either called from arch specific huge page setup,
> or as the result of hugetlb command line processing.  In either case,
> this is late enough in the init process that all numa memory information
> should be initialized.  And, it is early enough to still use early
> memory allocator.

This assumes that hugetlb_add_hstate() is called from the arch code at
the right point in time for the generic HugeTLB to do the required CMA
reservation which is not ideal. I guess it must have been a reason why
CMA reservation should always called by the platform code which knows
the boot sequence timing better.
Will Deacon July 17, 2020, 8:36 a.m. UTC | #8
On Fri, Jul 17, 2020 at 10:32:53AM +0530, Anshuman Khandual wrote:
> 
> 
> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
> >>From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
> > From: Mike Kravetz <mike.kravetz@oracle.com>
> > Date: Tue, 14 Jul 2020 15:54:46 -0700
> > Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
> >  hstate
> > 
> > Instead of calling hugetlb_cma_reserve() directly from arch specific
> > code, call from hugetlb_add_hstate when adding a gigantic hstate.
> > hugetlb_add_hstate is either called from arch specific huge page setup,
> > or as the result of hugetlb command line processing.  In either case,
> > this is late enough in the init process that all numa memory information
> > should be initialized.  And, it is early enough to still use early
> > memory allocator.
> 
> This assumes that hugetlb_add_hstate() is called from the arch code at
> the right point in time for the generic HugeTLB to do the required CMA
> reservation which is not ideal. I guess it must have been a reason why
> CMA reservation should always called by the platform code which knows
> the boot sequence timing better.

Ha, except we've moved it around two or three times already in the last
month or so, so I'd say we don't have a clue when to call it in the arch
code.

Will
Anshuman Khandual July 17, 2020, 9:51 a.m. UTC | #9
On 07/17/2020 02:06 PM, Will Deacon wrote:
> On Fri, Jul 17, 2020 at 10:32:53AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>>  hstate
>>>
>>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>>> hugetlb_add_hstate is either called from arch specific huge page setup,
>>> or as the result of hugetlb command line processing.  In either case,
>>> this is late enough in the init process that all numa memory information
>>> should be initialized.  And, it is early enough to still use early
>>> memory allocator.
>>
>> This assumes that hugetlb_add_hstate() is called from the arch code at
>> the right point in time for the generic HugeTLB to do the required CMA
>> reservation which is not ideal. I guess it must have been a reason why
>> CMA reservation should always called by the platform code which knows
>> the boot sequence timing better.
> 
> Ha, except we've moved it around two or three times already in the last
> month or so, so I'd say we don't have a clue when to call it in the arch
> code.

The arch dependency is not going way with this change either. Just that
its getting transferred to hugetlb_add_hstate() which gets called from
arch_initcall() in every architecture.

The perfect timing here happens to be because of arch_initcall() instead.
This is probably fine, as long as

0. hugetlb_add_hstate() is always called at arch_initcall()
1. N_MEMORY mask is guaranteed to be initialized at arch_initcall()
2. CMA reservation is available to be called at arch_initcall()
Mike Kravetz July 17, 2020, 5:02 p.m. UTC | #10
On 7/16/20 10:02 PM, Anshuman Khandual wrote:
> 
> 
> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>  hstate
>>
>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>> hugetlb_add_hstate is either called from arch specific huge page setup,
>> or as the result of hugetlb command line processing.  In either case,
>> this is late enough in the init process that all numa memory information
>> should be initialized.  And, it is early enough to still use early
>> memory allocator.
> 
> This assumes that hugetlb_add_hstate() is called from the arch code at
> the right point in time for the generic HugeTLB to do the required CMA
> reservation which is not ideal. I guess it must have been a reason why
> CMA reservation should always called by the platform code which knows
> the boot sequence timing better.

Actually, the code does not make the assumption that hugetlb_add_hstate
is called from arch specific huge page setup.  It can even be called later
at the time of hugetlb command line processing.

My 'reasoning' is that gigantic pages can currently be preallocated from
bootmem/memblock_alloc at the time of command line processing.  Therefore,
we should be able to reserve bootmem for CMA at the same time.  Is there
something wrong with this reasoning?  I tested this on x86 by removing the
call to hugetlb_add_hstate from arch specific code and instead forced the
call at command line processing time.  The ability to reserve CMA was the
same.

Yes, the CMA reservation interface says it should be called from arch
specific code.  However, if we currently depend on the ability to do
memblock_alloc at hugetlb command line processing time for gigantic page
preallocation, then I think we can do the CMA reservation here as well.

Thinking about it some more, I suppose there could be some arch code that
could call hugetlb_add_hstate too early in the boot process.  But, I do
not think we have an issue with calling it too late.
Anshuman Khandual July 20, 2020, 6:14 a.m. UTC | #11
On 07/17/2020 11:07 PM, Mike Kravetz wrote:
> On 7/17/20 2:51 AM, Anshuman Khandual wrote:
>>
>>
>> On 07/17/2020 02:06 PM, Will Deacon wrote:
>>> On Fri, Jul 17, 2020 at 10:32:53AM +0530, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>>>>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>>>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>>>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>>>>  hstate
>>>>>
>>>>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>>>>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>>>>> hugetlb_add_hstate is either called from arch specific huge page setup,
>>>>> or as the result of hugetlb command line processing.  In either case,
>>>>> this is late enough in the init process that all numa memory information
>>>>> should be initialized.  And, it is early enough to still use early
>>>>> memory allocator.
>>>>
>>>> This assumes that hugetlb_add_hstate() is called from the arch code at
>>>> the right point in time for the generic HugeTLB to do the required CMA
>>>> reservation which is not ideal. I guess it must have been a reason why
>>>> CMA reservation should always called by the platform code which knows
>>>> the boot sequence timing better.
>>>
>>> Ha, except we've moved it around two or three times already in the last
>>> month or so, so I'd say we don't have a clue when to call it in the arch
>>> code.
>>
>> The arch dependency is not going way with this change either. Just that
>> its getting transferred to hugetlb_add_hstate() which gets called from
>> arch_initcall() in every architecture.
>>
>> The perfect timing here happens to be because of arch_initcall() instead.
>> This is probably fine, as long as
>>
>> 0. hugetlb_add_hstate() is always called at arch_initcall()
> 
> In another reply, I give reasoning why it would be safe to call even later
> at hugetlb command line processing time.

Understood, but there is a time window in which CMA reservation is available
irrespective of whether it is called from arch or generic code. Finding this
right time window and ensuring that N_MEMORY nodemask is initialized, easier
done in the platform code.

> 
>> 1. N_MEMORY mask is guaranteed to be initialized at arch_initcall()
> 
> This is a bit more difficult to guarantee.  I find the init sequence hard to
> understand.  Looking at the arm code, arch_initcall(hugetlbpage_init)
> happens after N_MEMORY mask is setup.  I can't imagine any arch code setting
> up huge pages before N_MEMORY.  But, I suppose it is possible and we would
> need to somehow guarantee this.

Ensuring that N_MEMORY nodemask is initialized from the generic code is even
more difficult.

> 
>> 2. CMA reservation is available to be called at arch_initcall()
> 
> Since I am pretty sure we can delay the reservation until hugetlb command
> line processing time, it would be great if it was always done there.

But moving hugetlb CMA reservation completely during command line processing
has got another concern of mixing with existing memblock based pre-allocation.

> Unfortunately, I can not immediately think of an easy way to do this.
> 

It is rationale to move CMA reservation stuff into generic HugeTLB but there
are some challenges which needs to be solved comprehensively. The patch here
from Barry does solve a short term problem (N_ONLINE ---> N_MEMORY) for now
which IMHO should be considered. Moving CMA reservation into generic HugeTLB
would require some more thoughts and can be attempted later.
Anshuman Khandual July 20, 2020, 6:22 a.m. UTC | #12
On 07/17/2020 10:32 PM, Mike Kravetz wrote:
> On 7/16/20 10:02 PM, Anshuman Khandual wrote:
>>
>>
>> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>>  hstate
>>>
>>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>>> hugetlb_add_hstate is either called from arch specific huge page setup,
>>> or as the result of hugetlb command line processing.  In either case,
>>> this is late enough in the init process that all numa memory information
>>> should be initialized.  And, it is early enough to still use early
>>> memory allocator.
>>
>> This assumes that hugetlb_add_hstate() is called from the arch code at
>> the right point in time for the generic HugeTLB to do the required CMA
>> reservation which is not ideal. I guess it must have been a reason why
>> CMA reservation should always called by the platform code which knows
>> the boot sequence timing better.
> 
> Actually, the code does not make the assumption that hugetlb_add_hstate
> is called from arch specific huge page setup.  It can even be called later
> at the time of hugetlb command line processing.

Yes, now that hugetlb_cma_reserve() has been moved into hugetlb_add_hstate().
But then there is an explicit warning while trying to mix both the command
line options i.e hugepagesz= and hugetlb_cma=. The proposed code here have
not changed that behavior and hence the following warning should have been
triggered here as well.

1) hugepagesz_setup()
	hugetlb_add_hstate()
		 hugetlb_cma_reserve()

2) hugepages_setup()
	hugetlb_hstate_alloc_pages()	when order >= MAX_ORDER

	if (hstate_is_gigantic(h)) {
        	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
                	pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
			break;
                }
		if (!alloc_bootmem_huge_page(h))
                break;
	}

Nonetheless, it does not make sense to mix both memblock and CMA based huge
page pre-allocations. But looking at this again, could this warning be ever
triggered till now ? Unless, a given platform calls hugetlb_cma_reserve()
before _setup("hugepages=", hugepages_setup). Anyways, there seems to be
good reasons to keep both memblock and CMA based pre-allocations in place.
But mixing them together (as done in the proposed code here) does not seem
to be right.

> 
> My 'reasoning' is that gigantic pages can currently be preallocated from
> bootmem/memblock_alloc at the time of command line processing.  Therefore,
> we should be able to reserve bootmem for CMA at the same time.  Is there
> something wrong with this reasoning?  I tested this on x86 by removing the
> call to hugetlb_add_hstate from arch specific code and instead forced the
> call at command line processing time.  The ability to reserve CMA was the
> same.

There is no problem with that reasoning. __setup() triggered function should
be able perform CMA reservation. But as pointed out before, it does not make
sense to mix both CMA reservation and memblock based pre-allocation.

> 
> Yes, the CMA reservation interface says it should be called from arch
> specific code.  However, if we currently depend on the ability to do
> memblock_alloc at hugetlb command line processing time for gigantic page
> preallocation, then I think we can do the CMA reservation here as well.

IIUC, CMA reservation and memblock alloc have some differences in terms of
how the memory can be used later on, will have to dig deeper on this. But
the comment section near cma_declare_contiguous_nid() is a concern.

 * This function reserves memory from early allocator. It should be
 * called by arch specific code once the early allocator (memblock or bootmem)
 * has been activated and all other subsystems have already allocated/reserved
 * memory. This function allows to create custom reserved areas.

> 
> Thinking about it some more, I suppose there could be some arch code that
> could call hugetlb_add_hstate too early in the boot process.  But, I do
> not think we have an issue with calling it too late.
> 

Calling it too late might have got the page allocator initialized completely
and then CMA reservation would not be possible afterwards. Also calling it
too early would prevent other subsystems which might need memory reservation
in specific physical ranges.
Mike Kravetz July 20, 2020, 6:17 p.m. UTC | #13
On 7/19/20 11:22 PM, Anshuman Khandual wrote:
> 
> 
> On 07/17/2020 10:32 PM, Mike Kravetz wrote:
>> On 7/16/20 10:02 PM, Anshuman Khandual wrote:
>>>
>>>
>>> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>>>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>>>  hstate
>>>>
>>>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>>>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>>>> hugetlb_add_hstate is either called from arch specific huge page setup,
>>>> or as the result of hugetlb command line processing.  In either case,
>>>> this is late enough in the init process that all numa memory information
>>>> should be initialized.  And, it is early enough to still use early
>>>> memory allocator.
>>>
>>> This assumes that hugetlb_add_hstate() is called from the arch code at
>>> the right point in time for the generic HugeTLB to do the required CMA
>>> reservation which is not ideal. I guess it must have been a reason why
>>> CMA reservation should always called by the platform code which knows
>>> the boot sequence timing better.
>>
>> Actually, the code does not make the assumption that hugetlb_add_hstate
>> is called from arch specific huge page setup.  It can even be called later
>> at the time of hugetlb command line processing.
> 
> Yes, now that hugetlb_cma_reserve() has been moved into hugetlb_add_hstate().
> But then there is an explicit warning while trying to mix both the command
> line options i.e hugepagesz= and hugetlb_cma=. The proposed code here have
> not changed that behavior and hence the following warning should have been
> triggered here as well.
> 
> 1) hugepagesz_setup()
> 	hugetlb_add_hstate()
> 		 hugetlb_cma_reserve()
> 
> 2) hugepages_setup()
> 	hugetlb_hstate_alloc_pages()	when order >= MAX_ORDER
> 
> 	if (hstate_is_gigantic(h)) {
>         	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
>                 	pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> 			break;
>                 }
> 		if (!alloc_bootmem_huge_page(h))
>                 break;
> 	}
> 
> Nonetheless, it does not make sense to mix both memblock and CMA based huge
> page pre-allocations. But looking at this again, could this warning be ever
> triggered till now ? Unless, a given platform calls hugetlb_cma_reserve()
> before _setup("hugepages=", hugepages_setup). Anyways, there seems to be
> good reasons to keep both memblock and CMA based pre-allocations in place.
> But mixing them together (as done in the proposed code here) does not seem
> to be right.

I'm not sure if I follow the question.

This proposal does not change the trigger for the warning printed when one
tries to both reserve CMA and pre-allocate gigantic pages.  If hugetlb_cma
is specified on the command line, and someone tries to pre-allocate gigantic
pages they will get the warning.  Such a command line on x86 might look like,
hugetlb_cma=4G hugepagesz=1G hugepages=4

You will then see,
[    0.065864] HugeTLB: hugetlb_cma is enabled, skip boot time allocation
[    0.065866] HugeTLB: allocating 4 of page size 1.00 GiB failed.  Only allocated 0 hugepages.

Ideally we could/should eliminate the second message.
This behavior exists in the current code.

>> My 'reasoning' is that gigantic pages can currently be preallocated from
>> bootmem/memblock_alloc at the time of command line processing.  Therefore,
>> we should be able to reserve bootmem for CMA at the same time.  Is there
>> something wrong with this reasoning?  I tested this on x86 by removing the
>> call to hugetlb_add_hstate from arch specific code and instead forced the
>> call at command line processing time.  The ability to reserve CMA was the
>> same.
> 
> There is no problem with that reasoning. __setup() triggered function should
> be able perform CMA reservation. But as pointed out before, it does not make
> sense to mix both CMA reservation and memblock based pre-allocation.

Agree.  I am not proposing we do.  Sorry, if you got that impression.

>> Yes, the CMA reservation interface says it should be called from arch
>> specific code.  However, if we currently depend on the ability to do
>> memblock_alloc at hugetlb command line processing time for gigantic page
>> preallocation, then I think we can do the CMA reservation here as well.
> 
> IIUC, CMA reservation and memblock alloc have some differences in terms of
> how the memory can be used later on, will have to dig deeper on this. But
> the comment section near cma_declare_contiguous_nid() is a concern.
> 
>  * This function reserves memory from early allocator. It should be
>  * called by arch specific code once the early allocator (memblock or bootmem)
>  * has been activated and all other subsystems have already allocated/reserved
>  * memory. This function allows to create custom reserved areas.
> 

Yes, that is the comment I was looking at as well.

However, note that hugetlb pre-allocation of gigantic pages will end up
calling memblock_alloc_range_nid.  This is the same routine used for CMA
reservations/allocations from cma_declare_contiguous_nid.  This is why
there should be no issue with doing CMA reservations at this time.

This may be the confusing part.  I am not saying we would do CMA reservations
and pre-allocations together.  Rather, they both rely on the underlying code so
we can call them at the same time in the init process.

>> Thinking about it some more, I suppose there could be some arch code that
>> could call hugetlb_add_hstate too early in the boot process.  But, I do
>> not think we have an issue with calling it too late.
>>
> 
> Calling it too late might have got the page allocator initialized completely
> and then CMA reservation would not be possible afterwards. Also calling it
> too early would prevent other subsystems which might need memory reservation
> in specific physical ranges.

I thought about it some more and came up with a way to do all this at command
line processing time.  It will take me a day or two to put together.

The patch from Barry which started this thread is indeed needed and is in
Andrew's tree.  I'll start another thread with a patch to move CMA reservations
to command line processing.
Aneesh Kumar K.V July 27, 2020, 2:37 p.m. UTC | #14
Mike Kravetz <mike.kravetz@oracle.com> writes:

> On 7/19/20 11:22 PM, Anshuman Khandual wrote:
>> 
>> 
>> On 07/17/2020 10:32 PM, Mike Kravetz wrote:
>>> On 7/16/20 10:02 PM, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>>>>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>>>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>>>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>>>>  hstate
>>>>>
>>>>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>>>>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>>>>> hugetlb_add_hstate is either called from arch specific huge page setup,
>>>>> or as the result of hugetlb command line processing.  In either case,
>>>>> this is late enough in the init process that all numa memory information
>>>>> should be initialized.  And, it is early enough to still use early
>>>>> memory allocator.
>>>>
>>>> This assumes that hugetlb_add_hstate() is called from the arch code at
>>>> the right point in time for the generic HugeTLB to do the required CMA
>>>> reservation which is not ideal. I guess it must have been a reason why
>>>> CMA reservation should always called by the platform code which knows
>>>> the boot sequence timing better.
>>>
>>> Actually, the code does not make the assumption that hugetlb_add_hstate
>>> is called from arch specific huge page setup.  It can even be called later
>>> at the time of hugetlb command line processing.
>> 
>> Yes, now that hugetlb_cma_reserve() has been moved into hugetlb_add_hstate().
>> But then there is an explicit warning while trying to mix both the command
>> line options i.e hugepagesz= and hugetlb_cma=. The proposed code here have
>> not changed that behavior and hence the following warning should have been
>> triggered here as well.
>> 
>> 1) hugepagesz_setup()
>> 	hugetlb_add_hstate()
>> 		 hugetlb_cma_reserve()
>> 
>> 2) hugepages_setup()
>> 	hugetlb_hstate_alloc_pages()	when order >= MAX_ORDER
>> 
>> 	if (hstate_is_gigantic(h)) {
>>         	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
>>                 	pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
>> 			break;
>>                 }
>> 		if (!alloc_bootmem_huge_page(h))
>>                 break;
>> 	}
>> 
>> Nonetheless, it does not make sense to mix both memblock and CMA based huge
>> page pre-allocations. But looking at this again, could this warning be ever
>> triggered till now ? Unless, a given platform calls hugetlb_cma_reserve()
>> before _setup("hugepages=", hugepages_setup). Anyways, there seems to be
>> good reasons to keep both memblock and CMA based pre-allocations in place.
>> But mixing them together (as done in the proposed code here) does not seem
>> to be right.
>
> I'm not sure if I follow the question.
>
> This proposal does not change the trigger for the warning printed when one
> tries to both reserve CMA and pre-allocate gigantic pages.  If hugetlb_cma
> is specified on the command line, and someone tries to pre-allocate gigantic
> pages they will get the warning.  Such a command line on x86 might look like,
> hugetlb_cma=4G hugepagesz=1G hugepages=4
>
> You will then see,
> [    0.065864] HugeTLB: hugetlb_cma is enabled, skip boot time allocation
> [    0.065866] HugeTLB: allocating 4 of page size 1.00 GiB failed.  Only allocated 0 hugepages.
>
> Ideally we could/should eliminate the second message.
> This behavior exists in the current code.

There is variant of this which is pseries powerpc where there is
hypervisor assistance w.r.t allocating gigantic pages. See the ppc64
enablement patch 

https://lore.kernel.org/linuxppc-dev/20200713150749.25245-1-aneesh.kumar@linux.ibm.com/

-aneesh
Mike Kravetz July 27, 2020, 5:52 p.m. UTC | #15
On 7/27/20 7:37 AM, Aneesh Kumar K.V wrote:
> There is variant of this which is pseries powerpc where there is
> hypervisor assistance w.r.t allocating gigantic pages. See the ppc64
> enablement patch 
> 
> https://lore.kernel.org/linuxppc-dev/20200713150749.25245-1-aneesh.kumar@linux.ibm.com/
> 

Thanks Aneesh,  I missed the powerpc support.

I knew about the powerpc hypervisor assistance which caused me to rethink
my original idea that all this could be arch independent.  My next idea was
that architectures would only need to provide a constant something like:

#define HUGETLB_CMA_ORDER (PUD_SHIFT - PAGE_SHIFT)

However, it looks like this can not be a compile time constant on powerpc.

Moving more of the CMA support to arch independent code has moved down on
my priority list.  So, it it likely not to get much work in the immediate
future.

Just curious, can you have multiple gigantic page sizes supported at one
time (one system instance) on powerpc?
diff mbox series

Patch

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1e93cfc7c47a..420f5e55615c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -420,15 +420,6 @@  void __init bootmem_init(void)
 
 	arm64_numa_init();
 
-	/*
-	 * must be done after arm64_numa_init() which calls numa_init() to
-	 * initialize node_online_map that gets used in hugetlb_cma_reserve()
-	 * while allocating required CMA size across online nodes.
-	 */
-#ifdef CONFIG_ARM64_4K_PAGES
-	hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
-#endif
-
 	/*
 	 * Sparsemem tries to allocate bootmem in memory_present(), so must be
 	 * done after the fixed reservations.
@@ -438,6 +429,16 @@  void __init bootmem_init(void)
 	sparse_init();
 	zone_sizes_init(min, max);
 
+	/*
+	 * must be done after zone_sizes_init() which calls free_area_init()
+	 * that calls node_set_state() to initialize node_states[N_MEMORY]
+	 * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
+	 * state
+	 */
+#ifdef CONFIG_ARM64_4K_PAGES
+	hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+#endif
+
 	memblock_dump_all();
 }
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a3767e74c758..a1a9712090ae 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1164,9 +1164,6 @@  void __init setup_arch(char **cmdline_p)
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
-	if (boot_cpu_has(X86_FEATURE_GBPAGES))
-		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
-
 	/*
 	 * Reserve memory for crash kernel after SRAT is parsed so that it
 	 * won't consume hotpluggable memory.
@@ -1180,6 +1177,15 @@  void __init setup_arch(char **cmdline_p)
 
 	x86_init.paging.pagetable_init();
 
+	/*
+	 * must be done after zone_sizes_init() which calls free_area_init()
+	 * that calls node_set_state() to initialize node_states[N_MEMORY]
+	 * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
+	 * state
+	 */
+	if (boot_cpu_has(X86_FEATURE_GBPAGES))
+		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+
 	kasan_init();
 
 	/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc3304af40d0..32b5035ffec1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5665,6 +5665,13 @@  static int __init cmdline_parse_hugetlb_cma(char *p)
 
 early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
 
+/*
+ * hugetlb_cma_reserve() - reserve CMA for gigantic pages on nodes with memory
+ *
+ * must be called after free_area_init() that updates N_MEMORY via node_set_state().
+ * hugetlb_cma_reserve() scans over N_MEMORY nodemask and hence expects the platforms
+ * to have initialized N_MEMORY state.
+ */
 void __init hugetlb_cma_reserve(int order)
 {
 	unsigned long size, reserved, per_node;
@@ -5685,12 +5692,12 @@  void __init hugetlb_cma_reserve(int order)
 	 * If 3 GB area is requested on a machine with 4 numa nodes,
 	 * let's allocate 1 GB on first three nodes and ignore the last one.
 	 */
-	per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
+	per_node = DIV_ROUND_UP(hugetlb_cma_size, num_node_state(N_MEMORY));
 	pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
 		hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
 
 	reserved = 0;
-	for_each_node_state(nid, N_ONLINE) {
+	for_each_node_state(nid, N_MEMORY) {
 		int res;
 
 		size = min(per_node, hugetlb_cma_size - reserved);