diff mbox series

mem-hotplug: fix node spanned pages when we have a node with only zone_movable

Message ID 1554178276-10372-1-git-send-email-fanglinxu@huawei.com (mailing list archive)
State New, archived
Headers show
Series mem-hotplug: fix node spanned pages when we have a node with only zone_movable | expand

Commit Message

Linxu Fang April 2, 2019, 4:11 a.m. UTC
commit <342332e6a925> ("mm/page_alloc.c: introduce kernelcore=mirror
option") and series patches rewrote the calculation of node spanned
pages.
commit <e506b99696a2> (mem-hotplug: fix node spanned pages when we have a
movable node), but the current code still has problems,
when we have a node with only zone_movable and the node id is not zero,
the size of node spanned pages is double added.
That's because we have an empty normal zone, and zone_start_pfn or
zone_end_pfn is not between arch_zone_lowest_possible_pfn and
arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
range just like the commit <96e907d13602> (bootmem: Reimplement
__absent_pages_in_range() using for_each_mem_pfn_range()).

e.g.
Zone ranges:
  DMA      [mem 0x0000000000001000-0x0000000000ffffff]
  DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
  Normal   [mem 0x0000000100000000-0x000000023fffffff]
Movable zone start for each node
  Node 0: 0x0000000100000000
  Node 1: 0x0000000140000000
Early memory node ranges
  node   0: [mem 0x0000000000001000-0x000000000009efff]
  node   0: [mem 0x0000000000100000-0x00000000bffdffff]
  node   0: [mem 0x0000000100000000-0x000000013fffffff]
  node   1: [mem 0x0000000140000000-0x000000023fffffff]

node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
node 0 Normal	spanned:0	present:0	absent:0
node 0 Movable	spanned:0x40000 present:0x40000 absent:0
On node 0 totalpages(node_present_pages): 1048446
node_spanned_pages:1310719
node 1 DMA	spanned:0	    present:0		absent:0
node 1 DMA32	spanned:0	    present:0		absent:0
node 1 Normal	spanned:0x100000    present:0x100000	absent:0
node 1 Movable	spanned:0x100000    present:0x100000	absent:0
On node 1 totalpages(node_present_pages): 2097152
node_spanned_pages:2097152
Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
cma-reserved)

It shows that the current memory of node 1 is double added.
After this patch, the problem is fixed.

node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
node 0 Normal	spanned:0	present:0	absent:0
node 0 Movable	spanned:0x40000 present:0x40000 absent:0
On node 0 totalpages(node_present_pages): 1048446
node_spanned_pages:1310719
node 1 DMA	spanned:0	    present:0		absent:0
node 1 DMA32	spanned:0	    present:0		absent:0
node 1 Normal	spanned:0	    present:0		absent:0
node 1 Movable	spanned:0x100000    present:0x100000	absent:0
On node 1 totalpages(node_present_pages): 1048576
node_spanned_pages:1048576
memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
cma-reserved)

Signed-off-by: Linxu Fang <fanglinxu@huawei.com>
---
 mm/page_alloc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Oscar Salvador April 2, 2019, 2:57 p.m. UTC | #1
On Tue, Apr 02, 2019 at 12:11:16PM +0800, Linxu Fang wrote:
> commit <342332e6a925> ("mm/page_alloc.c: introduce kernelcore=mirror
> option") and series patches rewrote the calculation of node spanned
> pages.
> commit <e506b99696a2> (mem-hotplug: fix node spanned pages when we have a
> movable node), but the current code still has problems,
> when we have a node with only zone_movable and the node id is not zero,
> the size of node spanned pages is double added.
> That's because we have an empty normal zone, and zone_start_pfn or
> zone_end_pfn is not between arch_zone_lowest_possible_pfn and
> arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
> range just like the commit <96e907d13602> (bootmem: Reimplement
> __absent_pages_in_range() using for_each_mem_pfn_range()).

So, let me see if I understood this correctly:

When calling zone_spanned_pages_in_node() for any node which is not node 0,

> *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
> *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];

will actually set zone_start_pfn/zone_end_pfn to the values from node0's
ZONE_NORMAL?

So we use clamp to actually check if such values fall within what node1's
memory spans, and ignore them otherwise?

Btw, mem-hotplug does not hit this path anymore.


> 
> e.g.
> Zone ranges:
>   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
>   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
>   Normal   [mem 0x0000000100000000-0x000000023fffffff]
> Movable zone start for each node
>   Node 0: 0x0000000100000000
>   Node 1: 0x0000000140000000
> Early memory node ranges
>   node   0: [mem 0x0000000000001000-0x000000000009efff]
>   node   0: [mem 0x0000000000100000-0x00000000bffdffff]
>   node   0: [mem 0x0000000100000000-0x000000013fffffff]
>   node   1: [mem 0x0000000140000000-0x000000023fffffff]
> 
> node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
> node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
> node 0 Normal	spanned:0	present:0	absent:0
> node 0 Movable	spanned:0x40000 present:0x40000 absent:0
> On node 0 totalpages(node_present_pages): 1048446
> node_spanned_pages:1310719
> node 1 DMA	spanned:0	    present:0		absent:0
> node 1 DMA32	spanned:0	    present:0		absent:0
> node 1 Normal	spanned:0x100000    present:0x100000	absent:0
> node 1 Movable	spanned:0x100000    present:0x100000	absent:0
> On node 1 totalpages(node_present_pages): 2097152
> node_spanned_pages:2097152
> Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
> 4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
> cma-reserved)
> 
> It shows that the current memory of node 1 is double added.
> After this patch, the problem is fixed.
> 
> node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
> node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
> node 0 Normal	spanned:0	present:0	absent:0
> node 0 Movable	spanned:0x40000 present:0x40000 absent:0
> On node 0 totalpages(node_present_pages): 1048446
> node_spanned_pages:1310719
> node 1 DMA	spanned:0	    present:0		absent:0
> node 1 DMA32	spanned:0	    present:0		absent:0
> node 1 Normal	spanned:0	    present:0		absent:0
> node 1 Movable	spanned:0x100000    present:0x100000	absent:0
> On node 1 totalpages(node_present_pages): 1048576
> node_spanned_pages:1048576
> memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
> 4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
> cma-reserved)
> 
> Signed-off-by: Linxu Fang <fanglinxu@huawei.com>
> ---
>  mm/page_alloc.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3eb01de..5cd0cb2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6233,13 +6233,15 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
>  					unsigned long *zone_end_pfn,
>  					unsigned long *ignored)
>  {
> +	unsigned long zone_low = arch_zone_lowest_possible_pfn[zone_type];
> +	unsigned long zone_high = arch_zone_highest_possible_pfn[zone_type];
>  	/* When hotadd a new node from cpu_up(), the node should be empty */
>  	if (!node_start_pfn && !node_end_pfn)
>  		return 0;
>  
>  	/* Get the start and end of the zone */
> -	*zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
> -	*zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
> +	*zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high);
> +	*zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high);
>  	adjust_zone_range_for_zone_movable(nid, zone_type,
>  				node_start_pfn, node_end_pfn,
>  				zone_start_pfn, zone_end_pfn);
> -- 
> 1.8.5.6
> 
>
Oscar Salvador April 2, 2019, 3 p.m. UTC | #2
On Tue, Apr 02, 2019 at 04:57:11PM +0200, Oscar Salvador wrote:
> On Tue, Apr 02, 2019 at 12:11:16PM +0800, Linxu Fang wrote:
> > commit <342332e6a925> ("mm/page_alloc.c: introduce kernelcore=mirror
> > option") and series patches rewrote the calculation of node spanned
> > pages.
> > commit <e506b99696a2> (mem-hotplug: fix node spanned pages when we have a
> > movable node), but the current code still has problems,
> > when we have a node with only zone_movable and the node id is not zero,
> > the size of node spanned pages is double added.
> > That's because we have an empty normal zone, and zone_start_pfn or
> > zone_end_pfn is not between arch_zone_lowest_possible_pfn and
> > arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
> > range just like the commit <96e907d13602> (bootmem: Reimplement
> > __absent_pages_in_range() using for_each_mem_pfn_range()).
> 
> So, let me see if I understood this correctly:
> 
> When calling zone_spanned_pages_in_node() for any node which is not node 0,
> 
> > *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
> > *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
> 
> will actually set zone_start_pfn/zone_end_pfn to the values from node0's
> ZONE_NORMAL?

Of course, I meant when calling it being zone_type == ZONE_NORMAL.
Linxu Fang April 3, 2019, 4:06 a.m. UTC | #3
> will actually set zone_start_pfn/zone_end_pfn to the values from node0's
> ZONE_NORMAL?

> So we use clamp to actually check if such values fall within what node1's
> memory spans, and ignore them otherwise?

That's right.
Normally, zone_start_pfn/zone_end_pfn has the same value for all nodes.
Let's look at another example, which is obtained by adding some debugging
information.





e.g.
Zone ranges:
  DMA      [mem 0x0000000000001000-0x0000000000ffffff]
  DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
  Normal   [mem 0x0000000100000000-0x0000000792ffffff]
Movable zone start for each node
  Node 0: 0x0000000100000000
  Node 1: 0x00000002b1000000
  Node 2: 0x0000000522000000
Early memory node ranges
  node   0: [mem 0x0000000000001000-0x000000000009efff]
  node   0: [mem 0x0000000000100000-0x00000000bffdefff]
  node   0: [mem 0x0000000100000000-0x00000002b0ffffff]
  node   1: [mem 0x00000002b1000000-0x0000000521ffffff]
  node   2: [mem 0x0000000522000000-0x0000000792ffffff]

Node 0:
node_start_pfn=1        node_end_pfn=2822144
DMA      zone_low=1        zone_high=4096
DMA32    zone_low=4096     zone_high=1048576
Normal   zone_low=1048576  zone_high=7942144
Movable  zone_low=0        zone_high=0

Node 1:
node_start_pfn=2822144  node_end_pfn=5382144
DMA      zone_low=1        zone_high=4096
DMA32    zone_low=4096     zone_high=1048576
Normal   zone_low=1048576  zone_high=7942144
Movable  zone_low=0        zone_high=0

Node 2:
node_start_pfn=5382144  node_end_pfn=7942144
DMA      zone_low=1        zone_high=4096
DMA32    zone_low=4096     zone_high=1048576
Normal   zone_low=1048576  zone_high=7942144
Movable  zone_low=0        zone_high=0

Before this patch, zone_start_pfn/zone_end_pfn in node 0,1,2 is the same:
  DMA      zone_start_pfn:1        zone_end_pfn:4096
  DMA32    zone_start_pfn:4096     zone_end_pfn:1048576
  Normal   zone_start_pfn:1048576  zone_end_pfn:7942144
  Movable  zone_start_pfn:0        zone_end_pfn:0
  spaned pages resuelt:
  node 0:
    DMA      spanned:4095
    DMA32    spanned:1044480
    Normal   spanned:0
    Movable  spanned:1773568
    totalpages:2559869
  node 1:
    DMA      spanned:0
    DMA32    spanned:0
    Normal   spanned:2560000
    Movable  spanned:2560000
    totalpages:5120000
  node 2:
    DMA      spanned:0
    DMA32    spanned:0
    Normal   spanned:2560000
    Movable  spanned:2560000
    totalpages:5120000

After this patch:
  node 0:
    DMA      zone_start_pfn:1        zone_end_pfn:4096    spanned:4095
    DMA32    zone_start_pfn:4096     zone_end_pfn:1048576 spanned:1044480
    Normal   zone_start_pfn:1048576  zone_end_pfn:2822144 spanned:0
    Movable  zone_start_pfn:0        zone_end_pfn:0       spanned:1773568
    totalpages:2559869
  node 1:
    DMA      zone_start_pfn:4096     zone_end_pfn:4096    spanned:0
    DMA32    zone_start_pfn:1048576  zone_end_pfn:1048576 spanned:0
    Normal   zone_start_pfn:2822144  zone_end_pfn:5382144 spanned:0
    Movable  zone_start_pfn:0        zone_end_pfn:0       spanned:2560000
    totalpages:2560000
  node 2:
    DMA      zone_start_pfn:4096     zone_end_pfn:4096    spanned:0
    DMA32    zone_start_pfn:1048576  zone_end_pfn:1048576 spanned:0
    Normal   zone_start_pfn:5382144  zone_end_pfn:7942144 spanned:0
    Movable  zone_start_pfn:0        zone_end_pfn:0       spanned:2560000
    totalpages:2560000

It is easy to construct such a scenario by configuring kernelcore=mirror
in a multi-NUMA machine without full mirrored memory. 
Of course, it can be a machine without any mirrored memory. 
A great difference can be observed by startup information and viewing
/proc/pagetypeinfo.

On earlier kernel versions, such BUGs will directly double the memory of
some nodes.
Although these redundant memory exists in the form of reserved memory,
this should not be expected.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3eb01de..5cd0cb2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6233,13 +6233,15 @@  static unsigned long __init zone_spanned_pages_in_node(int nid,
 					unsigned long *zone_end_pfn,
 					unsigned long *ignored)
 {
+	unsigned long zone_low = arch_zone_lowest_possible_pfn[zone_type];
+	unsigned long zone_high = arch_zone_highest_possible_pfn[zone_type];
 	/* When hotadd a new node from cpu_up(), the node should be empty */
 	if (!node_start_pfn && !node_end_pfn)
 		return 0;
 
 	/* Get the start and end of the zone */
-	*zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
-	*zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+	*zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high);
+	*zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high);
 	adjust_zone_range_for_zone_movable(nid, zone_type,
 				node_start_pfn, node_end_pfn,
 				zone_start_pfn, zone_end_pfn);