diff mbox series

mm/pages_alloc.c: Don't create ZONE_MOVABLE beyond the end of a node

Message ID 20220215025831.2113067-1-apopple@nvidia.com (mailing list archive)
State New
Headers show
Series mm/pages_alloc.c: Don't create ZONE_MOVABLE beyond the end of a node | expand

Commit Message

Alistair Popple Feb. 15, 2022, 2:58 a.m. UTC
ZONE_MOVABLE uses the remaining memory in each node. It's starting pfn
is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
not enough room for ZONE_MOVABLE on that node.

Unfortunately this condition is not checked for. This leads to
zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
node.

calculate_node_totalpages() then sets zone->present_pages to be greater
than zone->spanned_pages which is invalid, as spanned_pages represents
the maximum number of pages in a zone assuming no holes.

Subsequently it is possible free_area_init_core() will observe a zone of
size zero with present pages. In this case it will skip setting up the
zone, including the initialisation of free_lists[].

However populated_zone() checks zone->present_pages to see if a zone has
memory available. This is used by iterators such as
walk_zones_in_node(). pagetypeinfo_showfree() uses this to walk the
free_list of each zone in each node, which are assumed to be initialised
due to the zone not being empty. As free_area_init_core() never
initialised the free_lists[] this results in the following kernel crash
when trying to read /proc/pagetypeinfo:

[   67.534914] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   67.535429] #PF: supervisor read access in kernel mode
[   67.535789] #PF: error_code(0x0000) - not-present page
[   67.536128] PGD 0 P4D 0
[   67.536305] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
[   67.536696] CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461
[   67.537096] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[   67.537638] RIP: 0010:pagetypeinfo_show+0x163/0x460
[   67.537992] Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82
[   67.538259] RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003
[   67.538259] RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001
[   67.538259] RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b
[   67.538259] RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e
[   67.538259] R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00
[   67.538259] R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000
[   67.538259] FS:  00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[   67.538259] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   67.538259] CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0
[   67.538259] Call Trace:
[   67.538259]  <TASK>
[   67.538259]  seq_read_iter+0x128/0x460
[   67.538259]  ? aa_file_perm+0x1af/0x5f0
[   67.538259]  proc_reg_read_iter+0x51/0x80
[   67.538259]  ? lock_is_held_type+0xea/0x140
[   67.538259]  new_sync_read+0x113/0x1a0
[   67.538259]  vfs_read+0x136/0x1d0
[   67.538259]  ksys_read+0x70/0xf0
[   67.538259]  __x64_sys_read+0x1a/0x20
[   67.538259]  do_syscall_64+0x3b/0xc0
[   67.538259]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   67.538259] RIP: 0033:0x7f9c83e23cce
[   67.538259] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 13 0a 00 e8 c9 e3 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
[   67.538259] RSP: 002b:00007fff116e1a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   67.538259] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f9c83e23cce
[   67.538259] RDX: 0000000000020000 RSI: 00007f9c83a2c000 RDI: 0000000000000003
[   67.538259] RBP: 00007f9c83a2c000 R08: 00007f9c83a2b010 R09: 0000000000000000
[   67.538259] R10: 00007f9c83f2d7d0 R11: 0000000000000246 R12: 0000000000000000
[   67.538259] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[   67.538259]  </TASK>

Fix this by checking that the aligned zone_movable_pfn[] does not exceed
the end of the node, and if it does skip creating a movable zone on this
node.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Fixes: 2a1e274acf0b ("Create the ZONE_MOVABLE zone")
---
 mm/page_alloc.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Anshuman Khandual Feb. 15, 2022, 4:47 a.m. UTC | #1
Hi Alistair,

On 2/15/22 8:28 AM, Alistair Popple wrote:
> ZONE_MOVABLE uses the remaining memory in each node. It's starting pfn
> is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
> memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
> not enough room for ZONE_MOVABLE on that node.

How plausible is this scenario on normal systems ? Should not the node
always contain MAX_ORDER_NR_PAGES aligned pages ? Also all zones which
get created from that node should also be MAX_ORDER_NR_PAGES aligned ?
I am just curious how a node could end up being like this.

- Anshuman
Alistair Popple Feb. 15, 2022, 5:16 a.m. UTC | #2
Anshuman Khandual <anshuman.khandual@arm.com> writes:

> Hi Alistair,
>
> On 2/15/22 8:28 AM, Alistair Popple wrote:
>> ZONE_MOVABLE uses the remaining memory in each node. It's starting pfn
>> is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
>> memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
>> not enough room for ZONE_MOVABLE on that node.
>
> How plausible is this scenario on normal systems ?

Probably not very. I happened to run into this on my development/test x86 VM
which has 8GB and was booted with `numa=fake=4 kernelcore=60%` but in theory I
guess any system that has a node with less than MAX_ORDER_NR_PAGES left over for
ZONE_MOVABLE may be susceptible.

This was the RAM map:

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffddfff] usable
[    0.000000] BIOS-e820: [mem 0x000000007ffde000-0x000000007fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000027fffffff] usable

[...]

[    0.065897] Early memory node ranges
[    0.065898]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.065900]   node   0: [mem 0x0000000000100000-0x000000007ffddfff]
[    0.065902]   node   1: [mem 0x0000000100000000-0x000000017fffffff]
[    0.065904]   node   2: [mem 0x0000000180000000-0x00000001ffffffff]
[    0.065906]   node   3: [mem 0x0000000200000000-0x000000027fffffff]

Note the reserved range from 0x000000007ffde000 to 0x000000007fffffff resulting
in node-0 ending at 0x000000007ffddfff.

> Should not the node always contain MAX_ORDER_NR_PAGES aligned pages ? Also all
> zones which get created from that node should also be MAX_ORDER_NR_PAGES
> aligned ?

I'm not sure why that would be case given page size and MAX_ORDER_NR_PAGES can
be set via a kernel configuration parameter. Obviously it wasn't the case here
or this situation would not arise. That said I don't know this code well, and
this was where I decided to stop shaving this yak so it's possible there is an
even deeper underlying issue.

Either way I don't *think* the fix should introduce any problems as it shouldn't
do anything unless you were going to hit this issue anyway (which took sometime
to track down as the cause wasn't obvious).

> I am just curious how a node could end up being like this.

- Anshuman
Oscar Salvador Feb. 15, 2022, 6:15 a.m. UTC | #3
On Tue, Feb 15, 2022 at 10:17:09AM +0530, Anshuman Khandual wrote:
> Hi Alistair,
> 
> On 2/15/22 8:28 AM, Alistair Popple wrote:
> > ZONE_MOVABLE uses the remaining memory in each node. It's starting pfn
> > is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
> > memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
> > not enough room for ZONE_MOVABLE on that node.

CC Mel as he wrote that back then.

I was curious about the commit that introduced that, and I found
[1] and [2].
I guess [2] was eventually dismissed in favor of [1] as a whole, but in
there the commit message said:

"This patch rounds the start of ZONE_MOVABLE in each node to a
MAX_ORDER_NR_PAGES boundary. If the rounding pushes the start of ZONE_MOVABLE
above the end of the node then the zone will contain no memory and will not
be used at runtime"

I might be missing something, but it just rounds up the value, but does
not check if it falls beyond node's boundaries.


[1] commit 2a1e274acf0b1c192face19a4be7c12d4503eaaf "Create the
ZONE_MOVABLE zone"
[2] https://marc.info/?l=linux-mm&m=117743777129526&w=2
Anshuman Khandual Feb. 16, 2022, 5:24 a.m. UTC | #4
On 2/15/22 10:46 AM, Alistair Popple wrote:
> Anshuman Khandual <anshuman.khandual@arm.com> writes:
> 
>> Hi Alistair,
>>
>> On 2/15/22 8:28 AM, Alistair Popple wrote:
>>> ZONE_MOVABLE uses the remaining memory in each node. It's starting pfn
>>> is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
>>> memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
>>> not enough room for ZONE_MOVABLE on that node.
>>
>> How plausible is this scenario on normal systems ?
> 
> Probably not very. I happened to run into this on my development/test x86 VM
> which has 8GB and was booted with `numa=fake=4 kernelcore=60%` but in theory I
> guess any system that has a node with less than MAX_ORDER_NR_PAGES left over for
> ZONE_MOVABLE may be susceptible.
> 
> This was the RAM map:
> 
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffddfff] usable
> [    0.000000] BIOS-e820: [mem 0x000000007ffde000-0x000000007fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000027fffffff] usable
> 
> [...]
> 
> [    0.065897] Early memory node ranges
> [    0.065898]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> [    0.065900]   node   0: [mem 0x0000000000100000-0x000000007ffddfff]
> [    0.065902]   node   1: [mem 0x0000000100000000-0x000000017fffffff]
> [    0.065904]   node   2: [mem 0x0000000180000000-0x00000001ffffffff]
> [    0.065906]   node   3: [mem 0x0000000200000000-0x000000027fffffff]
> 
> Note the reserved range from 0x000000007ffde000 to 0x000000007fffffff resulting
> in node-0 ending at 0x000000007ffddfff.
> 
>> Should not the node always contain MAX_ORDER_NR_PAGES aligned pages ? Also all
>> zones which get created from that node should also be MAX_ORDER_NR_PAGES
>> aligned ?
> 
> I'm not sure why that would be case given page size and MAX_ORDER_NR_PAGES can
> be set via a kernel configuration parameter. Obviously it wasn't the case here

I assumed that in general that would be the case.

> or this situation would not arise. That said I don't know this code well, and
> this was where I decided to stop shaving this yak so it's possible there is an
> even deeper underlying issue.
> 
> Either way I don't *think* the fix should introduce any problems as it shouldn't
> do anything unless you were going to hit this issue anyway (which took sometime
> to track down as the cause wasn't obvious).

Fair enough.

> 
>> I am just curious how a node could end up being like this.
> 
> - Anshuman
>
David Hildenbrand Feb. 17, 2022, 7:48 a.m. UTC | #5
On 15.02.22 03:58, Alistair Popple wrote:
> ZONE_MOVABLE uses the remaining memory in each node. It's starting pfn
> is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
> memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
> not enough room for ZONE_MOVABLE on that node.
> 
> Unfortunately this condition is not checked for. This leads to
> zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
> node.
> 
> calculate_node_totalpages() then sets zone->present_pages to be greater
> than zone->spanned_pages which is invalid, as spanned_pages represents
> the maximum number of pages in a zone assuming no holes.
> 
> Subsequently it is possible free_area_init_core() will observe a zone of
> size zero with present pages. In this case it will skip setting up the
> zone, including the initialisation of free_lists[].
> 
> However populated_zone() checks zone->present_pages to see if a zone has
> memory available. This is used by iterators such as
> walk_zones_in_node(). pagetypeinfo_showfree() uses this to walk the
> free_list of each zone in each node, which are assumed to be initialised
> due to the zone not being empty. As free_area_init_core() never
> initialised the free_lists[] this results in the following kernel crash
> when trying to read /proc/pagetypeinfo:
> 
> [   67.534914] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [   67.535429] #PF: supervisor read access in kernel mode
> [   67.535789] #PF: error_code(0x0000) - not-present page
> [   67.536128] PGD 0 P4D 0
> [   67.536305] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
> [   67.536696] CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461
> [   67.537096] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> [   67.537638] RIP: 0010:pagetypeinfo_show+0x163/0x460
> [   67.537992] Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82
> [   67.538259] RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003
> [   67.538259] RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001
> [   67.538259] RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b
> [   67.538259] RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e
> [   67.538259] R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00
> [   67.538259] R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000
> [   67.538259] FS:  00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> [   67.538259] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   67.538259] CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0
> [   67.538259] Call Trace:
> [   67.538259]  <TASK>
> [   67.538259]  seq_read_iter+0x128/0x460
> [   67.538259]  ? aa_file_perm+0x1af/0x5f0
> [   67.538259]  proc_reg_read_iter+0x51/0x80
> [   67.538259]  ? lock_is_held_type+0xea/0x140
> [   67.538259]  new_sync_read+0x113/0x1a0
> [   67.538259]  vfs_read+0x136/0x1d0
> [   67.538259]  ksys_read+0x70/0xf0
> [   67.538259]  __x64_sys_read+0x1a/0x20
> [   67.538259]  do_syscall_64+0x3b/0xc0
> [   67.538259]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   67.538259] RIP: 0033:0x7f9c83e23cce
> [   67.538259] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 13 0a 00 e8 c9 e3 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> [   67.538259] RSP: 002b:00007fff116e1a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [   67.538259] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f9c83e23cce
> [   67.538259] RDX: 0000000000020000 RSI: 00007f9c83a2c000 RDI: 0000000000000003
> [   67.538259] RBP: 00007f9c83a2c000 R08: 00007f9c83a2b010 R09: 0000000000000000
> [   67.538259] R10: 00007f9c83f2d7d0 R11: 0000000000000246 R12: 0000000000000000
> [   67.538259] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
> [   67.538259]  </TASK>
> 
> Fix this by checking that the aligned zone_movable_pfn[] does not exceed
> the end of the node, and if it does skip creating a movable zone on this
> node.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Fixes: 2a1e274acf0b ("Create the ZONE_MOVABLE zone")
> ---
>  mm/page_alloc.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6d31..a1fbf656e7db 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7972,10 +7972,17 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>  
>  out2:
>  	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
> -	for (nid = 0; nid < MAX_NUMNODES; nid++)
> +	for (nid = 0; nid < MAX_NUMNODES; nid++) {
> +		unsigned long start_pfn, end_pfn;
> +
>  		zone_movable_pfn[nid] =
>  			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
>  
> +		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> +		if (zone_movable_pfn[nid] >= end_pfn)
> +			zone_movable_pfn[nid] = 0;
> +	}
> +
>  out:
>  	/* restore the node_state */
>  	node_states[N_MEMORY] = saved_node_state;


Sounds plausible for me

Acked-by: David Hildenbrand <david@redhat.com>
Mel Gorman Feb. 21, 2022, 11:20 a.m. UTC | #6
On Tue, Feb 15, 2022 at 01:58:31PM +1100, Alistair Popple wrote:
> ZONE_MOVABLE uses the remaining memory in each node. It's starting pfn
> is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
> memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
> not enough room for ZONE_MOVABLE on that node.
> 
> Unfortunately this condition is not checked for. This leads to
> zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
> node.
> 
> calculate_node_totalpages() then sets zone->present_pages to be greater
> than zone->spanned_pages which is invalid, as spanned_pages represents
> the maximum number of pages in a zone assuming no holes.
> 
> Subsequently it is possible free_area_init_core() will observe a zone of
> size zero with present pages. In this case it will skip setting up the
> zone, including the initialisation of free_lists[].
> 
> However populated_zone() checks zone->present_pages to see if a zone has
> memory available. This is used by iterators such as
> walk_zones_in_node(). pagetypeinfo_showfree() uses this to walk the
> free_list of each zone in each node, which are assumed to be initialised
> due to the zone not being empty. As free_area_init_core() never
> initialised the free_lists[] this results in the following kernel crash
> when trying to read /proc/pagetypeinfo:
> 
> [   67.534914] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [   67.535429] #PF: supervisor read access in kernel mode
> [   67.535789] #PF: error_code(0x0000) - not-present page
> [   67.536128] PGD 0 P4D 0
> [   67.536305] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
> [   67.536696] CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461
> [   67.537096] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> [   67.537638] RIP: 0010:pagetypeinfo_show+0x163/0x460
> [   67.537992] Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82
> [   67.538259] RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003
> [   67.538259] RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001
> [   67.538259] RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b
> [   67.538259] RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e
> [   67.538259] R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00
> [   67.538259] R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000
> [   67.538259] FS:  00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> [   67.538259] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   67.538259] CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0
> [   67.538259] Call Trace:
> [   67.538259]  <TASK>
> [   67.538259]  seq_read_iter+0x128/0x460
> [   67.538259]  ? aa_file_perm+0x1af/0x5f0
> [   67.538259]  proc_reg_read_iter+0x51/0x80
> [   67.538259]  ? lock_is_held_type+0xea/0x140
> [   67.538259]  new_sync_read+0x113/0x1a0
> [   67.538259]  vfs_read+0x136/0x1d0
> [   67.538259]  ksys_read+0x70/0xf0
> [   67.538259]  __x64_sys_read+0x1a/0x20
> [   67.538259]  do_syscall_64+0x3b/0xc0
> [   67.538259]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   67.538259] RIP: 0033:0x7f9c83e23cce
> [   67.538259] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 13 0a 00 e8 c9 e3 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> [   67.538259] RSP: 002b:00007fff116e1a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [   67.538259] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f9c83e23cce
> [   67.538259] RDX: 0000000000020000 RSI: 00007f9c83a2c000 RDI: 0000000000000003
> [   67.538259] RBP: 00007f9c83a2c000 R08: 00007f9c83a2b010 R09: 0000000000000000
> [   67.538259] R10: 00007f9c83f2d7d0 R11: 0000000000000246 R12: 0000000000000000
> [   67.538259] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
> [   67.538259]  </TASK>
> 
> Fix this by checking that the aligned zone_movable_pfn[] does not exceed
> the end of the node, and if it does skip creating a movable zone on this
> node.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Fixes: 2a1e274acf0b ("Create the ZONE_MOVABLE zone")

Seems reasonable;

Acked-by: Mel Gorman <mgorman@techsingularity.net>
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..a1fbf656e7db 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7972,10 +7972,17 @@  static void __init find_zone_movable_pfns_for_nodes(void)
 
 out2:
 	/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
-	for (nid = 0; nid < MAX_NUMNODES; nid++)
+	for (nid = 0; nid < MAX_NUMNODES; nid++) {
+		unsigned long start_pfn, end_pfn;
+
 		zone_movable_pfn[nid] =
 			roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
+		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
+		if (zone_movable_pfn[nid] >= end_pfn)
+			zone_movable_pfn[nid] = 0;
+	}
+
 out:
 	/* restore the node_state */
 	node_states[N_MEMORY] = saved_node_state;