mm: fix panic in __alloc_pages

Message ID	20211101201312.11589-1-amakhalov@vmware.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=vc3H=PU=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 1524360D07 From: Alexey Makhalov <amakhalov@vmware.com> To: linux-mm@kvack.org Cc: Alexey Makhalov <amakhalov@vmware.com>, Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH] mm: fix panic in __alloc_pages Date: Mon, 1 Nov 2021 13:13:12 -0700 Message-Id: <20211101201312.11589-1-amakhalov@vmware.com> Content-Type: text/plain MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: fix panic in __alloc_pages \| expand mm: fix panic in __alloc_pages

Alexey Makhalov Nov. 1, 2021, 8:13 p.m. UTC

There is a kernel panic caused by __alloc_pages() accessing
uninitialized NODE_DATA(nid). Uninitialized node data exists
during the time when CPU with memoryless node was added but
not onlined yet. Panic can be easy reproduced by disabling
udev rule for automatic onlining hot added CPU followed by
CPU with memoryless node hot add.

This is a panic caused by percpu code doing allocations for
all possible CPUs and hitting this issue:

 CPU2 has been hot-added
 BUG: unable to handle page fault for address: 0000000000001608
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW

 RIP: 0010:__alloc_pages+0x127/0x290
 Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
 RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
 RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
 RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
 R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
 FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
 Call Trace:
  pcpu_alloc_pages.constprop.0+0xe4/0x1c0
  pcpu_populate_chunk+0x33/0xb0
  pcpu_alloc+0x4d3/0x6f0
  __alloc_percpu_gfp+0xd/0x10
  alloc_mem_cgroup_per_node_info+0x54/0xb0
  mem_cgroup_alloc+0xed/0x2f0
  mem_cgroup_css_alloc+0x33/0x2f0
  css_create+0x3a/0x1f0
  cgroup_apply_control_enable+0x12b/0x150
  cgroup_mkdir+0xdd/0x110
  kernfs_iop_mkdir+0x4f/0x80
  vfs_mkdir+0x178/0x230
  do_mkdirat+0xfd/0x120
  __x64_sys_mkdir+0x47/0x70
  ? syscall_exit_to_user_mode+0x21/0x50
  do_syscall_64+0x43/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Node can be in one of the following states:
1. not present (nid == NUMA_NO_NODE)
2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
				NODE_DATA(nid) == NULL)
3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
				NODE_DATA(nid) != NULL)

alloc_page_{bulk_array}node() functions verify for nid validity only
and do not check if nid is online. Enhanced verification check allows
to handle page allocation when node is in 2nd state.

Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
---
 include/linux/gfp.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Matthew Wilcox Nov. 1, 2021, 8:38 p.m. UTC | #1

On Mon, Nov 01, 2021 at 01:13:12PM -0700, Alexey Makhalov wrote:
> +++ b/include/linux/gfp.h
> @@ -551,7 +551,8 @@ alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_arr
>  static inline unsigned long
>  alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
>  {
> -	if (nid == NUMA_NO_NODE)
> +	if (nid == NUMA_NO_NODE || (!node_online(nid) &&
> +					!(gfp & __GFP_THISNODE)))
>  		nid = numa_mem_id();
>  
>  	return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);

I don't think it's a great idea to push node_online() and the gfp check
into the caller.  Can't we put this check in __alloc_pages_bulk() instead?

Michal Hocko Nov. 2, 2021, 7:47 a.m. UTC | #2

[CC Oscar and David]

On Mon 01-11-21 13:13:12, Alexey Makhalov wrote:
> There is a kernel panic caused by __alloc_pages() accessing
> uninitialized NODE_DATA(nid). Uninitialized node data exists
> during the time when CPU with memoryless node was added but
> not onlined yet. Panic can be easy reproduced by disabling
> udev rule for automatic onlining hot added CPU followed by
> CPU with memoryless node hot add.
> 
> This is a panic caused by percpu code doing allocations for
> all possible CPUs and hitting this issue:
> 
>  CPU2 has been hot-added
>  BUG: unable to handle page fault for address: 0000000000001608
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0 P4D 0
>  Oops: 0000 [#1] SMP PTI
>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
> 
>  RIP: 0010:__alloc_pages+0x127/0x290

Could you resolve this into a specific line of the source code please?

>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
>  Call Trace:
>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
>   pcpu_populate_chunk+0x33/0xb0
>   pcpu_alloc+0x4d3/0x6f0
>   __alloc_percpu_gfp+0xd/0x10
>   alloc_mem_cgroup_per_node_info+0x54/0xb0
>   mem_cgroup_alloc+0xed/0x2f0
>   mem_cgroup_css_alloc+0x33/0x2f0
>   css_create+0x3a/0x1f0
>   cgroup_apply_control_enable+0x12b/0x150
>   cgroup_mkdir+0xdd/0x110
>   kernfs_iop_mkdir+0x4f/0x80
>   vfs_mkdir+0x178/0x230
>   do_mkdirat+0xfd/0x120
>   __x64_sys_mkdir+0x47/0x70
>   ? syscall_exit_to_user_mode+0x21/0x50
>   do_syscall_64+0x43/0x90
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Node can be in one of the following states:
> 1. not present (nid == NUMA_NO_NODE)
> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
> 				NODE_DATA(nid) == NULL)
> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
> 				NODE_DATA(nid) != NULL)
> 
> alloc_page_{bulk_array}node() functions verify for nid validity only
> and do not check if nid is online. Enhanced verification check allows
> to handle page allocation when node is in 2nd state.

I do not think this is a correct approach. We should make sure that the
proper fallback node is used instead. This means that the zone list is
initialized properly. IIRC this has been a problem in the past and it
has been fixed. The initialization code is quite subtle though so it is
possible that this got broken again.

> Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: stable@vger.kernel.org
> ---
>  include/linux/gfp.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 55b2ec1f9..34a5a7def 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -551,7 +551,8 @@ alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_arr
>  static inline unsigned long
>  alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
>  {
> -	if (nid == NUMA_NO_NODE)
> +	if (nid == NUMA_NO_NODE || (!node_online(nid) &&
> +					!(gfp & __GFP_THISNODE)))
>  		nid = numa_mem_id();
>  
>  	return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
> @@ -578,7 +579,8 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>  						unsigned int order)
>  {
> -	if (nid == NUMA_NO_NODE)
> +	if (nid == NUMA_NO_NODE || (!node_online(nid) &&
> +					!(gfp_mask & __GFP_THISNODE)))
>  		nid = numa_mem_id();
>  
>  	return __alloc_pages_node(nid, gfp_mask, order);
> -- 
> 2.30.0

David Hildenbrand Nov. 2, 2021, 8:12 a.m. UTC | #3

On 02.11.21 08:47, Michal Hocko wrote:
> [CC Oscar and David]
> 
> On Mon 01-11-21 13:13:12, Alexey Makhalov wrote:
>> There is a kernel panic caused by __alloc_pages() accessing
>> uninitialized NODE_DATA(nid). Uninitialized node data exists
>> during the time when CPU with memoryless node was added but
>> not onlined yet. Panic can be easy reproduced by disabling
>> udev rule for automatic onlining hot added CPU followed by
>> CPU with memoryless node hot add.
>>
>> This is a panic caused by percpu code doing allocations for
>> all possible CPUs and hitting this issue:
>>
>>  CPU2 has been hot-added
>>  BUG: unable to handle page fault for address: 0000000000001608
>>  #PF: supervisor read access in kernel mode
>>  #PF: error_code(0x0000) - not-present page
>>  PGD 0 P4D 0
>>  Oops: 0000 [#1] SMP PTI
>>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
>>
>>  RIP: 0010:__alloc_pages+0x127/0x290
> 
> Could you resolve this into a specific line of the source code please?
> 
>>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
>>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
>>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
>>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
>>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
>>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
>>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
>>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
>>  Call Trace:
>>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
>>   pcpu_populate_chunk+0x33/0xb0
>>   pcpu_alloc+0x4d3/0x6f0
>>   __alloc_percpu_gfp+0xd/0x10
>>   alloc_mem_cgroup_per_node_info+0x54/0xb0
>>   mem_cgroup_alloc+0xed/0x2f0
>>   mem_cgroup_css_alloc+0x33/0x2f0
>>   css_create+0x3a/0x1f0
>>   cgroup_apply_control_enable+0x12b/0x150
>>   cgroup_mkdir+0xdd/0x110
>>   kernfs_iop_mkdir+0x4f/0x80
>>   vfs_mkdir+0x178/0x230
>>   do_mkdirat+0xfd/0x120
>>   __x64_sys_mkdir+0x47/0x70
>>   ? syscall_exit_to_user_mode+0x21/0x50
>>   do_syscall_64+0x43/0x90
>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>
>> Node can be in one of the following states:
>> 1. not present (nid == NUMA_NO_NODE)
>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
>> 				NODE_DATA(nid) == NULL)
>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
>> 				NODE_DATA(nid) != NULL)
>>
>> alloc_page_{bulk_array}node() functions verify for nid validity only
>> and do not check if nid is online. Enhanced verification check allows
>> to handle page allocation when node is in 2nd state.
> 
> I do not think this is a correct approach. We should make sure that the
> proper fallback node is used instead. This means that the zone list is
> initialized properly. IIRC this has been a problem in the past and it
> has been fixed. The initialization code is quite subtle though so it is
> possible that this got broken again.

I'm a little confused:

In add_memory_resource() we hotplug the new node if required and set it
online. Memory might get onlined later, via online_pages().

So after add_memory_resource()->__try_online_node() succeeded, we have
an online pgdat -- essentially 3.

This patch detects if we're past 3. but says that it reproduced by
disabling *memory* onlining.

Before we online memory for a hotplugged node, all zones are !populated.
So once we online memory for a !populated zone in online_pages(), we
trigger setup_zone_pageset().


The confusing part is that this patch checks for 3. but says it can be
reproduced by not onlining *memory*. There seems to be something missing.

Do we maybe need a proper populated_zone() check before accessing zone data?

Alexey Makhalov Nov. 2, 2021, 8:48 a.m. UTC | #4

On 11/2/21, 1:12 AM, "David Hildenbrand" <david@redhat.com> wrote:

Thanks for reviews,

    On 02.11.21 08:47, Michal Hocko wrote:
    > [CC Oscar and David]
    > 
    > On Mon 01-11-21 13:13:12, Alexey Makhalov wrote:
    >> There is a kernel panic caused by __alloc_pages() accessing
    >> uninitialized NODE_DATA(nid). Uninitialized node data exists
    >> during the time when CPU with memoryless node was added but
    >> not onlined yet. Panic can be easy reproduced by disabling
    >> udev rule for automatic onlining hot added CPU followed by
    >> CPU with memoryless node hot add.
    >>
    >> This is a panic caused by percpu code doing allocations for
    >> all possible CPUs and hitting this issue:
    >>
    >>  CPU2 has been hot-added
    >>  BUG: unable to handle page fault for address: 0000000000001608
    >>  #PF: supervisor read access in kernel mode
    >>  #PF: error_code(0x0000) - not-present page
    >>  PGD 0 P4D 0
    >>  Oops: 0000 [#1] SMP PTI
    >>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
    >>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
    >>
    >>  RIP: 0010:__alloc_pages+0x127/0x290
    > 
    > Could you resolve this into a specific line of the source code please?
    > 
    >>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
    >>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
    >>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
    >>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
    >>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
    >>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
    >>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
    >>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
    >>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    >>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
    >>  Call Trace:
    >>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
    >>   pcpu_populate_chunk+0x33/0xb0
    >>   pcpu_alloc+0x4d3/0x6f0
    >>   __alloc_percpu_gfp+0xd/0x10
    >>   alloc_mem_cgroup_per_node_info+0x54/0xb0
    >>   mem_cgroup_alloc+0xed/0x2f0
    >>   mem_cgroup_css_alloc+0x33/0x2f0
    >>   css_create+0x3a/0x1f0
    >>   cgroup_apply_control_enable+0x12b/0x150
    >>   cgroup_mkdir+0xdd/0x110
    >>   kernfs_iop_mkdir+0x4f/0x80
    >>   vfs_mkdir+0x178/0x230
    >>   do_mkdirat+0xfd/0x120
    >>   __x64_sys_mkdir+0x47/0x70
    >>   ? syscall_exit_to_user_mode+0x21/0x50
    >>   do_syscall_64+0x43/0x90
    >>   entry_SYSCALL_64_after_hwframe+0x44/0xae
    >>
    >> Node can be in one of the following states:
    >> 1. not present (nid == NUMA_NO_NODE)
    >> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
    >> 				NODE_DATA(nid) == NULL)
    >> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
    >> 				NODE_DATA(nid) != NULL)
    >>
    >> alloc_page_{bulk_array}node() functions verify for nid validity only
    >> and do not check if nid is online. Enhanced verification check allows
    >> to handle page allocation when node is in 2nd state.
    > 
    > I do not think this is a correct approach. We should make sure that the
    > proper fallback node is used instead. This means that the zone list is
    > initialized properly. IIRC this has been a problem in the past and it
    > has been fixed. The initialization code is quite subtle though so it is
    > possible that this got broken again.
This approach behaves in the same way as CPU was not yet added. (state #1).
So, we can think of state #2 as state #1 when CPU is not present.

    I'm a little confused:

    In add_memory_resource() we hotplug the new node if required and set it
    online. Memory might get onlined later, via online_pages().
You are correct. In case of memory hot add, it is true. But in case of adding
CPU with memoryless node, try_node_online() will be called only during CPU
onlining, see cpu_up().

Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
I think it would be correct to online node during the CPU hot add to align with
memory hot add.

    So after add_memory_resource()->__try_online_node() succeeded, we have
    an online pgdat -- essentially 3.

    This patch detects if we're past 3. but says that it reproduced by
    disabling *memory* onlining.
This is the hot adding of both new CPU and new _memoryless_ node (with CPU only)
And onlining CPU makes its node online. Disabling CPU onlining puts new node
into state #2, which leads to repro.    

    Before we online memory for a hotplugged node, all zones are !populated.
    So once we online memory for a !populated zone in online_pages(), we
    trigger setup_zone_pageset().


    The confusing part is that this patch checks for 3. but says it can be
    reproduced by not onlining *memory*. There seems to be something missing.

    Do we maybe need a proper populated_zone() check before accessing zone data?

Thanks,
--Alexey

Michal Hocko Nov. 2, 2021, 9:04 a.m. UTC | #5

It is hard to follow your reply as your email client is not quoting
properly. Let me try to reconstruct

On Tue 02-11-21 08:48:27, Alexey Makhalov wrote:
> On 02.11.21 08:47, Michal Hocko wrote:
[...]
>>>>  CPU2 has been hot-added
>>>>  BUG: unable to handle page fault for address: 0000000000001608
>>>>  #PF: supervisor read access in kernel mode
>>>>  #PF: error_code(0x0000) - not-present page
>>>>  PGD 0 P4D 0
>>>>  Oops: 0000 [#1] SMP PTI
>>>>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>>>>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
>>>>
>>>>  RIP: 0010:__alloc_pages+0x127/0x290
>>> 
>>> Could you resolve this into a specific line of the source code please?

This got probably unnoticed. I would be really curious whether this is
a broken zonelist or something else.
 
>>>> Node can be in one of the following states:
>>>> 1. not present (nid == NUMA_NO_NODE)
>>>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
>>>> 				NODE_DATA(nid) == NULL)
>>>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
>>>> 				NODE_DATA(nid) != NULL)
>>>>
>>>> alloc_page_{bulk_array}node() functions verify for nid validity only
>>>> and do not check if nid is online. Enhanced verification check allows
>>>> to handle page allocation when node is in 2nd state.
>>> 
>>> I do not think this is a correct approach. We should make sure that the
>>> proper fallback node is used instead. This means that the zone list is
>>> initialized properly. IIRC this has been a problem in the past and it
>>> has been fixed. The initialization code is quite subtle though so it is
>>> possible that this got broken again.

> This approach behaves in the same way as CPU was not yet added. (state #1).
> So, we can think of state #2 as state #1 when CPU is not present.

>> I'm a little confused:
>> 
>> In add_memory_resource() we hotplug the new node if required and set it
>> online. Memory might get onlined later, via online_pages().
>
> You are correct. In case of memory hot add, it is true. But in case of adding
> CPU with memoryless node, try_node_online() will be called only during CPU
> onlining, see cpu_up().
> 
> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
> I think it would be correct to online node during the CPU hot add to align with
> memory hot add.

I am not familiar with cpu hotplug, but this doesn't seem to be anything
new so how come this became problem only now?

>> So after add_memory_resource()->__try_online_node() succeeded, we have
>> an online pgdat -- essentially 3.
>> 
> This patch detects if we're past 3. but says that it reproduced by
> disabling *memory* onlining.
> This is the hot adding of both new CPU and new _memoryless_ node (with CPU only)
> And onlining CPU makes its node online. Disabling CPU onlining puts new node
> into state #2, which leads to repro.    
> 
>> Before we online memory for a hotplugged node, all zones are !populated.
>> So once we online memory for a !populated zone in online_pages(), we
>> trigger setup_zone_pageset().
>> 
>> 
>> The confusing part is that this patch checks for 3. but says it can be
>> reproduced by not onlining *memory*. There seems to be something missing.
> 
> Do we maybe need a proper populated_zone() check before accessing zone data?

No, we need them initialize properly.

David Hildenbrand Nov. 2, 2021, 9:24 a.m. UTC | #6

>>> In add_memory_resource() we hotplug the new node if required and set it
>>> online. Memory might get onlined later, via online_pages().
>>
>> You are correct. In case of memory hot add, it is true. But in case of adding
>> CPU with memoryless node, try_node_online() will be called only during CPU
>> onlining, see cpu_up().
>>
>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
>> I think it would be correct to online node during the CPU hot add to align with
>> memory hot add.
> 
> I am not familiar with cpu hotplug, but this doesn't seem to be anything
> new so how come this became problem only now?

So IIUC, the issue is that we have a node

a) That has no memory
b) That is offline

This node will get onlined when onlining the CPU as Alexey says. Yet we
have some code that stumbles over the node and goes ahead trying to use
the pgdat -- that code is broken.


If we take a look at build_zonelists() we indeed skip any
!node_online(node). Any other code should do the same. If the node is
not online, it shall be ignored because we might not even have a pgdat
yet -- see hotadd_new_pgdat(). Without node_online(), the pgdat might be
stale or non-existant.


The node onlining logic when onlining a CPU sounds bogus as well: Let's
take a look at try_offline_node(). It checks that:
1) That no memory is *present*
2) That no CPU is *present*

We should online the node when adding the CPU ("present"), not when
onlining the CPU.

Alexey Makhalov Nov. 2, 2021, 9:40 a.m. UTC | #7

> 
> It is hard to follow your reply as your email client is not quoting
> properly. Let me try to reconstruct
> 
> On Tue 02-11-21 08:48:27, Alexey Makhalov wrote:
>> On 02.11.21 08:47, Michal Hocko wrote:
> [...]
>>>>> CPU2 has been hot-added
>>>>> BUG: unable to handle page fault for address: 0000000000001608
>>>>> #PF: supervisor read access in kernel mode
>>>>> #PF: error_code(0x0000) - not-present page
>>>>> PGD 0 P4D 0
>>>>> Oops: 0000 [#1] SMP PTI
>>>>> CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>>>>> Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
>>>>> 
>>>>> RIP: 0010:__alloc_pages+0x127/0x290
>>>> 
>>>> Could you resolve this into a specific line of the source code please?
> 
> This got probably unnoticed. I would be really curious whether this is
> a broken zonelist or something else.

backtrace (including inline functions)
pcpu_alloc_pages()
alloc_pages_node()
  __alloc_pages_node()
    __alloc_pages()
      prepare_alloc_pages()
        node_zonelist()

Panic happens in node_zonelist(), dereferencing NULL pointer of NODE_DATA(nid) in
include/linux/gfp.h:514
512 static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
513 {
514         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
515 }


> 
>>>>> Node can be in one of the following states:
>>>>> 1. not present (nid == NUMA_NO_NODE)
>>>>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
>>>>> 				NODE_DATA(nid) == NULL)
>>>>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
>>>>> 				NODE_DATA(nid) != NULL)
>>>>> 
>>>>> alloc_page_{bulk_array}node() functions verify for nid validity only
>>>>> and do not check if nid is online. Enhanced verification check allows
>>>>> to handle page allocation when node is in 2nd state.
>>>> 
>>>> I do not think this is a correct approach. We should make sure that the
>>>> proper fallback node is used instead. This means that the zone list is
>>>> initialized properly. IIRC this has been a problem in the past and it
>>>> has been fixed. The initialization code is quite subtle though so it is
>>>> possible that this got broken again.
> 
>> This approach behaves in the same way as CPU was not yet added. (state #1).
>> So, we can think of state #2 as state #1 when CPU is not present.
> 
>>> I'm a little confused:
>>> 
>>> In add_memory_resource() we hotplug the new node if required and set it
>>> online. Memory might get onlined later, via online_pages().
>> 
>> You are correct. In case of memory hot add, it is true. But in case of adding
>> CPU with memoryless node, try_node_online() will be called only during CPU
>> onlining, see cpu_up().
>> 
>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
>> I think it would be correct to online node during the CPU hot add to align with
>> memory hot add.
> 
> I am not familiar with cpu hotplug, but this doesn't seem to be anything
> new so how come this became problem only now?

This is not CPU only hotplug, but CPU + NUMA node, and this new node is with no memory.
We accidentally found it by not unlining the CPU immediately.
> 
>>> So after add_memory_resource()->__try_online_node() succeeded, we have
>>> an online pgdat -- essentially 3.
>>> 
>> This patch detects if we're past 3. but says that it reproduced by
>> disabling *memory* onlining.
>> This is the hot adding of both new CPU and new _memoryless_ node (with CPU only)
>> And onlining CPU makes its node online. Disabling CPU onlining puts new node
>> into state #2, which leads to repro.
>> 
>>> Before we online memory for a hotplugged node, all zones are !populated.
>>> So once we online memory for a !populated zone in online_pages(), we
>>> trigger setup_zone_pageset().
>>> 
>>> 
>>> The confusing part is that this patch checks for 3. but says it can be
>>> reproduced by not onlining *memory*. There seems to be something missing.
>> 
>> Do we maybe need a proper populated_zone() check before accessing zone data?
> 
> No, we need them initialize properly.
> 

Thanks,
—Alexey

Michal Hocko Nov. 2, 2021, 9:40 a.m. UTC | #8

On Tue 02-11-21 10:04:23, Michal Hocko wrote:
[...]
> > Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
> > I think it would be correct to online node during the CPU hot add to align with
> > memory hot add.
> 
> I am not familiar with cpu hotplug, but this doesn't seem to be anything
> new so how come this became problem only now?

Just looked at the cpu hotplug part. I do not see add_cpu to add much
here. Here is what I can see in the current Linus tree
add_cpu
  device_online() # cpu device - cpu_sys_devices with cpu_subsys bus
    dev->bus->online -> cpu_subsys_online
      cpu_device_up
        cpu_up
	  try_online_node

So we should be bringing up the node during add_cpu. Unless something
fails on the way - e.g. cpu_possible check or something similar.

Alexey Makhalov Nov. 2, 2021, 10:34 a.m. UTC | #9

>>>> In add_memory_resource() we hotplug the new node if required and set it
>>>> online. Memory might get onlined later, via online_pages().
>>> 
>>> You are correct. In case of memory hot add, it is true. But in case of adding
>>> CPU with memoryless node, try_node_online() will be called only during CPU
>>> onlining, see cpu_up().
>>> 
>>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
>>> I think it would be correct to online node during the CPU hot add to align with
>>> memory hot add.
>> 
>> I am not familiar with cpu hotplug, but this doesn't seem to be anything
>> new so how come this became problem only now?
> 
> So IIUC, the issue is that we have a node
> 
> a) That has no memory
> b) That is offline
> 
> This node will get onlined when onlining the CPU as Alexey says. Yet we
> have some code that stumbles over the node and goes ahead trying to use
> the pgdat -- that code is broken.

You are correct.

> 
> 
> If we take a look at build_zonelists() we indeed skip any
> !node_online(node). Any other code should do the same. If the node is
> not online, it shall be ignored because we might not even have a pgdat
> yet -- see hotadd_new_pgdat(). Without node_online(), the pgdat might be
> stale or non-existant.

Agree, alloc_pages_node() should also do the same. Not exactly to skip the node,
but to fallback to another node if !node_online(node).
alloc_pages_node() can also be hit while onlining the node, creating chicken-egg
problem, see below.

> 
> 
> The node onlining logic when onlining a CPU sounds bogus as well: Let's
> take a look at try_offline_node(). It checks that:
> 1) That no memory is *present*
> 2) That no CPU is *present*
> 
> We should online the node when adding the CPU ("present"), not when
> onlining the CPU.

Possible.
Assuming try_online_node was moved under add_cpu(), let’s
take look on this call stack:
add_cpu()
  try_online_node()
    __try_online_node()
      hotadd_new_pgdat()
At line 1190 we'll have a problem:
1183         pgdat = NODE_DATA(nid);
1184         if (!pgdat) {
1185                 pgdat = arch_alloc_nodedata(nid);
1186                 if (!pgdat)
1187                         return NULL;
1188
1189                 pgdat->per_cpu_nodestats =
1190                         alloc_percpu(struct per_cpu_nodestat);
1191                 arch_refresh_nodedata(nid, pgdat);

alloc_percpu() will go for all possible CPUs and will eventually end up
calling alloc_pages_node() trying to use subject nid for corresponding CPU
hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
is not yet online.

I like the idea of onlining the node when adding the CPU rather then when
CPU get online. It will require current patch or another solution to resolve
described above chicken-egg problem.

PS, earlier this year I initiated discussion about redesigning per_cpu allocator
to do not allocate/waste memory chunks for not present CPUs, but it has another
complications.

Thanks,
—Alexey

David Hildenbrand Nov. 2, 2021, 11 a.m. UTC | #10

On 02.11.21 11:34, Alexey Makhalov wrote:
> 
>>>>> In add_memory_resource() we hotplug the new node if required and set it
>>>>> online. Memory might get onlined later, via online_pages().
>>>>
>>>> You are correct. In case of memory hot add, it is true. But in case of adding
>>>> CPU with memoryless node, try_node_online() will be called only during CPU
>>>> onlining, see cpu_up().
>>>>
>>>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
>>>> I think it would be correct to online node during the CPU hot add to align with
>>>> memory hot add.
>>>
>>> I am not familiar with cpu hotplug, but this doesn't seem to be anything
>>> new so how come this became problem only now?
>>
>> So IIUC, the issue is that we have a node
>>
>> a) That has no memory
>> b) That is offline
>>
>> This node will get onlined when onlining the CPU as Alexey says. Yet we
>> have some code that stumbles over the node and goes ahead trying to use
>> the pgdat -- that code is broken.
> 
> You are correct.
> 
>>
>>
>> If we take a look at build_zonelists() we indeed skip any
>> !node_online(node). Any other code should do the same. If the node is
>> not online, it shall be ignored because we might not even have a pgdat
>> yet -- see hotadd_new_pgdat(). Without node_online(), the pgdat might be
>> stale or non-existant.
> 
> Agree, alloc_pages_node() should also do the same. Not exactly to skip the node,
> but to fallback to another node if !node_online(node).
> alloc_pages_node() can also be hit while onlining the node, creating chicken-egg
> problem, see below

Right, the issue is also a bit involved when calling alloc_pages_node()
on an offline NID. See below.

> 
>>
>>
>> The node onlining logic when onlining a CPU sounds bogus as well: Let's
>> take a look at try_offline_node(). It checks that:
>> 1) That no memory is *present*
>> 2) That no CPU is *present*
>>
>> We should online the node when adding the CPU ("present"), not when
>> onlining the CPU.
> 
> Possible.
> Assuming try_online_node was moved under add_cpu(), let’s
> take look on this call stack:
> add_cpu()
>   try_online_node()
>     __try_online_node()
>       hotadd_new_pgdat()
> At line 1190 we'll have a problem:
> 1183         pgdat = NODE_DATA(nid);
> 1184         if (!pgdat) {
> 1185                 pgdat = arch_alloc_nodedata(nid);
> 1186                 if (!pgdat)
> 1187                         return NULL;
> 1188
> 1189                 pgdat->per_cpu_nodestats =
> 1190                         alloc_percpu(struct per_cpu_nodestat);
> 1191                 arch_refresh_nodedata(nid, pgdat);
> 
> alloc_percpu() will go for all possible CPUs and will eventually end up
> calling alloc_pages_node() trying to use subject nid for corresponding CPU
> hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
> is not yet online.

Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for
each possible CPU. We use cpu_to_node() to come up with the NID.

I can only assume that we usually don't get an offline NID for an
offline CPU, but instead either NODE=0 or NODE=NUMA_NO_NODE, because ...


alloc_pages_node()->__alloc_pages_node() will:

VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));

BUT: prepare_alloc_pages()

ac->zonelist = node_zonelist(preferred_nid, gfp_mask);

should similarly fail. when de-referencing NULL.

Michal Hocko Nov. 2, 2021, 11:44 a.m. UTC | #11

On Tue 02-11-21 12:00:57, David Hildenbrand wrote:
> On 02.11.21 11:34, Alexey Makhalov wrote:
[...]
> >> The node onlining logic when onlining a CPU sounds bogus as well: Let's
> >> take a look at try_offline_node(). It checks that:
> >> 1) That no memory is *present*
> >> 2) That no CPU is *present*
> >>
> >> We should online the node when adding the CPU ("present"), not when
> >> onlining the CPU.
> > 
> > Possible.
> > Assuming try_online_node was moved under add_cpu(), let’s
> > take look on this call stack:
> > add_cpu()
> >   try_online_node()
> >     __try_online_node()
> >       hotadd_new_pgdat()
> > At line 1190 we'll have a problem:
> > 1183         pgdat = NODE_DATA(nid);
> > 1184         if (!pgdat) {
> > 1185                 pgdat = arch_alloc_nodedata(nid);
> > 1186                 if (!pgdat)
> > 1187                         return NULL;
> > 1188
> > 1189                 pgdat->per_cpu_nodestats =
> > 1190                         alloc_percpu(struct per_cpu_nodestat);
> > 1191                 arch_refresh_nodedata(nid, pgdat);
> > 
> > alloc_percpu() will go for all possible CPUs and will eventually end up
> > calling alloc_pages_node() trying to use subject nid for corresponding CPU
> > hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
> > is not yet online.
> 
> Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for
> each possible CPU. We use cpu_to_node() to come up with the NID.

Shouldn't this be numa_mem_id instead? Memory less nodes are odd little
critters crafted into the MM code without wider considerations. From
time to time we are struggling with some fallouts but the primary thing
is that zonelists should be valid for all memory less nodes. If that is
not the case then there is a problem with the initialization code. If
somebody is providing a bogus node to allocate from then this should be
fixed. It is still not clear to me which case are we hitting here.

David Hildenbrand Nov. 2, 2021, 12:06 p.m. UTC | #12

On 02.11.21 12:44, Michal Hocko wrote:
> On Tue 02-11-21 12:00:57, David Hildenbrand wrote:
>> On 02.11.21 11:34, Alexey Makhalov wrote:
> [...]
>>>> The node onlining logic when onlining a CPU sounds bogus as well: Let's
>>>> take a look at try_offline_node(). It checks that:
>>>> 1) That no memory is *present*
>>>> 2) That no CPU is *present*
>>>>
>>>> We should online the node when adding the CPU ("present"), not when
>>>> onlining the CPU.
>>>
>>> Possible.
>>> Assuming try_online_node was moved under add_cpu(), let’s
>>> take look on this call stack:
>>> add_cpu()
>>>   try_online_node()
>>>     __try_online_node()
>>>       hotadd_new_pgdat()
>>> At line 1190 we'll have a problem:
>>> 1183         pgdat = NODE_DATA(nid);
>>> 1184         if (!pgdat) {
>>> 1185                 pgdat = arch_alloc_nodedata(nid);
>>> 1186                 if (!pgdat)
>>> 1187                         return NULL;
>>> 1188
>>> 1189                 pgdat->per_cpu_nodestats =
>>> 1190                         alloc_percpu(struct per_cpu_nodestat);
>>> 1191                 arch_refresh_nodedata(nid, pgdat);
>>>
>>> alloc_percpu() will go for all possible CPUs and will eventually end up
>>> calling alloc_pages_node() trying to use subject nid for corresponding CPU
>>> hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
>>> is not yet online.
>>
>> Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for
>> each possible CPU. We use cpu_to_node() to come up with the NID.
> 
> Shouldn't this be numa_mem_id instead? Memory less nodes are odd little

Hm, good question. Most probably yes for offline nodes.

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 2054c9213c43..c21ff5bb91dc 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
                            gfp_t gfp)
 {
        unsigned int cpu, tcpu;
-       int i;
+       int i, nid;
 
        gfp |= __GFP_HIGHMEM;
 
        for_each_possible_cpu(cpu) {
+               nid = cpu_to_node(cpu);
+
+               if (nid == NUMA_NO_NODE || !node_online(nid))
+                       nid = numa_mem_id();
                for (i = page_start; i < page_end; i++) {
                        struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
 
-                       *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+                       *pagep = alloc_pages_node(nid, gfp, 0);
                        if (!*pagep)
                                goto err;
                }


> critters crafted into the MM code without wider considerations. From
> time to time we are struggling with some fallouts but the primary thing
> is that zonelists should be valid for all memory less nodes.

Yes, but a zonelist cannot be correct for an offline node, where we might
not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().

I agree that someone passing an offline NID into an allocator function
should be fixed.

Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
(VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
preferred nid.

Michal Hocko Nov. 2, 2021, 12:27 p.m. UTC | #13

On Tue 02-11-21 13:06:06, David Hildenbrand wrote:
> On 02.11.21 12:44, Michal Hocko wrote:
> > On Tue 02-11-21 12:00:57, David Hildenbrand wrote:
> >> On 02.11.21 11:34, Alexey Makhalov wrote:
> > [...]
> >>>> The node onlining logic when onlining a CPU sounds bogus as well: Let's
> >>>> take a look at try_offline_node(). It checks that:
> >>>> 1) That no memory is *present*
> >>>> 2) That no CPU is *present*
> >>>>
> >>>> We should online the node when adding the CPU ("present"), not when
> >>>> onlining the CPU.
> >>>
> >>> Possible.
> >>> Assuming try_online_node was moved under add_cpu(), let’s
> >>> take look on this call stack:
> >>> add_cpu()
> >>>   try_online_node()
> >>>     __try_online_node()
> >>>       hotadd_new_pgdat()
> >>> At line 1190 we'll have a problem:
> >>> 1183         pgdat = NODE_DATA(nid);
> >>> 1184         if (!pgdat) {
> >>> 1185                 pgdat = arch_alloc_nodedata(nid);
> >>> 1186                 if (!pgdat)
> >>> 1187                         return NULL;
> >>> 1188
> >>> 1189                 pgdat->per_cpu_nodestats =
> >>> 1190                         alloc_percpu(struct per_cpu_nodestat);
> >>> 1191                 arch_refresh_nodedata(nid, pgdat);
> >>>
> >>> alloc_percpu() will go for all possible CPUs and will eventually end up
> >>> calling alloc_pages_node() trying to use subject nid for corresponding CPU
> >>> hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
> >>> is not yet online.
> >>
> >> Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for
> >> each possible CPU. We use cpu_to_node() to come up with the NID.
> > 
> > Shouldn't this be numa_mem_id instead? Memory less nodes are odd little
> 
> Hm, good question. Most probably yes for offline nodes.
> 
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index 2054c9213c43..c21ff5bb91dc 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>                             gfp_t gfp)
>  {
>         unsigned int cpu, tcpu;
> -       int i;
> +       int i, nid;
>  
>         gfp |= __GFP_HIGHMEM;
>  
>         for_each_possible_cpu(cpu) {
> +               nid = cpu_to_node(cpu);
> +
> +               if (nid == NUMA_NO_NODE || !node_online(nid))
> +                       nid = numa_mem_id();

or simply nid = cpu_to_mem(cpu)

>                 for (i = page_start; i < page_end; i++) {
>                         struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
>  
> -                       *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
> +                       *pagep = alloc_pages_node(nid, gfp, 0);
>                         if (!*pagep)
>                                 goto err;
>                 }
> 
> 
> > critters crafted into the MM code without wider considerations. From
> > time to time we are struggling with some fallouts but the primary thing
> > is that zonelists should be valid for all memory less nodes.
> 
> Yes, but a zonelist cannot be correct for an offline node, where we might
> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().

Yes, that is what I had in mind. We are talking about two things here.
Memoryless nodes and offline nodes. The later sounds like a bug to me.

> I agree that someone passing an offline NID into an allocator function
> should be fixed.

Right

> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
> preferred nid.

Historically, those allocation interfaces were not trying to be robust
against wrong inputs because that adds cpu cycles for everybody for
"what if buggy" code. This has worked (surprisingly) well. Memory less
nodes have brought in some confusion but this is still something that we
can address on a higher level. Nobody give arbitrary nodes as an input.
cpu_to_node might be tricky because it can point to a memory less node
which along with __GFP_THISNODE is very likely not something anybody
wants. Hence cpu_to_mem should be used for allocations. I hate we have
two very similar APIs...

But something seems wrong in this case. cpu_to_node shouldn't return
offline nodes. That is just a land mine. It is not clear to me how the
cpu has been brought up so that the numa node allocation was left
behind. As pointed in other email add_cpu resp. cpu_up is not it.
Is it possible that the cpu bring up was only half way?

David Hildenbrand Nov. 2, 2021, 12:39 p.m. UTC | #14

>> Yes, but a zonelist cannot be correct for an offline node, where we might
>> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
>> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().
> 
> Yes, that is what I had in mind. We are talking about two things here.
> Memoryless nodes and offline nodes. The later sounds like a bug to me.

Agreed. memoryless nodes should just have proper zonelists -- which
seems to be the case.

>> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
>> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
>> preferred nid.
> 
> Historically, those allocation interfaces were not trying to be robust
> against wrong inputs because that adds cpu cycles for everybody for
> "what if buggy" code. This has worked (surprisingly) well. Memory less
> nodes have brought in some confusion but this is still something that we
> can address on a higher level. Nobody give arbitrary nodes as an input.
> cpu_to_node might be tricky because it can point to a memory less node
> which along with __GFP_THISNODE is very likely not something anybody
> wants. Hence cpu_to_mem should be used for allocations. I hate we have
> two very similar APIs...

To be precise, I'm wondering if we should do:

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 55b2ec1f965a..8c49b88336ee 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -565,7 +565,7 @@ static inline struct page *
 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 {
        VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
-       VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
+       VM_WARN_ON(!node_online(nid));

        return __alloc_pages(gfp_mask, order, nid, NULL);
 }

(Or maybe VM_BUG_ON)

Because it cannot possibly work and we'll dereference NULL later.

> 
> But something seems wrong in this case. cpu_to_node shouldn't return
> offline nodes. That is just a land mine. It is not clear to me how the
> cpu has been brought up so that the numa node allocation was left
> behind. As pointed in other email add_cpu resp. cpu_up is not it.
> Is it possible that the cpu bring up was only half way?

I tried to follow the code (what sets a CPU present, what sets a CPU
online, when do we update cpu_to_node() mapping) and IMHO it's all a big
mess. Maybe it's clearer to people familiar with that code, but CPU
hotplug in general seems to be a confusing piece of (arch-specific) code.

Also, I have no clue if cpu_to_node() mapping will get invalidated after
unplugging that CPU, or if the mapping will simply stay around for all
eternity ...

Michal Hocko Nov. 2, 2021, 1:25 p.m. UTC | #15

On Tue 02-11-21 13:39:06, David Hildenbrand wrote:
> >> Yes, but a zonelist cannot be correct for an offline node, where we might
> >> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
> >> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().
> > 
> > Yes, that is what I had in mind. We are talking about two things here.
> > Memoryless nodes and offline nodes. The later sounds like a bug to me.
> 
> Agreed. memoryless nodes should just have proper zonelists -- which
> seems to be the case.
> 
> >> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
> >> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
> >> preferred nid.
> > 
> > Historically, those allocation interfaces were not trying to be robust
> > against wrong inputs because that adds cpu cycles for everybody for
> > "what if buggy" code. This has worked (surprisingly) well. Memory less
> > nodes have brought in some confusion but this is still something that we
> > can address on a higher level. Nobody give arbitrary nodes as an input.
> > cpu_to_node might be tricky because it can point to a memory less node
> > which along with __GFP_THISNODE is very likely not something anybody
> > wants. Hence cpu_to_mem should be used for allocations. I hate we have
> > two very similar APIs...
> 
> To be precise, I'm wondering if we should do:
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 55b2ec1f965a..8c49b88336ee 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -565,7 +565,7 @@ static inline struct page *
>  __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>  {
>         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> -       VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
> +       VM_WARN_ON(!node_online(nid));
> 
>         return __alloc_pages(gfp_mask, order, nid, NULL);
>  }
> 
> (Or maybe VM_BUG_ON)
> 
> Because it cannot possibly work and we'll dereference NULL later.

VM_BUG_ON would be silent for most configurations and crash would happen
even without it so I am not sure about the additional value. VM_WARN_ON
doesn't really add much on top - except it would crash in some
configurations. If we really care to catch this case then we would have
to do a reasonable fallback with a printk note and a dumps stack.
Something like
	if (unlikely(!node_online(nid))) {
		pr_err("%d is an offline numa node and using it is a bug in a caller. Please report...\n");
		dump_stack();
		nid = numa_mem_id();
	}

But again this is adding quite some cycles to a hotpath of the page
allocator. Is this worth it?

> > But something seems wrong in this case. cpu_to_node shouldn't return
> > offline nodes. That is just a land mine. It is not clear to me how the
> > cpu has been brought up so that the numa node allocation was left
> > behind. As pointed in other email add_cpu resp. cpu_up is not it.
> > Is it possible that the cpu bring up was only half way?
> 
> I tried to follow the code (what sets a CPU present, what sets a CPU
> online, when do we update cpu_to_node() mapping) and IMHO it's all a big
> mess. Maybe it's clearer to people familiar with that code, but CPU
> hotplug in general seems to be a confusing piece of (arch-specific) code.

Yes there are different arch specific parts that make this quite hard to
follow.

I think we want to learn how exactly Alexey brought that cpu up. Because
his initial thought on add_cpu resp cpu_up doesn't seem to be correct.
Or I am just not following the code properly. Once we know all those
details we can get in touch with cpu hotplug maintainers and see what
can we do.

Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem?
One last thing, there were some mentions of __GFP_THISNODE but I fail to
see connection with the pcp allocator...

David Hildenbrand Nov. 2, 2021, 1:41 p.m. UTC | #16

On 02.11.21 14:25, Michal Hocko wrote:
> On Tue 02-11-21 13:39:06, David Hildenbrand wrote:
>>>> Yes, but a zonelist cannot be correct for an offline node, where we might
>>>> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
>>>> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().
>>>
>>> Yes, that is what I had in mind. We are talking about two things here.
>>> Memoryless nodes and offline nodes. The later sounds like a bug to me.
>>
>> Agreed. memoryless nodes should just have proper zonelists -- which
>> seems to be the case.
>>
>>>> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
>>>> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
>>>> preferred nid.
>>>
>>> Historically, those allocation interfaces were not trying to be robust
>>> against wrong inputs because that adds cpu cycles for everybody for
>>> "what if buggy" code. This has worked (surprisingly) well. Memory less
>>> nodes have brought in some confusion but this is still something that we
>>> can address on a higher level. Nobody give arbitrary nodes as an input.
>>> cpu_to_node might be tricky because it can point to a memory less node
>>> which along with __GFP_THISNODE is very likely not something anybody
>>> wants. Hence cpu_to_mem should be used for allocations. I hate we have
>>> two very similar APIs...
>>
>> To be precise, I'm wondering if we should do:
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 55b2ec1f965a..8c49b88336ee 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -565,7 +565,7 @@ static inline struct page *
>>  __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>>  {
>>         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> -       VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
>> +       VM_WARN_ON(!node_online(nid));
>>
>>         return __alloc_pages(gfp_mask, order, nid, NULL);
>>  }
>>
>> (Or maybe VM_BUG_ON)
>>
>> Because it cannot possibly work and we'll dereference NULL later.
> 
> VM_BUG_ON would be silent for most configurations and crash would happen
> even without it so I am not sure about the additional value. VM_WARN_ON
> doesn't really add much on top - except it would crash in some
> configurations. If we really care to catch this case then we would have
> to do a reasonable fallback with a printk note and a dumps stack.

As I learned, VM_BUG_ON and friends are active for e.g., Fedora, which
can catch quite some issues early, before they end up in enterprise
distro kernels. I think it has value.

> Something like
> 	if (unlikely(!node_online(nid))) {
> 		pr_err("%d is an offline numa node and using it is a bug in a caller. Please report...\n");
> 		dump_stack();
> 		nid = numa_mem_id();
> 	}
> 
> But again this is adding quite some cycles to a hotpath of the page
> allocator. Is this worth it?

Don't think a fallback makes sense.

> 
>>> But something seems wrong in this case. cpu_to_node shouldn't return
>>> offline nodes. That is just a land mine. It is not clear to me how the
>>> cpu has been brought up so that the numa node allocation was left
>>> behind. As pointed in other email add_cpu resp. cpu_up is not it.
>>> Is it possible that the cpu bring up was only half way?
>>
>> I tried to follow the code (what sets a CPU present, what sets a CPU
>> online, when do we update cpu_to_node() mapping) and IMHO it's all a big
>> mess. Maybe it's clearer to people familiar with that code, but CPU
>> hotplug in general seems to be a confusing piece of (arch-specific) code.
> 
> Yes there are different arch specific parts that make this quite hard to
> follow.
> 
> I think we want to learn how exactly Alexey brought that cpu up. Because
> his initial thought on add_cpu resp cpu_up doesn't seem to be correct.
> Or I am just not following the code properly. Once we know all those
> details we can get in touch with cpu hotplug maintainers and see what
> can we do.

Yes.

> 
> Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem?

You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids?

cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it
won't help for this very report.

> One last thing, there were some mentions of __GFP_THISNODE but I fail to
> see connection with the pcp allocator...

Me to. If pcpu would be using __GFP_THISNODE, we'd be hitting the
VM_WARN_ON but still crash.

Oscar Salvador Nov. 2, 2021, 1:52 p.m. UTC | #17

On Tue, Nov 02, 2021 at 02:25:03PM +0100, Michal Hocko wrote:
> I think we want to learn how exactly Alexey brought that cpu up. Because
> his initial thought on add_cpu resp cpu_up doesn't seem to be correct.
> Or I am just not following the code properly. Once we know all those
> details we can get in touch with cpu hotplug maintainers and see what
> can we do.

I am not really familiar with CPU hot-onlining, but I have been taking a look.
As with memory, there are two different stages, hot-adding and onlining (and the
counterparts).

Part of the hot-adding being:

acpi_processor_get_info
 acpi_processor_hotadd_init
  arch_register_cpu
   register_cpu

One of the things that register_cpu() does is to set cpu->dev.bus pointing to
&cpu_subsys, which is:

struct bus_type cpu_subsys = {
	.name = "cpu",
	.dev_name = "cpu",
	.match = cpu_subsys_match,
#ifdef CONFIG_HOTPLUG_CPU
	.online = cpu_subsys_online,
	.offline = cpu_subsys_offline,
#endif
};

Then, the onlining part (in case of a udev rule or someone onlining the device)
would be:

online_store
 device_online
  cpu_subsys_online
   cpu_device_up
    cpu_up
     ...
     online node

Since Alexey disabled the udev rule and no one onlined the CPU, online_store()->
device_online() wasn't really called.

The following only applies to x86_64:
I think we got confused because cpu_device_up() is also called from add_cpu(),
but that is an exported function and x86 does not call add_cpu() unless for
debugging purposes (check kernel/torture.c and arch/x86/kernel/topology.c).
It does the onlining through online_store()...
So we can take add_cpu() off the equation here.

Michal Hocko Nov. 2, 2021, 2:12 p.m. UTC | #18

On Tue 02-11-21 14:41:25, David Hildenbrand wrote:
> On 02.11.21 14:25, Michal Hocko wrote:
[...]
> > Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem?
> 
> You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids?

just cpu_to_mem

> cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it
> won't help for this very report.

Weird, x86 allows memory less nodes as well. But you are right
there is nothing selecting HAVE_MEMORYLESS_NODES neither do I see any
arch specific implementation. I have to say that I have forgot all those
nasty details... Sigh

Michal Hocko Nov. 2, 2021, 2:35 p.m. UTC | #19

On Tue 02-11-21 14:52:01, Oscar Salvador wrote:
> On Tue, Nov 02, 2021 at 02:25:03PM +0100, Michal Hocko wrote:
> > I think we want to learn how exactly Alexey brought that cpu up. Because
> > his initial thought on add_cpu resp cpu_up doesn't seem to be correct.
> > Or I am just not following the code properly. Once we know all those
> > details we can get in touch with cpu hotplug maintainers and see what
> > can we do.
> 
> I am not really familiar with CPU hot-onlining, but I have been taking a look.
> As with memory, there are two different stages, hot-adding and onlining (and the
> counterparts).
> 
> Part of the hot-adding being:
> 
> acpi_processor_get_info
>  acpi_processor_hotadd_init
>   arch_register_cpu
>    register_cpu
> 
> One of the things that register_cpu() does is to set cpu->dev.bus pointing to
> &cpu_subsys, which is:
> 
> struct bus_type cpu_subsys = {
> 	.name = "cpu",
> 	.dev_name = "cpu",
> 	.match = cpu_subsys_match,
> #ifdef CONFIG_HOTPLUG_CPU
> 	.online = cpu_subsys_online,
> 	.offline = cpu_subsys_offline,
> #endif
> };
> 
> Then, the onlining part (in case of a udev rule or someone onlining the device)
> would be:
> 
> online_store
>  device_online
>   cpu_subsys_online
>    cpu_device_up
>     cpu_up
>      ...
>      online node
> 
> Since Alexey disabled the udev rule and no one onlined the CPU, online_store()->
> device_online() wasn't really called.
> 
> The following only applies to x86_64:
> I think we got confused because cpu_device_up() is also called from add_cpu(),
> but that is an exported function and x86 does not call add_cpu() unless for
> debugging purposes (check kernel/torture.c and arch/x86/kernel/topology.c).
> It does the onlining through online_store()...
> So we can take add_cpu() off the equation here.

Yes, so the real problem is (thanks for pointing me to the acpi code).
The cpu->node association is done in acpi_map_cpu2node and I suspect
this expects that the node is already present as it gets the information
from SRAT/PXM tables which are parsed during boot. But I might be just
confused or maybe just VMware inject new entries here somehow.

Another interesting thing is that acpi_map_cpu2node skips over
association if there is no node found in SRAT but that should only mean
it would use the default initialization which should be hopefuly 0.

Anyway, I have found in my notes
https://www.spinics.net/lists/kernel/msg3010886.html which is a slightly
different problem but it has some notes about how the initialization
mess works (that one was boot time though and hotplug might be different
actually).

I have ran out of time for this today so hopefully somebody can re-learn
that from there...

David Hildenbrand Nov. 2, 2021, 2:44 p.m. UTC | #20

On 02.11.21 15:12, Michal Hocko wrote:
> On Tue 02-11-21 14:41:25, David Hildenbrand wrote:
>> On 02.11.21 14:25, Michal Hocko wrote:
> [...]
>>> Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem?
>>
>> You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids?
> 
> just cpu_to_mem
> 
>> cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it
>> won't help for this very report.
> 
> Weird, x86 allows memory less nodes as well. But you are right
> there is nothing selecting HAVE_MEMORYLESS_NODES neither do I see any
> arch specific implementation. I have to say that I have forgot all those
> nasty details... Sigh
> 

I assume HAVE_MEMORYLESS_NODES is just an optimization to set a
preferred memory node for memoryless nodes. It doesn't imply that we
cannot have memoryless nodes otherwise.

I suspect just as so often, the config option name doesn't express what
it really does.

Alexey Makhalov Nov. 8, 2021, 6:12 a.m. UTC | #21

I’m going to send patch v2, with node_online check moved to caller (pcpu_alloc_pages() function)
as was suggested by David. It seems as it is only one place which passes present but offlined
node to alloc-pages_node(). Moving node online check to the caller keeps hot path (alloc_pages)
simple and performant.

> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index 2054c9213c43..c21ff5bb91dc 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>                           gfp_t gfp)
> {
>       unsigned int cpu, tcpu;
> -       int i;
> +       int i, nid;
> 
>       gfp |= __GFP_HIGHMEM;
> 
>       for_each_possible_cpu(cpu) {
> +               nid = cpu_to_node(cpu);
> +
> +               if (nid == NUMA_NO_NODE || !node_online(nid))
> +                       nid = numa_mem_id();
>               for (i = page_start; i < page_end; i++) {
>                       struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
> 
> -                       *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
> +                       *pagep = alloc_pages_node(nid, gfp, 0);
>                       if (!*pagep)
>                               goto err;
>               }
> 

Thanks,
—Alexey

mm: fix panic in __alloc_pages

Commit Message

Comments

Patch