diff mbox series

[v4] mm/memory_hotplug: refrain from adding memory into an impossible node

Message ID 20200416171019.24433-1-vishal.l.verma@intel.com (mailing list archive)
State New, archived
Headers show
Series [v4] mm/memory_hotplug: refrain from adding memory into an impossible node | expand

Commit Message

Verma, Vishal L April 16, 2020, 5:10 p.m. UTC
A misbehaving qemu created a situation where the ACPI SRAT table
advertised one fewer proximity domains than intended. The NFIT table did
describe all the expected proximity domains. This caused the device dax
driver to assign an impossible target_node to the device, and when
hotplugged as system memory, this would fail with the following
signature:

  [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
  [  +0.001331] #PF: supervisor read access in kernel mode
  [  +0.000975] #PF: error_code(0x0000) - not-present page
  [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
  [  +0.001338] Oops: 0000 [#1] SMP PTI
  [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
  [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
  [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
                      00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
		      80 00 00 00 fe f0 80 a3 38 20
  [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
  [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
  [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
  [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
  [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
  [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
  [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
  [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
  [  +0.001095] Call Trace:
  [  +0.000388]  kswapd+0x103/0x520
  [  +0.000494]  ? finish_wait+0x80/0x80
  [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
  [  +0.000607]  kthread+0x120/0x140
  [  +0.000508]  ? kthread_create_on_node+0x60/0x60
  [  +0.000706]  ret_from_fork+0x3a/0x50

Add a check in the add_memory path to fail if the node to which we
are adding memory is in the node_possible_map

Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 mm/memory_hotplug.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

David Hildenbrand April 16, 2020, 5:12 p.m. UTC | #1
On 16.04.20 19:10, Vishal Verma wrote:
> A misbehaving qemu created a situation where the ACPI SRAT table
> advertised one fewer proximity domains than intended. The NFIT table did
> describe all the expected proximity domains. This caused the device dax
> driver to assign an impossible target_node to the device, and when
> hotplugged as system memory, this would fail with the following
> signature:
> 
>   [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
>   [  +0.001331] #PF: supervisor read access in kernel mode
>   [  +0.000975] #PF: error_code(0x0000) - not-present page
>   [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
>   [  +0.001338] Oops: 0000 [#1] SMP PTI
>   [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
>   [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>       BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>   [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
>   [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
>                       00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
> 		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
> 		      80 00 00 00 fe f0 80 a3 38 20
>   [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
>   [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
>   [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
>   [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
>   [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
>   [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
>   [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
>   [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
>   [  +0.001095] Call Trace:
>   [  +0.000388]  kswapd+0x103/0x520
>   [  +0.000494]  ? finish_wait+0x80/0x80
>   [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
>   [  +0.000607]  kthread+0x120/0x140
>   [  +0.000508]  ? kthread_create_on_node+0x60/0x60
>   [  +0.000706]  ret_from_fork+0x3a/0x50
> 
> Add a check in the add_memory path to fail if the node to which we
> are adding memory is in the node_possible_map
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  mm/memory_hotplug.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0a54ffac8c68..ddd3347edd54 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1005,6 +1005,11 @@ int __ref add_memory_resource(int nid, struct resource *res)
>  	if (ret)
>  		return ret;
>  
> +	if (!node_possible(nid)) {
> +		WARN(1, "node %d was absent from the node_possible_map\n", nid);
> +		return -ENXIO;

Nit: I suggest using "-EINVAL" instead (e.g., returned via
check_hotplug_memory_range).

Not sure if we should pr_err() instead of WARN (see e.g.,
check_hotplug_memory_range)
Verma, Vishal L April 16, 2020, 5:23 p.m. UTC | #2
On Thu, 2020-04-16 at 19:12 +0200, David Hildenbrand wrote:
> On 16.04.20 19:10, Vishal Verma wrote:
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 0a54ffac8c68..ddd3347edd54 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1005,6 +1005,11 @@ int __ref add_memory_resource(int nid, struct resource *res)
> >  	if (ret)
> >  		return ret;
> >  
> > +	if (!node_possible(nid)) {
> > +		WARN(1, "node %d was absent from the node_possible_map\n", nid);
> > +		return -ENXIO;
> 
> Nit: I suggest using "-EINVAL" instead (e.g., returned via
> check_hotplug_memory_range).
> 
> Not sure if we should pr_err() instead of WARN (see e.g.,
> check_hotplug_memory_range)
> 
Hm, I'm happy to make the changes, but EINVAL to me suggests there is a
problem in the way this was called by the user. And in this case there
really might not be much the user can change in case fo buggy firmware.

Same thing with the WARN - make the potential firmware bug much more
obvious and visible.
David Hildenbrand April 16, 2020, 5:25 p.m. UTC | #3
On 16.04.20 19:23, Verma, Vishal L wrote:
> On Thu, 2020-04-16 at 19:12 +0200, David Hildenbrand wrote:
>> On 16.04.20 19:10, Vishal Verma wrote:
>>>
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 0a54ffac8c68..ddd3347edd54 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -1005,6 +1005,11 @@ int __ref add_memory_resource(int nid, struct resource *res)
>>>  	if (ret)
>>>  		return ret;
>>>  
>>> +	if (!node_possible(nid)) {
>>> +		WARN(1, "node %d was absent from the node_possible_map\n", nid);
>>> +		return -ENXIO;
>>
>> Nit: I suggest using "-EINVAL" instead (e.g., returned via
>> check_hotplug_memory_range).
>>
>> Not sure if we should pr_err() instead of WARN (see e.g.,
>> check_hotplug_memory_range)
>>
> Hm, I'm happy to make the changes, but EINVAL to me suggests there is a
> problem in the way this was called by the user. And in this case there
> really might not be much the user can change in case fo buggy firmware.

Yeah, but introducing new return codes callers might not expected might
create IMHO other issues.

> 
> Same thing with the WARN - make the potential firmware bug much more
> obvious and visible.
> 

Yeah, but I doubt this is really necessary. No strong feelings.
David Hildenbrand April 16, 2020, 5:53 p.m. UTC | #4
On 16.04.20 19:25, David Hildenbrand wrote:
> On 16.04.20 19:23, Verma, Vishal L wrote:
>> On Thu, 2020-04-16 at 19:12 +0200, David Hildenbrand wrote:
>>> On 16.04.20 19:10, Vishal Verma wrote:
>>>>
>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>> index 0a54ffac8c68..ddd3347edd54 100644
>>>> --- a/mm/memory_hotplug.c
>>>> +++ b/mm/memory_hotplug.c
>>>> @@ -1005,6 +1005,11 @@ int __ref add_memory_resource(int nid, struct resource *res)
>>>>  	if (ret)
>>>>  		return ret;
>>>>  
>>>> +	if (!node_possible(nid)) {
>>>> +		WARN(1, "node %d was absent from the node_possible_map\n", nid);
>>>> +		return -ENXIO;
>>>
>>> Nit: I suggest using "-EINVAL" instead (e.g., returned via
>>> check_hotplug_memory_range).
>>>
>>> Not sure if we should pr_err() instead of WARN (see e.g.,
>>> check_hotplug_memory_range)
>>>
>> Hm, I'm happy to make the changes, but EINVAL to me suggests there is a
>> problem in the way this was called by the user. And in this case there
>> really might not be much the user can change in case fo buggy firmware.
> 
> Yeah, but introducing new return codes callers might not expected might
> create IMHO other issues.
> 
>>
>> Same thing with the WARN - make the potential firmware bug much more
>> obvious and visible.
>>
> 
> Yeah, but I doubt this is really necessary. No strong feelings.
> 

Forgot to

Acked-by: David Hildenbrand <david@redhat.com>
Verma, Vishal L April 16, 2020, 10:47 p.m. UTC | #5
On Thu, 2020-04-16 at 19:53 +0200, David Hildenbrand wrote:
> > > > 
> > > Hm, I'm happy to make the changes, but EINVAL to me suggests there is a
> > > problem in the way this was called by the user. And in this case there
> > > really might not be much the user can change in case fo buggy firmware.
> > 
> > Yeah, but introducing new return codes callers might not expected might
> > create IMHO other issues.
> > 
> > > Same thing with the WARN - make the potential firmware bug much more
> > > obvious and visible.
> > > 
> > 
> > Yeah, but I doubt this is really necessary. No strong feelings.
> > 
> 
> Forgot to
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
Thanks for the review David. I'll change the return code, and keep the
WARN, and send a new version.
diff mbox series

Patch

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a54ffac8c68..ddd3347edd54 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1005,6 +1005,11 @@  int __ref add_memory_resource(int nid, struct resource *res)
 	if (ret)
 		return ret;
 
+	if (!node_possible(nid)) {
+		WARN(1, "node %d was absent from the node_possible_map\n", nid);
+		return -ENXIO;
+	}
+
 	mem_hotplug_begin();
 
 	/*