migration/mm: Add WARN_ON to try_offline_node

Message ID	20181001185616.11427.35521.stgit@ltcalpine2-lp9.aus.stglabs.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: neutral (google.com: 148.163.158.5 is neither permitted nor denied by best guess record for domain of mwb@linux.vnet.ibm.com) client-ip=148.163.158.5; Gateway: Authorized Use Only! Violators will be prosecuted for <linux-mm@kvack.org> from <mwb@linux.vnet.ibm.com>; Mon, 1 Oct 2018 12:56:31 -0600 Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Mon, 1 Oct 2018 12:56:26 -0600 Subject: [PATCH] migration/mm: Add WARN_ON to try_offline_node From: Michael Bringmann <mwb@linux.vnet.ibm.com> To: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mwb@linux.vnet.ibm.com Cc: Michael Ellerman <mpe@ellerman.id.au>, Nathan Fontenot <nfont@linux.vnet.ibm.com>, Nicholas Piggin <npiggin@gmail.com>, Kees Cook <keescook@chromium.org>, Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>, Russell Currey <ruscur@russell.cc>, Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>, Christophe Leroy <christophe.leroy@c-s.fr>, Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.com>, Pavel Tatashin <pasha.tatashin@oracle.com>, Dan Williams <dan.j.williams@intel.com>, Oscar Salvador <osalvador@suse.de>, YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>, Mathieu Malaterre <malat@debian.org>, Juliet Kim <minkim@us.ibm.com>, Tyrel Datwyler <tyreld@linux.vnet.ibm.com>, Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Date: Mon, 01 Oct 2018 13:56:25 -0500 User-Agent: StGit/0.18-105-g416a MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20181001185616.11427.35521.stgit@ltcalpine2-lp9.aus.stglabs.ibm.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	migration/mm: Add WARN_ON to try_offline_node \| expand migration/mm: Add WARN_ON to try_offline_node

Michael Bringmann Oct. 1, 2018, 6:56 p.m. UTC

In some LPAR migration scenarios, device-tree modifications are
made to the affinity of the memory in the system.  For instance,
it may occur that memory is installed to nodes 0,3 on a source
system, and to nodes 0,2 on a target system.  Node 2 may not
have been initialized/allocated on the target system.

After migration, if a RTAS PRRN memory remove is made to a
memory block that was in node 3 on the source system, then
try_offline_node tries to remove it from node 2 on the target.
The NODE_DATA(2) block would not be initialized on the target,
and there is no validation check in the current code to prevent
the use of a NULL pointer.  Call traces such as the following
may be observed:

A similar problem of moving memory to an unitialized node has
also been observed on systems where multiple PRRN events occur
prior to a complete update of the device-tree.

pseries-hotplug-mem: Attempting to update LMB, drc index 80000002
Offlined Pages 4096
...
Oops: Kernel access of bad area, sig: 11 [#1]
...
Workqueue: pseries hotplug workque pseries_hp_work_fn
...
NIP [c0000000002bc088] try_offline_node+0x48/0x1e0
LR [c0000000002e0b84] remove_memory+0xb4/0xf0
Call Trace:
[c0000002bbee7a30] [c0000002bbee7a70] 0xc0000002bbee7a70 (unreliable)
[c0000002bbee7a70] [c0000000002e0b84] remove_memory+0xb4/0xf0
[c0000002bbee7ab0] [c000000000097784] dlpar_remove_lmb+0xb4/0x160
[c0000002bbee7af0] [c000000000097f38] dlpar_memory+0x328/0xcb0
[c0000002bbee7ba0] [c0000000000906d0] handle_dlpar_errorlog+0xc0/0x130
[c0000002bbee7c10] [c0000000000907d4] pseries_hp_work_fn+0x94/0xa0
[c0000002bbee7c40] [c0000000000e1cd0] process_one_work+0x1a0/0x4e0
[c0000002bbee7cd0] [c0000000000e21b0] worker_thread+0x1a0/0x610
[c0000002bbee7d80] [c0000000000ea458] kthread+0x128/0x150
[c0000002bbee7e30] [c00000000000982c] ret_from_kernel_thread+0x5c/0xb0

This patch adds a check for an incorrectly initialized to the
beginning of try_offline_node, and exits the routine.

Another patch is being developed for powerpc to track the
node Id to which an LMB belongs, so that we can remove the
LMB from there instead of the nid as currently interpreted
from the device tree.

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
---
 mm/memory_hotplug.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Kees Cook Oct. 1, 2018, 8:02 p.m. UTC | #1

On Mon, Oct 1, 2018 at 11:56 AM, Michael Bringmann
<mwb@linux.vnet.ibm.com> wrote:
> In some LPAR migration scenarios, device-tree modifications are
> made to the affinity of the memory in the system.  For instance,
> it may occur that memory is installed to nodes 0,3 on a source
> system, and to nodes 0,2 on a target system.  Node 2 may not
> have been initialized/allocated on the target system.
>
> After migration, if a RTAS PRRN memory remove is made to a
> memory block that was in node 3 on the source system, then
> try_offline_node tries to remove it from node 2 on the target.
> The NODE_DATA(2) block would not be initialized on the target,
> and there is no validation check in the current code to prevent
> the use of a NULL pointer.  Call traces such as the following
> may be observed:
>
> A similar problem of moving memory to an unitialized node has
> also been observed on systems where multiple PRRN events occur
> prior to a complete update of the device-tree.
>
> pseries-hotplug-mem: Attempting to update LMB, drc index 80000002
> Offlined Pages 4096
> ...
> Oops: Kernel access of bad area, sig: 11 [#1]
> ...
> Workqueue: pseries hotplug workque pseries_hp_work_fn
> ...
> NIP [c0000000002bc088] try_offline_node+0x48/0x1e0
> LR [c0000000002e0b84] remove_memory+0xb4/0xf0
> Call Trace:
> [c0000002bbee7a30] [c0000002bbee7a70] 0xc0000002bbee7a70 (unreliable)
> [c0000002bbee7a70] [c0000000002e0b84] remove_memory+0xb4/0xf0
> [c0000002bbee7ab0] [c000000000097784] dlpar_remove_lmb+0xb4/0x160
> [c0000002bbee7af0] [c000000000097f38] dlpar_memory+0x328/0xcb0
> [c0000002bbee7ba0] [c0000000000906d0] handle_dlpar_errorlog+0xc0/0x130
> [c0000002bbee7c10] [c0000000000907d4] pseries_hp_work_fn+0x94/0xa0
> [c0000002bbee7c40] [c0000000000e1cd0] process_one_work+0x1a0/0x4e0
> [c0000002bbee7cd0] [c0000000000e21b0] worker_thread+0x1a0/0x610
> [c0000002bbee7d80] [c0000000000ea458] kthread+0x128/0x150
> [c0000002bbee7e30] [c00000000000982c] ret_from_kernel_thread+0x5c/0xb0
>
> This patch adds a check for an incorrectly initialized to the
> beginning of try_offline_node, and exits the routine.
>
> Another patch is being developed for powerpc to track the
> node Id to which an LMB belongs, so that we can remove the
> LMB from there instead of the nid as currently interpreted
> from the device tree.
>
> Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  mm/memory_hotplug.c |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 38d94b7..e48a4d0 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1831,10 +1831,16 @@ static int check_and_unmap_cpu_on_node(pg_data_t *pgdat)
>  void try_offline_node(int nid)
>  {
>         pg_data_t *pgdat = NODE_DATA(nid);
> -       unsigned long start_pfn = pgdat->node_start_pfn;
> -       unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
> +       unsigned long start_pfn;
> +       unsigned long end_pfn;
>         unsigned long pfn;
>
> +       if (WARN_ON(pgdat == NULL))
> +               return;
> +
> +       start_pfn = pgdat->node_start_pfn;
> +       end_pfn = start_pfn + pgdat->node_spanned_pages;
> +
>         for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>                 unsigned long section_nr = pfn_to_section_nr(pfn);
>
>

Michal Hocko Oct. 1, 2018, 8:27 p.m. UTC | #2

On Mon 01-10-18 13:56:25, Michael Bringmann wrote:
> In some LPAR migration scenarios, device-tree modifications are
> made to the affinity of the memory in the system.  For instance,
> it may occur that memory is installed to nodes 0,3 on a source
> system, and to nodes 0,2 on a target system.  Node 2 may not
> have been initialized/allocated on the target system.
> 
> After migration, if a RTAS PRRN memory remove is made to a
> memory block that was in node 3 on the source system, then
> try_offline_node tries to remove it from node 2 on the target.
> The NODE_DATA(2) block would not be initialized on the target,
> and there is no validation check in the current code to prevent
> the use of a NULL pointer.

I am not familiar with ppc and the above doesn't really help me
much. Sorry about that. But from the above it is not clear to me whether
it is the caller which does something unexpected or the hotplug code
being not robust enough. From your changelog I would suggest the later
but why don't we see the same problem for other archs? Is this a problem
of unrolling a partial failure?

dlpar_remove_lmb does the following

	nid = memory_add_physaddr_to_nid(lmb->base_addr);

	remove_memory(nid, lmb->base_addr, block_sz);

	/* Update memory regions for memory remove */
	memblock_remove(lmb->base_addr, block_sz);

	dlpar_remove_device_tree_lmb(lmb);

Is the whole operation correct when remove_memory simply backs off
silently. Why don't we have to care about memblock resp
dlpar_remove_device_tree_lmb parts? In other words how come the physical
memory range is valid while the node association is not?

Tyrel Datwyler Oct. 1, 2018, 11:20 p.m. UTC | #3

On 10/01/2018 01:27 PM, Michal Hocko wrote:
> On Mon 01-10-18 13:56:25, Michael Bringmann wrote:
>> In some LPAR migration scenarios, device-tree modifications are
>> made to the affinity of the memory in the system.  For instance,
>> it may occur that memory is installed to nodes 0,3 on a source
>> system, and to nodes 0,2 on a target system.  Node 2 may not
>> have been initialized/allocated on the target system.
>>
>> After migration, if a RTAS PRRN memory remove is made to a
>> memory block that was in node 3 on the source system, then
>> try_offline_node tries to remove it from node 2 on the target.
>> The NODE_DATA(2) block would not be initialized on the target,
>> and there is no validation check in the current code to prevent
>> the use of a NULL pointer.
> 
> I am not familiar with ppc and the above doesn't really help me
> much. Sorry about that. But from the above it is not clear to me whether
> it is the caller which does something unexpected or the hotplug code
> being not robust enough. From your changelog I would suggest the later
> but why don't we see the same problem for other archs? Is this a problem
> of unrolling a partial failure?
> 
> dlpar_remove_lmb does the following
> 
> 	nid = memory_add_physaddr_to_nid(lmb->base_addr);
> 
> 	remove_memory(nid, lmb->base_addr, block_sz);
> 
> 	/* Update memory regions for memory remove */
> 	memblock_remove(lmb->base_addr, block_sz);
> 
> 	dlpar_remove_device_tree_lmb(lmb);
> 
> Is the whole operation correct when remove_memory simply backs off
> silently. Why don't we have to care about memblock resp
> dlpar_remove_device_tree_lmb parts? In other words how come the physical
> memory range is valid while the node association is not?
> 

I think the issue here is a race between the LPM code updating affinity and PRRN events being processed. Does your other patch[1] not fix the issue? Or is it that the LPM affinity updates don't do any of the initialization/allocation you mentioned?

-Tyrel

[1] https://lore.kernel.org/linuxppc-dev/20181001185603.11373.61650.stgit@ltcalpine2-lp9.aus.stglabs.ibm.com/T/#u

Tyrel Datwyler Oct. 1, 2018, 11:23 p.m. UTC | #4

On 10/01/2018 01:27 PM, Michal Hocko wrote:
> On Mon 01-10-18 13:56:25, Michael Bringmann wrote:
>> In some LPAR migration scenarios, device-tree modifications are
>> made to the affinity of the memory in the system.  For instance,
>> it may occur that memory is installed to nodes 0,3 on a source
>> system, and to nodes 0,2 on a target system.  Node 2 may not
>> have been initialized/allocated on the target system.
>>
>> After migration, if a RTAS PRRN memory remove is made to a
>> memory block that was in node 3 on the source system, then
>> try_offline_node tries to remove it from node 2 on the target.
>> The NODE_DATA(2) block would not be initialized on the target,
>> and there is no validation check in the current code to prevent
>> the use of a NULL pointer.
> 
> I am not familiar with ppc and the above doesn't really help me
> much. Sorry about that. But from the above it is not clear to me whether
> it is the caller which does something unexpected or the hotplug code
> being not robust enough. From your changelog I would suggest the later
> but why don't we see the same problem for other archs? Is this a problem
> of unrolling a partial failure?
> 
> dlpar_remove_lmb does the following
> 
> 	nid = memory_add_physaddr_to_nid(lmb->base_addr);
> 
> 	remove_memory(nid, lmb->base_addr, block_sz);
> 
> 	/* Update memory regions for memory remove */
> 	memblock_remove(lmb->base_addr, block_sz);
> 
> 	dlpar_remove_device_tree_lmb(lmb);
> 
> Is the whole operation correct when remove_memory simply backs off
> silently. Why don't we have to care about memblock resp
> dlpar_remove_device_tree_lmb parts? In other words how come the physical
> memory range is valid while the node association is not?
> 

I guess with respect to my previous reply that patch in conjunction with this patch set as well?

https://lore.kernel.org/linuxppc-dev/20181001125846.2676.89826.stgit@ltcalpine2-lp9.aus.stglabs.ibm.com/T/#t

-Tyrel

Michael Bringmann Oct. 2, 2018, 2:51 p.m. UTC | #5

See below.

On 10/01/2018 06:20 PM, Tyrel Datwyler wrote:
> On 10/01/2018 01:27 PM, Michal Hocko wrote:
>> On Mon 01-10-18 13:56:25, Michael Bringmann wrote:
>>> In some LPAR migration scenarios, device-tree modifications are
>>> made to the affinity of the memory in the system.  For instance,
>>> it may occur that memory is installed to nodes 0,3 on a source
>>> system, and to nodes 0,2 on a target system.  Node 2 may not
>>> have been initialized/allocated on the target system.
>>>
>>> After migration, if a RTAS PRRN memory remove is made to a
>>> memory block that was in node 3 on the source system, then
>>> try_offline_node tries to remove it from node 2 on the target.
>>> The NODE_DATA(2) block would not be initialized on the target,
>>> and there is no validation check in the current code to prevent
>>> the use of a NULL pointer.
>>
>> I am not familiar with ppc and the above doesn't really help me
>> much. Sorry about that. But from the above it is not clear to me whether
>> it is the caller which does something unexpected or the hotplug code
>> being not robust enough. From your changelog I would suggest the later
>> but why don't we see the same problem for other archs? Is this a problem
>> of unrolling a partial failure?
>>
>> dlpar_remove_lmb does the following
>>
>> 	nid = memory_add_physaddr_to_nid(lmb->base_addr);
>>
>> 	remove_memory(nid, lmb->base_addr, block_sz);
>>
>> 	/* Update memory regions for memory remove */
>> 	memblock_remove(lmb->base_addr, block_sz);
>>
>> 	dlpar_remove_device_tree_lmb(lmb);
>>
>> Is the whole operation correct when remove_memory simply backs off
>> silently. Why don't we have to care about memblock resp
>> dlpar_remove_device_tree_lmb parts? In other words how come the physical
>> memory range is valid while the node association is not?
>>
> 
> I think the issue here is a race between the LPM code updating affinity and PRRN events being processed. Does your other patch[1] not fix the issue? Or is it that the LPM affinity updates don't do any of the initialization/allocation you mentioned?

This patch addresses the specific case where PRRN changes to CPU or memory are occurring on a system that also observes affinity changes during migration.  Yes,  there is a race condition -- if the PRRN events reliably occurred before the device-tree was updated, this error would not occur.  However, they overlap or occur after the changes during most LPMs observed.

When the device-tree affinity attributes have changed for memory, the 'nid' affinity calculated points to a different node for the memory block than the one used to install it, previously on the source system.  The newly calculated 'nid' affinity may not yet be initialized on the target system.  The current memory tracking mechanisms do not record the node to which a memory block was associated when it was added.  Nathan is looking at adding this feature to the new implementation of LMBs, but it is not there yet, and won't be present in earlier kernels without backporting a significant number of changes.

My other patch[1] is more intended to address locking and CPU update dependencies between PRRN changes and RTAS requests which did not necessarly involve memory updates.  The node to memory association problem after migration would be present even with this issue resolved.

> 
> -Tyrel
> 
> [1] https://lore.kernel.org/linuxppc-dev/20181001185603.11373.61650.stgit@ltcalpine2-lp9.aus.stglabs.ibm.com/T/#u
> 

Michael

Michal Hocko Oct. 2, 2018, 2:59 p.m. UTC | #6

On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
[...]
> When the device-tree affinity attributes have changed for memory,
> the 'nid' affinity calculated points to a different node for the
> memory block than the one used to install it, previously on the
> source system.  The newly calculated 'nid' affinity may not yet
> be initialized on the target system.  The current memory tracking
> mechanisms do not record the node to which a memory block was
> associated when it was added.  Nathan is looking at adding this
> feature to the new implementation of LMBs, but it is not there
> yet, and won't be present in earlier kernels without backporting a
> significant number of changes.

Then the patch you have proposed here just papers over a real issue, no?
IIUC then you simply do not remove the memory if you lose the race.

Michael Bringmann Oct. 2, 2018, 3:14 p.m. UTC | #7

On 10/02/2018 09:59 AM, Michal Hocko wrote:
> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
> [...]
>> When the device-tree affinity attributes have changed for memory,
>> the 'nid' affinity calculated points to a different node for the
>> memory block than the one used to install it, previously on the
>> source system.  The newly calculated 'nid' affinity may not yet
>> be initialized on the target system.  The current memory tracking
>> mechanisms do not record the node to which a memory block was
>> associated when it was added.  Nathan is looking at adding this
>> feature to the new implementation of LMBs, but it is not there
>> yet, and won't be present in earlier kernels without backporting a
>> significant number of changes.
> 
> Then the patch you have proposed here just papers over a real issue, no?
> IIUC then you simply do not remove the memory if you lose the race.

The problem occurs when removing memory after an affinity change references a node that was previously unreferenced.  Other code in 'kernel/mm/memory_hotplug.c' deals with initializing an empty node when adding memory to a system.  The   'removing memory' case is specific to systems that perform LPM and allow device-tree changes.  The powerpc kernel does not have the option of accepting some PRRN requests and accepting others.  It must perform them all.

The kernel/mm code that removes memory blocks does not (before this patch) recognize that the affinity of a memory block could have changed to a previously unused node.  If every path to try_offline_node made such a check, then this patch would be unnecessary.  However, putting a patch at a single location to check for a relatively rare occurrence, would seem to be a more efficient implementation.

Michael

Michal Hocko Oct. 2, 2018, 4:04 p.m. UTC | #8

On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
> On 10/02/2018 09:59 AM, Michal Hocko wrote:
> > On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
> > [...]
> >> When the device-tree affinity attributes have changed for memory,
> >> the 'nid' affinity calculated points to a different node for the
> >> memory block than the one used to install it, previously on the
> >> source system.  The newly calculated 'nid' affinity may not yet
> >> be initialized on the target system.  The current memory tracking
> >> mechanisms do not record the node to which a memory block was
> >> associated when it was added.  Nathan is looking at adding this
> >> feature to the new implementation of LMBs, but it is not there
> >> yet, and won't be present in earlier kernels without backporting a
> >> significant number of changes.
> > 
> > Then the patch you have proposed here just papers over a real issue, no?
> > IIUC then you simply do not remove the memory if you lose the race.
> 
> The problem occurs when removing memory after an affinity change
> references a node that was previously unreferenced.  Other code
> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
> node when adding memory to a system.  The 'removing memory' case is
> specific to systems that perform LPM and allow device-tree changes.
> The powerpc kernel does not have the option of accepting some PRRN
> requests and accepting others.  It must perform them all.

I am sorry, but you are still too cryptic for me. Either there is a
correctness issue and the the patch doesn't really fix anything or the
final race doesn't make any difference and then the ppc code should be
explicit about that. Checking the node inside the hotplug core code just
looks as a wrong layer to mitigate an arch specific problem. I am not
saying the patch is a no-go but if anything we want a big fat comment
explaining how this is possible because right now it just points to an
incorrect API usage.

That being said, this sounds pretty much ppc specific problem and I
would _prefer_ it to be handled there (along with a big fat comment of
course).

Michael Bringmann Oct. 2, 2018, 6:13 p.m. UTC | #9

On 10/02/2018 11:04 AM, Michal Hocko wrote:
> On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
>> On 10/02/2018 09:59 AM, Michal Hocko wrote:
>>> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
>>> [...]
>>>> When the device-tree affinity attributes have changed for memory,
>>>> the 'nid' affinity calculated points to a different node for the
>>>> memory block than the one used to install it, previously on the
>>>> source system.  The newly calculated 'nid' affinity may not yet
>>>> be initialized on the target system.  The current memory tracking
>>>> mechanisms do not record the node to which a memory block was
>>>> associated when it was added.  Nathan is looking at adding this
>>>> feature to the new implementation of LMBs, but it is not there
>>>> yet, and won't be present in earlier kernels without backporting a
>>>> significant number of changes.
>>>
>>> Then the patch you have proposed here just papers over a real issue, no?
>>> IIUC then you simply do not remove the memory if you lose the race.
>>
>> The problem occurs when removing memory after an affinity change
>> references a node that was previously unreferenced.  Other code
>> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
>> node when adding memory to a system.  The 'removing memory' case is
>> specific to systems that perform LPM and allow device-tree changes.
>> The powerpc kernel does not have the option of accepting some PRRN
>> requests and accepting others.  It must perform them all.
> 
> I am sorry, but you are still too cryptic for me. Either there is a
> correctness issue and the the patch doesn't really fix anything or the
> final race doesn't make any difference and then the ppc code should be
> explicit about that. Checking the node inside the hotplug core code just
> looks as a wrong layer to mitigate an arch specific problem. I am not
> saying the patch is a no-go but if anything we want a big fat comment
> explaining how this is possible because right now it just points to an
> incorrect API usage.
> 
> That being said, this sounds pretty much ppc specific problem and I
> would _prefer_ it to be handled there (along with a big fat comment of
> course).

Let me try again.  Regardless of the path to which we get to this condition,
we currently crash the kernel.  This patch changes that to a WARN_ON notice
and continues executing the kernel without shutting down the system.  I saw
the problem during powerpc testing, because that is the focus of my work.
There are other paths to this function besides powerpc.  I feel that the
kernel should keep running instead of halting.

Regards,

Tyrel Datwyler Oct. 2, 2018, 7:45 p.m. UTC | #10

On 10/02/2018 11:13 AM, Michael Bringmann wrote:
> 
> 
> On 10/02/2018 11:04 AM, Michal Hocko wrote:
>> On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
>>> On 10/02/2018 09:59 AM, Michal Hocko wrote:
>>>> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
>>>> [...]
>>>>> When the device-tree affinity attributes have changed for memory,
>>>>> the 'nid' affinity calculated points to a different node for the
>>>>> memory block than the one used to install it, previously on the
>>>>> source system.  The newly calculated 'nid' affinity may not yet
>>>>> be initialized on the target system.  The current memory tracking
>>>>> mechanisms do not record the node to which a memory block was
>>>>> associated when it was added.  Nathan is looking at adding this
>>>>> feature to the new implementation of LMBs, but it is not there
>>>>> yet, and won't be present in earlier kernels without backporting a
>>>>> significant number of changes.
>>>>
>>>> Then the patch you have proposed here just papers over a real issue, no?
>>>> IIUC then you simply do not remove the memory if you lose the race.
>>>
>>> The problem occurs when removing memory after an affinity change
>>> references a node that was previously unreferenced.  Other code
>>> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
>>> node when adding memory to a system.  The 'removing memory' case is
>>> specific to systems that perform LPM and allow device-tree changes.
>>> The powerpc kernel does not have the option of accepting some PRRN
>>> requests and accepting others.  It must perform them all.
>>
>> I am sorry, but you are still too cryptic for me. Either there is a
>> correctness issue and the the patch doesn't really fix anything or the
>> final race doesn't make any difference and then the ppc code should be
>> explicit about that. Checking the node inside the hotplug core code just
>> looks as a wrong layer to mitigate an arch specific problem. I am not
>> saying the patch is a no-go but if anything we want a big fat comment
>> explaining how this is possible because right now it just points to an
>> incorrect API usage.
>>
>> That being said, this sounds pretty much ppc specific problem and I
>> would _prefer_ it to be handled there (along with a big fat comment of
>> course).
> 
> Let me try again.  Regardless of the path to which we get to this condition,
> we currently crash the kernel.  This patch changes that to a WARN_ON notice
> and continues executing the kernel without shutting down the system.  I saw
> the problem during powerpc testing, because that is the focus of my work.
> There are other paths to this function besides powerpc.  I feel that the
> kernel should keep running instead of halting.

This is still basically a hack to get around a known race. In itself this patch is still worth while in that we shouldn't crash the kernel on a null pointer dereference. However, I think the actual problem still needs to be addressed. We shouldn't run any PRRN events for the source system on the target after a migration. The device tree update should have taken care of telling us about new affinities and what not. Can we just throw out any queued PRRN events when we wake up on the target?

-Tyrel
> 
> Regards,
>

Michal Hocko Oct. 3, 2018, 7:03 a.m. UTC | #11

On Tue 02-10-18 12:45:50, Tyrel Datwyler wrote:
> On 10/02/2018 11:13 AM, Michael Bringmann wrote:
> > 
> > 
> > On 10/02/2018 11:04 AM, Michal Hocko wrote:
> >> On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
> >>> On 10/02/2018 09:59 AM, Michal Hocko wrote:
> >>>> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
> >>>> [...]
> >>>>> When the device-tree affinity attributes have changed for memory,
> >>>>> the 'nid' affinity calculated points to a different node for the
> >>>>> memory block than the one used to install it, previously on the
> >>>>> source system.  The newly calculated 'nid' affinity may not yet
> >>>>> be initialized on the target system.  The current memory tracking
> >>>>> mechanisms do not record the node to which a memory block was
> >>>>> associated when it was added.  Nathan is looking at adding this
> >>>>> feature to the new implementation of LMBs, but it is not there
> >>>>> yet, and won't be present in earlier kernels without backporting a
> >>>>> significant number of changes.
> >>>>
> >>>> Then the patch you have proposed here just papers over a real issue, no?
> >>>> IIUC then you simply do not remove the memory if you lose the race.
> >>>
> >>> The problem occurs when removing memory after an affinity change
> >>> references a node that was previously unreferenced.  Other code
> >>> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
> >>> node when adding memory to a system.  The 'removing memory' case is
> >>> specific to systems that perform LPM and allow device-tree changes.
> >>> The powerpc kernel does not have the option of accepting some PRRN
> >>> requests and accepting others.  It must perform them all.
> >>
> >> I am sorry, but you are still too cryptic for me. Either there is a
> >> correctness issue and the the patch doesn't really fix anything or the
> >> final race doesn't make any difference and then the ppc code should be
> >> explicit about that. Checking the node inside the hotplug core code just
> >> looks as a wrong layer to mitigate an arch specific problem. I am not
> >> saying the patch is a no-go but if anything we want a big fat comment
> >> explaining how this is possible because right now it just points to an
> >> incorrect API usage.
> >>
> >> That being said, this sounds pretty much ppc specific problem and I
> >> would _prefer_ it to be handled there (along with a big fat comment of
> >> course).
> > 
> > Let me try again.  Regardless of the path to which we get to this condition,
> > we currently crash the kernel.  This patch changes that to a WARN_ON notice
> > and continues executing the kernel without shutting down the system.  I saw
> > the problem during powerpc testing, because that is the focus of my work.
> > There are other paths to this function besides powerpc.  I feel that the
> > kernel should keep running instead of halting.
> 
> This is still basically a hack to get around a known race. In itself
> this patch is still worth while in that we shouldn't crash the kernel
> on a null pointer dereference. However, I think the actual problem
> still needs to be addressed. We shouldn't run any PRRN events for the
> source system on the target after a migration. The device tree update
> should have taken care of telling us about new affinities and what
> not. Can we just throw out any queued PRRN events when we wake up on
> the target?

And until a proper fix is developed can we have NODE_DATA test in the
affected code rather than pollute the generic code with something that
is essentially a wrong usage of the API? With a big fat warning
explaining what is going on here?

Michael Bringmann Oct. 3, 2018, 1:27 p.m. UTC | #12

On 10/02/2018 02:45 PM, Tyrel Datwyler wrote:
> On 10/02/2018 11:13 AM, Michael Bringmann wrote:
>>
>>
>> On 10/02/2018 11:04 AM, Michal Hocko wrote:
>>> On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
>>>> On 10/02/2018 09:59 AM, Michal Hocko wrote:
>>>>> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
>>>>> [...]
>>>>>> When the device-tree affinity attributes have changed for memory,
>>>>>> the 'nid' affinity calculated points to a different node for the
>>>>>> memory block than the one used to install it, previously on the
>>>>>> source system.  The newly calculated 'nid' affinity may not yet
>>>>>> be initialized on the target system.  The current memory tracking
>>>>>> mechanisms do not record the node to which a memory block was
>>>>>> associated when it was added.  Nathan is looking at adding this
>>>>>> feature to the new implementation of LMBs, but it is not there
>>>>>> yet, and won't be present in earlier kernels without backporting a
>>>>>> significant number of changes.
>>>>>
>>>>> Then the patch you have proposed here just papers over a real issue, no?
>>>>> IIUC then you simply do not remove the memory if you lose the race.
>>>>
>>>> The problem occurs when removing memory after an affinity change
>>>> references a node that was previously unreferenced.  Other code
>>>> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
>>>> node when adding memory to a system.  The 'removing memory' case is
>>>> specific to systems that perform LPM and allow device-tree changes.
>>>> The powerpc kernel does not have the option of accepting some PRRN
>>>> requests and accepting others.  It must perform them all.
>>>
>>> I am sorry, but you are still too cryptic for me. Either there is a
>>> correctness issue and the the patch doesn't really fix anything or the
>>> final race doesn't make any difference and then the ppc code should be
>>> explicit about that. Checking the node inside the hotplug core code just
>>> looks as a wrong layer to mitigate an arch specific problem. I am not
>>> saying the patch is a no-go but if anything we want a big fat comment
>>> explaining how this is possible because right now it just points to an
>>> incorrect API usage.
>>>
>>> That being said, this sounds pretty much ppc specific problem and I
>>> would _prefer_ it to be handled there (along with a big fat comment of
>>> course).
>>
>> Let me try again.  Regardless of the path to which we get to this condition,
>> we currently crash the kernel.  This patch changes that to a WARN_ON notice
>> and continues executing the kernel without shutting down the system.  I saw
>> the problem during powerpc testing, because that is the focus of my work.
>> There are other paths to this function besides powerpc.  I feel that the
>> kernel should keep running instead of halting.
> 
> This is still basically a hack to get around a known race. In itself this patch is still worth while in that we shouldn't crash the kernel on a null pointer dereference. However, I think the actual problem still needs to be addressed. We shouldn't run any PRRN events for the source system on the target after a migration. The device tree update should have taken care of telling us about new affinities and what not. Can we just throw out any queued PRRN events when we wake up on the target?

We are not talking about queued events provided on the source system, but about
new PRRN events sent by phyp to the kernel on the target system to update the
kernel state after migration.  No way to predict the content.

> 
> -Tyrel
>>
>> Regards,
>>

Michael

Tyrel Datwyler Oct. 3, 2018, 11:05 p.m. UTC | #13

On 10/03/2018 06:27 AM, Michael Bringmann wrote:
> On 10/02/2018 02:45 PM, Tyrel Datwyler wrote:
>> On 10/02/2018 11:13 AM, Michael Bringmann wrote:
>>>
>>>
>>> On 10/02/2018 11:04 AM, Michal Hocko wrote:
>>>> On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
>>>>> On 10/02/2018 09:59 AM, Michal Hocko wrote:
>>>>>> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
>>>>>> [...]
>>>>>>> When the device-tree affinity attributes have changed for memory,
>>>>>>> the 'nid' affinity calculated points to a different node for the
>>>>>>> memory block than the one used to install it, previously on the
>>>>>>> source system.  The newly calculated 'nid' affinity may not yet
>>>>>>> be initialized on the target system.  The current memory tracking
>>>>>>> mechanisms do not record the node to which a memory block was
>>>>>>> associated when it was added.  Nathan is looking at adding this
>>>>>>> feature to the new implementation of LMBs, but it is not there
>>>>>>> yet, and won't be present in earlier kernels without backporting a
>>>>>>> significant number of changes.
>>>>>>
>>>>>> Then the patch you have proposed here just papers over a real issue, no?
>>>>>> IIUC then you simply do not remove the memory if you lose the race.
>>>>>
>>>>> The problem occurs when removing memory after an affinity change
>>>>> references a node that was previously unreferenced.  Other code
>>>>> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
>>>>> node when adding memory to a system.  The 'removing memory' case is
>>>>> specific to systems that perform LPM and allow device-tree changes.
>>>>> The powerpc kernel does not have the option of accepting some PRRN
>>>>> requests and accepting others.  It must perform them all.
>>>>
>>>> I am sorry, but you are still too cryptic for me. Either there is a
>>>> correctness issue and the the patch doesn't really fix anything or the
>>>> final race doesn't make any difference and then the ppc code should be
>>>> explicit about that. Checking the node inside the hotplug core code just
>>>> looks as a wrong layer to mitigate an arch specific problem. I am not
>>>> saying the patch is a no-go but if anything we want a big fat comment
>>>> explaining how this is possible because right now it just points to an
>>>> incorrect API usage.
>>>>
>>>> That being said, this sounds pretty much ppc specific problem and I
>>>> would _prefer_ it to be handled there (along with a big fat comment of
>>>> course).
>>>
>>> Let me try again.  Regardless of the path to which we get to this condition,
>>> we currently crash the kernel.  This patch changes that to a WARN_ON notice
>>> and continues executing the kernel without shutting down the system.  I saw
>>> the problem during powerpc testing, because that is the focus of my work.
>>> There are other paths to this function besides powerpc.  I feel that the
>>> kernel should keep running instead of halting.
>>
>> This is still basically a hack to get around a known race. In itself this patch is still worth while in that we shouldn't crash the kernel on a null pointer dereference. However, I think the actual problem still needs to be addressed. We shouldn't run any PRRN events for the source system on the target after a migration. The device tree update should have taken care of telling us about new affinities and what not. Can we just throw out any queued PRRN events when we wake up on the target?
> 
> We are not talking about queued events provided on the source system, but about
> new PRRN events sent by phyp to the kernel on the target system to update the
> kernel state after migration.  No way to predict the content.

Okay, but either way shouldn't your other proposed patches to update memory affinity by re-adding memory and changing the time topology updates are stopped to include the post-mobility updates put things in the right nodes? Or, am I missing something? I would assume a PRRN on the target would assume the target was up-to-date with respect to where things are supposed to be located.

-Tyrel

> 
>>
>> -Tyrel
>>>
>>> Regards,
>>>
> 
> Michael
>

Michael Bringmann Oct. 4, 2018, 1:02 a.m. UTC | #14

On 10/03/2018 06:05 PM, Tyrel Datwyler wrote:
> On 10/03/2018 06:27 AM, Michael Bringmann wrote:
>> On 10/02/2018 02:45 PM, Tyrel Datwyler wrote:
>>> On 10/02/2018 11:13 AM, Michael Bringmann wrote:
>>>>
>>>>
>>>> On 10/02/2018 11:04 AM, Michal Hocko wrote:
>>>>> On Tue 02-10-18 10:14:49, Michael Bringmann wrote:
>>>>>> On 10/02/2018 09:59 AM, Michal Hocko wrote:
>>>>>>> On Tue 02-10-18 09:51:40, Michael Bringmann wrote:
>>>>>>> [...]
>>>>>>>> When the device-tree affinity attributes have changed for memory,
>>>>>>>> the 'nid' affinity calculated points to a different node for the
>>>>>>>> memory block than the one used to install it, previously on the
>>>>>>>> source system.  The newly calculated 'nid' affinity may not yet
>>>>>>>> be initialized on the target system.  The current memory tracking
>>>>>>>> mechanisms do not record the node to which a memory block was
>>>>>>>> associated when it was added.  Nathan is looking at adding this
>>>>>>>> feature to the new implementation of LMBs, but it is not there
>>>>>>>> yet, and won't be present in earlier kernels without backporting a
>>>>>>>> significant number of changes.
>>>>>>>
>>>>>>> Then the patch you have proposed here just papers over a real issue, no?
>>>>>>> IIUC then you simply do not remove the memory if you lose the race.
>>>>>>
>>>>>> The problem occurs when removing memory after an affinity change
>>>>>> references a node that was previously unreferenced.  Other code
>>>>>> in 'kernel/mm/memory_hotplug.c' deals with initializing an empty
>>>>>> node when adding memory to a system.  The 'removing memory' case is
>>>>>> specific to systems that perform LPM and allow device-tree changes.
>>>>>> The powerpc kernel does not have the option of accepting some PRRN
>>>>>> requests and accepting others.  It must perform them all.
>>>>>
>>>>> I am sorry, but you are still too cryptic for me. Either there is a
>>>>> correctness issue and the the patch doesn't really fix anything or the
>>>>> final race doesn't make any difference and then the ppc code should be
>>>>> explicit about that. Checking the node inside the hotplug core code just
>>>>> looks as a wrong layer to mitigate an arch specific problem. I am not
>>>>> saying the patch is a no-go but if anything we want a big fat comment
>>>>> explaining how this is possible because right now it just points to an
>>>>> incorrect API usage.
>>>>>
>>>>> That being said, this sounds pretty much ppc specific problem and I
>>>>> would _prefer_ it to be handled there (along with a big fat comment of
>>>>> course).
>>>>
>>>> Let me try again.  Regardless of the path to which we get to this condition,
>>>> we currently crash the kernel.  This patch changes that to a WARN_ON notice
>>>> and continues executing the kernel without shutting down the system.  I saw
>>>> the problem during powerpc testing, because that is the focus of my work.
>>>> There are other paths to this function besides powerpc.  I feel that the
>>>> kernel should keep running instead of halting.
>>>
>>> This is still basically a hack to get around a known race. In itself this patch is still worth while in that we shouldn't crash the kernel on a null pointer dereference. However, I think the actual problem still needs to be addressed. We shouldn't run any PRRN events for the source system on the target after a migration. The device tree update should have taken care of telling us about new affinities and what not. Can we just throw out any queued PRRN events when we wake up on the target?
>>
>> We are not talking about queued events provided on the source system, but about
>> new PRRN events sent by phyp to the kernel on the target system to update the
>> kernel state after migration.  No way to predict the content.
> 
> Okay, but either way shouldn't your other proposed patches to update memory affinity by re-adding memory and changing the time topology updates are stopped to include the post-mobility updates put things in the right nodes? Or, am I missing something? I would assume a PRRN on the target would assume the target was up-to-date with respect to where things are supposed to be located.

This bug only recently came to our attention in a defect on a SLES12 SP3 platform running PHYP Memory Mover concurrently with LPM.  Not a normal test case.

> 
> -Tyrel

Michael

migration/mm: Add WARN_ON to try_offline_node

Commit Message

Comments

Patch