diff mbox series

[1/2] x86/numa: carve node online semantics out of alloc_node_data()

Message ID 1562300143-11671-1-git-send-email-kernelfans@gmail.com (mailing list archive)
State New, archived
Headers show
Series [1/2] x86/numa: carve node online semantics out of alloc_node_data() | expand

Commit Message

Pingfan Liu July 5, 2019, 4:15 a.m. UTC
Node online means either memory online or cpu online. But there is
requirement to instance a pglist_data, which has neither cpu nor memory
online (refer to [2/2]).

So carve out the online semantics, and call node_set_online() where either
memory or cpu is online.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Qian Cai <cai@lca.pw>
Cc: Barret Rhoden <brho@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/mm/numa.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Pingfan Liu July 8, 2019, 8:36 a.m. UTC | #1
On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Fri, 5 Jul 2019, Pingfan Liu wrote:
>
> > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
> > is used to speed up kdump process, so it is not a rare case.
>
> But fundamentally wrong, really.
>
> The rest of the CPUs are in a half baken state and any broadcast event,
> e.g. MCE or a stray IPI, will result in a undiagnosable crash.
Very appreciate if you can pay more word on it? I tried to figure out
your point, but fail.

For "a half baked state", I think you concern about LAPIC state, and I
expand this point like the following:

For IPI: when capture kernel BSP is up, the rest cpus are still loop
inside crash_nmi_callback(), so there is no way to eject new IPI from
these cpu. Also we disable_local_APIC(), which effectively prevent the
LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
occur in crash case.

For MCE, I am not sure whether it can broadcast or not between cpus,
but as my understanding, it can not. Then is it a problem?

From another view point, is there any difference between nr_cpus=1 and
nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
it does for nr_cpus=1.

Thanks,
  Pingfan
Andy Lutomirski July 8, 2019, 5:53 p.m. UTC | #2
> On Jul 8, 2019, at 3:35 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>> On Mon, 8 Jul 2019, Pingfan Liu wrote:
>>> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>>> On Fri, 5 Jul 2019, Pingfan Liu wrote:
>>>> 
>>>> I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
>>>> is used to speed up kdump process, so it is not a rare case.
>>> 
>>> But fundamentally wrong, really.
>>> 
>>> The rest of the CPUs are in a half baken state and any broadcast event,
>>> e.g. MCE or a stray IPI, will result in a undiagnosable crash.
>> Very appreciate if you can pay more word on it? I tried to figure out
>> your point, but fail.
>> 
>> For "a half baked state", I think you concern about LAPIC state, and I
>> expand this point like the following:
> 
> It's not only the APIC state. It's the state of the CPUs in general.
> 
>> For IPI: when capture kernel BSP is up, the rest cpus are still loop
>> inside crash_nmi_callback(), so there is no way to eject new IPI from
>> these cpu. Also we disable_local_APIC(), which effectively prevent the
>> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
>> occur in crash case.
> 
> Fair enough for the IPI case.
> 
>> For MCE, I am not sure whether it can broadcast or not between cpus,
>> but as my understanding, it can not. Then is it a problem?
> 
> It can and it does.
> 
> That's the whole point why we bring up all CPUs in the 'nosmt' case and
> shut the siblings down again after setting CR4.MCE. Actually that's in fact
> a 'let's hope no MCE hits before that happened' approach, but that's all we
> can do.
> 
> If we don't do that then the MCE broadcast can hit a CPU which has some
> firmware initialized state. The result can be a full system lockup, triple
> fault etc.
> 
> So when the MCE hits a CPU which is still in the crashed kernel lala state,
> then all hell breaks lose.
> 
>> From another view point, is there any difference between nr_cpus=1 and
>> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
>> it does for nr_cpus=1.
> 
> Anything less than the actual number of present CPUs is problematic except
> you use the 'let's hope nothing happens' approach. We could add an option
> to stop the bringup at the early online state similar to what we do for
> 'nosmt'.
> 
> 

How about we change nr_cpus to do that instead so we never have to have this conversation again?
Pingfan Liu July 9, 2019, 4:16 a.m. UTC | #3
On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, 8 Jul 2019, Pingfan Liu wrote:
> > On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > On Fri, 5 Jul 2019, Pingfan Liu wrote:
> > >
> > > > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
> > > > is used to speed up kdump process, so it is not a rare case.
> > >
> > > But fundamentally wrong, really.
> > >
> > > The rest of the CPUs are in a half baken state and any broadcast event,
> > > e.g. MCE or a stray IPI, will result in a undiagnosable crash.
> > Very appreciate if you can pay more word on it? I tried to figure out
> > your point, but fail.
> >
> > For "a half baked state", I think you concern about LAPIC state, and I
> > expand this point like the following:
>
> It's not only the APIC state. It's the state of the CPUs in general.
For other states, "kexec -l " is a kind of boot loader and the boot
cpu complies with the kernel boot up provision. As for the rest AP,
they are pinged at loop before receiving #INIT IPI. Then the left
things is the same as SMP boot up.

>
> > For IPI: when capture kernel BSP is up, the rest cpus are still loop
> > inside crash_nmi_callback(), so there is no way to eject new IPI from
> > these cpu. Also we disable_local_APIC(), which effectively prevent the
> > LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
> > occur in crash case.
>
> Fair enough for the IPI case.
>
> > For MCE, I am not sure whether it can broadcast or not between cpus,
> > but as my understanding, it can not. Then is it a problem?
>
> It can and it does.
>
> That's the whole point why we bring up all CPUs in the 'nosmt' case and
> shut the siblings down again after setting CR4.MCE. Actually that's in fact
> a 'let's hope no MCE hits before that happened' approach, but that's all we
> can do.
>
> If we don't do that then the MCE broadcast can hit a CPU which has some
> firmware initialized state. The result can be a full system lockup, triple
> fault etc.
>
> So when the MCE hits a CPU which is still in the crashed kernel lala state,
> then all hell breaks lose.
Thank you for the comprehensive explain. With your guide, now, I have
a full understanding of the issue.

But when I tried to add something to enable CR4.MCE in
crash_nmi_callback(), I realized that it is undo-able in some case (if
crashed, we will not ask an offline smt cpu to online), also it is
needless. "kexec -l/-p" takes the advantage of the cpu state in the
first kernel, where all logical cpu has CR4.MCE=1.

So kexec is exempt from this bug if the first kernel already do it.
>
> > From another view point, is there any difference between nr_cpus=1 and
> > nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
> > it does for nr_cpus=1.
>
> Anything less than the actual number of present CPUs is problematic except
> you use the 'let's hope nothing happens' approach. We could add an option
> to stop the bringup at the early online state similar to what we do for
> 'nosmt'.
Yes, we should do something about nr_cpus param for the first kernel.

Thanks,
  Pingfan
Pingfan Liu July 9, 2019, 4:26 a.m. UTC | #4
On Tue, Jul 9, 2019 at 1:53 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Jul 8, 2019, at 3:35 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> >> On Mon, 8 Jul 2019, Pingfan Liu wrote:
> >>> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>>
> >>>> On Fri, 5 Jul 2019, Pingfan Liu wrote:
> >>>>
> >>>> I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
> >>>> is used to speed up kdump process, so it is not a rare case.
> >>>
> >>> But fundamentally wrong, really.
> >>>
> >>> The rest of the CPUs are in a half baken state and any broadcast event,
> >>> e.g. MCE or a stray IPI, will result in a undiagnosable crash.
> >> Very appreciate if you can pay more word on it? I tried to figure out
> >> your point, but fail.
> >>
> >> For "a half baked state", I think you concern about LAPIC state, and I
> >> expand this point like the following:
> >
> > It's not only the APIC state. It's the state of the CPUs in general.
> >
> >> For IPI: when capture kernel BSP is up, the rest cpus are still loop
> >> inside crash_nmi_callback(), so there is no way to eject new IPI from
> >> these cpu. Also we disable_local_APIC(), which effectively prevent the
> >> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
> >> occur in crash case.
> >
> > Fair enough for the IPI case.
> >
> >> For MCE, I am not sure whether it can broadcast or not between cpus,
> >> but as my understanding, it can not. Then is it a problem?
> >
> > It can and it does.
> >
> > That's the whole point why we bring up all CPUs in the 'nosmt' case and
> > shut the siblings down again after setting CR4.MCE. Actually that's in fact
> > a 'let's hope no MCE hits before that happened' approach, but that's all we
> > can do.
> >
> > If we don't do that then the MCE broadcast can hit a CPU which has some
> > firmware initialized state. The result can be a full system lockup, triple
> > fault etc.
> >
> > So when the MCE hits a CPU which is still in the crashed kernel lala state,
> > then all hell breaks lose.
> >
> >> From another view point, is there any difference between nr_cpus=1 and
> >> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
> >> it does for nr_cpus=1.
> >
> > Anything less than the actual number of present CPUs is problematic except
> > you use the 'let's hope nothing happens' approach. We could add an option
> > to stop the bringup at the early online state similar to what we do for
> > 'nosmt'.
> >
> >
>
> How about we change nr_cpus to do that instead so we never have to have this conversation again?
Are you interest in implementing this?

Thanks,
  Pingfan
Andy Lutomirski July 9, 2019, 1:34 p.m. UTC | #5
> On Jul 9, 2019, at 1:24 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> 
>> On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>>> On Tue, 9 Jul 2019, Pingfan Liu wrote:
>>>> On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> It can and it does.
>>>> 
>>>> That's the whole point why we bring up all CPUs in the 'nosmt' case and
>>>> shut the siblings down again after setting CR4.MCE. Actually that's in fact
>>>> a 'let's hope no MCE hits before that happened' approach, but that's all we
>>>> can do.
>>>> 
>>>> If we don't do that then the MCE broadcast can hit a CPU which has some
>>>> firmware initialized state. The result can be a full system lockup, triple
>>>> fault etc.
>>>> 
>>>> So when the MCE hits a CPU which is still in the crashed kernel lala state,
>>>> then all hell breaks lose.
>>> Thank you for the comprehensive explain. With your guide, now, I have
>>> a full understanding of the issue.
>>> 
>>> But when I tried to add something to enable CR4.MCE in
>>> crash_nmi_callback(), I realized that it is undo-able in some case (if
>>> crashed, we will not ask an offline smt cpu to online), also it is
>>> needless. "kexec -l/-p" takes the advantage of the cpu state in the
>>> first kernel, where all logical cpu has CR4.MCE=1.
>>> 
>>> So kexec is exempt from this bug if the first kernel already do it.
>> 
>> No. If the MCE broadcast is handled by a CPU which is stuck in the old
>> kernel stop loop, then it will execute on the old kernel and eventually run
>> into the memory corruption which crashed the old one.
>> 
> Yes, you are right. Stuck cpu may execute the old do_machine_check()
> code. But I just found out that we have
> do_machine_check()->__mc_check_crashing_cpu() to against this case.
> 
> And I think the MCE issue with nr_cpus is not closely related with
> this series, can
> be a separated issue. I had question whether Andy will take it, if
> not, I am glad to do it.
> 
> 

Go for it. I’m not familiar enough with the SMP boot stuff that I would be able to do it any faster than you. I’ll gladly help review it.
Pingfan Liu July 10, 2019, 8:40 a.m. UTC | #6
On Tue, Jul 9, 2019 at 9:34 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Jul 9, 2019, at 1:24 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> >
> >> On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >>> On Tue, 9 Jul 2019, Pingfan Liu wrote:
> >>>> On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>>> It can and it does.
> >>>>
> >>>> That's the whole point why we bring up all CPUs in the 'nosmt' case and
> >>>> shut the siblings down again after setting CR4.MCE. Actually that's in fact
> >>>> a 'let's hope no MCE hits before that happened' approach, but that's all we
> >>>> can do.
> >>>>
> >>>> If we don't do that then the MCE broadcast can hit a CPU which has some
> >>>> firmware initialized state. The result can be a full system lockup, triple
> >>>> fault etc.
> >>>>
> >>>> So when the MCE hits a CPU which is still in the crashed kernel lala state,
> >>>> then all hell breaks lose.
> >>> Thank you for the comprehensive explain. With your guide, now, I have
> >>> a full understanding of the issue.
> >>>
> >>> But when I tried to add something to enable CR4.MCE in
> >>> crash_nmi_callback(), I realized that it is undo-able in some case (if
> >>> crashed, we will not ask an offline smt cpu to online), also it is
> >>> needless. "kexec -l/-p" takes the advantage of the cpu state in the
> >>> first kernel, where all logical cpu has CR4.MCE=1.
> >>>
> >>> So kexec is exempt from this bug if the first kernel already do it.
> >>
> >> No. If the MCE broadcast is handled by a CPU which is stuck in the old
> >> kernel stop loop, then it will execute on the old kernel and eventually run
> >> into the memory corruption which crashed the old one.
> >>
> > Yes, you are right. Stuck cpu may execute the old do_machine_check()
> > code. But I just found out that we have
> > do_machine_check()->__mc_check_crashing_cpu() to against this case.
> >
> > And I think the MCE issue with nr_cpus is not closely related with
> > this series, can
> > be a separated issue. I had question whether Andy will take it, if
> > not, I am glad to do it.
> >
> >
>
> Go for it. I’m not familiar enough with the SMP boot stuff that I would be able to do it any faster than you. I’ll gladly help review it.
I had sent out a patch to fix maxcpus "[PATCH] smp: force all cpu to
boot once under maxcpus option"
But for the case of nrcpus, I think things will not be so easy due to
percpu area, and I think it may take a quite different way.

Thanks,
  Pingfan
diff mbox series

Patch

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e6dad60..b48d507 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -213,8 +213,6 @@  static void __init alloc_node_data(int nid)
 
 	node_data[nid] = nd;
 	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
-	node_set_online(nid);
 }
 
 /**
@@ -589,6 +587,7 @@  static int __init numa_register_memblks(struct numa_meminfo *mi)
 			continue;
 
 		alloc_node_data(nid);
+		node_set_online(nid);
 	}
 
 	/* Dump memblock with node info and return. */
@@ -760,8 +759,10 @@  void __init init_cpu_to_node(void)
 		if (node == NUMA_NO_NODE)
 			continue;
 
-		if (!node_online(node))
+		if (!node_online(node)) {
 			init_memory_less_node(node);
+			node_set_online(nid);
+		}
 
 		numa_set_node(cpu, node);
 	}