Message ID | 1562300143-11671-1-git-send-email-kernelfans@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] x86/numa: carve node online semantics out of alloc_node_data() | expand |
On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Fri, 5 Jul 2019, Pingfan Liu wrote: > > > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option > > is used to speed up kdump process, so it is not a rare case. > > But fundamentally wrong, really. > > The rest of the CPUs are in a half baken state and any broadcast event, > e.g. MCE or a stray IPI, will result in a undiagnosable crash. Very appreciate if you can pay more word on it? I tried to figure out your point, but fail. For "a half baked state", I think you concern about LAPIC state, and I expand this point like the following: For IPI: when capture kernel BSP is up, the rest cpus are still loop inside crash_nmi_callback(), so there is no way to eject new IPI from these cpu. Also we disable_local_APIC(), which effectively prevent the LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not occur in crash case. For MCE, I am not sure whether it can broadcast or not between cpus, but as my understanding, it can not. Then is it a problem? From another view point, is there any difference between nr_cpus=1 and nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1, it does for nr_cpus=1. Thanks, Pingfan
> On Jul 8, 2019, at 3:35 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > >> On Mon, 8 Jul 2019, Pingfan Liu wrote: >>> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <tglx@linutronix.de> wrote: >>> >>>> On Fri, 5 Jul 2019, Pingfan Liu wrote: >>>> >>>> I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option >>>> is used to speed up kdump process, so it is not a rare case. >>> >>> But fundamentally wrong, really. >>> >>> The rest of the CPUs are in a half baken state and any broadcast event, >>> e.g. MCE or a stray IPI, will result in a undiagnosable crash. >> Very appreciate if you can pay more word on it? I tried to figure out >> your point, but fail. >> >> For "a half baked state", I think you concern about LAPIC state, and I >> expand this point like the following: > > It's not only the APIC state. It's the state of the CPUs in general. > >> For IPI: when capture kernel BSP is up, the rest cpus are still loop >> inside crash_nmi_callback(), so there is no way to eject new IPI from >> these cpu. Also we disable_local_APIC(), which effectively prevent the >> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not >> occur in crash case. > > Fair enough for the IPI case. > >> For MCE, I am not sure whether it can broadcast or not between cpus, >> but as my understanding, it can not. Then is it a problem? > > It can and it does. > > That's the whole point why we bring up all CPUs in the 'nosmt' case and > shut the siblings down again after setting CR4.MCE. Actually that's in fact > a 'let's hope no MCE hits before that happened' approach, but that's all we > can do. > > If we don't do that then the MCE broadcast can hit a CPU which has some > firmware initialized state. The result can be a full system lockup, triple > fault etc. > > So when the MCE hits a CPU which is still in the crashed kernel lala state, > then all hell breaks lose. > >> From another view point, is there any difference between nr_cpus=1 and >> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1, >> it does for nr_cpus=1. > > Anything less than the actual number of present CPUs is problematic except > you use the 'let's hope nothing happens' approach. We could add an option > to stop the bringup at the early online state similar to what we do for > 'nosmt'. > > How about we change nr_cpus to do that instead so we never have to have this conversation again?
On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Mon, 8 Jul 2019, Pingfan Liu wrote: > > On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > On Fri, 5 Jul 2019, Pingfan Liu wrote: > > > > > > > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option > > > > is used to speed up kdump process, so it is not a rare case. > > > > > > But fundamentally wrong, really. > > > > > > The rest of the CPUs are in a half baken state and any broadcast event, > > > e.g. MCE or a stray IPI, will result in a undiagnosable crash. > > Very appreciate if you can pay more word on it? I tried to figure out > > your point, but fail. > > > > For "a half baked state", I think you concern about LAPIC state, and I > > expand this point like the following: > > It's not only the APIC state. It's the state of the CPUs in general. For other states, "kexec -l " is a kind of boot loader and the boot cpu complies with the kernel boot up provision. As for the rest AP, they are pinged at loop before receiving #INIT IPI. Then the left things is the same as SMP boot up. > > > For IPI: when capture kernel BSP is up, the rest cpus are still loop > > inside crash_nmi_callback(), so there is no way to eject new IPI from > > these cpu. Also we disable_local_APIC(), which effectively prevent the > > LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not > > occur in crash case. > > Fair enough for the IPI case. > > > For MCE, I am not sure whether it can broadcast or not between cpus, > > but as my understanding, it can not. Then is it a problem? > > It can and it does. > > That's the whole point why we bring up all CPUs in the 'nosmt' case and > shut the siblings down again after setting CR4.MCE. Actually that's in fact > a 'let's hope no MCE hits before that happened' approach, but that's all we > can do. > > If we don't do that then the MCE broadcast can hit a CPU which has some > firmware initialized state. The result can be a full system lockup, triple > fault etc. > > So when the MCE hits a CPU which is still in the crashed kernel lala state, > then all hell breaks lose. Thank you for the comprehensive explain. With your guide, now, I have a full understanding of the issue. But when I tried to add something to enable CR4.MCE in crash_nmi_callback(), I realized that it is undo-able in some case (if crashed, we will not ask an offline smt cpu to online), also it is needless. "kexec -l/-p" takes the advantage of the cpu state in the first kernel, where all logical cpu has CR4.MCE=1. So kexec is exempt from this bug if the first kernel already do it. > > > From another view point, is there any difference between nr_cpus=1 and > > nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1, > > it does for nr_cpus=1. > > Anything less than the actual number of present CPUs is problematic except > you use the 'let's hope nothing happens' approach. We could add an option > to stop the bringup at the early online state similar to what we do for > 'nosmt'. Yes, we should do something about nr_cpus param for the first kernel. Thanks, Pingfan
On Tue, Jul 9, 2019 at 1:53 AM Andy Lutomirski <luto@amacapital.net> wrote: > > > > > On Jul 8, 2019, at 3:35 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > >> On Mon, 8 Jul 2019, Pingfan Liu wrote: > >>> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <tglx@linutronix.de> wrote: > >>> > >>>> On Fri, 5 Jul 2019, Pingfan Liu wrote: > >>>> > >>>> I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option > >>>> is used to speed up kdump process, so it is not a rare case. > >>> > >>> But fundamentally wrong, really. > >>> > >>> The rest of the CPUs are in a half baken state and any broadcast event, > >>> e.g. MCE or a stray IPI, will result in a undiagnosable crash. > >> Very appreciate if you can pay more word on it? I tried to figure out > >> your point, but fail. > >> > >> For "a half baked state", I think you concern about LAPIC state, and I > >> expand this point like the following: > > > > It's not only the APIC state. It's the state of the CPUs in general. > > > >> For IPI: when capture kernel BSP is up, the rest cpus are still loop > >> inside crash_nmi_callback(), so there is no way to eject new IPI from > >> these cpu. Also we disable_local_APIC(), which effectively prevent the > >> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not > >> occur in crash case. > > > > Fair enough for the IPI case. > > > >> For MCE, I am not sure whether it can broadcast or not between cpus, > >> but as my understanding, it can not. Then is it a problem? > > > > It can and it does. > > > > That's the whole point why we bring up all CPUs in the 'nosmt' case and > > shut the siblings down again after setting CR4.MCE. Actually that's in fact > > a 'let's hope no MCE hits before that happened' approach, but that's all we > > can do. > > > > If we don't do that then the MCE broadcast can hit a CPU which has some > > firmware initialized state. The result can be a full system lockup, triple > > fault etc. > > > > So when the MCE hits a CPU which is still in the crashed kernel lala state, > > then all hell breaks lose. > > > >> From another view point, is there any difference between nr_cpus=1 and > >> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1, > >> it does for nr_cpus=1. > > > > Anything less than the actual number of present CPUs is problematic except > > you use the 'let's hope nothing happens' approach. We could add an option > > to stop the bringup at the early online state similar to what we do for > > 'nosmt'. > > > > > > How about we change nr_cpus to do that instead so we never have to have this conversation again? Are you interest in implementing this? Thanks, Pingfan
> On Jul 9, 2019, at 1:24 AM, Pingfan Liu <kernelfans@gmail.com> wrote: > >> On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> >>> On Tue, 9 Jul 2019, Pingfan Liu wrote: >>>> On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <tglx@linutronix.de> wrote: >>>> It can and it does. >>>> >>>> That's the whole point why we bring up all CPUs in the 'nosmt' case and >>>> shut the siblings down again after setting CR4.MCE. Actually that's in fact >>>> a 'let's hope no MCE hits before that happened' approach, but that's all we >>>> can do. >>>> >>>> If we don't do that then the MCE broadcast can hit a CPU which has some >>>> firmware initialized state. The result can be a full system lockup, triple >>>> fault etc. >>>> >>>> So when the MCE hits a CPU which is still in the crashed kernel lala state, >>>> then all hell breaks lose. >>> Thank you for the comprehensive explain. With your guide, now, I have >>> a full understanding of the issue. >>> >>> But when I tried to add something to enable CR4.MCE in >>> crash_nmi_callback(), I realized that it is undo-able in some case (if >>> crashed, we will not ask an offline smt cpu to online), also it is >>> needless. "kexec -l/-p" takes the advantage of the cpu state in the >>> first kernel, where all logical cpu has CR4.MCE=1. >>> >>> So kexec is exempt from this bug if the first kernel already do it. >> >> No. If the MCE broadcast is handled by a CPU which is stuck in the old >> kernel stop loop, then it will execute on the old kernel and eventually run >> into the memory corruption which crashed the old one. >> > Yes, you are right. Stuck cpu may execute the old do_machine_check() > code. But I just found out that we have > do_machine_check()->__mc_check_crashing_cpu() to against this case. > > And I think the MCE issue with nr_cpus is not closely related with > this series, can > be a separated issue. I had question whether Andy will take it, if > not, I am glad to do it. > > Go for it. I’m not familiar enough with the SMP boot stuff that I would be able to do it any faster than you. I’ll gladly help review it.
On Tue, Jul 9, 2019 at 9:34 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > > > On Jul 9, 2019, at 1:24 AM, Pingfan Liu <kernelfans@gmail.com> wrote: > > > >> On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> > >>> On Tue, 9 Jul 2019, Pingfan Liu wrote: > >>>> On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >>>> It can and it does. > >>>> > >>>> That's the whole point why we bring up all CPUs in the 'nosmt' case and > >>>> shut the siblings down again after setting CR4.MCE. Actually that's in fact > >>>> a 'let's hope no MCE hits before that happened' approach, but that's all we > >>>> can do. > >>>> > >>>> If we don't do that then the MCE broadcast can hit a CPU which has some > >>>> firmware initialized state. The result can be a full system lockup, triple > >>>> fault etc. > >>>> > >>>> So when the MCE hits a CPU which is still in the crashed kernel lala state, > >>>> then all hell breaks lose. > >>> Thank you for the comprehensive explain. With your guide, now, I have > >>> a full understanding of the issue. > >>> > >>> But when I tried to add something to enable CR4.MCE in > >>> crash_nmi_callback(), I realized that it is undo-able in some case (if > >>> crashed, we will not ask an offline smt cpu to online), also it is > >>> needless. "kexec -l/-p" takes the advantage of the cpu state in the > >>> first kernel, where all logical cpu has CR4.MCE=1. > >>> > >>> So kexec is exempt from this bug if the first kernel already do it. > >> > >> No. If the MCE broadcast is handled by a CPU which is stuck in the old > >> kernel stop loop, then it will execute on the old kernel and eventually run > >> into the memory corruption which crashed the old one. > >> > > Yes, you are right. Stuck cpu may execute the old do_machine_check() > > code. But I just found out that we have > > do_machine_check()->__mc_check_crashing_cpu() to against this case. > > > > And I think the MCE issue with nr_cpus is not closely related with > > this series, can > > be a separated issue. I had question whether Andy will take it, if > > not, I am glad to do it. > > > > > > Go for it. I’m not familiar enough with the SMP boot stuff that I would be able to do it any faster than you. I’ll gladly help review it. I had sent out a patch to fix maxcpus "[PATCH] smp: force all cpu to boot once under maxcpus option" But for the case of nrcpus, I think things will not be so easy due to percpu area, and I think it may take a quite different way. Thanks, Pingfan
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index e6dad60..b48d507 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -213,8 +213,6 @@ static void __init alloc_node_data(int nid) node_data[nid] = nd; memset(NODE_DATA(nid), 0, sizeof(pg_data_t)); - - node_set_online(nid); } /** @@ -589,6 +587,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) continue; alloc_node_data(nid); + node_set_online(nid); } /* Dump memblock with node info and return. */ @@ -760,8 +759,10 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue; - if (!node_online(node)) + if (!node_online(node)) { init_memory_less_node(node); + node_set_online(nid); + } numa_set_node(cpu, node); }
Node online means either memory online or cpu online. But there is requirement to instance a pglist_data, which has neither cpu nor memory online (refer to [2/2]). So carve out the online semantics, and call node_set_online() where either memory or cpu is online. Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Qian Cai <cai@lca.pw> Cc: Barret Rhoden <brho@google.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: David Rientjes <rientjes@google.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- arch/x86/mm/numa.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)