diff mbox series

[v2] x86: Fix MCE error handing when kdump is enabled

Message ID 20200929211644.31632-1-minyard@acm.org (mailing list archive)
State New, archived
Headers show
Series [v2] x86: Fix MCE error handing when kdump is enabled | expand

Commit Message

Corey Minyard Sept. 29, 2020, 9:16 p.m. UTC
From: Corey Minyard <cminyard@mvista.com>

If kdump is enabled, the handling of shooting down CPUs does not use the
RESET_VECTOR irq before trying to use NMIs to shoot down the CPUs.

For normal errors that is fine.  MCEs, however, interrupt all CPUs at
the same time so they are already running in an exception that is
higher priority than an NMI, so sending them an NMI won't do anything.
The MCE code in wait_for_panic() is set up to receive the RESET_VECTOR
because it enables irqs, but it won't work on the NMI-only case.

There is already code in place to scan for the NMI callback being ready,
simply call that from the MCE's wait_for_panic() code so it will pick up
and handle it if an NMI shootdown is requested.  This required
propagating the registers down to wait_for_panic().

Reported-by: Wu Bo <wubo40@huawei.com>
Cc: hidehiro.kawai.ez@hitachi.com
Cc: linfeilong@huawei.com
Cc: liuzhiqiang26@huawei.com
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Tested-by: Wu Bo <wubo40@huawei.com>
---
I hadn't heard anything, so I thought I would re-post with the updated
version of the patch.

Wu Bo found this doing kdumps because the IPMI driver saves panic
information to the IPMI event log during a panic.  But it was getting
interrupts at the same time because the other cores had interrupts
enabled, causing the process to take a long time.

Having interrupt enabled during a kdump shutdown and while the new kdump
kernel is running is obviously a bad thing and can cause other problems,
too.  I think this is the right fix, but I'm not an expert in this code.

The only thing I can think of that might be wrong with this is that if
you are going to take a kdump, it might be best to never re-enable
interrupts at all.  But I'm not sure.

Changes since v1:
  * Moved regs to the beginning of all parameter lists that it was
    added to.
  * Fixed the header text about NMIs and MCEs.

 arch/x86/kernel/cpu/mce/core.c | 63 +++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 23 deletions(-)

Comments

Borislav Petkov Sept. 30, 2020, 5:56 p.m. UTC | #1
On Tue, Sep 29, 2020 at 04:16:44PM -0500, minyard@acm.org wrote:
> From: Corey Minyard <cminyard@mvista.com>
> 
> If kdump is enabled, the handling of shooting down CPUs does not use the
> RESET_VECTOR irq before trying to use NMIs to shoot down the CPUs.

So I've read that commit message like a bunch of times already and am
getting none the wiser about what the situation is, who's doing what and
what is this thing fixing.

It must be something about kdumping a kernel and an MCE happening at the
same time and we did something about this a while ago, see:

 5bc329503e81 ("x86/mce: Handle broadcasted MCE gracefully with kexec")

and that is simply letting CPUs which are not doing the kexec-ing
continue from the broadcasted MCE holding pattern so that kexec
finishes.

So please explain exactly what this problem is, who's doing what, when
does the MCE happen etc?

I've found this:

https://lkml.kernel.org/r/1600339070-570840-1-git-send-email-wubo40@huawei.com

and that sounds like the problem and I'm going to read that one in
detail if that is the issue we're talking about. But from skimming over
it, it sounds like the commit I mentioned above should take care of it.

Although I have no clue what this means:

"1) MCE appears on all CPUs, Currently all CPUs are in the NMI interrupt 
   context."

I think he means, all CPUs are in the #MC handler.

Also, looking at that mail, what kernel is Wu Bo using?

[ 4767.947960] BUG: unable to handle kernel paging request at ffff893e40000000
[ 4767.947962] PGD 13c001067 P4D 13c001067 PUD 0
[ 4767.947965] Oops: 0000 [#1] SMP PTI
[ 4767.947967] CPU: 0 PID: 0 Comm: swapper/0

There's no kernel version on this line above. Taint line is gone too. Why?

Judging by the "unable to handle kernel paging request" text, that must
be from before

  f28b11a2abd9 ("x86/fault: Reword initial BUG message for unhandled page faults")

which is 5.1. The commit above is in 5.1 but Wu Bo better try the latest
*upstream* kernel first. The stress being on *upstream*.

Also that kernel is in a guest - I take MCEs in guests not very
seriously.

So before we waste time, let's explain why we're doing all that exercise
first.

Thx.
Corey Minyard Sept. 30, 2020, 6:49 p.m. UTC | #2
On Wed, Sep 30, 2020 at 07:56:33PM +0200, Borislav Petkov wrote:
> On Tue, Sep 29, 2020 at 04:16:44PM -0500, minyard@acm.org wrote:
> > From: Corey Minyard <cminyard@mvista.com>
> > 
> > If kdump is enabled, the handling of shooting down CPUs does not use the
> > RESET_VECTOR irq before trying to use NMIs to shoot down the CPUs.
> 
> So I've read that commit message like a bunch of times already and am
> getting none the wiser about what the situation is, who's doing what and
> what is this thing fixing.
> 
> It must be something about kdumping a kernel and an MCE happening at the
> same time and we did something about this a while ago, see:
> 
>  5bc329503e81 ("x86/mce: Handle broadcasted MCE gracefully with kexec")
> 
> and that is simply letting CPUs which are not doing the kexec-ing
> continue from the broadcasted MCE holding pattern so that kexec
> finishes.
> 
> So please explain exactly what this problem is, who's doing what, when
> does the MCE happen etc?
> 
> I've found this:
> 
> https://lkml.kernel.org/r/1600339070-570840-1-git-send-email-wubo40@huawei.com
> 
> and that sounds like the problem and I'm going to read that one in
> detail if that is the issue we're talking about. But from skimming over
> it, it sounds like the commit I mentioned above should take care of it.

That is the original post for this, yes.

Wu, what kernel version are you using?  Can you try to reproduce on the
current mainstream kernel?  I just assumed it was current.

The description isn't that great, no.  I'll try again.

The problem is that while waiting in wait_for_panic() in the mce code,
interrupts are enabled.  In the kdump case, there is nothing that will
wake them up, so they will sit there in the loop until they time out.

In the mean time, the cpu handling the panic calls some IPMI code that
stores panic information in the IPMI event log.  Since interrupts are
enabled on the CPUs in wait_for_panic(), those CPUs are handling
interrupts from the IPMI hardware.  They will not, however, handle
the NMI IPI that gets sent from the panic() code for kdump.

The IPMI code has disabled locks to avoid a deadlock if the exception
happens while in IPMI code.  So the panic-handling part of IPMI and the
IPMI interrupt handling are both running at the same time, messing each
other up.

It seems, in general, like a bad idea to have interrupts enabled on some
CPUs while running through the panic() code and after the new kdump
kernel has started.  There are other issues that might come from this.

I'm also not quite sure how kdump register information for the CPUs
in wait_for_panic() gets put into the kernel coredump if you don't do
something like my patch.

Thanks,

-corey

> 
> Although I have no clue what this means:
> 
> "1) MCE appears on all CPUs, Currently all CPUs are in the NMI interrupt 
>    context."
> 
> I think he means, all CPUs are in the #MC handler.
> 
> Also, looking at that mail, what kernel is Wu Bo using?
> 
> [ 4767.947960] BUG: unable to handle kernel paging request at ffff893e40000000
> [ 4767.947962] PGD 13c001067 P4D 13c001067 PUD 0
> [ 4767.947965] Oops: 0000 [#1] SMP PTI
> [ 4767.947967] CPU: 0 PID: 0 Comm: swapper/0
> 
> There's no kernel version on this line above. Taint line is gone too. Why?
> 
> Judging by the "unable to handle kernel paging request" text, that must
> be from before
> 
>   f28b11a2abd9 ("x86/fault: Reword initial BUG message for unhandled page faults")
> 
> which is 5.1. The commit above is in 5.1 but Wu Bo better try the latest
> *upstream* kernel first. The stress being on *upstream*.
> 
> Also that kernel is in a guest - I take MCEs in guests not very
> seriously.
> 
> So before we waste time, let's explain why we're doing all that exercise
> first.
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov Oct. 1, 2020, 11:33 a.m. UTC | #3
On Wed, Sep 30, 2020 at 01:49:06PM -0500, Corey Minyard wrote:
> That is the original post for this, yes.
> 
> Wu, what kernel version are you using?  Can you try to reproduce on the
> current mainstream kernel?  I just assumed it was current.
> 
> The description isn't that great, no.  I'll try again.
> 
> The problem is that while waiting in wait_for_panic() in the mce code,
> interrupts are enabled.  In the kdump case, there is nothing that will
> wake them up, so they will sit there in the loop until they time out.
> 
> In the mean time, the cpu handling the panic calls some IPMI code that
> stores panic information in the IPMI event log.  Since interrupts are
> enabled on the CPUs in wait_for_panic(), those CPUs are handling
> interrupts from the IPMI hardware.  They will not, however, handle
> the NMI IPI that gets sent from the panic() code for kdump.
> 
> The IPMI code has disabled locks to avoid a deadlock if the exception
> happens while in IPMI code.  So the panic-handling part of IPMI and the
> IPMI interrupt handling are both running at the same time, messing each
> other up.
> 
> It seems, in general, like a bad idea to have interrupts enabled on some
> CPUs while running through the panic() code and after the new kdump
> kernel has started.  There are other issues that might come from this.
> 
> I'm also not quite sure how kdump register information for the CPUs
> in wait_for_panic() gets put into the kernel coredump if you don't do
> something like my patch.
> 

Ok, thanks for taking the time, this makes a lot more sense to me.

Now, from looking at the code, I'm thinking that we should simply "let
go" the other CPUs just like we do in mce_check_crashing_cpu(), if
kdump is starting. Instead of spinning with IRQs enabled.

Simply run mce_check_crashing_cpu() at wait_for_panic() entry, just like
exc_machine_check_kernel() does now. The logic being, if we're going
to wait for panic but we're starting the kdump kernel anyway, then we
better let the CPUs go so that they can do all kinds of IRQ servicing
etc and don't interfere.

*If* we don't kdump, we timeout the usual way.

Tony, how does that logic sound?
Corey Minyard Oct. 1, 2020, 1:44 p.m. UTC | #4
On Thu, Oct 01, 2020 at 01:33:18PM +0200, Borislav Petkov wrote:
> On Wed, Sep 30, 2020 at 01:49:06PM -0500, Corey Minyard wrote:
> > That is the original post for this, yes.
> > 
> > Wu, what kernel version are you using?  Can you try to reproduce on the
> > current mainstream kernel?  I just assumed it was current.
> > 
> > The description isn't that great, no.  I'll try again.
> > 
> > The problem is that while waiting in wait_for_panic() in the mce code,
> > interrupts are enabled.  In the kdump case, there is nothing that will
> > wake them up, so they will sit there in the loop until they time out.
> > 
> > In the mean time, the cpu handling the panic calls some IPMI code that
> > stores panic information in the IPMI event log.  Since interrupts are
> > enabled on the CPUs in wait_for_panic(), those CPUs are handling
> > interrupts from the IPMI hardware.  They will not, however, handle
> > the NMI IPI that gets sent from the panic() code for kdump.
> > 
> > The IPMI code has disabled locks to avoid a deadlock if the exception
> > happens while in IPMI code.  So the panic-handling part of IPMI and the
> > IPMI interrupt handling are both running at the same time, messing each
> > other up.
> > 
> > It seems, in general, like a bad idea to have interrupts enabled on some
> > CPUs while running through the panic() code and after the new kdump
> > kernel has started.  There are other issues that might come from this.
> > 
> > I'm also not quite sure how kdump register information for the CPUs
> > in wait_for_panic() gets put into the kernel coredump if you don't do
> > something like my patch.
> > 
> 
> Ok, thanks for taking the time, this makes a lot more sense to me.
> 
> Now, from looking at the code, I'm thinking that we should simply "let
> go" the other CPUs just like we do in mce_check_crashing_cpu(), if
> kdump is starting. Instead of spinning with IRQs enabled.
> 
> Simply run mce_check_crashing_cpu() at wait_for_panic() entry, just like
> exc_machine_check_kernel() does now. The logic being, if we're going
> to wait for panic but we're starting the kdump kernel anyway, then we
> better let the CPUs go so that they can do all kinds of IRQ servicing
> etc and don't interfere.

I don't understand the last sentence.  You don't want to do IRQ
servicing when you are going to kdump.  That's going to change the state
of the kernel and you may lose information, and it may interfere with
the kdump process.

That's why (well, one of many reasons why) kdump goes straight to NMI
shootdown.

Also, it's still unclear to me how kdump would get the register
information for the CPUs that enter wait_for_panic().

> 
> *If* we don't kdump, we timeout the usual way.

I was thinking about this some yesterday.  It seems to me that enabling
IRQS in an MCE handler is just a bad idea, but it's really a bad idea
for kdump.

I think you could just remove the irq enable in wait_for_panic() and
call run_crash_ipi_callback() from the loop there without messing
with irqs.  In the non-kdump case, it waits a second for the
RESET_VECTOR to happen in native_stop_other_cpus() then it uses an NMI
shootdown.  So it will delay for a second in the normal panic case.
The kdump case uses nmi_shootdown_cpus(), which doesn't do the
RESET_VECTOR stop.

-corey

> 
> Tony, how does that logic sound?
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov Oct. 1, 2020, 4:16 p.m. UTC | #5
On Thu, Oct 01, 2020 at 08:44:49AM -0500, Corey Minyard wrote:
> I don't understand the last sentence.  You don't want to do IRQ
> servicing when you are going to kdump.  That's going to change the state
> of the kernel and you may lose information, and it may interfere with
> the kdump process.

I misspoke: what I meant was, what mce_check_crashing_cpu() does - free
the CPU from the #MC handler so that it can do whatever it is supposed
to do under kdump.

> That's why (well, one of many reasons why) kdump goes straight to NMI
> shootdown.

Right.

> Also, it's still unclear to me how kdump would get the register
> information for the CPUs that enter wait_for_panic().

Yes, you said that already.

> I was thinking about this some yesterday.  It seems to me that enabling
> IRQS in an MCE handler is just a bad idea, but it's really a bad idea
> for kdump.

I don't think this code ever thought about kdump.

> I think you could just remove the irq enable in wait_for_panic() and
> call run_crash_ipi_callback() from the loop there without messing
> with irqs.  In the non-kdump case, it waits a second for the
> RESET_VECTOR to happen in native_stop_other_cpus() then it uses an NMI
> shootdown.  So it will delay for a second in the normal panic case.
> The kdump case uses nmi_shootdown_cpus(), which doesn't do the
> RESET_VECTOR stop.

Well, I don't think the MCE code should know anything about kdump. What
it should do in the kdump case - i.e., when crashing_cpu != -1, is
simply call mce_check_crashing_cpu() in wait_for_panic(). In that case,
the only thing it should do is get out of the #MC handler so that it can
get the shootdown NMI.

For all other cases, it should do what wait_for_panic() has been doing
so far.

Thx.
Luck, Tony Oct. 1, 2020, 4:29 p.m. UTC | #6
>> I was thinking about this some yesterday.  It seems to me that enabling
>> IRQS in an MCE handler is just a bad idea, but it's really a bad idea
>> for kdump.
>
> I don't think this code ever thought about kdump.

How useful is kdump after a machine check induced crash anyway?

kdump is useful for debugging software problems.  There are very
few ways that a software bug can result in a machine check. There
are many ways that a hardware problem can trigger a machine check
and crash.

So it would seem (statistically) that the analysis of almost every kdump
after a machine check just says "h/w issue".

-Tony
Borislav Petkov Oct. 1, 2020, 4:58 p.m. UTC | #7
On Thu, Oct 01, 2020 at 04:29:49PM +0000, Luck, Tony wrote:
> >> I was thinking about this some yesterday.  It seems to me that enabling
> >> IRQS in an MCE handler is just a bad idea, but it's really a bad idea
> >> for kdump.
> >
> > I don't think this code ever thought about kdump.
> 
> How useful is kdump after a machine check induced crash anyway?
> 
> kdump is useful for debugging software problems.  There are very
> few ways that a software bug can result in a machine check. There
> are many ways that a hardware problem can trigger a machine check
> and crash.
> 
> So it would seem (statistically) that the analysis of almost every kdump
> after a machine check just says "h/w issue".

You certainly have a point.

The only use case I can think of is being able to read out the MCE
signature from the first kernel's buffer - *if* it has landed there - to
at least know what it is that caused the explosion.
Corey Minyard Oct. 1, 2020, 5:12 p.m. UTC | #8
On Thu, Oct 01, 2020 at 04:29:49PM +0000, Luck, Tony wrote:
> >> I was thinking about this some yesterday.  It seems to me that enabling
> >> IRQS in an MCE handler is just a bad idea, but it's really a bad idea
> >> for kdump.
> >
> > I don't think this code ever thought about kdump.
> 
> How useful is kdump after a machine check induced crash anyway?
> 
> kdump is useful for debugging software problems.  There are very
> few ways that a software bug can result in a machine check. There
> are many ways that a hardware problem can trigger a machine check
> and crash.
> 
> So it would seem (statistically) that the analysis of almost every kdump
> after a machine check just says "h/w issue".

I don't really know.  It seems like having an idea of what the software
was doing when the hardware died might be useful for the hardware
engineers.  I really don't know much about what triggers MCEs, though,
besides memory errors the hardware couldn't correct.

You could say that the regs don't matter, I suppose, and that's
probabaly fine.  But if it's easy enough to do, and the interfaces are
already there and work, and it speeds up the crash process a bit, why
not do it?

-corey
Zhiqiang Liu Oct. 10, 2020, 1:36 a.m. UTC | #9
Corey, you have missed Cc: Wu Bo (wubo40@huawei.com).
He cannot accept these messages.

On 2020/9/30 5:16, minyard@acm.org wrote:
> From: Corey Minyard <cminyard@mvista.com>
> 
> If kdump is enabled, the handling of shooting down CPUs does not use the
> RESET_VECTOR irq before trying to use NMIs to shoot down the CPUs.
> 
> For normal errors that is fine.  MCEs, however, interrupt all CPUs at
> the same time so they are already running in an exception that is
> higher priority than an NMI, so sending them an NMI won't do anything.
> The MCE code in wait_for_panic() is set up to receive the RESET_VECTOR
> because it enables irqs, but it won't work on the NMI-only case.
> 
> There is already code in place to scan for the NMI callback being ready,
> simply call that from the MCE's wait_for_panic() code so it will pick up
> and handle it if an NMI shootdown is requested.  This required
> propagating the registers down to wait_for_panic().
> 
> Reported-by: Wu Bo <wubo40@huawei.com>
> Cc: hidehiro.kawai.ez@hitachi.com
> Cc: linfeilong@huawei.com
> Cc: liuzhiqiang26@huawei.com
> Signed-off-by: Corey Minyard <cminyard@mvista.com>
> Tested-by: Wu Bo <wubo40@huawei.com>
> ---
> I hadn't heard anything, so I thought I would re-post with the updated
> version of the patch.
> 
> Wu Bo found this doing kdumps because the IPMI driver saves panic
> information to the IPMI event log during a panic.  But it was getting
> interrupts at the same time because the other cores had interrupts
> enabled, causing the process to take a long time.
> 
> Having interrupt enabled during a kdump shutdown and while the new kdump
> kernel is running is obviously a bad thing and can cause other problems,
> too.  I think this is the right fix, but I'm not an expert in this code.
> 
> The only thing I can think of that might be wrong with this is that if
> you are going to take a kdump, it might be best to never re-enable
> interrupts at all.  But I'm not sure.
> 
> Changes since v1:
>   * Moved regs to the beginning of all parameter lists that it was
>     added to.
>   * Fixed the header text about NMIs and MCEs.
> 
>  arch/x86/kernel/cpu/mce/core.c | 63 +++++++++++++++++++++-------------
>  1 file changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index f43a78bde670..b786608f9a21 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -282,20 +282,35 @@ static int fake_panic;
>  static atomic_t mce_fake_panicked;
>  
>  /* Panic in progress. Enable interrupts and wait for final IPI */
> -static void wait_for_panic(void)
> +static void wait_for_panic(struct pt_regs *regs)
>  {
>  	long timeout = PANIC_TIMEOUT*USEC_PER_SEC;
>  
>  	preempt_disable();
>  	local_irq_enable();
> -	while (timeout-- > 0)
> +	while (timeout-- > 0) {
> +		/*
> +		 * We are in an NMI waiting to be stopped by the
> +		 * handing processor.  For kdump handling, we need to
> +		 * be monitoring crash_ipi_issued since that is what
> +		 * is used for an NMI stop used by kdump.  But we also
> +		 * need to have interrupts enabled some so that
> +		 * RESET_VECTOR will interrupt us on a normal
> +		 * shutdown.
> +		 */
> +		local_irq_disable();
> +		run_crash_ipi_callback(regs);
> +		local_irq_enable();
> +
>  		udelay(1);
> +	}
>  	if (panic_timeout == 0)
>  		panic_timeout = mca_cfg.panic_timeout;
>  	panic("Panicing machine check CPU died");
>  }
>  
> -static void mce_panic(const char *msg, struct mce *final, char *exp)
> +static void mce_panic(struct pt_regs *regs, const char *msg,
> +		      struct mce *final, char *exp)
>  {
>  	int apei_err = 0;
>  	struct llist_node *pending;
> @@ -306,7 +321,7 @@ static void mce_panic(const char *msg, struct mce *final, char *exp)
>  		 * Make sure only one CPU runs in machine check panic
>  		 */
>  		if (atomic_inc_return(&mce_panicked) > 1)
> -			wait_for_panic();
> +			wait_for_panic(regs);
>  		barrier();
>  
>  		bust_spinlocks(1);
> @@ -817,7 +832,7 @@ static atomic_t mce_callin;
>  /*
>   * Check if a timeout waiting for other CPUs happened.
>   */
> -static int mce_timed_out(u64 *t, const char *msg)
> +static int mce_timed_out(struct pt_regs *regs, u64 *t, const char *msg)
>  {
>  	/*
>  	 * The others already did panic for some reason.
> @@ -827,12 +842,12 @@ static int mce_timed_out(u64 *t, const char *msg)
>  	 */
>  	rmb();
>  	if (atomic_read(&mce_panicked))
> -		wait_for_panic();
> +		wait_for_panic(regs);
>  	if (!mca_cfg.monarch_timeout)
>  		goto out;
>  	if ((s64)*t < SPINUNIT) {
>  		if (mca_cfg.tolerant <= 1)
> -			mce_panic(msg, NULL, NULL);
> +			mce_panic(regs, msg, NULL, NULL);
>  		cpu_missing = 1;
>  		return 1;
>  	}
> @@ -866,7 +881,7 @@ static int mce_timed_out(u64 *t, const char *msg)
>   * All the spin loops have timeouts; when a timeout happens a CPU
>   * typically elects itself to be Monarch.
>   */
> -static void mce_reign(void)
> +static void mce_reign(struct pt_regs *regs)
>  {
>  	int cpu;
>  	struct mce *m = NULL;
> @@ -896,7 +911,7 @@ static void mce_reign(void)
>  	 * other CPUs.
>  	 */
>  	if (m && global_worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
> -		mce_panic("Fatal machine check", m, msg);
> +		mce_panic(regs, "Fatal machine check", m, msg);
>  
>  	/*
>  	 * For UC somewhere we let the CPU who detects it handle it.
> @@ -909,7 +924,8 @@ static void mce_reign(void)
>  	 * source or one CPU is hung. Panic.
>  	 */
>  	if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3)
> -		mce_panic("Fatal machine check from unknown source", NULL, NULL);
> +		mce_panic(regs,
> +			  "Fatal machine check from unknown source", NULL, NULL);
>  
>  	/*
>  	 * Now clear all the mces_seen so that they don't reappear on
> @@ -928,7 +944,7 @@ static atomic_t global_nwo;
>   * in the entry order.
>   * TBD double check parallel CPU hotunplug
>   */
> -static int mce_start(int *no_way_out)
> +static int mce_start(struct pt_regs *regs, int *no_way_out)
>  {
>  	int order;
>  	int cpus = num_online_cpus();
> @@ -948,7 +964,7 @@ static int mce_start(int *no_way_out)
>  	 * Wait for everyone.
>  	 */
>  	while (atomic_read(&mce_callin) != cpus) {
> -		if (mce_timed_out(&timeout,
> +		if (mce_timed_out(regs, &timeout,
>  				  "Timeout: Not all CPUs entered broadcast exception handler")) {
>  			atomic_set(&global_nwo, 0);
>  			return -1;
> @@ -974,7 +990,7 @@ static int mce_start(int *no_way_out)
>  		 * only seen by one CPU before cleared, avoiding duplicates.
>  		 */
>  		while (atomic_read(&mce_executing) < order) {
> -			if (mce_timed_out(&timeout,
> +			if (mce_timed_out(regs, &timeout,
>  					  "Timeout: Subject CPUs unable to finish machine check processing")) {
>  				atomic_set(&global_nwo, 0);
>  				return -1;
> @@ -995,7 +1011,7 @@ static int mce_start(int *no_way_out)
>   * Synchronize between CPUs after main scanning loop.
>   * This invokes the bulk of the Monarch processing.
>   */
> -static int mce_end(int order)
> +static int mce_end(struct pt_regs *regs, int order)
>  {
>  	int ret = -1;
>  	u64 timeout = (u64)mca_cfg.monarch_timeout * NSEC_PER_USEC;
> @@ -1019,13 +1035,13 @@ static int mce_end(int order)
>  		 * loops.
>  		 */
>  		while (atomic_read(&mce_executing) <= cpus) {
> -			if (mce_timed_out(&timeout,
> +			if (mce_timed_out(regs, &timeout,
>  					  "Timeout: Monarch CPU unable to finish machine check processing"))
>  				goto reset;
>  			ndelay(SPINUNIT);
>  		}
>  
> -		mce_reign();
> +		mce_reign(regs);
>  		barrier();
>  		ret = 0;
>  	} else {
> @@ -1033,7 +1049,7 @@ static int mce_end(int order)
>  		 * Subject: Wait for Monarch to finish.
>  		 */
>  		while (atomic_read(&mce_executing) != 0) {
> -			if (mce_timed_out(&timeout,
> +			if (mce_timed_out(regs, &timeout,
>  					  "Timeout: Monarch CPU did not finish machine check processing"))
>  				goto reset;
>  			ndelay(SPINUNIT);
> @@ -1286,9 +1302,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  	 */
>  	if (lmce) {
>  		if (no_way_out)
> -			mce_panic("Fatal local machine check", &m, msg);
> +			mce_panic(regs, "Fatal local machine check", &m, msg);
>  	} else {
> -		order = mce_start(&no_way_out);
> +		order = mce_start(regs, &no_way_out);
>  	}
>  
>  	__mc_scan_banks(&m, final, toclear, valid_banks, no_way_out, &worst);
> @@ -1301,7 +1317,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  	 * When there's any problem use only local no_way_out state.
>  	 */
>  	if (!lmce) {
> -		if (mce_end(order) < 0)
> +		if (mce_end(regs, order) < 0)
>  			no_way_out = worst >= MCE_PANIC_SEVERITY;
>  	} else {
>  		/*
> @@ -1314,7 +1330,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  		 */
>  		if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
>  			mce_severity(&m, cfg->tolerant, &msg, true);
> -			mce_panic("Local fatal machine check!", &m, msg);
> +			mce_panic(regs, "Local fatal machine check!", &m, msg);
>  		}
>  	}
>  
> @@ -1325,7 +1341,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  	if (cfg->tolerant == 3)
>  		kill_it = 0;
>  	else if (no_way_out)
> -		mce_panic("Fatal machine check on current CPU", &m, msg);
> +		mce_panic(regs, "Fatal machine check on current CPU", &m, msg);
>  
>  	if (worst > 0)
>  		irq_work_queue(&mce_irq_work);
> @@ -1361,7 +1377,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
>  		 */
>  		if (m.kflags & MCE_IN_KERNEL_RECOV) {
>  			if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
> -				mce_panic("Failed kernel mode recovery", &m, msg);
> +				mce_panic(regs, "Failed kernel mode recovery",
> +					  &m, msg);
>  		}
>  	}
>  }
>
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f43a78bde670..b786608f9a21 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -282,20 +282,35 @@  static int fake_panic;
 static atomic_t mce_fake_panicked;
 
 /* Panic in progress. Enable interrupts and wait for final IPI */
-static void wait_for_panic(void)
+static void wait_for_panic(struct pt_regs *regs)
 {
 	long timeout = PANIC_TIMEOUT*USEC_PER_SEC;
 
 	preempt_disable();
 	local_irq_enable();
-	while (timeout-- > 0)
+	while (timeout-- > 0) {
+		/*
+		 * We are in an NMI waiting to be stopped by the
+		 * handing processor.  For kdump handling, we need to
+		 * be monitoring crash_ipi_issued since that is what
+		 * is used for an NMI stop used by kdump.  But we also
+		 * need to have interrupts enabled some so that
+		 * RESET_VECTOR will interrupt us on a normal
+		 * shutdown.
+		 */
+		local_irq_disable();
+		run_crash_ipi_callback(regs);
+		local_irq_enable();
+
 		udelay(1);
+	}
 	if (panic_timeout == 0)
 		panic_timeout = mca_cfg.panic_timeout;
 	panic("Panicing machine check CPU died");
 }
 
-static void mce_panic(const char *msg, struct mce *final, char *exp)
+static void mce_panic(struct pt_regs *regs, const char *msg,
+		      struct mce *final, char *exp)
 {
 	int apei_err = 0;
 	struct llist_node *pending;
@@ -306,7 +321,7 @@  static void mce_panic(const char *msg, struct mce *final, char *exp)
 		 * Make sure only one CPU runs in machine check panic
 		 */
 		if (atomic_inc_return(&mce_panicked) > 1)
-			wait_for_panic();
+			wait_for_panic(regs);
 		barrier();
 
 		bust_spinlocks(1);
@@ -817,7 +832,7 @@  static atomic_t mce_callin;
 /*
  * Check if a timeout waiting for other CPUs happened.
  */
-static int mce_timed_out(u64 *t, const char *msg)
+static int mce_timed_out(struct pt_regs *regs, u64 *t, const char *msg)
 {
 	/*
 	 * The others already did panic for some reason.
@@ -827,12 +842,12 @@  static int mce_timed_out(u64 *t, const char *msg)
 	 */
 	rmb();
 	if (atomic_read(&mce_panicked))
-		wait_for_panic();
+		wait_for_panic(regs);
 	if (!mca_cfg.monarch_timeout)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
 		if (mca_cfg.tolerant <= 1)
-			mce_panic(msg, NULL, NULL);
+			mce_panic(regs, msg, NULL, NULL);
 		cpu_missing = 1;
 		return 1;
 	}
@@ -866,7 +881,7 @@  static int mce_timed_out(u64 *t, const char *msg)
  * All the spin loops have timeouts; when a timeout happens a CPU
  * typically elects itself to be Monarch.
  */
-static void mce_reign(void)
+static void mce_reign(struct pt_regs *regs)
 {
 	int cpu;
 	struct mce *m = NULL;
@@ -896,7 +911,7 @@  static void mce_reign(void)
 	 * other CPUs.
 	 */
 	if (m && global_worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
-		mce_panic("Fatal machine check", m, msg);
+		mce_panic(regs, "Fatal machine check", m, msg);
 
 	/*
 	 * For UC somewhere we let the CPU who detects it handle it.
@@ -909,7 +924,8 @@  static void mce_reign(void)
 	 * source or one CPU is hung. Panic.
 	 */
 	if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3)
-		mce_panic("Fatal machine check from unknown source", NULL, NULL);
+		mce_panic(regs,
+			  "Fatal machine check from unknown source", NULL, NULL);
 
 	/*
 	 * Now clear all the mces_seen so that they don't reappear on
@@ -928,7 +944,7 @@  static atomic_t global_nwo;
  * in the entry order.
  * TBD double check parallel CPU hotunplug
  */
-static int mce_start(int *no_way_out)
+static int mce_start(struct pt_regs *regs, int *no_way_out)
 {
 	int order;
 	int cpus = num_online_cpus();
@@ -948,7 +964,7 @@  static int mce_start(int *no_way_out)
 	 * Wait for everyone.
 	 */
 	while (atomic_read(&mce_callin) != cpus) {
-		if (mce_timed_out(&timeout,
+		if (mce_timed_out(regs, &timeout,
 				  "Timeout: Not all CPUs entered broadcast exception handler")) {
 			atomic_set(&global_nwo, 0);
 			return -1;
@@ -974,7 +990,7 @@  static int mce_start(int *no_way_out)
 		 * only seen by one CPU before cleared, avoiding duplicates.
 		 */
 		while (atomic_read(&mce_executing) < order) {
-			if (mce_timed_out(&timeout,
+			if (mce_timed_out(regs, &timeout,
 					  "Timeout: Subject CPUs unable to finish machine check processing")) {
 				atomic_set(&global_nwo, 0);
 				return -1;
@@ -995,7 +1011,7 @@  static int mce_start(int *no_way_out)
  * Synchronize between CPUs after main scanning loop.
  * This invokes the bulk of the Monarch processing.
  */
-static int mce_end(int order)
+static int mce_end(struct pt_regs *regs, int order)
 {
 	int ret = -1;
 	u64 timeout = (u64)mca_cfg.monarch_timeout * NSEC_PER_USEC;
@@ -1019,13 +1035,13 @@  static int mce_end(int order)
 		 * loops.
 		 */
 		while (atomic_read(&mce_executing) <= cpus) {
-			if (mce_timed_out(&timeout,
+			if (mce_timed_out(regs, &timeout,
 					  "Timeout: Monarch CPU unable to finish machine check processing"))
 				goto reset;
 			ndelay(SPINUNIT);
 		}
 
-		mce_reign();
+		mce_reign(regs);
 		barrier();
 		ret = 0;
 	} else {
@@ -1033,7 +1049,7 @@  static int mce_end(int order)
 		 * Subject: Wait for Monarch to finish.
 		 */
 		while (atomic_read(&mce_executing) != 0) {
-			if (mce_timed_out(&timeout,
+			if (mce_timed_out(regs, &timeout,
 					  "Timeout: Monarch CPU did not finish machine check processing"))
 				goto reset;
 			ndelay(SPINUNIT);
@@ -1286,9 +1302,9 @@  noinstr void do_machine_check(struct pt_regs *regs)
 	 */
 	if (lmce) {
 		if (no_way_out)
-			mce_panic("Fatal local machine check", &m, msg);
+			mce_panic(regs, "Fatal local machine check", &m, msg);
 	} else {
-		order = mce_start(&no_way_out);
+		order = mce_start(regs, &no_way_out);
 	}
 
 	__mc_scan_banks(&m, final, toclear, valid_banks, no_way_out, &worst);
@@ -1301,7 +1317,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 	 * When there's any problem use only local no_way_out state.
 	 */
 	if (!lmce) {
-		if (mce_end(order) < 0)
+		if (mce_end(regs, order) < 0)
 			no_way_out = worst >= MCE_PANIC_SEVERITY;
 	} else {
 		/*
@@ -1314,7 +1330,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 		 */
 		if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
 			mce_severity(&m, cfg->tolerant, &msg, true);
-			mce_panic("Local fatal machine check!", &m, msg);
+			mce_panic(regs, "Local fatal machine check!", &m, msg);
 		}
 	}
 
@@ -1325,7 +1341,7 @@  noinstr void do_machine_check(struct pt_regs *regs)
 	if (cfg->tolerant == 3)
 		kill_it = 0;
 	else if (no_way_out)
-		mce_panic("Fatal machine check on current CPU", &m, msg);
+		mce_panic(regs, "Fatal machine check on current CPU", &m, msg);
 
 	if (worst > 0)
 		irq_work_queue(&mce_irq_work);
@@ -1361,7 +1377,8 @@  noinstr void do_machine_check(struct pt_regs *regs)
 		 */
 		if (m.kflags & MCE_IN_KERNEL_RECOV) {
 			if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
-				mce_panic("Failed kernel mode recovery", &m, msg);
+				mce_panic(regs, "Failed kernel mode recovery",
+					  &m, msg);
 		}
 	}
 }