mbox series

[0/6] x86/mce: Various fixes and cleanups for MCE handling

Message ID 20191210000733.17979-1-jschoenh@amazon.de (mailing list archive)
Headers show
Series x86/mce: Various fixes and cleanups for MCE handling | expand

Message

Jan H. Schönherr Dec. 10, 2019, 12:07 a.m. UTC
Hi.

Here is a series of smallish fixes/cleanups for the handling of MCEs.

Note, that except for patches 2 and 4, the patches are just compile tested.

Regards
Jan

Jan H. Schönherr (6):
  x86/mce: Take action on UCNA/Deferred errors again
  x86/mce: Make mce=nobootlog work again
  x86/mce: Fix possibly incorrect severity calculation on AMD
  x86/mce: Fix handling of optional message string
  x86/mce: Pass MCE message to mce_panic() on failed kernel recovery
  x86/mce: Remove mce_inject_log() in favor of mce_log()

 arch/x86/kernel/cpu/mce/core.c     | 59 ++++++++++++------------------
 arch/x86/kernel/cpu/mce/inject.c   |  2 +-
 arch/x86/kernel/cpu/mce/internal.h |  2 -
 3 files changed, 25 insertions(+), 38 deletions(-)

Comments

Luck, Tony Dec. 11, 2019, 12:25 a.m. UTC | #1
On Tue, Dec 10, 2019 at 01:07:27AM +0100, Jan H. Schönherr wrote:
> Hi.
> 
> Here is a series of smallish fixes/cleanups for the handling of MCEs.
> 
> Note, that except for patches 2 and 4, the patches are just compile tested.

I tried some UC injections with these patches applied. Stuff
still works. But thanks to the UCN/Deferred patch I see:

	Memory failure: 0x5fe3284: already hardware poisoned

which is expected because on Intel we see an SRAR error in bank 1
and the machine check takes the page offline.  Then we see a UCNA
from another bank for the same address and try to take the same
page offline again.

This is a return to the old behavior, but might surprise some folks.

One nit-pick - I think we should rename the srao_decode_notifier()
function since it now handles both SRAO and UCNA. Not sure what a
good name is ... something like "uc_decode_notifier()" since it takes
action for multiple classes of uncorrected errors?

All the patches look OK to me (modulo the above rename suggestion).
So consider this a "Reviewed-by" for all six.

Thanks in particular for hunting down what happened to the UCNA
action ... I was looking for it a couple of months ago and didn't
have time to complete the search.

-Tony
Jan H. Schönherr Dec. 12, 2019, 12:25 p.m. UTC | #2
On 11/12/2019 01.25, Luck, Tony wrote:
> On Tue, Dec 10, 2019 at 01:07:27AM +0100, Jan H. Schönherr wrote:
>> Hi.
>>
>> Here is a series of smallish fixes/cleanups for the handling of MCEs.
>>
>> Note, that except for patches 2 and 4, the patches are just compile tested.
> 
> I tried some UC injections with these patches applied. Stuff
> still works. But thanks to the UCN/Deferred patch I see:
> 
> 	Memory failure: 0x5fe3284: already hardware poisoned
> 
> which is expected because on Intel we see an SRAR error in bank 1
> and the machine check takes the page offline.  Then we see a UCNA
> from another bank for the same address and try to take the same
> page offline again.

As I'm not a subject matter expert: But there are cases, where we
get an UCNA only, right? So the change makes still sense?

> This is a return to the old behavior, but might surprise some folks.

I'll add this as a note to the commit message.

> One nit-pick - I think we should rename the srao_decode_notifier()
> function since it now handles both SRAO and UCNA. Not sure what a
> good name is ... something like "uc_decode_notifier()" since it takes
> action for multiple classes of uncorrected errors?

This and names like "uncorrected_memory_error_notifier()" seem to imply
a wider scope than the function actually has. That brings me to another
question: should the scope be wider?

Instead of filtering for usable addresses and specific severities, we
could for example filter for (similar to cec_add_mce()):

  mce_is_memory_error(m) &&
  !mce_is_correctable(m) &&
  mce_usable_address(m)

Would that make sense? Or does that violate anything, that I'm not aware of?

> All the patches look OK to me (modulo the above rename suggestion).
> So consider this a "Reviewed-by" for all six.

Thank you for reviewing. :)

Regards
Jan

--



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Borislav Petkov Dec. 16, 2019, 4:52 p.m. UTC | #3
On Thu, Dec 12, 2019 at 01:25:31PM +0100, Jan H. Schönherr wrote:
> This and names like "uncorrected_memory_error_notifier()" seem to imply
> a wider scope than the function actually has. That brings me to another
> question: should the scope be wider?
> 
> Instead of filtering for usable addresses and specific severities, we
> could for example filter for (similar to cec_add_mce()):
> 
>   mce_is_memory_error(m) &&
>   !mce_is_correctable(m) &&
>   mce_usable_address(m)

There's a comment above that code which says what that function wants:

	/* We eat only correctable DRAM errors with usable addresses. */

> Would that make sense? Or does that violate anything, that I'm not aware of?

So this should be a decision of the two CPU vendors basically answering
the question: for which error severities you want the kernel to poison
pages?

Basically a question for Tony and Yazen. CCed.
Luck, Tony Dec. 16, 2019, 9:59 p.m. UTC | #4
On Mon, Dec 16, 2019 at 05:52:07PM +0100, Borislav Petkov wrote:
> On Thu, Dec 12, 2019 at 01:25:31PM +0100, Jan H. Schönherr wrote:
> > This and names like "uncorrected_memory_error_notifier()" seem to imply
> > a wider scope than the function actually has. That brings me to another
> > question: should the scope be wider?
> > 
> > Instead of filtering for usable addresses and specific severities, we
> > could for example filter for (similar to cec_add_mce()):
> > 
> >   mce_is_memory_error(m) &&
> >   !mce_is_correctable(m) &&
> >   mce_usable_address(m)
> 
> There's a comment above that code which says what that function wants:
> 
> 	/* We eat only correctable DRAM errors with usable addresses. */
> 
> > Would that make sense? Or does that violate anything, that I'm not aware of?
> 
> So this should be a decision of the two CPU vendors basically answering
> the question: for which error severities you want the kernel to poison
> pages?
> 
> Basically a question for Tony and Yazen. CCed.

Using Intel naming, I'd like the SRAO and UCNA severity uncorrected
errors to take the soft offline path to stop using pages.

-Tony
Yazen Ghannam Dec. 17, 2019, 1:19 a.m. UTC | #5
> -----Original Message-----
> From: Luck, Tony <tony.luck@intel.com>
> Sent: Monday, December 16, 2019 4:59 PM
> To: Borislav Petkov <bp@alien8.de>
> Cc: Jan H. Schönherr <jschoenh@amazon.de>; Ghannam, Yazen <Yazen.Ghannam@amd.com>; linux-edac@vger.kernel.org; Thomas
> Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; H. Peter Anvin <hpa@zytor.com>; x86@kernel.org
> Subject: Re: [PATCH 0/6] x86/mce: Various fixes and cleanups for MCE handling
> 
> On Mon, Dec 16, 2019 at 05:52:07PM +0100, Borislav Petkov wrote:
> > On Thu, Dec 12, 2019 at 01:25:31PM +0100, Jan H. Schönherr wrote:
> > > This and names like "uncorrected_memory_error_notifier()" seem to imply
> > > a wider scope than the function actually has. That brings me to another
> > > question: should the scope be wider?
> > >
> > > Instead of filtering for usable addresses and specific severities, we
> > > could for example filter for (similar to cec_add_mce()):
> > >
> > >   mce_is_memory_error(m) &&
> > >   !mce_is_correctable(m) &&
> > >   mce_usable_address(m)
> >
> > There's a comment above that code which says what that function wants:
> >
> > 	/* We eat only correctable DRAM errors with usable addresses. */
> >
> > > Would that make sense? Or does that violate anything, that I'm not aware of?
> >
> > So this should be a decision of the two CPU vendors basically answering
> > the question: for which error severities you want the kernel to poison
> > pages?
> >
> > Basically a question for Tony and Yazen. CCed.
> 
> Using Intel naming, I'd like the SRAO and UCNA severity uncorrected
> errors to take the soft offline path to stop using pages.
> 

For AMD, I'd like that no errors are handled in the SRAO notifier for now.

The DEFERRED severity could be used, but I'd like to do more testing beforehand.

Thanks,
Yazen
Borislav Petkov Dec. 17, 2019, 7:34 a.m. UTC | #6
On Tue, Dec 17, 2019 at 01:19:30AM +0000, Ghannam, Yazen wrote:
> > Using Intel naming, I'd like the SRAO and UCNA severity uncorrected
> > errors to take the soft offline path to stop using pages.

Ok.

> For AMD, I'd like that no errors are handled in the SRAO notifier for now.
> 
> The DEFERRED severity could be used, but I'd like to do more testing beforehand.

If that notifier should be disabled on AMD, we should make it explicit
and not rely on a severify define. But sure, do some testing first
before you know what exactly needs to happen.

Thx.