Message ID | 20191210000733.17979-1-jschoenh@amazon.de (mailing list archive) |
---|---|
Headers | show |
Series | x86/mce: Various fixes and cleanups for MCE handling | expand |
On Tue, Dec 10, 2019 at 01:07:27AM +0100, Jan H. Schönherr wrote: > Hi. > > Here is a series of smallish fixes/cleanups for the handling of MCEs. > > Note, that except for patches 2 and 4, the patches are just compile tested. I tried some UC injections with these patches applied. Stuff still works. But thanks to the UCN/Deferred patch I see: Memory failure: 0x5fe3284: already hardware poisoned which is expected because on Intel we see an SRAR error in bank 1 and the machine check takes the page offline. Then we see a UCNA from another bank for the same address and try to take the same page offline again. This is a return to the old behavior, but might surprise some folks. One nit-pick - I think we should rename the srao_decode_notifier() function since it now handles both SRAO and UCNA. Not sure what a good name is ... something like "uc_decode_notifier()" since it takes action for multiple classes of uncorrected errors? All the patches look OK to me (modulo the above rename suggestion). So consider this a "Reviewed-by" for all six. Thanks in particular for hunting down what happened to the UCNA action ... I was looking for it a couple of months ago and didn't have time to complete the search. -Tony
On 11/12/2019 01.25, Luck, Tony wrote: > On Tue, Dec 10, 2019 at 01:07:27AM +0100, Jan H. Schönherr wrote: >> Hi. >> >> Here is a series of smallish fixes/cleanups for the handling of MCEs. >> >> Note, that except for patches 2 and 4, the patches are just compile tested. > > I tried some UC injections with these patches applied. Stuff > still works. But thanks to the UCN/Deferred patch I see: > > Memory failure: 0x5fe3284: already hardware poisoned > > which is expected because on Intel we see an SRAR error in bank 1 > and the machine check takes the page offline. Then we see a UCNA > from another bank for the same address and try to take the same > page offline again. As I'm not a subject matter expert: But there are cases, where we get an UCNA only, right? So the change makes still sense? > This is a return to the old behavior, but might surprise some folks. I'll add this as a note to the commit message. > One nit-pick - I think we should rename the srao_decode_notifier() > function since it now handles both SRAO and UCNA. Not sure what a > good name is ... something like "uc_decode_notifier()" since it takes > action for multiple classes of uncorrected errors? This and names like "uncorrected_memory_error_notifier()" seem to imply a wider scope than the function actually has. That brings me to another question: should the scope be wider? Instead of filtering for usable addresses and specific severities, we could for example filter for (similar to cec_add_mce()): mce_is_memory_error(m) && !mce_is_correctable(m) && mce_usable_address(m) Would that make sense? Or does that violate anything, that I'm not aware of? > All the patches look OK to me (modulo the above rename suggestion). > So consider this a "Reviewed-by" for all six. Thank you for reviewing. :) Regards Jan -- Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Thu, Dec 12, 2019 at 01:25:31PM +0100, Jan H. Schönherr wrote: > This and names like "uncorrected_memory_error_notifier()" seem to imply > a wider scope than the function actually has. That brings me to another > question: should the scope be wider? > > Instead of filtering for usable addresses and specific severities, we > could for example filter for (similar to cec_add_mce()): > > mce_is_memory_error(m) && > !mce_is_correctable(m) && > mce_usable_address(m) There's a comment above that code which says what that function wants: /* We eat only correctable DRAM errors with usable addresses. */ > Would that make sense? Or does that violate anything, that I'm not aware of? So this should be a decision of the two CPU vendors basically answering the question: for which error severities you want the kernel to poison pages? Basically a question for Tony and Yazen. CCed.
On Mon, Dec 16, 2019 at 05:52:07PM +0100, Borislav Petkov wrote: > On Thu, Dec 12, 2019 at 01:25:31PM +0100, Jan H. Schönherr wrote: > > This and names like "uncorrected_memory_error_notifier()" seem to imply > > a wider scope than the function actually has. That brings me to another > > question: should the scope be wider? > > > > Instead of filtering for usable addresses and specific severities, we > > could for example filter for (similar to cec_add_mce()): > > > > mce_is_memory_error(m) && > > !mce_is_correctable(m) && > > mce_usable_address(m) > > There's a comment above that code which says what that function wants: > > /* We eat only correctable DRAM errors with usable addresses. */ > > > Would that make sense? Or does that violate anything, that I'm not aware of? > > So this should be a decision of the two CPU vendors basically answering > the question: for which error severities you want the kernel to poison > pages? > > Basically a question for Tony and Yazen. CCed. Using Intel naming, I'd like the SRAO and UCNA severity uncorrected errors to take the soft offline path to stop using pages. -Tony
> -----Original Message----- > From: Luck, Tony <tony.luck@intel.com> > Sent: Monday, December 16, 2019 4:59 PM > To: Borislav Petkov <bp@alien8.de> > Cc: Jan H. Schönherr <jschoenh@amazon.de>; Ghannam, Yazen <Yazen.Ghannam@amd.com>; linux-edac@vger.kernel.org; Thomas > Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; H. Peter Anvin <hpa@zytor.com>; x86@kernel.org > Subject: Re: [PATCH 0/6] x86/mce: Various fixes and cleanups for MCE handling > > On Mon, Dec 16, 2019 at 05:52:07PM +0100, Borislav Petkov wrote: > > On Thu, Dec 12, 2019 at 01:25:31PM +0100, Jan H. Schönherr wrote: > > > This and names like "uncorrected_memory_error_notifier()" seem to imply > > > a wider scope than the function actually has. That brings me to another > > > question: should the scope be wider? > > > > > > Instead of filtering for usable addresses and specific severities, we > > > could for example filter for (similar to cec_add_mce()): > > > > > > mce_is_memory_error(m) && > > > !mce_is_correctable(m) && > > > mce_usable_address(m) > > > > There's a comment above that code which says what that function wants: > > > > /* We eat only correctable DRAM errors with usable addresses. */ > > > > > Would that make sense? Or does that violate anything, that I'm not aware of? > > > > So this should be a decision of the two CPU vendors basically answering > > the question: for which error severities you want the kernel to poison > > pages? > > > > Basically a question for Tony and Yazen. CCed. > > Using Intel naming, I'd like the SRAO and UCNA severity uncorrected > errors to take the soft offline path to stop using pages. > For AMD, I'd like that no errors are handled in the SRAO notifier for now. The DEFERRED severity could be used, but I'd like to do more testing beforehand. Thanks, Yazen
On Tue, Dec 17, 2019 at 01:19:30AM +0000, Ghannam, Yazen wrote: > > Using Intel naming, I'd like the SRAO and UCNA severity uncorrected > > errors to take the soft offline path to stop using pages. Ok. > For AMD, I'd like that no errors are handled in the SRAO notifier for now. > > The DEFERRED severity could be used, but I'd like to do more testing beforehand. If that notifier should be disabled on AMD, we should make it explicit and not rely on a severify define. But sure, do some testing first before you know what exactly needs to happen. Thx.