Message ID | 20200205125831.20430-1-prarit@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142 | expand |
On Wed, Feb 05, 2020 at 07:58:31AM -0500, Prarit Bhargava wrote: > Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142 That subject is unreadable for humans. > Intel Errata HSD131, HSM142, HSW131, and BDM48 report that > "spurious corrected errors may be logged in the IA32_MC0_STATUS register > with the valid field (bit 63) set, the uncorrected error field (bit 61) > not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and > an MCA Error Code (bits [15:0]) of 0x0005." > > Block these spurious errors from the console and logs. Are they being hit in the wild or why do we need this? > Links to Intel Specification updates: > HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html > HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html > HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html > BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html Those links tend to get stale with time. If you really want to refer to the PDFs, add a new bugzilla entry on https://bugzilla.kernel.org/, add them there as an attachment and add the link to the entry to the commit message. > Signed-off-by: Alexander Krupp <centos@akr.yagii.de> What's that Signed-off-by: tag supposed to mean? > Signed-off-by: Prarit Bhargava <prarit@redhat.com> > Cc: Tony Luck <tony.luck@intel.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: x86@kernel.org > Cc: linux-edac@vger.kernel.org > --- > arch/x86/kernel/cpu/mce/core.c | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) If at all, this should be done by adding an intel_filter_mce() function and called from filter_mce() so that such errors don't get logged. Thx.
On 2/6/20 6:10 AM, Borislav Petkov wrote: > On Wed, Feb 05, 2020 at 07:58:31AM -0500, Prarit Bhargava wrote: > >> Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142 > > That subject is unreadable for humans. Yeah :/ I couldn't think of a better one. Maybe "Block spurious corrected errors on some Intel processors"? Any other suggestion? > >> Intel Errata HSD131, HSM142, HSW131, and BDM48 report that >> "spurious corrected errors may be logged in the IA32_MC0_STATUS register >> with the valid field (bit 63) set, the uncorrected error field (bit 61) >> not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and >> an MCA Error Code (bits [15:0]) of 0x0005." >> >> Block these spurious errors from the console and logs. > > Are they being hit in the wild or why do we need this? Alexander, cc'd, is being hit by this in the wild. > >> Links to Intel Specification updates: >> HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html >> HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html >> HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html >> BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html > > Those links tend to get stale with time. If you really want to refer to > the PDFs, add a new bugzilla entry on https://bugzilla.kernel.org/, add > them there as an attachment and add the link to the entry to the commit > message. > >> Signed-off-by: Alexander Krupp <centos@akr.yagii.de> > > What's that Signed-off-by: tag supposed to mean? > >> Signed-off-by: Prarit Bhargava <prarit@redhat.com> >> Cc: Tony Luck <tony.luck@intel.com> >> Cc: Borislav Petkov <bp@alien8.de> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: "H. Peter Anvin" <hpa@zytor.com> >> Cc: x86@kernel.org >> Cc: linux-edac@vger.kernel.org >> --- >> arch/x86/kernel/cpu/mce/core.c | 21 +++++++++++++++++++++ >> 1 file changed, 21 insertions(+) > > If at all, this should be done by adding an intel_filter_mce() function > and called from filter_mce() so that such errors don't get logged. I'll take a look over there. P. > > Thx. >
On Thu, Feb 06, 2020 at 07:53:34AM -0500, Prarit Bhargava wrote: > Yeah :/ I couldn't think of a better one. Maybe "Block spurious corrected > errors on some Intel processors"? Any other suggestion? "Do not log ..." > Alexander, cc'd, is being hit by this in the wild. Do say that in the commit message. > >> Signed-off-by: Alexander Krupp <centos@akr.yagii.de> > > > > What's that Signed-off-by: tag supposed to mean? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You missed this one.
On 2/6/20 7:53 AM, Prarit Bhargava wrote: > > > On 2/6/20 6:10 AM, Borislav Petkov wrote: >> On Wed, Feb 05, 2020 at 07:58:31AM -0500, Prarit Bhargava wrote: >> >>> Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142 >> >> That subject is unreadable for humans. > > Yeah :/ I couldn't think of a better one. Maybe "Block spurious corrected > errors on some Intel processors"? Any other suggestion? > >> >>> Intel Errata HSD131, HSM142, HSW131, and BDM48 report that >>> "spurious corrected errors may be logged in the IA32_MC0_STATUS register >>> with the valid field (bit 63) set, the uncorrected error field (bit 61) >>> not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and >>> an MCA Error Code (bits [15:0]) of 0x0005." >>> >>> Block these spurious errors from the console and logs. >> >> Are they being hit in the wild or why do we need this? > > Alexander, cc'd, is being hit by this in the wild. > >> >>> Links to Intel Specification updates: >>> HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html >>> HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html >>> HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html >>> BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html >> >> Those links tend to get stale with time. If you really want to refer to >> the PDFs, add a new bugzilla entry on https://bugzilla.kernel.org/, add >> them there as an attachment and add the link to the entry to the commit >> message. >> >>> Signed-off-by: Alexander Krupp <centos@akr.yagii.de> >> >> What's that Signed-off-by: tag supposed to mean? Sorry. I missed this question, but I really don't understand the question. Alexander posted a patch in a kernel bugzilla @ Red Hat and I modified the patch with some additional changes. I don't want him to lose credit for the work so he's got a proper Signed-off-by tag for this patch. P.
On Thu, Feb 06, 2020 at 08:05:24AM -0500, Prarit Bhargava wrote: > Sorry. I missed this question, but I really don't understand the question. > Alexander posted a patch in a kernel bugzilla @ Red Hat and I modified the patch > with some additional changes. I don't want him to lose credit for the work so > he's got a proper Signed-off-by tag for this patch. This is not how this is expressed. Either you write that in free text in the commit message or you use Co-developed-by. More details in Documentation/process/submitting-patches.rst
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 2c4f949611e4..d893cc764a06 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -121,6 +121,8 @@ static struct irq_work mce_irq_work; static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs); +static int (*quirk_noprint)(struct mce *m); + /* * CPU/chipset specific EDAC code can register a notifier call here to print * MCE errors in a human-readable form. @@ -232,6 +234,9 @@ struct mca_msr_regs msr_ops = { static void __print_mce(struct mce *m) { + if (quirk_noprint && quirk_noprint(m)) + return; + pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n", m->extcpu, (m->mcgstatus & MCG_STATUS_MCIP ? " Exception" : ""), @@ -1622,6 +1627,15 @@ static void quirk_sandybridge_ifu(int bank, struct mce *m, struct pt_regs *regs) m->cs = regs->cs; } +static int quirk_spurious_ce_noprint(struct mce *m) +{ + if (m->bank == 0 && + (m->status & 0xa0000000ffffffff) == 0x80000000000f0005) + return 1; + + return 0; +} + /* Add per CPU specific workarounds here */ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c) { @@ -1696,6 +1710,13 @@ static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c) if (c->x86 == 6 && c->x86_model == 45) quirk_no_way_out = quirk_sandybridge_ifu; + + if ((c->x86 == 6) && + ((c->x86_model == 0x3c) || (c->x86_model == 0x3d) || + (c->x86_model == 0x45) || (c->x86_model == 46))) { + pr_info("MCE errata HSD131, HSM142, HSW131, BDM48, or HSM142 enabled.\n"); + quirk_noprint = quirk_spurious_ce_noprint; + } } if (c->x86_vendor == X86_VENDOR_ZHAOXIN) {