diff mbox series

x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142

Message ID 20200205125831.20430-1-prarit@redhat.com (mailing list archive)
State New, archived
Headers show
Series x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142 | expand

Commit Message

Prarit Bhargava Feb. 5, 2020, 12:58 p.m. UTC
Intel Errata HSD131, HSM142, HSW131, and BDM48 report that
"spurious corrected errors may be logged in the IA32_MC0_STATUS register
with the valid field (bit 63) set, the uncorrected error field (bit 61)
not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and
an MCA Error Code (bits [15:0]) of 0x0005."

Block these spurious errors from the console and logs.

Links to Intel Specification updates:
HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html
HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html
HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html
BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html

Signed-off-by: Alexander Krupp <centos@akr.yagii.de>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-edac@vger.kernel.org
---
 arch/x86/kernel/cpu/mce/core.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Comments

Borislav Petkov Feb. 6, 2020, 11:10 a.m. UTC | #1
On Wed, Feb 05, 2020 at 07:58:31AM -0500, Prarit Bhargava wrote:

> Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142

That subject is unreadable for humans.

> Intel Errata HSD131, HSM142, HSW131, and BDM48 report that
> "spurious corrected errors may be logged in the IA32_MC0_STATUS register
> with the valid field (bit 63) set, the uncorrected error field (bit 61)
> not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and
> an MCA Error Code (bits [15:0]) of 0x0005."
> 
> Block these spurious errors from the console and logs.

Are they being hit in the wild or why do we need this?

> Links to Intel Specification updates:
> HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html
> HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html
> HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html
> BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html

Those links tend to get stale with time. If you really want to refer to
the PDFs, add a new bugzilla entry on https://bugzilla.kernel.org/, add
them there as an attachment and add the link to the entry to the commit
message.

> Signed-off-by: Alexander Krupp <centos@akr.yagii.de>

What's that Signed-off-by: tag supposed to mean?

> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: linux-edac@vger.kernel.org
> ---
>  arch/x86/kernel/cpu/mce/core.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)

If at all, this should be done by adding an intel_filter_mce() function
and called from filter_mce() so that such errors don't get logged.

Thx.
Prarit Bhargava Feb. 6, 2020, 12:53 p.m. UTC | #2
On 2/6/20 6:10 AM, Borislav Petkov wrote:
> On Wed, Feb 05, 2020 at 07:58:31AM -0500, Prarit Bhargava wrote:
> 
>> Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142
> 
> That subject is unreadable for humans.

Yeah :/  I couldn't think of a better one.  Maybe "Block spurious corrected
errors on some Intel processors"?  Any other suggestion?

> 
>> Intel Errata HSD131, HSM142, HSW131, and BDM48 report that
>> "spurious corrected errors may be logged in the IA32_MC0_STATUS register
>> with the valid field (bit 63) set, the uncorrected error field (bit 61)
>> not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and
>> an MCA Error Code (bits [15:0]) of 0x0005."
>>
>> Block these spurious errors from the console and logs.
> 
> Are they being hit in the wild or why do we need this?

Alexander, cc'd, is being hit by this in the wild.

> 
>> Links to Intel Specification updates:
>> HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html
>> HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html
>> HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html
>> BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html
> 
> Those links tend to get stale with time. If you really want to refer to
> the PDFs, add a new bugzilla entry on https://bugzilla.kernel.org/, add
> them there as an attachment and add the link to the entry to the commit
> message.
> 
>> Signed-off-by: Alexander Krupp <centos@akr.yagii.de>
> 
> What's that Signed-off-by: tag supposed to mean?
> 
>> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
>> Cc: Tony Luck <tony.luck@intel.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: linux-edac@vger.kernel.org
>> ---
>>  arch/x86/kernel/cpu/mce/core.c | 21 +++++++++++++++++++++
>>  1 file changed, 21 insertions(+)
> 
> If at all, this should be done by adding an intel_filter_mce() function
> and called from filter_mce() so that such errors don't get logged.

I'll take a look over there.

P.

> 
> Thx.
>
Borislav Petkov Feb. 6, 2020, 1:03 p.m. UTC | #3
On Thu, Feb 06, 2020 at 07:53:34AM -0500, Prarit Bhargava wrote:
> Yeah :/  I couldn't think of a better one.  Maybe "Block spurious corrected
> errors on some Intel processors"?  Any other suggestion?

"Do not log ..."

> Alexander, cc'd, is being hit by this in the wild.

Do say that in the commit message.

> >> Signed-off-by: Alexander Krupp <centos@akr.yagii.de>
> > 
> > What's that Signed-off-by: tag supposed to mean?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You missed this one.
Prarit Bhargava Feb. 6, 2020, 1:05 p.m. UTC | #4
On 2/6/20 7:53 AM, Prarit Bhargava wrote:
> 
> 
> On 2/6/20 6:10 AM, Borislav Petkov wrote:
>> On Wed, Feb 05, 2020 at 07:58:31AM -0500, Prarit Bhargava wrote:
>>
>>> Subject: Re: [PATCH] x86/mce: Enable HSD131, HSM142, HSW131, BDM48, and HSM142
>>
>> That subject is unreadable for humans.
> 
> Yeah :/  I couldn't think of a better one.  Maybe "Block spurious corrected
> errors on some Intel processors"?  Any other suggestion?
> 
>>
>>> Intel Errata HSD131, HSM142, HSW131, and BDM48 report that
>>> "spurious corrected errors may be logged in the IA32_MC0_STATUS register
>>> with the valid field (bit 63) set, the uncorrected error field (bit 61)
>>> not set, a Model Specific Error Code (bits [31:16]) of 0x000F, and
>>> an MCA Error Code (bits [15:0]) of 0x0005."
>>>
>>> Block these spurious errors from the console and logs.
>>
>> Are they being hit in the wild or why do we need this?
> 
> Alexander, cc'd, is being hit by this in the wild.
> 
>>
>>> Links to Intel Specification updates:
>>> HSD131: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-desktop-specification-update.html
>>> HSM142: https://www.intel.com/content/www/us/en/products/docs/processors/core/4th-gen-core-family-mobile-specification-update.html
>>> HSW131: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v3-spec-update.html
>>> BDM48: https://www.intel.com/content/www/us/en/products/docs/processors/core/5th-gen-core-family-spec-update.html
>>
>> Those links tend to get stale with time. If you really want to refer to
>> the PDFs, add a new bugzilla entry on https://bugzilla.kernel.org/, add
>> them there as an attachment and add the link to the entry to the commit
>> message.
>>
>>> Signed-off-by: Alexander Krupp <centos@akr.yagii.de>
>>
>> What's that Signed-off-by: tag supposed to mean?

Sorry.  I missed this question, but I really don't understand the question.
Alexander posted a patch in a kernel bugzilla @ Red Hat and I modified the patch
with some additional changes.  I don't want him to lose credit for the work so
he's got a proper Signed-off-by tag for this patch.

P.
Borislav Petkov Feb. 6, 2020, 2:04 p.m. UTC | #5
On Thu, Feb 06, 2020 at 08:05:24AM -0500, Prarit Bhargava wrote:
> Sorry.  I missed this question, but I really don't understand the question.
> Alexander posted a patch in a kernel bugzilla @ Red Hat and I modified the patch
> with some additional changes.  I don't want him to lose credit for the work so
> he's got a proper Signed-off-by tag for this patch.

This is not how this is expressed. Either you write that in free text in
the commit message or you use Co-developed-by. More details in

Documentation/process/submitting-patches.rst
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c4f949611e4..d893cc764a06 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -121,6 +121,8 @@  static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
+static int (*quirk_noprint)(struct mce *m);
+
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
@@ -232,6 +234,9 @@  struct mca_msr_regs msr_ops = {
 
 static void __print_mce(struct mce *m)
 {
+	if (quirk_noprint && quirk_noprint(m))
+		return;
+
 	pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
 		 m->extcpu,
 		 (m->mcgstatus & MCG_STATUS_MCIP ? " Exception" : ""),
@@ -1622,6 +1627,15 @@  static void quirk_sandybridge_ifu(int bank, struct mce *m, struct pt_regs *regs)
 	m->cs = regs->cs;
 }
 
+static int quirk_spurious_ce_noprint(struct mce *m)
+{
+	if (m->bank == 0 &&
+	    (m->status & 0xa0000000ffffffff) == 0x80000000000f0005)
+		return 1;
+
+	return 0;
+}
+
 /* Add per CPU specific workarounds here */
 static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 {
@@ -1696,6 +1710,13 @@  static int __mcheck_cpu_apply_quirks(struct cpuinfo_x86 *c)
 
 		if (c->x86 == 6 && c->x86_model == 45)
 			quirk_no_way_out = quirk_sandybridge_ifu;
+
+		if ((c->x86 == 6) &&
+		    ((c->x86_model == 0x3c) || (c->x86_model == 0x3d) ||
+		     (c->x86_model == 0x45) || (c->x86_model == 46))) {
+			pr_info("MCE errata HSD131, HSM142, HSW131, BDM48, or HSM142 enabled.\n");
+			quirk_noprint = quirk_spurious_ce_noprint;
+		}
 	}
 
 	if (c->x86_vendor == X86_VENDOR_ZHAOXIN) {