Message ID | 20170420221818.23522-1-vishal.l.verma@intel.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Thu, 2017-04-20 at 16:18 -0600, Vishal Verma wrote: > The check for an MCE being a memory error in the NFIT mce handler was > bogus. Fix it to check for the correct MCA status compound error code. > > Reported-by: Tony Luck <tony.luck@intel.com> > Cc: <stable@vger.kernel.org> Forgot to include, Fixes: 6839a6d96f4e nfit: do an ARS scrub on hitting a latent media error > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/acpi/nfit/mce.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c > index 3ba1c34..23e12a0 100644 > --- a/drivers/acpi/nfit/mce.c > +++ b/drivers/acpi/nfit/mce.c > @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block > *nb, unsigned long val, > struct nfit_spa *nfit_spa; > > /* We only care about memory errors */ > - if (!(mce->status & MCACOD)) > + if (!(mce->status & 0xef80) == BIT(7)) > return NOTIFY_DONE; > > /*
Hi Vishal, [auto build test WARNING on pm/linux-next] [also build test WARNING on v4.11-rc7 next-20170420] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Vishal-Verma/acpi-nfit-fix-the-memory-error-check-in-nfit_handle_mce/20170421-084359 base: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next config: x86_64-randconfig-x005-201716 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): drivers/acpi/nfit/mce.c: In function 'nfit_handle_mce': >> drivers/acpi/nfit/mce.c:29:30: warning: comparison of constant '128ul' with boolean expression is always false [-Wbool-compare] if (!(mce->status & 0xef80) == BIT(7)) ^~ >> drivers/acpi/nfit/mce.c:29:30: warning: logical not is only applied to the left hand side of comparison [-Wlogical-not-parentheses] vim +/128ul +29 drivers/acpi/nfit/mce.c 13 * General Public License for more details. 14 */ 15 #include <linux/notifier.h> 16 #include <linux/acpi.h> 17 #include <linux/nd.h> 18 #include <asm/mce.h> 19 #include "nfit.h" 20 21 static int nfit_handle_mce(struct notifier_block *nb, unsigned long val, 22 void *data) 23 { 24 struct mce *mce = (struct mce *)data; 25 struct acpi_nfit_desc *acpi_desc; 26 struct nfit_spa *nfit_spa; 27 28 /* We only care about memory errors */ > 29 if (!(mce->status & 0xef80) == BIT(7)) 30 return NOTIFY_DONE; 31 32 /* 33 * mce->addr contains the physical addr accessed that caused the 34 * machine check. We need to walk through the list of NFITs, and see 35 * if any of them matches that address, and only then start a scrub. 36 */ 37 mutex_lock(&acpi_desc_lock); --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Thu, Apr 20, 2017 at 3:18 PM, Vishal Verma <vishal.l.verma@intel.com> wrote: > The check for an MCE being a memory error in the NFIT mce handler was > bogus. Fix it to check for the correct MCA status compound error code. > > Reported-by: Tony Luck <tony.luck@intel.com> > Cc: <stable@vger.kernel.org> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/acpi/nfit/mce.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c > index 3ba1c34..23e12a0 100644 > --- a/drivers/acpi/nfit/mce.c > +++ b/drivers/acpi/nfit/mce.c > @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val, > struct nfit_spa *nfit_spa; > > /* We only care about memory errors */ > - if (!(mce->status & MCACOD)) > + if (!(mce->status & 0xef80) == BIT(7)) Can we get a define for this, or a comment explaining all the magic that's happening on that one line? -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2017-04-21 at 12:21 -0700, Dan Williams wrote: > On Thu, Apr 20, 2017 at 3:18 PM, Vishal Verma <vishal.l.verma@intel.co > m> wrote: > > The check for an MCE being a memory error in the NFIT mce handler > > was > > bogus. Fix it to check for the correct MCA status compound error > > code. > > > > Reported-by: Tony Luck <tony.luck@intel.com> > > Cc: <stable@vger.kernel.org> > > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > > --- > > drivers/acpi/nfit/mce.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c > > index 3ba1c34..23e12a0 100644 > > --- a/drivers/acpi/nfit/mce.c > > +++ b/drivers/acpi/nfit/mce.c > > @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block > > *nb, unsigned long val, > > struct nfit_spa *nfit_spa; > > > > /* We only care about memory errors */ > > - if (!(mce->status & MCACOD)) > > + if (!(mce->status & 0xef80) == BIT(7)) > > Can we get a define for this, or a comment explaining all the magic > that's happening on that one line? Yes - also like lkp pointed out, the check isn't correct at all. Let me figure out what really needs to be done, and I will resend with a better comment.
>> > + if (!(mce->status & 0xef80) == BIT(7)) >> >> Can we get a define for this, or a comment explaining all the magic >> that's happening on that one line? > > Yes - also like lkp pointed out, the check isn't correct at all. Let me > figure out what really needs to be done, and I will resend with a better > comment. Needs extra parentheses to make it right. Vishal, sorry I led you astray. if (!((mce->status & 0xef80) == BIT(7))) The magic is shown in table 15-9 of the Intel Software Developers Manual (but perhaps not well explained there). mce->status in the above code is a value plucked from a machine check bank status register. See figure 15-6 in the SDM. The important bits for this are {15:0} which are the "MCA Error code". Table 15-9 shows how these are grouped into types, where the type is defined by the most significant '1' bit in the field (excluding bit 12 which is the Correction Report Filtering bit, see section 15.9.2.1). So if BIT(3) is the most significant bit, the this is a "Generic Cache Hierarchy" error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on. Maybe we should have defines in mce.h for them? It gets a bit more complicated as all the above only applies to Intel branded X86 CPUs ... on AMD different decoding rules apply. -Tony
On Fri, Apr 21, 2017 at 1:16 PM, Luck, Tony <tony.luck@intel.com> wrote: >>> > + if (!(mce->status & 0xef80) == BIT(7)) >>> >>> Can we get a define for this, or a comment explaining all the magic >>> that's happening on that one line? >> >> Yes - also like lkp pointed out, the check isn't correct at all. Let me >> figure out what really needs to be done, and I will resend with a better >> comment. > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > if (!((mce->status & 0xef80) == BIT(7))) > > The magic is shown in table 15-9 of the Intel Software Developers Manual > (but perhaps not well explained there). > > mce->status in the above code is a value plucked from a machine check > bank status register. See figure 15-6 in the SDM. The important bits for this > are {15:0} which are the "MCA Error code". Table 15-9 shows how these > are grouped into types, where the type is defined by the most significant '1' > bit in the field (excluding bit 12 which is the Correction Report Filtering bit, > see section 15.9.2.1). > > So if BIT(3) is the most significant bit, the this is a "Generic Cache Hierarchy" > error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on. Ah, ok. > Maybe we should have defines in mce.h for them? It gets a bit more complicated > as all the above only applies to Intel branded X86 CPUs ... on AMD different > decoding rules apply. Yeah, this code is x86_64 generic so should call into helpers that do the right thing per cpu type. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 21, 2017 at 01:19:16PM -0700, Dan Williams wrote: > On Fri, Apr 21, 2017 at 1:16 PM, Luck, Tony <tony.luck@intel.com> wrote: > >>> > + if (!(mce->status & 0xef80) == BIT(7)) > >>> > >>> Can we get a define for this, or a comment explaining all the magic > >>> that's happening on that one line? > >> > >> Yes - also like lkp pointed out, the check isn't correct at all. Let me > >> figure out what really needs to be done, and I will resend with a better > >> comment. > > > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > > > if (!((mce->status & 0xef80) == BIT(7))) > > > > The magic is shown in table 15-9 of the Intel Software Developers Manual > > (but perhaps not well explained there). > > > > mce->status in the above code is a value plucked from a machine check > > bank status register. See figure 15-6 in the SDM. The important bits for this > > are {15:0} which are the "MCA Error code". Table 15-9 shows how these > > are grouped into types, where the type is defined by the most significant '1' > > bit in the field (excluding bit 12 which is the Correction Report Filtering bit, > > see section 15.9.2.1). > > > > So if BIT(3) is the most significant bit, the this is a "Generic Cache Hierarchy" > > error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on. > > Ah, ok. > > > Maybe we should have defines in mce.h for them? It gets a bit more complicated > > as all the above only applies to Intel branded X86 CPUs ... on AMD different > > decoding rules apply. > > Yeah, this code is x86_64 generic so should call into helpers that do > the right thing per cpu type. Boris: you coded up a "static bool memory_error(struct mce *m)" function inside the patches for the corrected error thingy. Perhaps when it goes upstream it should be available for other users too? -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/21, Luck, Tony wrote: > >> > + if (!(mce->status & 0xef80) == BIT(7)) > >> > >> Can we get a define for this, or a comment explaining all the magic > >> that's happening on that one line? > > > > Yes - also like lkp pointed out, the check isn't correct at all. Let me > > figure out what really needs to be done, and I will resend with a better > > comment. > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > if (!((mce->status & 0xef80) == BIT(7))) Is this still right though? Anything AND'ed with 0xef80 will never equal BIT(7) which is simply 01000000 binary (the lowest byte of the left hand side is '0') > > The magic is shown in table 15-9 of the Intel Software Developers Manual > (but perhaps not well explained there). > > mce->status in the above code is a value plucked from a machine check > bank status register. See figure 15-6 in the SDM. The important bits for this > are {15:0} which are the "MCA Error code". Table 15-9 shows how these > are grouped into types, where the type is defined by the most significant '1' > bit in the field (excluding bit 12 which is the Correction Report Filtering bit, > see section 15.9.2.1). > > So if BIT(3) is the most significant bit, the this is a "Generic Cache Hierarchy" > error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on. > > Maybe we should have defines in mce.h for them? It gets a bit more complicated > as all the above only applies to Intel branded X86 CPUs ... on AMD different > decoding rules apply. > > -Tony > > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 21, 2017 at 02:35:51PM -0600, Vishal Verma wrote: > On 04/21, Luck, Tony wrote: > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > > > if (!((mce->status & 0xef80) == BIT(7))) > > Is this still right though? Anything AND'ed with 0xef80 will never equal > BIT(7) which is simply 01000000 binary (the lowest byte of the left hand > side is '0') I think so ... here it is in binary ef80 = 1110 1111 1000 0000 BIT7 = 0000 0000 1000 0000 so the "&" will zap bits {6:0} and bit {12} [and everything not part of the MCACOD field]. If mce->status had some bit above BIT(7) set, it won't be zapped, so we won't match the exact value BIT(7). -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/21, Luck, Tony wrote: > On Fri, Apr 21, 2017 at 02:35:51PM -0600, Vishal Verma wrote: > > On 04/21, Luck, Tony wrote: > > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > > > > > if (!((mce->status & 0xef80) == BIT(7))) > > > > Is this still right though? Anything AND'ed with 0xef80 will never equal > > BIT(7) which is simply 01000000 binary (the lowest byte of the left hand > > side is '0') > > I think so ... here it is in binary > > ef80 = 1110 1111 1000 0000 > BIT7 = 0000 0000 1000 0000 > > so the "&" will zap bits {6:0} and bit {12} [and everything not part > of the MCACOD field]. > > If mce->status had some bit above BIT(7) set, it won't be zapped, so we > won't match the exact value BIT(7). Ah, you're right, I was off by one, taking BIT(7) to mean 0100 0000 > > -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 21, 2017 at 01:27:41PM -0700, Luck, Tony wrote: > Boris: you coded up a "static bool memory_error(struct mce *m)" > function inside the patches for the corrected error thingy. > > Perhaps when it goes upstream it should be available for other > users too? I don't see why not. struct mce.cpuvendor even has the vendor in there so memory_error() wouldn't even have to look at boot_cpu_data when doing per-vendor decision. I guess we should rename it to something more global namespace-y like "mce_is_memory_error() or so, though, before we expose it to wider audience...
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c index 3ba1c34..23e12a0 100644 --- a/drivers/acpi/nfit/mce.c +++ b/drivers/acpi/nfit/mce.c @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val, struct nfit_spa *nfit_spa; /* We only care about memory errors */ - if (!(mce->status & MCACOD)) + if (!(mce->status & 0xef80) == BIT(7)) return NOTIFY_DONE; /*
The check for an MCE being a memory error in the NFIT mce handler was bogus. Fix it to check for the correct MCA status compound error code. Reported-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> --- drivers/acpi/nfit/mce.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)