diff mbox

acpi, nfit: fix the memory error check in nfit_handle_mce

Message ID 20170420221818.23522-1-vishal.l.verma@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Verma, Vishal L April 20, 2017, 10:18 p.m. UTC
The check for an MCE being a memory error in the NFIT mce handler was
bogus. Fix it to check for the correct MCA status compound error code.

Reported-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 drivers/acpi/nfit/mce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Verma, Vishal L April 20, 2017, 10:21 p.m. UTC | #1
On Thu, 2017-04-20 at 16:18 -0600, Vishal Verma wrote:
> The check for an MCE being a memory error in the NFIT mce handler was

> bogus. Fix it to check for the correct MCA status compound error code.

> 

> Reported-by: Tony Luck <tony.luck@intel.com>

> Cc: <stable@vger.kernel.org>


Forgot to include,
Fixes: 6839a6d96f4e nfit: do an ARS scrub on hitting a latent media error

> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>

> ---

>  drivers/acpi/nfit/mce.c | 2 +-

>  1 file changed, 1 insertion(+), 1 deletion(-)

> 

> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c

> index 3ba1c34..23e12a0 100644

> --- a/drivers/acpi/nfit/mce.c

> +++ b/drivers/acpi/nfit/mce.c

> @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block

> *nb, unsigned long val,

>  	struct nfit_spa *nfit_spa;

>  

>  	/* We only care about memory errors */

> -	if (!(mce->status & MCACOD))

> +	if (!(mce->status & 0xef80) == BIT(7))

>  		return NOTIFY_DONE;

>  

>  	/*
kernel test robot April 21, 2017, 2:21 a.m. UTC | #2
Hi Vishal,

[auto build test WARNING on pm/linux-next]
[also build test WARNING on v4.11-rc7 next-20170420]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Vishal-Verma/acpi-nfit-fix-the-memory-error-check-in-nfit_handle_mce/20170421-084359
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
config: x86_64-randconfig-x005-201716 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/acpi/nfit/mce.c: In function 'nfit_handle_mce':
>> drivers/acpi/nfit/mce.c:29:30: warning: comparison of constant '128ul' with boolean expression is always false [-Wbool-compare]
     if (!(mce->status & 0xef80) == BIT(7))
                                 ^~
>> drivers/acpi/nfit/mce.c:29:30: warning: logical not is only applied to the left hand side of comparison [-Wlogical-not-parentheses]

vim +/128ul +29 drivers/acpi/nfit/mce.c

    13	 * General Public License for more details.
    14	 */
    15	#include <linux/notifier.h>
    16	#include <linux/acpi.h>
    17	#include <linux/nd.h>
    18	#include <asm/mce.h>
    19	#include "nfit.h"
    20	
    21	static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
    22				void *data)
    23	{
    24		struct mce *mce = (struct mce *)data;
    25		struct acpi_nfit_desc *acpi_desc;
    26		struct nfit_spa *nfit_spa;
    27	
    28		/* We only care about memory errors */
  > 29		if (!(mce->status & 0xef80) == BIT(7))
    30			return NOTIFY_DONE;
    31	
    32		/*
    33		 * mce->addr contains the physical addr accessed that caused the
    34		 * machine check. We need to walk through the list of NFITs, and see
    35		 * if any of them matches that address, and only then start a scrub.
    36		 */
    37		mutex_lock(&acpi_desc_lock);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Dan Williams April 21, 2017, 7:21 p.m. UTC | #3
On Thu, Apr 20, 2017 at 3:18 PM, Vishal Verma <vishal.l.verma@intel.com> wrote:
> The check for an MCE being a memory error in the NFIT mce handler was
> bogus. Fix it to check for the correct MCA status compound error code.
>
> Reported-by: Tony Luck <tony.luck@intel.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  drivers/acpi/nfit/mce.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index 3ba1c34..23e12a0 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
>         struct nfit_spa *nfit_spa;
>
>         /* We only care about memory errors */
> -       if (!(mce->status & MCACOD))
> +       if (!(mce->status & 0xef80) == BIT(7))

Can we get a define for this, or a comment explaining all the magic
that's happening on that one line?
Verma, Vishal L April 21, 2017, 7:56 p.m. UTC | #4
On Fri, 2017-04-21 at 12:21 -0700, Dan Williams wrote:
> On Thu, Apr 20, 2017 at 3:18 PM, Vishal Verma <vishal.l.verma@intel.co

> m> wrote:

> > The check for an MCE being a memory error in the NFIT mce handler

> > was

> > bogus. Fix it to check for the correct MCA status compound error

> > code.

> > 

> > Reported-by: Tony Luck <tony.luck@intel.com>

> > Cc: <stable@vger.kernel.org>

> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>

> > ---

> >  drivers/acpi/nfit/mce.c | 2 +-

> >  1 file changed, 1 insertion(+), 1 deletion(-)

> > 

> > diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c

> > index 3ba1c34..23e12a0 100644

> > --- a/drivers/acpi/nfit/mce.c

> > +++ b/drivers/acpi/nfit/mce.c

> > @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block

> > *nb, unsigned long val,

> >         struct nfit_spa *nfit_spa;

> > 

> >         /* We only care about memory errors */

> > -       if (!(mce->status & MCACOD))

> > +       if (!(mce->status & 0xef80) == BIT(7))

> 

> Can we get a define for this, or a comment explaining all the magic

> that's happening on that one line?


Yes - also like lkp pointed out, the check isn't correct at all. Let me
figure out what really needs to be done, and I will resend with a better
comment.
Luck, Tony April 21, 2017, 8:16 p.m. UTC | #5
>> > +       if (!(mce->status & 0xef80) == BIT(7))

>> 

>> Can we get a define for this, or a comment explaining all the magic

>> that's happening on that one line?

>

> Yes - also like lkp pointed out, the check isn't correct at all. Let me

> figure out what really needs to be done, and I will resend with a better

> comment. 


Needs extra parentheses to make it right. Vishal, sorry I led you astray.

	if (!((mce->status & 0xef80) == BIT(7)))

The magic is shown in table 15-9 of the Intel Software Developers Manual
(but perhaps not well explained there).

mce->status in the above code is a value plucked from a machine check
bank status register. See figure 15-6 in the SDM.  The important bits for this
are {15:0} which are the "MCA Error code".  Table 15-9 shows how these
are grouped into types, where the type is defined by the most significant '1'
bit in the field (excluding bit 12 which is the Correction Report Filtering bit,
see section 15.9.2.1).

So if BIT(3) is the most significant bit, the this is a "Generic Cache Hierarchy"
error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on.

Maybe we should have defines in mce.h for them?  It gets a bit more complicated
as all the above only applies to Intel branded X86 CPUs ... on AMD different
decoding rules apply.

-Tony
Dan Williams April 21, 2017, 8:19 p.m. UTC | #6
On Fri, Apr 21, 2017 at 1:16 PM, Luck, Tony <tony.luck@intel.com> wrote:
>>> > +       if (!(mce->status & 0xef80) == BIT(7))
>>>
>>> Can we get a define for this, or a comment explaining all the magic
>>> that's happening on that one line?
>>
>> Yes - also like lkp pointed out, the check isn't correct at all. Let me
>> figure out what really needs to be done, and I will resend with a better
>> comment.
>
> Needs extra parentheses to make it right. Vishal, sorry I led you astray.
>
>         if (!((mce->status & 0xef80) == BIT(7)))
>
> The magic is shown in table 15-9 of the Intel Software Developers Manual
> (but perhaps not well explained there).
>
> mce->status in the above code is a value plucked from a machine check
> bank status register. See figure 15-6 in the SDM.  The important bits for this
> are {15:0} which are the "MCA Error code".  Table 15-9 shows how these
> are grouped into types, where the type is defined by the most significant '1'
> bit in the field (excluding bit 12 which is the Correction Report Filtering bit,
> see section 15.9.2.1).
>
> So if BIT(3) is the most significant bit, the this is a "Generic Cache Hierarchy"
> error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on.

Ah, ok.

> Maybe we should have defines in mce.h for them?  It gets a bit more complicated
> as all the above only applies to Intel branded X86 CPUs ... on AMD different
> decoding rules apply.

Yeah, this code is x86_64 generic so should call into helpers that do
the right thing per cpu type.
Luck, Tony April 21, 2017, 8:27 p.m. UTC | #7
On Fri, Apr 21, 2017 at 01:19:16PM -0700, Dan Williams wrote:
> On Fri, Apr 21, 2017 at 1:16 PM, Luck, Tony <tony.luck@intel.com> wrote:
> >>> > +       if (!(mce->status & 0xef80) == BIT(7))
> >>>
> >>> Can we get a define for this, or a comment explaining all the magic
> >>> that's happening on that one line?
> >>
> >> Yes - also like lkp pointed out, the check isn't correct at all. Let me
> >> figure out what really needs to be done, and I will resend with a better
> >> comment.
> >
> > Needs extra parentheses to make it right. Vishal, sorry I led you astray.
> >
> >         if (!((mce->status & 0xef80) == BIT(7)))
> >
> > The magic is shown in table 15-9 of the Intel Software Developers Manual
> > (but perhaps not well explained there).
> >
> > mce->status in the above code is a value plucked from a machine check
> > bank status register. See figure 15-6 in the SDM.  The important bits for this
> > are {15:0} which are the "MCA Error code".  Table 15-9 shows how these
> > are grouped into types, where the type is defined by the most significant '1'
> > bit in the field (excluding bit 12 which is the Correction Report Filtering bit,
> > see section 15.9.2.1).
> >
> > So if BIT(3) is the most significant bit, the this is a "Generic Cache Hierarchy"
> > error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on.
> 
> Ah, ok.
> 
> > Maybe we should have defines in mce.h for them?  It gets a bit more complicated
> > as all the above only applies to Intel branded X86 CPUs ... on AMD different
> > decoding rules apply.
> 
> Yeah, this code is x86_64 generic so should call into helpers that do
> the right thing per cpu type.

Boris: you coded up a "static bool memory_error(struct mce *m)"
function inside the patches for the corrected error thingy.

Perhaps when it goes upstream it should be available for other
users too?

-Tony
Verma, Vishal L April 21, 2017, 8:35 p.m. UTC | #8
On 04/21, Luck, Tony wrote:
> >> > +       if (!(mce->status & 0xef80) == BIT(7))
> >> 
> >> Can we get a define for this, or a comment explaining all the magic
> >> that's happening on that one line?
> >
> > Yes - also like lkp pointed out, the check isn't correct at all. Let me
> > figure out what really needs to be done, and I will resend with a better
> > comment. 
> 
> Needs extra parentheses to make it right. Vishal, sorry I led you astray.
> 
> 	if (!((mce->status & 0xef80) == BIT(7)))

Is this still right though? Anything AND'ed with 0xef80 will never equal
BIT(7) which is simply 01000000 binary (the lowest byte of the left hand
side is '0')

> 
> The magic is shown in table 15-9 of the Intel Software Developers Manual
> (but perhaps not well explained there).
> 
> mce->status in the above code is a value plucked from a machine check
> bank status register. See figure 15-6 in the SDM.  The important bits for this
> are {15:0} which are the "MCA Error code".  Table 15-9 shows how these
> are grouped into types, where the type is defined by the most significant '1'
> bit in the field (excluding bit 12 which is the Correction Report Filtering bit,
> see section 15.9.2.1).
> 
> So if BIT(3) is the most significant bit, the this is a "Generic Cache Hierarchy"
> error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on.
> 
> Maybe we should have defines in mce.h for them?  It gets a bit more complicated
> as all the above only applies to Intel branded X86 CPUs ... on AMD different
> decoding rules apply.
> 
> -Tony
> 
>
Luck, Tony April 21, 2017, 8:50 p.m. UTC | #9
On Fri, Apr 21, 2017 at 02:35:51PM -0600, Vishal Verma wrote:
> On 04/21, Luck, Tony wrote:
> > Needs extra parentheses to make it right. Vishal, sorry I led you astray.
> > 
> > 	if (!((mce->status & 0xef80) == BIT(7)))
> 
> Is this still right though? Anything AND'ed with 0xef80 will never equal
> BIT(7) which is simply 01000000 binary (the lowest byte of the left hand
> side is '0')

I think so ... here it is in binary

ef80 = 1110 1111 1000 0000
BIT7 = 0000 0000 1000 0000

so the "&" will zap bits {6:0} and bit {12}  [and everything not part
of the MCACOD field].

If mce->status had some bit above BIT(7) set, it won't be zapped, so we
won't match the exact value BIT(7).

-Tony
Verma, Vishal L April 21, 2017, 8:54 p.m. UTC | #10
On 04/21, Luck, Tony wrote:
> On Fri, Apr 21, 2017 at 02:35:51PM -0600, Vishal Verma wrote:
> > On 04/21, Luck, Tony wrote:
> > > Needs extra parentheses to make it right. Vishal, sorry I led you astray.
> > > 
> > > 	if (!((mce->status & 0xef80) == BIT(7)))
> > 
> > Is this still right though? Anything AND'ed with 0xef80 will never equal
> > BIT(7) which is simply 01000000 binary (the lowest byte of the left hand
> > side is '0')
> 
> I think so ... here it is in binary
> 
> ef80 = 1110 1111 1000 0000
> BIT7 = 0000 0000 1000 0000
> 
> so the "&" will zap bits {6:0} and bit {12}  [and everything not part
> of the MCACOD field].
> 
> If mce->status had some bit above BIT(7) set, it won't be zapped, so we
> won't match the exact value BIT(7).

Ah, you're right, I was off by one, taking BIT(7) to mean 0100 0000

> 
> -Tony
Borislav Petkov April 21, 2017, 9:07 p.m. UTC | #11
On Fri, Apr 21, 2017 at 01:27:41PM -0700, Luck, Tony wrote:
> Boris: you coded up a "static bool memory_error(struct mce *m)"
> function inside the patches for the corrected error thingy.
> 
> Perhaps when it goes upstream it should be available for other
> users too?

I don't see why not. struct mce.cpuvendor even has the vendor in there
so memory_error() wouldn't even have to look at boot_cpu_data when doing
per-vendor decision.

I guess we should rename it to something more global namespace-y like
"mce_is_memory_error() or so, though, before we expose it to wider
audience...
diff mbox

Patch

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index 3ba1c34..23e12a0 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -26,7 +26,7 @@  static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 	struct nfit_spa *nfit_spa;
 
 	/* We only care about memory errors */
-	if (!(mce->status & MCACOD))
+	if (!(mce->status & 0xef80) == BIT(7))
 		return NOTIFY_DONE;
 
 	/*