diff mbox series

[V2] EDAC/i10nm: shift exponent is negative

Message ID 20230703162509.77828-1-koba.ko@canonical.com (mailing list archive)
State New, archived
Headers show
Series [V2] EDAC/i10nm: shift exponent is negative | expand

Commit Message

Koba Ko July 3, 2023, 4:25 p.m. UTC
UBSAN complains this error
~~~
 UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
 shift exponent -66 is negative
 Call Trace:
  <TASK>
  dump_stack_lvl+0x48/0x70
  dump_stack+0x10/0x20
  __ubsan_handle_shift_out_of_bounds+0x1ac/0x360
  skx_get_dimm_info.cold+0x91/0x175 [i10nm_edac]
  ? kvasprintf_const+0x2a/0xb0
  i10nm_get_dimm_config+0x23c/0x340 [i10nm_edac]
  skx_register_mci+0x139/0x1e0 [i10nm_edac]
  ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac]
  i10nm_init+0x403/0xd10 [i10nm_edac]
  ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac]
  do_one_initcall+0x5b/0x250
  do_init_module+0x68/0x260
  load_module+0xb45/0xcd0
  ? kernel_read_file+0x2a4/0x320
  __do_sys_finit_module+0xc4/0x140
  ? __do_sys_finit_module+0xc4/0x140
  __x64_sys_finit_module+0x18/0x30
  do_syscall_64+0x58/0x90
  ? syscall_exit_to_user_mode+0x29/0x50
  ? do_syscall_64+0x67/0x90
  ? syscall_exit_to_user_mode+0x29/0x50
  ? do_syscall_64+0x67/0x90
  ? do_syscall_64+0x67/0x90
  ? __flush_smp_call_function_queue+0x122/0x1f0
  ? exit_to_user_mode_prepare+0x30/0xb0
  ? irqentry_exit_to_user_mode+0x9/0x20
  ? irqentry_exit+0x43/0x50
  ? sysvec_call_function+0x4b/0xd0
  entry_SYSCALL_64_after_hwframe+0x72/0xdc
~~~

when get rows, cols and ranks, the returned error value doesn't be
handled.

check the return value is EINVAL, if yes, directly return 0.

Signed-off-by: Koba Ko <koba.ko@canonical.com>
---
V2: make error-print explicitly
---
 drivers/edac/skx_common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Tony Luck July 3, 2023, 9:51 p.m. UTC | #1
> > UBSAN complains this error
> > ~~~
> >  UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> …
> > ~~~
> >
> > when get rows, cols and ranks, the returned error value doesn't be
> > handled.
> >
> > check the return value is EINVAL, if yes, directly return 0.
> …
>
> * Please improve this change description further.

To be specific. Initially you reported this because of the UBSAN error
report. But, after community discussion you now know that the problem
is that one or more of the rows/cols/ranks has a value that the EDAC driver
doesn't expect and probably can handle.

So, in V2, the commit message should start with the information these
values are out of range and mention this was discovered when UBSAN
put out a warning about a negative shift. No need to include the whole
of the UBSAN stack trace.

Then describe the two fixes that this patch includes. One is to change the
edac debug message into a console error message to enable further
debug of this issue. The other is to skip the unrecognized DIMM.

> * How do you think about to add the tag “Fixes”?

This is a good idea.  Use git blame, or dig into the GIT history to
find the commit where this code was introduced (hint .. git blame
says:
88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
but that obviously just refactored code, so you should dig back more into
the history.

> > V2: make error-print explicitly
> > ---
>
> Would you like to avoid a misplaced marker line here?
>
> See also:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686

That's an excellent resource.

-Tony
Koba Ko July 4, 2023, 1:50 a.m. UTC | #2
On Tue, Jul 4, 2023 at 5:51 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> > > UBSAN complains this error
> > > ~~~
> > >  UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> > …
> > > ~~~
> > >
> > > when get rows, cols and ranks, the returned error value doesn't be
> > > handled.
> > >
> > > check the return value is EINVAL, if yes, directly return 0.
> > …
> >
> > * Please improve this change description further.
>
> To be specific. Initially you reported this because of the UBSAN error
> report. But, after community discussion you now know that the problem
> is that one or more of the rows/cols/ranks has a value that the EDAC driver
> doesn't expect and probably can handle.
>
> So, in V2, the commit message should start with the information these
> values are out of range and mention this was discovered when UBSAN
> put out a warning about a negative shift. No need to include the whole
> of the UBSAN stack trace.
>
> Then describe the two fixes that this patch includes. One is to change the
> edac debug message into a console error message to enable further
> debug of this issue. The other is to skip the unrecognized DIMM.
>
> > * How do you think about to add the tag “Fixes”?
>
> This is a good idea.  Use git blame, or dig into the GIT history to
> find the commit where this code was introduced (hint .. git blame
> says:
> 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
> but that obviously just refactored code, so you should dig back more into
> the history.
>
> > > V2: make error-print explicitly
> > > ---
> >
> > Would you like to avoid a misplaced marker line here?
> >
> > See also:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686
>
> That's an excellent resource.

Thanks for your advices and I will modify.
here's part of dmesg enabled EDAC_DEBUG
~~~
[    4.032332] EDAC DEBUG: skx_register_mci: MC#1: mci = 00000000799db99e
[    4.032334] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff
mcddrtcfg 0xffffffff (mc1 ch0 dimm0)
[    4.032335] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
[    4.032337] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
[    4.032338] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
[    4.032339] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff
mcddrtcfg 0xffffffff (mc1 ch0 dimm1)
[    4.032340] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
[    4.032341] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
[    4.032341] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
[    4.032343] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff
mcddrtcfg 0xffffffff (mc1 ch1 dimm0)
[    4.032344] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
[    4.032345] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
[    4.032346] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
[    4.032347] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff
mcddrtcfg 0xffffffff (mc1 ch1 dimm1)
[    4.032348] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
[    4.032349] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
[    4.032349] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
~~~

I shared the whole dmesg through g-drive.
https://drive.google.com/file/d/1epnDZNezGiJsK1eT4UNOi8_igcBSXtiF/view?usp=sharing

>
> -Tony
Koba Ko July 4, 2023, 2:20 a.m. UTC | #3
On Tue, Jul 4, 2023 at 5:51 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> > > UBSAN complains this error
> > > ~~~
> > >  UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> > …
> > > ~~~
> > >
> > > when get rows, cols and ranks, the returned error value doesn't be
> > > handled.
> > >
> > > check the return value is EINVAL, if yes, directly return 0.
> > …
> >
> > * Please improve this change description further.
>
> To be specific. Initially you reported this because of the UBSAN error
> report. But, after community discussion you now know that the problem
> is that one or more of the rows/cols/ranks has a value that the EDAC driver
> doesn't expect and probably can handle.
>
> So, in V2, the commit message should start with the information these
> values are out of range and mention this was discovered when UBSAN
> put out a warning about a negative shift. No need to include the whole
> of the UBSAN stack trace.
>
> Then describe the two fixes that this patch includes. One is to change the
> edac debug message into a console error message to enable further
> debug of this issue. The other is to skip the unrecognized DIMM.
>
> > * How do you think about to add the tag “Fixes”?
>
> This is a good idea.  Use git blame, or dig into the GIT history to
> find the commit where this code was introduced (hint .. git blame
> says:
> 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
> but that obviously just refactored code, so you should dig back more into
> the history.
There are two parts,
1. @get_dimm_attr, edac_dbg was added since e235dd43d8b0f0
2. get num of ranks, rows and cols, 4ec656bdf43a13

Should I add all of them prefixes with "Fixes"?

>
> > > V2: make error-print explicitly
> > > ---
> >
> > Would you like to avoid a misplaced marker line here?
> >
> > See also:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686
>
> That's an excellent resource.
>
> -Tony
Tony Luck July 4, 2023, 3:14 a.m. UTC | #4
Interesting. This isn’t some odd shaped DIMM.
You are just getting a failed read from the register.
So trying to decode all-ones.

I’ll try to look at the full log you linked when I’m 
on a computer instead of a phone.

> [    4.032335] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
> [    4.032337] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
> [    4.032338] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
Koba Ko July 4, 2023, 4:59 a.m. UTC | #5
As per suggestions, i modified V3.
could you please take a look

Subject: [PATCH][V3] EDAC/i10nm: shift exponent is negative

Because failed to read from DIMM, get the negative value for shift
operation.
`EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
 EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
 EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
 EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff mcddrtcfg
0xffffffff (mc1 ch0 dimm1)`

UBSAN complains this error
`UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
 shift exponent -66 is negative`

when get rows, cols and ranks, the returned error value doesn't be
handled.

check the return value is EINVAL, if yes, directly return 0 and
show error message explicitly.

Fixes: 4ec656bdf43a13) EDAC, skx_edac: Add EDAC driver for Skylake
Signed-off-by: Koba Ko <koba.ko@canonical.com>
---
V2 -> V3: simplify the summary and add 'Fixes:'
V1 -> V2: make error-print explicitly

On Tue, Jul 4, 2023 at 10:20 AM Koba Ko <koba.ko@canonical.com> wrote:
>
> On Tue, Jul 4, 2023 at 5:51 AM Luck, Tony <tony.luck@intel.com> wrote:
> >
> > > > UBSAN complains this error
> > > > ~~~
> > > >  UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> > > …
> > > > ~~~
> > > >
> > > > when get rows, cols and ranks, the returned error value doesn't be
> > > > handled.
> > > >
> > > > check the return value is EINVAL, if yes, directly return 0.
> > > …
> > >
> > > * Please improve this change description further.
> >
> > To be specific. Initially you reported this because of the UBSAN error
> > report. But, after community discussion you now know that the problem
> > is that one or more of the rows/cols/ranks has a value that the EDAC driver
> > doesn't expect and probably can handle.
> >
> > So, in V2, the commit message should start with the information these
> > values are out of range and mention this was discovered when UBSAN
> > put out a warning about a negative shift. No need to include the whole
> > of the UBSAN stack trace.
> >
> > Then describe the two fixes that this patch includes. One is to change the
> > edac debug message into a console error message to enable further
> > debug of this issue. The other is to skip the unrecognized DIMM.
> >
> > > * How do you think about to add the tag “Fixes”?
> >
> > This is a good idea.  Use git blame, or dig into the GIT history to
> > find the commit where this code was introduced (hint .. git blame
> > says:
> > 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
> > but that obviously just refactored code, so you should dig back more into
> > the history.
> There are two parts,
> 1. @get_dimm_attr, edac_dbg was added since e235dd43d8b0f0
> 2. get num of ranks, rows and cols, 4ec656bdf43a13
>
> Should I add all of them prefixes with "Fixes"?
>
> >
> > > > V2: make error-print explicitly
> > > > ---
> > >
> > > Would you like to avoid a misplaced marker line here?
> > >
> > > See also:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686
> >
> > That's an excellent resource.
> >
> > -Tony
Koba Ko July 4, 2023, 8:05 a.m. UTC | #6
On Tue, Jul 4, 2023 at 3:16 PM Markus Elfring <Markus.Elfring@web.de> wrote:
>
> > As per suggestions, i modified V3.
> > could you please take a look
> >
> > Subject: [PATCH][V3] EDAC/i10nm: shift exponent is negative
>
> Would you like to put the text “[PATCH v4] EDAC/i10nm: Fix an inappropriate shift exponent”
> into a subsequent patch?

I didn't send V3 so the suggestions could be put in V3.

>
> I would find further wording variants nicer.
>
>
> > Because failed to read from DIMM, get the negative value for shift
> > operation.
>
> A surprising value was determined after a read failure from a DIMM.
>
>
> …
> > UBSAN complains this error
>
> Software analyses pointed a data processing issue out.
>
>
> > `UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> >  shift exponent -66 is negative`
> >
> > when get rows, cols and ranks, the returned error value doesn't be
> > handled.
>
> A special value combination could not be handled so far.
>
>
> > check the return value is EINVAL, if yes, directly return 0 and
> > show error message explicitly.
>
> Check if an invalid value was detected by a call of the function “skx_get_dimm_attr”.
>
> * Print a corresponding error message in this case.
>
> * Return zero then directly from the function “skx_get_dimm_info”.
>
>
> …
> @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm,
>         ranks = numrank(mtr);
>         rows = numrow(mtr);
>         cols = imc->hbm_mc ? 6 : numcol(mtr);
> +       if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> +               return 0;
> …
>
>
> Can it be nicer to perform a check for such an error code directly
> after each variable assignment?
> (May this condition check be split?)
>
>
> > Fixes: 4ec656bdf43a13) EDAC, skx_edac: Add EDAC driver for Skylake
>
> Please properly apply parentheses and double quotes for this tag.
>
> See also:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n145
>
>
> > V2 -> V3: simplify the summary and add 'Fixes:'
> > V1 -> V2: make error-print explicitly
>
> How do you think about to use more succinct version identifiers
> for such descriptions?
>
> V4:
> …
>
> V3:
> …
>
>
> Regards,
> Markus
Koba Ko July 4, 2023, 8:17 a.m. UTC | #7
@Markus and all,
please review it, thanks

Subject: [PATCH][V3] EDAC/i10nm: Fix an inappropriate shift exponen

A surprising value was determined after a read failure from a DIMM.

Software analyses pointed a data processing issue out.
`UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
 shift exponent -66 is negative`

when get rows, cols and ranks, A special value combination could not
be handled so far.

Check if an invalid value was detected by a call of the function
“skx_get_dimm_attr”.
* Print a corresponding error message in this case.
* Return zero then directly from the function “skx_get_dimm_info”.

Fixes: 4ec656bdf43a13 (EDAC, skx_edac: Add EDAC driver for Skylake)
Signed-off-by: Koba Ko <koba.ko@canonical.com>
---
V3: simplify and polish the summary and add `Fixes:`
V2: make error-print explicitly

Thanks
Koba Ko

On Tue, Jul 4, 2023 at 3:16 PM Markus Elfring <Markus.Elfring@web.de> wrote:
>
> > As per suggestions, i modified V3.
> > could you please take a look
> >
> > Subject: [PATCH][V3] EDAC/i10nm: shift exponent is negative
>
> Would you like to put the text “[PATCH v4] EDAC/i10nm: Fix an inappropriate shift exponent”
> into a subsequent patch?
>
> I would find further wording variants nicer.
>
>
> > Because failed to read from DIMM, get the negative value for shift
> > operation.
>
> A surprising value was determined after a read failure from a DIMM.
>
>
> …
> > UBSAN complains this error
>
> Software analyses pointed a data processing issue out.
>
>
> > `UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> >  shift exponent -66 is negative`
> >
> > when get rows, cols and ranks, the returned error value doesn't be
> > handled.
>
> A special value combination could not be handled so far.
>
>
> > check the return value is EINVAL, if yes, directly return 0 and
> > show error message explicitly.
>
> Check if an invalid value was detected by a call of the function “skx_get_dimm_attr”.
>
> * Print a corresponding error message in this case.
>
> * Return zero then directly from the function “skx_get_dimm_info”.
>
>
> …
> @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm,
>         ranks = numrank(mtr);
>         rows = numrow(mtr);
>         cols = imc->hbm_mc ? 6 : numcol(mtr);
> +       if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> +               return 0;
> …
>
>
> Can it be nicer to perform a check for such an error code directly
> after each variable assignment?
> (May this condition check be split?)
>
>
> > Fixes: 4ec656bdf43a13) EDAC, skx_edac: Add EDAC driver for Skylake
>
> Please properly apply parentheses and double quotes for this tag.
>
> See also:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n145
>
>
> > V2 -> V3: simplify the summary and add 'Fixes:'
> > V1 -> V2: make error-print explicitly
>
> How do you think about to use more succinct version identifiers
> for such descriptions?
>
> V4:
> …
>
> V3:
> …
>
>
> Regards,
> Markus
Koba Ko July 4, 2023, 8:59 a.m. UTC | #8
On Tue, Jul 4, 2023 at 4:50 PM Markus Elfring <Markus.Elfring@web.de> wrote:
>
> > @Markus and all,
> > please review it, thanks
> >
> > Subject: [PATCH][V3] EDAC/i10nm: Fix an inappropriate shift exponen
>
> * Combined prefixes would be nicer.
Do you mean, [PATCH V3]?
> * Please avoid a typo here.

=> exponent

>
> > when get rows, cols and ranks, A special value combination could not
> > be handled so far.
>
> It seems that you stumble still on wording difficulties.

remove the previous sentence,
A special value combination could not be handled so far.

>
> > Fixes: 4ec656bdf43a13 (EDAC, skx_edac: Add EDAC driver for Skylake)
>
> You overlooked somehow to use double quotes here.

must be
Fixes: 4ec656bdf43a13 ("EDAC, skx_edac: Add EDAC driver for Skylake")

> Please provide the next version of a proper patch.
>
> Regards,
> Markus
Dan Carpenter July 4, 2023, 12:02 p.m. UTC | #9
Here is a better commit message.  You can just copy and paste it.
------------------------------------------
[PATCH v3] EDAC/i10nm: Prevent negative shifts in skx_get_dimm_info().

UBSAN generated the following warning during a timeout:

    UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
    shift exponent -66 is negative

That most likely means that rows, cols, and ranks were all set to
-EINVAL.  Address this in two ways.

1) Change the debug output in skx_get_dimm_attr() to KERN_ERR so that
   users will know where exactly the error is.
2) Add a check for errors in skx_get_dimm_info().

Fixes: 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
Signed-off-by:
-----------------------------------------------

> @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm,
>  	ranks = numrank(mtr);
>  	rows = numrow(mtr);
>  	cols = imc->hbm_mc ? 6 : numcol(mtr);
> +	if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> +		return 0;

Change this to:

	if (rangks < 0 || rows < 0 || cols < 0)
		return 0;

It's bad form to check for a specific error code unless there is a need.

regards,
dan carpenter
Koba Ko July 4, 2023, 12:32 p.m. UTC | #10
On Tue, Jul 4, 2023 at 8:02 PM Dan Carpenter <dan.carpenter@linaro.org> wrote:
>
> Here is a better commit message.  You can just copy and paste it.
> ------------------------------------------
> [PATCH v3] EDAC/i10nm: Prevent negative shifts in skx_get_dimm_info().
>
> UBSAN generated the following warning during a timeout:
>
>     UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
>     shift exponent -66 is negative
>
> That most likely means that rows, cols, and ranks were all set to
> -EINVAL.  Address this in two ways.
>
> 1) Change the debug output in skx_get_dimm_attr() to KERN_ERR so that
>    users will know where exactly the error is.
> 2) Add a check for errors in skx_get_dimm_info().
>
> Fixes: 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
> Signed-off-by:
> -----------------------------------------------

have sent V3 as per Markus' comments.
https://patchwork.kernel.org/project/linux-edac/patch/20230704095939.119620-1-koba.ko@canonical.com/
Thanks
>
> > @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm,
> >       ranks = numrank(mtr);
> >       rows = numrow(mtr);
> >       cols = imc->hbm_mc ? 6 : numcol(mtr);
> > +     if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> > +             return 0;
>
> Change this to:
>
>         if (rangks < 0 || rows < 0 || cols < 0)
>                 return 0;
>
> It's bad form to check for a specific error code unless there is a need.
>
> regards,
> dan carpenter
>
Tony Luck July 5, 2023, 4:08 a.m. UTC | #11
> I shared the whole dmesg through g-drive.
> https://drive.google.com/file/d/1epnDZNezGiJsK1eT4UNOi8_igcBSXtiF/view?usp=sharing

The EDAC driver was OK for MC#0, but then failed for the other memory controllers.

Can you send some more information about your system. Output from the
following commands:

# head /proc/cpuinfo

# dmidecode -t 17

# lspci

Thanks

-Tony
Koba Ko July 5, 2023, 8:45 a.m. UTC | #12
On Wed, Jul 5, 2023 at 12:08 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> > I shared the whole dmesg through g-drive.
> > https://drive.google.com/file/d/1epnDZNezGiJsK1eT4UNOi8_igcBSXtiF/view?usp=sharing
>
> The EDAC driver was OK for MC#0, but then failed for the other memory controllers.
>
> Can you send some more information about your system. Output from the
> following commands:
>
> # head /proc/cpuinfo
>
> # dmidecode -t 17
>
> # lspci
please check through this url,
https://drive.google.com/drive/folders/199k3BX6IipNYCDfuMGy8W26ZtYRDIYZr?usp=sharing
>
> Thanks
>
> -Tony
Tony Luck July 5, 2023, 3:22 p.m. UTC | #13
>> # head /proc/cpuinfo

This shows your system is the workstation version of Sapphire rapids. I don't
think we did any validation of the EDAC driver against this model.

> # dmidecode -t 17

You have just one 16GB DIMM, and EDAC found that. So despite the messy warnings,
EDAC should be working for you.

> # lspci

I didn't dig into this. Qiuxu - can you compare this against a server Sapphire rapids?
Maybe it has some clues so the EDAC driver will know not to look for non-existent
memory controllers.

> please check through this url,
> https://drive.google.com/drive/folders/199k3BX6IipNYCDfuMGy8W26ZtYRDIYZr?usp=sharing

Thanks for all the details.

-Tony
Zhuo, Qiuxu July 6, 2023, 2:11 p.m. UTC | #14
> From: Luck, Tony <tony.luck@intel.com>
> Sent: Wednesday, July 5, 2023 11:22 PM
> ...
> Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative
> 
> >> # head /proc/cpuinfo
> 
> This shows your system is the workstation version of Sapphire rapids. I don't
> think we did any validation of the EDAC driver against this model.

No, we didn't do any validation of the EDAC on Sapphires Rapids workstations.
From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers.
https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html

We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before. 

> > # dmidecode -t 17
> 
> You have just one 16GB DIMM, and EDAC found that. So despite the messy
> warnings, EDAC should be working for you.
> 
> > # lspci
> 
> I didn't dig into this. Qiuxu - can you compare this against a server Sapphire
> rapids?
> Maybe it has some clues so the EDAC driver will know not to look for non-
> existent memory controllers.

This Sapphire Rapids workstation had 2 memory controllers but appeared 
4 memory controller PCIe devices from the log:

    0000:fe:0c.0 1101: 8086:324a
    0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear
    0000:fe:0e.0 1101: 8086:324a
    0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear

By observing that the MMIO registers of these absent
memory controllers consistently hold the value of ~0.
We may identify a memory controller as absent by checking
if its MMIO register "mcmtr" == ~0 in all its channels.

I made a patch below to skip all these absent memory controllers
https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@intel.com/T/#u
@Koba Ko, could you please verify the patch from the link above on your workstation? Thanks! 

BTW,
Kai-Heng Feng also found the same issue before:
https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2

- Qiuxu
Koba Ko July 6, 2023, 5:40 p.m. UTC | #15
On Thu, Jul 6, 2023 at 10:12 PM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
>
> > From: Luck, Tony <tony.luck@intel.com>
> > Sent: Wednesday, July 5, 2023 11:22 PM
> > ...
> > Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative
> >
> > >> # head /proc/cpuinfo
> >
> > This shows your system is the workstation version of Sapphire rapids. I don't
> > think we did any validation of the EDAC driver against this model.
>
> No, we didn't do any validation of the EDAC on Sapphires Rapids workstations.
> From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers.
> https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html
>
> We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before.
>
> > > # dmidecode -t 17
> >
> > You have just one 16GB DIMM, and EDAC found that. So despite the messy
> > warnings, EDAC should be working for you.
> >
> > > # lspci
> >
> > I didn't dig into this. Qiuxu - can you compare this against a server Sapphire
> > rapids?
> > Maybe it has some clues so the EDAC driver will know not to look for non-
> > existent memory controllers.
>
> This Sapphire Rapids workstation had 2 memory controllers but appeared
> 4 memory controller PCIe devices from the log:
>
>     0000:fe:0c.0 1101: 8086:324a
>     0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear
>     0000:fe:0e.0 1101: 8086:324a
>     0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear
>
> By observing that the MMIO registers of these absent
> memory controllers consistently hold the value of ~0.
> We may identify a memory controller as absent by checking
> if its MMIO register "mcmtr" == ~0 in all its channels.
>
> I made a patch below to skip all these absent memory controllers
> https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@intel.com/T/#u
> @Koba Ko, could you please verify the patch from the link above on your workstation? Thanks!

Here's dmesg patched(Ref. 1). didn't find the previous message,
`EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)`

Ref. 1, https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmVlp?usp=sharing

>
> BTW,
> Kai-Heng Feng also found the same issue before:
> https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2
>
> - Qiuxu
Zhuo, Qiuxu July 7, 2023, 3:34 a.m. UTC | #16
> From: Koba Ko <koba.ko@canonical.com>
> Sent: Friday, July 7, 2023 1:41 AM
> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>
> ...
> > I made a patch below to skip all these absent memory controllers
> > https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@i
> > ntel.com/T/#u @Koba Ko, could you please verify the patch from the
> > link above on your workstation? Thanks!
> 
> Here's dmesg patched(Ref. 1). didn't find the previous message, `EDAC
> DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)`
> 
> Ref. 1,
> https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmV
> lp?usp=sharing

Thanks for the verification. 
Your log showed that the patch worked well.

-Qiuxu
Koba Ko July 9, 2023, 3:42 p.m. UTC | #17
On Fri, Jul 7, 2023 at 11:34 AM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
>
> > From: Koba Ko <koba.ko@canonical.com>
> > Sent: Friday, July 7, 2023 1:41 AM
> > To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>
> > ...
> > > I made a patch below to skip all these absent memory controllers
> > > https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@i
> > > ntel.com/T/#u @Koba Ko, could you please verify the patch from the
> > > link above on your workstation? Thanks!
> >
> > Here's dmesg patched(Ref. 1). didn't find the previous message, `EDAC
> > DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)`
> >
> > Ref. 1,
> > https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmV
> > lp?usp=sharing
>
> Thanks for the verification.
> Your log showed that the patch worked well.

please,
tested-by: koba.ko@canonical.com
>
> -Qiuxu
Zhuo, Qiuxu July 10, 2023, 1:18 a.m. UTC | #18
> From: Koba Ko <koba.ko@canonical.com>
> ...
> > > Ref. 1,
> > > https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJK
> > > mV
> > > lp?usp=sharing
> >
> > Thanks for the verification.
> > Your log showed that the patch worked well.
> 
> please,
> tested-by: koba.ko@canonical.com

Sure.
Dan Carpenter July 10, 2023, 6:25 a.m. UTC | #19
On Sun, Jul 09, 2023 at 11:42:22PM +0800, Koba Ko wrote:
> tested-by: koba.ko@canonical.com

It's better if you put the tag in the correct format.  There are a bunch
of tools which automatically add tags, but they only work if the tag is
in the correct format.

regards,
dan carpenter
Zhuo, Qiuxu July 10, 2023, 8:08 a.m. UTC | #20
> From: Dan Carpenter <dan.carpenter@linaro.org>
> ...
> Subject: Re: [PATCH v2] EDAC/i10nm: shift exponent is negative
> 
> On Sun, Jul 09, 2023 at 11:42:22PM +0800, Koba Ko wrote:
> > tested-by: koba.ko@canonical.com
> 
> It's better if you put the tag in the correct format.  There are a bunch of tools
> which automatically add tags, but they only work if the tag is in the correct
> format.

I corrected Koba Ko's "Tested-by" tag and added it to my v2:
https://lore.kernel.org/linux-edac/20230710013232.59712-1-qiuxu.zhuo@intel.com/

-Qiuxu
diff mbox series

Patch

diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c
index 2a00e0503f0d5..ac61db72d2e6b 100644
--- a/drivers/edac/skx_common.c
+++ b/drivers/edac/skx_common.c
@@ -330,7 +330,7 @@  static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add,
 	u32 val = GET_BITFIELD(reg, lobit, hibit);
 
 	if (val < minval || val > maxval) {
-		edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
+		skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg);
 		return -EINVAL;
 	}
 	return val + add;
@@ -351,6 +351,8 @@  int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm,
 	ranks = numrank(mtr);
 	rows = numrow(mtr);
 	cols = imc->hbm_mc ? 6 : numcol(mtr);
+	if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
+		return 0;
 
 	if (imc->hbm_mc) {
 		banks = 32;