diff mbox series

EDAC/i10nm: shift exponent is negative

Message ID 20230628085253.1013799-1-koba.ko@canonical.com (mailing list archive)
State New, archived
Headers show
Series EDAC/i10nm: shift exponent is negative | expand

Commit Message

Koba Ko June 28, 2023, 8:52 a.m. UTC
UBSAN complains this error
~~~
 UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
 shift exponent -66 is negative
 Call Trace:
  <TASK>
  dump_stack_lvl+0x48/0x70
  dump_stack+0x10/0x20
  __ubsan_handle_shift_out_of_bounds+0x1ac/0x360
  skx_get_dimm_info.cold+0x91/0x175 [i10nm_edac]
  ? kvasprintf_const+0x2a/0xb0
  i10nm_get_dimm_config+0x23c/0x340 [i10nm_edac]
  skx_register_mci+0x139/0x1e0 [i10nm_edac]
  ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac]
  i10nm_init+0x403/0xd10 [i10nm_edac]
  ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac]
  do_one_initcall+0x5b/0x250
  do_init_module+0x68/0x260
  load_module+0xb45/0xcd0
  ? kernel_read_file+0x2a4/0x320
  __do_sys_finit_module+0xc4/0x140
  ? __do_sys_finit_module+0xc4/0x140
  __x64_sys_finit_module+0x18/0x30
  do_syscall_64+0x58/0x90
  ? syscall_exit_to_user_mode+0x29/0x50
  ? do_syscall_64+0x67/0x90
  ? syscall_exit_to_user_mode+0x29/0x50
  ? do_syscall_64+0x67/0x90
  ? do_syscall_64+0x67/0x90
  ? __flush_smp_call_function_queue+0x122/0x1f0
  ? exit_to_user_mode_prepare+0x30/0xb0
  ? irqentry_exit_to_user_mode+0x9/0x20
  ? irqentry_exit+0x43/0x50
  ? sysvec_call_function+0x4b/0xd0
  entry_SYSCALL_64_after_hwframe+0x72/0xdc
~~~

when get rows, cols and ranks, the returned error value doesn't be
handled.

check the return value is EINVAL, if yes, directly return 0.

Signed-off-by: Koba Ko <koba.ko@canonical.com>
---
 drivers/edac/skx_common.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Tony Luck June 28, 2023, 4:39 p.m. UTC | #1
>       ranks = numrank(mtr);
>       rows = numrow(mtr);
>       cols = imc->hbm_mc ? 6 : numcol(mtr);
> +     if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> +             return 0;

This seems to be just hiding the real problem that a DIMM was found
with some number of ranks, rows, or columns that the EDAC driver
didn't expect to see. Your fix makes the driver skip over this DIMM.

Can you build your kernel with CONFIG_EDAC_DEBUG=y and see
what messages you get from this code:

static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add,
                             int minval, int maxval, const char *name)
{
        u32 val = GET_BITFIELD(reg, lobit, hibit);

        if (val < minval || val > maxval) {
                edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
                return -EINVAL;
        }

-Tony
Koba Ko June 29, 2023, 3:52 a.m. UTC | #2
hi Luck,
I agree with your points
is it expected to shift with negative?

Thanks
Koba Ko

On Thu, Jun 29, 2023 at 12:41 AM Luck, Tony <tony.luck@intel.com> wrote:
>
> >       ranks = numrank(mtr);
> >       rows = numrow(mtr);
> >       cols = imc->hbm_mc ? 6 : numcol(mtr);
> > +     if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> > +             return 0;
>
> This seems to be just hiding the real problem that a DIMM was found
> with some number of ranks, rows, or columns that the EDAC driver
> didn't expect to see. Your fix makes the driver skip over this DIMM.
>
> Can you build your kernel with CONFIG_EDAC_DEBUG=y and see
> what messages you get from this code:
>
> static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add,
>                              int minval, int maxval, const char *name)
> {
>         u32 val = GET_BITFIELD(reg, lobit, hibit);
>
>         if (val < minval || val > maxval) {
>                 edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
>                 return -EINVAL;
>         }
>
> -Tony
>
>
Zhuo, Qiuxu June 29, 2023, 9:58 a.m. UTC | #3
Hi Ko,

I don't agree with simply skipping over a DIMM even EDAC doesn't expect to see it. 
As the EDAC driver can still report errors for this DIMM once there are errors that occur in this DIMM.  

As per Tony's suggestion, could you test your kernel with CONFIG_EDAC_DEBUG=y and see the result?
 
@Luck, Tony, Perhaps we may turn the debug print
     
       edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);

to an error-print explicitly

     skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg);

Let the user have the chance to notice there is a DIMM that EDAC doesn't expect to see.

- Qiuxu

> From: Koba Ko <koba.ko@canonical.com>
> Sent: Thursday, June 29, 2023 11:53 AM
> To: Luck, Tony <tony.luck@intel.com>
> Cc: Borislav Petkov <bp@alien8.de>; James Morse <james.morse@arm.com>;
> Mauro Carvalho Chehab <mchehab@kernel.org>; Robert Richter
> <rric@kernel.org>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] EDAC/i10nm: shift exponent is negative
> 
> hi Luck,
> I agree with your points
> is it expected to shift with negative?
> 
> Thanks
> Koba Ko
> 
> On Thu, Jun 29, 2023 at 12:41 AM Luck, Tony <tony.luck@intel.com> wrote:
> >
> > >       ranks = numrank(mtr);
> > >       rows = numrow(mtr);
> > >       cols = imc->hbm_mc ? 6 : numcol(mtr);
> > > +     if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> > > +             return 0;
> >
> > This seems to be just hiding the real problem that a DIMM was found
> > with some number of ranks, rows, or columns that the EDAC driver
> > didn't expect to see. Your fix makes the driver skip over this DIMM.
> >
> > Can you build your kernel with CONFIG_EDAC_DEBUG=y and see what
> > messages you get from this code:
> >
> > static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add,
> >                              int minval, int maxval, const char *name)
> > {
> >         u32 val = GET_BITFIELD(reg, lobit, hibit);
> >
> >         if (val < minval || val > maxval) {
> >                 edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
> >                 return -EINVAL;
> >         }
> >
> > -Tony
> >
> >
Tony Luck June 29, 2023, 4:11 p.m. UTC | #4
> I don't agree with simply skipping over a DIMM even EDAC doesn't expect to see it.
> As the EDAC driver can still report errors for this DIMM once there are errors that occur in this DIMM.
>
> As per Tony's suggestion, could you test your kernel with CONFIG_EDAC_DEBUG=y and see the result?
>
> @Luck, Tony, Perhaps we may turn the debug print
>
>        edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
>
> to an error-print explicitly
>
>      skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg);
>
> Let the user have the chance to notice there is a DIMM that EDAC doesn't expect to see.

We need both. Changing that debug message to a real error message will let the user
know that EDAC doesn't recognize this DIMM (and will provide the information for you
or me to fix the driver).

But we also need Ko's fix - because it makes no sense to just use that negative shift
and pretend that EDAC knows how to handle this DIMM.

-Tony
Zhuo, Qiuxu June 30, 2023, 8:31 a.m. UTC | #5
> From: Luck, Tony <tony.luck@intel.com>
> Sent: Friday, June 30, 2023 12:12 AM
> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Koba Ko <koba.ko@canonical.com>
> Cc: Borislav Petkov <bp@alien8.de>; James Morse <james.morse@arm.com>;
> Mauro Carvalho Chehab <mchehab@kernel.org>; Robert Richter
> <rric@kernel.org>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: RE: [PATCH] EDAC/i10nm: shift exponent is negative
> 
> > I don't agree with simply skipping over a DIMM even EDAC doesn't expect
> to see it.
> > As the EDAC driver can still report errors for this DIMM once there are
> errors that occur in this DIMM.
> >
> > As per Tony's suggestion, could you test your kernel with
> CONFIG_EDAC_DEBUG=y and see the result?
> >
> > @Luck, Tony, Perhaps we may turn the debug print
> >
> >        edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
> >
> > to an error-print explicitly
> >
> >      skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg);
> >
> > Let the user have the chance to notice there is a DIMM that EDAC doesn't
> expect to see.
> 
> We need both. Changing that debug message to a real error message will let
> the user know that EDAC doesn't recognize this DIMM (and will provide the
> information for you or me to fix the driver).
> 
> But we also need Ko's fix - because it makes no sense to just use that
> negative shift and pretend that EDAC knows how to handle this DIMM.
> 

OK.
@Koba Ko, could you make a new patch with Tony's suggestion? Thanks!

-Qiuxu
diff mbox series

Patch

diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c
index 2a00e0503f0d5..98baed1617c9f 100644
--- a/drivers/edac/skx_common.c
+++ b/drivers/edac/skx_common.c
@@ -351,6 +351,8 @@  int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm,
 	ranks = numrank(mtr);
 	rows = numrow(mtr);
 	cols = imc->hbm_mc ? 6 : numcol(mtr);
+	if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
+		return 0;
 
 	if (imc->hbm_mc) {
 		banks = 32;