Message ID | 20230628085253.1013799-1-koba.ko@canonical.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | EDAC/i10nm: shift exponent is negative | expand |
> ranks = numrank(mtr); > rows = numrow(mtr); > cols = imc->hbm_mc ? 6 : numcol(mtr); > + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL) > + return 0; This seems to be just hiding the real problem that a DIMM was found with some number of ranks, rows, or columns that the EDAC driver didn't expect to see. Your fix makes the driver skip over this DIMM. Can you build your kernel with CONFIG_EDAC_DEBUG=y and see what messages you get from this code: static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add, int minval, int maxval, const char *name) { u32 val = GET_BITFIELD(reg, lobit, hibit); if (val < minval || val > maxval) { edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg); return -EINVAL; } -Tony
hi Luck, I agree with your points is it expected to shift with negative? Thanks Koba Ko On Thu, Jun 29, 2023 at 12:41 AM Luck, Tony <tony.luck@intel.com> wrote: > > > ranks = numrank(mtr); > > rows = numrow(mtr); > > cols = imc->hbm_mc ? 6 : numcol(mtr); > > + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL) > > + return 0; > > This seems to be just hiding the real problem that a DIMM was found > with some number of ranks, rows, or columns that the EDAC driver > didn't expect to see. Your fix makes the driver skip over this DIMM. > > Can you build your kernel with CONFIG_EDAC_DEBUG=y and see > what messages you get from this code: > > static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add, > int minval, int maxval, const char *name) > { > u32 val = GET_BITFIELD(reg, lobit, hibit); > > if (val < minval || val > maxval) { > edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg); > return -EINVAL; > } > > -Tony > >
Hi Ko, I don't agree with simply skipping over a DIMM even EDAC doesn't expect to see it. As the EDAC driver can still report errors for this DIMM once there are errors that occur in this DIMM. As per Tony's suggestion, could you test your kernel with CONFIG_EDAC_DEBUG=y and see the result? @Luck, Tony, Perhaps we may turn the debug print edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg); to an error-print explicitly skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg); Let the user have the chance to notice there is a DIMM that EDAC doesn't expect to see. - Qiuxu > From: Koba Ko <koba.ko@canonical.com> > Sent: Thursday, June 29, 2023 11:53 AM > To: Luck, Tony <tony.luck@intel.com> > Cc: Borislav Petkov <bp@alien8.de>; James Morse <james.morse@arm.com>; > Mauro Carvalho Chehab <mchehab@kernel.org>; Robert Richter > <rric@kernel.org>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH] EDAC/i10nm: shift exponent is negative > > hi Luck, > I agree with your points > is it expected to shift with negative? > > Thanks > Koba Ko > > On Thu, Jun 29, 2023 at 12:41 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > > ranks = numrank(mtr); > > > rows = numrow(mtr); > > > cols = imc->hbm_mc ? 6 : numcol(mtr); > > > + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL) > > > + return 0; > > > > This seems to be just hiding the real problem that a DIMM was found > > with some number of ranks, rows, or columns that the EDAC driver > > didn't expect to see. Your fix makes the driver skip over this DIMM. > > > > Can you build your kernel with CONFIG_EDAC_DEBUG=y and see what > > messages you get from this code: > > > > static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add, > > int minval, int maxval, const char *name) > > { > > u32 val = GET_BITFIELD(reg, lobit, hibit); > > > > if (val < minval || val > maxval) { > > edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg); > > return -EINVAL; > > } > > > > -Tony > > > >
> I don't agree with simply skipping over a DIMM even EDAC doesn't expect to see it. > As the EDAC driver can still report errors for this DIMM once there are errors that occur in this DIMM. > > As per Tony's suggestion, could you test your kernel with CONFIG_EDAC_DEBUG=y and see the result? > > @Luck, Tony, Perhaps we may turn the debug print > > edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg); > > to an error-print explicitly > > skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg); > > Let the user have the chance to notice there is a DIMM that EDAC doesn't expect to see. We need both. Changing that debug message to a real error message will let the user know that EDAC doesn't recognize this DIMM (and will provide the information for you or me to fix the driver). But we also need Ko's fix - because it makes no sense to just use that negative shift and pretend that EDAC knows how to handle this DIMM. -Tony
> From: Luck, Tony <tony.luck@intel.com> > Sent: Friday, June 30, 2023 12:12 AM > To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Koba Ko <koba.ko@canonical.com> > Cc: Borislav Petkov <bp@alien8.de>; James Morse <james.morse@arm.com>; > Mauro Carvalho Chehab <mchehab@kernel.org>; Robert Richter > <rric@kernel.org>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: RE: [PATCH] EDAC/i10nm: shift exponent is negative > > > I don't agree with simply skipping over a DIMM even EDAC doesn't expect > to see it. > > As the EDAC driver can still report errors for this DIMM once there are > errors that occur in this DIMM. > > > > As per Tony's suggestion, could you test your kernel with > CONFIG_EDAC_DEBUG=y and see the result? > > > > @Luck, Tony, Perhaps we may turn the debug print > > > > edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg); > > > > to an error-print explicitly > > > > skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg); > > > > Let the user have the chance to notice there is a DIMM that EDAC doesn't > expect to see. > > We need both. Changing that debug message to a real error message will let > the user know that EDAC doesn't recognize this DIMM (and will provide the > information for you or me to fix the driver). > > But we also need Ko's fix - because it makes no sense to just use that > negative shift and pretend that EDAC knows how to handle this DIMM. > OK. @Koba Ko, could you make a new patch with Tony's suggestion? Thanks! -Qiuxu
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 2a00e0503f0d5..98baed1617c9f 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm, ranks = numrank(mtr); rows = numrow(mtr); cols = imc->hbm_mc ? 6 : numcol(mtr); + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL) + return 0; if (imc->hbm_mc) { banks = 32;
UBSAN complains this error ~~~ UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 shift exponent -66 is negative Call Trace: <TASK> dump_stack_lvl+0x48/0x70 dump_stack+0x10/0x20 __ubsan_handle_shift_out_of_bounds+0x1ac/0x360 skx_get_dimm_info.cold+0x91/0x175 [i10nm_edac] ? kvasprintf_const+0x2a/0xb0 i10nm_get_dimm_config+0x23c/0x340 [i10nm_edac] skx_register_mci+0x139/0x1e0 [i10nm_edac] ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac] i10nm_init+0x403/0xd10 [i10nm_edac] ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac] do_one_initcall+0x5b/0x250 do_init_module+0x68/0x260 load_module+0xb45/0xcd0 ? kernel_read_file+0x2a4/0x320 __do_sys_finit_module+0xc4/0x140 ? __do_sys_finit_module+0xc4/0x140 __x64_sys_finit_module+0x18/0x30 do_syscall_64+0x58/0x90 ? syscall_exit_to_user_mode+0x29/0x50 ? do_syscall_64+0x67/0x90 ? syscall_exit_to_user_mode+0x29/0x50 ? do_syscall_64+0x67/0x90 ? do_syscall_64+0x67/0x90 ? __flush_smp_call_function_queue+0x122/0x1f0 ? exit_to_user_mode_prepare+0x30/0xb0 ? irqentry_exit_to_user_mode+0x9/0x20 ? irqentry_exit+0x43/0x50 ? sysvec_call_function+0x4b/0xd0 entry_SYSCALL_64_after_hwframe+0x72/0xdc ~~~ when get rows, cols and ranks, the returned error value doesn't be handled. check the return value is EINVAL, if yes, directly return 0. Signed-off-by: Koba Ko <koba.ko@canonical.com> --- drivers/edac/skx_common.c | 2 ++ 1 file changed, 2 insertions(+)