Message ID | 20230516033133.340936-1-kai.heng.feng@canonical.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | EDAC/Intel: Fix shift-out-of-bounds when DIMM/NVDIMM is absent | expand |
> From: Kai-Heng Feng <kai.heng.feng@canonical.com> > ... > Subject: [PATCH] EDAC/Intel: Fix shift-out-of-bounds when DIMM/NVDIMM > is absent > > The following splat can be found on many systems equipped with EDAC: > [ 13.875276] UBSAN: shift-out-of-bounds in > drivers/edac/skx_common.c:369:16 > [ 13.875279] shift exponent -66 is negative > [ 13.875280] CPU: 11 PID: 519 Comm: systemd-udevd Not tainted 6.4.0-rc1+ > #1 > [ 13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962, > BIOS U61 Ver. 01.01.15 04/19/2023 > [ 13.875283] Call Trace: > [ 13.875285] <TASK> > [ 13.875287] dump_stack_lvl+0x48/0x70 > [ 13.875295] dump_stack+0x10/0x20 > [ 13.875297] __ubsan_handle_shift_out_of_bounds+0x156/0x310 > [ 13.875302] ? __kmem_cache_alloc_node+0x196/0x300 > [ 13.875307] skx_get_dimm_info.cold+0xac/0x15d [i10nm_edac] > [ 13.875312] i10nm_get_dimm_config+0x240/0x360 [i10nm_edac] > [ 13.875316] ? kasprintf+0x4e/0x80 > [ 13.875321] skx_register_mci+0x12b/0x1d0 [i10nm_edac] > [ 13.875324] ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac] > [ 13.875329] i10nm_init+0x89f/0x1d10 [i10nm_edac] > [ 13.875333] ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac] > [ 13.875337] do_one_initcall+0x46/0x240 > [ 13.875342] ? kmalloc_trace+0x2a/0xb0 > [ 13.875346] do_init_module+0x6a/0x280 > [ 13.875350] load_module+0x2419/0x2500 > [ 13.875353] ? security_kernel_post_read_file+0x5c/0x80 > [ 13.875358] __do_sys_finit_module+0xcc/0x150 > [ 13.875360] ? __do_sys_finit_module+0xcc/0x150 > [ 13.875363] __x64_sys_finit_module+0x18/0x30 > [ 13.875365] do_syscall_64+0x59/0x90 > [ 13.875368] ? syscall_exit_to_user_mode+0x2a/0x50 > [ 13.875371] ? do_syscall_64+0x69/0x90 > [ 13.875372] ? do_syscall_64+0x69/0x90 > [ 13.875373] ? do_syscall_64+0x69/0x90 > [ 13.875374] ? do_syscall_64+0x69/0x90 > [ 13.875375] ? syscall_exit_to_user_mode+0x2a/0x50 > [ 13.875376] ? do_syscall_64+0x69/0x90 > [ 13.875377] ? do_syscall_64+0x69/0x90 > [ 13.875378] ? do_syscall_64+0x69/0x90 > [ 13.875379] ? sysvec_call_function+0x4e/0xb0 > [ 13.875381] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore Looked like a buggy BIOS/hw that didn't set the mtr register. 1. Did you print the mtr register whose value was 0xffffffff? 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled? 3. What was the CPU? Please take the output of "lscpu". 4. Did you verify your patch that the issue was fixed on your systems? Thanks! -Qiuxu > the wrong "ranks" value creates shift-out-of-bounds error. The same issue > can be found on NVDIMM too. > > So only consider DIMM/NVDIMM is present when the value of > mtr/mcddrtcfg is not ~0. > ...
>> [ 13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962, > > BIOS U61 Ver. 01.01.15 04/19/2023 >> When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore > Looked like a buggy BIOS/hw that didn't set the mtr register. > > 1. Did you print the mtr register whose value was 0xffffffff? > 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled? > 3. What was the CPU? Please take the output of "lscpu". > 4. Did you verify your patch that the issue was fixed on your systems? I wonder if BIOS is "hiding" some devices from the OS? The 0xffffffff return is the standard PCI response for reading a non-existent register. But that doesn't quite make sense with having a "dimm present" bit in the MTR register. If the register only exists if the DIMM is present, then there is no need for a "dimm present" bit. Some "lspci" output may also be useful. -Tony
On Tue, May 16, 2023 at 8:53 PM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote: > > > From: Kai-Heng Feng <kai.heng.feng@canonical.com> > > ... > > Subject: [PATCH] EDAC/Intel: Fix shift-out-of-bounds when DIMM/NVDIMM > > is absent > > > > The following splat can be found on many systems equipped with EDAC: > > [ 13.875276] UBSAN: shift-out-of-bounds in > > drivers/edac/skx_common.c:369:16 > > [ 13.875279] shift exponent -66 is negative > > [ 13.875280] CPU: 11 PID: 519 Comm: systemd-udevd Not tainted 6.4.0-rc1+ > > #1 > > [ 13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962, > > BIOS U61 Ver. 01.01.15 04/19/2023 > > [ 13.875283] Call Trace: > > [ 13.875285] <TASK> > > [ 13.875287] dump_stack_lvl+0x48/0x70 > > [ 13.875295] dump_stack+0x10/0x20 > > [ 13.875297] __ubsan_handle_shift_out_of_bounds+0x156/0x310 > > [ 13.875302] ? __kmem_cache_alloc_node+0x196/0x300 > > [ 13.875307] skx_get_dimm_info.cold+0xac/0x15d [i10nm_edac] > > [ 13.875312] i10nm_get_dimm_config+0x240/0x360 [i10nm_edac] > > [ 13.875316] ? kasprintf+0x4e/0x80 > > [ 13.875321] skx_register_mci+0x12b/0x1d0 [i10nm_edac] > > [ 13.875324] ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac] > > [ 13.875329] i10nm_init+0x89f/0x1d10 [i10nm_edac] > > [ 13.875333] ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac] > > [ 13.875337] do_one_initcall+0x46/0x240 > > [ 13.875342] ? kmalloc_trace+0x2a/0xb0 > > [ 13.875346] do_init_module+0x6a/0x280 > > [ 13.875350] load_module+0x2419/0x2500 > > [ 13.875353] ? security_kernel_post_read_file+0x5c/0x80 > > [ 13.875358] __do_sys_finit_module+0xcc/0x150 > > [ 13.875360] ? __do_sys_finit_module+0xcc/0x150 > > [ 13.875363] __x64_sys_finit_module+0x18/0x30 > > [ 13.875365] do_syscall_64+0x59/0x90 > > [ 13.875368] ? syscall_exit_to_user_mode+0x2a/0x50 > > [ 13.875371] ? do_syscall_64+0x69/0x90 > > [ 13.875372] ? do_syscall_64+0x69/0x90 > > [ 13.875373] ? do_syscall_64+0x69/0x90 > > [ 13.875374] ? do_syscall_64+0x69/0x90 > > [ 13.875375] ? syscall_exit_to_user_mode+0x2a/0x50 > > [ 13.875376] ? do_syscall_64+0x69/0x90 > > [ 13.875377] ? do_syscall_64+0x69/0x90 > > [ 13.875378] ? do_syscall_64+0x69/0x90 > > [ 13.875379] ? sysvec_call_function+0x4e/0xb0 > > [ 13.875381] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > > > When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore > > Looked like a buggy BIOS/hw that didn't set the mtr register. If that's the case, I suspect the bug comes from Intel BIOS RC, because the issue happens on different vendors' hardware. > > 1. Did you print the mtr register whose value was 0xffffffff? Yes, 0xffffffff is the value. mcddrtcfg is also 0xffffffff. > 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled? > 3. What was the CPU? Please take the output of "lscpu". Both attached in Bugzlla [1]. > 4. Did you verify your patch that the issue was fixed on your systems? I did, that's why I sent the patch to mailing list. [1] https://bugzilla.kernel.org/show_bug.cgi?id=217453 Kai-Heng > > Thanks! > -Qiuxu > > > the wrong "ranks" value creates shift-out-of-bounds error. The same issue > > can be found on NVDIMM too. > > > > So only consider DIMM/NVDIMM is present when the value of > > mtr/mcddrtcfg is not ~0. > > ...
On Wed, May 17, 2023 at 1:13 AM Luck, Tony <tony.luck@intel.com> wrote: > > >> [ 13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962, > > > BIOS U61 Ver. 01.01.15 04/19/2023 > > > >> When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore > > > Looked like a buggy BIOS/hw that didn't set the mtr register. > > > > 1. Did you print the mtr register whose value was 0xffffffff? > > 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled? > > 3. What was the CPU? Please take the output of "lscpu". > > 4. Did you verify your patch that the issue was fixed on your systems? > > I wonder if BIOS is "hiding" some devices from the OS? The 0xffffffff return is > the standard PCI response for reading a non-existent register. But that doesn't > quite make sense with having a "dimm present" bit in the MTR register. If > the register only exists if the DIMM is present, then there is no need for > a "dimm present" bit. I wonder if the "non-existent register" read is intended? > > Some "lspci" output may also be useful. lspci can be found in [1]: [1] https://bugzilla.kernel.org/show_bug.cgi?id=217453 Kai-Heng > > -Tony
On Wed, May 17, 2023 at 3:49 PM Kai-Heng Feng <kai.heng.feng@canonical.com> wrote: > > On Wed, May 17, 2023 at 1:13 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > >> [ 13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962, > > > > BIOS U61 Ver. 01.01.15 04/19/2023 > > > > > > >> When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore > > > > > Looked like a buggy BIOS/hw that didn't set the mtr register. > > > > > > 1. Did you print the mtr register whose value was 0xffffffff? > > > 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled? > > > 3. What was the CPU? Please take the output of "lscpu". > > > 4. Did you verify your patch that the issue was fixed on your systems? > > > > I wonder if BIOS is "hiding" some devices from the OS? The 0xffffffff return is > > the standard PCI response for reading a non-existent register. But that doesn't > > quite make sense with having a "dimm present" bit in the MTR register. If > > the register only exists if the DIMM is present, then there is no need for > > a "dimm present" bit. > > I wonder if the "non-existent register" read is intended? > > > > > Some "lspci" output may also be useful. > > lspci can be found in [1]: > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217453 A gentle ping... > > Kai-Heng > > > > > -Tony
On Wed, Jun 14, 2023 at 3:58 PM Kai-Heng Feng <kai.heng.feng@canonical.com> wrote: > > On Wed, May 17, 2023 at 3:49 PM Kai-Heng Feng > <kai.heng.feng@canonical.com> wrote: > > > > On Wed, May 17, 2023 at 1:13 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > > > >> [ 13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962, > > > > > BIOS U61 Ver. 01.01.15 04/19/2023 > > > > > > > > > >> When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore > > > > > > > Looked like a buggy BIOS/hw that didn't set the mtr register. > > > > > > > > 1. Did you print the mtr register whose value was 0xffffffff? > > > > 2. Can you take a dmesg log with kernel "CONFIG_EDAC_DEBUG=y" enabled? > > > > 3. What was the CPU? Please take the output of "lscpu". > > > > 4. Did you verify your patch that the issue was fixed on your systems? > > > > > > I wonder if BIOS is "hiding" some devices from the OS? The 0xffffffff return is > > > the standard PCI response for reading a non-existent register. But that doesn't > > > quite make sense with having a "dimm present" bit in the MTR register. If > > > the register only exists if the DIMM is present, then there is no need for > > > a "dimm present" bit. > > > > I wonder if the "non-existent register" read is intended? > > > > > > > > Some "lspci" output may also be useful. > > > > lspci can be found in [1]: > > > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217453 > > A gentle ping... Another gentle ping... > > > > > Kai-Heng > > > > > > > > -Tony
> From: Kai-Heng Feng <kai.heng.feng@canonical.com> > ... > > > > Some "lspci" output may also be useful. > > > > > > lspci can be found in [1]: > > > > > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217453 > > > > A gentle ping... > > Another gentle ping... The same issue was also reported recently by Koba Ko: https://lore.kernel.org/linux-edac/CAJB-X+XtfBm0a4btt6NT9rvdrxETNLNMVQ3G=u513Nh8RKwjWw@mail.gmail.com/T/#t The root cause of this issue was that the absent memory controllers still appeared as PCIe devices that fooled the i10nm_edac driver. Could you please verify the new fix patch below on your machine and feedback on the testing results? Thanks! https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@intel.com/T/#u - Qiuxu
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c index 0c779a0326b6..bc5155e84514 100644 --- a/drivers/edac/sb_edac.c +++ b/drivers/edac/sb_edac.c @@ -211,7 +211,7 @@ static const int mtr_regs[] = { static const int knl_mtr_reg = 0xb60; #define RANK_DISABLE(mtr) GET_BITFIELD(mtr, 16, 19) -#define IS_DIMM_PRESENT(mtr) GET_BITFIELD(mtr, 14, 14) +#define IS_DIMM_PRESENT(mtr) ((mtr != ~0) && GET_BITFIELD(mtr, 14, 14)) #define RANK_CNT_BITS(mtr) GET_BITFIELD(mtr, 12, 13) #define RANK_WIDTH_BITS(mtr) GET_BITFIELD(mtr, 2, 4) #define COL_WIDTH_BITS(mtr) GET_BITFIELD(mtr, 0, 1) diff --git a/drivers/edac/skx_common.h b/drivers/edac/skx_common.h index b6d3607dffe2..2f975ffeaac9 100644 --- a/drivers/edac/skx_common.h +++ b/drivers/edac/skx_common.h @@ -50,8 +50,8 @@ #define NUM_CHANNELS MAX(SKX_NUM_CHANNELS, I10NM_NUM_CHANNELS) #define NUM_DIMMS MAX(SKX_NUM_DIMMS, I10NM_NUM_DIMMS) -#define IS_DIMM_PRESENT(r) GET_BITFIELD(r, 15, 15) -#define IS_NVDIMM_PRESENT(r, i) GET_BITFIELD(r, i, i) +#define IS_DIMM_PRESENT(r) ((r != ~0) && GET_BITFIELD(r, 15, 15)) +#define IS_NVDIMM_PRESENT(r, i) ((r != ~0) && GET_BITFIELD(r, i, i)) #define MCI_MISC_ECC_MODE(m) (((m) >> 59) & 15) #define MCI_MISC_ECC_DDRT 8 /* read from DDRT */
The following splat can be found on many systems equipped with EDAC: [ 13.875276] UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 [ 13.875279] shift exponent -66 is negative [ 13.875280] CPU: 11 PID: 519 Comm: systemd-udevd Not tainted 6.4.0-rc1+ #1 [ 13.875282] Hardware name: HP HP Z4 G5 Workstation Desktop PC/8962, BIOS U61 Ver. 01.01.15 04/19/2023 [ 13.875283] Call Trace: [ 13.875285] <TASK> [ 13.875287] dump_stack_lvl+0x48/0x70 [ 13.875295] dump_stack+0x10/0x20 [ 13.875297] __ubsan_handle_shift_out_of_bounds+0x156/0x310 [ 13.875302] ? __kmem_cache_alloc_node+0x196/0x300 [ 13.875307] skx_get_dimm_info.cold+0xac/0x15d [i10nm_edac] [ 13.875312] i10nm_get_dimm_config+0x240/0x360 [i10nm_edac] [ 13.875316] ? kasprintf+0x4e/0x80 [ 13.875321] skx_register_mci+0x12b/0x1d0 [i10nm_edac] [ 13.875324] ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac] [ 13.875329] i10nm_init+0x89f/0x1d10 [i10nm_edac] [ 13.875333] ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac] [ 13.875337] do_one_initcall+0x46/0x240 [ 13.875342] ? kmalloc_trace+0x2a/0xb0 [ 13.875346] do_init_module+0x6a/0x280 [ 13.875350] load_module+0x2419/0x2500 [ 13.875353] ? security_kernel_post_read_file+0x5c/0x80 [ 13.875358] __do_sys_finit_module+0xcc/0x150 [ 13.875360] ? __do_sys_finit_module+0xcc/0x150 [ 13.875363] __x64_sys_finit_module+0x18/0x30 [ 13.875365] do_syscall_64+0x59/0x90 [ 13.875368] ? syscall_exit_to_user_mode+0x2a/0x50 [ 13.875371] ? do_syscall_64+0x69/0x90 [ 13.875372] ? do_syscall_64+0x69/0x90 [ 13.875373] ? do_syscall_64+0x69/0x90 [ 13.875374] ? do_syscall_64+0x69/0x90 [ 13.875375] ? syscall_exit_to_user_mode+0x2a/0x50 [ 13.875376] ? do_syscall_64+0x69/0x90 [ 13.875377] ? do_syscall_64+0x69/0x90 [ 13.875378] ? do_syscall_64+0x69/0x90 [ 13.875379] ? sysvec_call_function+0x4e/0xb0 [ 13.875381] entry_SYSCALL_64_after_hwframe+0x72/0xdc When a DIMM slot is empty, the read value of mtr can be 0xffffffff, therefore the wrong "ranks" value creates shift-out-of-bounds error. The same issue can be found on NVDIMM too. So only consider DIMM/NVDIMM is present when the value of mtr/mcddrtcfg is not ~0. Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> --- drivers/edac/sb_edac.c | 2 +- drivers/edac/skx_common.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)