Message ID | 20230703162509.77828-1-koba.ko@canonical.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [V2] EDAC/i10nm: shift exponent is negative | expand |
> > UBSAN complains this error > > ~~~ > > UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 > … > > ~~~ > > > > when get rows, cols and ranks, the returned error value doesn't be > > handled. > > > > check the return value is EINVAL, if yes, directly return 0. > … > > * Please improve this change description further. To be specific. Initially you reported this because of the UBSAN error report. But, after community discussion you now know that the problem is that one or more of the rows/cols/ranks has a value that the EDAC driver doesn't expect and probably can handle. So, in V2, the commit message should start with the information these values are out of range and mention this was discovered when UBSAN put out a warning about a negative shift. No need to include the whole of the UBSAN stack trace. Then describe the two fixes that this patch includes. One is to change the edac debug message into a console error message to enable further debug of this issue. The other is to skip the unrecognized DIMM. > * How do you think about to add the tag “Fixes”? This is a good idea. Use git blame, or dig into the GIT history to find the commit where this code was introduced (hint .. git blame says: 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac") but that obviously just refactored code, so you should dig back more into the history. > > V2: make error-print explicitly > > --- > > Would you like to avoid a misplaced marker line here? > > See also: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686 That's an excellent resource. -Tony
On Tue, Jul 4, 2023 at 5:51 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > UBSAN complains this error > > > ~~~ > > > UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 > > … > > > ~~~ > > > > > > when get rows, cols and ranks, the returned error value doesn't be > > > handled. > > > > > > check the return value is EINVAL, if yes, directly return 0. > > … > > > > * Please improve this change description further. > > To be specific. Initially you reported this because of the UBSAN error > report. But, after community discussion you now know that the problem > is that one or more of the rows/cols/ranks has a value that the EDAC driver > doesn't expect and probably can handle. > > So, in V2, the commit message should start with the information these > values are out of range and mention this was discovered when UBSAN > put out a warning about a negative shift. No need to include the whole > of the UBSAN stack trace. > > Then describe the two fixes that this patch includes. One is to change the > edac debug message into a console error message to enable further > debug of this issue. The other is to skip the unrecognized DIMM. > > > * How do you think about to add the tag “Fixes”? > > This is a good idea. Use git blame, or dig into the GIT history to > find the commit where this code was introduced (hint .. git blame > says: > 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac") > but that obviously just refactored code, so you should dig back more into > the history. > > > > V2: make error-print explicitly > > > --- > > > > Would you like to avoid a misplaced marker line here? > > > > See also: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686 > > That's an excellent resource. Thanks for your advices and I will modify. here's part of dmesg enabled EDAC_DEBUG ~~~ [ 4.032332] EDAC DEBUG: skx_register_mci: MC#1: mci = 00000000799db99e [ 4.032334] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff mcddrtcfg 0xffffffff (mc1 ch0 dimm0) [ 4.032335] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff) [ 4.032337] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff) [ 4.032338] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff) [ 4.032339] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff mcddrtcfg 0xffffffff (mc1 ch0 dimm1) [ 4.032340] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff) [ 4.032341] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff) [ 4.032341] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff) [ 4.032343] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff mcddrtcfg 0xffffffff (mc1 ch1 dimm0) [ 4.032344] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff) [ 4.032345] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff) [ 4.032346] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff) [ 4.032347] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff mcddrtcfg 0xffffffff (mc1 ch1 dimm1) [ 4.032348] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff) [ 4.032349] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff) [ 4.032349] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff) ~~~ I shared the whole dmesg through g-drive. https://drive.google.com/file/d/1epnDZNezGiJsK1eT4UNOi8_igcBSXtiF/view?usp=sharing > > -Tony
On Tue, Jul 4, 2023 at 5:51 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > UBSAN complains this error > > > ~~~ > > > UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 > > … > > > ~~~ > > > > > > when get rows, cols and ranks, the returned error value doesn't be > > > handled. > > > > > > check the return value is EINVAL, if yes, directly return 0. > > … > > > > * Please improve this change description further. > > To be specific. Initially you reported this because of the UBSAN error > report. But, after community discussion you now know that the problem > is that one or more of the rows/cols/ranks has a value that the EDAC driver > doesn't expect and probably can handle. > > So, in V2, the commit message should start with the information these > values are out of range and mention this was discovered when UBSAN > put out a warning about a negative shift. No need to include the whole > of the UBSAN stack trace. > > Then describe the two fixes that this patch includes. One is to change the > edac debug message into a console error message to enable further > debug of this issue. The other is to skip the unrecognized DIMM. > > > * How do you think about to add the tag “Fixes”? > > This is a good idea. Use git blame, or dig into the GIT history to > find the commit where this code was introduced (hint .. git blame > says: > 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac") > but that obviously just refactored code, so you should dig back more into > the history. There are two parts, 1. @get_dimm_attr, edac_dbg was added since e235dd43d8b0f0 2. get num of ranks, rows and cols, 4ec656bdf43a13 Should I add all of them prefixes with "Fixes"? > > > > V2: make error-print explicitly > > > --- > > > > Would you like to avoid a misplaced marker line here? > > > > See also: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686 > > That's an excellent resource. > > -Tony
Interesting. This isn’t some odd shaped DIMM. You are just getting a failed read from the register. So trying to decode all-ones. I’ll try to look at the full log you linked when I’m on a computer instead of a phone. > [ 4.032335] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff) > [ 4.032337] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff) > [ 4.032338] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
As per suggestions, i modified V3. could you please take a look Subject: [PATCH][V3] EDAC/i10nm: shift exponent is negative Because failed to read from DIMM, get the negative value for shift operation. `EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff) EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff) EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff) EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff mcddrtcfg 0xffffffff (mc1 ch0 dimm1)` UBSAN complains this error `UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 shift exponent -66 is negative` when get rows, cols and ranks, the returned error value doesn't be handled. check the return value is EINVAL, if yes, directly return 0 and show error message explicitly. Fixes: 4ec656bdf43a13) EDAC, skx_edac: Add EDAC driver for Skylake Signed-off-by: Koba Ko <koba.ko@canonical.com> --- V2 -> V3: simplify the summary and add 'Fixes:' V1 -> V2: make error-print explicitly On Tue, Jul 4, 2023 at 10:20 AM Koba Ko <koba.ko@canonical.com> wrote: > > On Tue, Jul 4, 2023 at 5:51 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > > > UBSAN complains this error > > > > ~~~ > > > > UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 > > > … > > > > ~~~ > > > > > > > > when get rows, cols and ranks, the returned error value doesn't be > > > > handled. > > > > > > > > check the return value is EINVAL, if yes, directly return 0. > > > … > > > > > > * Please improve this change description further. > > > > To be specific. Initially you reported this because of the UBSAN error > > report. But, after community discussion you now know that the problem > > is that one or more of the rows/cols/ranks has a value that the EDAC driver > > doesn't expect and probably can handle. > > > > So, in V2, the commit message should start with the information these > > values are out of range and mention this was discovered when UBSAN > > put out a warning about a negative shift. No need to include the whole > > of the UBSAN stack trace. > > > > Then describe the two fixes that this patch includes. One is to change the > > edac debug message into a console error message to enable further > > debug of this issue. The other is to skip the unrecognized DIMM. > > > > > * How do you think about to add the tag “Fixes”? > > > > This is a good idea. Use git blame, or dig into the GIT history to > > find the commit where this code was introduced (hint .. git blame > > says: > > 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac") > > but that obviously just refactored code, so you should dig back more into > > the history. > There are two parts, > 1. @get_dimm_attr, edac_dbg was added since e235dd43d8b0f0 > 2. get num of ranks, rows and cols, 4ec656bdf43a13 > > Should I add all of them prefixes with "Fixes"? > > > > > > > V2: make error-print explicitly > > > > --- > > > > > > Would you like to avoid a misplaced marker line here? > > > > > > See also: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686 > > > > That's an excellent resource. > > > > -Tony
On Tue, Jul 4, 2023 at 3:16 PM Markus Elfring <Markus.Elfring@web.de> wrote: > > > As per suggestions, i modified V3. > > could you please take a look > > > > Subject: [PATCH][V3] EDAC/i10nm: shift exponent is negative > > Would you like to put the text “[PATCH v4] EDAC/i10nm: Fix an inappropriate shift exponent” > into a subsequent patch? I didn't send V3 so the suggestions could be put in V3. > > I would find further wording variants nicer. > > > > Because failed to read from DIMM, get the negative value for shift > > operation. > > A surprising value was determined after a read failure from a DIMM. > > > … > > UBSAN complains this error > > Software analyses pointed a data processing issue out. > > > > `UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 > > shift exponent -66 is negative` > > > > when get rows, cols and ranks, the returned error value doesn't be > > handled. > > A special value combination could not be handled so far. > > > > check the return value is EINVAL, if yes, directly return 0 and > > show error message explicitly. > > Check if an invalid value was detected by a call of the function “skx_get_dimm_attr”. > > * Print a corresponding error message in this case. > > * Return zero then directly from the function “skx_get_dimm_info”. > > > … > @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm, > ranks = numrank(mtr); > rows = numrow(mtr); > cols = imc->hbm_mc ? 6 : numcol(mtr); > + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL) > + return 0; > … > > > Can it be nicer to perform a check for such an error code directly > after each variable assignment? > (May this condition check be split?) > > > > Fixes: 4ec656bdf43a13) EDAC, skx_edac: Add EDAC driver for Skylake > > Please properly apply parentheses and double quotes for this tag. > > See also: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n145 > > > > V2 -> V3: simplify the summary and add 'Fixes:' > > V1 -> V2: make error-print explicitly > > How do you think about to use more succinct version identifiers > for such descriptions? > > V4: > … > > V3: > … > > > Regards, > Markus
@Markus and all, please review it, thanks Subject: [PATCH][V3] EDAC/i10nm: Fix an inappropriate shift exponen A surprising value was determined after a read failure from a DIMM. Software analyses pointed a data processing issue out. `UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 shift exponent -66 is negative` when get rows, cols and ranks, A special value combination could not be handled so far. Check if an invalid value was detected by a call of the function “skx_get_dimm_attr”. * Print a corresponding error message in this case. * Return zero then directly from the function “skx_get_dimm_info”. Fixes: 4ec656bdf43a13 (EDAC, skx_edac: Add EDAC driver for Skylake) Signed-off-by: Koba Ko <koba.ko@canonical.com> --- V3: simplify and polish the summary and add `Fixes:` V2: make error-print explicitly Thanks Koba Ko On Tue, Jul 4, 2023 at 3:16 PM Markus Elfring <Markus.Elfring@web.de> wrote: > > > As per suggestions, i modified V3. > > could you please take a look > > > > Subject: [PATCH][V3] EDAC/i10nm: shift exponent is negative > > Would you like to put the text “[PATCH v4] EDAC/i10nm: Fix an inappropriate shift exponent” > into a subsequent patch? > > I would find further wording variants nicer. > > > > Because failed to read from DIMM, get the negative value for shift > > operation. > > A surprising value was determined after a read failure from a DIMM. > > > … > > UBSAN complains this error > > Software analyses pointed a data processing issue out. > > > > `UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 > > shift exponent -66 is negative` > > > > when get rows, cols and ranks, the returned error value doesn't be > > handled. > > A special value combination could not be handled so far. > > > > check the return value is EINVAL, if yes, directly return 0 and > > show error message explicitly. > > Check if an invalid value was detected by a call of the function “skx_get_dimm_attr”. > > * Print a corresponding error message in this case. > > * Return zero then directly from the function “skx_get_dimm_info”. > > > … > @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm, > ranks = numrank(mtr); > rows = numrow(mtr); > cols = imc->hbm_mc ? 6 : numcol(mtr); > + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL) > + return 0; > … > > > Can it be nicer to perform a check for such an error code directly > after each variable assignment? > (May this condition check be split?) > > > > Fixes: 4ec656bdf43a13) EDAC, skx_edac: Add EDAC driver for Skylake > > Please properly apply parentheses and double quotes for this tag. > > See also: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n145 > > > > V2 -> V3: simplify the summary and add 'Fixes:' > > V1 -> V2: make error-print explicitly > > How do you think about to use more succinct version identifiers > for such descriptions? > > V4: > … > > V3: > … > > > Regards, > Markus
On Tue, Jul 4, 2023 at 4:50 PM Markus Elfring <Markus.Elfring@web.de> wrote: > > > @Markus and all, > > please review it, thanks > > > > Subject: [PATCH][V3] EDAC/i10nm: Fix an inappropriate shift exponen > > * Combined prefixes would be nicer. Do you mean, [PATCH V3]? > * Please avoid a typo here. => exponent > > > when get rows, cols and ranks, A special value combination could not > > be handled so far. > > It seems that you stumble still on wording difficulties. remove the previous sentence, A special value combination could not be handled so far. > > > Fixes: 4ec656bdf43a13 (EDAC, skx_edac: Add EDAC driver for Skylake) > > You overlooked somehow to use double quotes here. must be Fixes: 4ec656bdf43a13 ("EDAC, skx_edac: Add EDAC driver for Skylake") > Please provide the next version of a proper patch. > > Regards, > Markus
Here is a better commit message. You can just copy and paste it. ------------------------------------------ [PATCH v3] EDAC/i10nm: Prevent negative shifts in skx_get_dimm_info(). UBSAN generated the following warning during a timeout: UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 shift exponent -66 is negative That most likely means that rows, cols, and ranks were all set to -EINVAL. Address this in two ways. 1) Change the debug output in skx_get_dimm_attr() to KERN_ERR so that users will know where exactly the error is. 2) Add a check for errors in skx_get_dimm_info(). Fixes: 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac") Signed-off-by: ----------------------------------------------- > @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm, > ranks = numrank(mtr); > rows = numrow(mtr); > cols = imc->hbm_mc ? 6 : numcol(mtr); > + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL) > + return 0; Change this to: if (rangks < 0 || rows < 0 || cols < 0) return 0; It's bad form to check for a specific error code unless there is a need. regards, dan carpenter
On Tue, Jul 4, 2023 at 8:02 PM Dan Carpenter <dan.carpenter@linaro.org> wrote: > > Here is a better commit message. You can just copy and paste it. > ------------------------------------------ > [PATCH v3] EDAC/i10nm: Prevent negative shifts in skx_get_dimm_info(). > > UBSAN generated the following warning during a timeout: > > UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 > shift exponent -66 is negative > > That most likely means that rows, cols, and ranks were all set to > -EINVAL. Address this in two ways. > > 1) Change the debug output in skx_get_dimm_attr() to KERN_ERR so that > users will know where exactly the error is. > 2) Add a check for errors in skx_get_dimm_info(). > > Fixes: 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac") > Signed-off-by: > ----------------------------------------------- have sent V3 as per Markus' comments. https://patchwork.kernel.org/project/linux-edac/patch/20230704095939.119620-1-koba.ko@canonical.com/ Thanks > > > @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm, > > ranks = numrank(mtr); > > rows = numrow(mtr); > > cols = imc->hbm_mc ? 6 : numcol(mtr); > > + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL) > > + return 0; > > Change this to: > > if (rangks < 0 || rows < 0 || cols < 0) > return 0; > > It's bad form to check for a specific error code unless there is a need. > > regards, > dan carpenter >
> I shared the whole dmesg through g-drive. > https://drive.google.com/file/d/1epnDZNezGiJsK1eT4UNOi8_igcBSXtiF/view?usp=sharing The EDAC driver was OK for MC#0, but then failed for the other memory controllers. Can you send some more information about your system. Output from the following commands: # head /proc/cpuinfo # dmidecode -t 17 # lspci Thanks -Tony
On Wed, Jul 5, 2023 at 12:08 PM Luck, Tony <tony.luck@intel.com> wrote: > > > I shared the whole dmesg through g-drive. > > https://drive.google.com/file/d/1epnDZNezGiJsK1eT4UNOi8_igcBSXtiF/view?usp=sharing > > The EDAC driver was OK for MC#0, but then failed for the other memory controllers. > > Can you send some more information about your system. Output from the > following commands: > > # head /proc/cpuinfo > > # dmidecode -t 17 > > # lspci please check through this url, https://drive.google.com/drive/folders/199k3BX6IipNYCDfuMGy8W26ZtYRDIYZr?usp=sharing > > Thanks > > -Tony
>> # head /proc/cpuinfo This shows your system is the workstation version of Sapphire rapids. I don't think we did any validation of the EDAC driver against this model. > # dmidecode -t 17 You have just one 16GB DIMM, and EDAC found that. So despite the messy warnings, EDAC should be working for you. > # lspci I didn't dig into this. Qiuxu - can you compare this against a server Sapphire rapids? Maybe it has some clues so the EDAC driver will know not to look for non-existent memory controllers. > please check through this url, > https://drive.google.com/drive/folders/199k3BX6IipNYCDfuMGy8W26ZtYRDIYZr?usp=sharing Thanks for all the details. -Tony
> From: Luck, Tony <tony.luck@intel.com> > Sent: Wednesday, July 5, 2023 11:22 PM > ... > Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative > > >> # head /proc/cpuinfo > > This shows your system is the workstation version of Sapphire rapids. I don't > think we did any validation of the EDAC driver against this model. No, we didn't do any validation of the EDAC on Sapphires Rapids workstations. From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers. https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before. > > # dmidecode -t 17 > > You have just one 16GB DIMM, and EDAC found that. So despite the messy > warnings, EDAC should be working for you. > > > # lspci > > I didn't dig into this. Qiuxu - can you compare this against a server Sapphire > rapids? > Maybe it has some clues so the EDAC driver will know not to look for non- > existent memory controllers. This Sapphire Rapids workstation had 2 memory controllers but appeared 4 memory controller PCIe devices from the log: 0000:fe:0c.0 1101: 8086:324a 0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear 0000:fe:0e.0 1101: 8086:324a 0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear By observing that the MMIO registers of these absent memory controllers consistently hold the value of ~0. We may identify a memory controller as absent by checking if its MMIO register "mcmtr" == ~0 in all its channels. I made a patch below to skip all these absent memory controllers https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@intel.com/T/#u @Koba Ko, could you please verify the patch from the link above on your workstation? Thanks! BTW, Kai-Heng Feng also found the same issue before: https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2 - Qiuxu
On Thu, Jul 6, 2023 at 10:12 PM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote: > > > From: Luck, Tony <tony.luck@intel.com> > > Sent: Wednesday, July 5, 2023 11:22 PM > > ... > > Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative > > > > >> # head /proc/cpuinfo > > > > This shows your system is the workstation version of Sapphire rapids. I don't > > think we did any validation of the EDAC driver against this model. > > No, we didn't do any validation of the EDAC on Sapphires Rapids workstations. > From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers. > https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html > > We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before. > > > > # dmidecode -t 17 > > > > You have just one 16GB DIMM, and EDAC found that. So despite the messy > > warnings, EDAC should be working for you. > > > > > # lspci > > > > I didn't dig into this. Qiuxu - can you compare this against a server Sapphire > > rapids? > > Maybe it has some clues so the EDAC driver will know not to look for non- > > existent memory controllers. > > This Sapphire Rapids workstation had 2 memory controllers but appeared > 4 memory controller PCIe devices from the log: > > 0000:fe:0c.0 1101: 8086:324a > 0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear > 0000:fe:0e.0 1101: 8086:324a > 0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear > > By observing that the MMIO registers of these absent > memory controllers consistently hold the value of ~0. > We may identify a memory controller as absent by checking > if its MMIO register "mcmtr" == ~0 in all its channels. > > I made a patch below to skip all these absent memory controllers > https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@intel.com/T/#u > @Koba Ko, could you please verify the patch from the link above on your workstation? Thanks! Here's dmesg patched(Ref. 1). didn't find the previous message, `EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)` Ref. 1, https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmVlp?usp=sharing > > BTW, > Kai-Heng Feng also found the same issue before: > https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2 > > - Qiuxu
> From: Koba Ko <koba.ko@canonical.com> > Sent: Friday, July 7, 2023 1:41 AM > To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com> > ... > > I made a patch below to skip all these absent memory controllers > > https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@i > > ntel.com/T/#u @Koba Ko, could you please verify the patch from the > > link above on your workstation? Thanks! > > Here's dmesg patched(Ref. 1). didn't find the previous message, `EDAC > DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)` > > Ref. 1, > https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmV > lp?usp=sharing Thanks for the verification. Your log showed that the patch worked well. -Qiuxu
On Fri, Jul 7, 2023 at 11:34 AM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote: > > > From: Koba Ko <koba.ko@canonical.com> > > Sent: Friday, July 7, 2023 1:41 AM > > To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com> > > ... > > > I made a patch below to skip all these absent memory controllers > > > https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@i > > > ntel.com/T/#u @Koba Ko, could you please verify the patch from the > > > link above on your workstation? Thanks! > > > > Here's dmesg patched(Ref. 1). didn't find the previous message, `EDAC > > DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)` > > > > Ref. 1, > > https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmV > > lp?usp=sharing > > Thanks for the verification. > Your log showed that the patch worked well. please, tested-by: koba.ko@canonical.com > > -Qiuxu
> From: Koba Ko <koba.ko@canonical.com> > ... > > > Ref. 1, > > > https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJK > > > mV > > > lp?usp=sharing > > > > Thanks for the verification. > > Your log showed that the patch worked well. > > please, > tested-by: koba.ko@canonical.com Sure.
On Sun, Jul 09, 2023 at 11:42:22PM +0800, Koba Ko wrote:
> tested-by: koba.ko@canonical.com
It's better if you put the tag in the correct format. There are a bunch
of tools which automatically add tags, but they only work if the tag is
in the correct format.
regards,
dan carpenter
> From: Dan Carpenter <dan.carpenter@linaro.org> > ... > Subject: Re: [PATCH v2] EDAC/i10nm: shift exponent is negative > > On Sun, Jul 09, 2023 at 11:42:22PM +0800, Koba Ko wrote: > > tested-by: koba.ko@canonical.com > > It's better if you put the tag in the correct format. There are a bunch of tools > which automatically add tags, but they only work if the tag is in the correct > format. I corrected Koba Ko's "Tested-by" tag and added it to my v2: https://lore.kernel.org/linux-edac/20230710013232.59712-1-qiuxu.zhuo@intel.com/ -Qiuxu
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 2a00e0503f0d5..ac61db72d2e6b 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -330,7 +330,7 @@ static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add, u32 val = GET_BITFIELD(reg, lobit, hibit); if (val < minval || val > maxval) { - edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg); + skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg); return -EINVAL; } return val + add; @@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm, ranks = numrank(mtr); rows = numrow(mtr); cols = imc->hbm_mc ? 6 : numcol(mtr); + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL) + return 0; if (imc->hbm_mc) { banks = 32;
UBSAN complains this error ~~~ UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16 shift exponent -66 is negative Call Trace: <TASK> dump_stack_lvl+0x48/0x70 dump_stack+0x10/0x20 __ubsan_handle_shift_out_of_bounds+0x1ac/0x360 skx_get_dimm_info.cold+0x91/0x175 [i10nm_edac] ? kvasprintf_const+0x2a/0xb0 i10nm_get_dimm_config+0x23c/0x340 [i10nm_edac] skx_register_mci+0x139/0x1e0 [i10nm_edac] ? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac] i10nm_init+0x403/0xd10 [i10nm_edac] ? __pfx_i10nm_init+0x10/0x10 [i10nm_edac] do_one_initcall+0x5b/0x250 do_init_module+0x68/0x260 load_module+0xb45/0xcd0 ? kernel_read_file+0x2a4/0x320 __do_sys_finit_module+0xc4/0x140 ? __do_sys_finit_module+0xc4/0x140 __x64_sys_finit_module+0x18/0x30 do_syscall_64+0x58/0x90 ? syscall_exit_to_user_mode+0x29/0x50 ? do_syscall_64+0x67/0x90 ? syscall_exit_to_user_mode+0x29/0x50 ? do_syscall_64+0x67/0x90 ? do_syscall_64+0x67/0x90 ? __flush_smp_call_function_queue+0x122/0x1f0 ? exit_to_user_mode_prepare+0x30/0xb0 ? irqentry_exit_to_user_mode+0x9/0x20 ? irqentry_exit+0x43/0x50 ? sysvec_call_function+0x4b/0xd0 entry_SYSCALL_64_after_hwframe+0x72/0xdc ~~~ when get rows, cols and ranks, the returned error value doesn't be handled. check the return value is EINVAL, if yes, directly return 0. Signed-off-by: Koba Ko <koba.ko@canonical.com> --- V2: make error-print explicitly --- drivers/edac/skx_common.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)