Message ID | 20191022203448.13962-1-Yazen.Ghannam@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | AMD64 EDAC: Check for nodes without memory, etc. | expand |
On Tue, Oct 22, 2019 at 08:35:08PM +0000, Ghannam, Yazen wrote: > From: Yazen Ghannam <yazen.ghannam@amd.com> > > Hi Boris, > > Most of these patches address the issue where the module checks and > complains about DRAM ECC on nodes without memory. > > Thanks, > Yazen > > Link: > https://lkml.kernel.org/r/20191018153114.39378-1-Yazen.Ghannam@amd.com > > Yazen Ghannam (6): > EDAC/amd64: Make struct amd64_family_type global > EDAC/amd64: Gather hardware information early > EDAC/amd64: Save max number of controllers to family type > EDAC/amd64: Use cached data when checking for ECC > EDAC/amd64: Check for memory before fully initializing an instance > EDAC/amd64: Set grain per DIMM > > drivers/edac/amd64_edac.c | 196 +++++++++++++++++++------------------- > drivers/edac/amd64_edac.h | 2 + > 2 files changed, 100 insertions(+), 98 deletions(-) Almost there: now it dumps the whole shebang twice. This is on an old F10h box which doesn't have ECC DIMMs: [ 2.222853] EDAC MC: Ver: 3.0.0 [ 2.226881] EDAC DEBUG: edac_mc_sysfs_init: device mc created [ 5.726912] EDAC amd64: F10h detected (node 0). [ 5.732709] EDAC DEBUG: reserve_mc_sibling_devs: F1: 0000:00:18.1 [ 5.750886] EDAC DEBUG: reserve_mc_sibling_devs: F2: 0000:00:18.2 [ 5.758427] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3 [ 5.765871] EDAC DEBUG: read_mc_regs: TOP_MEM: 0x00000000d0000000 [ 5.774098] EDAC DEBUG: read_mc_regs: TOP_MEM2: 0x0000000230000000 [ 5.782339] EDAC DEBUG: read_dram_ctl_register: F2x110 (DCTSelLow): 0xffffffff, High range addrs at: 0xfffff800 [ 5.793976] EDAC DEBUG: read_dram_ctl_register: DCTs operate in ganged mode [ 5.802429] EDAC DEBUG: read_dram_ctl_register: data interleave for ECC: enabled, DRAM cleared since last warm reset: yes [ 5.814702] EDAC DEBUG: read_dram_ctl_register: channel interleave: enabled, interleave bits selector: 0x3 [ 5.826142] EDAC DEBUG: read_mc_regs: DRAM range[0], base: 0x0000ff0000000000; limit: 0x0000ff022fffffff [ 5.837070] EDAC DEBUG: read_mc_regs: IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=0 [ 5.847061] EDAC DEBUG: read_dct_base_mask: DCSB0[0]=0x00000001 reg: F2x40 [ 5.854699] EDAC DEBUG: read_dct_base_mask: DCSB1[0]=0x00000000 reg: F2x140 [ 5.862763] EDAC DEBUG: read_dct_base_mask: DCSB0[1]=0x00000101 reg: F2x44 [ 5.870614] EDAC DEBUG: read_dct_base_mask: DCSB1[1]=0x00000000 reg: F2x144 [ 5.878457] EDAC DEBUG: read_dct_base_mask: DCSB0[2]=0x00000201 reg: F2x48 [ 5.888483] EDAC DEBUG: read_dct_base_mask: DCSB1[2]=0x00000000 reg: F2x148 [ 5.897359] EDAC DEBUG: read_dct_base_mask: DCSB0[3]=0x00000301 reg: F2x4c [ 5.906307] EDAC DEBUG: read_dct_base_mask: DCSB1[3]=0x00000000 reg: F2x14c [ 5.913698] EDAC DEBUG: read_dct_base_mask: DCSB0[4]=0x00000000 reg: F2x50 [ 5.921646] EDAC DEBUG: read_dct_base_mask: DCSB1[4]=0x00000000 reg: F2x150 [ 5.930415] EDAC DEBUG: read_dct_base_mask: DCSB0[5]=0x00000000 reg: F2x54 [ 5.937772] EDAC DEBUG: read_dct_base_mask: DCSB1[5]=0x00000000 reg: F2x154 [ 5.945684] EDAC DEBUG: read_dct_base_mask: DCSB0[6]=0x00000000 reg: F2x58 [ 5.953523] EDAC DEBUG: read_dct_base_mask: DCSB1[6]=0x00000000 reg: F2x158 [ 5.961546] EDAC DEBUG: read_dct_base_mask: DCSB0[7]=0x00000000 reg: F2x5c [ 5.969385] EDAC DEBUG: read_dct_base_mask: DCSB1[7]=0x00000000 reg: F2x15c [ 5.977333] EDAC DEBUG: read_dct_base_mask: DCSM0[0]=0x00f83ce0 reg: F2x60 [ 5.986777] EDAC DEBUG: read_dct_base_mask: DCSM1[0]=0x00000000 reg: F2x160 [ 6.000195] EDAC DEBUG: read_dct_base_mask: DCSM0[1]=0x00f83ce0 reg: F2x64 [ 6.012487] EDAC DEBUG: read_dct_base_mask: DCSM1[1]=0x00000000 reg: F2x164 [ 6.019946] EDAC DEBUG: read_dct_base_mask: DCSM0[2]=0x00000000 reg: F2x68 [ 6.027283] EDAC DEBUG: read_dct_base_mask: DCSM1[2]=0x00000000 reg: F2x168 [ 6.035342] EDAC DEBUG: read_dct_base_mask: DCSM0[3]=0x00000000 reg: F2x6c [ 6.042800] EDAC DEBUG: read_dct_base_mask: DCSM1[3]=0x00000000 reg: F2x16c [ 6.050913] EDAC DEBUG: read_mc_regs: DIMM type: Unbuffered-DDR2 [ 6.057183] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled [ 6.065925] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled [ 6.081200] EDAC amd64: Node 0: DRAM ECC disabled. [ 6.092690] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. [ 6.208087] EDAC amd64: F10h detected (node 0). [ 6.212966] EDAC DEBUG: reserve_mc_sibling_devs: F1: 0000:00:18.1 [ 6.235500] EDAC DEBUG: reserve_mc_sibling_devs: F2: 0000:00:18.2 [ 6.241661] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3 [ 6.252691] EDAC DEBUG: read_mc_regs: TOP_MEM: 0x00000000d0000000 [ 6.259134] EDAC DEBUG: read_mc_regs: TOP_MEM2: 0x0000000230000000 [ 6.265823] EDAC DEBUG: read_dram_ctl_register: F2x110 (DCTSelLow): 0xffffffff, High range addrs at: 0xfffff800 [ 6.275978] EDAC DEBUG: read_dram_ctl_register: DCTs operate in ganged mode [ 6.283271] EDAC DEBUG: read_dram_ctl_register: data interleave for ECC: enabled, DRAM cleared since last warm reset: yes [ 6.294635] EDAC DEBUG: read_dram_ctl_register: channel interleave: enabled, interleave bits selector: 0x3 [ 6.304565] EDAC DEBUG: read_mc_regs: DRAM range[0], base: 0x0000ff0000000000; limit: 0x0000ff022fffffff [ 6.314367] EDAC DEBUG: read_mc_regs: IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=0 [ 6.323259] EDAC DEBUG: read_dct_base_mask: DCSB0[0]=0x00000001 reg: F2x40 [ 6.330434] EDAC DEBUG: read_dct_base_mask: DCSB1[0]=0x00000000 reg: F2x140 [ 6.337648] EDAC DEBUG: read_dct_base_mask: DCSB0[1]=0x00000101 reg: F2x44 [ 6.351551] EDAC DEBUG: read_dct_base_mask: DCSB1[1]=0x00000000 reg: F2x144 [ 6.364985] EDAC DEBUG: read_dct_base_mask: DCSB0[2]=0x00000201 reg: F2x48 [ 6.379708] EDAC DEBUG: read_dct_base_mask: DCSB1[2]=0x00000000 reg: F2x148 [ 6.386913] EDAC DEBUG: read_dct_base_mask: DCSB0[3]=0x00000301 reg: F2x4c [ 6.394037] EDAC DEBUG: read_dct_base_mask: DCSB1[3]=0x00000000 reg: F2x14c [ 6.401259] EDAC DEBUG: read_dct_base_mask: DCSB0[4]=0x00000000 reg: F2x50 [ 6.408377] EDAC DEBUG: read_dct_base_mask: DCSB1[4]=0x00000000 reg: F2x150 [ 6.415854] EDAC DEBUG: read_dct_base_mask: DCSB0[5]=0x00000000 reg: F2x54 [ 6.422976] EDAC DEBUG: read_dct_base_mask: DCSB1[5]=0x00000000 reg: F2x154 [ 6.430178] EDAC DEBUG: read_dct_base_mask: DCSB0[6]=0x00000000 reg: F2x58 [ 6.437300] EDAC DEBUG: read_dct_base_mask: DCSB1[6]=0x00000000 reg: F2x158 [ 6.444507] EDAC DEBUG: read_dct_base_mask: DCSB0[7]=0x00000000 reg: F2x5c [ 6.451621] EDAC DEBUG: read_dct_base_mask: DCSB1[7]=0x00000000 reg: F2x15c [ 6.458833] EDAC DEBUG: read_dct_base_mask: DCSM0[0]=0x00f83ce0 reg: F2x60 [ 6.466155] EDAC DEBUG: read_dct_base_mask: DCSM1[0]=0x00000000 reg: F2x160 [ 6.473571] EDAC DEBUG: read_dct_base_mask: DCSM0[1]=0x00f83ce0 reg: F2x64 [ 6.480901] EDAC DEBUG: read_dct_base_mask: DCSM1[1]=0x00000000 reg: F2x164 [ 6.488305] EDAC DEBUG: read_dct_base_mask: DCSM0[2]=0x00000000 reg: F2x68 [ 6.495647] EDAC DEBUG: read_dct_base_mask: DCSM1[2]=0x00000000 reg: F2x168 [ 6.511447] EDAC DEBUG: read_dct_base_mask: DCSM0[3]=0x00000000 reg: F2x6c [ 6.511448] EDAC DEBUG: read_dct_base_mask: DCSM1[3]=0x00000000 reg: F2x16c [ 6.511451] EDAC DEBUG: read_mc_regs: DIMM type: Unbuffered-DDR2 [ 6.511458] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled [ 6.511459] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled [ 6.511460] EDAC amd64: Node 0: DRAM ECC disabled. [ 6.511461] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
> -----Original Message----- > From: Borislav Petkov <bp@alien8.de> > Sent: Friday, October 25, 2019 9:35 AM > To: Ghannam, Yazen <Yazen.Ghannam@amd.com> > Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc. > > On Tue, Oct 22, 2019 at 08:35:08PM +0000, Ghannam, Yazen wrote: > > From: Yazen Ghannam <yazen.ghannam@amd.com> > > > > Hi Boris, > > > > Most of these patches address the issue where the module checks and > > complains about DRAM ECC on nodes without memory. > > > > Thanks, > > Yazen > > > > Link: > > https://lkml.kernel.org/r/20191018153114.39378-1-Yazen.Ghannam@amd.com > > > > Yazen Ghannam (6): > > EDAC/amd64: Make struct amd64_family_type global > > EDAC/amd64: Gather hardware information early > > EDAC/amd64: Save max number of controllers to family type > > EDAC/amd64: Use cached data when checking for ECC > > EDAC/amd64: Check for memory before fully initializing an instance > > EDAC/amd64: Set grain per DIMM > > > > drivers/edac/amd64_edac.c | 196 +++++++++++++++++++------------------- > > drivers/edac/amd64_edac.h | 2 + > > 2 files changed, 100 insertions(+), 98 deletions(-) > > Almost there: now it dumps the whole shebang twice. This is on an old > F10h box which doesn't have ECC DIMMs: > > [ 2.222853] EDAC MC: Ver: 3.0.0 > [ 2.226881] EDAC DEBUG: edac_mc_sysfs_init: device mc created > [ 5.726912] EDAC amd64: F10h detected (node 0). ... > [ 6.208087] EDAC amd64: F10h detected (node 0). Is the module being probed twice? We have this problem in general, e.g. the module gets loaded multiple times on failure. The clue for me is that node 0 gets detected twice. This is done in per_family_init() early in probe_one_instance(). In any case, I think we can make !ecc_enabled(pvt) in probe_one_instance() a failure now that we have an explicit check for memory on a node. In other words, if we have memory and ECC is disabled then this is a failure for the module. Thanks, Yazen
On Fri, Nov 01, 2019 at 03:19:36PM +0000, Ghannam, Yazen wrote: > Is the module being probed twice? We have this problem in general, e.g. the > module gets loaded multiple times on failure. Yap, it looks like it. > The clue for me is that node 0 gets detected twice. This is done in > per_family_init() early in probe_one_instance(). > > In any case, I think we can make !ecc_enabled(pvt) in probe_one_instance() a > failure now that we have an explicit check for memory on a node. In other > words, if we have memory and ECC is disabled then this is a failure for the > module. Yeah, for that case we should be printing ecc_msg. Makes sense. Thx.
> -----Original Message----- > From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov > Sent: Friday, November 1, 2019 11:54 AM > To: Ghannam, Yazen <Yazen.Ghannam@amd.com> > Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc. > > On Fri, Nov 01, 2019 at 03:19:36PM +0000, Ghannam, Yazen wrote: > > Is the module being probed twice? We have this problem in general, e.g. the > > module gets loaded multiple times on failure. > > Yap, it looks like it. > > > The clue for me is that node 0 gets detected twice. This is done in > > per_family_init() early in probe_one_instance(). > > > > In any case, I think we can make !ecc_enabled(pvt) in probe_one_instance() a > > failure now that we have an explicit check for memory on a node. In other > > words, if we have memory and ECC is disabled then this is a failure for the > > module. > > Yeah, for that case we should be printing ecc_msg. Makes sense. > Do you have any other comments on this set? Should I send another revision with this change? Thanks, Yazen
On Tue, Nov 05, 2019 at 01:38:15PM +0000, Ghannam, Yazen wrote: > Do you have any other comments on this set? No, it looks good otherwise. > Should I send another revision with this change? Pls do, thx.
From: Yazen Ghannam <yazen.ghannam@amd.com> Hi Boris, Most of these patches address the issue where the module checks and complains about DRAM ECC on nodes without memory. Thanks, Yazen Link: https://lkml.kernel.org/r/20191018153114.39378-1-Yazen.Ghannam@amd.com Yazen Ghannam (6): EDAC/amd64: Make struct amd64_family_type global EDAC/amd64: Gather hardware information early EDAC/amd64: Save max number of controllers to family type EDAC/amd64: Use cached data when checking for ECC EDAC/amd64: Check for memory before fully initializing an instance EDAC/amd64: Set grain per DIMM drivers/edac/amd64_edac.c | 196 +++++++++++++++++++------------------- drivers/edac/amd64_edac.h | 2 + 2 files changed, 100 insertions(+), 98 deletions(-)