Message ID | 20190821235938.118710-1-Yazen.Ghannam@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | AMD64 EDAC fixes | expand |
On Wed, Aug 21, 2019 at 11:59:53PM +0000, Ghannam, Yazen wrote: > I've also added RFC patches to avoid the "ECC disabled" message for > nodes without memory. I haven't fully tested these, but I wanted to get > your thoughts. Here's an earlier discussion: > https://lkml.kernel.org/r/20180321191335.7832-1-Yazen.Ghannam@amd.com While you're editing that code, could you please also cut the spam if ECC is actually disabled? For example, a 2990WX with non-ECC RAM gets 1024 lines; 64 copies of: [ 8.186164] EDAC amd64: Node 0: DRAM ECC disabled. [ 8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Either enable ECC checking or force module loading by setting 'ecc_enable_override'. (Note that use of the override may cause unknown side effects.) [ 8.194762] EDAC amd64: Node 1: DRAM ECC disabled. [ 8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Either enable ECC checking or force module loading by setting 'ecc_enable_override'. (Note that use of the override may cause unknown side effects.) [ 8.199840] EDAC amd64: Node 2: DRAM ECC disabled. [ 8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Either enable ECC checking or force module loading by setting 'ecc_enable_override'. (Note that use of the override may cause unknown side effects.) [ 8.204326] EDAC amd64: Node 3: DRAM ECC disabled. [ 8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Either enable ECC checking or force module loading by setting 'ecc_enable_override'. (Note that use of the override may cause unknown side effects.) Meow!
On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote: > While you're editing that code, could you please also cut the spam if ECC is > actually disabled? For example, a 2990WX with non-ECC RAM gets 1024 lines; Patch is in there. I'll give you extra points if you spot it. > Meow! Wuff!
On Thu, Aug 22, 2019 at 10:35:48AM +0200, Borislav Petkov wrote: > On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote: > > While you're editing that code, could you please also cut the spam if ECC is > > actually disabled? For example, a 2990WX with non-ECC RAM gets 1024 lines; > > Patch is in there. I'll give you extra points if you spot it. Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1 and 3). Your patch set also overhauls the messages. But, the amount of redundant messages I'm complaining about has actually increased: dmesg|grep EDAC|cut -c 16-|sort|uniq -c 256 EDAC MC: UMC0 chip selects: 256 EDAC MC: UMC1 chip selects: 1 EDAC MC: Ver: 3.0.0 128 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. ^ three lines each 64 EDAC amd64: F17h detected (node 0). 64 EDAC amd64: F17h detected (node 1). 64 EDAC amd64: F17h detected (node 2). 64 EDAC amd64: F17h detected (node 3). 512 EDAC amd64: MC: 0: 0MB 1: 0MB 256 EDAC amd64: MC: 2: 0MB 3: 0MB 256 EDAC amd64: MC: 2: 8192MB 3: 0MB 64 EDAC amd64: Node 0: DRAM ECC disabled. 64 EDAC amd64: Node 2: DRAM ECC disabled. 256 EDAC amd64: using x4 syndromes. (Full dmesg at http://ix.io/1T1o) While on 5.3-rc5 without the patchset I get: 1 EDAC MC: Ver: 3.0.0 256 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. ^ three lines each 64 EDAC amd64: Node 0: DRAM ECC disabled. 64 EDAC amd64: Node 1: DRAM ECC disabled. 64 EDAC amd64: Node 2: DRAM ECC disabled. 64 EDAC amd64: Node 3: DRAM ECC disabled. So I wonder if we could deduplicate those. Meow!
On Thu, Aug 22, 2019 at 11:46:07AM +0200, Adam Borowski wrote: > Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1 > and 3). Your patch set also overhauls the messages. Not my patchset - Yazen's. > But, the amount of redundant messages I'm complaining about has actually > increased: He's working on that - that still needs some love. Thanks for testing.
> -----Original Message----- > From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Adam Borowski > Sent: Wednesday, August 21, 2019 7:50 PM > To: Ghannam, Yazen <Yazen.Ghannam@amd.com> > Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; bp@alien8.de > Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes > > On Wed, Aug 21, 2019 at 11:59:53PM +0000, Ghannam, Yazen wrote: > > I've also added RFC patches to avoid the "ECC disabled" message for > > nodes without memory. I haven't fully tested these, but I wanted to get > > your thoughts. Here's an earlier discussion: > > https://lkml.kernel.org/r/20180321191335.7832-1-Yazen.Ghannam@amd.com > > While you're editing that code, could you please also cut the spam if ECC is > actually disabled? For example, a 2990WX with non-ECC RAM gets 1024 lines; > 64 copies of: > > [ 8.186164] EDAC amd64: Node 0: DRAM ECC disabled. > [ 8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. > Either enable ECC checking or force module loading by setting 'ecc_enable_override'. > (Note that use of the override may cause unknown side effects.) > [ 8.194762] EDAC amd64: Node 1: DRAM ECC disabled. > [ 8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. > Either enable ECC checking or force module loading by setting 'ecc_enable_override'. > (Note that use of the override may cause unknown side effects.) > [ 8.199840] EDAC amd64: Node 2: DRAM ECC disabled. > [ 8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. > Either enable ECC checking or force module loading by setting 'ecc_enable_override'. > (Note that use of the override may cause unknown side effects.) > [ 8.204326] EDAC amd64: Node 3: DRAM ECC disabled. > [ 8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. > Either enable ECC checking or force module loading by setting 'ecc_enable_override'. > (Note that use of the override may cause unknown side effects.) > I wonder if the module is being loaded multiple times. I'll try to reproduce this and track it down. Thanks for testing and reporting this issue. -Yazen
> -----Original Message----- > From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Ghannam, Yazen > Sent: Thursday, August 22, 2019 1:54 PM > To: Adam Borowski <kilobyte@angband.pl> > Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; bp@alien8.de > Subject: RE: [PATCH v3 0/8] AMD64 EDAC fixes > ... > I wonder if the module is being loaded multiple times. I'll try to reproduce this and track it down. > I was able to reproduce a similar failure. I do see that the module is being loaded multiple times on failure. Here's a call trace from one dump_stack() output: [ +0.004964] CPU: 132 PID: 2680 Comm: systemd-udevd Not tainted 4.20.0-edac-debug+ #36 [ +0.009802] Call Trace: [ +0.002727] dump_stack+0x63/0x85 [ +0.003696] amd64_edac_init+0x2163/0x3000 [amd64_edac_mod] [ +0.006216] ? __wake_up+0x13/0x20 [ +0.003790] ? 0xffffffffc120d000 [ +0.003694] do_one_initcall+0x4a/0x1c9 [ +0.004277] ? _cond_resched+0x19/0x40 [ +0.004178] ? kmem_cache_alloc_trace+0x15c/0x1d0 [ +0.005244] do_init_module+0x5f/0x216 [ +0.004180] load_module+0x21d5/0x2ac0 [ +0.004179] ? wait_woken+0x80/0x80 [ +0.003889] __do_sys_finit_module+0xfc/0x120 [ +0.004858] ? __do_sys_finit_module+0xfc/0x120 [ +0.005052] __x64_sys_finit_module+0x1a/0x20 [ +0.004857] do_syscall_64+0x5a/0x120 [ +0.004081] entry_SYSCALL_64_after_hwframe+0x44/0xa9 So it seems that userspace (systemd-udevd) keeps trying to load the module. I'm not sure how to prevent this from within the module. Boris, Do you think it'd be appropriate to change the return values for some cases? For example, ECC disabled is a hardware configuration. This doesn't mean that the module failed any operations in this case. In other words, the module checks for a feature. If the feature is not present, then return without failure (and maybe give a message). Thanks, Yazen
On Fri, Aug 23, 2019 at 03:28:59PM +0000, Ghannam, Yazen wrote: > Boris, Do you think it'd be appropriate to change the return values > for some cases? > > For example, ECC disabled is a hardware configuration. This doesn't > mean that the module failed any operations in this case. > > In other words, the module checks for a feature. If the feature is not > present, then return without failure (and maybe give a message). That makes sense but AFAICT if probe_one_instance() sees that ECC is not enabled, it returns 0. The "if (!edac_has_mcs())" check later is to verify that at least once instance was loaded successfully and, if not, then return an error. So where does it return failure?
> -----Original Message----- > From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov > Sent: Friday, August 23, 2019 10:38 AM > To: Ghannam, Yazen <Yazen.Ghannam@amd.com> > Cc: Adam Borowski <kilobyte@angband.pl>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes > > On Fri, Aug 23, 2019 at 03:28:59PM +0000, Ghannam, Yazen wrote: > > Boris, Do you think it'd be appropriate to change the return values > > for some cases? > > > > For example, ECC disabled is a hardware configuration. This doesn't > > mean that the module failed any operations in this case. > > > > In other words, the module checks for a feature. If the feature is not > > present, then return without failure (and maybe give a message). > > That makes sense but AFAICT if probe_one_instance() sees that ECC is not > enabled, it returns 0. > > The "if (!edac_has_mcs())" check later is to verify that at least once > instance was loaded successfully and, if not, then return an error. > > So where does it return failure? > I was tracking down the failure with ECC disabled, and that seems to be it. So I think we should return 0 "if (!edac_has_mcs())", because we'd only get there if ECC is disabled on all nodes and there wasn't some other initialization error. I'll send a patch for this soon. Adam, would you mind testing this patch? Thanks, Yazen
On Mon, Aug 26, 2019 at 02:19:18PM +0000, Ghannam, Yazen wrote: > I was tracking down the failure with ECC disabled, and that seems to be it. > > So I think we should return 0 "if (!edac_has_mcs())", because we'd only get > there if ECC is disabled on all nodes and there wasn't some other initialization > error. > > I'll send a patch for this soon. > > Adam, would you mind testing this patch? You can't return 0 when ECC is disabled on all nodes because then the driver remains loaded without driving anything. That silly userspace needs to understand that ENODEV means "stop trying to load this driver".
> -----Original Message----- > From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov > Sent: Monday, August 26, 2019 9:59 AM > To: Ghannam, Yazen <Yazen.Ghannam@amd.com> > Cc: Adam Borowski <kilobyte@angband.pl>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes > > On Mon, Aug 26, 2019 at 02:19:18PM +0000, Ghannam, Yazen wrote: > > I was tracking down the failure with ECC disabled, and that seems to be it. > > > > So I think we should return 0 "if (!edac_has_mcs())", because we'd only get > > there if ECC is disabled on all nodes and there wasn't some other initialization > > error. > > > > I'll send a patch for this soon. > > > > Adam, would you mind testing this patch? > > You can't return 0 when ECC is disabled on all nodes because then the > driver remains loaded without driving anything. That silly userspace > needs to understand that ENODEV means "stop trying to load this driver". > Yes, you're right. I'll try and track down the interaction here between userspace and the module. Please let me know if you have any suggestions. Thanks, Yazen
From: Yazen Ghannam <yazen.ghannam@amd.com> Hi Boris, This set contains a few fixes for some changes merged in v5.2. There are also a couple of fixes for older issues. In addition, there are a couple of patches to add support for Asymmetric Dual-Rank DIMMs. I don't have the failing config readily available that you used, but I believe I found the issue. Please let me know how it goes. I've also added RFC patches to avoid the "ECC disabled" message for nodes without memory. I haven't fully tested these, but I wanted to get your thoughts. Here's an earlier discussion: https://lkml.kernel.org/r/20180321191335.7832-1-Yazen.Ghannam@amd.com Thanks, Yazen Link: https://lkml.kernel.org/r/20190709215643.171078-1-Yazen.Ghannam@amd.com v2->v3: * Drop Fixes: tags in patch 1. * Add detection of x8 DRAM devices in patches 2 and 3. * Fix Chip Select size printing in patch 4. * Added RFC patches to avoid "ECC disabled" message for nodes without memory. v1->v2: * Squash patches 1 and 2 together. Yazen Ghannam (10): EDAC/amd64: Support more than two controllers for chip selects handling EDAC/amd64: Recognize DRAM device type with EDAC_CTL_CAP EDAC/amd64: Initialize DIMM info for systems with more than two channels EDAC/amd64: Find Chip Select memory size using Address Mask EDAC/amd64: Decode syndrome before translating address EDAC/amd64: Cache secondary Chip Select registers EDAC/amd64: Support Asymmetric Dual-Rank DIMMs EDAC/amd64: Gather hardware information early EDAC/amd64: Use cached data when checking for ECC EDAC/amd64: Check for memory before fully initializing an instance drivers/edac/amd64_edac.c | 371 +++++++++++++++++++++++++------------- drivers/edac/amd64_edac.h | 9 +- 2 files changed, 251 insertions(+), 129 deletions(-)