mbox series

[v3,0/8] AMD64 EDAC fixes

Message ID 20190821235938.118710-1-Yazen.Ghannam@amd.com (mailing list archive)
Headers show
Series AMD64 EDAC fixes | expand

Message

Yazen Ghannam Aug. 21, 2019, 11:59 p.m. UTC
From: Yazen Ghannam <yazen.ghannam@amd.com>

Hi Boris,

This set contains a few fixes for some changes merged in v5.2. There
are also a couple of fixes for older issues. In addition, there are a
couple of patches to add support for Asymmetric Dual-Rank DIMMs.

I don't have the failing config readily available that you used, but I
believe I found the issue. Please let me know how it goes.

I've also added RFC patches to avoid the "ECC disabled" message for
nodes without memory. I haven't fully tested these, but I wanted to get
your thoughts. Here's an earlier discussion:
https://lkml.kernel.org/r/20180321191335.7832-1-Yazen.Ghannam@amd.com

Thanks,
Yazen

Link:
https://lkml.kernel.org/r/20190709215643.171078-1-Yazen.Ghannam@amd.com

v2->v3:
* Drop Fixes: tags in patch 1.
* Add detection of x8 DRAM devices in patches 2 and 3.
* Fix Chip Select size printing in patch 4.
* Added RFC patches to avoid "ECC disabled" message for nodes without memory.

v1->v2:
* Squash patches 1 and 2 together.


Yazen Ghannam (10):
  EDAC/amd64: Support more than two controllers for chip selects
    handling
  EDAC/amd64: Recognize DRAM device type with EDAC_CTL_CAP
  EDAC/amd64: Initialize DIMM info for systems with more than two
    channels
  EDAC/amd64: Find Chip Select memory size using Address Mask
  EDAC/amd64: Decode syndrome before translating address
  EDAC/amd64: Cache secondary Chip Select registers
  EDAC/amd64: Support Asymmetric Dual-Rank DIMMs
  EDAC/amd64: Gather hardware information early
  EDAC/amd64: Use cached data when checking for ECC
  EDAC/amd64: Check for memory before fully initializing an instance

 drivers/edac/amd64_edac.c | 371 +++++++++++++++++++++++++-------------
 drivers/edac/amd64_edac.h |   9 +-
 2 files changed, 251 insertions(+), 129 deletions(-)

Comments

Adam Borowski Aug. 22, 2019, 12:50 a.m. UTC | #1
On Wed, Aug 21, 2019 at 11:59:53PM +0000, Ghannam, Yazen wrote:
> I've also added RFC patches to avoid the "ECC disabled" message for
> nodes without memory. I haven't fully tested these, but I wanted to get
> your thoughts. Here's an earlier discussion:
> https://lkml.kernel.org/r/20180321191335.7832-1-Yazen.Ghannam@amd.com

While you're editing that code, could you please also cut the spam if ECC is
actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;
64 copies of:

[    8.186164] EDAC amd64: Node 0: DRAM ECC disabled.
[    8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                (Note that use of the override may cause unknown side effects.)
[    8.194762] EDAC amd64: Node 1: DRAM ECC disabled.
[    8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                (Note that use of the override may cause unknown side effects.)
[    8.199840] EDAC amd64: Node 2: DRAM ECC disabled.
[    8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                (Note that use of the override may cause unknown side effects.)
[    8.204326] EDAC amd64: Node 3: DRAM ECC disabled.
[    8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                (Note that use of the override may cause unknown side effects.)


Meow!
Borislav Petkov Aug. 22, 2019, 8:35 a.m. UTC | #2
On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote:
> While you're editing that code, could you please also cut the spam if ECC is
> actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;

Patch is in there. I'll give you extra points if you spot it.

> Meow!

Wuff!
Adam Borowski Aug. 22, 2019, 9:46 a.m. UTC | #3
On Thu, Aug 22, 2019 at 10:35:48AM +0200, Borislav Petkov wrote:
> On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote:
> > While you're editing that code, could you please also cut the spam if ECC is
> > actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;
> 
> Patch is in there. I'll give you extra points if you spot it.

Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1
and 3).  Your patch set also overhauls the messages.

But, the amount of redundant messages I'm complaining about has actually
increased:

dmesg|grep EDAC|cut -c 16-|sort|uniq -c
    256 EDAC MC: UMC0 chip selects:
    256 EDAC MC: UMC1 chip selects:
      1 EDAC MC: Ver: 3.0.0
    128 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
    ^ three lines each
     64 EDAC amd64: F17h detected (node 0).
     64 EDAC amd64: F17h detected (node 1).
     64 EDAC amd64: F17h detected (node 2).
     64 EDAC amd64: F17h detected (node 3).
    512 EDAC amd64: MC: 0:     0MB 1:     0MB
    256 EDAC amd64: MC: 2:     0MB 3:     0MB
    256 EDAC amd64: MC: 2:  8192MB 3:     0MB
     64 EDAC amd64: Node 0: DRAM ECC disabled.
     64 EDAC amd64: Node 2: DRAM ECC disabled.
    256 EDAC amd64: using x4 syndromes.

(Full dmesg at http://ix.io/1T1o)

While on 5.3-rc5 without the patchset I get:

      1 EDAC MC: Ver: 3.0.0
    256 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
    ^ three lines each
     64 EDAC amd64: Node 0: DRAM ECC disabled.
     64 EDAC amd64: Node 1: DRAM ECC disabled.
     64 EDAC amd64: Node 2: DRAM ECC disabled.
     64 EDAC amd64: Node 3: DRAM ECC disabled.

So I wonder if we could deduplicate those.


Meow!
Borislav Petkov Aug. 22, 2019, 9:55 a.m. UTC | #4
On Thu, Aug 22, 2019 at 11:46:07AM +0200, Adam Borowski wrote:
> Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1
> and 3).  Your patch set also overhauls the messages.

Not my patchset - Yazen's.

> But, the amount of redundant messages I'm complaining about has actually
> increased:

He's working on that - that still needs some love.

Thanks for testing.
Yazen Ghannam Aug. 22, 2019, 6:54 p.m. UTC | #5
> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Adam Borowski
> Sent: Wednesday, August 21, 2019 7:50 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; bp@alien8.de
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
> 
> On Wed, Aug 21, 2019 at 11:59:53PM +0000, Ghannam, Yazen wrote:
> > I've also added RFC patches to avoid the "ECC disabled" message for
> > nodes without memory. I haven't fully tested these, but I wanted to get
> > your thoughts. Here's an earlier discussion:
> > https://lkml.kernel.org/r/20180321191335.7832-1-Yazen.Ghannam@amd.com
> 
> While you're editing that code, could you please also cut the spam if ECC is
> actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;
> 64 copies of:
> 
> [    8.186164] EDAC amd64: Node 0: DRAM ECC disabled.
> [    8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>                 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>                 (Note that use of the override may cause unknown side effects.)
> [    8.194762] EDAC amd64: Node 1: DRAM ECC disabled.
> [    8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>                 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>                 (Note that use of the override may cause unknown side effects.)
> [    8.199840] EDAC amd64: Node 2: DRAM ECC disabled.
> [    8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>                 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>                 (Note that use of the override may cause unknown side effects.)
> [    8.204326] EDAC amd64: Node 3: DRAM ECC disabled.
> [    8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>                 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>                 (Note that use of the override may cause unknown side effects.)
> 

I wonder if the module is being loaded multiple times. I'll try to reproduce this and track it down.

Thanks for testing and reporting this issue.

-Yazen
Yazen Ghannam Aug. 23, 2019, 3:28 p.m. UTC | #6
> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Ghannam, Yazen
> Sent: Thursday, August 22, 2019 1:54 PM
> To: Adam Borowski <kilobyte@angband.pl>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; bp@alien8.de
> Subject: RE: [PATCH v3 0/8] AMD64 EDAC fixes
> 
...
> I wonder if the module is being loaded multiple times. I'll try to reproduce this and track it down.
> 

I was able to reproduce a similar failure. I do see that the module is being loaded multiple times on failure.

Here's a call trace from one dump_stack() output:
[  +0.004964] CPU: 132 PID: 2680 Comm: systemd-udevd Not tainted 4.20.0-edac-debug+ #36
[  +0.009802] Call Trace:
[  +0.002727]  dump_stack+0x63/0x85
[  +0.003696]  amd64_edac_init+0x2163/0x3000 [amd64_edac_mod]
[  +0.006216]  ? __wake_up+0x13/0x20
[  +0.003790]  ? 0xffffffffc120d000
[  +0.003694]  do_one_initcall+0x4a/0x1c9
[  +0.004277]  ? _cond_resched+0x19/0x40
[  +0.004178]  ? kmem_cache_alloc_trace+0x15c/0x1d0
[  +0.005244]  do_init_module+0x5f/0x216
[  +0.004180]  load_module+0x21d5/0x2ac0
[  +0.004179]  ? wait_woken+0x80/0x80
[  +0.003889]  __do_sys_finit_module+0xfc/0x120
[  +0.004858]  ? __do_sys_finit_module+0xfc/0x120
[  +0.005052]  __x64_sys_finit_module+0x1a/0x20
[  +0.004857]  do_syscall_64+0x5a/0x120
[  +0.004081]  entry_SYSCALL_64_after_hwframe+0x44/0xa9


So it seems that userspace (systemd-udevd) keeps trying to load the module. I'm not sure how to prevent this from within the module.

Boris,
Do you think it'd be appropriate to change the return values for some cases?

For example, ECC disabled is a hardware configuration. This doesn't mean that the module failed any operations in this case.

In other words, the module checks for a feature. If the feature is not present, then return without failure (and maybe give a message).

Thanks,
Yazen
Borislav Petkov Aug. 23, 2019, 3:37 p.m. UTC | #7
On Fri, Aug 23, 2019 at 03:28:59PM +0000, Ghannam, Yazen wrote:
> Boris, Do you think it'd be appropriate to change the return values
> for some cases?
>
> For example, ECC disabled is a hardware configuration. This doesn't
> mean that the module failed any operations in this case.
>
> In other words, the module checks for a feature. If the feature is not
> present, then return without failure (and maybe give a message).

That makes sense but AFAICT if probe_one_instance() sees that ECC is not
enabled, it returns 0.

The "if (!edac_has_mcs())" check later is to verify that at least once
instance was loaded successfully and, if not, then return an error.

So where does it return failure?
Yazen Ghannam Aug. 26, 2019, 2:19 p.m. UTC | #8
> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Friday, August 23, 2019 10:38 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: Adam Borowski <kilobyte@angband.pl>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
> 
> On Fri, Aug 23, 2019 at 03:28:59PM +0000, Ghannam, Yazen wrote:
> > Boris, Do you think it'd be appropriate to change the return values
> > for some cases?
> >
> > For example, ECC disabled is a hardware configuration. This doesn't
> > mean that the module failed any operations in this case.
> >
> > In other words, the module checks for a feature. If the feature is not
> > present, then return without failure (and maybe give a message).
> 
> That makes sense but AFAICT if probe_one_instance() sees that ECC is not
> enabled, it returns 0.
> 
> The "if (!edac_has_mcs())" check later is to verify that at least once
> instance was loaded successfully and, if not, then return an error.
> 
> So where does it return failure?
> 

I was tracking down the failure with ECC disabled, and that seems to be it.

So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
there if ECC is disabled on all nodes and there wasn't some other initialization
error.

I'll send a patch for this soon.

Adam, would you mind testing this patch?

Thanks,
Yazen
Borislav Petkov Aug. 26, 2019, 2:59 p.m. UTC | #9
On Mon, Aug 26, 2019 at 02:19:18PM +0000, Ghannam, Yazen wrote:
> I was tracking down the failure with ECC disabled, and that seems to be it.
>
> So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
> there if ECC is disabled on all nodes and there wasn't some other initialization
> error.
>
> I'll send a patch for this soon.
>
> Adam, would you mind testing this patch?

You can't return 0 when ECC is disabled on all nodes because then the
driver remains loaded without driving anything. That silly userspace
needs to understand that ENODEV means "stop trying to load this driver".
Yazen Ghannam Aug. 26, 2019, 3:05 p.m. UTC | #10
> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Monday, August 26, 2019 9:59 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: Adam Borowski <kilobyte@angband.pl>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
> 
> On Mon, Aug 26, 2019 at 02:19:18PM +0000, Ghannam, Yazen wrote:
> > I was tracking down the failure with ECC disabled, and that seems to be it.
> >
> > So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
> > there if ECC is disabled on all nodes and there wasn't some other initialization
> > error.
> >
> > I'll send a patch for this soon.
> >
> > Adam, would you mind testing this patch?
> 
> You can't return 0 when ECC is disabled on all nodes because then the
> driver remains loaded without driving anything. That silly userspace
> needs to understand that ENODEV means "stop trying to load this driver".
> 

Yes, you're right.

I'll try and track down the interaction here between userspace and the module.
Please let me know if you have any suggestions.

Thanks,
Yazen