[v3,0/5] AMD64 EDAC: Check for nodes without memory, etc.

Message ID	20191106012448.243970-1-Yazen.Ghannam@amd.com (mailing list archive)
Headers	show Return-Path: <SRS0=1gXh=Y6=vger.kernel.org=linux-edac-owner@kernel.org> From: "Ghannam, Yazen" <Yazen.Ghannam@amd.com> To: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> CC: "Ghannam, Yazen" <Yazen.Ghannam@amd.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "bp@alien8.de" <bp@alien8.de> Subject: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc. Thread-Topic: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc. Thread-Index: AQHVlEEBkkiHg8UvLEKx+Q/uzElWBg== Date: Wed, 6 Nov 2019 01:24:59 +0000 Message-ID: <20191106012448.243970-1-Yazen.Ghannam@amd.com> Accept-Language: en-US Content-Language: en-US received-spf: None (protection.outlook.com: amd.com does not designate permitted sender hosts) Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: linux-edac-owner@vger.kernel.org Precedence: bulk
Series	AMD64 EDAC: Check for nodes without memory, etc. \| expand [v3,0/5] AMD64 EDAC: Check for nodes without memory, etc. [v3,1/5] EDAC/amd64: Make struct amd64_family_type global [v3,2/5] EDAC/amd64: Gather hardware information early [v3,3/5] EDAC/amd64: Save max number of controllers to family type [v3,4/5] EDAC/amd64: Use cached data when checking for ECC [v3,5/5] EDAC/amd64: Check for memory before fully initializing an instance

Yazen Ghannam Nov. 6, 2019, 1:24 a.m. UTC

From: Yazen Ghannam <yazen.ghannam@amd.com>

Hi Boris,

These patches address the issue where the module checks and complains
about DRAM ECC on nodes without memory.

Changes from last revision:
  1) Dropped patch 6 which was for adding a grain value.
  2) Added an error code for !ecc_enabled() in patch 5.

Thanks,
Yazen

Link:
https://lkml.kernel.org/r/20191022203448.13962-1-Yazen.Ghannam@amd.com

Yazen Ghannam (5):
  EDAC/amd64: Make struct amd64_family_type global
  EDAC/amd64: Gather hardware information early
  EDAC/amd64: Save max number of controllers to family type
  EDAC/amd64: Use cached data when checking for ECC
  EDAC/amd64: Check for memory before fully initializing an instance

 drivers/edac/amd64_edac.c | 196 +++++++++++++++++++-------------------
 drivers/edac/amd64_edac.h |   2 +
 2 files changed, 99 insertions(+), 99 deletions(-)

Borislav Petkov Nov. 6, 2019, 4:06 p.m. UTC | #1

On Wed, Nov 06, 2019 at 01:24:59AM +0000, Ghannam, Yazen wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> Hi Boris,
> 
> These patches address the issue where the module checks and complains
> about DRAM ECC on nodes without memory.
> 
> Changes from last revision:
>   1) Dropped patch 6 which was for adding a grain value.
>   2) Added an error code for !ecc_enabled() in patch 5.

Still doesn't help. The load gets attempted twice still. Try reproducing
it on a small, single-node box where ECC is disabled.

[    2.590123] EDAC MC: Ver: 3.0.0
[    2.594153] EDAC DEBUG: edac_mc_sysfs_init: device mc created
[    5.946482] EDAC amd64: F10h detected (node 0).
[    5.952134] EDAC DEBUG: reserve_mc_sibling_devs: F1: 0000:00:18.1
[    5.958967] EDAC DEBUG: reserve_mc_sibling_devs: F2: 0000:00:18.2
[    5.969869] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3
[    5.981125] EDAC DEBUG: read_mc_regs:   TOP_MEM:  0x00000000d0000000
[    5.981126] EDAC DEBUG: read_mc_regs:   TOP_MEM2: 0x0000000230000000
[    5.981130] EDAC DEBUG: read_dram_ctl_register: F2x110 (DCTSelLow): 0xffffffff, High range addrs at: 0xfffff800
[    5.981131] EDAC DEBUG: read_dram_ctl_register:   DCTs operate in ganged mode
[    5.981132] EDAC DEBUG: read_dram_ctl_register:   data interleave for ECC: enabled, DRAM cleared since last warm reset: yes
[    5.981133] EDAC DEBUG: read_dram_ctl_register:   channel interleave: enabled, interleave bits selector: 0x3
[    5.981137] EDAC DEBUG: read_mc_regs:   DRAM range[0], base: 0x0000ff0000000000; limit: 0x0000ff022fffffff
[    5.981138] EDAC DEBUG: read_mc_regs:    IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=0
[    5.981144] EDAC DEBUG: read_dct_base_mask:   DCSB0[0]=0x00000001 reg: F2x40
[    5.981146] EDAC DEBUG: read_dct_base_mask:   DCSB1[0]=0x00000000 reg: F2x140
[    5.981147] EDAC DEBUG: read_dct_base_mask:   DCSB0[1]=0x00000101 reg: F2x44
[    5.981148] EDAC DEBUG: read_dct_base_mask:   DCSB1[1]=0x00000000 reg: F2x144
[    5.981149] EDAC DEBUG: read_dct_base_mask:   DCSB0[2]=0x00000201 reg: F2x48
[    5.981150] EDAC DEBUG: read_dct_base_mask:   DCSB1[2]=0x00000000 reg: F2x148
[    5.981151] EDAC DEBUG: read_dct_base_mask:   DCSB0[3]=0x00000301 reg: F2x4c
[    5.981152] EDAC DEBUG: read_dct_base_mask:   DCSB1[3]=0x00000000 reg: F2x14c
[    5.981153] EDAC DEBUG: read_dct_base_mask:   DCSB0[4]=0x00000000 reg: F2x50
[    5.981154] EDAC DEBUG: read_dct_base_mask:   DCSB1[4]=0x00000000 reg: F2x150
[    5.981155] EDAC DEBUG: read_dct_base_mask:   DCSB0[5]=0x00000000 reg: F2x54
[    5.981156] EDAC DEBUG: read_dct_base_mask:   DCSB1[5]=0x00000000 reg: F2x154
[    5.981157] EDAC DEBUG: read_dct_base_mask:   DCSB0[6]=0x00000000 reg: F2x58
[    5.981158] EDAC DEBUG: read_dct_base_mask:   DCSB1[6]=0x00000000 reg: F2x158
[    5.981159] EDAC DEBUG: read_dct_base_mask:   DCSB0[7]=0x00000000 reg: F2x5c
[    5.981160] EDAC DEBUG: read_dct_base_mask:   DCSB1[7]=0x00000000 reg: F2x15c
[    5.981161] EDAC DEBUG: read_dct_base_mask:     DCSM0[0]=0x00f83ce0 reg: F2x60
[    5.981162] EDAC DEBUG: read_dct_base_mask:     DCSM1[0]=0x00000000 reg: F2x160
[    5.981163] EDAC DEBUG: read_dct_base_mask:     DCSM0[1]=0x00f83ce0 reg: F2x64
[    5.981164] EDAC DEBUG: read_dct_base_mask:     DCSM1[1]=0x00000000 reg: F2x164
[    5.981165] EDAC DEBUG: read_dct_base_mask:     DCSM0[2]=0x00000000 reg: F2x68
[    5.981166] EDAC DEBUG: read_dct_base_mask:     DCSM1[2]=0x00000000 reg: F2x168
[    5.981167] EDAC DEBUG: read_dct_base_mask:     DCSM0[3]=0x00000000 reg: F2x6c
[    5.981168] EDAC DEBUG: read_dct_base_mask:     DCSM1[3]=0x00000000 reg: F2x16c
[    5.981169] EDAC DEBUG: read_mc_regs:   DIMM type: Unbuffered-DDR2
[    5.981219] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    5.981221] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    5.981221] EDAC amd64: Node 0: DRAM ECC disabled.
[    5.981223] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
[    6.302561] EDAC amd64: F10h detected (node 0).
[    6.307276] EDAC DEBUG: reserve_mc_sibling_devs: F1: 0000:00:18.1
[    6.313630] EDAC DEBUG: reserve_mc_sibling_devs: F2: 0000:00:18.2
[    6.320589] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3
[    6.328359] EDAC DEBUG: read_mc_regs:   TOP_MEM:  0x00000000d0000000
[    6.335150] EDAC DEBUG: read_mc_regs:   TOP_MEM2: 0x0000000230000000
[    6.342188] EDAC DEBUG: read_dram_ctl_register: F2x110 (DCTSelLow): 0xffffffff, High range addrs at: 0xfffff800
[    6.353691] EDAC DEBUG: read_dram_ctl_register:   DCTs operate in ganged mode
[    6.361527] EDAC DEBUG: read_dram_ctl_register:   data interleave for ECC: enabled, DRAM cleared since last warm reset: yes
[    6.374204] EDAC DEBUG: read_dram_ctl_register:   channel interleave: enabled, interleave bits selector: 0x3
[    6.384343] EDAC DEBUG: read_mc_regs:   DRAM range[0], base: 0x0000ff0000000000; limit: 0x0000ff022fffffff
[    6.395942] EDAC DEBUG: read_mc_regs:    IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=0
[    6.406619] EDAC DEBUG: read_dct_base_mask:   DCSB0[0]=0x00000001 reg: F2x40
[    6.414646] EDAC DEBUG: read_dct_base_mask:   DCSB1[0]=0x00000000 reg: F2x140
[    6.422526] EDAC DEBUG: read_dct_base_mask:   DCSB0[1]=0x00000101 reg: F2x44
[    6.430823] EDAC DEBUG: read_dct_base_mask:   DCSB1[1]=0x00000000 reg: F2x144
[    6.438710] EDAC DEBUG: read_dct_base_mask:   DCSB0[2]=0x00000201 reg: F2x48
[    6.446810] EDAC DEBUG: read_dct_base_mask:   DCSB1[2]=0x00000000 reg: F2x148
[    6.454788] EDAC DEBUG: read_dct_base_mask:   DCSB0[3]=0x00000301 reg: F2x4c
[    6.462743] EDAC DEBUG: read_dct_base_mask:   DCSB1[3]=0x00000000 reg: F2x14c
[    6.470585] EDAC DEBUG: read_dct_base_mask:   DCSB0[4]=0x00000000 reg: F2x50
[    6.478698] EDAC DEBUG: read_dct_base_mask:   DCSB1[4]=0x00000000 reg: F2x150
[    6.486624] EDAC DEBUG: read_dct_base_mask:   DCSB0[5]=0x00000000 reg: F2x54
[    6.494631] EDAC DEBUG: read_dct_base_mask:   DCSB1[5]=0x00000000 reg: F2x154
[    6.502866] EDAC DEBUG: read_dct_base_mask:   DCSB0[6]=0x00000000 reg: F2x58
[    6.510817] EDAC DEBUG: read_dct_base_mask:   DCSB1[6]=0x00000000 reg: F2x158
[    6.518602] EDAC DEBUG: read_dct_base_mask:   DCSB0[7]=0x00000000 reg: F2x5c
[    6.527120] EDAC DEBUG: read_dct_base_mask:   DCSB1[7]=0x00000000 reg: F2x15c
[    6.534926] EDAC DEBUG: read_dct_base_mask:     DCSM0[0]=0x00f83ce0 reg: F2x60
[    6.548356] EDAC DEBUG: read_dct_base_mask:     DCSM1[0]=0x00000000 reg: F2x160
[    6.560715] EDAC DEBUG: read_dct_base_mask:     DCSM0[1]=0x00f83ce0 reg: F2x64
[    6.568116] EDAC DEBUG: read_dct_base_mask:     DCSM1[1]=0x00000000 reg: F2x164
[    6.575596] EDAC DEBUG: read_dct_base_mask:     DCSM0[2]=0x00000000 reg: F2x68
[    6.584317] EDAC DEBUG: read_dct_base_mask:     DCSM1[2]=0x00000000 reg: F2x168
[    6.591899] EDAC DEBUG: read_dct_base_mask:     DCSM0[3]=0x00000000 reg: F2x6c
[    6.599460] EDAC DEBUG: read_dct_base_mask:     DCSM1[3]=0x00000000 reg: F2x16c
[    6.606877] EDAC DEBUG: read_mc_regs:   DIMM type: Unbuffered-DDR2
[    6.619722] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    6.628463] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    6.648232] EDAC amd64: Node 0: DRAM ECC disabled.
[    6.657843] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.

Yazen Ghannam Nov. 6, 2019, 6:16 p.m. UTC | #2

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Wednesday, November 6, 2019 11:06 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Wed, Nov 06, 2019 at 01:24:59AM +0000, Ghannam, Yazen wrote:
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> >
> > Hi Boris,
> >
> > These patches address the issue where the module checks and complains
> > about DRAM ECC on nodes without memory.
> >
> > Changes from last revision:
> >   1) Dropped patch 6 which was for adding a grain value.
> >   2) Added an error code for !ecc_enabled() in patch 5.
> 
> Still doesn't help. The load gets attempted twice still. Try reproducing
> it on a small, single-node box where ECC is disabled.
> 

We had a thread before about usersapce loading the module multiple times on
failure:
https://lore.kernel.org/linux-edac/20190822005020.GA403@angband.pl/

I tried to look into it a bit, but I didn't get very far.

So is the behavior you see only happening with the new patchset applied? That
may be a clue that we can fix this in the module.

Thanks,
Yazen

Borislav Petkov Nov. 6, 2019, 7:54 p.m. UTC | #3

On Wed, Nov 06, 2019 at 06:16:12PM +0000, Ghannam, Yazen wrote:
> We had a thread before about usersapce loading the module multiple times on
> failure:
> https://lore.kernel.org/linux-edac/20190822005020.GA403@angband.pl/
> 
> I tried to look into it a bit, but I didn't get very far.

Right, I'll try to have a look soon, as it reproduces here.

> So is the behavior you see only happening with the new patchset applied? That
> may be a clue that we can fix this in the module.

Actually, it did try twice before your patchset and I didn't notice it
then because it wouldn't spit so much debug output. But that happens now
because your patchset pulls up the detection early. And without it we
had:

$ dmesg | grep -i edac
[    2.590869] EDAC MC: Ver: 3.0.0
[    2.594855] EDAC DEBUG: edac_mc_sysfs_init: device mc created
[    5.939351] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    5.948488] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    5.957312] EDAC amd64: Node 0: DRAM ECC disabled.
[    5.967746] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
[    6.031424] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    6.042173] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    6.052253] EDAC amd64: Node 0: DRAM ECC disabled.
[    6.057804] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.

which are also two attempts.

Anyway, I'll queue your set and I'll try to debug that thing because it
is getting on my nerves slowly...

Thx.

Borislav Petkov Nov. 7, 2019, 10:38 a.m. UTC | #4

On Wed, Nov 06, 2019 at 08:54:17PM +0100, Borislav Petkov wrote:
> which are also two attempts.
> 
> Anyway, I'll queue your set and I'll try to debug that thing because it
> is getting on my nerves slowly...

Yah, the problem is that because we have:

MODULE_DEVICE_TABLE(x86cpu, amd64_cpuids);

it gets tried on each CPU because an uevent gets dispatched for each
device, and each CPU is a device.

That's why I see it twice on this box - it has two CPUs.

And Greg says making it attempt once per system can't be done. Unless we
start doing hacks with sending uevents per BSP only which is too much.
Or we can remember the previous return value of the module init function
into edac_core but that's nasty too.

I'm thinking we should simply kill this fat ecc_msg thing which is not
very useful and be done with it:

[    5.697275] EDAC MC: Ver: 3.0.0
[    5.909530] EDAC amd64: F10h detected (node 0).
[    6.345231] EDAC amd64: Node 0: DRAM ECC disabled.
[    6.370815] EDAC amd64: F10h detected (node 0).
[    6.370929] EDAC amd64: Node 0: DRAM ECC disabled.

That's probably still a bit annoying on a large machine but better than
nothing.

---
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 3aeb5173e200..0738237e3f09 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3188,18 +3188,6 @@ static void restore_ecc_error_reporting(struct ecc_settings *s, u16 nid,
 		amd64_warn("Error restoring NB MCGCTL settings!\n");
 }
 
-/*
- * EDAC requires that the BIOS have ECC enabled before
- * taking over the processing of ECC errors. A command line
- * option allows to force-enable hardware ECC later in
- * enable_ecc_error_reporting().
- */
-static const char *ecc_msg =
-	"ECC disabled in the BIOS or no ECC capability, module will not load.\n"
-	" Either enable ECC checking or force module loading by setting "
-	"'ecc_enable_override'.\n"
-	" (Note that use of the override may cause unknown side effects.)\n";
-
 static bool ecc_enabled(struct amd64_pvt *pvt)
 {
 	u16 nid = pvt->mc_node_id;
@@ -3246,11 +3234,10 @@ static bool ecc_enabled(struct amd64_pvt *pvt)
 	amd64_info("Node %d: DRAM ECC %s.\n",
 		   nid, (ecc_en ? "enabled" : "disabled"));
 
-	if (!ecc_en || !nb_mce_en) {
-		amd64_info("%s", ecc_msg);
+	if (!ecc_en || !nb_mce_en)
 		return false;
-	}
-	return true;
+	else
+		return true;
 }
 
 static inline void

Yazen Ghannam Nov. 7, 2019, 1:47 p.m. UTC | #5

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Thursday, November 7, 2019 5:39 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Wed, Nov 06, 2019 at 08:54:17PM +0100, Borislav Petkov wrote:
> > which are also two attempts.
> >
> > Anyway, I'll queue your set and I'll try to debug that thing because it
> > is getting on my nerves slowly...
> 
> Yah, the problem is that because we have:
> 
> MODULE_DEVICE_TABLE(x86cpu, amd64_cpuids);
> 
> it gets tried on each CPU because an uevent gets dispatched for each
> device, and each CPU is a device.
> 
> That's why I see it twice on this box - it has two CPUs.
> 

Okay, that's makes sense.

BTW, what do you think about loading based on PCI devices? The module
used to do this. I ask because I'm starting to see that future systems may
re-use PCI IDs, and this indicates the same level of hardware support.

> And Greg says making it attempt once per system can't be done. Unless we
> start doing hacks with sending uevents per BSP only which is too much.
> Or we can remember the previous return value of the module init function
> into edac_core but that's nasty too.
> 
> I'm thinking we should simply kill this fat ecc_msg thing which is not
> very useful and be done with it:
> 
> [    5.697275] EDAC MC: Ver: 3.0.0
> [    5.909530] EDAC amd64: F10h detected (node 0).
> [    6.345231] EDAC amd64: Node 0: DRAM ECC disabled.
> [    6.370815] EDAC amd64: F10h detected (node 0).
> [    6.370929] EDAC amd64: Node 0: DRAM ECC disabled.
> 
> That's probably still a bit annoying on a large machine but better than
> nothing.
> 

Yeah, I agree.

> ---
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 3aeb5173e200..0738237e3f09 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3188,18 +3188,6 @@ static void restore_ecc_error_reporting(struct ecc_settings *s, u16 nid,
>  		amd64_warn("Error restoring NB MCGCTL settings!\n");
>  }
> 
> -/*
> - * EDAC requires that the BIOS have ECC enabled before
> - * taking over the processing of ECC errors. A command line
> - * option allows to force-enable hardware ECC later in
> - * enable_ecc_error_reporting().
> - */
> -static const char *ecc_msg =
> -	"ECC disabled in the BIOS or no ECC capability, module will not load.\n"
> -	" Either enable ECC checking or force module loading by setting "
> -	"'ecc_enable_override'.\n"
> -	" (Note that use of the override may cause unknown side effects.)\n";
> -
>  static bool ecc_enabled(struct amd64_pvt *pvt)
>  {
>  	u16 nid = pvt->mc_node_id;
> @@ -3246,11 +3234,10 @@ static bool ecc_enabled(struct amd64_pvt *pvt)
>  	amd64_info("Node %d: DRAM ECC %s.\n",
>  		   nid, (ecc_en ? "enabled" : "disabled"));
> 
> -	if (!ecc_en || !nb_mce_en) {
> -		amd64_info("%s", ecc_msg);
> +	if (!ecc_en || !nb_mce_en)
>  		return false;
> -	}
> -	return true;
> +	else
> +		return true;
>  }

Just a nit, but this else seems unnecessary right?

Thanks,
Yazen

Borislav Petkov Nov. 7, 2019, 3:40 p.m. UTC | #6

On Thu, Nov 07, 2019 at 01:47:53PM +0000, Ghannam, Yazen wrote:
> BTW, what do you think about loading based on PCI devices? The module
> used to do this. I ask because I'm starting to see that future systems may
> re-use PCI IDs, and this indicates the same level of hardware support.

The reason we switched to family-based autoloading was that almost
every new platform would add a new PCI device ID, which would require
enablement work...

> Just a nit, but this else seems unnecessary right?

Maybe it is easier if you look at the function end in the .c file directly as
diffs can be confusing:

static bool ecc_enabled(struct amd64_pvt *pvt)
{

	...

        amd64_info("Node %d: DRAM ECC %s.\n",
                   nid, (ecc_en ? "enabled" : "disabled"));

        if (!ecc_en || !nb_mce_en)
                return false;
        else
                return true;
}

Yazen Ghannam Nov. 7, 2019, 7:20 p.m. UTC | #7

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Thursday, November 7, 2019 10:40 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Thu, Nov 07, 2019 at 01:47:53PM +0000, Ghannam, Yazen wrote:
> > BTW, what do you think about loading based on PCI devices? The module
> > used to do this. I ask because I'm starting to see that future systems may
> > re-use PCI IDs, and this indicates the same level of hardware support.
> 
> The reason we switched to family-based autoloading was that almost
> every new platform would add a new PCI device ID, which would require
> enablement work...
> 

Yes, that's right. But it looks like future systems will re-use PCI IDs even
across families and models. And the PCI IDs will be more closely related to
hardware capabilities than family and model.

In any case, we can address that when we get there.

> > Just a nit, but this else seems unnecessary right?
> 
> Maybe it is easier if you look at the function end in the .c file directly as
> diffs can be confusing:
> 
> static bool ecc_enabled(struct amd64_pvt *pvt)
> {
> 
> 	...
> 
>         amd64_info("Node %d: DRAM ECC %s.\n",
>                    nid, (ecc_en ? "enabled" : "disabled"));
> 
>         if (!ecc_en || !nb_mce_en)
>                 return false;
>         else

Right, I meant you can drop this else and just return true.

>                 return true;
> }
> 

Thanks,
Yazen

Borislav Petkov Nov. 7, 2019, 7:34 p.m. UTC | #8

On Thu, Nov 07, 2019 at 07:20:25PM +0000, Ghannam, Yazen wrote:
> Yes, that's right. But it looks like future systems will re-use PCI IDs even
> across families and models. And the PCI IDs will be more closely related to
> hardware capabilities than family and model.
> 
> In any case, we can address that when we get there.

I'd be fine with it if this really is the case and we don't end up
having to keep adding PCI IDs like crazy again. That was a moderate
PITA, AFAIR, especially for distro kernels having to constantly pick up
enablement patches and people complaining about it.

So you need to make sure the PCI IDs will really get reused before
converting back...

> >         if (!ecc_en || !nb_mce_en)
> >                 return false;
> >         else
> 
> Right, I meant you can drop this else and just return true.
> 
> >                 return true;

I prefer the regular if-else way because it reads faster and it is
straight-forward when one skims over the code.

But I can drop if if you insist. :-)

Yazen Ghannam Nov. 7, 2019, 7:41 p.m. UTC | #9

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Thursday, November 7, 2019 2:34 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Thu, Nov 07, 2019 at 07:20:25PM +0000, Ghannam, Yazen wrote:
> > Yes, that's right. But it looks like future systems will re-use PCI IDs even
> > across families and models. And the PCI IDs will be more closely related to
> > hardware capabilities than family and model.
> >
> > In any case, we can address that when we get there.
> 
> I'd be fine with it if this really is the case and we don't end up
> having to keep adding PCI IDs like crazy again. That was a moderate
> PITA, AFAIR, especially for distro kernels having to constantly pick up
> enablement patches and people complaining about it.
> 
> So you need to make sure the PCI IDs will really get reused before
> converting back...
> 

Will do.

> > >         if (!ecc_en || !nb_mce_en)
> > >                 return false;
> > >         else
> >
> > Right, I meant you can drop this else and just return true.
> >
> > >                 return true;
> 
> I prefer the regular if-else way because it reads faster and it is
> straight-forward when one skims over the code.
> 
> But I can drop if if you insist. :-)
> 

No, I don't mind.

Thanks,
Yazen

[v3,0/5] AMD64 EDAC: Check for nodes without memory, etc.

Message

Comments