[RFC] EDAC, ghes: Enable per-layer error reporting for ARM

Message ID	1531762009-15112-1-git-send-email-tbaicar@codeaurora.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> sender: tbaicar@smtp.codeaurora.org) by smtp.codeaurora.org (Postfix) with ESMTPSA id F230060274; Mon, 16 Jul 2018 17:27:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org F230060274 From: Tyler Baicar <tbaicar@codeaurora.org> To: mchehab@kernel.org, bp@alien8.de, james.morse@arm.com, linux-edac@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM Date: Mon, 16 Jul 2018 13:26:49 -0400 Message-Id: <1531762009-15112-1-git-send-email-tbaicar@codeaurora.org> Precedence: list Cc: Tyler Baicar <tbaicar@codeaurora.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Tyler Baicar July 16, 2018, 5:26 p.m. UTC

Enable per-layer error reporting for ARM systems so that the error
counters are incremented per-DIMM.

On ARM systems that use firmware first error handling it is understood
that card=channel and module=DIMM on that channel. Populate that
information and enable per layer error reporting for ARM systems so that
the EDAC error counters are incremented based on DIMM number as per the
SMBIOS table rather than just incrementing the noinfo counters on the
memory controller.

Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 drivers/edac/ghes_edac.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Borislav Petkov July 19, 2018, 2:01 p.m. UTC | #1

On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
> Enable per-layer error reporting for ARM systems so that the error
> counters are incremented per-DIMM.
> 
> On ARM systems that use firmware first error handling it is understood
> that card=channel and module=DIMM on that channel. Populate that
> information and enable per layer error reporting for ARM systems so that
> the EDAC error counters are incremented based on DIMM number as per the
> SMBIOS table rather than just incrementing the noinfo counters on the
> memory controller.

I guess.

James?

> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
> ---
>  drivers/edac/ghes_edac.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 473aeec..e4c8b6e 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -213,9 +213,18 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  	strcpy(e->label, "unknown label");
>  	e->msg = pvt->msg;
>  	e->other_detail = pvt->other_detail;
> -	e->top_layer = -1;
> -	e->mid_layer = -1;
> -	e->low_layer = -1;

<----- newline here.

> +	if ((IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64))
> +	    && (mem_err->validation_bits & CPER_MEM_VALID_CARD)
> +	    && (mem_err->validation_bits & CPER_MEM_VALID_MODULE)) {
> +		e->top_layer = mem_err->card;
> +		e->mid_layer = mem_err->module;
> +		e->low_layer = -1;
> +		e->enable_per_layer_report = true;
> +	} else {
> +		e->top_layer = -1;
> +		e->mid_layer = -1;
> +		e->low_layer = -1;
> +	}

ditto.

>  	*pvt->other_detail = '\0';
>  	*pvt->msg = '\0';
>  
> --

James Morse July 19, 2018, 2:46 p.m. UTC | #2

Hi guys,

On 19/07/18 15:01, Borislav Petkov wrote:
> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>> Enable per-layer error reporting for ARM systems so that the error
>> counters are incremented per-DIMM.
>>
>> On ARM systems that use firmware first error handling it is understood

understood by whom? Is this written down somewhere, or is it the convention. (in
which case, lets get it written down somewhere)

>> that card=channel and module=DIMM on that channel. Populate that
I'm guessing this is the mapping between CPER records and the DMItable data.

>> information and enable per layer error reporting for ARM systems so that
>> the EDAC error counters are incremented based on DIMM number as per the
>> SMBIOS table rather than just incrementing the noinfo counters on the
>> memory controller.

Does this work on x86, and its just the dmi/cper fields have a subtle difference?

> I guess.
> 
> James?

I don't know anything about this stuff. Looks like the SMBIOS specification is
my new bedtime reading.

Thanks,

James

Tyler Baicar July 19, 2018, 6:36 p.m. UTC | #3

On 7/19/2018 10:46 AM, James Morse wrote:
> On 19/07/18 15:01, Borislav Petkov wrote:
>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>>> Enable per-layer error reporting for ARM systems so that the error
>>> counters are incremented per-DIMM.
>>>
>>> On ARM systems that use firmware first error handling it is understood
> understood by whom? Is this written down somewhere, or is it the convention. (in
> which case, lets get it written down somewhere)
Hey Boris, James,

It has just been convention, but Harb recently brought up the idea of adding it 
to SBBR.
>>> that card=channel and module=DIMM on that channel. Populate that
> I'm guessing this is the mapping between CPER records and the DMItable data.
Unfortunately the DMI table doesn't actually have channel and DIMM number values 
which
makes this more complicated than I originally thought...
>>> information and enable per layer error reporting for ARM systems so that
>>> the EDAC error counters are incremented based on DIMM number as per the
>>> SMBIOS table rather than just incrementing the noinfo counters on the
>>> memory controller.
> Does this work on x86, and its just the dmi/cper fields have a subtle difference?
There are CPU specific EDAC drivers for a lot of x86 folks and those drivers 
populate the layer information
in a custom way.

With more investigation and testing it turns out a simple patch like this is not 
going to work. This worked for
me on a 1DPC board since the card number turned out to always be the same as the 
index into the DMI table
to find the proper DIMM. On a 2DPC board this fails completely. The ghes_edac 
driver only sets up a single
layer so it is only using the card number with this patch. That setup can be 
seen here:

https://elixir.bootlin.com/linux/v4.18-rc5/source/drivers/edac/ghes_edac.c#L469

So it is only setting up a single layer with all the DIMMs on that layer. In 
order to properly enable the layers
to represent channel and DIMM number on that channel, we would need to have a 
way of determining the
number of channels (which would be layers[0].size) and the number of DIMMs each 
channel supported
(layers[1].size). There doesn't appear to be a way to determine that information 
at this point.

With the current ghes_edac setup, it seems the only way this could work would be 
to have the firmware
always report the module value to be the index into the DMI table that this DIMM 
information lives. When I
say index into the DMI table, I'm meaning the index into the list of "type 17" 
DMI entries. So, DIMM number
doesn't actually matter, what really matters is the ordering of the type 17 
entries in the DMI table.

This seems pretty hacky to me, so if anyone has other suggestions please share 
them. The goal is to be able to
enable the per layer error reporting in the ghes_edac driver so that the per 
dimm counters exposed in the
EDAC sysfs nodes are properly updated. The other obvious but more messy way 
would be to have notifiers
register to be called by ghes_edac and have a custom EDAC driver for each CPU to 
properly populate their layer
information.

Thanks,
Tyler

Borislav Petkov July 20, 2018, 4:10 a.m. UTC | #4

On Thu, Jul 19, 2018 at 02:36:21PM -0400, Tyler Baicar wrote:
> With the current ghes_edac setup, it seems the only way this could
> work would be to have the firmware always report the module value to

My experience with firmware so far is that it is a lost cause,
considering all the bugs, snafus and incompleteness it demonstrates...

> The other obvious but more messy way would be to have notifiers
> register to be called by ghes_edac and have a custom EDAC driver for
> each CPU to properly populate their layer information.

... which is what we do on x86 and whitelist only known-good platforms
in ghes_edac, which claim that their fw info is correct.

James Morse Aug. 23, 2018, 9:28 a.m. UTC | #5

Hi guys,

(CC: +Fan Wu)

On 19/07/18 19:36, Tyler Baicar wrote:
> On 7/19/2018 10:46 AM, James Morse wrote:
>> On 19/07/18 15:01, Borislav Petkov wrote:
>>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>>>> Enable per-layer error reporting for ARM systems so that the error
>>>> counters are incremented per-DIMM.

This 'layer' term seems to be EDAC's artificial view of memory.

>> I'm guessing this is the mapping between CPER records and the DMItable data.

> Unfortunately the DMI table doesn't actually have channel and DIMM number values
> which makes this more complicated than I originally thought...

>>>> information and enable per layer error reporting for ARM systems so that
>>>> the EDAC error counters are incremented based on DIMM number as per the
>>>> SMBIOS table rather than just incrementing the noinfo counters on the
>>>> memory controller.

>> Does this work on x86, and its just the dmi/cper fields have a subtle difference?

> There are CPU specific EDAC drivers for a lot of x86 folks and those drivers
> populate the layer information in a custom way.

Not for GHES surely?

> With more investigation and testing it turns out a simple patch like this is not
> going to work. This worked for
> me on a 1DPC board since the card number turned out to always be the same as the
> index into the DMI table
> to find the proper DIMM. On a 2DPC board this fails completely. The ghes_edac
> driver only sets up a single
> layer so it is only using the card number with this patch.

(DPC == DIMM per Channel?)

> So it is only setting up a single layer with all the DIMMs on that layer. In
> order to properly enable the layers
> to represent channel and DIMM number on that channel, we would need to have a
> way of determining the
> number of channels (which would be layers[0].size) and the number of DIMMs each
> channel supported
> (layers[1].size). There doesn't appear to be a way to determine that information
> at this point.

So, we're after a mapping for EDAC:layers that includes 'channel'?

What would you do with a channel counter? Isn't that part of the SoC? (it can't
be replaced!)

> The goal is to be able to enable the per layer error reporting in the ghes_edac
> driver so that the per dimm counters exposed in the EDAC sysfs nodes are properly
> updated.

What do you mean by layer? I can't find anything in the ACPI/UEFI/SMBIOS specs
that uses this term...

If its just 'per dimm counters' you're after, this looks straightforward.

> The other obvious but more messy way would be to have notifiers register to be called
> by ghes_edac and have a custom EDAC driver for each CPU to properly populate their layer
> information.

Yuck. This isn't platform specific, its firmware specific. You're hooking the
driver to say "my firmware thinks 'card' means this".

Where would this information come from? We don't want
per-soc,per-firmware-version code mangling what was supposed to be a standard
interface.

... we can't be the first people to try and do this ...

[re-ordered hunk:]
> This seems pretty hacky to me, so if anyone has other suggestions please share
> them.

CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should provide
the information necessary to identify the failing FRU". As EDAC has three
'levels', these are what they should correspond to for ghes-edac.

I assume NODE means rack/chassis in some distributed system. Lets ignore it as
it doesn't seem to map to anything in the SMBIOS table.

The CPER record's card and module numbers are useless to us, as we need to know
how many there will be in advance. (does this version of firmware count from 0
or 1?)

... but CPER also gives us a 'Card Handle' and 'Module Handle'.
'Module Handle' maps to SMBIOS:17 Memory Device (aka, a DIMM). The Handle is a
word-value in the structure, so it doesn't depend on the layout/parse-order of
the SMBIOS tables. When we count the DIMMs in edac-ghes we can give them some
level-idx, then use the handle to find which level-idx to use for this DIMM.

ghes_edac_report_mem_error() already picks up the module-handle, but only uses
it to print the bank/device.

'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array
Structure", which the Memory Device structure also points to.
Card then must mean "a collection of memory devices (DIMMs) that operate
together to form an address space".

This might be what I think of as a memory-controller, or it might be something
more complicated. Regardless, the CPER records think its relevant.

For the edac:layers, we could walk the DMI table to find these structures, and
build the layers from them. If the Memory-array-structures are missing, we can
use the existing 1:NUM_DIMMS approach.

Hope this makes sense!

Thanks,

James

Tyler Baicar Aug. 23, 2018, 3:46 p.m. UTC | #6

Hello James,

On Thu, Aug 23, 2018 at 5:29 AM James Morse <james.morse@arm.com> wrote:
> On 19/07/18 19:36, Tyler Baicar wrote:
> > On 7/19/2018 10:46 AM, James Morse wrote:
> >> On 19/07/18 15:01, Borislav Petkov wrote:
> >>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
> >>>> Enable per-layer error reporting for ARM systems so that the error
> >>>> counters are incremented per-DIMM.
>
> This 'layer' term seems to be EDAC's artificial view of memory.
>

Yes, it's just the terminology that EDAC uses for locating a DIMM.

"Layer" can mean several things here:

https://elixir.bootlin.com/linux/latest/source/include/linux/edac.h#L318

We should be able to avoid the layer definitions with the SMBIOS handles.

>
> >> Does this work on x86, and its just the dmi/cper fields have a subtle difference?
>
> > There are CPU specific EDAC drivers for a lot of x86 folks and those drivers
> > populate the layer information in a custom way.
>
> Not for GHES surely?
>

Correct, the x86 drivers that properly increment the DIMM error counters are not
tied to the ghes_edac driver.

>
> (DPC == DIMM per Channel?)
>

Yes.

> > The goal is to be able to enable the per layer error reporting in the ghes_edac
> > driver so that the per dimm counters exposed in the EDAC sysfs nodes are properly
> > updated.
>
> What do you mean by layer? I can't find anything in the ACPI/UEFI/SMBIOS specs
> that uses this term...
>
> If its just 'per dimm counters' you're after, this looks straightforward.
>

Yes, we just need a way to increment the per DIMM counters that are
exposed by the
EDAC sysfs nodes.

> [re-ordered hunk:]
> > This seems pretty hacky to me, so if anyone has other suggestions please share
> > them.
>
> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should provide
> the information necessary to identify the failing FRU". As EDAC has three
> 'levels', these are what they should correspond to for ghes-edac.
>
> I assume NODE means rack/chassis in some distributed system. Lets ignore it as
> it doesn't seem to map to anything in the SMBIOS table.

I believe NODE should map to socket number for multi-socket systems.

> The CPER record's card and module numbers are useless to us, as we need to know
> how many there will be in advance. (does this version of firmware count from 0
> or 1?)
>
> ... but CPER also gives us a 'Card Handle' and 'Module Handle'.
> 'Module Handle' maps to SMBIOS:17 Memory Device (aka, a DIMM). The Handle is a
> word-value in the structure, so it doesn't depend on the layout/parse-order of
> the SMBIOS tables. When we count the DIMMs in edac-ghes we can give them some
> level-idx, then use the handle to find which level-idx to use for this DIMM.
>
> ghes_edac_report_mem_error() already picks up the module-handle, but only uses
> it to print the bank/device.
>
> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array
> Structure", which the Memory Device structure also points to.
> Card then must mean "a collection of memory devices (DIMMs) that operate
> together to form an address space".
>
> This might be what I think of as a memory-controller, or it might be something
> more complicated. Regardless, the CPER records think its relevant.
>
> For the edac:layers, we could walk the DMI table to find these structures, and
> build the layers from them. If the Memory-array-structures are missing, we can
> use the existing 1:NUM_DIMMS approach.
>

I think the proper way to get this working would be to use these handles. We can
avoid populating this layer information and instead have a mapping of type 17
index number (how edac is numbering the DIMMs today) to the handle number.
Then we will need a new function to increment the counter based on the handle
number rather than this layer information. Is that how you are envisioning it?

Thanks,
Tyler

James Morse Aug. 24, 2018, 9:48 a.m. UTC | #7

Hi Tyler,

On 23/08/18 16:46, Tyler Baicar wrote:
> On Thu, Aug 23, 2018 at 5:29 AM James Morse <james.morse@arm.com> wrote:
>> On 19/07/18 19:36, Tyler Baicar wrote:
>>> On 7/19/2018 10:46 AM, James Morse wrote:
>>>> On 19/07/18 15:01, Borislav Petkov wrote:
>>>>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote:
>>>>>> Enable per-layer error reporting for ARM systems so that the error
>>>>>> counters are incremented per-DIMM.
>>
>> This 'layer' term seems to be EDAC's artificial view of memory.
>>
> 
> Yes, it's just the terminology that EDAC uses for locating a DIMM.
> 
> "Layer" can mean several things here:
> 
> https://elixir.bootlin.com/linux/latest/source/include/linux/edac.h#L318

Aha, its an enum. I thought it was an upper/middle/lower mapping at the whim of
the edac driver.

[...]

>> [re-ordered hunk:]
>>> This seems pretty hacky to me, so if anyone has other suggestions please share
>>> them.
>>
>> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should provide
>> the information necessary to identify the failing FRU". As EDAC has three
>> 'levels', these are what they should correspond to for ghes-edac.
>>
>> I assume NODE means rack/chassis in some distributed system. Lets ignore it as
>> it doesn't seem to map to anything in the SMBIOS table.
> 
> I believe NODE should map to socket number for multi-socket systems.

Isn't the Memory Array Structure still unique in a multi-socket system? If so
the node isn't telling us anything new.

Do sockets show up in the SMBIOS table? We would need to know how many there are
in advance. For arm systems the cpu topology from PPTT is the best bet for this
information, but what do we do if that table is missing? (also, does firmware
count from 1 or 0?) I suspect we can't use this field unless we know what the
range of values is going to be in advance.

I assumed this node must be a level of information above Card/Memory-Array's
address-space. Somehow the Card handle isn't no long unique, we need the node
number too. If the CPER records were all being pumped at a single agent, (shared
BMC in a blade/chassis thing) then this might matter. I suspect we can ignore it
in linux.

>> The CPER record's card and module numbers are useless to us, as we need to know
>> how many there will be in advance. (does this version of firmware count from 0
>> or 1?)
>>
>> ... but CPER also gives us a 'Card Handle' and 'Module Handle'.
>> 'Module Handle' maps to SMBIOS:17 Memory Device (aka, a DIMM). The Handle is a
>> word-value in the structure, so it doesn't depend on the layout/parse-order of
>> the SMBIOS tables. When we count the DIMMs in edac-ghes we can give them some
>> level-idx, then use the handle to find which level-idx to use for this DIMM.
>>
>> ghes_edac_report_mem_error() already picks up the module-handle, but only uses
>> it to print the bank/device.
>>
>> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array
>> Structure", which the Memory Device structure also points to.
>> Card then must mean "a collection of memory devices (DIMMs) that operate
>> together to form an address space".
>>
>> This might be what I think of as a memory-controller, or it might be something
>> more complicated. Regardless, the CPER records think its relevant.
>>
>> For the edac:layers, we could walk the DMI table to find these structures, and
>> build the layers from them. If the Memory-array-structures are missing, we can
>> use the existing 1:NUM_DIMMS approach.

> I think the proper way to get this working would be to use these handles. We can
> avoid populating this layer information and instead have a mapping of type 17
> index number (how edac is numbering the DIMMs today) to the handle number.

Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
EDAC_MC_LAYER_SLOT is for?

> Then we will need a new function to increment the counter based on the handle
> number rather than this layer information. Is that how you are envisioning it?

I'm not familiar with edac's internals, so I didn't have any particular vision!

Isn't the problem that ghes_edac_report_mem_error() does this:
|	e->top_layer = -1;
|	e->mid_layer = -1;
|	e->low_layer = -1;

so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't
read what it does with this information yet).

ghes_edac_report_mem_error() does check CPER_MEM_VALID_MODULE_HANDLE, and if its
set, it uses the handle to find the bank/device strings and prints them out.

Naively I thought we could generate some index during ghes_edac_count_dimms(),
and use this as e->${whichever}_layer. I hoped there would be something we could
already use as the index, but I can't spot it, so this will be more than the
one-liner I was hoping for!

Thanks,

James

Borislav Petkov Aug. 24, 2018, 12:01 p.m. UTC | #8

On Fri, Aug 24, 2018 at 10:48:24AM +0100, James Morse wrote:
> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
> EDAC_MC_LAYER_SLOT is for?

Yap.

> so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't
> read what it does with this information yet).

See edac_inc_ce_error(), for example - it uses the layers which are not
negative (-1) to increment the error counts of the respective layer. It
all depends on what granularity of the hardware part you're reporting
the error for: is it a DIMM rank, a whole DIMM or for a channel which
can span multiple DIMM ranks. And so on...

Look at some of the drivers and how they're doing that layering. It all
depends on whether you can get the precise info from the hw.

> ghes_edac_report_mem_error() does check CPER_MEM_VALID_MODULE_HANDLE, and if its
> set, it uses the handle to find the bank/device strings and prints them out.

Yap, and the error counts are lumped together into

  /sys/devices/system/edac/mc/mc*/ce_noinfo_count

> Naively I thought we could generate some index during ghes_edac_count_dimms(),
> and use this as e->${whichever}_layer. I hoped there would be something we could
> already use as the index, but I can't spot it, so this will be more than the
> one-liner I was hoping for!

If you can get that info from the hardware and injecting an error into
a DIMM gives you the correct DIMM number so that we can increment the
proper counter, then you're golden. I don't think that works reliably on
x86, though, therefore the lumping together.

Fan Wu Aug. 24, 2018, 2:30 p.m. UTC | #9

Hi James, 
 
> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
> EDAC_MC_LAYER_SLOT is for?

Borislav has explained it in his response. Here let me elaborate a little more. To use the layer information you need an accurate way to pinpoint each component in the layer and the parent components in the layers above. For example, to use EDAC_MC_LAYER_SLOT you also need information for the parent layer say EDAC_MC_LAYER_CHANNEL, or another layer on top say EDAC_MC_LAYER_BRANCH. There are no clear ways to get the information from SMBIOS table. In the case of "memory channel" we looked at type 37 which has the exact spelling but it was introduced to support RamBus and Synclink. Not sure we can readily use it for modern architecture concept of "channel/slot". 

 I think it is good enough if we can pin each error to the corresponding DIMM. At the end of the day DIMMs are what customer can replace in the memory system and that's all that they care about. For the manufacturers of the board/chips they have the knowledge to map the specific DIMMs to the upper layer components, so they can easily collect error counter data for upper layers. 

> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE
> should provide the information necessary to identify the failing FRU". As
> EDAC has three 'levels', these are what they should correspond to for ghes-
> edac.
> 
> I assume NODE means rack/chassis in some distributed system. Lets ignore it
> as it doesn't seem to map to anything in the SMBIOS table.

How about type 4 "Processor Information"?

> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array
> Structure", which the Memory Device structure also points to.
> Card then must mean "a collection of memory devices (DIMMs) that operate
> together to form an address space".
> 
> This might be what I think of as a memory-controller, or it might be
> something more complicated. Regardless, the CPER records think its relevant.

Originally I thought "Card" were memory channel. But looking at the definition of "Card Handle" in CPER: "... this field contains the SMBIOS handle for the Type 16 Memory Array Structure that represents the memory card". So Card is memory controller or something similar to that. Right now ghes-edac assumes one mc. We probably need to map mc(s) to the type 16 instances in SMBIOS table. 

Thanks,
Fan

Tyler Baicar Aug. 24, 2018, 3:14 p.m. UTC | #10

On Fri, Aug 24, 2018 at 5:48 AM, James Morse <james.morse@arm.com> wrote:
> On 23/08/18 16:46, Tyler Baicar wrote:
>> On Thu, Aug 23, 2018 at 5:29 AM James Morse <james.morse@arm.com> wrote:
>>> On 19/07/18 19:36, Tyler Baicar wrote:
>>>> This seems pretty hacky to me, so if anyone has other suggestions please share
>>>> them.
>>>
>>> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should provide
>>> the information necessary to identify the failing FRU". As EDAC has three
>>> 'levels', these are what they should correspond to for ghes-edac.
>>>
>>> I assume NODE means rack/chassis in some distributed system. Lets ignore it as
>>> it doesn't seem to map to anything in the SMBIOS table.
>>
>> I believe NODE should map to socket number for multi-socket systems.
>
> Isn't the Memory Array Structure still unique in a multi-socket system? If so
> the node isn't telling us anything new.

Yes, the Memory Array structure in SMBIOS is still unique, but the NODE value
is needed in NODE, CARD, MODULE because the CARD number here typically
maps to channel number which each socket has their own channel numbers.

(i.e. socket 0 can have channel 0 and socket 1 can have a channel 0)

> Do sockets show up in the SMBIOS table? We would need to know how many there are
> in advance. For arm systems the cpu topology from PPTT is the best bet for this
> information, but what do we do if that table is missing? (also, does firmware
> count from 1 or 0?) I suspect we can't use this field unless we know what the
> range of values is going to be in advance.

An Fan mentioned in his response, what the customers really care about
is mapping to
a particular DIMM since that is what they can replace. To do this, the
Memory Device
handle should be enough since those are all unique regardless of
Memory Array handle
and which socket the DIMM is on. The Firmware I've worked with counts
from 0, but I'm
not sure if that is required. That won't matter if we just use the
Memory Device handle.

>> I think the proper way to get this working would be to use these handles. We can
>> avoid populating this layer information and instead have a mapping of type 17
>> index number (how edac is numbering the DIMMs today) to the handle number.
>
> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
> EDAC_MC_LAYER_SLOT is for?

The problem with the layer reporting is that you need to know all the
layer information
as Fan mentioned. SoCs can support multiple board combinations (ie
1DPC vs. 2DPC)
and there is no standardized way of knowing whether you are booted on a 1DPC or
2DPC board.

>> Then we will need a new function to increment the counter based on the handle
>> number rather than this layer information. Is that how you are envisioning it?
>
> I'm not familiar with edac's internals, so I didn't have any particular vision!
>
> Isn't the problem that ghes_edac_report_mem_error() does this:
> |       e->top_layer = -1;
> |       e->mid_layer = -1;
> |       e->low_layer = -1;

The other problem is that the sysfs nodes are all setup with a single
layer representing
all of the memory on the board.

https://elixir.bootlin.com/linux/latest/source/drivers/edac/ghes_edac.c#L469

So the DIMM counters exposed in sysfs are all under a single memory
controller and just
numbered from 0 to n-1 based on the order in which the type 17 SMBIOS
entries show up
in the DMI walk.

> so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't
> read what it does with this information yet).
>
> ghes_edac_report_mem_error() does check CPER_MEM_VALID_MODULE_HANDLE, and if its
> set, it uses the handle to find the bank/device strings and prints them out.

Yes, I think this is where we need to add support to increment the
count based on that module
handle.

> Naively I thought we could generate some index during ghes_edac_count_dimms(),
> and use this as e->${whichever}_layer. I hoped there would be something we could
> already use as the index, but I can't spot it, so this will be more than the
> one-liner I was hoping for!

We could use what ghes_edac_register does by setting up a single layer
with all memory and
then keep a map of which module handle maps to which index into that
layer. From that it would
be easy to increment the proper sysfs exposed DIMM counters using the
single layer (that way
we can probably avoid the custom increment function I eluded to in my
previous response).

Thanks,
Tyler

James Morse Aug. 28, 2018, 5:09 p.m. UTC | #11

Hi Boris,

On 24/08/18 13:01, Borislav Petkov wrote:
> On Fri, Aug 24, 2018 at 10:48:24AM +0100, James Morse wrote:
>> so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't
>> read what it does with this information yet).
> 
> See edac_inc_ce_error(), for example - it uses the layers which are not
> negative (-1) to increment the error counts of the respective layer. It
> all depends on what granularity of the hardware part you're reporting
> the error for: is it a DIMM rank, a whole DIMM or for a channel which
> can span multiple DIMM ranks. And so on...
> 
> Look at some of the drivers and how they're doing that layering. It all
> depends on whether you can get the precise info from the hw.

Hmmm, in this example we need the information from firmware, as that is where
ghes-edac gets its information from.

We already count the module/device/dimms in the smbios table, memory is
described as 'EDAC_MC_LAYER_ALL_MEM' with num_dimms. I think all we're missing
is which dimm in ghes_edac_report_mem_error(). We have the handle, we just need
a number between 1 and num_dimms.

If it turns out firmware doesn't populate the handles in its cper records, then
we can keep e->enable_per_layer_report false when calling
edac_raw_mc_handle_error().

(I suggest we ignore 'card', and just do this for the device/dimms).

>> Naively I thought we could generate some index during ghes_edac_count_dimms(),
>> and use this as e->${whichever}_layer. I hoped there would be something we could
>> already use as the index, but I can't spot it, so this will be more than the
>> one-liner I was hoping for!
> 
> If you can get that info from the hardware and injecting an error into
> a DIMM gives you the correct DIMM number so that we can increment the
> proper counter, then you're golden. I don't think that works reliably on
> x86, though, therefore the lumping together.

... 'correct DIMM number' ...

Does x86 have another source of memory-topology information it needs to
correlate smbios with?

For arm there is nothing else describing the memory-topology, so as long as we
can correlate the smbios table and ghes:cper records through the handles, we can
get this working for all systems.

Thanks,

James

James Morse Aug. 28, 2018, 5:09 p.m. UTC | #12

Hi Fan,

On 24/08/18 15:30, wufan wrote:
>> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
>> EDAC_MC_LAYER_SLOT is for?
> 
> Borislav has explained it in his response. Here let me elaborate a little more.
> To use the layer information you need an accurate way to pinpoint each component
> in the layer and the parent components in the layers above. For example, to use
> EDAC_MC_LAYER_SLOT you also need information for the parent layer say
> EDAC_MC_LAYER_CHANNEL, or another layer on top say EDAC_MC_LAYER_BRANCH.

I haven't spotted anything that forces a particular meaning/topology on these
types. (there are four of them, but only three 'levels')

i3000_edac.c has EDAC_MC_LAYER_CHIP_SELECT then EDAC_MC_LAYER_CHANNEL,
but i5400_edac.c has EDAC_MC_LAYER_BRANCH then EDAC_MC_LAYER_CHANNEL.
pnd2_edac.c agrees that channel comes before slot, but thunderx_edac.c just uses
slot directly....

ghes-edac already describes memory as 'EDAC_MC_LAYER_ALL_MEM', with num_dimms, I
think we just need to get 'the' dimm number. Using the cper-module-handle means
we don't have to worry about firmware's count of dimms being different, as we
both agree its smbios-type-17 we're talking about.

> There
> are no clear ways to get the information from SMBIOS table. In the case of "memory
> channel" we looked at type 37 which has the exact spelling but it was introduced
> to support RamBus and Synclink. Not sure we can readily use it for modern
> architecture concept of "channel/slot". 

I think we should avoid this 'channel' thing as it means different things to
different people!

We can use ghes:card smbios:physical-memory-array as the UEFI spec tells us they
are the same, and we don't actually need to know what they mean.

>  I think it is good enough if we can pin each error to the corresponding DIMM.
> At the end of the day DIMMs are what customer can replace in the memory system
> and that's all that they care about. For the manufacturers of the board/chips
> they have the knowledge to map the specific DIMMs to the upper layer components,
> so they can easily collect error counter data for upper layers. 

I agree.

>> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE
>> should provide the information necessary to identify the failing FRU". As
>> EDAC has three 'levels', these are what they should correspond to for ghes-
>> edac.
>>
>> I assume NODE means rack/chassis in some distributed system. Lets ignore it
>> as it doesn't seem to map to anything in the SMBIOS table.
> 
> How about type 4 "Processor Information"?

As the spec doesn't tell us what the field means, we can't really do anything
other than print the value out.

>> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array
>> Structure", which the Memory Device structure also points to.
>> Card then must mean "a collection of memory devices (DIMMs) that operate
>> together to form an address space".
>>
>> This might be what I think of as a memory-controller, or it might be
>> something more complicated. Regardless, the CPER records think its relevant.
> 
> Originally I thought "Card" were memory channel. But looking at the definition
> of "Card Handle" in CPER: "... this field contains the SMBIOS handle for the
> Type 16 Memory Array Structure that represents the memory card". So Card is
> memory controller or something similar to that.
> Right now ghes-edac assumes
> one mc. We probably need to map mc(s) to the type 16 instances in SMBIOS table. 

I think we should ignore 'mc's, and just report the dimm numbers.

ghes-edac only cares about the number of dimms today, and this would work on
systems that only describe the dimms in the smbios table.

Thanks,

James

James Morse Aug. 28, 2018, 5:11 p.m. UTC | #13

Hi Tyler,

On 24/08/18 16:14, Tyler Baicar wrote:
> On Fri, Aug 24, 2018 at 5:48 AM, James Morse <james.morse@arm.com> wrote:
>> On 23/08/18 16:46, Tyler Baicar wrote:
>>> On Thu, Aug 23, 2018 at 5:29 AM James Morse <james.morse@arm.com> wrote:
>>>> On 19/07/18 19:36, Tyler Baicar wrote:
>>>>> This seems pretty hacky to me, so if anyone has other suggestions please share
>>>>> them.
>>>>
>>>> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should provide
>>>> the information necessary to identify the failing FRU". As EDAC has three
>>>> 'levels', these are what they should correspond to for ghes-edac.
>>>>
>>>> I assume NODE means rack/chassis in some distributed system. Lets ignore it as
>>>> it doesn't seem to map to anything in the SMBIOS table.
>>>
>>> I believe NODE should map to socket number for multi-socket systems.
>>
>> Isn't the Memory Array Structure still unique in a multi-socket system? If so
>> the node isn't telling us anything new.

> Yes, the Memory Array structure in SMBIOS is still unique, but the NODE value
> is needed in NODE, CARD, MODULE because the CARD number here typically
> maps to channel number which each socket has their own channel numbers.

/me flinches at 'typically'.

Okay, so the hierarchy applies to the numbers, not the handles.
How come there isn't a node handle?

I think we should ignore the extra hierarchy. The module/devices/dimms are the
only replaceable part, and we don't know if the firmware will provide the
information.


>> Do sockets show up in the SMBIOS table? We would need to know how many there are
>> in advance. For arm systems the cpu topology from PPTT is the best bet for this
>> information, but what do we do if that table is missing? (also, does firmware
>> count from 1 or 0?) I suspect we can't use this field unless we know what the
>> range of values is going to be in advance.
> 
> An Fan mentioned in his response, what the customers really care about
> is mapping to
> a particular DIMM since that is what they can replace. To do this, the
> Memory Device
> handle should be enough since those are all unique regardless of
> Memory Array handle
> and which socket the DIMM is on. The Firmware I've worked with counts
> from 0, but I'm
> not sure if that is required.

If the spec doesn't say, its difficult to know the range of values until we've
seen one of the limits.


> That won't matter if we just use the Memory Device handle.

I agree.


>>> I think the proper way to get this working would be to use these handles. We can
>>> avoid populating this layer information and instead have a mapping of type 17
>>> index number (how edac is numbering the DIMMs today) to the handle number.
>>
>> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
>> EDAC_MC_LAYER_SLOT is for?
> 
> The problem with the layer reporting is that you need to know all the
> layer information as Fan mentioned.

I haven't spotted what requires this, there seems to be a bit of a mix of
numbers-of-layers and what order they go in. thunderx_edac.c just uses SLOT.
I think we can get away with using EDAC_MC_LAYER_ALL_MEM, sized by num_dimms
(which ghes-edac already does).

Providing extra topology may not be useful unless the firmware populates this
information. What do we do if we export card+module, but then firmware only
populates the module-handle?


> SoCs can support multiple board combinations (ie
> 1DPC vs. 2DPC)
> and there is no standardized way of knowing whether you are booted on a 1DPC or
> 2DPC board.
> 
>>> Then we will need a new function to increment the counter based on the handle
>>> number rather than this layer information. Is that how you are envisioning it?
>>
>> I'm not familiar with edac's internals, so I didn't have any particular vision!
>>
>> Isn't the problem that ghes_edac_report_mem_error() does this:
>> |       e->top_layer = -1;
>> |       e->mid_layer = -1;
>> |       e->low_layer = -1;
> 
> The other problem is that the sysfs nodes are all setup with a single
> layer representing
> all of the memory on the board.
> 
> https://elixir.bootlin.com/linux/latest/source/drivers/edac/ghes_edac.c#L469
> 
> So the DIMM counters exposed in sysfs are all under a single memory
> controller and just
> numbered from 0 to n-1 based on the order in which the type 17 SMBIOS
> entries show up
> in the DMI walk.

User-space depending on the dmi walk order makes me nervous. Doing this gives
you per-dimm counters, if user-space needs to know which physical-dimm/slot that
is, we'd need a way of matching it with one of the smbios location strings.


>> so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't
>> read what it does with this information yet).
>>
>> ghes_edac_report_mem_error() does check CPER_MEM_VALID_MODULE_HANDLE, and if its
>> set, it uses the handle to find the bank/device strings and prints them out.
> 
> Yes, I think this is where we need to add support to increment the
> count based on that module handle.

If layer[0] is EDAC_MC_LAYER_ALL_MEM, sized for num_dimm, don't we just put the
dimm number in e->top_layer and flip e->enable_per_layer_report?


>> Naively I thought we could generate some index during ghes_edac_count_dimms(),
>> and use this as e->${whichever}_layer. I hoped there would be something we could
>> already use as the index, but I can't spot it, so this will be more than the
>> one-liner I was hoping for!
> 
> We could use what ghes_edac_register does by setting up a single layer
> with all memory and
> then keep a map of which module handle maps to which index into that
> layer. From that it would
> be easy to increment the proper sysfs exposed DIMM counters using the
> single layer

Yes, I think this is what we should do!


Thanks,

James

Tyler Baicar Aug. 28, 2018, 8:04 p.m. UTC | #14

On Tue, Aug 28, 2018 at 1:11 PM, James Morse <james.morse@arm.com> wrote:
> On 24/08/18 16:14, Tyler Baicar wrote:
>> On Fri, Aug 24, 2018 at 5:48 AM, James Morse <james.morse@arm.com> wrote:
>>> On 23/08/18 16:46, Tyler Baicar wrote:
>>> so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't
>>> read what it does with this information yet).
>>>
>>> ghes_edac_report_mem_error() does check CPER_MEM_VALID_MODULE_HANDLE, and if its
>>> set, it uses the handle to find the bank/device strings and prints them out.
>>
>> Yes, I think this is where we need to add support to increment the
>> count based on that module handle.
>
> If layer[0] is EDAC_MC_LAYER_ALL_MEM, sized for num_dimm, don't we just put the
> dimm number in e->top_layer and flip e->enable_per_layer_report?

Yes, that is all we would need to do. Figuring out that DIMM number is
the issue, but that can be done with the map of module handles to DIMM
index.

>>> Naively I thought we could generate some index during ghes_edac_count_dimms(),
>>> and use this as e->${whichever}_layer. I hoped there would be something we could
>>> already use as the index, but I can't spot it, so this will be more than the
>>> one-liner I was hoping for!
>>
>> We could use what ghes_edac_register does by setting up a single layer
>> with all memory and
>> then keep a map of which module handle maps to which index into that
>> layer. From that it would
>> be easy to increment the proper sysfs exposed DIMM counters using the
>> single layer
>
> Yes, I think this is what we should do!

Sounds good, I'll start working on this!

Thanks,
Tyler

Borislav Petkov Aug. 29, 2018, 7:38 a.m. UTC | #15

On Tue, Aug 28, 2018 at 06:09:24PM +0100, James Morse wrote:
> Does x86 have another source of memory-topology information it needs to
> correlate smbios with?

Bah, pinpointing the DIMM on x86 is a mess. There's no reliable way to
say which DIMM it is in certain cases (interleaving, mirrorring, ...)
and it is all platform-dependent. So we do the layers to dump a memory
location (node, memory controller, ....) so that we can at least limit
the number of DIMMs the user needs to replace/try.

In an ideal world, I'd like to be able to query the SPD chips on the
DIMMs and build the topology and then when an error happens to say,
"error in DIMM <silkscreen>" where silkscreen is what is written on the
motherboard under the DIMM socket.

But I don't see that happening any time soon...

> For arm there is nothing else describing the memory-topology, so as long as we
> can correlate the smbios table and ghes:cper records through the handles, we can
> get this working for all systems.

And then make sure vendors fill in the proper info in smbios. Because that's
also a mess on x86.

James Morse Aug. 29, 2018, 10:20 a.m. UTC | #16

Hi Boris,

On 29/08/18 08:38, Borislav Petkov wrote:
> On Tue, Aug 28, 2018 at 06:09:24PM +0100, James Morse wrote:
>> Does x86 have another source of memory-topology information it needs to
>> correlate smbios with?
> 
> Bah, pinpointing the DIMM on x86 is a mess. There's no reliable way to
> say which DIMM it is in certain cases (interleaving, mirrorring, ...)
> and it is all platform-dependent. So we do the layers to dump a memory
> location (node, memory controller, ....) so that we can at least limit
> the number of DIMMs the user needs to replace/try.

Right. I'd like ghes-edac to work in the same way for both architectures.

I think this is best done by stuffing the dmi-handle in struct dimm_info during
ghes_edac_dmidecode(), then populating the struct edac_raw_error_desc layers
from the matching mci->dimms 'location'.

For EDAC_MC_LAYER_ALL_MEM this boils down to a flat index, so pointer arithmetic
on mci->dimms is an appropriate short cut.

(We should probably 'FIXME: It shouldn't be hard to also fill the DIMM labels'
at the same time so that no-one is tempted to interpret the edac:dimm-idx)

> In an ideal world, I'd like to be able to query the SPD chips on the

(oh, that can be done?)

> DIMMs and build the topology and then when an error happens to say,
> "error in DIMM <silkscreen>" where silkscreen is what is written on the
> motherboard under the DIMM socket.
> 
> But I don't see that happening any time soon...

>> For arm there is nothing else describing the memory-topology, so as long as we
>> can correlate the smbios table and ghes:cper records through the handles, we can
>> get this working for all systems.
> 
> And then make sure vendors fill in the proper info in smbios. Because that's
> also a mess on x86.

I got educated by the people who look after specifications last time I touched
this [0]. SMBIOS tables are required by Arm's 'Server Base Boot Requirements',
It lists the memory-device and physical-memory-array as required.

I will drop them a note that we will be depending on the handle, and it should
go on the list too... if its not populated on today's systems we can fall back
to !e->enable_per_layer_report as we do today.

Thanks,

James

[0] https://www.spinics.net/lists/arm-kernel/msg653133.html

Borislav Petkov Aug. 30, 2018, 10:28 a.m. UTC | #17

On Wed, Aug 29, 2018 at 11:20:48AM +0100, James Morse wrote:
> Right. I'd like ghes-edac to work in the same way for both architectures.
> 
> I think this is best done by stuffing the dmi-handle in struct dimm_info during
> ghes_edac_dmidecode(), then populating the struct edac_raw_error_desc layers
> from the matching mci->dimms 'location'.
> 
> For EDAC_MC_LAYER_ALL_MEM this boils down to a flat index, so pointer arithmetic
> on mci->dimms is an appropriate short cut.

It all sounds nice on paper but you should try it on a couple of
machines first. See whether/how it actually works there.

Also, this probably would need to not change x86 unless you wanna fix it
there too. I'd think twice before I attempt such a thing though :)

> (We should probably 'FIXME: It shouldn't be hard to also fill the DIMM labels'
> at the same time so that no-one is tempted to interpret the edac:dimm-idx)

See above.

> > In an ideal world, I'd like to be able to query the SPD chips on the
> 
> (oh, that can be done?)

There was some talk initially and I've seen BIOS read SPD chips and
showing DIMM info but I've heard the word "proprietary" a couple of
times. Haven't dug any deeper though.

> I got educated by the people who look after specifications last time I touched
> this [0]. SMBIOS tables are required by Arm's 'Server Base Boot Requirements',
> It lists the memory-device and physical-memory-array as required.

It is always better to have certification on your side. Should make ARM
vendors dance.

> I will drop them a note that we will be depending on the handle, and it should
> go on the list too... if its not populated on today's systems we can fall back
> to !e->enable_per_layer_report as we do today.

Right.

[RFC] EDAC, ghes: Enable per-layer error reporting for ARM

Commit Message

Comments

Patch