diff mbox series

[v2,1/3] ghes_edac: unify memory error report format with cper

Message ID 20211210134019.28536-2-xueshuai@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series ghes_edac: refactor memory error reporting to avoid code duplication | expand

Commit Message

Shuai Xue Dec. 10, 2021, 1:40 p.m. UTC
The changes are to:

- add device info into ghes_edac
- change bit_pos to bit_position, col to column, requestorID to
  requestor_id, etc in ghes_edac
- move requestor_id, responder_id, target_id and chip_id into memory error
  location in ghes_edac
- add "DIMM location: not present." for DIMM location in ghes_edac
- remove the 'space' delimiter after the colon in ghes_edac and cper

The original EDAC and cper error log are as follows (all Validation Bits
are enabled):

[31940.060454] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x93724c offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
[31940.060459] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[31940.060460] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[31940.060462] {3}[Hardware Error]: event severity: corrected
[31940.060463] {3}[Hardware Error]:  Error 0, type: corrected
[31940.060464] {3}[Hardware Error]:   section_type: memory error
[31940.060465] {3}[Hardware Error]:   error_status: 0x0000000000000000
[31940.060466] {3}[Hardware Error]:   physical_address: 0x000000093724c020
[31940.060466] {3}[Hardware Error]:   physical_address_mask: 0x0000000000000000
[31940.060469] {3}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 0 bank: 257 bank_group: 1 bank_address: 1 device: 0 row: 75492 column: 8 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
[31940.060470] {3}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
[31940.060471] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000

Now, the EDAC and cper error log are properly reporting the error as
follows (all Validation Bits are enabled):

[  117.973657] EDAC MC0: 1 CE Single-symbol ChipKill ECC on 0x0000 (node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM DMI handle:0x0000 page:0x8d2ef4 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM DMI handle:0x0000 status(0x0000000000000000):reserved)
[  117.973663] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[  117.973664] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[  117.973665] {2}[Hardware Error]: event severity: corrected
[  117.973666] {2}[Hardware Error]:  Error 0, type: corrected
[  117.973667] {2}[Hardware Error]:   section_type: memory error
[  117.973668] {2}[Hardware Error]:   error_status: 0x0000000000000000
[  117.973669] {2}[Hardware Error]:   physical_address: 0x00000008d2ef4020
[  117.973670] {2}[Hardware Error]:   physical_address_mask: 0x0000000000000000
[  117.973672] {2}[Hardware Error]:   node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
[  117.973673] {2}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
[  117.973674] {2}[Hardware Error]:   DIMM location: not present. DMI handle:0x0000

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/edac/ghes_edac.c    | 35 +++++++++++++++++++----------------
 drivers/firmware/efi/cper.c | 34 +++++++++++++++++-----------------
 2 files changed, 36 insertions(+), 33 deletions(-)

Comments

Borislav Petkov Dec. 28, 2021, 5:12 p.m. UTC | #1
On Fri, Dec 10, 2021 at 09:40:17PM +0800, Shuai Xue wrote:
> Subject: Re: [PATCH v2 1/3] ghes_edac: unify memory error report format with

"EDAC/ghes: Unify..."

is the format we use in the EDAC tree.

> The changes are to:
> 
> - add device info into ghes_edac
> - change bit_pos to bit_position, col to column, requestorID to
>   requestor_id, etc in ghes_edac
> - move requestor_id, responder_id, target_id and chip_id into memory error
>   location in ghes_edac
> - add "DIMM location: not present." for DIMM location in ghes_edac
> - remove the 'space' delimiter after the colon in ghes_edac and cper

This commit message is useless: it should not talk about what your patch
does - that should hopefully be visible in the diff itself. Rather, it
should talk about *why* you're doing what you're doing.

Also, your patch does a bunch of things at once.

From: Documentation/process/submitting-patches.rst

"Solve only one problem per patch.  If your description starts to get
long, that's a sign that you probably need to split up your patch.
See :ref:`split_changes`."

You should have a look at that file.

Also, avoid having "This patch" or "This commit" in the commit message.
It is tautologically useless.

Also, do

$ git grep 'This patch' Documentation/process

for more details.

> The original EDAC and cper error log are as follows (all Validation Bits
> are enabled):
> 
> [31940.060454] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x93724c offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
> [31940.060459] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
> [31940.060460] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
> [31940.060462] {3}[Hardware Error]: event severity: corrected
> [31940.060463] {3}[Hardware Error]:  Error 0, type: corrected
> [31940.060464] {3}[Hardware Error]:   section_type: memory error
> [31940.060465] {3}[Hardware Error]:   error_status: 0x0000000000000000
> [31940.060466] {3}[Hardware Error]:   physical_address: 0x000000093724c020
> [31940.060466] {3}[Hardware Error]:   physical_address_mask: 0x0000000000000000
> [31940.060469] {3}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 0 bank: 257 bank_group: 1 bank_address: 1 device: 0 row: 75492 column: 8 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
> [31940.060470] {3}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
> [31940.060471] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
> 
> Now, the EDAC and cper error log are properly reporting the error as
> follows (all Validation Bits are enabled):
> 
> [ 117.973657] EDAC MC0: 1 CE Single-symbol ChipKill ECC on 0x0000

What does "ECC on 0x0000" mean?

> (node:0 card:0 module:0 rank:0 bank:1026 bank_group:4
> bank_address:2 device:0 row:6749 column:8 bit_position:0

> requestor_id:0x0000000000000000
> responder_id:0x0000000000000000
> target_id:0x0000000000000000

those look useless to me too. Probably invalid/unpopulated by BIOS...

> chip_id:0 DIMM location:not present. DIMM
> DMI handle:0x0000 page:0x8d2ef4 offset:0x20 grain:1 syndrome:0x0 -
> APEI location: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4
> bank_address:2 device:0 row:6749 column:8 bit_position:0
> requestor_id:0x0000000000000000 responder_id:0x0000000000000000
> target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM
> DMI handle:0x0000 status(0x0000000000000000):reserved)

Sorry but I fail to see how this changed error record is an improvement.

> [  117.973663] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
> [  117.973664] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
> [  117.973665] {2}[Hardware Error]: event severity: corrected
> [  117.973666] {2}[Hardware Error]:  Error 0, type: corrected
> [  117.973667] {2}[Hardware Error]:   section_type: memory error
> [  117.973668] {2}[Hardware Error]:   error_status: 0x0000000000000000
> [  117.973669] {2}[Hardware Error]:   physical_address: 0x00000008d2ef4020
> [  117.973670] {2}[Hardware Error]:   physical_address_mask: 0x0000000000000000
> [  117.973672] {2}[Hardware Error]:   node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
> [  117.973673] {2}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
> [  117.973674] {2}[Hardware Error]:   DIMM location: not present. DMI handle:0x0000
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---
>  drivers/edac/ghes_edac.c    | 35 +++++++++++++++++++----------------
>  drivers/firmware/efi/cper.c | 34 +++++++++++++++++-----------------
>  2 files changed, 36 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 6d1ddecbf0da..526a28cbb19b 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -378,6 +378,8 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  	if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
>  		p += sprintf(p, "bank_address:%d ",
>  			     mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
> +	if (mem_err->validation_bits & CPER_MEM_VALID_DEVICE)
> +		p += sprintf(p, "device:%d ", mem_err->device);
>  	if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
>  		u32 row = mem_err->row;
>  
> @@ -385,9 +387,21 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  		p += sprintf(p, "row:%d ", row);
>  	}
>  	if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
> -		p += sprintf(p, "col:%d ", mem_err->column);
> +		p += sprintf(p, "column:%d ", mem_err->column);
>  	if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
> -		p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
> +		p += sprintf(p, "bit_position:%d ", mem_err->bit_pos);

What for?

> +	if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
> +		p += sprintf(p, "requestor_id:0x%016llx ",
> +			     (long long)mem_err->requestor_id);
> +	if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
> +		p += sprintf(p, "responder_id:0x%016llx ",
> +			     (long long)mem_err->responder_id);
					^^^^^^^^^^^^^^^^^^^^^^

> +	if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
> +		p += sprintf(p, "target_id:0x%016llx ",
> +			     (long long)mem_err->responder_id);
					^^^^^^^^^^^^^^^^^^^^^^

mem_err->responder_id for both responder and target?!

> +	if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
> +		p += sprintf(p, "chip_id:%d ",
> +			     mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);

I don't know if some dumb BIOS simply sets those valid bits regardless
of whether those fields are populated or not... It looks like it...

Right, so first this needs to explain *why* you're doing what you're
doing. And then with each reason why, you should do a patch, one by one,
explaining the rationale.

/me goes and looks at the next patches.

Aha, and you add all that crap here just to remove it in patch 2. But
this is all useless churn.

If you want to use cper_mem_err_location(), why don't you simply use it
directly?

And so on and so on...
Shuai Xue Dec. 29, 2021, 3:22 a.m. UTC | #2
Hi, Borislav,

Thank you very much for your valuable comments.


在 2021/12/29 AM1:12, Borislav Petkov 写道:
> On Fri, Dec 10, 2021 at 09:40:17PM +0800, Shuai Xue wrote:
>> Subject: Re: [PATCH v2 1/3] ghes_edac: unify memory error report format with
> 
> "EDAC/ghes: Unify..."
> 
> is the format we use in the EDAC tree.
Got it, thank you. Will fix in next version.


>> The changes are to:
>>
>> - add device info into ghes_edac
>> - change bit_pos to bit_position, col to column, requestorID to
>>   requestor_id, etc in ghes_edac
>> - move requestor_id, responder_id, target_id and chip_id into memory error
>>   location in ghes_edac
>> - add "DIMM location: not present." for DIMM location in ghes_edac
>> - remove the 'space' delimiter after the colon in ghes_edac and cper
> 
> This commit message is useless: it should not talk about what your patch
> does - that should hopefully be visible in the diff itself. Rather, it
> should talk about *why* you're doing what you're doing.

The comments from Robert in v1[1] ask me to explicitly list those differences,

	It is not really duplicate yet, changes are slightly different which
	could trigger problems in some parsers. At least those differences
	should be listed in the patch description.

Should I keep it here? Or delete it and just add why.


> Also, your patch does a bunch of things at once.
> 
> From: Documentation/process/submitting-patches.rst
> 
> "Solve only one problem per patch.  If your description starts to get
> long, that's a sign that you probably need to split up your patch.
> See :ref:`split_changes`."
> 
> You should have a look at that file.
> 
> Also, avoid having "This patch" or "This commit" in the commit message.
> It is tautologically useless.
> 
> Also, do
> 
> $ git grep 'This patch' Documentation/process
> 
> for more details.

Got it, thank you very much. But as you mentioned bellow, the purpose of this
patch is to unify memory error report format with cper so that we can use use
cper_mem_err_location() directly. Actually, this patch series is originally
suggested by Tony[2] to share the same code with cper. Should I split this
patch by function?


>> The original EDAC and cper error log are as follows (all Validation Bits
>> are enabled):
>>
>> [31940.060454] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x93724c offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
>> [31940.060459] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
>> [31940.060460] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
>> [31940.060462] {3}[Hardware Error]: event severity: corrected
>> [31940.060463] {3}[Hardware Error]:  Error 0, type: corrected
>> [31940.060464] {3}[Hardware Error]:   section_type: memory error
>> [31940.060465] {3}[Hardware Error]:   error_status: 0x0000000000000000
>> [31940.060466] {3}[Hardware Error]:   physical_address: 0x000000093724c020
>> [31940.060466] {3}[Hardware Error]:   physical_address_mask: 0x0000000000000000
>> [31940.060469] {3}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 0 bank: 257 bank_group: 1 bank_address: 1 device: 0 row: 75492 column: 8 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
>> [31940.060470] {3}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
>> [31940.060471] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
>>
>> Now, the EDAC and cper error log are properly reporting the error as
>> follows (all Validation Bits are enabled):
>>
>> [ 117.973657] EDAC MC0: 1 CE Single-symbol ChipKill ECC on 0x0000
> 
> What does "ECC on 0x0000" mean?
> 
>> (node:0 card:0 module:0 rank:0 bank:1026 bank_group:4
>> bank_address:2 device:0 row:6749 column:8 bit_position:0
> 
>> requestor_id:0x0000000000000000
>> responder_id:0x0000000000000000
>> target_id:0x0000000000000000
> 
> those look useless to me too. Probably invalid/unpopulated by BIOS...

Yep, these fields are unpopulated by BIOS, I manually enable all Validation
Bits for debug so that we see the difference more clearly. I will declare it
in next version.


>> chip_id:0 DIMM location:not present. DIMM
>> DMI handle:0x0000 page:0x8d2ef4 offset:0x20 grain:1 syndrome:0x0 -
>> APEI location: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4
>> bank_address:2 device:0 row:6749 column:8 bit_position:0
>> requestor_id:0x0000000000000000 responder_id:0x0000000000000000
>> target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM
>> DMI handle:0x0000 status(0x0000000000000000):reserved)
> 
> Sorry but I fail to see how this changed error record is an improvement.

Well, the purpose is not to improve but unify.


>> [  117.973663] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
>> [  117.973664] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
>> [  117.973665] {2}[Hardware Error]: event severity: corrected
>> [  117.973666] {2}[Hardware Error]:  Error 0, type: corrected
>> [  117.973667] {2}[Hardware Error]:   section_type: memory error
>> [  117.973668] {2}[Hardware Error]:   error_status: 0x0000000000000000
>> [  117.973669] {2}[Hardware Error]:   physical_address: 0x00000008d2ef4020
>> [  117.973670] {2}[Hardware Error]:   physical_address_mask: 0x0000000000000000
>> [  117.973672] {2}[Hardware Error]:   node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
>> [  117.973673] {2}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
>> [  117.973674] {2}[Hardware Error]:   DIMM location: not present. DMI handle:0x0000
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> ---
>>  drivers/edac/ghes_edac.c    | 35 +++++++++++++++++++----------------
>>  drivers/firmware/efi/cper.c | 34 +++++++++++++++++-----------------
>>  2 files changed, 36 insertions(+), 33 deletions(-)
>>
>> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
>> index 6d1ddecbf0da..526a28cbb19b 100644
>> --- a/drivers/edac/ghes_edac.c
>> +++ b/drivers/edac/ghes_edac.c
>> @@ -378,6 +378,8 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>>  	if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
>>  		p += sprintf(p, "bank_address:%d ",
>>  			     mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
>> +	if (mem_err->validation_bits & CPER_MEM_VALID_DEVICE)
>> +		p += sprintf(p, "device:%d ", mem_err->device);
>>  	if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
>>  		u32 row = mem_err->row;
>>  
>> @@ -385,9 +387,21 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>>  		p += sprintf(p, "row:%d ", row);
>>  	}
>>  	if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
>> -		p += sprintf(p, "col:%d ", mem_err->column);
>> +		p += sprintf(p, "column:%d ", mem_err->column);
>>  	if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
>> -		p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
>> +		p += sprintf(p, "bit_position:%d ", mem_err->bit_pos);
> 
> What for?

To unify memory error report format with cper. Should we use cper
or ghes_edac version?


>> +	if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
>> +		p += sprintf(p, "requestor_id:0x%016llx ",
>> +			     (long long)mem_err->requestor_id);
>> +	if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
>> +		p += sprintf(p, "responder_id:0x%016llx ",
>> +			     (long long)mem_err->responder_id);
> 					^^^^^^^^^^^^^^^^^^^^^^
> 
>> +	if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
>> +		p += sprintf(p, "target_id:0x%016llx ",
>> +			     (long long)mem_err->responder_id);
> 					^^^^^^^^^^^^^^^^^^^^^^
> 
> mem_err->responder_id for both responder and target?!

Sorry for this bug. Will fix in next version.

> 
>> +	if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
>> +		p += sprintf(p, "chip_id:%d ",
>> +			     mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
> 
> I don't know if some dumb BIOS simply sets those valid bits regardless
> of whether those fields are populated or not... It looks like it...
> 
> Right, so first this needs to explain *why* you're doing what you're
> doing. And then with each reason why, you should do a patch, one by one,
> explaining the rationale.
> 
> /me goes and looks at the next patches.
> 
> Aha, and you add all that crap here just to remove it in patch 2. But
> this is all useless churn.
> 
> If you want to use cper_mem_err_location(), why don't you simply use it
> directly?
> 
> And so on and so on...

Well, Robert suggested me add a unification patch[1] so that we could review
the changes more clearly. I think it makes sense.

	It is not really duplicate yet, changes are slightly different which
	could trigger problems in some parsers. At least those differences
	should be listed in the patch description. I would rather remove the
	'space' delimiter after the colon and take the ghes version of it as
	logs become harder to read. So ideally there is a unification patch
	before the "duplication" is removed with changes in both files as
	necessary for review and to document the change.

[1]https://lore.kernel.org/all/Ya9F75xWt%2FIlwcKC@rric.localdomain/
[2]https://lore.kernel.org/lkml/YaZ3yiIBRj6qIg2h@agluck-desk2.amr.corp.intel.com/T/#m8cdf77bee7d20ba094130707363359e765e8459f


Best Regards,
Shuai
Borislav Petkov Dec. 30, 2021, 5:57 p.m. UTC | #3
On Wed, Dec 29, 2021 at 11:22:11AM +0800, Shuai Xue wrote:
> Yep, these fields are unpopulated by BIOS, I manually enable all Validation
> Bits for debug so that we see the difference more clearly. I will declare it
> in next version.

Declare what? I can't parse your statement.

> Well, the purpose is not to improve but unify.

The most importang goal with kernel code is improvement and less bugs.
Unification is second. We should not jump through hoops and unify at
every price just because there's a duplicated function somewhere.
Remember that when doing your changes.

> Well, Robert suggested me add a unification patch[1] so that we could review
> the changes more clearly. I think it makes sense.

Not really. I can imagine why Robert suggested that but this strategy is
causing unnecessary churn. What one usually does in such cases is:

1. Add changes to the target functionality - the one in cper.c - by
explaining *why* those changes are needed.

2. Switch ghes_edac.c to that functionality and remove the redundant one
there.

Simple and clean diffstat and easy review.

Thx.
Shuai Xue Dec. 31, 2021, 3:04 a.m. UTC | #4
Hi, Borislav,

Thank you for your comments.

在 2021/12/31 AM1:57, Borislav Petkov 写道:
> On Wed, Dec 29, 2021 at 11:22:11AM +0800, Shuai Xue wrote:
>> Yep, these fields are unpopulated by BIOS, I manually enable all Validation
>> Bits for debug so that we see the difference more clearly. I will declare it
>> in next version.
> 
> Declare what? I can't parse your statement.

The ghes_edac log message is printed only when a validation bit is set, e.g.:

	if (mem_err->validation_bits & CPER_MEM_VALID_NODE)
		p += sprintf(p, "node:%d ", mem_err->node);

Not all bits are populated by BIOS in my platform, I manually enable all
validation bits during test so that we can see log message and differences of all
fields more clearly.

	+ 	mem_err->validation_bits = 0xfffffffffffffff;

>> Well, the purpose is not to improve but unify.
> 
> The most importang goal with kernel code is improvement and less bugs.
> Unification is second. We should not jump through hoops and unify at
> every price just because there's a duplicated function somewhere.
> Remember that when doing your changes.

I see. Thank you.

>> Well, Robert suggested me add a unification patch[1] so that we could review
>> the changes more clearly. I think it makes sense.
> 
> Not really. I can imagine why Robert suggested that but this strategy is
> causing unnecessary churn. What one usually does in such cases is:
> 
> 1. Add changes to the target functionality - the one in cper.c - by
> explaining *why* those changes are needed.
> 
> 2. Switch ghes_edac.c to that functionality and remove the redundant one
> there.
> 
> Simple and clean diffstat and easy review.
> 
> Thx.

Got it. I will send next version latter.

Merry Christmas and happy New Year.

Best Regards,
Shuai
diff mbox series

Patch

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..526a28cbb19b 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -378,6 +378,8 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 	if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
 		p += sprintf(p, "bank_address:%d ",
 			     mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
+	if (mem_err->validation_bits & CPER_MEM_VALID_DEVICE)
+		p += sprintf(p, "device:%d ", mem_err->device);
 	if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
 		u32 row = mem_err->row;
 
@@ -385,9 +387,21 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 		p += sprintf(p, "row:%d ", row);
 	}
 	if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
-		p += sprintf(p, "col:%d ", mem_err->column);
+		p += sprintf(p, "column:%d ", mem_err->column);
 	if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
-		p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+		p += sprintf(p, "bit_position:%d ", mem_err->bit_pos);
+	if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
+		p += sprintf(p, "requestor_id:0x%016llx ",
+			     (long long)mem_err->requestor_id);
+	if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
+		p += sprintf(p, "responder_id:0x%016llx ",
+			     (long long)mem_err->responder_id);
+	if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
+		p += sprintf(p, "target_id:0x%016llx ",
+			     (long long)mem_err->responder_id);
+	if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
+		p += sprintf(p, "chip_id:%d ",
+			     mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
 	if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
 		const char *bank = NULL, *device = NULL;
 		struct dimm_info *dimm;
@@ -396,7 +410,7 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 		if (bank != NULL && device != NULL)
 			p += sprintf(p, "DIMM location:%s %s ", bank, device);
 		else
-			p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
+			p += sprintf(p, "DIMM location:not present. DIMM DMI handle:0x%.4x ",
 				     mem_err->mem_dev_handle);
 
 		dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
@@ -405,9 +419,6 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 			strcpy(e->label, dimm->label);
 		}
 	}
-	if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
-		p += sprintf(p, "chipID: %d ",
-			     mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
 	if (p > e->location)
 		*(p - 1) = '\0';
 
@@ -421,7 +432,7 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 	if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
 		u64 status = mem_err->error_status;
 
-		p += sprintf(p, "status(0x%016llx): ", (long long)status);
+		p += sprintf(p, "status(0x%016llx):", (long long)status);
 		switch ((status >> 8) & 0xff) {
 		case 1:
 			p += sprintf(p, "Error detected internal to the component ");
@@ -479,15 +490,7 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 			break;
 		}
 	}
-	if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
-		p += sprintf(p, "requestorID: 0x%016llx ",
-			     (long long)mem_err->requestor_id);
-	if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
-		p += sprintf(p, "responderID: 0x%016llx ",
-			     (long long)mem_err->responder_id);
-	if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
-		p += sprintf(p, "targetID: 0x%016llx ",
-			     (long long)mem_err->responder_id);
+
 	if (p > pvt->other_detail)
 		*(p - 1) = '\0';
 
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..77b39b058924 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -221,45 +221,45 @@  static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
 	n = 0;
 	len = CPER_REC_LEN;
 	if (mem->validation_bits & CPER_MEM_VALID_NODE)
-		n += scnprintf(msg + n, len - n, "node: %d ", mem->node);
+		n += scnprintf(msg + n, len - n, "node:%d ", mem->node);
 	if (mem->validation_bits & CPER_MEM_VALID_CARD)
-		n += scnprintf(msg + n, len - n, "card: %d ", mem->card);
+		n += scnprintf(msg + n, len - n, "card:%d ", mem->card);
 	if (mem->validation_bits & CPER_MEM_VALID_MODULE)
-		n += scnprintf(msg + n, len - n, "module: %d ", mem->module);
+		n += scnprintf(msg + n, len - n, "module:%d ", mem->module);
 	if (mem->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
-		n += scnprintf(msg + n, len - n, "rank: %d ", mem->rank);
+		n += scnprintf(msg + n, len - n, "rank:%d ", mem->rank);
 	if (mem->validation_bits & CPER_MEM_VALID_BANK)
-		n += scnprintf(msg + n, len - n, "bank: %d ", mem->bank);
+		n += scnprintf(msg + n, len - n, "bank:%d ", mem->bank);
 	if (mem->validation_bits & CPER_MEM_VALID_BANK_GROUP)
-		n += scnprintf(msg + n, len - n, "bank_group: %d ",
+		n += scnprintf(msg + n, len - n, "bank_group:%d ",
 			       mem->bank >> CPER_MEM_BANK_GROUP_SHIFT);
 	if (mem->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
-		n += scnprintf(msg + n, len - n, "bank_address: %d ",
+		n += scnprintf(msg + n, len - n, "bank_address:%d ",
 			       mem->bank & CPER_MEM_BANK_ADDRESS_MASK);
 	if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
-		n += scnprintf(msg + n, len - n, "device: %d ", mem->device);
+		n += scnprintf(msg + n, len - n, "device:%d ", mem->device);
 	if (mem->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
 		u32 row = mem->row;
 
 		row |= cper_get_mem_extension(mem->validation_bits, mem->extended);
-		n += scnprintf(msg + n, len - n, "row: %d ", row);
+		n += scnprintf(msg + n, len - n, "row:%d ", row);
 	}
 	if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
-		n += scnprintf(msg + n, len - n, "column: %d ", mem->column);
+		n += scnprintf(msg + n, len - n, "column:%d ", mem->column);
 	if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
-		n += scnprintf(msg + n, len - n, "bit_position: %d ",
+		n += scnprintf(msg + n, len - n, "bit_position:%d ",
 			       mem->bit_pos);
 	if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
-		n += scnprintf(msg + n, len - n, "requestor_id: 0x%016llx ",
+		n += scnprintf(msg + n, len - n, "requestor_id:0x%016llx ",
 			       mem->requestor_id);
 	if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
-		n += scnprintf(msg + n, len - n, "responder_id: 0x%016llx ",
+		n += scnprintf(msg + n, len - n, "responder_id:0x%016llx ",
 			       mem->responder_id);
 	if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
-		n += scnprintf(msg + n, len - n, "target_id: 0x%016llx ",
+		n += scnprintf(msg + n, len - n, "target_id:0x%016llx ",
 			       mem->target_id);
 	if (mem->validation_bits & CPER_MEM_VALID_CHIP_ID)
-		n += scnprintf(msg + n, len - n, "chip_id: %d ",
+		n += scnprintf(msg + n, len - n, "chip_id:%d ",
 			       mem->extended >> CPER_MEM_CHIP_ID_SHIFT);
 
 	return n;
@@ -276,10 +276,10 @@  static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
 	len = CPER_REC_LEN;
 	dmi_memdev_name(mem->mem_dev_handle, &bank, &device);
 	if (bank && device)
-		n = snprintf(msg, len, "DIMM location: %s %s ", bank, device);
+		n = snprintf(msg, len, "DIMM location:%s %s ", bank, device);
 	else
 		n = snprintf(msg, len,
-			     "DIMM location: not present. DMI handle: 0x%.4x ",
+			     "DIMM location: not present. DMI handle:0x%.4x ",
 			     mem->mem_dev_handle);
 
 	return n;