Message ID | 20211207031905.61906-2-xueshuai@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] ghes_edac: refactor memory error location processing | expand |
On 07.12.21 11:19:04, Shuai Xue wrote: > The memory error location processing in ghes_edac_report_mem_error() have > Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and > cper_mem_err_type_str() in drivers/firmware/efi/cper.c. > > To avoid the duplicated code, this patch introduces the above cper_*() into > ghes_edac_report_mem_error(). It is not really duplicate yet, changes are slightly different which could trigger problems in some parsers. At least those differences should be listed in the patch description. I would rather remove the 'space' delimiter after the colon and take the ghes version of it as logs become harder to read. So ideally there is a unification patch before the "duplication" is removed with changes in both files as necessary for review and to document the change. > > The EDAC error log is now properly reporting the error as follows (all > Validation Bits are enabled): > > [ 375.938411] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 page:0x898b86 offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved) > [ 375.938416] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 > [ 375.938417] {2}[Hardware Error]: It has been corrected by h/w and requires no further action > [ 375.938418] {2}[Hardware Error]: event severity: corrected > [ 375.938419] {2}[Hardware Error]: Error 0, type: corrected > [ 375.938420] {2}[Hardware Error]: section_type: memory error > [ 375.938421] {2}[Hardware Error]: error_status: 0x0000000000000000 > [ 375.938422] {2}[Hardware Error]: physical_address: 0x0000000898b86020 > [ 375.938422] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000 > [ 375.938426] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 > [ 375.938426] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC > [ 375.938428] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 > > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > index 6ec8edec6329..08eabb2e23f8 100644 > --- a/drivers/firmware/efi/cper.c > +++ b/drivers/firmware/efi/cper.c > @@ -211,7 +211,7 @@ const char *cper_mem_err_type_str(unsigned int etype) > } > EXPORT_SYMBOL_GPL(cper_mem_err_type_str); > > -static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg) > +int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg) > { > u32 len, n; > > @@ -265,7 +265,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg) > return n; > } > > -static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg) > +int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg) > { > u32 len, n; > const char *bank = NULL, *device = NULL; Even though the ghes driver cannot be built as module, EXPORT_SYMBOL_GPL()s should be added for both. It would be good to add a note to the description that the UEFI_CPER/EDAC_GHES dependency is always solved through ACPI_APEI_GHES/ACPI_APEI. But we should make the UEFI_CPER dependency explicit for EDAC_GHES in Kconfig anyway. -Robert
Hi, Robert, Thank you for your quick comments! I will change these in the next version. On 2021/12/7 PM7:30, Robert Richter wrote: > On 07.12.21 11:19:04, Shuai Xue wrote: >> The memory error location processing in ghes_edac_report_mem_error() have >> Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and >> cper_mem_err_type_str() in drivers/firmware/efi/cper.c. >> >> To avoid the duplicated code, this patch introduces the above cper_*() into >> ghes_edac_report_mem_error(). > > It is not really duplicate yet, changes are slightly different which > could trigger problems in some parsers. At least those differences > should be listed in the patch description. I see your concerns. I will document the changes in patch description. > I would rather remove the > 'space' delimiter after the colon and take the ghes version of it as > logs become harder to read. So ideally there is a unification patch > before the "duplication" is removed with changes in both files as > necessary for review and to document the change. I will add a new patch to unify ghes and cper before removing duplication. >> The EDAC error log is now properly reporting the error as follows (all >> Validation Bits are enabled): >> >> [ 375.938411] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 page:0x898b86 offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved) >> [ 375.938416] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 >> [ 375.938417] {2}[Hardware Error]: It has been corrected by h/w and requires no further action >> [ 375.938418] {2}[Hardware Error]: event severity: corrected >> [ 375.938419] {2}[Hardware Error]: Error 0, type: corrected >> [ 375.938420] {2}[Hardware Error]: section_type: memory error >> [ 375.938421] {2}[Hardware Error]: error_status: 0x0000000000000000 >> [ 375.938422] {2}[Hardware Error]: physical_address: 0x0000000898b86020 >> [ 375.938422] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000 >> [ 375.938426] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 >> [ 375.938426] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC >> [ 375.938428] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 >> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > > >> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c >> index 6ec8edec6329..08eabb2e23f8 100644 >> --- a/drivers/firmware/efi/cper.c >> +++ b/drivers/firmware/efi/cper.c >> @@ -211,7 +211,7 @@ const char *cper_mem_err_type_str(unsigned int etype) >> } >> EXPORT_SYMBOL_GPL(cper_mem_err_type_str); >> >> -static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg) >> +int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg) >> { >> u32 len, n; >> >> @@ -265,7 +265,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg) >> return n; >> } >> >> -static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg) >> +int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg) >> { >> u32 len, n; >> const char *bank = NULL, *device = NULL; > > Even though the ghes driver cannot be built as module, > EXPORT_SYMBOL_GPL()s should be added for both. I will add EXPORT_SYMBOL_GPL()s. > It would be good to add a note to the description that the > UEFI_CPER/EDAC_GHES dependency is always solved through > ACPI_APEI_GHES/ACPI_APEI. But we should make the UEFI_CPER dependency > explicit for EDAC_GHES in Kconfig anyway. Will document the dependency and add UEFI_CPER dependency explicit for EDAC_GHES in Kconfig in next version. Cheers, Shuai
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c index 6d1ddecbf0da..7b25c3207c11 100644 --- a/drivers/edac/ghes_edac.c +++ b/drivers/edac/ghes_edac.c @@ -239,6 +239,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) { struct edac_raw_error_desc *e; struct mem_ctl_info *mci; + struct cper_mem_err_compact cmem; struct ghes_pvt *pvt; unsigned long flags; char *p; @@ -292,60 +293,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) /* Error type, mapped on e->msg */ if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) { + u8 etype = mem_err->error_type; + p = pvt->msg; - switch (mem_err->error_type) { - case 0: - p += sprintf(p, "Unknown"); - break; - case 1: - p += sprintf(p, "No error"); - break; - case 2: - p += sprintf(p, "Single-bit ECC"); - break; - case 3: - p += sprintf(p, "Multi-bit ECC"); - break; - case 4: - p += sprintf(p, "Single-symbol ChipKill ECC"); - break; - case 5: - p += sprintf(p, "Multi-symbol ChipKill ECC"); - break; - case 6: - p += sprintf(p, "Master abort"); - break; - case 7: - p += sprintf(p, "Target abort"); - break; - case 8: - p += sprintf(p, "Parity Error"); - break; - case 9: - p += sprintf(p, "Watchdog timeout"); - break; - case 10: - p += sprintf(p, "Invalid address"); - break; - case 11: - p += sprintf(p, "Mirror Broken"); - break; - case 12: - p += sprintf(p, "Memory Sparing"); - break; - case 13: - p += sprintf(p, "Scrub corrected error"); - break; - case 14: - p += sprintf(p, "Scrub uncorrected error"); - break; - case 15: - p += sprintf(p, "Physical Memory Map-out event"); - break; - default: - p += sprintf(p, "reserved error (%d)", - mem_err->error_type); - } + p += snprintf(p, sizeof(pvt->msg), "%s", cper_mem_err_type_str(etype)); } else { strcpy(pvt->msg, "unknown error"); } @@ -362,42 +313,13 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) /* Memory error location, mapped on e->location */ p = e->location; - if (mem_err->validation_bits & CPER_MEM_VALID_NODE) - p += sprintf(p, "node:%d ", mem_err->node); - if (mem_err->validation_bits & CPER_MEM_VALID_CARD) - p += sprintf(p, "card:%d ", mem_err->card); - if (mem_err->validation_bits & CPER_MEM_VALID_MODULE) - p += sprintf(p, "module:%d ", mem_err->module); - if (mem_err->validation_bits & CPER_MEM_VALID_RANK_NUMBER) - p += sprintf(p, "rank:%d ", mem_err->rank); - if (mem_err->validation_bits & CPER_MEM_VALID_BANK) - p += sprintf(p, "bank:%d ", mem_err->bank); - if (mem_err->validation_bits & CPER_MEM_VALID_BANK_GROUP) - p += sprintf(p, "bank_group:%d ", - mem_err->bank >> CPER_MEM_BANK_GROUP_SHIFT); - if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS) - p += sprintf(p, "bank_address:%d ", - mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK); - if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) { - u32 row = mem_err->row; - - row |= cper_get_mem_extension(mem_err->validation_bits, mem_err->extended); - p += sprintf(p, "row:%d ", row); - } - if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN) - p += sprintf(p, "col:%d ", mem_err->column); - if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION) - p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos); + cper_mem_err_pack(mem_err, &cmem); + p += cper_mem_err_location(&cmem, p); + if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) { - const char *bank = NULL, *device = NULL; struct dimm_info *dimm; - dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device); - if (bank != NULL && device != NULL) - p += sprintf(p, "DIMM location:%s %s ", bank, device); - else - p += sprintf(p, "DIMM DMI handle: 0x%.4x ", - mem_err->mem_dev_handle); + p += cper_dimm_err_location(&cmem, p); dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle); if (dimm) { @@ -405,9 +327,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) strcpy(e->label, dimm->label); } } - if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID) - p += sprintf(p, "chipID: %d ", - mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT); + if (p > e->location) *(p - 1) = '\0'; @@ -479,15 +399,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) break; } } - if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID) - p += sprintf(p, "requestorID: 0x%016llx ", - (long long)mem_err->requestor_id); - if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID) - p += sprintf(p, "responderID: 0x%016llx ", - (long long)mem_err->responder_id); - if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID) - p += sprintf(p, "targetID: 0x%016llx ", - (long long)mem_err->responder_id); + if (p > pvt->other_detail) *(p - 1) = '\0'; diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c index 6ec8edec6329..08eabb2e23f8 100644 --- a/drivers/firmware/efi/cper.c +++ b/drivers/firmware/efi/cper.c @@ -211,7 +211,7 @@ const char *cper_mem_err_type_str(unsigned int etype) } EXPORT_SYMBOL_GPL(cper_mem_err_type_str); -static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg) +int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg) { u32 len, n; @@ -265,7 +265,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg) return n; } -static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg) +int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg) { u32 len, n; const char *bank = NULL, *device = NULL; diff --git a/include/linux/cper.h b/include/linux/cper.h index 6a511a1078ca..918e7efffb60 100644 --- a/include/linux/cper.h +++ b/include/linux/cper.h @@ -568,5 +568,7 @@ void cper_print_proc_arm(const char *pfx, const struct cper_sec_proc_arm *proc); void cper_print_proc_ia(const char *pfx, const struct cper_sec_proc_ia *proc); +int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg); +int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg); #endif
The memory error location processing in ghes_edac_report_mem_error() have Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and cper_mem_err_type_str() in drivers/firmware/efi/cper.c. To avoid the duplicated code, this patch introduces the above cper_*() into ghes_edac_report_mem_error(). The EDAC error log is now properly reporting the error as follows (all Validation Bits are enabled): [ 375.938411] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 page:0x898b86 offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved) [ 375.938416] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 375.938417] {2}[Hardware Error]: It has been corrected by h/w and requires no further action [ 375.938418] {2}[Hardware Error]: event severity: corrected [ 375.938419] {2}[Hardware Error]: Error 0, type: corrected [ 375.938420] {2}[Hardware Error]: section_type: memory error [ 375.938421] {2}[Hardware Error]: error_status: 0x0000000000000000 [ 375.938422] {2}[Hardware Error]: physical_address: 0x0000000898b86020 [ 375.938422] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000 [ 375.938426] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 [ 375.938426] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC [ 375.938428] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> --- drivers/edac/ghes_edac.c | 108 ++++-------------------------------- drivers/firmware/efi/cper.c | 4 +- include/linux/cper.h | 2 + 3 files changed, 14 insertions(+), 100 deletions(-)