Message ID | cover.1718906288.git.mchehab+huawei@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Fix CPER issues related to UEFI 2.9A Errata | expand |
On Thu, 20 Jun 2024 at 20:01, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > The UEFI 2.9A errata makes clear how ARM processor type encoding should > be done: it is meant to be equal to Generic processor, using a bitmask. > > The current code assumes, for both generic and ARM processor types > that this is an integer, which is an incorrect assumption. > > Fix it. While here, also fix a compilation issue when using W=1. > > After the change, Kernel will properly decode receiving two errors at the same > message, as defined at UEFI spec: > > [ 75.282430] Memory failure: 0x5cdfd: recovery action for free buddy page: Recovered > [ 94.973081] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > [ 94.973770] {2}[Hardware Error]: event severity: recoverable > [ 94.974334] {2}[Hardware Error]: Error 0, type: recoverable > [ 94.974962] {2}[Hardware Error]: section_type: ARM processor error > [ 94.975586] {2}[Hardware Error]: MIDR: 0x000000000000cd24 > [ 94.976202] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x000000000000ab12 > [ 94.977011] {2}[Hardware Error]: error affinity level: 2 > [ 94.977593] {2}[Hardware Error]: running state: 0x1 > [ 94.978135] {2}[Hardware Error]: Power State Coordination Interface state: 4660 > [ 94.978884] {2}[Hardware Error]: Error info structure 0: > [ 94.979463] {2}[Hardware Error]: num errors: 3 > [ 94.979971] {2}[Hardware Error]: first error captured > [ 94.980523] {2}[Hardware Error]: propagated error captured > [ 94.981110] {2}[Hardware Error]: overflow occurred, error info is incomplete > [ 94.981893] {2}[Hardware Error]: error_type: 0x0006: cache error|TLB error > [ 94.982606] {2}[Hardware Error]: error_info: 0x000000000091000f > [ 94.983249] {2}[Hardware Error]: transaction type: Data Access > [ 94.983891] {2}[Hardware Error]: cache error, operation type: Data write > [ 94.984559] {2}[Hardware Error]: TLB error, operation type: Data write > [ 94.985215] {2}[Hardware Error]: cache level: 2 > [ 94.985749] {2}[Hardware Error]: TLB level: 2 > [ 94.986277] {2}[Hardware Error]: processor context not corrupted > > And the error code is properly decoded according with table N.17 from UEFI 2.10 > spec: > > [ 94.981893] {2}[Hardware Error]: error_type: 0x0006: cache error|TLB error > > Mauro Carvalho Chehab (3): > efi/cper: Adjust infopfx size to accept an extra space > efi/cper: Add a new helper function to print bitmasks > efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs > Hello Mauro, How this is v4 different from the preceding 3 revisions that you sent over the past 2 days? I would expect an experienced maintainer like yourself to be familiar with the common practice here: please leave some time between sending revisions so people can take a look. And if there is a pressing need to deviate from this rule, at least put an explanation in the commit log of how the series differs from the preceding one. Thanks, Ard.
Hi Ard, Em Fri, 21 Jun 2024 09:45:16 +0200 Ard Biesheuvel <ardb@kernel.org> escreveu: > On Thu, 20 Jun 2024 at 20:01, Mauro Carvalho Chehab > <mchehab+huawei@kernel.org> wrote: > > > > The UEFI 2.9A errata makes clear how ARM processor type encoding should > > be done: it is meant to be equal to Generic processor, using a bitmask. > > > > The current code assumes, for both generic and ARM processor types > > that this is an integer, which is an incorrect assumption. > > > > Fix it. While here, also fix a compilation issue when using W=1. > > > > After the change, Kernel will properly decode receiving two errors at the same > > message, as defined at UEFI spec: > > > > [ 75.282430] Memory failure: 0x5cdfd: recovery action for free buddy page: Recovered > > [ 94.973081] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > [ 94.973770] {2}[Hardware Error]: event severity: recoverable > > [ 94.974334] {2}[Hardware Error]: Error 0, type: recoverable > > [ 94.974962] {2}[Hardware Error]: section_type: ARM processor error > > [ 94.975586] {2}[Hardware Error]: MIDR: 0x000000000000cd24 > > [ 94.976202] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x000000000000ab12 > > [ 94.977011] {2}[Hardware Error]: error affinity level: 2 > > [ 94.977593] {2}[Hardware Error]: running state: 0x1 > > [ 94.978135] {2}[Hardware Error]: Power State Coordination Interface state: 4660 > > [ 94.978884] {2}[Hardware Error]: Error info structure 0: > > [ 94.979463] {2}[Hardware Error]: num errors: 3 > > [ 94.979971] {2}[Hardware Error]: first error captured > > [ 94.980523] {2}[Hardware Error]: propagated error captured > > [ 94.981110] {2}[Hardware Error]: overflow occurred, error info is incomplete > > [ 94.981893] {2}[Hardware Error]: error_type: 0x0006: cache error|TLB error > > [ 94.982606] {2}[Hardware Error]: error_info: 0x000000000091000f > > [ 94.983249] {2}[Hardware Error]: transaction type: Data Access > > [ 94.983891] {2}[Hardware Error]: cache error, operation type: Data write > > [ 94.984559] {2}[Hardware Error]: TLB error, operation type: Data write > > [ 94.985215] {2}[Hardware Error]: cache level: 2 > > [ 94.985749] {2}[Hardware Error]: TLB level: 2 > > [ 94.986277] {2}[Hardware Error]: processor context not corrupted > > > > And the error code is properly decoded according with table N.17 from UEFI 2.10 > > spec: > > > > [ 94.981893] {2}[Hardware Error]: error_type: 0x0006: cache error|TLB error > > > > Mauro Carvalho Chehab (3): > > efi/cper: Adjust infopfx size to accept an extra space > > efi/cper: Add a new helper function to print bitmasks > > efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs > > > > Hello Mauro, > > How this is v4 different from the preceding 3 revisions that you sent > over the past 2 days? > > I would expect an experienced maintainer like yourself to be familiar > with the common practice here: please leave some time between sending > revisions so people can take a look. And if there is a pressing need > to deviate from this rule, at least put an explanation in the commit > log of how the series differs from the preceding one. Sorry, I'll add a version review on that. Basically I was missing a test environment to do error injection. When I got it enabled, and fixed to cope with UEFI 2.9A/2.10 expected behavior, I was able to discover some issues and to do some code improvements. v1: - (tagged as RFC) was mostly to give a heads up that the current implementation is not following the spec. It also touches only cper code. v2: - It fixes the way printks are handled on both cper_arm and ghes drivers; v3: - It adds a helper function to produce a buffer describing the error bits at cper's printk and ghes pr_warn_bitrated. It also fixes a W=1 error while building cper; v4: - The print function had some bugs on it, which was discovered with the help of an error injection tool I'm now using. I have already another version ready to send. It does some code cleanup and address the issues pointed by Tony and Jonathan. If you prefer, I can hold it until Monday to give you some time to look at it. Thanks, Mauro