mbox series

[v4,0/3] Fix CPER issues related to UEFI 2.9A Errata

Message ID cover.1718906288.git.mchehab+huawei@kernel.org (mailing list archive)
Headers show
Series Fix CPER issues related to UEFI 2.9A Errata | expand

Message

Mauro Carvalho Chehab June 20, 2024, 6:01 p.m. UTC
The UEFI 2.9A errata makes clear how ARM processor type encoding should
be done: it is meant to be equal to Generic processor, using a bitmask.

The current code assumes, for both generic and ARM processor types
that this is an integer, which is an incorrect assumption.

Fix it. While here, also fix a compilation issue when using W=1.

After the change, Kernel will properly decode receiving two errors at the same
message, as defined at UEFI spec:

[   75.282430] Memory failure: 0x5cdfd: recovery action for free buddy page: Recovered
[   94.973081] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[   94.973770] {2}[Hardware Error]: event severity: recoverable
[   94.974334] {2}[Hardware Error]:  Error 0, type: recoverable
[   94.974962] {2}[Hardware Error]:   section_type: ARM processor error
[   94.975586] {2}[Hardware Error]:   MIDR: 0x000000000000cd24
[   94.976202] {2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x000000000000ab12
[   94.977011] {2}[Hardware Error]:   error affinity level: 2
[   94.977593] {2}[Hardware Error]:   running state: 0x1
[   94.978135] {2}[Hardware Error]:   Power State Coordination Interface state: 4660
[   94.978884] {2}[Hardware Error]:   Error info structure 0:
[   94.979463] {2}[Hardware Error]:   num errors: 3
[   94.979971] {2}[Hardware Error]:    first error captured
[   94.980523] {2}[Hardware Error]:    propagated error captured
[   94.981110] {2}[Hardware Error]:    overflow occurred, error info is incomplete
[   94.981893] {2}[Hardware Error]:    error_type: 0x0006: cache error|TLB error
[   94.982606] {2}[Hardware Error]:    error_info: 0x000000000091000f
[   94.983249] {2}[Hardware Error]:     transaction type: Data Access
[   94.983891] {2}[Hardware Error]:     cache error, operation type: Data write
[   94.984559] {2}[Hardware Error]:     TLB error, operation type: Data write
[   94.985215] {2}[Hardware Error]:     cache level: 2
[   94.985749] {2}[Hardware Error]:     TLB level: 2
[   94.986277] {2}[Hardware Error]:     processor context not corrupted

And the error code is properly decoded according with table N.17 from UEFI 2.10
spec:

	[   94.981893] {2}[Hardware Error]:    error_type: 0x0006: cache error|TLB error

Mauro Carvalho Chehab (3):
  efi/cper: Adjust infopfx size to accept an extra space
  efi/cper: Add a new helper function to print bitmasks
  efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs

 drivers/acpi/apei/ghes.c        | 10 +++---
 drivers/firmware/efi/cper-arm.c | 48 ++++++++++++---------------
 drivers/firmware/efi/cper.c     | 59 +++++++++++++++++++++++++++++++++
 include/linux/cper.h            | 13 +++++---
 4 files changed, 94 insertions(+), 36 deletions(-)

Comments

Ard Biesheuvel June 21, 2024, 7:45 a.m. UTC | #1
On Thu, 20 Jun 2024 at 20:01, Mauro Carvalho Chehab
<mchehab+huawei@kernel.org> wrote:
>
> The UEFI 2.9A errata makes clear how ARM processor type encoding should
> be done: it is meant to be equal to Generic processor, using a bitmask.
>
> The current code assumes, for both generic and ARM processor types
> that this is an integer, which is an incorrect assumption.
>
> Fix it. While here, also fix a compilation issue when using W=1.
>
> After the change, Kernel will properly decode receiving two errors at the same
> message, as defined at UEFI spec:
>
> [   75.282430] Memory failure: 0x5cdfd: recovery action for free buddy page: Recovered
> [   94.973081] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> [   94.973770] {2}[Hardware Error]: event severity: recoverable
> [   94.974334] {2}[Hardware Error]:  Error 0, type: recoverable
> [   94.974962] {2}[Hardware Error]:   section_type: ARM processor error
> [   94.975586] {2}[Hardware Error]:   MIDR: 0x000000000000cd24
> [   94.976202] {2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x000000000000ab12
> [   94.977011] {2}[Hardware Error]:   error affinity level: 2
> [   94.977593] {2}[Hardware Error]:   running state: 0x1
> [   94.978135] {2}[Hardware Error]:   Power State Coordination Interface state: 4660
> [   94.978884] {2}[Hardware Error]:   Error info structure 0:
> [   94.979463] {2}[Hardware Error]:   num errors: 3
> [   94.979971] {2}[Hardware Error]:    first error captured
> [   94.980523] {2}[Hardware Error]:    propagated error captured
> [   94.981110] {2}[Hardware Error]:    overflow occurred, error info is incomplete
> [   94.981893] {2}[Hardware Error]:    error_type: 0x0006: cache error|TLB error
> [   94.982606] {2}[Hardware Error]:    error_info: 0x000000000091000f
> [   94.983249] {2}[Hardware Error]:     transaction type: Data Access
> [   94.983891] {2}[Hardware Error]:     cache error, operation type: Data write
> [   94.984559] {2}[Hardware Error]:     TLB error, operation type: Data write
> [   94.985215] {2}[Hardware Error]:     cache level: 2
> [   94.985749] {2}[Hardware Error]:     TLB level: 2
> [   94.986277] {2}[Hardware Error]:     processor context not corrupted
>
> And the error code is properly decoded according with table N.17 from UEFI 2.10
> spec:
>
>         [   94.981893] {2}[Hardware Error]:    error_type: 0x0006: cache error|TLB error
>
> Mauro Carvalho Chehab (3):
>   efi/cper: Adjust infopfx size to accept an extra space
>   efi/cper: Add a new helper function to print bitmasks
>   efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs
>

Hello Mauro,

How this is v4 different from the preceding 3 revisions that you sent
over the past 2 days?

I would expect an experienced maintainer like yourself to be familiar
with the common practice here: please leave some time between sending
revisions so people can take a look. And if there is a pressing need
to deviate from this rule, at least put an explanation in the commit
log of how the series differs from the preceding one.

Thanks,
Ard.
Mauro Carvalho Chehab June 21, 2024, 3:26 p.m. UTC | #2
Hi Ard,

Em Fri, 21 Jun 2024 09:45:16 +0200
Ard Biesheuvel <ardb@kernel.org> escreveu:

> On Thu, 20 Jun 2024 at 20:01, Mauro Carvalho Chehab
> <mchehab+huawei@kernel.org> wrote:
> >
> > The UEFI 2.9A errata makes clear how ARM processor type encoding should
> > be done: it is meant to be equal to Generic processor, using a bitmask.
> >
> > The current code assumes, for both generic and ARM processor types
> > that this is an integer, which is an incorrect assumption.
> >
> > Fix it. While here, also fix a compilation issue when using W=1.
> >
> > After the change, Kernel will properly decode receiving two errors at the same
> > message, as defined at UEFI spec:
> >
> > [   75.282430] Memory failure: 0x5cdfd: recovery action for free buddy page: Recovered
> > [   94.973081] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > [   94.973770] {2}[Hardware Error]: event severity: recoverable
> > [   94.974334] {2}[Hardware Error]:  Error 0, type: recoverable
> > [   94.974962] {2}[Hardware Error]:   section_type: ARM processor error
> > [   94.975586] {2}[Hardware Error]:   MIDR: 0x000000000000cd24
> > [   94.976202] {2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x000000000000ab12
> > [   94.977011] {2}[Hardware Error]:   error affinity level: 2
> > [   94.977593] {2}[Hardware Error]:   running state: 0x1
> > [   94.978135] {2}[Hardware Error]:   Power State Coordination Interface state: 4660
> > [   94.978884] {2}[Hardware Error]:   Error info structure 0:
> > [   94.979463] {2}[Hardware Error]:   num errors: 3
> > [   94.979971] {2}[Hardware Error]:    first error captured
> > [   94.980523] {2}[Hardware Error]:    propagated error captured
> > [   94.981110] {2}[Hardware Error]:    overflow occurred, error info is incomplete
> > [   94.981893] {2}[Hardware Error]:    error_type: 0x0006: cache error|TLB error
> > [   94.982606] {2}[Hardware Error]:    error_info: 0x000000000091000f
> > [   94.983249] {2}[Hardware Error]:     transaction type: Data Access
> > [   94.983891] {2}[Hardware Error]:     cache error, operation type: Data write
> > [   94.984559] {2}[Hardware Error]:     TLB error, operation type: Data write
> > [   94.985215] {2}[Hardware Error]:     cache level: 2
> > [   94.985749] {2}[Hardware Error]:     TLB level: 2
> > [   94.986277] {2}[Hardware Error]:     processor context not corrupted
> >
> > And the error code is properly decoded according with table N.17 from UEFI 2.10
> > spec:
> >
> >         [   94.981893] {2}[Hardware Error]:    error_type: 0x0006: cache error|TLB error
> >
> > Mauro Carvalho Chehab (3):
> >   efi/cper: Adjust infopfx size to accept an extra space
> >   efi/cper: Add a new helper function to print bitmasks
> >   efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs
> >  
> 
> Hello Mauro,
> 
> How this is v4 different from the preceding 3 revisions that you sent
> over the past 2 days?
> 
> I would expect an experienced maintainer like yourself to be familiar
> with the common practice here: please leave some time between sending
> revisions so people can take a look. And if there is a pressing need
> to deviate from this rule, at least put an explanation in the commit
> log of how the series differs from the preceding one.

Sorry, I'll add a version review on that. Basically I was missing a
test environment to do error injection. When I got it enabled, and fixed
to cope with UEFI 2.9A/2.10 expected behavior, I was able to discover 
some issues and to do some code improvements.

v1: 
- (tagged as RFC) was mostly to give a heads up that the current 
  implementation is not following the spec. It also touches
  only cper code.

v2:
- It fixes the way printks are handled on both cper_arm and ghes
  drivers;

v3:
- It adds a helper function to produce a buffer describing the
  error bits at cper's printk and ghes pr_warn_bitrated. It also
  fixes a W=1 error while building cper;

v4:
- The print function had some bugs on it, which was discovered with
  the help of an error injection tool I'm now using.

I have already another version ready to send. It does some code
cleanup and address the issues pointed by Tony and Jonathan. If you
prefer, I can hold it until Monday to give you some time to look
at it.

Thanks,
Mauro