diff mbox series

[2/2] dt-bindings: arm: cpus: Add edac-enabled property

Message ID 1744241785-20256-3-git-send-email-vijayb@linux.microsoft.com (mailing list archive)
State New
Headers show
Series Add L1 and L2 error detection for A53, A57 and A72 | expand

Commit Message

Vijay Balakrishna April 9, 2025, 11:36 p.m. UTC
From: Sascha Hauer <s.hauer@pengutronix.de>

Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And
Correction (EDAC) support on their L1 and L2 caches. This is implemented
in implementation defined registers, so usage of this functionality is
not safe in virtualized environments or when EL3 already uses these
registers. This patch adds a edac-enabled flag which can be explicitly
set when EDAC can be used.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
[vijayb: Added A72 to the commit message]
Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
---
 Documentation/devicetree/bindings/arm/cpus.yaml | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Krzysztof Kozlowski April 10, 2025, 6 a.m. UTC | #1
On 10/04/2025 01:36, Vijay Balakrishna wrote:
> From: Sascha Hauer <s.hauer@pengutronix.de>
> 
> Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And
> Correction (EDAC) support on their L1 and L2 caches. This is implemented
> in implementation defined registers, so usage of this functionality is
> not safe in virtualized environments or when EL3 already uses these
> registers. This patch adds a edac-enabled flag which can be explicitly
> set when EDAC can be used.

Can't hypervisor tell you that?

> 
> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
> [vijayb: Added A72 to the commit message]
> Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
> ---
>  Documentation/devicetree/bindings/arm/cpus.yaml | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/arm/cpus.yaml b/Documentation/devicetree/bindings/arm/cpus.yaml
> index 2e666b2a4dcd..18d649a18552 100644
> --- a/Documentation/devicetree/bindings/arm/cpus.yaml
> +++ b/Documentation/devicetree/bindings/arm/cpus.yaml
> @@ -331,6 +331,12 @@ properties:
>        corresponding to the index of an SCMI performance domain provider, must be
>        "perf".
>  
> +  edac-enabled:
> +    $ref: '/schemas/types.yaml#/definitions/flag'

Drop quotes - look at every other line.

<form letter>
Please use scripts/get_maintainers.pl to get a list of necessary people
and lists to CC. It might happen, that command when run on an older
kernel, gives you outdated entries. Therefore please be sure you base
your patches on recent Linux kernel.

Tools like b4 or scripts/get_maintainer.pl provide you proper list of
people, so fix your workflow. Tools might also fail if you work on some
ancient tree (don't, instead use mainline) or work on fork of kernel
(don't, instead use mainline). Just use b4 and everything should be
fine, although remember about `b4 prep --auto-to-cc` if you added new
patches to the patchset.

You missed at least devicetree list (maybe more), so this won't be
tested by automated tooling. Performing review on untested code might be
a waste of time.

Please kindly resend and include all necessary To/Cc entries.
</form letter>



Best regards,
Krzysztof
Marc Zyngier April 10, 2025, 7:10 a.m. UTC | #2
On Thu, 10 Apr 2025 07:00:55 +0100,
Krzysztof Kozlowski <krzk@kernel.org> wrote:
> 
> On 10/04/2025 01:36, Vijay Balakrishna wrote:
> > From: Sascha Hauer <s.hauer@pengutronix.de>
> > 
> > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And
> > Correction (EDAC) support on their L1 and L2 caches. This is implemented
> > in implementation defined registers, so usage of this functionality is
> > not safe in virtualized environments or when EL3 already uses these
> > registers. This patch adds a edac-enabled flag which can be explicitly
> > set when EDAC can be used.
> 
> Can't hypervisor tell you that?

No, it can't. This is not an architecture feature, and KVM will gladly
inject an UNDEF exception if the guest tries to use this.

Which is yet another reason why this whole exercise is futile.

	M.
Tyler Hicks April 10, 2025, 2:30 p.m. UTC | #3
On 2025-04-10 08:10:18, Marc Zyngier wrote:
> On Thu, 10 Apr 2025 07:00:55 +0100,
> Krzysztof Kozlowski <krzk@kernel.org> wrote:
> > 
> > On 10/04/2025 01:36, Vijay Balakrishna wrote:
> > > From: Sascha Hauer <s.hauer@pengutronix.de>
> > > 
> > > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And
> > > Correction (EDAC) support on their L1 and L2 caches. This is implemented
> > > in implementation defined registers, so usage of this functionality is
> > > not safe in virtualized environments or when EL3 already uses these
> > > registers. This patch adds a edac-enabled flag which can be explicitly
> > > set when EDAC can be used.
> > 
> > Can't hypervisor tell you that?
> 
> No, it can't. This is not an architecture feature, and KVM will gladly
> inject an UNDEF exception if the guest tries to use this.
> 
> Which is yet another reason why this whole exercise is futile.

Hi Marc - could you clarify why this is futile for baremetal or were you just
referring to virtualized environments?

Thanks!

Tyler

> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.
Marc Zyngier April 10, 2025, 4:23 p.m. UTC | #4
On Thu, 10 Apr 2025 15:30:17 +0100,
"Tyler Hicks (Microsoft)" <code@tyhicks.com> wrote:
> 
> On 2025-04-10 08:10:18, Marc Zyngier wrote:
> > On Thu, 10 Apr 2025 07:00:55 +0100,
> > Krzysztof Kozlowski <krzk@kernel.org> wrote:
> > > 
> > > On 10/04/2025 01:36, Vijay Balakrishna wrote:
> > > > From: Sascha Hauer <s.hauer@pengutronix.de>
> > > > 
> > > > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And
> > > > Correction (EDAC) support on their L1 and L2 caches. This is implemented
> > > > in implementation defined registers, so usage of this functionality is
> > > > not safe in virtualized environments or when EL3 already uses these
> > > > registers. This patch adds a edac-enabled flag which can be explicitly
> > > > set when EDAC can be used.
> > > 
> > > Can't hypervisor tell you that?
> > 
> > No, it can't. This is not an architecture feature, and KVM will gladly
> > inject an UNDEF exception if the guest tries to use this.
> > 
> > Which is yet another reason why this whole exercise is futile.
> 
> Hi Marc - could you clarify why this is futile for baremetal or were you just
> referring to virtualized environments?

This is futile in general. This sort of stuff only makes sense if you
can take useful action upon detecting an error, such as cache
scrubbing. Here, this is just telling you "bang, you're dead", without
any other recourse. You are not even sure you'll be able to actually
*run* this code. You cannot identify what the blast radius.

We have some other EDAC implementation for arm64 CPUs (XGene,
ThunderX), and they are all perfectly useless (I have them in my
collection of horrors). I know you are familiar enough with the RAS
architecture to appreciate the difference with a contemporary
implementation that would actually do the right thing.

Thanks,

	M.
Tyler Hicks April 10, 2025, 4:42 p.m. UTC | #5
On 2025-04-10 17:23:26, Marc Zyngier wrote:
> On Thu, 10 Apr 2025 15:30:17 +0100,
> "Tyler Hicks (Microsoft)" <code@tyhicks.com> wrote:
> > 
> > On 2025-04-10 08:10:18, Marc Zyngier wrote:
> > > On Thu, 10 Apr 2025 07:00:55 +0100,
> > > Krzysztof Kozlowski <krzk@kernel.org> wrote:
> > > > 
> > > > On 10/04/2025 01:36, Vijay Balakrishna wrote:
> > > > > From: Sascha Hauer <s.hauer@pengutronix.de>
> > > > > 
> > > > > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And
> > > > > Correction (EDAC) support on their L1 and L2 caches. This is implemented
> > > > > in implementation defined registers, so usage of this functionality is
> > > > > not safe in virtualized environments or when EL3 already uses these
> > > > > registers. This patch adds a edac-enabled flag which can be explicitly
> > > > > set when EDAC can be used.
> > > > 
> > > > Can't hypervisor tell you that?
> > > 
> > > No, it can't. This is not an architecture feature, and KVM will gladly
> > > inject an UNDEF exception if the guest tries to use this.
> > > 
> > > Which is yet another reason why this whole exercise is futile.
> > 
> > Hi Marc - could you clarify why this is futile for baremetal or were you just
> > referring to virtualized environments?
> 
> This is futile in general. This sort of stuff only makes sense if you
> can take useful action upon detecting an error, such as cache
> scrubbing. Here, this is just telling you "bang, you're dead", without
> any other recourse. You are not even sure you'll be able to actually
> *run* this code. You cannot identify what the blast radius.

We want to use it for monitoring purposes to let us know when a system needs to
be replaced. Knowing the number of Correctable Errors that a specific system is
encountering will help prioritize the replacement of that faulty system.

Also, if we can find some breadcrumbs of an Uncorrectable Error (UE) occurring
just before an important process crashes or before the kernel crashing, then we
can avoid expensive manual debugging and simply replace the system. Automation
can be implemented to dig through the kernel core dump contents to look for a
UE log message from this driver and a kernel engineer will never have to look
at the dump.

> We have some other EDAC implementation for arm64 CPUs (XGene,
> ThunderX), and they are all perfectly useless (I have them in my
> collection of horrors). I know you are familiar enough with the RAS
> architecture to appreciate the difference with a contemporary
> implementation that would actually do the right thing.

Yes, those are nice luxuries to have in the newer implementations but there are
still a lot of older systems in use and making do with what capabilities the
older hardware provides is still useful.

Tyler

> 
> Thanks,
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.
Borislav Petkov April 11, 2025, 8:02 p.m. UTC | #6
On Thu, Apr 10, 2025 at 05:23:26PM +0100, Marc Zyngier wrote:
> We have some other EDAC implementation for arm64 CPUs (XGene,
> ThunderX), and they are all perfectly useless (I have them in my
> collection of horrors).

Oh oh, can I remove, can I remove?

My trigger finger is itching to kill some more useless code...

Thx.
Marc Zyngier April 13, 2025, 10:38 a.m. UTC | #7
On Fri, 11 Apr 2025 21:02:07 +0100,
Borislav Petkov <bp@alien8.de> wrote:
> 
> On Thu, Apr 10, 2025 at 05:23:26PM +0100, Marc Zyngier wrote:
> > We have some other EDAC implementation for arm64 CPUs (XGene,
> > ThunderX), and they are all perfectly useless (I have them in my
> > collection of horrors).
> 
> Oh oh, can I remove, can I remove?
> 
> My trigger finger is itching to kill some more useless code...

The drivers do report ECC errors being corrected, which indicates that
the HW itself is doing its job. Yes, I buy cheap memory from eBay.

Do we need actual drivers to output crap on the console? Probably not,
but I'm the wrong person to ask -- I only keep these machines alive to
remind me how things can go horribly wrong.

I don't think there is any harm in keeping this crap around. It
compiles, and if it breaks, I'll fix it. I'm not convinced we need any
more of it though, specially for CPUs that are over a decade old.

	M.
diff mbox series

Patch

diff --git a/Documentation/devicetree/bindings/arm/cpus.yaml b/Documentation/devicetree/bindings/arm/cpus.yaml
index 2e666b2a4dcd..18d649a18552 100644
--- a/Documentation/devicetree/bindings/arm/cpus.yaml
+++ b/Documentation/devicetree/bindings/arm/cpus.yaml
@@ -331,6 +331,12 @@  properties:
       corresponding to the index of an SCMI performance domain provider, must be
       "perf".
 
+  edac-enabled:
+    $ref: '/schemas/types.yaml#/definitions/flag'
+    description:
+      Some CPUs support Error Detection And Correction (EDAC) on their L1 and
+      L2 caches. This flag marks this function as usable.
+
   qcom,saw:
     $ref: /schemas/types.yaml#/definitions/phandle
     description: |