Message ID | 1744241785-20256-3-git-send-email-vijayb@linux.microsoft.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add L1 and L2 error detection for A53, A57 and A72 | expand |
On 10/04/2025 01:36, Vijay Balakrishna wrote: > From: Sascha Hauer <s.hauer@pengutronix.de> > > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And > Correction (EDAC) support on their L1 and L2 caches. This is implemented > in implementation defined registers, so usage of this functionality is > not safe in virtualized environments or when EL3 already uses these > registers. This patch adds a edac-enabled flag which can be explicitly > set when EDAC can be used. Can't hypervisor tell you that? > > Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> > [vijayb: Added A72 to the commit message] > Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com> > --- > Documentation/devicetree/bindings/arm/cpus.yaml | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/Documentation/devicetree/bindings/arm/cpus.yaml b/Documentation/devicetree/bindings/arm/cpus.yaml > index 2e666b2a4dcd..18d649a18552 100644 > --- a/Documentation/devicetree/bindings/arm/cpus.yaml > +++ b/Documentation/devicetree/bindings/arm/cpus.yaml > @@ -331,6 +331,12 @@ properties: > corresponding to the index of an SCMI performance domain provider, must be > "perf". > > + edac-enabled: > + $ref: '/schemas/types.yaml#/definitions/flag' Drop quotes - look at every other line. <form letter> Please use scripts/get_maintainers.pl to get a list of necessary people and lists to CC. It might happen, that command when run on an older kernel, gives you outdated entries. Therefore please be sure you base your patches on recent Linux kernel. Tools like b4 or scripts/get_maintainer.pl provide you proper list of people, so fix your workflow. Tools might also fail if you work on some ancient tree (don't, instead use mainline) or work on fork of kernel (don't, instead use mainline). Just use b4 and everything should be fine, although remember about `b4 prep --auto-to-cc` if you added new patches to the patchset. You missed at least devicetree list (maybe more), so this won't be tested by automated tooling. Performing review on untested code might be a waste of time. Please kindly resend and include all necessary To/Cc entries. </form letter> Best regards, Krzysztof
On Thu, 10 Apr 2025 07:00:55 +0100, Krzysztof Kozlowski <krzk@kernel.org> wrote: > > On 10/04/2025 01:36, Vijay Balakrishna wrote: > > From: Sascha Hauer <s.hauer@pengutronix.de> > > > > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And > > Correction (EDAC) support on their L1 and L2 caches. This is implemented > > in implementation defined registers, so usage of this functionality is > > not safe in virtualized environments or when EL3 already uses these > > registers. This patch adds a edac-enabled flag which can be explicitly > > set when EDAC can be used. > > Can't hypervisor tell you that? No, it can't. This is not an architecture feature, and KVM will gladly inject an UNDEF exception if the guest tries to use this. Which is yet another reason why this whole exercise is futile. M.
On 2025-04-10 08:10:18, Marc Zyngier wrote: > On Thu, 10 Apr 2025 07:00:55 +0100, > Krzysztof Kozlowski <krzk@kernel.org> wrote: > > > > On 10/04/2025 01:36, Vijay Balakrishna wrote: > > > From: Sascha Hauer <s.hauer@pengutronix.de> > > > > > > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And > > > Correction (EDAC) support on their L1 and L2 caches. This is implemented > > > in implementation defined registers, so usage of this functionality is > > > not safe in virtualized environments or when EL3 already uses these > > > registers. This patch adds a edac-enabled flag which can be explicitly > > > set when EDAC can be used. > > > > Can't hypervisor tell you that? > > No, it can't. This is not an architecture feature, and KVM will gladly > inject an UNDEF exception if the guest tries to use this. > > Which is yet another reason why this whole exercise is futile. Hi Marc - could you clarify why this is futile for baremetal or were you just referring to virtualized environments? Thanks! Tyler > > M. > > -- > Without deviation from the norm, progress is not possible.
On Thu, 10 Apr 2025 15:30:17 +0100, "Tyler Hicks (Microsoft)" <code@tyhicks.com> wrote: > > On 2025-04-10 08:10:18, Marc Zyngier wrote: > > On Thu, 10 Apr 2025 07:00:55 +0100, > > Krzysztof Kozlowski <krzk@kernel.org> wrote: > > > > > > On 10/04/2025 01:36, Vijay Balakrishna wrote: > > > > From: Sascha Hauer <s.hauer@pengutronix.de> > > > > > > > > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And > > > > Correction (EDAC) support on their L1 and L2 caches. This is implemented > > > > in implementation defined registers, so usage of this functionality is > > > > not safe in virtualized environments or when EL3 already uses these > > > > registers. This patch adds a edac-enabled flag which can be explicitly > > > > set when EDAC can be used. > > > > > > Can't hypervisor tell you that? > > > > No, it can't. This is not an architecture feature, and KVM will gladly > > inject an UNDEF exception if the guest tries to use this. > > > > Which is yet another reason why this whole exercise is futile. > > Hi Marc - could you clarify why this is futile for baremetal or were you just > referring to virtualized environments? This is futile in general. This sort of stuff only makes sense if you can take useful action upon detecting an error, such as cache scrubbing. Here, this is just telling you "bang, you're dead", without any other recourse. You are not even sure you'll be able to actually *run* this code. You cannot identify what the blast radius. We have some other EDAC implementation for arm64 CPUs (XGene, ThunderX), and they are all perfectly useless (I have them in my collection of horrors). I know you are familiar enough with the RAS architecture to appreciate the difference with a contemporary implementation that would actually do the right thing. Thanks, M.
On 2025-04-10 17:23:26, Marc Zyngier wrote: > On Thu, 10 Apr 2025 15:30:17 +0100, > "Tyler Hicks (Microsoft)" <code@tyhicks.com> wrote: > > > > On 2025-04-10 08:10:18, Marc Zyngier wrote: > > > On Thu, 10 Apr 2025 07:00:55 +0100, > > > Krzysztof Kozlowski <krzk@kernel.org> wrote: > > > > > > > > On 10/04/2025 01:36, Vijay Balakrishna wrote: > > > > > From: Sascha Hauer <s.hauer@pengutronix.de> > > > > > > > > > > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And > > > > > Correction (EDAC) support on their L1 and L2 caches. This is implemented > > > > > in implementation defined registers, so usage of this functionality is > > > > > not safe in virtualized environments or when EL3 already uses these > > > > > registers. This patch adds a edac-enabled flag which can be explicitly > > > > > set when EDAC can be used. > > > > > > > > Can't hypervisor tell you that? > > > > > > No, it can't. This is not an architecture feature, and KVM will gladly > > > inject an UNDEF exception if the guest tries to use this. > > > > > > Which is yet another reason why this whole exercise is futile. > > > > Hi Marc - could you clarify why this is futile for baremetal or were you just > > referring to virtualized environments? > > This is futile in general. This sort of stuff only makes sense if you > can take useful action upon detecting an error, such as cache > scrubbing. Here, this is just telling you "bang, you're dead", without > any other recourse. You are not even sure you'll be able to actually > *run* this code. You cannot identify what the blast radius. We want to use it for monitoring purposes to let us know when a system needs to be replaced. Knowing the number of Correctable Errors that a specific system is encountering will help prioritize the replacement of that faulty system. Also, if we can find some breadcrumbs of an Uncorrectable Error (UE) occurring just before an important process crashes or before the kernel crashing, then we can avoid expensive manual debugging and simply replace the system. Automation can be implemented to dig through the kernel core dump contents to look for a UE log message from this driver and a kernel engineer will never have to look at the dump. > We have some other EDAC implementation for arm64 CPUs (XGene, > ThunderX), and they are all perfectly useless (I have them in my > collection of horrors). I know you are familiar enough with the RAS > architecture to appreciate the difference with a contemporary > implementation that would actually do the right thing. Yes, those are nice luxuries to have in the newer implementations but there are still a lot of older systems in use and making do with what capabilities the older hardware provides is still useful. Tyler > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.
On Thu, Apr 10, 2025 at 05:23:26PM +0100, Marc Zyngier wrote: > We have some other EDAC implementation for arm64 CPUs (XGene, > ThunderX), and they are all perfectly useless (I have them in my > collection of horrors). Oh oh, can I remove, can I remove? My trigger finger is itching to kill some more useless code... Thx.
On Fri, 11 Apr 2025 21:02:07 +0100, Borislav Petkov <bp@alien8.de> wrote: > > On Thu, Apr 10, 2025 at 05:23:26PM +0100, Marc Zyngier wrote: > > We have some other EDAC implementation for arm64 CPUs (XGene, > > ThunderX), and they are all perfectly useless (I have them in my > > collection of horrors). > > Oh oh, can I remove, can I remove? > > My trigger finger is itching to kill some more useless code... The drivers do report ECC errors being corrected, which indicates that the HW itself is doing its job. Yes, I buy cheap memory from eBay. Do we need actual drivers to output crap on the console? Probably not, but I'm the wrong person to ask -- I only keep these machines alive to remind me how things can go horribly wrong. I don't think there is any harm in keeping this crap around. It compiles, and if it breaks, I'll fix it. I'm not convinced we need any more of it though, specially for CPUs that are over a decade old. M.
diff --git a/Documentation/devicetree/bindings/arm/cpus.yaml b/Documentation/devicetree/bindings/arm/cpus.yaml index 2e666b2a4dcd..18d649a18552 100644 --- a/Documentation/devicetree/bindings/arm/cpus.yaml +++ b/Documentation/devicetree/bindings/arm/cpus.yaml @@ -331,6 +331,12 @@ properties: corresponding to the index of an SCMI performance domain provider, must be "perf". + edac-enabled: + $ref: '/schemas/types.yaml#/definitions/flag' + description: + Some CPUs support Error Detection And Correction (EDAC) on their L1 and + L2 caches. This flag marks this function as usable. + qcom,saw: $ref: /schemas/types.yaml#/definitions/phandle description: |