diff mbox series

[1/6] dt-bindings: firmware: Add arm,errata-management

Message ID 20230330165128.3237939-2-james.morse@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: errata: Disable FWB on parts with non-ARM interconnects | expand

Commit Message

James Morse March 30, 2023, 4:51 p.m. UTC
The Errata Management SMCCC interface allows firmware to advertise whether
the OS is affected by an erratum, or if a higher exception level has
mitigated the issue. This allows properties of the device that are not
discoverable by the OS to be described. e.g. some errata depend on the
behaviour of the interconnect, which is not visible to the OS.

Deployed devices may find it significantly harder to update EL3
firmware than the device tree. Erratum workarounds typically have to
fail safe, and assume the platform is affected putting correctness
above performance.

Instead of adding a device-tree entry for any CPU errata that is
relevant (or not) to the platform, allow the device-tree to describe
firmware's responses for the SMCCC interface. This could be used as
the data source for the firmware interface, or be parsed by the OS if
the firmware interface is missing.

Most errata can be detected from CPU id registers. These mechanisms
are only needed for the rare cases that external knowledge is needed.

Suggested-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
 .../devicetree/bindings/arm/cpus.yaml         |  5 ++
 .../firmware/arm,errata-management.yaml       | 77 +++++++++++++++++++
 2 files changed, 82 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/firmware/arm,errata-management.yaml

Comments

Rob Herring (Arm) March 30, 2023, 8:50 p.m. UTC | #1
On Thu, 30 Mar 2023 17:51:23 +0100, James Morse wrote:
> The Errata Management SMCCC interface allows firmware to advertise whether
> the OS is affected by an erratum, or if a higher exception level has
> mitigated the issue. This allows properties of the device that are not
> discoverable by the OS to be described. e.g. some errata depend on the
> behaviour of the interconnect, which is not visible to the OS.
> 
> Deployed devices may find it significantly harder to update EL3
> firmware than the device tree. Erratum workarounds typically have to
> fail safe, and assume the platform is affected putting correctness
> above performance.
> 
> Instead of adding a device-tree entry for any CPU errata that is
> relevant (or not) to the platform, allow the device-tree to describe
> firmware's responses for the SMCCC interface. This could be used as
> the data source for the firmware interface, or be parsed by the OS if
> the firmware interface is missing.
> 
> Most errata can be detected from CPU id registers. These mechanisms
> are only needed for the rare cases that external knowledge is needed.
> 
> Suggested-by: Andre Przywara <andre.przywara@arm.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  .../devicetree/bindings/arm/cpus.yaml         |  5 ++
>  .../firmware/arm,errata-management.yaml       | 77 +++++++++++++++++++
>  2 files changed, 82 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
> 

My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check'
on your patch (DT_CHECKER_FLAGS is new in v5.13):

yamllint warnings/errors:
./Documentation/devicetree/bindings/firmware/arm,errata-management.yaml:48:5: [warning] wrong indentation: expected 6 but found 4 (indentation)
./Documentation/devicetree/bindings/firmware/arm,errata-management.yaml:50:5: [warning] wrong indentation: expected 6 but found 4 (indentation)
./Documentation/devicetree/bindings/firmware/arm,errata-management.yaml:52:5: [warning] wrong indentation: expected 6 but found 4 (indentation)

dtschema/dtc warnings/errors:

doc reference errors (make refcheckdocs):

See https://patchwork.ozlabs.org/project/devicetree-bindings/patch/20230330165128.3237939-2-james.morse@arm.com

The base for the series is generally the latest rc1. A different dependency
should be noted in *this* patch.

If you already ran 'make dt_binding_check' and didn't see the above
error(s), then make sure 'yamllint' is installed and dt-schema is up to
date:

pip3 install dtschema --upgrade

Please check and re-submit after running the above command yourself. Note
that DT_SCHEMA_FILES can be set to your schema file to speed up checking
your schema. However, it must be unset to test all examples with your schema.
Krzysztof Kozlowski March 31, 2023, 8:29 a.m. UTC | #2
On 30/03/2023 18:51, James Morse wrote:
> The Errata Management SMCCC interface allows firmware to advertise whether
> the OS is affected by an erratum, or if a higher exception level has
> mitigated the issue. This allows properties of the device that are not
> discoverable by the OS to be described. e.g. some errata depend on the
> behaviour of the interconnect, which is not visible to the OS.
> 
> Deployed devices may find it significantly harder to update EL3
> firmware than the device tree. Erratum workarounds typically have to
> fail safe, and assume the platform is affected putting correctness
> above performance.
> 
> Instead of adding a device-tree entry for any CPU errata that is
> relevant (or not) to the platform, allow the device-tree to describe
> firmware's responses for the SMCCC interface. This could be used as
> the data source for the firmware interface, or be parsed by the OS if
> the firmware interface is missing.
> 
> Most errata can be detected from CPU id registers. These mechanisms
> are only needed for the rare cases that external knowledge is needed.
> 
> Suggested-by: Andre Przywara <andre.przywara@arm.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
>  .../devicetree/bindings/arm/cpus.yaml         |  5 ++
>  .../firmware/arm,errata-management.yaml       | 77 +++++++++++++++++++
>  2 files changed, 82 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
> 
> diff --git a/Documentation/devicetree/bindings/arm/cpus.yaml b/Documentation/devicetree/bindings/arm/cpus.yaml
> index c145f6a035ee..47b12761f305 100644
> --- a/Documentation/devicetree/bindings/arm/cpus.yaml
> +++ b/Documentation/devicetree/bindings/arm/cpus.yaml
> @@ -257,6 +257,11 @@ properties:
>        List of phandles to idle state nodes supported
>        by this cpu (see ./idle-states.yaml).
>  
> +  arm,erratum-list:
> +    $ref: '/schemas/types.yaml#/definitions/phandle'
> +    description:
> +      Specifies the firmware cpu-erratum-list node associated with this CPU.
> +
>    capacity-dmips-mhz:
>      description:
>        u32 value representing CPU capacity (see ../cpu/cpu-capacity.txt) in
> diff --git a/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
> new file mode 100644
> index 000000000000..9baeb3d35213
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
> @@ -0,0 +1,77 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/firmware/arm,errata-management.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#

Except missing testing...

> +
> +title: Errata Management Firmware Interface
> +
> +maintainers:
> +  - James Morse <james.morse@arm.com>
> +
> +description: |+

Do not need '|+'.

> +  The SMC-CC has an erratum discovery interface that allows the OS to discover
> +  whether a particular CPU is affected by a specific erratum when the
> +  configurations affected is only known by firmware. See the specification of
> +  the same title on developer.arm.com, document DEN0100.
> +  Provide the values that should be used by the interface, either to supplement
> +  firmware, or override the values firmware provides.

Why? If you have the discovery interface, don't add stuff to the DT, but
use that interface.

> +  Most errata can be detected from CPU id registers. These mechanisms are only
> +  needed for the rare cases that external knowledge is needed.
> +  The CPU node should hold a phandle that points to the cpu-erratum-list node.
> +
> +properties:
> +  compatible:
> +    items:

One item, so drop "items".

> +      - const: arm,cpu-erratum-list
> +
> +  arm,erratum-affected:
> +    description: Erratum numbers that this CPU is affected by.

Isn't this explicit from CPU compatible?

> +    $ref: /schemas/types.yaml#/definitions/uint32-array
> +    minItems: 1

What do the numbers mean?

maxItems?

> +
> +  arm,erratum-not-affected:
> +    description: Erratum numbers that this CPU is not affected by.
> +    $ref: /schemas/types.yaml#/definitions/uint32-array
> +    minItems: 1
> +
> +  arm,erratum-higher-el-mitigation:
> +    description: Erratum numbers that have been mitigated by a higher level
> +      of firmware
> +    $ref: /schemas/types.yaml#/definitions/uint32-array
> +    minItems: 1
> +
> +required:
> +  - compatible

Missing blank line

> +anyOf:
> +  - required:
> +    - 'arm,erratum-affected'
> +  - required:
> +    - 'arm,erratum-not-affected'
> +  - required:
> +    - 'arm,erratum-higher-el-mitigation'

Drop quotes

> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    firmware {
> +      CL1_ERRATA: cluster1-errata {
> +          compatible = "arm,cpu-erratum-list";
> +          arm,erratum-not-affected = <2701952>;
> +      };
> +    };
> +
> +    cpus {
> +      #size-cells = <0>;
> +      #address-cells = <1>;
> +
> +      cpu@0 {
> +        device_type = "cpu";
> +        compatible = "arm,cortex-x2";
> +        reg = <0x0>;
> +        arm,erratum-list = <&CL1_ERRATA>;
> +      };
> +    };
> +
> +...

Best regards,
Krzysztof
Rob Herring March 31, 2023, 1:46 p.m. UTC | #3
On Thu, Mar 30, 2023 at 11:52 AM James Morse <james.morse@arm.com> wrote:
>
> The Errata Management SMCCC interface allows firmware to advertise whether
> the OS is affected by an erratum, or if a higher exception level has
> mitigated the issue. This allows properties of the device that are not
> discoverable by the OS to be described. e.g. some errata depend on the
> behaviour of the interconnect, which is not visible to the OS.
>
> Deployed devices may find it significantly harder to update EL3
> firmware than the device tree. Erratum workarounds typically have to
> fail safe, and assume the platform is affected putting correctness
> above performance.

Updating the DT is still harder than updating the kernel. A well
designed binding allows for errata fixes without DT changes. That
generally means specific compatibles up front rather than adding
properties to turn things on/off.

Do we really want to encourage/endorse/support non-updatable firmware?
We've rejected plenty of things with 'fix your firmware'.

> Instead of adding a device-tree entry for any CPU errata that is
> relevant (or not) to the platform, allow the device-tree to describe
> firmware's responses for the SMCCC interface. This could be used as
> the data source for the firmware interface, or be parsed by the OS if
> the firmware interface is missing.

What's to prevent vendors from only using the DT mechanism and never
supporting the SMCCC interface? I'm sure some will love to not have to
make a firmware update when they can just fix it in DT.

The downside to this extendable binding compared to simple properties
is parsing a flat tree is slow and more complicated. So it may be
difficult to support if you need this early in boot.

> Most errata can be detected from CPU id registers. These mechanisms
> are only needed for the rare cases that external knowledge is needed.

And also have significant performance impact. In the end, how many
platforms are there that can't fix these in firmware and need a
mainline/distro kernel optimized to avoid the work-around. I suspect
the number is small enough it could be a list in the kernel.

Rob
James Morse March 31, 2023, 4:58 p.m. UTC | #4
Hi Rob,

On 31/03/2023 14:46, Rob Herring wrote:
> On Thu, Mar 30, 2023 at 11:52 AM James Morse <james.morse@arm.com> wrote:
>> The Errata Management SMCCC interface allows firmware to advertise whether
>> the OS is affected by an erratum, or if a higher exception level has
>> mitigated the issue. This allows properties of the device that are not
>> discoverable by the OS to be described. e.g. some errata depend on the
>> behaviour of the interconnect, which is not visible to the OS.
>>
>> Deployed devices may find it significantly harder to update EL3
>> firmware than the device tree. Erratum workarounds typically have to
>> fail safe, and assume the platform is affected putting correctness
>> above performance.
> 
> Updating the DT is still harder than updating the kernel. A well
> designed binding allows for errata fixes without DT changes. That
> generally means specific compatibles up front rather than adding
> properties to turn things on/off.

I started with a per-erratum identifier, but there are 8 of them, and its hard to know
where to put it. The CPU side is detectable from the MIDR,its an interconnect property
that we need to know ... but the interconnect isn't described in the DT. (but the obvious
compatible string identifies the PMU)


> Do we really want to encourage/endorse/support non-updatable firmware?
> We've rejected plenty of things with 'fix your firmware'.

A DT property was explicitly requested by Marc Z on the RFC:
https://lore.kernel.org/linux-arm-kernel/86mt5dxxbc.wl-maz@kernel.org/

The concern is that platforms where the CPU is affected, but the issue is masked by the
interconnect will never bother with a firmware interface. The kernel can't know this, so
has to enable the workaround at the cost of performance.

Adding a DT property to describe the hardware state of the erratum avoids the need for
separate kernel builds to work around the missing firmware.


>> Instead of adding a device-tree entry for any CPU errata that is
>> relevant (or not) to the platform, allow the device-tree to describe
>> firmware's responses for the SMCCC interface. This could be used as
>> the data source for the firmware interface, or be parsed by the OS if
>> the firmware interface is missing.

> What's to prevent vendors from only using the DT mechanism and never
> supporting the SMCCC interface? I'm sure some will love to not have to
> make a firmware update when they can just fix it in DT.

The firmware interface has to exist for ACPI systems where EL3 can't influence the ACPI
tables, which typically get replaced if the part is OEM'd.

Ultimately all the interface is describing is a non-discoverable hardware property
relevant to an erratum. These are often configurations the silicon manufacturer chooses.
In this case its the behaviour of the interconnect.

If we didn't have to support ACPI systems, this stuff would only have been in the DT. With


> The downside to this extendable binding compared to simple properties
> is parsing a flat tree is slow and more complicated. So it may be
> difficult to support if you need this early in boot.

I do need this early in the boot, but I'm not worried about the delay as these tables
should be small.


>> Most errata can be detected from CPU id registers. These mechanisms
>> are only needed for the rare cases that external knowledge is needed.
> 
> And also have significant performance impact. In the end, how many
> platforms are there that can't fix these in firmware and need a
> mainline/distro kernel optimized to avoid the work-around. I suspect
> the number is small enough it could be a list in the kernel.

At a guess, its all mobile phones produced in the last 2 years that want to run a version
of Android that uses virtualisation. Cortex-A78 is affected, but I don't know when the
first products were shipped.



Thanks,

James
James Morse March 31, 2023, 4:58 p.m. UTC | #5
Hi Krzysztof

On 31/03/2023 09:29, Krzysztof Kozlowski wrote:
> On 30/03/2023 18:51, James Morse wrote:
>> The Errata Management SMCCC interface allows firmware to advertise whether
>> the OS is affected by an erratum, or if a higher exception level has
>> mitigated the issue. This allows properties of the device that are not
>> discoverable by the OS to be described. e.g. some errata depend on the
>> behaviour of the interconnect, which is not visible to the OS.
>>
>> Deployed devices may find it significantly harder to update EL3
>> firmware than the device tree. Erratum workarounds typically have to
>> fail safe, and assume the platform is affected putting correctness
>> above performance.
>>
>> Instead of adding a device-tree entry for any CPU errata that is
>> relevant (or not) to the platform, allow the device-tree to describe
>> firmware's responses for the SMCCC interface. This could be used as
>> the data source for the firmware interface, or be parsed by the OS if
>> the firmware interface is missing.
>>
>> Most errata can be detected from CPU id registers. These mechanisms
>> are only needed for the rare cases that external knowledge is needed.

>> diff --git a/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
>> new file mode 100644
>> index 000000000000..9baeb3d35213
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
>> @@ -0,0 +1,77 @@
>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>> +%YAML 1.2
>> +---
>> +$id: http://devicetree.org/schemas/firmware/arm,errata-management.yaml#
>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> 
> Except missing testing...

After a couple of hours of testing this, I went blind and missed that it was still
complaining about spaces.


>> +
>> +title: Errata Management Firmware Interface
>> +
>> +maintainers:
>> +  - James Morse <james.morse@arm.com>
>> +
>> +description: |+
> 
> Do not need '|+'.
> 
>> +  The SMC-CC has an erratum discovery interface that allows the OS to discover
>> +  whether a particular CPU is affected by a specific erratum when the
>> +  configurations affected is only known by firmware. See the specification of
>> +  the same title on developer.arm.com, document DEN0100.
>> +  Provide the values that should be used by the interface, either to supplement
>> +  firmware, or override the values firmware provides.
> 
> Why? If you have the discovery interface, don't add stuff to the DT, but
> use that interface.

A DT property was explicitly requested by Marc Z on the RFC:
https://lore.kernel.org/linux-arm-kernel/86mt5dxxbc.wl-maz@kernel.org/

The concern is that platforms where the CPU is affected, but the issue is masked by the
interconnect will never bother with a firmware interface. The kernel can't know this, so
has to enable the workaround at the cost of performance.

Adding a DT property to describe the hardware state of the erratum avoids the need for
separate kernel builds to work around the missing firmware.


>> +      - const: arm,cpu-erratum-list
>> +
>> +  arm,erratum-affected:
>> +    description: Erratum numbers that this CPU is affected by.
> 
> Isn't this explicit from CPU compatible?

I'll drop it from the compatible. The concern is arm add erratum in other IP to this
interface, hence shoving 'cpu' in a few places to future proof it against changes in the spec.


>> +    $ref: /schemas/types.yaml#/definitions/uint32-array
>> +    minItems: 1

> What do the numbers mean?

The numbers are unique identifiers issued by the CPU designer to identify the part
affected and the erratum. See the cover-letter for links to arms documents for all these
CPUs, or Documentation/arm64/silicon-errata.txt for a table of those the kernel works around.


> maxItems?

If there are zero entries, the table can be omitted, hence minItems.

This is the first erratum workaround that needs to know non-discoverable CPU properties,
but there will be others.


Thanks,

James
Krzysztof Kozlowski April 3, 2023, 9:15 a.m. UTC | #6
On 31/03/2023 18:58, James Morse wrote:
> Hi Krzysztof
> 
> On 31/03/2023 09:29, Krzysztof Kozlowski wrote:
>> On 30/03/2023 18:51, James Morse wrote:
>>> The Errata Management SMCCC interface allows firmware to advertise whether
>>> the OS is affected by an erratum, or if a higher exception level has
>>> mitigated the issue. This allows properties of the device that are not
>>> discoverable by the OS to be described. e.g. some errata depend on the
>>> behaviour of the interconnect, which is not visible to the OS.
>>>
>>> Deployed devices may find it significantly harder to update EL3
>>> firmware than the device tree. Erratum workarounds typically have to
>>> fail safe, and assume the platform is affected putting correctness
>>> above performance.
>>>
>>> Instead of adding a device-tree entry for any CPU errata that is
>>> relevant (or not) to the platform, allow the device-tree to describe
>>> firmware's responses for the SMCCC interface. This could be used as
>>> the data source for the firmware interface, or be parsed by the OS if
>>> the firmware interface is missing.
>>>
>>> Most errata can be detected from CPU id registers. These mechanisms
>>> are only needed for the rare cases that external knowledge is needed.
> 
>>> diff --git a/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
>>> new file mode 100644
>>> index 000000000000..9baeb3d35213
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
>>> @@ -0,0 +1,77 @@
>>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>>> +%YAML 1.2
>>> +---
>>> +$id: http://devicetree.org/schemas/firmware/arm,errata-management.yaml#
>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>
>> Except missing testing...
> 
> After a couple of hours of testing this, I went blind and missed that it was still
> complaining about spaces.
> 
> 
>>> +
>>> +title: Errata Management Firmware Interface
>>> +
>>> +maintainers:
>>> +  - James Morse <james.morse@arm.com>
>>> +
>>> +description: |+
>>
>> Do not need '|+'.
>>
>>> +  The SMC-CC has an erratum discovery interface that allows the OS to discover
>>> +  whether a particular CPU is affected by a specific erratum when the
>>> +  configurations affected is only known by firmware. See the specification of
>>> +  the same title on developer.arm.com, document DEN0100.
>>> +  Provide the values that should be used by the interface, either to supplement
>>> +  firmware, or override the values firmware provides.
>>
>> Why? If you have the discovery interface, don't add stuff to the DT, but
>> use that interface.
> 
> A DT property was explicitly requested by Marc Z on the RFC:
> https://lore.kernel.org/linux-arm-kernel/86mt5dxxbc.wl-maz@kernel.org/
> 
> The concern is that platforms where the CPU is affected, but the issue is masked by the
> interconnect will never bother with a firmware interface. The kernel can't know this, so
> has to enable the workaround at the cost of performance.

It would have to bother DT, so same problem... DT is not optimization
mechanism for SW decisions.

> 
> Adding a DT property to describe the hardware state of the erratum avoids the need for
> separate kernel builds to work around the missing firmware.

The purpose of DT is not to avoid separate kernel builds thus this is
not an argument whether property fits DT or it doesn't.

> 
> 
>>> +      - const: arm,cpu-erratum-list
>>> +
>>> +  arm,erratum-affected:
>>> +    description: Erratum numbers that this CPU is affected by.
>>
>> Isn't this explicit from CPU compatible?
> 
> I'll drop it from the compatible. The concern is arm add erratum in other IP to this
> interface, hence shoving 'cpu' in a few places to future proof it against changes in the spec.
> 
> 
>>> +    $ref: /schemas/types.yaml#/definitions/uint32-array
>>> +    minItems: 1
> 
>> What do the numbers mean?
> 
> The numbers are unique identifiers issued by the CPU designer to identify the part
> affected and the erratum. See the cover-letter for links to arms documents for all these
> CPUs, or Documentation/arm64/silicon-errata.txt for a table of those the kernel works around.

The answer should be in description, not in cover letter.

> 
> 
>> maxItems?
> 
> If there are zero entries, the table can be omitted, hence minItems.
> 
> This is the first erratum workaround that needs to know non-discoverable CPU properties,
> but there will be others.


Best regards,
Krzysztof
Marc Zyngier April 3, 2023, 12:05 p.m. UTC | #7
On Mon, 03 Apr 2023 10:15:19 +0100,
Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> wrote:
> 
> On 31/03/2023 18:58, James Morse wrote:
> > Hi Krzysztof
> > 
> > On 31/03/2023 09:29, Krzysztof Kozlowski wrote:
> >> On 30/03/2023 18:51, James Morse wrote:
> >>> The Errata Management SMCCC interface allows firmware to advertise whether
> >>> the OS is affected by an erratum, or if a higher exception level has
> >>> mitigated the issue. This allows properties of the device that are not
> >>> discoverable by the OS to be described. e.g. some errata depend on the
> >>> behaviour of the interconnect, which is not visible to the OS.
> >>>
> >>> Deployed devices may find it significantly harder to update EL3
> >>> firmware than the device tree. Erratum workarounds typically have to
> >>> fail safe, and assume the platform is affected putting correctness
> >>> above performance.
> >>>
> >>> Instead of adding a device-tree entry for any CPU errata that is
> >>> relevant (or not) to the platform, allow the device-tree to describe
> >>> firmware's responses for the SMCCC interface. This could be used as
> >>> the data source for the firmware interface, or be parsed by the OS if
> >>> the firmware interface is missing.
> >>>
> >>> Most errata can be detected from CPU id registers. These mechanisms
> >>> are only needed for the rare cases that external knowledge is needed.
> > 
> >>> diff --git a/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
> >>> new file mode 100644
> >>> index 000000000000..9baeb3d35213
> >>> --- /dev/null
> >>> +++ b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
> >>> @@ -0,0 +1,77 @@
> >>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> >>> +%YAML 1.2
> >>> +---
> >>> +$id: http://devicetree.org/schemas/firmware/arm,errata-management.yaml#
> >>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> >>
> >> Except missing testing...
> > 
> > After a couple of hours of testing this, I went blind and missed that it was still
> > complaining about spaces.
> > 
> > 
> >>> +
> >>> +title: Errata Management Firmware Interface
> >>> +
> >>> +maintainers:
> >>> +  - James Morse <james.morse@arm.com>
> >>> +
> >>> +description: |+
> >>
> >> Do not need '|+'.
> >>
> >>> +  The SMC-CC has an erratum discovery interface that allows the OS to discover
> >>> +  whether a particular CPU is affected by a specific erratum when the
> >>> +  configurations affected is only known by firmware. See the specification of
> >>> +  the same title on developer.arm.com, document DEN0100.
> >>> +  Provide the values that should be used by the interface, either to supplement
> >>> +  firmware, or override the values firmware provides.
> >>
> >> Why? If you have the discovery interface, don't add stuff to the DT, but
> >> use that interface.

If you read the cover letter, you'll notice that *nobody* implements
the discovery mechanism, and yet we still need a way to identify
broken systems.

> > 
> > A DT property was explicitly requested by Marc Z on the RFC:
> > https://lore.kernel.org/linux-arm-kernel/86mt5dxxbc.wl-maz@kernel.org/
> > 
> > The concern is that platforms where the CPU is affected, but the issue is masked by the
> > interconnect will never bother with a firmware interface. The kernel can't know this, so
> > has to enable the workaround at the cost of performance.
> 
> It would have to bother DT, so same problem... DT is not optimization
> mechanism for SW decisions.

What does SW has to do with this? This describes the state of the
HW. The HW is broken, SW has no way to discover it otherwise, so DT
*is* the place to put it.

In any case, it is far easier to update a DT that your EL3 firmware.

	M.
Rob Herring April 3, 2023, 3:45 p.m. UTC | #8
On Fri, Mar 31, 2023 at 11:59 AM James Morse <james.morse@arm.com> wrote:
>
> Hi Rob,
>
> On 31/03/2023 14:46, Rob Herring wrote:
> > On Thu, Mar 30, 2023 at 11:52 AM James Morse <james.morse@arm.com> wrote:
> >> The Errata Management SMCCC interface allows firmware to advertise whether
> >> the OS is affected by an erratum, or if a higher exception level has
> >> mitigated the issue. This allows properties of the device that are not
> >> discoverable by the OS to be described. e.g. some errata depend on the
> >> behaviour of the interconnect, which is not visible to the OS.
> >>
> >> Deployed devices may find it significantly harder to update EL3
> >> firmware than the device tree. Erratum workarounds typically have to
> >> fail safe, and assume the platform is affected putting correctness
> >> above performance.
> >
> > Updating the DT is still harder than updating the kernel. A well
> > designed binding allows for errata fixes without DT changes. That
> > generally means specific compatibles up front rather than adding
> > properties to turn things on/off.
>
> I started with a per-erratum identifier, but there are 8 of them, and its hard to know
> where to put it.

That's still requiring updating the DT to fix things.

> The CPU side is detectable from the MIDR,its an interconnect property
> that we need to know ... but the interconnect isn't described in the DT. (but the obvious
> compatible string identifies the PMU)

But the interconnect could be described. In fact, there's a binding
for such things already. Surprisingly, it's called 'interconnects'...
Of course, there are lots of interconnects in SoCs and the one you
need may not be described ('cause it is invisible to s/w (until it's
not)). There's a binding going back to the CCI-400 in fact. So it
isn't really that interconnects aren't described, it's that they
aren't consistently described.

If you can add this errata table to the DT, then why not add
describing the interconnect? Then it will be there for the next thing
we need the interconnect for. I imagine some of the interconnects are
already described if not explicitly in bits and pieces (i.e. clocks or
power domains for the interconnect get tossed into some other node).

> > Do we really want to encourage/endorse/support non-updatable firmware?
> > We've rejected plenty of things with 'fix your firmware'.
>
> A DT property was explicitly requested by Marc Z on the RFC:
> https://lore.kernel.org/linux-arm-kernel/86mt5dxxbc.wl-maz@kernel.org/
>
> The concern is that platforms where the CPU is affected, but the issue is masked by the
> interconnect will never bother with a firmware interface. The kernel can't know this, so
> has to enable the workaround at the cost of performance.

Sure it can. Worst case, the kernel knows the exact SoC and board it
is running on because we have root level compatibles for those. It's
just a question of whether using those would scale or not. Whether
that scales or not depends on both how long the lists are and how
distributed the implementation is (e.g. PCI quirks). More on that
below.

> Adding a DT property to describe the hardware state of the erratum avoids the need for
> separate kernel builds to work around the missing firmware.

DT is not the kernel's runtime configuration mechanism. That assumes a
tight coupling of the DT and kernel. That may work for some cases like
Android with kernel and DT updated together, but for upstream we can't
assume that coupling and must treat it as an ABI (not an issue for
this case).

The kernel command line is a runtime config mechanism for the kernel.
So why not use only that? There's certainly some room for improvement
with handling it. For example, it's not well defined with what happens
with 'bootargs' as you go from a dtb to bootloader to kernel in terms
of overriding/prepending/appending, but that could be addressed.

> >> Instead of adding a device-tree entry for any CPU errata that is
> >> relevant (or not) to the platform, allow the device-tree to describe
> >> firmware's responses for the SMCCC interface. This could be used as
> >> the data source for the firmware interface, or be parsed by the OS if
> >> the firmware interface is missing.
>
> > What's to prevent vendors from only using the DT mechanism and never
> > supporting the SMCCC interface? I'm sure some will love to not have to
> > make a firmware update when they can just fix it in DT.
>
> The firmware interface has to exist for ACPI systems where EL3 can't influence the ACPI
> tables, which typically get replaced if the part is OEM'd.
>
> Ultimately all the interface is describing is a non-discoverable hardware property
> relevant to an erratum. These are often configurations the silicon manufacturer chooses.
> In this case its the behaviour of the interconnect.
>
> If we didn't have to support ACPI systems, this stuff would only have been in the DT. With

With...?

I fail to see what ACPI has to do with DT platforms adopting the SMCCC
interface or not.

> > The downside to this extendable binding compared to simple properties
> > is parsing a flat tree is slow and more complicated. So it may be
> > difficult to support if you need this early in boot.
>
> I do need this early in the boot, but I'm not worried about the delay as these tables
> should be small.
>
>
> >> Most errata can be detected from CPU id registers. These mechanisms
> >> are only needed for the rare cases that external knowledge is needed.
> >
> > And also have significant performance impact. In the end, how many
> > platforms are there that can't fix these in firmware and need a
> > mainline/distro kernel optimized to avoid the work-around. I suspect
> > the number is small enough it could be a list in the kernel.
>
> At a guess, its all mobile phones produced in the last 2 years that want to run a version
> of Android that uses virtualisation. Cortex-A78 is affected, but I don't know when the
> first products were shipped.

How many run mainline and run it well enough to even care about the
optimization yet?

Even if we went with the above list, that's 2 years x 4 vendors (QCom,
Mediatek, Samsung, Google) x 1-2 Soc (high and mid tier). Subtract out
any vendors capable of updating their firmware. So a worst case list
of potentially 8-16 SoCs? It shouldn't grow because anything newer is
going to implement the SMCCC interface, right? That's not unmanageable
in my book.

Rob
James Morse April 4, 2023, 3:19 p.m. UTC | #9
Hi Rob,

On 03/04/2023 16:45, Rob Herring wrote:
> On Fri, Mar 31, 2023 at 11:59 AM James Morse <james.morse@arm.com> wrote:
>> On 31/03/2023 14:46, Rob Herring wrote:
>>> On Thu, Mar 30, 2023 at 11:52 AM James Morse <james.morse@arm.com> wrote:
>>>> The Errata Management SMCCC interface allows firmware to advertise whether
>>>> the OS is affected by an erratum, or if a higher exception level has
>>>> mitigated the issue. This allows properties of the device that are not
>>>> discoverable by the OS to be described. e.g. some errata depend on the
>>>> behaviour of the interconnect, which is not visible to the OS.
>>>>
>>>> Deployed devices may find it significantly harder to update EL3
>>>> firmware than the device tree. Erratum workarounds typically have to
>>>> fail safe, and assume the platform is affected putting correctness
>>>> above performance.
>>>
>>> Updating the DT is still harder than updating the kernel. A well
>>> designed binding allows for errata fixes without DT changes. That
>>> generally means specific compatibles up front rather than adding
>>> properties to turn things on/off.
>>
>> I started with a per-erratum identifier, but there are 8 of them, and its hard to know
>> where to put it.

> That's still requiring updating the DT to fix things.

.... this discussion is about how to add this stuff to DT ...


>> The CPU side is detectable from the MIDR,its an interconnect property
>> that we need to know ... but the interconnect isn't described in the DT. (but the obvious
>> compatible string identifies the PMU)
> 
> But the interconnect could be described. In fact, there's a binding
> for such things already. Surprisingly, it's called 'interconnects'...
> Of course, there are lots of interconnects in SoCs and the one you
> need may not be described ('cause it is invisible to s/w (until it's
> not)). There's a binding going back to the CCI-400 in fact. So it
> isn't really that interconnects aren't described, it's that they
> aren't consistently described.

This is the CPU interconnect, but it already has a binding:
Documentation/devicetree/bindings/perf/arm,cmn.yaml

This describes the PMU, which is the only bit of that IP that is interesting to the OS.

I can search for the "arm,cmn-600", "arm,cmn-650", "arm,cmn-700" compatibles, but this is
a list that would have to be added to each time arm, or someone else, create a new
interconnect. It is more manageable than listing the affected platforms.

But if a SoC is not configured to allow normal-world accesses to the interconnect PMU, the
existing binding is of no use.


Should I be looking at adding a compatible "arm,interconnect" to the existing PMU binding?
Presumably I'd need an empty node ... somewhere ... with the "arm,interconnect" property
if the platform doesn't have a usable PMU.


Alternatively I could add a 'non_cacheable_fetch_hits_cacheable_data' property to these
nodes, but it doesn't feel right to search all the nodes for the property. The position in
the tree doesn't help as the CPUs live in their own special node, not in the CPU
interconnect node.

(I asked about a top-level property, and was told its not 2012 anymore!)


> If you can add this errata table to the DT, then why not add
> describing the interconnect? Then it will be there for the next thing
> we need the interconnect for. I imagine some of the interconnects are
> already described if not explicitly in bits and pieces (i.e. clocks or
> power domains for the interconnect get tossed into some other node).

Another example in this space might help:
Cortex-A510 has an erratum #2658417 "BFMMLA or VMMLA instructions might produce incorrect
result"

The configurations affected if the shared VPU datapath (whatever that is) between a pair
of cores is 2x64. Which cores share what doesn't fit with DT's model of CPUs.

It's a darn sight easier to ask "are you affected by #2658417".

Today, the workaround for that erratum nobbles the feature on all CPUs, because the OS
can't know. With either the firmware interface, or this description of it, BF16 can be
re-enabled on those parts that aren't affected.

Improving this falls in to the same trap of people not updating EL3 firmware when they
aren't affected by an erratum.


Another example in the same document is #2022138 "Core in FULL_RET might not have clock
gated" ... fortunately EL3 implements this one, as I've no idea what "if the FULL_RET
power mode has been implemented for logic retention" means.

My point is: these things are nasty. Using the erratum number to identify the conditions
and the workaround gives us one unambiguous source of truth.


>>> Do we really want to encourage/endorse/support non-updatable firmware?
>>> We've rejected plenty of things with 'fix your firmware'.
>>
>> A DT property was explicitly requested by Marc Z on the RFC:
>> https://lore.kernel.org/linux-arm-kernel/86mt5dxxbc.wl-maz@kernel.org/
>>
>> The concern is that platforms where the CPU is affected, but the issue is masked by the
>> interconnect will never bother with a firmware interface. The kernel can't know this, so
>> has to enable the workaround at the cost of performance.
> 
> Sure it can. Worst case, the kernel knows the exact SoC and board it
> is running on because we have root level compatibles for those. It's
> just a question of whether using those would scale or not. Whether
> that scales or not depends on both how long the lists are and how
> distributed the implementation is (e.g. PCI quirks). More on that
> below.

My stab in the dark at the length of the list is: every variant of every phone shipped in
the last two years, and every variant using these CPUs for a few more years. We'd never
have a complete list. Moving this to firmware (be that DT or EL3) is the only way to make
this scale.


>> Adding a DT property to describe the hardware state of the erratum avoids the need for
>> separate kernel builds to work around the missing firmware.

> DT is not the kernel's runtime configuration mechanism. That assumes a
> tight coupling of the DT and kernel. That may work for some cases like
> Android with kernel and DT updated together, but for upstream we can't
> assume that coupling and must treat it as an ABI (not an issue for
> this case).

> The kernel command line is a runtime config mechanism for the kernel.
> So why not use only that?

As implemented in patch 6.


> There's certainly some room for improvement
> with handling it. For example, it's not well defined with what happens
> with 'bootargs' as you go from a dtb to bootloader to kernel in terms
> of overriding/prepending/appending, but that could be addressed.

Command-line arguments are horrific for the distributions to work with. Some tool would
need the list of affected parts to insert the argument, we've just moved the problem.

This isn't policy, so the command-line is the wrong tool for the job. Patch 6 is only
included to give people stuck with an affected platform something they can do.


>>>> Instead of adding a device-tree entry for any CPU errata that is
>>>> relevant (or not) to the platform, allow the device-tree to describe
>>>> firmware's responses for the SMCCC interface. This could be used as
>>>> the data source for the firmware interface, or be parsed by the OS if
>>>> the firmware interface is missing.
>>
>>> What's to prevent vendors from only using the DT mechanism and never
>>> supporting the SMCCC interface? I'm sure some will love to not have to
>>> make a firmware update when they can just fix it in DT.
>>
>> The firmware interface has to exist for ACPI systems where EL3 can't influence the ACPI
>> tables, which typically get replaced if the part is OEM'd.
>>
>> Ultimately all the interface is describing is a non-discoverable hardware property
>> relevant to an erratum. These are often configurations the silicon manufacturer chooses.
>> In this case its the behaviour of the interconnect.
>>
>> If we didn't have to support ACPI systems, this stuff would only have been in the DT. With

> With...?

Looks like I got interrupted when writing this.


> I fail to see what ACPI has to do with DT platforms adopting the SMCCC
> interface or not.

My point was the SMCCC interface has to exist because of ACPI systems. Without them, the
available tools to solve the problem would be different.

I don't see a difference between using SMCCC as a source of this information, or the
firmware node of the DT. I have no problem if the manufacturers of DT systems never
implement the SMCCC. The EL3 firmware already contains these tables, including them in the
firmware node of the DT is just making the information visible.


>>> The downside to this extendable binding compared to simple properties
>>> is parsing a flat tree is slow and more complicated. So it may be
>>> difficult to support if you need this early in boot.
>>
>> I do need this early in the boot, but I'm not worried about the delay as these tables
>> should be small.
>>
>>
>>>> Most errata can be detected from CPU id registers. These mechanisms
>>>> are only needed for the rare cases that external knowledge is needed.
>>>
>>> And also have significant performance impact. In the end, how many
>>> platforms are there that can't fix these in firmware and need a
>>> mainline/distro kernel optimized to avoid the work-around. I suspect
>>> the number is small enough it could be a list in the kernel.
>>
>> At a guess, its all mobile phones produced in the last 2 years that want to run a version
>> of Android that uses virtualisation. Cortex-A78 is affected, but I don't know when the
>> first products were shipped.

> How many run mainline and run it well enough to even care about the
> optimization yet?

By the optimization I assume you mean FWB?

This happens when the VM takes a stage2 translation fault because the host's mm code moved
the memory. (due to THP, KSM, compaction, etc). The guest vCPU is blocked until the cache
maintenance completes. The feature was added because of the overhead for memory intensive
workloads.

Affected platforms will be stuck with this, but I anticipate they are in the minority.


> Even if we went with the above list, that's 2 years x 4 vendors (QCom,
> Mediatek, Samsung, Google) x 1-2 Soc (high and mid tier).

If this is the top-level DT machine compatible, wouldn't you see Xiaomi, Oppo, Huawei etc?
Wikipedia lists 175 brands.

If this is SoC-ID ... I've no idea how to start collecting those values.


> Subtract out
> any vendors capable of updating their firmware. So a worst case list
> of potentially 8-16 SoCs? It shouldn't grow because anything newer is
> going to implement the SMCCC interface, right? That's not unmanageable
> in my book.

If we wanted a short list, we could just list the affected platforms.

But I don't think the scaleability or manageability is about the length of the list, its
about knowing you've got the complete list. I've seen VMs running on network switches, I
wouldn't be surprised if there are DT systems in this space.

Pushing this back into the firmware description moves this problem to someone much closer
to the device, who is able to get the SoC manufacturer to answer questions.

Doing it in terms of the erratum number means the question is unambiguous.


Thanks,

James
diff mbox series

Patch

diff --git a/Documentation/devicetree/bindings/arm/cpus.yaml b/Documentation/devicetree/bindings/arm/cpus.yaml
index c145f6a035ee..47b12761f305 100644
--- a/Documentation/devicetree/bindings/arm/cpus.yaml
+++ b/Documentation/devicetree/bindings/arm/cpus.yaml
@@ -257,6 +257,11 @@  properties:
       List of phandles to idle state nodes supported
       by this cpu (see ./idle-states.yaml).
 
+  arm,erratum-list:
+    $ref: '/schemas/types.yaml#/definitions/phandle'
+    description:
+      Specifies the firmware cpu-erratum-list node associated with this CPU.
+
   capacity-dmips-mhz:
     description:
       u32 value representing CPU capacity (see ../cpu/cpu-capacity.txt) in
diff --git a/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
new file mode 100644
index 000000000000..9baeb3d35213
--- /dev/null
+++ b/Documentation/devicetree/bindings/firmware/arm,errata-management.yaml
@@ -0,0 +1,77 @@ 
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/firmware/arm,errata-management.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Errata Management Firmware Interface
+
+maintainers:
+  - James Morse <james.morse@arm.com>
+
+description: |+
+  The SMC-CC has an erratum discovery interface that allows the OS to discover
+  whether a particular CPU is affected by a specific erratum when the
+  configurations affected is only known by firmware. See the specification of
+  the same title on developer.arm.com, document DEN0100.
+  Provide the values that should be used by the interface, either to supplement
+  firmware, or override the values firmware provides.
+  Most errata can be detected from CPU id registers. These mechanisms are only
+  needed for the rare cases that external knowledge is needed.
+  The CPU node should hold a phandle that points to the cpu-erratum-list node.
+
+properties:
+  compatible:
+    items:
+      - const: arm,cpu-erratum-list
+
+  arm,erratum-affected:
+    description: Erratum numbers that this CPU is affected by.
+    $ref: /schemas/types.yaml#/definitions/uint32-array
+    minItems: 1
+
+  arm,erratum-not-affected:
+    description: Erratum numbers that this CPU is not affected by.
+    $ref: /schemas/types.yaml#/definitions/uint32-array
+    minItems: 1
+
+  arm,erratum-higher-el-mitigation:
+    description: Erratum numbers that have been mitigated by a higher level
+      of firmware
+    $ref: /schemas/types.yaml#/definitions/uint32-array
+    minItems: 1
+
+required:
+  - compatible
+anyOf:
+  - required:
+    - 'arm,erratum-affected'
+  - required:
+    - 'arm,erratum-not-affected'
+  - required:
+    - 'arm,erratum-higher-el-mitigation'
+
+additionalProperties: false
+
+examples:
+  - |
+    firmware {
+      CL1_ERRATA: cluster1-errata {
+          compatible = "arm,cpu-erratum-list";
+          arm,erratum-not-affected = <2701952>;
+      };
+    };
+
+    cpus {
+      #size-cells = <0>;
+      #address-cells = <1>;
+
+      cpu@0 {
+        device_type = "cpu";
+        compatible = "arm,cortex-x2";
+        reg = <0x0>;
+        arm,erratum-list = <&CL1_ERRATA>;
+      };
+    };
+
+...