mbox series

[0/3] arm64: errata: remove BF16 HWCAP due to incorrect result on Cortex-A510

Message ID 20220909165938.3931307-1-james.morse@arm.com (mailing list archive)
Headers show
Series arm64: errata: remove BF16 HWCAP due to incorrect result on Cortex-A510 | expand

Message

James Morse Sept. 9, 2022, 4:59 p.m. UTC
Hello!

Cortex-A510 has an erratum where the BFMMLA or VMMLA instructions might
produce the incorrect result. This only happens if two Cortex-A510 CPUs
are configured by the SoC manafacturer in a particular way to share some
hardware, and are using it at the same time. It isn't possible for linux
to know how the CPUs were configured, so the only option is to disable
the BF16 feature for all Cortex-A510 CPUs.

This won't stop user-space from trying to use the instructions to see
if they work - but such software is already broken on big/little systems
with mismathed features.

Removing the BF16 feature involves removing both the HWCAP, as was done
by commit 44b3834b2eed "arm64: errata: Remove AES hwcap for COMPAT tasks",
and the emulated view of that register that user-space has.
(This wasn't previously needed as aarch32 ID registers aren't accessible
like this).

As there are now two of these things, this series tries to add a more
maintainable way to remove features, to avoid spilling workaround like
this into cpufeature.c.

Instead, cpu_errata.c modifies the user_mask that is used for emulation
of the id registers, and cpufeature.c builds the HWCAPs based on this.

I have patches to convert the AES workaround to do the same, but that
should wait until the aarch32 ID registers are generated by the sysreg
awk scripts (it needs some new masks defined).

[0] https://developer.arm.com/documentation/SDEN1873351/1400/?lang=en

Thanks,

James Morse (3):
  arm64: cpufeature: Force HWCAP to be based on the sysreg visible to
    user-space
  arm64: cpufeature: Expose get_arm64_ftr_reg() outside cpufeature.c
  arm64: errata: remove BF16 HWCAP due to incorrect result on
    Cortex-A510

 Documentation/arm64/silicon-errata.rst |  2 ++
 arch/arm64/Kconfig                     | 14 +++++++++
 arch/arm64/include/asm/cpufeature.h    |  2 ++
 arch/arm64/kernel/cpu_errata.c         | 27 ++++++++++++++++
 arch/arm64/kernel/cpufeature.c         | 43 ++++++++++++++++++++------
 arch/arm64/tools/cpucaps               |  1 +
 6 files changed, 79 insertions(+), 10 deletions(-)

Comments

Robin Murphy Sept. 9, 2022, 6:29 p.m. UTC | #1
On 2022-09-09 17:59, James Morse wrote:
> Hello!
> 
> Cortex-A510 has an erratum where the BFMMLA or VMMLA instructions might
> produce the incorrect result. This only happens if two Cortex-A510 CPUs
> are configured by the SoC manafacturer in a particular way to share some
> hardware, and are using it at the same time. It isn't possible for linux
> to know how the CPUs were configured, so the only option is to disable
> the BF16 feature for all Cortex-A510 CPUs.

Hmm, the TRM doesn't seem to describe any obvious restrictions on 
accessing IMP_CPUCFR_EL1 - is there some other nefarious secret at play 
there?

Robin.

> This won't stop user-space from trying to use the instructions to see
> if they work - but such software is already broken on big/little systems
> with mismathed features.
> 
> Removing the BF16 feature involves removing both the HWCAP, as was done
> by commit 44b3834b2eed "arm64: errata: Remove AES hwcap for COMPAT tasks",
> and the emulated view of that register that user-space has.
> (This wasn't previously needed as aarch32 ID registers aren't accessible
> like this).
> 
> As there are now two of these things, this series tries to add a more
> maintainable way to remove features, to avoid spilling workaround like
> this into cpufeature.c.
> 
> Instead, cpu_errata.c modifies the user_mask that is used for emulation
> of the id registers, and cpufeature.c builds the HWCAPs based on this.
> 
> I have patches to convert the AES workaround to do the same, but that
> should wait until the aarch32 ID registers are generated by the sysreg
> awk scripts (it needs some new masks defined).
> 
> [0] https://developer.arm.com/documentation/SDEN1873351/1400/?lang=en
> 
> Thanks,
> 
> James Morse (3):
>    arm64: cpufeature: Force HWCAP to be based on the sysreg visible to
>      user-space
>    arm64: cpufeature: Expose get_arm64_ftr_reg() outside cpufeature.c
>    arm64: errata: remove BF16 HWCAP due to incorrect result on
>      Cortex-A510
> 
>   Documentation/arm64/silicon-errata.rst |  2 ++
>   arch/arm64/Kconfig                     | 14 +++++++++
>   arch/arm64/include/asm/cpufeature.h    |  2 ++
>   arch/arm64/kernel/cpu_errata.c         | 27 ++++++++++++++++
>   arch/arm64/kernel/cpufeature.c         | 43 ++++++++++++++++++++------
>   arch/arm64/tools/cpucaps               |  1 +
>   6 files changed, 79 insertions(+), 10 deletions(-)
>
James Morse Sept. 15, 2022, 1:39 p.m. UTC | #2
Hi Robin,

On 09/09/2022 19:29, Robin Murphy wrote:
> On 2022-09-09 17:59, James Morse wrote:
>> Cortex-A510 has an erratum where the BFMMLA or VMMLA instructions might
>> produce the incorrect result. This only happens if two Cortex-A510 CPUs
>> are configured by the SoC manafacturer in a particular way to share some
>> hardware, and are using it at the same time. It isn't possible for linux
>> to know how the CPUs were configured, so the only option is to disable
>> the BF16 feature for all Cortex-A510 CPUs.

> Hmm, the TRM doesn't seem to describe any obvious restrictions on accessing IMP_CPUCFR_EL1
> - is there some other nefarious secret at play there?

HCR_EL2.TIDCP.

KVM traps all IMP_* registers from EL1, and has no ability to emulate them.
For an erratum workaround at EL1, touching an imp-def register is the wrong thing to do.

We could add some discovery API with firmware to determine if the part is affected, but I
think that is overkill for this. I'm pretty sure affected Cortex-A510 silicon is only in
mobile phones, which can have the Kconfig option disabled if the target platform isn't
affected. (Even with GKI, I assume things don't run one kernel image in the same way
distros do)


Thanks,

James
Robin Murphy Sept. 15, 2022, 2:21 p.m. UTC | #3
On 2022-09-15 14:39, James Morse wrote:
> Hi Robin,
> 
> On 09/09/2022 19:29, Robin Murphy wrote:
>> On 2022-09-09 17:59, James Morse wrote:
>>> Cortex-A510 has an erratum where the BFMMLA or VMMLA instructions might
>>> produce the incorrect result. This only happens if two Cortex-A510 CPUs
>>> are configured by the SoC manafacturer in a particular way to share some
>>> hardware, and are using it at the same time. It isn't possible for linux
>>> to know how the CPUs were configured, so the only option is to disable
>>> the BF16 feature for all Cortex-A510 CPUs.
> 
>> Hmm, the TRM doesn't seem to describe any obvious restrictions on accessing IMP_CPUCFR_EL1
>> - is there some other nefarious secret at play there?
> 
> HCR_EL2.TIDCP.

Aha, indeed I should have looked more closely at the pseudocode.

> KVM traps all IMP_* registers from EL1, and has no ability to emulate them.
> For an erratum workaround at EL1, touching an imp-def register is the wrong thing to do.

Well, there's no real reason we couldn't make KVM handle this case, but 
that doesn't help if we're booted at EL1 under some other hypervisor, so 
in general I concur.

> We could add some discovery API with firmware to determine if the part is affected, but I
> think that is overkill for this. I'm pretty sure affected Cortex-A510 silicon is only in
> mobile phones, which can have the Kconfig option disabled if the target platform isn't
> affected. (Even with GKI, I assume things don't run one kernel image in the same way
> distros do)

I suppose it wouldn't take *too* much to check this in init_el2 and 
stash a "no actually it's fine" flag for later, but whether even that's 
worth it depends on how big the impact of losing BF16 on unaffected 
configurations actually is (which I have no idea about). If someone does 
care then that can always be added later, so that's good enough for me!

Cheers,
Robin.
Catalin Marinas Sept. 16, 2022, 5:44 p.m. UTC | #4
On Fri, 9 Sep 2022 17:59:35 +0100, James Morse wrote:
> Cortex-A510 has an erratum where the BFMMLA or VMMLA instructions might
> produce the incorrect result. This only happens if two Cortex-A510 CPUs
> are configured by the SoC manafacturer in a particular way to share some
> hardware, and are using it at the same time. It isn't possible for linux
> to know how the CPUs were configured, so the only option is to disable
> the BF16 feature for all Cortex-A510 CPUs.
> 
> [...]

Applied to arm64 (for-next/a510-erratum-2658417), thanks!

(I updated the last patch following Suzuki's comments)

[1/3] arm64: cpufeature: Force HWCAP to be based on the sysreg visible to user-space
      https://git.kernel.org/arm64/c/237405ebef58
[2/3] arm64: cpufeature: Expose get_arm64_ftr_reg() outside cpufeature.c
      https://git.kernel.org/arm64/c/445c953e4a84
[3/3] arm64: errata: remove BF16 HWCAP due to incorrect result on Cortex-A510
      https://git.kernel.org/arm64/c/1bdb0fbb2e27