mbox series

[RFC,0/3] arm64: errata: Disable FWB on parts with non-ARM interconnects

Message ID 20230216182201.1705406-1-james.morse@arm.com (mailing list archive)
Headers show
Series arm64: errata: Disable FWB on parts with non-ARM interconnects | expand

Message

James Morse Feb. 16, 2023, 6:21 p.m. UTC
Hello!

When stage1 translation is disabled, the SCTRL_E1.I bit controls the
attributes used for instruction fetch, one of the options results in a
non-cacheable access. A whole host of CPUs missed the FWB override
in this case, meaning a KVM guest could fetch stale/junk data instead of
instructions.

The workaround is to disable FWB, and do the required cache maintenance
instead.

The good news is, this isn't a problem for systems using Arm's
interconnect IP. The bad news is: linux can't know this. Arm knows of
at least one platform that is affected by this erratum.


This series adds support for the 'Errata Management Firmware Interface', [0]
and queries that to determine if the CPU is affected or not.

Unfortunately, no-one has firmware that supports this new interface yet,
and the least surprising thing to do is to enable the workaround by default,
meaning FWB is disabled on all these cores, even for unaffected platforms.
Platforms that are not-affected can either take a firmware-update to support
the interface, or if the kernel they run will only run on hardware that is
unaffected, disable the workaround at build time.

The trusted firmware series to implement the interface has not yet been
posted. I'll include a link once it is.

This series is an RFC as I anticipate a wider discussion around how we add
workaround that depend on firmware for detection.

The SDEN documents that describe this are:
Cortex-A78:
https://developer.arm.com/documentation/SDEN1401784/1800/?lang=en
Cortex-A78C:
https://developer.arm.com/documentation/SDEN1707916/1300/?lang=en
https://developer.arm.com/documentation/SDEN2004089/0700/?lang=en
(yes, there are two!)
Cortex-A710:
https://developer.arm.com/documentation/SDEN1775101/1500/?lang=en
Cortex-X1:
https://developer.arm.com/documentation/SDEN1401782/1800/?lang=en
Cortex-X2:
https://developer.arm.com/documentation/SDEN1775100/1500/?lang=en
Cortex-X3:
https://developer.arm.com/documentation/SDEN2055130/1000/?lang=en
Cortex-V1:
https://developer.arm.com/documentation/SDEN1401781/1600/?lang=en
Cortex-V2:
https://developer.arm.com/documentation/SDEN2332927/0500/?lang=en
Cortex-N2:
https://developer.arm.com/documentation/SDEN1982442/1200/?lang=en

Thanks,

James

[0] https://developer.arm.com/documentation/den0100/1-0/?lang=en

James Morse (3):
  firmware: smccc: Add support for erratum discovery API
  arm64: cputype: Add new part numbers for Cortex-X3, and Neoverse-V2
  arm64: errata: Disable FWB on parts with non-ARM interconnects

 Documentation/arm64/silicon-errata.rst | 18 ++++++
 arch/arm64/Kconfig                     | 27 +++++++++
 arch/arm64/include/asm/cputype.h       |  4 ++
 arch/arm64/kernel/cpufeature.c         | 77 ++++++++++++++++++++++++-
 drivers/firmware/smccc/Kconfig         |  8 +++
 drivers/firmware/smccc/Makefile        |  1 +
 drivers/firmware/smccc/em.c            | 78 ++++++++++++++++++++++++++
 include/linux/arm-smccc.h              | 28 +++++++++
 8 files changed, 240 insertions(+), 1 deletion(-)
 create mode 100644 drivers/firmware/smccc/em.c

Comments

Oliver Upton Feb. 16, 2023, 6:52 p.m. UTC | #1
Hi James,

On Thu, Feb 16, 2023 at 06:21:58PM +0000, James Morse wrote:
> Hello!
> 
> When stage1 translation is disabled, the SCTRL_E1.I bit controls the
> attributes used for instruction fetch, one of the options results in a
> non-cacheable access. A whole host of CPUs missed the FWB override
> in this case, meaning a KVM guest could fetch stale/junk data instead of
> instructions.
> 
> The workaround is to disable FWB, and do the required cache maintenance
> instead.
> 
> The good news is, this isn't a problem for systems using Arm's
> interconnect IP. The bad news is: linux can't know this. Arm knows of
> at least one platform that is affected by this erratum.
> 
> 
> This series adds support for the 'Errata Management Firmware Interface', [0]
> and queries that to determine if the CPU is affected or not.
> 
> Unfortunately, no-one has firmware that supports this new interface yet,
> and the least surprising thing to do is to enable the workaround by default,
> meaning FWB is disabled on all these cores, even for unaffected platforms.
> Platforms that are not-affected can either take a firmware-update to support
> the interface, or if the kernel they run will only run on hardware that is
> unaffected, disable the workaround at build time.

Wait, what? Is there a legitimate concern that affected systems are in
the wild today, or is there enough time for affected platforms to go and
implement the necessary firmware interface? Requiring correctly
implemented systems to explicitly opt-out seems like quite a lot more
work (w/ low likelihood) than having the one known platform go about
this the right way.

I'm rather troubled by the idea of enabling this by default on systems
that use these cores unless there really is no opportunity to
course-correct.
Ard Biesheuvel Feb. 21, 2023, 2:38 p.m. UTC | #2
On Thu, 16 Feb 2023 at 19:23, James Morse <james.morse@arm.com> wrote:
>
> Hello!
>
> When stage1 translation is disabled, the SCTRL_E1.I bit controls the
> attributes used for instruction fetch, one of the options results in a
> non-cacheable access. A whole host of CPUs missed the FWB override
> in this case, meaning a KVM guest could fetch stale/junk data instead of
> instructions.
>
> The workaround is to disable FWB, and do the required cache maintenance
> instead.
>

So the system should behave as if SCTLR_EL1.I==1 when FWB is enabled,
but it doesn't, right? Couldn't we just force SCTLR_EL1.I to 1 when
FWB is enabled? I.e., trap writes and override the I bit - and if we
want to pretend it is 0 we could trap reads and lie to the guest as
well, but I doubt we'd even need that.
James Morse Feb. 21, 2023, 5:41 p.m. UTC | #3
Hi Oliver,

On 16/02/2023 18:52, Oliver Upton wrote:
> On Thu, Feb 16, 2023 at 06:21:58PM +0000, James Morse wrote:
>> When stage1 translation is disabled, the SCTRL_E1.I bit controls the
>> attributes used for instruction fetch, one of the options results in a
>> non-cacheable access. A whole host of CPUs missed the FWB override
>> in this case, meaning a KVM guest could fetch stale/junk data instead of
>> instructions.
>>
>> The workaround is to disable FWB, and do the required cache maintenance
>> instead.
>>
>> The good news is, this isn't a problem for systems using Arm's
>> interconnect IP. The bad news is: linux can't know this. Arm knows of
>> at least one platform that is affected by this erratum.
>>
>>
>> This series adds support for the 'Errata Management Firmware Interface', [0]
>> and queries that to determine if the CPU is affected or not.
>>
>> Unfortunately, no-one has firmware that supports this new interface yet,
>> and the least surprising thing to do is to enable the workaround by default,
>> meaning FWB is disabled on all these cores, even for unaffected platforms.
>> Platforms that are not-affected can either take a firmware-update to support
>> the interface, or if the kernel they run will only run on hardware that is
>> unaffected, disable the workaround at build time.

> Wait, what? Is there a legitimate concern that affected systems are in
> the wild today, or is there enough time for affected platforms to go and
> implement the necessary firmware interface?

The one platform that arm is aware of isn't shipping yet - I assume it will implement the
firmware interface.

But I don't think arm always know what it is people are building ... it certainly doesn't
reach me. This affects a whole host of CPUs, I wouldn't be surprised if there is an
existing part out there that is affected.


> Requiring correctly
> implemented systems to explicitly opt-out seems like quite a lot more
> work (w/ low likelihood) than having the one known platform go about
> this the right way.

Sure, but its safe by default.


> I'm rather troubled by the idea of enabling this by default on systems
> that use these cores unless there really is no opportunity to
> course-correct.

It's the choice between correctness and performance. Probability says unless the CPU is
Neoverse-V2 (which is that one platform), you're not affected. But how much does
correctness matter? I'd hate to have to debug "1 in a 100 times the guest doesn't boot".


Thanks,

James
James Morse Feb. 21, 2023, 5:41 p.m. UTC | #4
Hi Ard,

On 21/02/2023 14:38, Ard Biesheuvel wrote:
> On Thu, 16 Feb 2023 at 19:23, James Morse <james.morse@arm.com> wrote:
>> When stage1 translation is disabled, the SCTRL_E1.I bit controls the
>> attributes used for instruction fetch, one of the options results in a
>> non-cacheable access. A whole host of CPUs missed the FWB override
>> in this case, meaning a KVM guest could fetch stale/junk data instead of
>> instructions.
>>
>> The workaround is to disable FWB, and do the required cache maintenance
>> instead.

> So the system should behave as if SCTLR_EL1.I==1 when FWB is enabled,
> but it doesn't, right? Couldn't we just force SCTLR_EL1.I to 1 when
> FWB is enabled? I.e., trap writes and override the I bit - and if we
> want to pretend it is 0 we could trap reads and lie to the guest as
> well, but I doubt we'd even need that.

The affected parts don't have fine-grained traps, so we'd need to set HCR_EL2.TVM, which
traps loads of things. We'd only need it while the guest has the MMU disabled, and KVM
already has code that uses this trap to try and spot this ...

... but it only works until the first time you enable SCTRL_EL1.M as the trap is too
costly to leave enabled. If you put the workaround in there, it would work the first time
a guest booted, but a subsequent kexec, or any other reason to turn the MMU off is
exposed. Its an incomplete fix, I'd hate to have to debug it!


Thanks,

James
Oliver Upton Feb. 24, 2023, 7 p.m. UTC | #5
James,

I realize I didn't send my reply earlier this week and came back to this
when looking for your reply on another thread.

Sorry about that.

On Tue, Feb 21, 2023 at 05:41:35PM +0000, James Morse wrote:

[...]

> > Wait, what? Is there a legitimate concern that affected systems are in
> > the wild today, or is there enough time for affected platforms to go and
> > implement the necessary firmware interface?
> 
> The one platform that arm is aware of isn't shipping yet - I assume it will implement the
> firmware interface.
> 
> But I don't think arm always know what it is people are building ... it certainly doesn't
> reach me. This affects a whole host of CPUs, I wouldn't be surprised if there is an
> existing part out there that is affected.

I was only thinking of the V2 system that is unobtainable at this point.
Actually looking at the laundry list of affected cores it does make more
sense that this problem already exists today.

> > I'm rather troubled by the idea of enabling this by default on systems
> > that use these cores unless there really is no opportunity to
> > course-correct.
> 
> It's the choice between correctness and performance. Probability says unless the CPU is
> Neoverse-V2 (which is that one platform), you're not affected. But how much does
> correctness matter? I'd hate to have to debug "1 in a 100 times the guest doesn't boot".

Oh, not looking to make that tradeoff with my line of questioning :) I
was more curious if there was still an opportunity for affected systems
to explicitly opt-in to the mitigation. Seems that the answer is "no",
sadly.