Message ID | 20230216182201.1705406-1-james.morse@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | arm64: errata: Disable FWB on parts with non-ARM interconnects | expand |
Hi James, On Thu, Feb 16, 2023 at 06:21:58PM +0000, James Morse wrote: > Hello! > > When stage1 translation is disabled, the SCTRL_E1.I bit controls the > attributes used for instruction fetch, one of the options results in a > non-cacheable access. A whole host of CPUs missed the FWB override > in this case, meaning a KVM guest could fetch stale/junk data instead of > instructions. > > The workaround is to disable FWB, and do the required cache maintenance > instead. > > The good news is, this isn't a problem for systems using Arm's > interconnect IP. The bad news is: linux can't know this. Arm knows of > at least one platform that is affected by this erratum. > > > This series adds support for the 'Errata Management Firmware Interface', [0] > and queries that to determine if the CPU is affected or not. > > Unfortunately, no-one has firmware that supports this new interface yet, > and the least surprising thing to do is to enable the workaround by default, > meaning FWB is disabled on all these cores, even for unaffected platforms. > Platforms that are not-affected can either take a firmware-update to support > the interface, or if the kernel they run will only run on hardware that is > unaffected, disable the workaround at build time. Wait, what? Is there a legitimate concern that affected systems are in the wild today, or is there enough time for affected platforms to go and implement the necessary firmware interface? Requiring correctly implemented systems to explicitly opt-out seems like quite a lot more work (w/ low likelihood) than having the one known platform go about this the right way. I'm rather troubled by the idea of enabling this by default on systems that use these cores unless there really is no opportunity to course-correct.
On Thu, 16 Feb 2023 at 19:23, James Morse <james.morse@arm.com> wrote: > > Hello! > > When stage1 translation is disabled, the SCTRL_E1.I bit controls the > attributes used for instruction fetch, one of the options results in a > non-cacheable access. A whole host of CPUs missed the FWB override > in this case, meaning a KVM guest could fetch stale/junk data instead of > instructions. > > The workaround is to disable FWB, and do the required cache maintenance > instead. > So the system should behave as if SCTLR_EL1.I==1 when FWB is enabled, but it doesn't, right? Couldn't we just force SCTLR_EL1.I to 1 when FWB is enabled? I.e., trap writes and override the I bit - and if we want to pretend it is 0 we could trap reads and lie to the guest as well, but I doubt we'd even need that.
Hi Oliver, On 16/02/2023 18:52, Oliver Upton wrote: > On Thu, Feb 16, 2023 at 06:21:58PM +0000, James Morse wrote: >> When stage1 translation is disabled, the SCTRL_E1.I bit controls the >> attributes used for instruction fetch, one of the options results in a >> non-cacheable access. A whole host of CPUs missed the FWB override >> in this case, meaning a KVM guest could fetch stale/junk data instead of >> instructions. >> >> The workaround is to disable FWB, and do the required cache maintenance >> instead. >> >> The good news is, this isn't a problem for systems using Arm's >> interconnect IP. The bad news is: linux can't know this. Arm knows of >> at least one platform that is affected by this erratum. >> >> >> This series adds support for the 'Errata Management Firmware Interface', [0] >> and queries that to determine if the CPU is affected or not. >> >> Unfortunately, no-one has firmware that supports this new interface yet, >> and the least surprising thing to do is to enable the workaround by default, >> meaning FWB is disabled on all these cores, even for unaffected platforms. >> Platforms that are not-affected can either take a firmware-update to support >> the interface, or if the kernel they run will only run on hardware that is >> unaffected, disable the workaround at build time. > Wait, what? Is there a legitimate concern that affected systems are in > the wild today, or is there enough time for affected platforms to go and > implement the necessary firmware interface? The one platform that arm is aware of isn't shipping yet - I assume it will implement the firmware interface. But I don't think arm always know what it is people are building ... it certainly doesn't reach me. This affects a whole host of CPUs, I wouldn't be surprised if there is an existing part out there that is affected. > Requiring correctly > implemented systems to explicitly opt-out seems like quite a lot more > work (w/ low likelihood) than having the one known platform go about > this the right way. Sure, but its safe by default. > I'm rather troubled by the idea of enabling this by default on systems > that use these cores unless there really is no opportunity to > course-correct. It's the choice between correctness and performance. Probability says unless the CPU is Neoverse-V2 (which is that one platform), you're not affected. But how much does correctness matter? I'd hate to have to debug "1 in a 100 times the guest doesn't boot". Thanks, James
Hi Ard, On 21/02/2023 14:38, Ard Biesheuvel wrote: > On Thu, 16 Feb 2023 at 19:23, James Morse <james.morse@arm.com> wrote: >> When stage1 translation is disabled, the SCTRL_E1.I bit controls the >> attributes used for instruction fetch, one of the options results in a >> non-cacheable access. A whole host of CPUs missed the FWB override >> in this case, meaning a KVM guest could fetch stale/junk data instead of >> instructions. >> >> The workaround is to disable FWB, and do the required cache maintenance >> instead. > So the system should behave as if SCTLR_EL1.I==1 when FWB is enabled, > but it doesn't, right? Couldn't we just force SCTLR_EL1.I to 1 when > FWB is enabled? I.e., trap writes and override the I bit - and if we > want to pretend it is 0 we could trap reads and lie to the guest as > well, but I doubt we'd even need that. The affected parts don't have fine-grained traps, so we'd need to set HCR_EL2.TVM, which traps loads of things. We'd only need it while the guest has the MMU disabled, and KVM already has code that uses this trap to try and spot this ... ... but it only works until the first time you enable SCTRL_EL1.M as the trap is too costly to leave enabled. If you put the workaround in there, it would work the first time a guest booted, but a subsequent kexec, or any other reason to turn the MMU off is exposed. Its an incomplete fix, I'd hate to have to debug it! Thanks, James
James, I realize I didn't send my reply earlier this week and came back to this when looking for your reply on another thread. Sorry about that. On Tue, Feb 21, 2023 at 05:41:35PM +0000, James Morse wrote: [...] > > Wait, what? Is there a legitimate concern that affected systems are in > > the wild today, or is there enough time for affected platforms to go and > > implement the necessary firmware interface? > > The one platform that arm is aware of isn't shipping yet - I assume it will implement the > firmware interface. > > But I don't think arm always know what it is people are building ... it certainly doesn't > reach me. This affects a whole host of CPUs, I wouldn't be surprised if there is an > existing part out there that is affected. I was only thinking of the V2 system that is unobtainable at this point. Actually looking at the laundry list of affected cores it does make more sense that this problem already exists today. > > I'm rather troubled by the idea of enabling this by default on systems > > that use these cores unless there really is no opportunity to > > course-correct. > > It's the choice between correctness and performance. Probability says unless the CPU is > Neoverse-V2 (which is that one platform), you're not affected. But how much does > correctness matter? I'd hate to have to debug "1 in a 100 times the guest doesn't boot". Oh, not looking to make that tradeoff with my line of questioning :) I was more curious if there was still an opportunity for affected systems to explicitly opt-in to the mitigation. Seems that the answer is "no", sadly.