Message ID | 20240411-tso-v1-0-754f11abfbff@marcan.st (mailing list archive) |
---|---|
Headers | show |
Series | arm64: Support the TSO memory model | expand |
On Wed, Apr 10, 2024 at 8:51 PM Hector Martin <marcan@marcan.st> wrote: > > x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this > reason, x86 emulation on baseline ARM64 systems requires very expensive > memory model emulation. Having hardware that supports this natively is > therefore very attractive. Such hardware, in fact, exists. This series > adds support for userspace to identify when TSO is available and > toggle it on, if supported. > > Some ARM64 CPUs intrinsically implement the TSO memory model, while > others expose is as an IMPDEF control. Apple Silicon SoCs are in the > latter category. Using TSO for x86 emulation on chips that support it > has been shown to provide a massive performance boost [1]. > > Patch 1 introduces the PR_{SET,GET}_MEM_MODEL userspace control, which > is initially not implemented for any architectures. > > Patch 2 implements it for CPUs which are known, to the best of my > knowledge, to always implement the TSO memory model unconditionally. > This uses the cpufeature mechanism to only enable this if *all* cores in > the system meet the requirements. > > Patch 3 adds the scaffolding necesasry to save/restore the ACTLR_EL1 > register across context switches. This register contains IMPDEF flags > related to CPU execution, and on Apple CPUs this is where the runtime > TSO toggle bit is implemented. Other CPUs could conceivably benefit from > this scaffolding if they also use ACTLR_EL1 for things that could > ostensibly be runtime controlled and context-switched. For this to work, > ACTLR_EL1 must have a uniform layout across all cores in the system. > > Finally, patch 4 implements PR_{SET,GET}_MEM_MODEL for Apple CPUs by > hooking it up to flip the appropriate ACTLR_EL1 bit when the Apple TSO > feature is detected (on all CPUs, which also implies the uniform > ACTLR_EL1 layout). > > This series has been brewing in the downstream Asahi Linux tree for a > while now, and ships to thousands of users. A subset have been using it > with FEX-Emu, which already supports this feature. This rebase on > v6.9-rc1 is only build-tested (all intermediate commits with and without > the config enabled, on ARM64) but I'll update the downstream branch soon > with this version and get it pushed out to users/testers. > > The Apple support works on bare metal and *should* work exactly the same > way on macOS VMs (as alluded to by Zayd in his independent submission [3]), > though I haven't personally verified this. KVM support for this is left > for a future patchset. > > (Apologies for the large Cc: list; I want to make sure nobody who got > Cced on Zayd's alternate take is left out of this one.) > > [1] https://fex-emu.com/FEX-2306/ > [2] https://github.com/AsahiLinux/linux/tree/bits/220-tso > [3] https://lore.kernel.org/lkml/20240410211652.16640-1-zayd_qumsieh@apple.com/ > > To: Catalin Marinas <catalin.marinas@arm.com> > To: Will Deacon <will@kernel.org> > To: Marc Zyngier <maz@kernel.org> > To: Mark Rutland <mark.rutland@arm.com> > Cc: Zayd Qumsieh <zayd_qumsieh@apple.com> > Cc: Justin Lu <ih_justin@apple.com> > Cc: Ryan Houdek <Houdek.Ryan@fex-emu.org> > Cc: Mark Brown <broonie@kernel.org> > Cc: Ard Biesheuvel <ardb@kernel.org> > Cc: Mateusz Guzik <mjguzik@gmail.com> > Cc: Anshuman Khandual <anshuman.khandual@arm.com> > Cc: Oliver Upton <oliver.upton@linux.dev> > Cc: Miguel Luis <miguel.luis@oracle.com> > Cc: Joey Gouly <joey.gouly@arm.com> > Cc: Christoph Paasch <cpaasch@apple.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: Sami Tolvanen <samitolvanen@google.com> > Cc: Baoquan He <bhe@redhat.com> > Cc: Joel Granados <j.granados@samsung.com> > Cc: Dawei Li <dawei.li@shingroup.cn> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Florent Revest <revest@chromium.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Stefan Roesch <shr@devkernel.io> > Cc: Andy Chiu <andy.chiu@sifive.com> > Cc: Josh Triplett <josh@joshtriplett.org> > Cc: Oleg Nesterov <oleg@redhat.com> > Cc: Helge Deller <deller@gmx.de> > Cc: Zev Weiss <zev@bewilderbeest.net> > Cc: Ondrej Mosnacek <omosnace@redhat.com> > Cc: Miguel Ojeda <ojeda@kernel.org> > Cc: linux-arm-kernel@lists.infradead.org > Cc: linux-kernel@vger.kernel.org > Cc: Asahi Linux <asahi@lists.linux.dev> > > Signed-off-by: Hector Martin <marcan@marcan.st> > --- > Hector Martin (4): > prctl: Introduce PR_{SET,GET}_MEM_MODEL > arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs > arm64: Introduce scaffolding to add ACTLR_EL1 to thread state > arm64: Implement Apple IMPDEF TSO memory model control > > arch/arm64/Kconfig | 14 ++++++ > arch/arm64/include/asm/apple_cpufeature.h | 15 +++++++ > arch/arm64/include/asm/cpufeature.h | 10 +++++ > arch/arm64/include/asm/processor.h | 3 ++ > arch/arm64/kernel/Makefile | 3 +- > arch/arm64/kernel/cpufeature.c | 11 ++--- > arch/arm64/kernel/cpufeature_impdef.c | 61 ++++++++++++++++++++++++++ > arch/arm64/kernel/process.c | 71 +++++++++++++++++++++++++++++++ > arch/arm64/kernel/setup.c | 8 ++++ > arch/arm64/tools/cpucaps | 2 + > include/linux/memory_ordering_model.h | 11 +++++ > include/uapi/linux/prctl.h | 5 +++ > kernel/sys.c | 21 +++++++++ > 13 files changed, 229 insertions(+), 6 deletions(-) > --- > base-commit: 4cece764965020c22cff7665b18a012006359095 > change-id: 20240411-tso-e86fdceb94b8 > The series looks good to me. Reviewed-by: Neal Gompa <neal@gompa.dev>
Hi Hector, On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote: > x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this > reason, x86 emulation on baseline ARM64 systems requires very expensive > memory model emulation. Having hardware that supports this natively is > therefore very attractive. Such hardware, in fact, exists. This series > adds support for userspace to identify when TSO is available and > toggle it on, if supported. I'm probably going to make myself hugely unpopular here, but I have a strong objection to this patch series as it stands. I firmly believe that providing a prctl() to query and toggle the memory model to/from TSO is going to lead to subtle fragmentation of arm64 Linux userspace. It's not difficult to envisage this TSO switch being abused for native arm64 applications: * A program no longer crashes when TSO is enabled, so the developer just toggles TSO to meet a deadline. * Some legacy x86 sources are being ported to arm64 but concurrency is hard so the developer just enables TSO to (mostly) avoid thinking about it. * Some binaries in a distribution exhibit instability which goes away in TSO mode, so a taskset-like program is used to run them with TSO enabled. In all these cases, we end up with native arm64 applications that will either fail to load or will crash in subtle ways on CPUs without the TSO feature. Assuming that the application cannot be fixed, a better approach would be to recompile using stronger instructions (e.g. LDAR/STLR) so that at least the resulting binary is portable. Now, it's true that some existing CPUs are TSO by design (this is a perfectly valid implementation of the arm64 memory model), but I think there's a big difference between quietly providing more ordering guarantees than software may be relying on and providing a mechanism to discover, request and ultimately rely upon the stronger behaviour. An alternative option is to go down the SPARC RMO route and just enable TSO statically (although presumably in the firmware) for Apple silicon. I'm assuming that has a performance impact for native code? Will P.S. I briefly pondered the idea of the kernel toggling the bit in the ELF loader when e.g. it sees an x86 machine type but I suspect that doesn't really help with existing emulators and you'd still need a way to tell the emulator whether or not it was enabled.
On 2024/04/11 22:28, Will Deacon wrote: > Hi Hector, > > On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote: >> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this >> reason, x86 emulation on baseline ARM64 systems requires very expensive >> memory model emulation. Having hardware that supports this natively is >> therefore very attractive. Such hardware, in fact, exists. This series >> adds support for userspace to identify when TSO is available and >> toggle it on, if supported. > > I'm probably going to make myself hugely unpopular here, but I have a > strong objection to this patch series as it stands. I firmly believe > that providing a prctl() to query and toggle the memory model to/from > TSO is going to lead to subtle fragmentation of arm64 Linux userspace. I honestly doubt this should be a significant concern right now, given that only a subset of implementations actually support this. Yes, developers can do stupid stuff, but we already have gone through this kind of story with other situations (e.g. 16K and 64K page support on ARM64 breaking 4K assumptions) and things have been fixed over time. In particular, I highly suspect Asahi Linux and Apple Silicon have done a lot more good for the ARM64 ecosystem by getting developers to fix their page size mess than they will do bad by somehow encouraging TSO abuse. We've even found new memory model issues thanks to the architecture's deep out-of-order character (remember that mess with Linux atomics? :-)). So far, in the year+ we've had this patchset downstream, not a single developer has proposed abusing it for something that isn't an x86 emulator. There's a pragmatic argument here: since we need this, and it absolutely will continue to ship downstream if rejected, it doesn't make much difference for fragmentation risk does it? The vast majority of Linux-on-Mac users are likely to continue running downstream kernels for the foreseeable future anyway to get newer features and hardware support faster than they can be upstreamed. So not allowing this upstream doesn't really change the landscape vis-a-vis being able to abuse this or not, it just makes our life harder by forcing us to carry more patches forever. > It's not difficult to envisage this TSO switch being abused for native > arm64 applications: > > * A program no longer crashes when TSO is enabled, so the developer > just toggles TSO to meet a deadline. > > * Some legacy x86 sources are being ported to arm64 but concurrency > is hard so the developer just enables TSO to (mostly) avoid thinking > about it. Both of these rely on the developer *knowing* what TSO is and why it fixes this. I posit that a developer who knows what that is also likely to know why this is a stupid hack and they shouldn't be doing this and that it won't work on all machines. > > * Some binaries in a distribution exhibit instability which goes away > in TSO mode, so a taskset-like program is used to run them with TSO > enabled. Since the flag is cleared on execve, this third one isn't generally possible as far as I know. > In all these cases, we end up with native arm64 applications that will > either fail to load or will crash in subtle ways on CPUs without the TSO > feature. Assuming that the application cannot be fixed, a better > approach would be to recompile using stronger instructions (e.g. > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > true that some existing CPUs are TSO by design (this is a perfectly > valid implementation of the arm64 memory model), but I think there's a > big difference between quietly providing more ordering guarantees than > software may be relying on and providing a mechanism to discover, > request and ultimately rely upon the stronger behaviour. The problem is "just" using stronger instructions is much more expensive, as emulators have demonstrated. If TSO didn't serve a practical purpose I wouldn't be submitting this, but it does. This is basically non-negotiable for x86 emulation; if this is rejected upstream, it will forever live as a downstream patch used by the entire gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very explicitly targeting, given our efforts with microVMs for 4K page size support and the upcoming Vulkan drivers). That said, I have a pragmatic proposal here. The "fixed TSO" part of the implementation should be harmless, since those CPUs would correctly run poorly-written applications anyway so the API is moot. That leaves Apple Silicon. Our native kernels are and likely always will be 16K page size, due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot natively but with very broken functionality including no GPU acceleration) plus performance differences that favor 16K. How about we gate the TSO functionality to only be supported on 4K kernel builds? This would make them only work in 4K VMs on Asahi Linux. We are very explicitly discouraging people from trying to use the microVMs to work around page size problems (which they can already do, another fragmentation problem, anyway); any application which requires the 4K VM to run that isn't an emulator is already clearly broken and advertising that fact openly. So, adding TSO to this should be only a marginal risk of further fragmentation, and it wouldn't allow apps to "sneakily" "just work" on Apple machines by abusing TSO. > > An alternative option is to go down the SPARC RMO route and just enable > TSO statically (although presumably in the firmware) for Apple silicon. > I'm assuming that has a performance impact for native code? Correct. We already have this as a bootloader option, but it is not desirable. Plus, userspace code still needs a way to *discover* that TSO is enabled for correctness, so it can automatically decide whether to use stronger or weaker instructions. > > Will > > P.S. I briefly pondered the idea of the kernel toggling the bit in the > ELF loader when e.g. it sees an x86 machine type but I suspect that > doesn't really help with existing emulators and you'd still need a way > to tell the emulator whether or not it was enabled. > - Hector
On 2024/04/11 23:19, Hector Martin wrote: >> >> An alternative option is to go down the SPARC RMO route and just enable >> TSO statically (although presumably in the firmware) for Apple silicon. >> I'm assuming that has a performance impact for native code? > > Correct. We already have this as a bootloader option, but it is not > desirable. Plus, userspace code still needs a way to *discover* that TSO > is enabled for correctness, so it can automatically decide whether to > use stronger or weaker instructions. To add some numbers to this (I was just made aware of this paper): https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf Using TSO globally has, on average, a 9% performance hit, so that is clearly off the table as a general solution. Meanwhile, more detailed microbenchmarks often show TSO as having better performance than outright using acquire/release instructions without TSO. Therefore, just giving up on TSO and using acq/rel semantics for emulators is also not an acceptable solution. Additionally, the general load/store instructions on ARM have more flexible addressing modes than the synchronizing ones, and since general x86 emulation requires *all* loads and stores to be like this in a non-TSO model (without much more complex/expensive program analysis to determine where this can be elided), the perf impact is definitely worse for emulation (e.g. stack accesses are affected) than for a microbenchmark where only the "target" test instructions are being modified. - Hector
The patch looks great! :) I have one minor suggestion, though: >static __always_inline bool system_has_actlr_state(void) >{ > return IS_ENABLED(CONFIG_ARM64_ACTLR_STATE) && > alternative_has_cap_unlikely(ARM64_HAS_TSO_APPLE); >} ACTLR_EL1.TSO is not exposed for writing on Virtual Machines on all versions of MacOS. However, AIDR_EL1 may still advertise TSO, whether or not ACTLR_EL1.TSO is writable. Could you modify the patch such that we check the writability of ACTLR_EL1.TSO in system_has_actlr_state (or once on startup, and cache it, since reading from AIDR_EL1 causes a trap to Hypervisor.fwk)? Thanks, Zayd
On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: > On 2024/04/11 22:28, Will Deacon wrote: > > * Some binaries in a distribution exhibit instability which goes away > > in TSO mode, so a taskset-like program is used to run them with TSO > > enabled. > > Since the flag is cleared on execve, this third one isn't generally > possible as far as I know. Ah ok, I'd missed that. Thanks. > > In all these cases, we end up with native arm64 applications that will > > either fail to load or will crash in subtle ways on CPUs without the TSO > > feature. Assuming that the application cannot be fixed, a better > > approach would be to recompile using stronger instructions (e.g. > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > > true that some existing CPUs are TSO by design (this is a perfectly > > valid implementation of the arm64 memory model), but I think there's a > > big difference between quietly providing more ordering guarantees than > > software may be relying on and providing a mechanism to discover, > > request and ultimately rely upon the stronger behaviour. > > The problem is "just" using stronger instructions is much more > expensive, as emulators have demonstrated. If TSO didn't serve a > practical purpose I wouldn't be submitting this, but it does. This is > basically non-negotiable for x86 emulation; if this is rejected > upstream, it will forever live as a downstream patch used by the entire > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very > explicitly targeting, given our efforts with microVMs for 4K page size > support and the upcoming Vulkan drivers). These microVMs sound quite interesting. What exactly are they? Are you running them under KVM? Ignoring the mechanism for the time being, would it solve your problem if you were able to run specific microVMs in TSO mode, or do you *really* need the VM to have finer-grained control than that? If the whole VM is running in TSO mode, then my concerns largely disappear, as that's indistinguishable from running on a hardware implementation that happens to be TSO. > That said, I have a pragmatic proposal here. The "fixed TSO" part of the > implementation should be harmless, since those CPUs would correctly run > poorly-written applications anyway so the API is moot. That leaves Apple > Silicon. Our native kernels are and likely always will be 16K page size, > due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot > natively but with very broken functionality including no GPU > acceleration) plus performance differences that favor 16K. How about we > gate the TSO functionality to only be supported on 4K kernel builds? > This would make them only work in 4K VMs on Asahi Linux. We are very > explicitly discouraging people from trying to use the microVMs to work > around page size problems (which they can already do, another > fragmentation problem, anyway); any application which requires the 4K VM > to run that isn't an emulator is already clearly broken and advertising > that fact openly. So, adding TSO to this should be only a marginal risk > of further fragmentation, and it wouldn't allow apps to "sneakily" "just > work" on Apple machines by abusing TSO. I appreciate that you're trying to be constructive here, but I don't think we should tie this to the page size. It's an artifical limitation and I don't think it really addresses the underlying concerns that I have. Will
On Mon, Apr 15, 2024 at 07:22:41PM -0700, Zayd Qumsieh wrote: > >I'm probably going to make myself hugely unpopular here, but I have a > >strong objection to this patch series as it stands. I firmly believe > >that providing a prctl() to query and toggle the memory model to/from > >TSO is going to lead to subtle fragmentation of arm64 Linux userspace. > > It's definitely not our intent to fragment the ecosystem. > The goal of this memory ordering is to simplify emulation layers that benefit from this. > If you have suggestions to reduce the risk of it being misused outside of emulators, we'd be happy to look into it. Once you have exposed this toggle via prctl(), it doesn't really matter what your intentions where. It will get used outside of emulation laters and we'll be stuck supporting it. Will
On Fri, Apr 19, 2024 at 05:58:26PM +0100, Will Deacon wrote: > On Mon, Apr 15, 2024 at 07:22:41PM -0700, Zayd Qumsieh wrote: > > >I'm probably going to make myself hugely unpopular here, but I have a > > >strong objection to this patch series as it stands. I firmly believe > > >that providing a prctl() to query and toggle the memory model to/from > > >TSO is going to lead to subtle fragmentation of arm64 Linux userspace. > > > > It's definitely not our intent to fragment the ecosystem. The goal > > of this memory ordering is to simplify emulation layers that benefit > > from this. If you have suggestions to reduce the risk of it being > > misused outside of emulators, we'd be happy to look into it. > > Once you have exposed this toggle via prctl(), it doesn't really matter > what your intentions where. It will get used outside of emulation laters > and we'll be stuck supporting it. Just FTR, I fully agree with Will. I'm strongly against this kind of ABI for a non-architected, implementation defined feature. I can't even tell exactly what TSO means on the Apple hardware. Is it close to the x86 TSO? Is there a formal memory model for it? Are future Apple (or other Arm vendor) implementations going to follow exactly the same model to be able to call it some form of "Apple standard" that deserves an ABI? So, sorry, I'm going to NAK these approaches proposing imp def features as generic opt-in mechanisms (the microVMs thing sounds doable though, to my limited understanding; I guess that would mean running the emulator in a VM).
On Fri, 19 Apr 2024 17:58:09 +0100, Will Deacon <will@kernel.org> wrote: > > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: > > On 2024/04/11 22:28, Will Deacon wrote: > > > * Some binaries in a distribution exhibit instability which goes away > > > in TSO mode, so a taskset-like program is used to run them with TSO > > > enabled. > > > > Since the flag is cleared on execve, this third one isn't generally > > possible as far as I know. > > Ah ok, I'd missed that. Thanks. > > > > In all these cases, we end up with native arm64 applications that will > > > either fail to load or will crash in subtle ways on CPUs without the TSO > > > feature. Assuming that the application cannot be fixed, a better > > > approach would be to recompile using stronger instructions (e.g. > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > > > true that some existing CPUs are TSO by design (this is a perfectly > > > valid implementation of the arm64 memory model), but I think there's a > > > big difference between quietly providing more ordering guarantees than > > > software may be relying on and providing a mechanism to discover, > > > request and ultimately rely upon the stronger behaviour. > > > > The problem is "just" using stronger instructions is much more > > expensive, as emulators have demonstrated. If TSO didn't serve a > > practical purpose I wouldn't be submitting this, but it does. This is > > basically non-negotiable for x86 emulation; if this is rejected > > upstream, it will forever live as a downstream patch used by the entire > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very > > explicitly targeting, given our efforts with microVMs for 4K page size > > support and the upcoming Vulkan drivers). > > These microVMs sound quite interesting. What exactly are they? Are you > running them under KVM? > > Ignoring the mechanism for the time being, would it solve your problem > if you were able to run specific microVMs in TSO mode, or do you *really* > need the VM to have finer-grained control than that? If the whole VM is > running in TSO mode, then my concerns largely disappear, as that's > indistinguishable from running on a hardware implementation that happens > to be TSO. Since KVM has been mentioned a few times, I'll give my take on this. Since day 1, it was a conscious decision for KVM/arm64 to emulate the architecture, and only that -- this is complicated enough. Meaning that no implementation-defined features should be explicitly exposed to the guest. So I have no plan to expose any such feature for userspace to configure TSO or anything else of the sort. However, that doesn't preclude VMs from running in TSO mode if the HW is configured as such at boot time. From what I have understood, this is a per translation regime setting (EL1 and EL2 have separate knobs). So it should be possible to set ACTLR_EL1.TSO=1 from firmware (using the non-architected ACTLR_EL12 accessor), and let things work without touching anything else (KVM doesn't context switch this register and traps accesses to it). This would keep KVM out of the loop, the host side would be unaffected, and only VMs would pay the overhead of TSO. I appreciate that this is not the ideal situation, and very much an all-or-nothing approach. But that's what we can reasonably manage from an upstream perspective given the variability of the arm64 ecosystem. M.
On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote: > > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: > > On 2024/04/11 22:28, Will Deacon wrote: > > > * Some binaries in a distribution exhibit instability which goes away > > > in TSO mode, so a taskset-like program is used to run them with TSO > > > enabled. > > > > Since the flag is cleared on execve, this third one isn't generally > > possible as far as I know. > > Ah ok, I'd missed that. Thanks. > > > > In all these cases, we end up with native arm64 applications that will > > > either fail to load or will crash in subtle ways on CPUs without the TSO > > > feature. Assuming that the application cannot be fixed, a better > > > approach would be to recompile using stronger instructions (e.g. > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > > > true that some existing CPUs are TSO by design (this is a perfectly > > > valid implementation of the arm64 memory model), but I think there's a > > > big difference between quietly providing more ordering guarantees than > > > software may be relying on and providing a mechanism to discover, > > > request and ultimately rely upon the stronger behaviour. > > > > The problem is "just" using stronger instructions is much more > > expensive, as emulators have demonstrated. If TSO didn't serve a > > practical purpose I wouldn't be submitting this, but it does. This is > > basically non-negotiable for x86 emulation; if this is rejected > > upstream, it will forever live as a downstream patch used by the entire > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very > > explicitly targeting, given our efforts with microVMs for 4K page size > > support and the upcoming Vulkan drivers). > > These microVMs sound quite interesting. What exactly are they? Are you > running them under KVM? It's the magic of libkrun. This is one of the git repos in the family of libkrun, it has a wide array of use cases, which I personally won't do much justice explaining all then, this is just one repo/tool/usecases: https://github.com/containers/krunvm https://sinrega.org/running-microvms-on-m1/ CC'ing @Sergio Lopez Pascual the lead of krun in general. Is mise le meas/Regards, Eric Curtin > > Ignoring the mechanism for the time being, would it solve your problem > if you were able to run specific microVMs in TSO mode, or do you *really* > need the VM to have finer-grained control than that? If the whole VM is > running in TSO mode, then my concerns largely disappear, as that's > indistinguishable from running on a hardware implementation that happens > to be TSO. > > > That said, I have a pragmatic proposal here. The "fixed TSO" part of the > > implementation should be harmless, since those CPUs would correctly run > > poorly-written applications anyway so the API is moot. That leaves Apple > > Silicon. Our native kernels are and likely always will be 16K page size, > > due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot > > natively but with very broken functionality including no GPU > > acceleration) plus performance differences that favor 16K. How about we > > gate the TSO functionality to only be supported on 4K kernel builds? > > This would make them only work in 4K VMs on Asahi Linux. We are very > > explicitly discouraging people from trying to use the microVMs to work > > around page size problems (which they can already do, another > > fragmentation problem, anyway); any application which requires the 4K VM > > to run that isn't an emulator is already clearly broken and advertising > > that fact openly. So, adding TSO to this should be only a marginal risk > > of further fragmentation, and it wouldn't allow apps to "sneakily" "just > > work" on Apple machines by abusing TSO. > > I appreciate that you're trying to be constructive here, but I don't think > we should tie this to the page size. It's an artifical limitation and I > don't think it really addresses the underlying concerns that I have. > > Will >
On Sat, 20 Apr 2024 at 13:13, Eric Curtin <ecurtin@redhat.com> wrote: > > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote: > > > > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: > > > On 2024/04/11 22:28, Will Deacon wrote: > > > > * Some binaries in a distribution exhibit instability which goes away > > > > in TSO mode, so a taskset-like program is used to run them with TSO > > > > enabled. > > > > > > Since the flag is cleared on execve, this third one isn't generally > > > possible as far as I know. > > > > Ah ok, I'd missed that. Thanks. > > > > > > In all these cases, we end up with native arm64 applications that will > > > > either fail to load or will crash in subtle ways on CPUs without the TSO > > > > feature. Assuming that the application cannot be fixed, a better > > > > approach would be to recompile using stronger instructions (e.g. > > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > > > > true that some existing CPUs are TSO by design (this is a perfectly > > > > valid implementation of the arm64 memory model), but I think there's a > > > > big difference between quietly providing more ordering guarantees than > > > > software may be relying on and providing a mechanism to discover, > > > > request and ultimately rely upon the stronger behaviour. > > > > > > The problem is "just" using stronger instructions is much more > > > expensive, as emulators have demonstrated. If TSO didn't serve a > > > practical purpose I wouldn't be submitting this, but it does. This is > > > basically non-negotiable for x86 emulation; if this is rejected > > > upstream, it will forever live as a downstream patch used by the entire > > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very > > > explicitly targeting, given our efforts with microVMs for 4K page size > > > support and the upcoming Vulkan drivers). > > > > These microVMs sound quite interesting. What exactly are they? Are you > > running them under KVM? > > It's the magic of libkrun. This is one of the git repos in the family > of libkrun, it has a wide array of use cases, which I personally won't > do much justice explaining all then, this is just one > repo/tool/usecases: > > https://github.com/containers/krunvm > > https://sinrega.org/running-microvms-on-m1/ Sorry for the double post, meant to share this one for the Asahi emulator usecase. Sergio's blogs are great in general: https://sinrega.org/2023-10-06-using-microvms-for-gaming-on-fedora-asahi/ Is mise le meas/Regards, Eric Curtin > > CC'ing @Sergio Lopez Pascual the lead of krun in general. > > Is mise le meas/Regards, > > Eric Curtin > > > > > Ignoring the mechanism for the time being, would it solve your problem > > if you were able to run specific microVMs in TSO mode, or do you *really* > > need the VM to have finer-grained control than that? If the whole VM is > > running in TSO mode, then my concerns largely disappear, as that's > > indistinguishable from running on a hardware implementation that happens > > to be TSO. > > > > > That said, I have a pragmatic proposal here. The "fixed TSO" part of the > > > implementation should be harmless, since those CPUs would correctly run > > > poorly-written applications anyway so the API is moot. That leaves Apple > > > Silicon. Our native kernels are and likely always will be 16K page size, > > > due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot > > > natively but with very broken functionality including no GPU > > > acceleration) plus performance differences that favor 16K. How about we > > > gate the TSO functionality to only be supported on 4K kernel builds? > > > This would make them only work in 4K VMs on Asahi Linux. We are very > > > explicitly discouraging people from trying to use the microVMs to work > > > around page size problems (which they can already do, another > > > fragmentation problem, anyway); any application which requires the 4K VM > > > to run that isn't an emulator is already clearly broken and advertising > > > that fact openly. So, adding TSO to this should be only a marginal risk > > > of further fragmentation, and it wouldn't allow apps to "sneakily" "just > > > work" on Apple machines by abusing TSO. > > > > I appreciate that you're trying to be constructive here, but I don't think > > we should tie this to the page size. It's an artifical limitation and I > > don't think it really addresses the underlying concerns that I have. > > > > Will > >
> On Fri, 19 Apr 2024 17:58:09 +0100, > Will Deacon <will@kernel.org> wrote: > > > > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: > > > On 2024/04/11 22:28, Will Deacon wrote: > > > > * Some binaries in a distribution exhibit instability which goes away > > > > in TSO mode, so a taskset-like program is used to run them with TSO > > > > enabled. > > > > > > Since the flag is cleared on execve, this third one isn't generally > > > possible as far as I know. > > > > Ah ok, I'd missed that. Thanks. > > > > > > In all these cases, we end up with native arm64 applications that will > > > > either fail to load or will crash in subtle ways on CPUs without the TSO > > > > feature. Assuming that the application cannot be fixed, a better > > > > approach would be to recompile using stronger instructions (e.g. > > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > > > > true that some existing CPUs are TSO by design (this is a perfectly > > > > valid implementation of the arm64 memory model), but I think there's a > > > > big difference between quietly providing more ordering guarantees than > > > > software may be relying on and providing a mechanism to discover, > > > > request and ultimately rely upon the stronger behaviour. > > > > > > The problem is "just" using stronger instructions is much more > > > expensive, as emulators have demonstrated. If TSO didn't serve a > > > practical purpose I wouldn't be submitting this, but it does. This is > > > basically non-negotiable for x86 emulation; if this is rejected > > > upstream, it will forever live as a downstream patch used by the entire > > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very > > > explicitly targeting, given our efforts with microVMs for 4K page size > > > support and the upcoming Vulkan drivers). > > > > These microVMs sound quite interesting. What exactly are they? Are you > > running them under KVM? > > > > Ignoring the mechanism for the time being, would it solve your problem > > if you were able to run specific microVMs in TSO mode, or do you *really* > > need the VM to have finer-grained control than that? If the whole VM is > > running in TSO mode, then my concerns largely disappear, as that's > > indistinguishable from running on a hardware implementation that happens > > to be TSO. > > Since KVM has been mentioned a few times, I'll give my take on this. > > Since day 1, it was a conscious decision for KVM/arm64 to emulate the > architecture, and only that -- this is complicated enough. Meaning > that no implementation-defined features should be explicitly exposed > to the guest. So I have no plan to expose any such feature for > userspace to configure TSO or anything else of the sort. Agreed. We do not intend for TSO mode to be used extensively for EL1, the intention is for TSO mode to be reserved for userspace applications that request it.
On Thu, 11 Apr 2024 14:28:54 +0100, Will Deacon <will@kernel.org> wrote: > P.S. I briefly pondered the idea of the kernel toggling the bit in the > ELF loader when e.g. it sees an x86 machine type but I suspect that > doesn't really help with existing emulators and you'd still need a way > to tell the emulator whether or not it was enabled. This seems promising to me. What do people think of adding an opt-in argument, option, or similar to binfmt that allows users to mark certain file formats as "must run under TSO"? And then, the kernel would set the TSO bit when invoking the interpreter for those file formats. If an emulator decides to create a non-CPU-emulation thread, then it can use a prctl to disable TSO and switch to the default ARM memory model. Note that this prctl wouldn't be allowed to enable TSO - it would only disable it. This way, it is much harder for a faulty application to be made that relies on TSO, since enabling of TSO is only done via a binfmt handler that the user must explicitly opt into. It is true that existing emulators wouldn't be able to benefit from this, but that's the case no matter the activation mechanism. We can, however, expose a prctl to get the memory model, so emulators can detect if TSO was enabled for their threads. To summarize, I propose two prctls (similar to the ones in the current revision of the patch series). One to switch from the TSO memory model to the default ARM one (this is a one-way street). And another to query the current memory model. Thanks, Zayd P.S. I forgot to CC you in my most recent email to Marc Zyngier just now. Sorry, I'm quite new to using mailing lists.
[adding Will back to the thread] On Thu, 02 May 2024 01:10:35 +0100, Zayd Qumsieh <zayd_qumsieh@apple.com> wrote: > > > On Fri, 19 Apr 2024 17:58:09 +0100, > > Will Deacon <will@kernel.org> wrote: > > > > > > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: > > > > On 2024/04/11 22:28, Will Deacon wrote: > > > > > * Some binaries in a distribution exhibit instability which goes away > > > > > in TSO mode, so a taskset-like program is used to run them with TSO > > > > > enabled. > > > > > > > > Since the flag is cleared on execve, this third one isn't generally > > > > possible as far as I know. > > > > > > Ah ok, I'd missed that. Thanks. > > > > > > > > In all these cases, we end up with native arm64 applications that will > > > > > either fail to load or will crash in subtle ways on CPUs without the TSO > > > > > feature. Assuming that the application cannot be fixed, a better > > > > > approach would be to recompile using stronger instructions (e.g. > > > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > > > > > true that some existing CPUs are TSO by design (this is a perfectly > > > > > valid implementation of the arm64 memory model), but I think there's a > > > > > big difference between quietly providing more ordering guarantees than > > > > > software may be relying on and providing a mechanism to discover, > > > > > request and ultimately rely upon the stronger behaviour. > > > > > > > > The problem is "just" using stronger instructions is much more > > > > expensive, as emulators have demonstrated. If TSO didn't serve a > > > > practical purpose I wouldn't be submitting this, but it does. This is > > > > basically non-negotiable for x86 emulation; if this is rejected > > > > upstream, it will forever live as a downstream patch used by the entire > > > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very > > > > explicitly targeting, given our efforts with microVMs for 4K page size > > > > support and the upcoming Vulkan drivers). > > > > > > These microVMs sound quite interesting. What exactly are they? Are you > > > running them under KVM? > > > > > > Ignoring the mechanism for the time being, would it solve your problem > > > if you were able to run specific microVMs in TSO mode, or do you *really* > > > need the VM to have finer-grained control than that? If the whole VM is > > > running in TSO mode, then my concerns largely disappear, as that's > > > indistinguishable from running on a hardware implementation that happens > > > to be TSO. > > > > Since KVM has been mentioned a few times, I'll give my take on this. > > > > Since day 1, it was a conscious decision for KVM/arm64 to emulate the > > architecture, and only that -- this is complicated enough. Meaning > > that no implementation-defined features should be explicitly exposed > > to the guest. So I have no plan to expose any such feature for > > userspace to configure TSO or anything else of the sort. > > Agreed. We do not intend for TSO mode to be used extensively for EL1, the > intention is for TSO mode to be reserved for userspace applications that > request it. But that's the same thing for a hypervisor. For usersoace in a VM to make use of any feature, it must be exposed to the VM as a whole by the host VMM (QEMU, kvmtool, whatever). Which means having a new userspace ABI, specific to KVM, exposing a feature for which there is no spec whatsoever. Even worse, you cannot discover whether the instruction you must use to context switch the ACTLR_EL1 register is implemented. Isn't that great? And I'm not even talking about the joys of migrating such a VM, because we have no clue what this bit means on other implementations. For all we know it causes another CPU to catch fire (or go PDP-endian, which is basically the same). Which is why my proposal is for this bit to be set statically for *all* VMs, and leave the kernel (and KVM) out of the picture altogether. At least that is something we can reason about (although someone would need to start thinking of how this particular TSO implementation composes with the relaxed memory ordering used outside of the VM and show that they actually lead to correct results for something such as virtio, for example). Thanks, M.
Am 5/2/2024 um 3:25 PM schrieb Marc Zyngier: > although > someone would need to start thinking of how this particular TSO > implementation composes with the relaxed memory ordering used outside > of the VM and show that they actually lead to correct results for > something such as virtio, for example I used to think about this problem space. Composing some kinds of memory models (e.g., Arm and TSO) is easy, others is hard. I don't know much about virtio, so this may show my naivety, but what complications could arise from virtio? Does the "visible behavior" of virtio change depending on the memory model of the machine it is running on? At least internally inside virtio it should not cause any problems, since you are effectively adding some barriers inside some of the virtio threads. (those that are running in the VM). But if the VM relies on virtio behaving in a "TSO manner" but its behavior is more relaxed on e.g. Arm, then that could cause issues. have fun, jonas
Eric Curtin <ecurtin@redhat.com> writes: > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote: >> >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: >> > On 2024/04/11 22:28, Will Deacon wrote: >> > > * Some binaries in a distribution exhibit instability which goes away >> > > in TSO mode, so a taskset-like program is used to run them with TSO >> > > enabled. >> > >> > Since the flag is cleared on execve, this third one isn't generally >> > possible as far as I know. >> >> Ah ok, I'd missed that. Thanks. >> >> > > In all these cases, we end up with native arm64 applications that will >> > > either fail to load or will crash in subtle ways on CPUs without the TSO >> > > feature. Assuming that the application cannot be fixed, a better >> > > approach would be to recompile using stronger instructions (e.g. >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's >> > > true that some existing CPUs are TSO by design (this is a perfectly >> > > valid implementation of the arm64 memory model), but I think there's a >> > > big difference between quietly providing more ordering guarantees than >> > > software may be relying on and providing a mechanism to discover, >> > > request and ultimately rely upon the stronger behaviour. >> > >> > The problem is "just" using stronger instructions is much more >> > expensive, as emulators have demonstrated. If TSO didn't serve a >> > practical purpose I wouldn't be submitting this, but it does. This is >> > basically non-negotiable for x86 emulation; if this is rejected >> > upstream, it will forever live as a downstream patch used by the entire >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very >> > explicitly targeting, given our efforts with microVMs for 4K page size >> > support and the upcoming Vulkan drivers). In addition to the use case Hector exposed here, there's another, potentially larger one, which is running x86_64 containers on aarch64 systems, using a combination of both Virtualization and emulation. In this scenario, both not being able to use TSO for emulation and having to enable it all the time for the whole VM have a very large impact on performance (~25% on some workloads). I understand the concern about the risk of userspace fragmentation, but I was wondering if we could minimize it to an acceptable level by narrowing down the context. For instance, since both use cases we're bringing to the table imply the use of Virtualization, we should be able to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1 (and not in nVHE nor pKVM), returning EINVAL otherwise. This would heavily discourage users from relying on this feature for native applications that can run on arbitrary contexts, hence drastically reducing the fragmentation risk. We would still need a way to ensure the trap gets to the VMM and for the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on a different series. Thanks, Sergio.
On Mon, 06 May 2024 12:21:40 +0100, Sergio Lopez Pascual <slp@redhat.com> wrote: > > Eric Curtin <ecurtin@redhat.com> writes: > > > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote: > >> > >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: > >> > On 2024/04/11 22:28, Will Deacon wrote: > >> > > * Some binaries in a distribution exhibit instability which goes away > >> > > in TSO mode, so a taskset-like program is used to run them with TSO > >> > > enabled. > >> > > >> > Since the flag is cleared on execve, this third one isn't generally > >> > possible as far as I know. > >> > >> Ah ok, I'd missed that. Thanks. > >> > >> > > In all these cases, we end up with native arm64 applications that will > >> > > either fail to load or will crash in subtle ways on CPUs without the TSO > >> > > feature. Assuming that the application cannot be fixed, a better > >> > > approach would be to recompile using stronger instructions (e.g. > >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > >> > > true that some existing CPUs are TSO by design (this is a perfectly > >> > > valid implementation of the arm64 memory model), but I think there's a > >> > > big difference between quietly providing more ordering guarantees than > >> > > software may be relying on and providing a mechanism to discover, > >> > > request and ultimately rely upon the stronger behaviour. > >> > > >> > The problem is "just" using stronger instructions is much more > >> > expensive, as emulators have demonstrated. If TSO didn't serve a > >> > practical purpose I wouldn't be submitting this, but it does. This is > >> > basically non-negotiable for x86 emulation; if this is rejected > >> > upstream, it will forever live as a downstream patch used by the entire > >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very > >> > explicitly targeting, given our efforts with microVMs for 4K page size > >> > support and the upcoming Vulkan drivers). > > In addition to the use case Hector exposed here, there's another, > potentially larger one, which is running x86_64 containers on aarch64 > systems, using a combination of both Virtualization and emulation. > > In this scenario, both not being able to use TSO for emulation > and having to enable it all the time for the whole VM have a very large > impact on performance (~25% on some workloads). Well, there is always a price to pay somewhere, and this is the usual trade-off between performance and maintainability. > I understand the concern about the risk of userspace fragmentation, but > I was wondering if we could minimize it to an acceptable level by > narrowing down the context. For instance, since both use cases we're > bringing to the table imply the use of Virtualization, we should be able > to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1 > (and not in nVHE nor pKVM), returning EINVAL otherwise. This would > heavily discourage users from relying on this feature for native > applications that can run on arbitrary contexts, hence drastically > reducing the fragmentation risk. As I explained in another sub-thread[1], I am not prepared to allow non architectural state to be exposed to a guest. I'm also not prepared to make significant ABI differences between VHE, nVHE, hVHE, with or without pKVM, because the job of the kernel is to abstract those differences. > We would still need a way to ensure the trap gets to the VMM and for > the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on > a different series. The VMM can't use ACTLR_EL12, by the very definition of this register (the clue is in the name). You'd have to proxy the write in the kernel and context-switch it, which means adding non-architectural state to KVM, breaking VM migration and adding more kludges to the existing Apple-specific host crap. Also, let's realise that we are talking about making significant changes to the arm64 ABI for a platform that is still not fully supported in the upstream kernel. I have the feeling that changing the memory model dynamically may not be of the utmost priority until then. Thanks, M. [1] https://lore.kernel.org/all/867cgcqrb9.wl-maz@kernel.org
On Mon, 6 May 2024 at 17:13, Marc Zyngier <maz@kernel.org> wrote: > > On Mon, 06 May 2024 12:21:40 +0100, > Sergio Lopez Pascual <slp@redhat.com> wrote: > > > > Eric Curtin <ecurtin@redhat.com> writes: > > > > > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote: > > >> > > >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: > > >> > On 2024/04/11 22:28, Will Deacon wrote: > > >> > > * Some binaries in a distribution exhibit instability which goes away > > >> > > in TSO mode, so a taskset-like program is used to run them with TSO > > >> > > enabled. > > >> > > > >> > Since the flag is cleared on execve, this third one isn't generally > > >> > possible as far as I know. > > >> > > >> Ah ok, I'd missed that. Thanks. > > >> > > >> > > In all these cases, we end up with native arm64 applications that will > > >> > > either fail to load or will crash in subtle ways on CPUs without the TSO > > >> > > feature. Assuming that the application cannot be fixed, a better > > >> > > approach would be to recompile using stronger instructions (e.g. > > >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > > >> > > true that some existing CPUs are TSO by design (this is a perfectly > > >> > > valid implementation of the arm64 memory model), but I think there's a > > >> > > big difference between quietly providing more ordering guarantees than > > >> > > software may be relying on and providing a mechanism to discover, > > >> > > request and ultimately rely upon the stronger behaviour. > > >> > > > >> > The problem is "just" using stronger instructions is much more > > >> > expensive, as emulators have demonstrated. If TSO didn't serve a > > >> > practical purpose I wouldn't be submitting this, but it does. This is > > >> > basically non-negotiable for x86 emulation; if this is rejected > > >> > upstream, it will forever live as a downstream patch used by the entire > > >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very > > >> > explicitly targeting, given our efforts with microVMs for 4K page size > > >> > support and the upcoming Vulkan drivers). > > > > In addition to the use case Hector exposed here, there's another, > > potentially larger one, which is running x86_64 containers on aarch64 > > systems, using a combination of both Virtualization and emulation. > > > > In this scenario, both not being able to use TSO for emulation > > and having to enable it all the time for the whole VM have a very large > > impact on performance (~25% on some workloads). > > Well, there is always a price to pay somewhere, and this is the usual > trade-off between performance and maintainability. > > > I understand the concern about the risk of userspace fragmentation, but > > I was wondering if we could minimize it to an acceptable level by > > narrowing down the context. For instance, since both use cases we're > > bringing to the table imply the use of Virtualization, we should be able > > to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1 > > (and not in nVHE nor pKVM), returning EINVAL otherwise. This would > > heavily discourage users from relying on this feature for native > > applications that can run on arbitrary contexts, hence drastically > > reducing the fragmentation risk. > > As I explained in another sub-thread[1], I am not prepared to allow > non architectural state to be exposed to a guest. I'm also not > prepared to make significant ABI differences between VHE, nVHE, hVHE, > with or without pKVM, because the job of the kernel is to abstract > those differences. > > > We would still need a way to ensure the trap gets to the VMM and for > > the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on > > a different series. > > The VMM can't use ACTLR_EL12, by the very definition of this register > (the clue is in the name). You'd have to proxy the write in the > kernel and context-switch it, which means adding non-architectural > state to KVM, breaking VM migration and adding more kludges to the > existing Apple-specific host crap. > > Also, let's realise that we are talking about making significant > changes to the arm64 ABI for a platform that is still not fully > supported in the upstream kernel. I have the feeling that changing the Note there's two use-cases for this today, bare-metal Linux on Apple Silicon devices and Linux VMs on macOS. The latter is fully supported in the upstream kernel. Apple Silicon devices have a significantly sized Linux userbase as there is a shortage of decent local ARM development machines for Linux as well as just being decent local laptop/desktop SoC's in general for AI. The general performance of the SoC makes it very useful. Is mise le meas/Regards, Eric Curtin > memory model dynamically may not be of the utmost priority until then. > > Thanks, > > M. > > [1] https://lore.kernel.org/all/867cgcqrb9.wl-maz@kernel.org > > -- > Without deviation from the norm, progress is not possible. >
Marc Zyngier <maz@kernel.org> writes: > On Mon, 06 May 2024 12:21:40 +0100, > Sergio Lopez Pascual <slp@redhat.com> wrote: >> >> Eric Curtin <ecurtin@redhat.com> writes: >> >> > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote: >> >> >> >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote: >> >> > On 2024/04/11 22:28, Will Deacon wrote: >> >> > > * Some binaries in a distribution exhibit instability which goes away >> >> > > in TSO mode, so a taskset-like program is used to run them with TSO >> >> > > enabled. >> >> > >> >> > Since the flag is cleared on execve, this third one isn't generally >> >> > possible as far as I know. >> >> >> >> Ah ok, I'd missed that. Thanks. >> >> >> >> > > In all these cases, we end up with native arm64 applications that will >> >> > > either fail to load or will crash in subtle ways on CPUs without the TSO >> >> > > feature. Assuming that the application cannot be fixed, a better >> >> > > approach would be to recompile using stronger instructions (e.g. >> >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's >> >> > > true that some existing CPUs are TSO by design (this is a perfectly >> >> > > valid implementation of the arm64 memory model), but I think there's a >> >> > > big difference between quietly providing more ordering guarantees than >> >> > > software may be relying on and providing a mechanism to discover, >> >> > > request and ultimately rely upon the stronger behaviour. >> >> > >> >> > The problem is "just" using stronger instructions is much more >> >> > expensive, as emulators have demonstrated. If TSO didn't serve a >> >> > practical purpose I wouldn't be submitting this, but it does. This is >> >> > basically non-negotiable for x86 emulation; if this is rejected >> >> > upstream, it will forever live as a downstream patch used by the entire >> >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very >> >> > explicitly targeting, given our efforts with microVMs for 4K page size >> >> > support and the upcoming Vulkan drivers). >> >> In addition to the use case Hector exposed here, there's another, >> potentially larger one, which is running x86_64 containers on aarch64 >> systems, using a combination of both Virtualization and emulation. >> >> In this scenario, both not being able to use TSO for emulation >> and having to enable it all the time for the whole VM have a very large >> impact on performance (~25% on some workloads). > > Well, there is always a price to pay somewhere, and this is the usual > trade-off between performance and maintainability. Yes, and given that the impact on performance is so big, I honestly think it's worth exploring a bit if there's an option that could keep the maintenance cost at an acceptable level. >> I understand the concern about the risk of userspace fragmentation, but >> I was wondering if we could minimize it to an acceptable level by >> narrowing down the context. For instance, since both use cases we're >> bringing to the table imply the use of Virtualization, we should be able >> to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1 >> (and not in nVHE nor pKVM), returning EINVAL otherwise. This would >> heavily discourage users from relying on this feature for native >> applications that can run on arbitrary contexts, hence drastically >> reducing the fragmentation risk. > > As I explained in another sub-thread[1], I am not prepared to allow > non architectural state to be exposed to a guest. I'm also not > prepared to make significant ABI differences between VHE, nVHE, hVHE, > with or without pKVM, because the job of the kernel is to abstract > those differences. I understand, makes sense. >> We would still need a way to ensure the trap gets to the VMM and for >> the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on >> a different series. > > The VMM can't use ACTLR_EL12, by the very definition of this register > (the clue is in the name). You'd have to proxy the write in the > kernel and context-switch it, which means adding non-architectural > state to KVM, breaking VM migration and adding more kludges to the > existing Apple-specific host crap. I know, I just didn't want to go into details here, because this series is not touching any of that. But since we're already there, I'd like to ask you, do you think it'd be possible and reasonable dealing with IMPDEF registers outside of KVM, from a platform-specific module, treating it like a paravirt feature? In fact, if that would be acceptable, what if we treated this whole feature as a platform-specific knob leaving both the ARM64 ABI and KVM (mostly) aside? I'm thinking of something in the lines of this: - Host side: * Having vcpu load/put calling into some platform-specific module that would be in charge of keeping track of the desired state for a particular context and adjusting ACTLR_EL12 as needed, relieving KVM from this task and avoiding polluting its structs with non-architectural state. * Either having a kernel handler for the TACR trap that would call to the platform-specific module, or allowing the VMM to request the kernel to exit to it when that trap is triggered. The latter would also require the module to expose a device node with an ioctl interface (independent from KVM's) for the VMM to request the desired TSO stategy for a particular thread. * An alternative to the previous point could be enabling the VMM to be able to request KVM to start a VM with HCR_EL2.TACR = 0. This one would be way cheaper in CPU time, and would simplify the platform-specific module job to just save/restore ACTLR_EL12 for that context, but I guess it could potentially introduce some undesired variance between VM configurations. I'm honestly open to both options, please let me know if you find one to be better for KVM. - Guest side: * Wiring __switch_to() to also call the platform-specific module. Akin to what happens with KVM, this one would be in charge of keeping track of the threads that want TSO enabled, adjusting ACTLR_EL1 accordingly. * Having the platform-specific module expose a device node with an ioctl interface for userspace applications to request TSO to be enabled for the current thread. I think an approach like this would address the ARM64 userspace fragmentation concerns, relieve KVM from carrying a platform-specific burden and reduce the maintenance costs to a reasonable level. WDYT? > Also, let's realise that we are talking about making significant > changes to the arm64 ABI for a platform that is still not fully > supported in the upstream kernel. I have the feeling that changing the > memory model dynamically may not be of the utmost priority until then. Please note this feature will also be used by Linux running in a VM on macOS under Hypervisor.framework, so Asahi isn't the only platform. This raises significantly the number of users potentially benefited by emulators being able to operate the TSO knob. Thanks, Sergio.
Will Deacon <will@kernel.org> writes: > Hi Hector, > > On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote: >> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this >> reason, x86 emulation on baseline ARM64 systems requires very expensive >> memory model emulation. Having hardware that supports this natively is >> therefore very attractive. Such hardware, in fact, exists. This series >> adds support for userspace to identify when TSO is available and >> toggle it on, if supported. > > I'm probably going to make myself hugely unpopular here, but I have a > strong objection to this patch series as it stands. I firmly believe > that providing a prctl() to query and toggle the memory model to/from > TSO is going to lead to subtle fragmentation of arm64 Linux userspace. > > It's not difficult to envisage this TSO switch being abused for native > arm64 applications: > > * A program no longer crashes when TSO is enabled, so the developer > just toggles TSO to meet a deadline. > > * Some legacy x86 sources are being ported to arm64 but concurrency > is hard so the developer just enables TSO to (mostly) avoid thinking > about it. > > * Some binaries in a distribution exhibit instability which goes away > in TSO mode, so a taskset-like program is used to run them with TSO > enabled. These all just seem like cases of engineers hiding from their very real problems. I don't know if its really the kernels place to avoid giving them the foot gun. Would it assuage your concerns at all if we set a taint flag so bug reports/core dumps indicated we were in a non-architectural memory mode? > In all these cases, we end up with native arm64 applications that will > either fail to load or will crash in subtle ways on CPUs without the TSO > feature. Assuming that the application cannot be fixed, a better > approach would be to recompile using stronger instructions (e.g. > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > true that some existing CPUs are TSO by design (this is a perfectly > valid implementation of the arm64 memory model), but I think there's a > big difference between quietly providing more ordering guarantees than > software may be relying on and providing a mechanism to discover, > request and ultimately rely upon the stronger behaviour. I think the main use case here is for emulation. When we run x86-on-arm in QEMU we do currently insert lots of extra barrier instructions on every load and store. If we can probe and set a TSO mode I can assure you we'll do the right thing ;-)
On Tue, 7 May 2024 at 12:24, Alex Bennée <alex.bennee@linaro.org> wrote: > > Will Deacon <will@kernel.org> writes: > > > Hi Hector, > > > > On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote: > >> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this > >> reason, x86 emulation on baseline ARM64 systems requires very expensive > >> memory model emulation. Having hardware that supports this natively is > >> therefore very attractive. Such hardware, in fact, exists. This series > >> adds support for userspace to identify when TSO is available and > >> toggle it on, if supported. > > > > I'm probably going to make myself hugely unpopular here, but I have a > > strong objection to this patch series as it stands. I firmly believe > > that providing a prctl() to query and toggle the memory model to/from > > TSO is going to lead to subtle fragmentation of arm64 Linux userspace. > > > > It's not difficult to envisage this TSO switch being abused for native > > arm64 applications: > > > > * A program no longer crashes when TSO is enabled, so the developer > > just toggles TSO to meet a deadline. > > > > * Some legacy x86 sources are being ported to arm64 but concurrency > > is hard so the developer just enables TSO to (mostly) avoid thinking > > about it. > > > > * Some binaries in a distribution exhibit instability which goes away > > in TSO mode, so a taskset-like program is used to run them with TSO > > enabled. > > These all just seem like cases of engineers hiding from their very real > problems. I don't know if its really the kernels place to avoid giving > them the foot gun. Would it assuage your concerns at all if we set a > taint flag so bug reports/core dumps indicated we were in a > non-architectural memory mode? > > > In all these cases, we end up with native arm64 applications that will > > either fail to load or will crash in subtle ways on CPUs without the TSO > > feature. Assuming that the application cannot be fixed, a better > > approach would be to recompile using stronger instructions (e.g. > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's > > true that some existing CPUs are TSO by design (this is a perfectly > > valid implementation of the arm64 memory model), but I think there's a > > big difference between quietly providing more ordering guarantees than > > software may be relying on and providing a mechanism to discover, > > request and ultimately rely upon the stronger behaviour. > > I think the main use case here is for emulation. When we run x86-on-arm > in QEMU we do currently insert lots of extra barrier instructions on > every load and store. If we can probe and set a TSO mode I can assure > you we'll do the right thing ;-) > Without a public specification of what TSO mode actually entails, deciding which of those barriers can be dropped is not going to be as straight-forward as you make it out to be. Apple's TSO mode is vertically integrated with Rosetta, which means that TSO mode provides whatever Rosetta needs to run x86 code correctly, and that it could mean different things on different generations of the micro-architecture. And whether Apple's TSO is the same as Fujitsu's is anyone's guess afaik. Running a game and seeing it perform better is great, but it is not the kind of rigor we usually attempt to apply when adding support for architectural features. Hopefully, there will be some architectural support for this in the future, but without any spec that defines the memory model it implements, I am not convinced we should merge this.
On Tue, May 07, 2024 at 04:52:30PM +0200, Ard Biesheuvel wrote: > On Tue, 7 May 2024 at 12:24, Alex Bennée <alex.bennee@linaro.org> wrote: > > I think the main use case here is for emulation. When we run x86-on-arm > > in QEMU we do currently insert lots of extra barrier instructions on > > every load and store. If we can probe and set a TSO mode I can assure > > you we'll do the right thing ;-) > > Without a public specification of what TSO mode actually entails, > deciding which of those barriers can be dropped is not going to be as > straight-forward as you make it out to be. > > Apple's TSO mode is vertically integrated with Rosetta, which means > that TSO mode provides whatever Rosetta needs to run x86 code > correctly, and that it could mean different things on different > generations of the micro-architecture. And whether Apple's TSO is the > same as Fujitsu's is anyone's guess afaik. Indeed. Apart from using impdef registers, that's what I think is the second biggest problem with this feature (and the corresponding patches). We don't know the precise memory model, we can't tell whether this TSO bit is stored in the TLB. If it is, is it per ASID/VMID? The other problem Marc raised is what memory model is between two CPUs where only one has the TSO bit set? Does it only break the TSO model or is there a chance that it also breaks the default relaxed model? What other TSO flavours are out there, how do they compare with the Apple one? > Running a game and seeing it perform better is great, but it is not > the kind of rigor we usually attempt to apply when adding support for > architectural features. Hopefully, there will be some architectural > support for this in the future, but without any spec that defines the > memory model it implements, I am not convinced we should merge this. There is FEAT_LRCPC (available on Apple Silicon from M2 onwards). Rather than having a big knob to turn TSO on or off, this feature introduces instructions that permit a code generator to get the TSO semantics in a more efficient way (e.g. using LDAPR+STLR instead of the stricter LDAR+STLR; not sure how well these are implemented on the Apple Silicon). There are further improvements in FEAT_LRCPC{2,3} (with the latter adding support for SIMD but not available in hardware yet). So the direction from Arm is pretty clear, acknowledging that there is a need for such TSO emulation but not in the way of undocumented impdef registers. Whether more is needed here, I guess people working on emulators could reach out to Arm or CPU vendors with suggestions (the path to the architects is not straightforward, usually legal has a say, but it's doable, there are formal channels already). I see the impdef hardware TSO options as temporary until CPU implementations catch up to architected FEAT_LRCPC*. Given the problems already stated in this thread, I think such hacks should be carried downstream and (hopefully) will eventually vanish. Maybe those TSO knobs currently make an emulation faster than FEAT_LRCPC* but that's feedback to go to the microarchitects on the implementation (or architects on what other instructions should be covered).
On Thu, May 9, 2024 at 5:13 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, May 07, 2024 at 04:52:30PM +0200, Ard Biesheuvel wrote: > > On Tue, 7 May 2024 at 12:24, Alex Bennée <alex.bennee@linaro.org> wrote: > > > I think the main use case here is for emulation. When we run x86-on-arm > > > in QEMU we do currently insert lots of extra barrier instructions on > > > every load and store. If we can probe and set a TSO mode I can assure > > > you we'll do the right thing ;-) > > > > Without a public specification of what TSO mode actually entails, > > deciding which of those barriers can be dropped is not going to be as > > straight-forward as you make it out to be. > > > > Apple's TSO mode is vertically integrated with Rosetta, which means > > that TSO mode provides whatever Rosetta needs to run x86 code > > correctly, and that it could mean different things on different > > generations of the micro-architecture. And whether Apple's TSO is the > > same as Fujitsu's is anyone's guess afaik. > > Indeed. Apart from using impdef registers, that's what I think is the > second biggest problem with this feature (and the corresponding > patches). We don't know the precise memory model, we can't tell whether > this TSO bit is stored in the TLB. If it is, is it per ASID/VMID? The > other problem Marc raised is what memory model is between two CPUs where > only one has the TSO bit set? Does it only break the TSO model or is > there a chance that it also breaks the default relaxed model? What other > TSO flavours are out there, how do they compare with the Apple one? > > > Running a game and seeing it perform better is great, but it is not > > the kind of rigor we usually attempt to apply when adding support for > > architectural features. Hopefully, there will be some architectural > > support for this in the future, but without any spec that defines the > > memory model it implements, I am not convinced we should merge this. > > There is FEAT_LRCPC (available on Apple Silicon from M2 onwards). Rather > than having a big knob to turn TSO on or off, this feature introduces > instructions that permit a code generator to get the TSO semantics in a > more efficient way (e.g. using LDAPR+STLR instead of the stricter > LDAR+STLR; not sure how well these are implemented on the Apple > Silicon). There are further improvements in FEAT_LRCPC{2,3} (with the > latter adding support for SIMD but not available in hardware yet). So > the direction from Arm is pretty clear, acknowledging that there is a > need for such TSO emulation but not in the way of undocumented impdef > registers. Whether more is needed here, I guess people working on > emulators could reach out to Arm or CPU vendors with suggestions (the > path to the architects is not straightforward, usually legal has a say, > but it's doable, there are formal channels already). > > I see the impdef hardware TSO options as temporary until CPU > implementations catch up to architected FEAT_LRCPC*. Given the problems > already stated in this thread, I think such hacks should be carried > downstream and (hopefully) will eventually vanish. Maybe those TSO knobs > currently make an emulation faster than FEAT_LRCPC* but that's feedback > to go to the microarchitects on the implementation (or architects on > what other instructions should be covered). > They cannot ever "vanish" because we are supporting every Mx platform back to the first one. The M1 series will never have FEAT_LRCPC. I do not think it is unreasonable to support this method when we know what the CPU platform is and FEAT_LRCPC does not exist. -- 真実はいつも一つ!/ Always, there's only one truth!
On Thu, May 09, 2024 at 06:31:04AM -0600, Neal Gompa wrote: > On Thu, May 9, 2024 at 5:13 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > I see the impdef hardware TSO options as temporary until CPU > > implementations catch up to architected FEAT_LRCPC*. Given the problems > > already stated in this thread, I think such hacks should be carried > > downstream and (hopefully) will eventually vanish. Maybe those TSO knobs > > currently make an emulation faster than FEAT_LRCPC* but that's feedback > > to go to the microarchitects on the implementation (or architects on > > what other instructions should be covered). > > They cannot ever "vanish" because we are supporting every Mx platform > back to the first one. The M1 series will never have FEAT_LRCPC. Well, you missed "eventually". It depends on the timeline you have in mind but, say, 15 years from now there may not be many M1s around to be worth maintaining these patches out-of-tree (and they don't make sense in-tree either because of the lack of standardisation). > I do not think it is unreasonable to support this method when we know > what the CPU platform is and FEAT_LRCPC does not exist. If you want a portable emulator, you better start supporting FEAT_LRCPC* (I think FEX does this), ideally detected at run-time with a fallback to RCsc. Whether, additionally, you want to support the non-portable Apple TSO with out-of-tree patches, it's up to you.
x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this reason, x86 emulation on baseline ARM64 systems requires very expensive memory model emulation. Having hardware that supports this natively is therefore very attractive. Such hardware, in fact, exists. This series adds support for userspace to identify when TSO is available and toggle it on, if supported. Some ARM64 CPUs intrinsically implement the TSO memory model, while others expose is as an IMPDEF control. Apple Silicon SoCs are in the latter category. Using TSO for x86 emulation on chips that support it has been shown to provide a massive performance boost [1]. Patch 1 introduces the PR_{SET,GET}_MEM_MODEL userspace control, which is initially not implemented for any architectures. Patch 2 implements it for CPUs which are known, to the best of my knowledge, to always implement the TSO memory model unconditionally. This uses the cpufeature mechanism to only enable this if *all* cores in the system meet the requirements. Patch 3 adds the scaffolding necesasry to save/restore the ACTLR_EL1 register across context switches. This register contains IMPDEF flags related to CPU execution, and on Apple CPUs this is where the runtime TSO toggle bit is implemented. Other CPUs could conceivably benefit from this scaffolding if they also use ACTLR_EL1 for things that could ostensibly be runtime controlled and context-switched. For this to work, ACTLR_EL1 must have a uniform layout across all cores in the system. Finally, patch 4 implements PR_{SET,GET}_MEM_MODEL for Apple CPUs by hooking it up to flip the appropriate ACTLR_EL1 bit when the Apple TSO feature is detected (on all CPUs, which also implies the uniform ACTLR_EL1 layout). This series has been brewing in the downstream Asahi Linux tree for a while now, and ships to thousands of users. A subset have been using it with FEX-Emu, which already supports this feature. This rebase on v6.9-rc1 is only build-tested (all intermediate commits with and without the config enabled, on ARM64) but I'll update the downstream branch soon with this version and get it pushed out to users/testers. The Apple support works on bare metal and *should* work exactly the same way on macOS VMs (as alluded to by Zayd in his independent submission [3]), though I haven't personally verified this. KVM support for this is left for a future patchset. (Apologies for the large Cc: list; I want to make sure nobody who got Cced on Zayd's alternate take is left out of this one.) [1] https://fex-emu.com/FEX-2306/ [2] https://github.com/AsahiLinux/linux/tree/bits/220-tso [3] https://lore.kernel.org/lkml/20240410211652.16640-1-zayd_qumsieh@apple.com/ To: Catalin Marinas <catalin.marinas@arm.com> To: Will Deacon <will@kernel.org> To: Marc Zyngier <maz@kernel.org> To: Mark Rutland <mark.rutland@arm.com> Cc: Zayd Qumsieh <zayd_qumsieh@apple.com> Cc: Justin Lu <ih_justin@apple.com> Cc: Ryan Houdek <Houdek.Ryan@fex-emu.org> Cc: Mark Brown <broonie@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Miguel Luis <miguel.luis@oracle.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Christoph Paasch <cpaasch@apple.com> Cc: Kees Cook <keescook@chromium.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Joel Granados <j.granados@samsung.com> Cc: Dawei Li <dawei.li@shingroup.cn> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Florent Revest <revest@chromium.org> Cc: David Hildenbrand <david@redhat.com> Cc: Stefan Roesch <shr@devkernel.io> Cc: Andy Chiu <andy.chiu@sifive.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Zev Weiss <zev@bewilderbeest.net> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: Asahi Linux <asahi@lists.linux.dev> Signed-off-by: Hector Martin <marcan@marcan.st> --- Hector Martin (4): prctl: Introduce PR_{SET,GET}_MEM_MODEL arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs arm64: Introduce scaffolding to add ACTLR_EL1 to thread state arm64: Implement Apple IMPDEF TSO memory model control arch/arm64/Kconfig | 14 ++++++ arch/arm64/include/asm/apple_cpufeature.h | 15 +++++++ arch/arm64/include/asm/cpufeature.h | 10 +++++ arch/arm64/include/asm/processor.h | 3 ++ arch/arm64/kernel/Makefile | 3 +- arch/arm64/kernel/cpufeature.c | 11 ++--- arch/arm64/kernel/cpufeature_impdef.c | 61 ++++++++++++++++++++++++++ arch/arm64/kernel/process.c | 71 +++++++++++++++++++++++++++++++ arch/arm64/kernel/setup.c | 8 ++++ arch/arm64/tools/cpucaps | 2 + include/linux/memory_ordering_model.h | 11 +++++ include/uapi/linux/prctl.h | 5 +++ kernel/sys.c | 21 +++++++++ 13 files changed, 229 insertions(+), 6 deletions(-) --- base-commit: 4cece764965020c22cff7665b18a012006359095 change-id: 20240411-tso-e86fdceb94b8 Best regards,