mbox series

[0/4] arm64: Support the TSO memory model

Message ID 20240411-tso-v1-0-754f11abfbff@marcan.st (mailing list archive)
Headers show
Series arm64: Support the TSO memory model | expand

Message

Hector Martin April 11, 2024, 12:51 a.m. UTC
x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
reason, x86 emulation on baseline ARM64 systems requires very expensive
memory model emulation. Having hardware that supports this natively is
therefore very attractive. Such hardware, in fact, exists. This series
adds support for userspace to identify when TSO is available and
toggle it on, if supported.

Some ARM64 CPUs intrinsically implement the TSO memory model, while
others expose is as an IMPDEF control. Apple Silicon SoCs are in the
latter category. Using TSO for x86 emulation on chips that support it
has been shown to provide a massive performance boost [1].

Patch 1 introduces the PR_{SET,GET}_MEM_MODEL userspace control, which
is initially not implemented for any architectures.

Patch 2 implements it for CPUs which are known, to the best of my
knowledge, to always implement the TSO memory model unconditionally.
This uses the cpufeature mechanism to only enable this if *all* cores in
the system meet the requirements.

Patch 3 adds the scaffolding necesasry to save/restore the ACTLR_EL1
register across context switches. This register contains IMPDEF flags
related to CPU execution, and on Apple CPUs this is where the runtime
TSO toggle bit is implemented. Other CPUs could conceivably benefit from
this scaffolding if they also use ACTLR_EL1 for things that could
ostensibly be runtime controlled and context-switched. For this to work,
ACTLR_EL1 must have a uniform layout across all cores in the system.

Finally, patch 4 implements PR_{SET,GET}_MEM_MODEL for Apple CPUs by
hooking it up to flip the appropriate ACTLR_EL1 bit when the Apple TSO
feature is detected (on all CPUs, which also implies the uniform
ACTLR_EL1 layout).

This series has been brewing in the downstream Asahi Linux tree for a
while now, and ships to thousands of users. A subset have been using it
with FEX-Emu, which already supports this feature. This rebase on
v6.9-rc1 is only build-tested (all intermediate commits with and without
the config enabled, on ARM64) but I'll update the downstream branch soon
with this version and get it pushed out to users/testers.

The Apple support works on bare metal and *should* work exactly the same
way on macOS VMs (as alluded to by Zayd in his independent submission [3]),
though I haven't personally verified this. KVM support for this is left
for a future patchset.

(Apologies for the large Cc: list; I want to make sure nobody who got
Cced on Zayd's alternate take is left out of this one.) 

[1] https://fex-emu.com/FEX-2306/
[2] https://github.com/AsahiLinux/linux/tree/bits/220-tso
[3] https://lore.kernel.org/lkml/20240410211652.16640-1-zayd_qumsieh@apple.com/

To: Catalin Marinas <catalin.marinas@arm.com>
To: Will Deacon <will@kernel.org>
To: Marc Zyngier <maz@kernel.org>
To: Mark Rutland <mark.rutland@arm.com>
Cc: Zayd Qumsieh <zayd_qumsieh@apple.com>
Cc: Justin Lu <ih_justin@apple.com>
Cc: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Miguel Luis <miguel.luis@oracle.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Christoph Paasch <cpaasch@apple.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Joel Granados <j.granados@samsung.com>
Cc: Dawei Li <dawei.li@shingroup.cn>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Florent Revest <revest@chromium.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Andy Chiu <andy.chiu@sifive.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Zev Weiss <zev@bewilderbeest.net>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: Asahi Linux <asahi@lists.linux.dev>

Signed-off-by: Hector Martin <marcan@marcan.st>
---
Hector Martin (4):
      prctl: Introduce PR_{SET,GET}_MEM_MODEL
      arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs
      arm64: Introduce scaffolding to add ACTLR_EL1 to thread state
      arm64: Implement Apple IMPDEF TSO memory model control

 arch/arm64/Kconfig                        | 14 ++++++
 arch/arm64/include/asm/apple_cpufeature.h | 15 +++++++
 arch/arm64/include/asm/cpufeature.h       | 10 +++++
 arch/arm64/include/asm/processor.h        |  3 ++
 arch/arm64/kernel/Makefile                |  3 +-
 arch/arm64/kernel/cpufeature.c            | 11 ++---
 arch/arm64/kernel/cpufeature_impdef.c     | 61 ++++++++++++++++++++++++++
 arch/arm64/kernel/process.c               | 71 +++++++++++++++++++++++++++++++
 arch/arm64/kernel/setup.c                 |  8 ++++
 arch/arm64/tools/cpucaps                  |  2 +
 include/linux/memory_ordering_model.h     | 11 +++++
 include/uapi/linux/prctl.h                |  5 +++
 kernel/sys.c                              | 21 +++++++++
 13 files changed, 229 insertions(+), 6 deletions(-)
---
base-commit: 4cece764965020c22cff7665b18a012006359095
change-id: 20240411-tso-e86fdceb94b8

Best regards,

Comments

Neal Gompa April 11, 2024, 1:37 a.m. UTC | #1
On Wed, Apr 10, 2024 at 8:51 PM Hector Martin <marcan@marcan.st> wrote:
>
> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
> reason, x86 emulation on baseline ARM64 systems requires very expensive
> memory model emulation. Having hardware that supports this natively is
> therefore very attractive. Such hardware, in fact, exists. This series
> adds support for userspace to identify when TSO is available and
> toggle it on, if supported.
>
> Some ARM64 CPUs intrinsically implement the TSO memory model, while
> others expose is as an IMPDEF control. Apple Silicon SoCs are in the
> latter category. Using TSO for x86 emulation on chips that support it
> has been shown to provide a massive performance boost [1].
>
> Patch 1 introduces the PR_{SET,GET}_MEM_MODEL userspace control, which
> is initially not implemented for any architectures.
>
> Patch 2 implements it for CPUs which are known, to the best of my
> knowledge, to always implement the TSO memory model unconditionally.
> This uses the cpufeature mechanism to only enable this if *all* cores in
> the system meet the requirements.
>
> Patch 3 adds the scaffolding necesasry to save/restore the ACTLR_EL1
> register across context switches. This register contains IMPDEF flags
> related to CPU execution, and on Apple CPUs this is where the runtime
> TSO toggle bit is implemented. Other CPUs could conceivably benefit from
> this scaffolding if they also use ACTLR_EL1 for things that could
> ostensibly be runtime controlled and context-switched. For this to work,
> ACTLR_EL1 must have a uniform layout across all cores in the system.
>
> Finally, patch 4 implements PR_{SET,GET}_MEM_MODEL for Apple CPUs by
> hooking it up to flip the appropriate ACTLR_EL1 bit when the Apple TSO
> feature is detected (on all CPUs, which also implies the uniform
> ACTLR_EL1 layout).
>
> This series has been brewing in the downstream Asahi Linux tree for a
> while now, and ships to thousands of users. A subset have been using it
> with FEX-Emu, which already supports this feature. This rebase on
> v6.9-rc1 is only build-tested (all intermediate commits with and without
> the config enabled, on ARM64) but I'll update the downstream branch soon
> with this version and get it pushed out to users/testers.
>
> The Apple support works on bare metal and *should* work exactly the same
> way on macOS VMs (as alluded to by Zayd in his independent submission [3]),
> though I haven't personally verified this. KVM support for this is left
> for a future patchset.
>
> (Apologies for the large Cc: list; I want to make sure nobody who got
> Cced on Zayd's alternate take is left out of this one.)
>
> [1] https://fex-emu.com/FEX-2306/
> [2] https://github.com/AsahiLinux/linux/tree/bits/220-tso
> [3] https://lore.kernel.org/lkml/20240410211652.16640-1-zayd_qumsieh@apple.com/
>
> To: Catalin Marinas <catalin.marinas@arm.com>
> To: Will Deacon <will@kernel.org>
> To: Marc Zyngier <maz@kernel.org>
> To: Mark Rutland <mark.rutland@arm.com>
> Cc: Zayd Qumsieh <zayd_qumsieh@apple.com>
> Cc: Justin Lu <ih_justin@apple.com>
> Cc: Ryan Houdek <Houdek.Ryan@fex-emu.org>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Mateusz Guzik <mjguzik@gmail.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Oliver Upton <oliver.upton@linux.dev>
> Cc: Miguel Luis <miguel.luis@oracle.com>
> Cc: Joey Gouly <joey.gouly@arm.com>
> Cc: Christoph Paasch <cpaasch@apple.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Sami Tolvanen <samitolvanen@google.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Joel Granados <j.granados@samsung.com>
> Cc: Dawei Li <dawei.li@shingroup.cn>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Florent Revest <revest@chromium.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Stefan Roesch <shr@devkernel.io>
> Cc: Andy Chiu <andy.chiu@sifive.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Zev Weiss <zev@bewilderbeest.net>
> Cc: Ondrej Mosnacek <omosnace@redhat.com>
> Cc: Miguel Ojeda <ojeda@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Asahi Linux <asahi@lists.linux.dev>
>
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
> Hector Martin (4):
>       prctl: Introduce PR_{SET,GET}_MEM_MODEL
>       arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs
>       arm64: Introduce scaffolding to add ACTLR_EL1 to thread state
>       arm64: Implement Apple IMPDEF TSO memory model control
>
>  arch/arm64/Kconfig                        | 14 ++++++
>  arch/arm64/include/asm/apple_cpufeature.h | 15 +++++++
>  arch/arm64/include/asm/cpufeature.h       | 10 +++++
>  arch/arm64/include/asm/processor.h        |  3 ++
>  arch/arm64/kernel/Makefile                |  3 +-
>  arch/arm64/kernel/cpufeature.c            | 11 ++---
>  arch/arm64/kernel/cpufeature_impdef.c     | 61 ++++++++++++++++++++++++++
>  arch/arm64/kernel/process.c               | 71 +++++++++++++++++++++++++++++++
>  arch/arm64/kernel/setup.c                 |  8 ++++
>  arch/arm64/tools/cpucaps                  |  2 +
>  include/linux/memory_ordering_model.h     | 11 +++++
>  include/uapi/linux/prctl.h                |  5 +++
>  kernel/sys.c                              | 21 +++++++++
>  13 files changed, 229 insertions(+), 6 deletions(-)
> ---
> base-commit: 4cece764965020c22cff7665b18a012006359095
> change-id: 20240411-tso-e86fdceb94b8
>

The series looks good to me.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Will Deacon April 11, 2024, 1:28 p.m. UTC | #2
Hi Hector,

On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
> reason, x86 emulation on baseline ARM64 systems requires very expensive
> memory model emulation. Having hardware that supports this natively is
> therefore very attractive. Such hardware, in fact, exists. This series
> adds support for userspace to identify when TSO is available and
> toggle it on, if supported.

I'm probably going to make myself hugely unpopular here, but I have a
strong objection to this patch series as it stands. I firmly believe
that providing a prctl() to query and toggle the memory model to/from
TSO is going to lead to subtle fragmentation of arm64 Linux userspace.

It's not difficult to envisage this TSO switch being abused for native
arm64 applications:

  * A program no longer crashes when TSO is enabled, so the developer
    just toggles TSO to meet a deadline.

  * Some legacy x86 sources are being ported to arm64 but concurrency
    is hard so the developer just enables TSO to (mostly) avoid thinking
    about it.

  * Some binaries in a distribution exhibit instability which goes away
    in TSO mode, so a taskset-like program is used to run them with TSO
    enabled.

In all these cases, we end up with native arm64 applications that will
either fail to load or will crash in subtle ways on CPUs without the TSO
feature. Assuming that the application cannot be fixed, a better
approach would be to recompile using stronger instructions (e.g.
LDAR/STLR) so that at least the resulting binary is portable. Now, it's
true that some existing CPUs are TSO by design (this is a perfectly
valid implementation of the arm64 memory model), but I think there's a
big difference between quietly providing more ordering guarantees than
software may be relying on and providing a mechanism to discover,
request and ultimately rely upon the stronger behaviour.

An alternative option is to go down the SPARC RMO route and just enable
TSO statically (although presumably in the firmware) for Apple silicon.
I'm assuming that has a performance impact for native code?

Will

P.S. I briefly pondered the idea of the kernel toggling the bit in the
ELF loader when e.g. it sees an x86 machine type but I suspect that
doesn't really help with existing emulators and you'd still need a way
to tell the emulator whether or not it was enabled.
Hector Martin April 11, 2024, 2:19 p.m. UTC | #3
On 2024/04/11 22:28, Will Deacon wrote:
> Hi Hector,
> 
> On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
>> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
>> reason, x86 emulation on baseline ARM64 systems requires very expensive
>> memory model emulation. Having hardware that supports this natively is
>> therefore very attractive. Such hardware, in fact, exists. This series
>> adds support for userspace to identify when TSO is available and
>> toggle it on, if supported.
> 
> I'm probably going to make myself hugely unpopular here, but I have a
> strong objection to this patch series as it stands. I firmly believe
> that providing a prctl() to query and toggle the memory model to/from
> TSO is going to lead to subtle fragmentation of arm64 Linux userspace.

I honestly doubt this should be a significant concern right now, given
that only a subset of implementations actually support this. Yes,
developers can do stupid stuff, but we already have gone through this
kind of story with other situations (e.g. 16K and 64K page support on
ARM64 breaking 4K assumptions) and things have been fixed over time.

In particular, I highly suspect Asahi Linux and Apple Silicon have done
a lot more good for the ARM64 ecosystem by getting developers to fix
their page size mess than they will do bad by somehow encouraging TSO
abuse. We've even found new memory model issues thanks to the
architecture's deep out-of-order character (remember that mess with
Linux atomics? :-)). So far, in the year+ we've had this patchset
downstream, not a single developer has proposed abusing it for something
that isn't an x86 emulator.

There's a pragmatic argument here: since we need this, and it absolutely
will continue to ship downstream if rejected, it doesn't make much
difference for fragmentation risk does it? The vast majority of
Linux-on-Mac users are likely to continue running downstream kernels for
the foreseeable future anyway to get newer features and hardware support
faster than they can be upstreamed. So not allowing this upstream
doesn't really change the landscape vis-a-vis being able to abuse this
or not, it just makes our life harder by forcing us to carry more
patches forever.

> It's not difficult to envisage this TSO switch being abused for native
> arm64 applications:
> 
>   * A program no longer crashes when TSO is enabled, so the developer
>     just toggles TSO to meet a deadline.
> 
>   * Some legacy x86 sources are being ported to arm64 but concurrency
>     is hard so the developer just enables TSO to (mostly) avoid thinking
>     about it.

Both of these rely on the developer *knowing* what TSO is and why it
fixes this. I posit that a developer who knows what that is also likely
to know why this is a stupid hack and they shouldn't be doing this and
that it won't work on all machines.

> 
>   * Some binaries in a distribution exhibit instability which goes away
>     in TSO mode, so a taskset-like program is used to run them with TSO
>     enabled.

Since the flag is cleared on execve, this third one isn't generally
possible as far as I know.

> In all these cases, we end up with native arm64 applications that will
> either fail to load or will crash in subtle ways on CPUs without the TSO
> feature. Assuming that the application cannot be fixed, a better
> approach would be to recompile using stronger instructions (e.g.
> LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> true that some existing CPUs are TSO by design (this is a perfectly
> valid implementation of the arm64 memory model), but I think there's a
> big difference between quietly providing more ordering guarantees than
> software may be relying on and providing a mechanism to discover,
> request and ultimately rely upon the stronger behaviour.

The problem is "just" using stronger instructions is much more
expensive, as emulators have demonstrated. If TSO didn't serve a
practical purpose I wouldn't be submitting this, but it does. This is
basically non-negotiable for x86 emulation; if this is rejected
upstream, it will forever live as a downstream patch used by the entire
gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
explicitly targeting, given our efforts with microVMs for 4K page size
support and the upcoming Vulkan drivers).

That said, I have a pragmatic proposal here. The "fixed TSO" part of the
implementation should be harmless, since those CPUs would correctly run
poorly-written applications anyway so the API is moot. That leaves Apple
Silicon. Our native kernels are and likely always will be 16K page size,
due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
natively but with very broken functionality including no GPU
acceleration) plus performance differences that favor 16K. How about we
gate the TSO functionality to only be supported on 4K kernel builds?
This would make them only work in 4K VMs on Asahi Linux. We are very
explicitly discouraging people from trying to use the microVMs to work
around page size problems (which they can already do, another
fragmentation problem, anyway); any application which requires the 4K VM
to run that isn't an emulator is already clearly broken and advertising
that fact openly. So, adding TSO to this should be only a marginal risk
of further fragmentation, and it wouldn't allow apps to "sneakily" "just
work" on Apple machines by abusing TSO.

> 
> An alternative option is to go down the SPARC RMO route and just enable
> TSO statically (although presumably in the firmware) for Apple silicon.
> I'm assuming that has a performance impact for native code?

Correct. We already have this as a bootloader option, but it is not
desirable. Plus, userspace code still needs a way to *discover* that TSO
is enabled for correctness, so it can automatically decide whether to
use stronger or weaker instructions.

> 
> Will
> 
> P.S. I briefly pondered the idea of the kernel toggling the bit in the
> ELF loader when e.g. it sees an x86 machine type but I suspect that
> doesn't really help with existing emulators and you'd still need a way
> to tell the emulator whether or not it was enabled.
> 

- Hector
Hector Martin April 11, 2024, 6:43 p.m. UTC | #4
On 2024/04/11 23:19, Hector Martin wrote:
>>
>> An alternative option is to go down the SPARC RMO route and just enable
>> TSO statically (although presumably in the firmware) for Apple silicon.
>> I'm assuming that has a performance impact for native code?
> 
> Correct. We already have this as a bootloader option, but it is not
> desirable. Plus, userspace code still needs a way to *discover* that TSO
> is enabled for correctness, so it can automatically decide whether to
> use stronger or weaker instructions.

To add some numbers to this (I was just made aware of this paper):

https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf

Using TSO globally has, on average, a 9% performance hit, so that is
clearly off the table as a general solution.

Meanwhile, more detailed microbenchmarks often show TSO as having better
performance than outright using acquire/release instructions without
TSO. Therefore, just giving up on TSO and using acq/rel semantics for
emulators is also not an acceptable solution.

Additionally, the general load/store instructions on ARM have more
flexible addressing modes than the synchronizing ones, and since general
x86 emulation requires *all* loads and stores to be like this in a
non-TSO model (without much more complex/expensive program analysis to
determine where this can be elided), the perf impact is definitely worse
for emulation (e.g. stack accesses are affected) than for a
microbenchmark where only the "target" test instructions are being modified.

- Hector
Zayd Qumsieh April 16, 2024, 2:11 a.m. UTC | #5
The patch looks great! :) I have one minor suggestion, though:

>static __always_inline bool system_has_actlr_state(void)
>{
>	return IS_ENABLED(CONFIG_ARM64_ACTLR_STATE) &&
>		alternative_has_cap_unlikely(ARM64_HAS_TSO_APPLE);
>}

ACTLR_EL1.TSO is not exposed for writing on Virtual Machines on all
versions of MacOS. However, AIDR_EL1 may still advertise TSO, whether
or not ACTLR_EL1.TSO is writable. Could you modify the patch such that
we check the writability of ACTLR_EL1.TSO in system_has_actlr_state
(or once on startup, and cache it, since reading from AIDR_EL1 causes
a trap to Hypervisor.fwk)?

Thanks,
Zayd
Will Deacon April 19, 2024, 4:58 p.m. UTC | #6
On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> On 2024/04/11 22:28, Will Deacon wrote:
> >   * Some binaries in a distribution exhibit instability which goes away
> >     in TSO mode, so a taskset-like program is used to run them with TSO
> >     enabled.
> 
> Since the flag is cleared on execve, this third one isn't generally
> possible as far as I know.

Ah ok, I'd missed that. Thanks.

> > In all these cases, we end up with native arm64 applications that will
> > either fail to load or will crash in subtle ways on CPUs without the TSO
> > feature. Assuming that the application cannot be fixed, a better
> > approach would be to recompile using stronger instructions (e.g.
> > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > true that some existing CPUs are TSO by design (this is a perfectly
> > valid implementation of the arm64 memory model), but I think there's a
> > big difference between quietly providing more ordering guarantees than
> > software may be relying on and providing a mechanism to discover,
> > request and ultimately rely upon the stronger behaviour.
> 
> The problem is "just" using stronger instructions is much more
> expensive, as emulators have demonstrated. If TSO didn't serve a
> practical purpose I wouldn't be submitting this, but it does. This is
> basically non-negotiable for x86 emulation; if this is rejected
> upstream, it will forever live as a downstream patch used by the entire
> gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> explicitly targeting, given our efforts with microVMs for 4K page size
> support and the upcoming Vulkan drivers).

These microVMs sound quite interesting. What exactly are they? Are you
running them under KVM?

Ignoring the mechanism for the time being, would it solve your problem
if you were able to run specific microVMs in TSO mode, or do you *really*
need the VM to have finer-grained control than that? If the whole VM is
running in TSO mode, then my concerns largely disappear, as that's
indistinguishable from running on a hardware implementation that happens
to be TSO.

> That said, I have a pragmatic proposal here. The "fixed TSO" part of the
> implementation should be harmless, since those CPUs would correctly run
> poorly-written applications anyway so the API is moot. That leaves Apple
> Silicon. Our native kernels are and likely always will be 16K page size,
> due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
> natively but with very broken functionality including no GPU
> acceleration) plus performance differences that favor 16K. How about we
> gate the TSO functionality to only be supported on 4K kernel builds?
> This would make them only work in 4K VMs on Asahi Linux. We are very
> explicitly discouraging people from trying to use the microVMs to work
> around page size problems (which they can already do, another
> fragmentation problem, anyway); any application which requires the 4K VM
> to run that isn't an emulator is already clearly broken and advertising
> that fact openly. So, adding TSO to this should be only a marginal risk
> of further fragmentation, and it wouldn't allow apps to "sneakily" "just
> work" on Apple machines by abusing TSO.

I appreciate that you're trying to be constructive here, but I don't think
we should tie this to the page size. It's an artifical limitation and I
don't think it really addresses the underlying concerns that I have.

Will
Will Deacon April 19, 2024, 4:58 p.m. UTC | #7
On Mon, Apr 15, 2024 at 07:22:41PM -0700, Zayd Qumsieh wrote:
> >I'm probably going to make myself hugely unpopular here, but I have a
> >strong objection to this patch series as it stands. I firmly believe
> >that providing a prctl() to query and toggle the memory model to/from
> >TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
> 
> It's definitely not our intent to fragment the ecosystem.
> The goal of this memory ordering is to simplify emulation layers that benefit from this.
> If you have suggestions to reduce the risk of it being misused outside of emulators, we'd be happy to look into it.

Once you have exposed this toggle via prctl(), it doesn't really matter
what your intentions where. It will get used outside of emulation laters
and we'll be stuck supporting it.

Will
Catalin Marinas April 19, 2024, 6:05 p.m. UTC | #8
On Fri, Apr 19, 2024 at 05:58:26PM +0100, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 07:22:41PM -0700, Zayd Qumsieh wrote:
> > >I'm probably going to make myself hugely unpopular here, but I have a
> > >strong objection to this patch series as it stands. I firmly believe
> > >that providing a prctl() to query and toggle the memory model to/from
> > >TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
> > 
> > It's definitely not our intent to fragment the ecosystem. The goal
> > of this memory ordering is to simplify emulation layers that benefit
> > from this. If you have suggestions to reduce the risk of it being
> > misused outside of emulators, we'd be happy to look into it.
> 
> Once you have exposed this toggle via prctl(), it doesn't really matter
> what your intentions where. It will get used outside of emulation laters
> and we'll be stuck supporting it.

Just FTR, I fully agree with Will. I'm strongly against this kind of ABI
for a non-architected, implementation defined feature. I can't even tell
exactly what TSO means on the Apple hardware. Is it close to the x86
TSO? Is there a formal memory model for it? Are future Apple (or other
Arm vendor) implementations going to follow exactly the same model to be
able to call it some form of "Apple standard" that deserves an ABI?

So, sorry, I'm going to NAK these approaches proposing imp def features
as generic opt-in mechanisms (the microVMs thing sounds doable though,
to my limited understanding; I guess that would mean running the
emulator in a VM).
Marc Zyngier April 20, 2024, 11:37 a.m. UTC | #9
On Fri, 19 Apr 2024 17:58:09 +0100,
Will Deacon <will@kernel.org> wrote:
> 
> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > On 2024/04/11 22:28, Will Deacon wrote:
> > >   * Some binaries in a distribution exhibit instability which goes away
> > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > >     enabled.
> > 
> > Since the flag is cleared on execve, this third one isn't generally
> > possible as far as I know.
> 
> Ah ok, I'd missed that. Thanks.
> 
> > > In all these cases, we end up with native arm64 applications that will
> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > feature. Assuming that the application cannot be fixed, a better
> > > approach would be to recompile using stronger instructions (e.g.
> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > true that some existing CPUs are TSO by design (this is a perfectly
> > > valid implementation of the arm64 memory model), but I think there's a
> > > big difference between quietly providing more ordering guarantees than
> > > software may be relying on and providing a mechanism to discover,
> > > request and ultimately rely upon the stronger behaviour.
> > 
> > The problem is "just" using stronger instructions is much more
> > expensive, as emulators have demonstrated. If TSO didn't serve a
> > practical purpose I wouldn't be submitting this, but it does. This is
> > basically non-negotiable for x86 emulation; if this is rejected
> > upstream, it will forever live as a downstream patch used by the entire
> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > explicitly targeting, given our efforts with microVMs for 4K page size
> > support and the upcoming Vulkan drivers).
> 
> These microVMs sound quite interesting. What exactly are they? Are you
> running them under KVM?
> 
> Ignoring the mechanism for the time being, would it solve your problem
> if you were able to run specific microVMs in TSO mode, or do you *really*
> need the VM to have finer-grained control than that? If the whole VM is
> running in TSO mode, then my concerns largely disappear, as that's
> indistinguishable from running on a hardware implementation that happens
> to be TSO.

Since KVM has been mentioned a few times, I'll give my take on this.

Since day 1, it was a conscious decision for KVM/arm64 to emulate the
architecture, and only that -- this is complicated enough. Meaning
that no implementation-defined features should be explicitly exposed
to the guest. So I have no plan to expose any such feature for
userspace to configure TSO or anything else of the sort.

However, that doesn't preclude VMs from running in TSO mode if the HW
is configured as such at boot time. From what I have understood, this
is a per translation regime setting (EL1 and EL2 have separate knobs).

So it should be possible to set ACTLR_EL1.TSO=1 from firmware (using
the non-architected ACTLR_EL12 accessor), and let things work without
touching anything else (KVM doesn't context switch this register and
traps accesses to it). This would keep KVM out of the loop, the host
side would be unaffected, and only VMs would pay the overhead of TSO.

I appreciate that this is not the ideal situation, and very much an
all-or-nothing approach. But that's what we can reasonably manage from
an upstream perspective given the variability of the arm64 ecosystem.

	M.
Eric Curtin April 20, 2024, 12:13 p.m. UTC | #10
On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
>
> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > On 2024/04/11 22:28, Will Deacon wrote:
> > >   * Some binaries in a distribution exhibit instability which goes away
> > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > >     enabled.
> >
> > Since the flag is cleared on execve, this third one isn't generally
> > possible as far as I know.
>
> Ah ok, I'd missed that. Thanks.
>
> > > In all these cases, we end up with native arm64 applications that will
> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > feature. Assuming that the application cannot be fixed, a better
> > > approach would be to recompile using stronger instructions (e.g.
> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > true that some existing CPUs are TSO by design (this is a perfectly
> > > valid implementation of the arm64 memory model), but I think there's a
> > > big difference between quietly providing more ordering guarantees than
> > > software may be relying on and providing a mechanism to discover,
> > > request and ultimately rely upon the stronger behaviour.
> >
> > The problem is "just" using stronger instructions is much more
> > expensive, as emulators have demonstrated. If TSO didn't serve a
> > practical purpose I wouldn't be submitting this, but it does. This is
> > basically non-negotiable for x86 emulation; if this is rejected
> > upstream, it will forever live as a downstream patch used by the entire
> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > explicitly targeting, given our efforts with microVMs for 4K page size
> > support and the upcoming Vulkan drivers).
>
> These microVMs sound quite interesting. What exactly are they? Are you
> running them under KVM?

It's the magic of libkrun. This is one of the git repos in the family
of libkrun, it has a wide array of use cases, which I personally won't
do much justice explaining all then, this is just one
repo/tool/usecases:

https://github.com/containers/krunvm

https://sinrega.org/running-microvms-on-m1/

CC'ing @Sergio Lopez Pascual the lead of krun in general.

Is mise le meas/Regards,

Eric Curtin

>
> Ignoring the mechanism for the time being, would it solve your problem
> if you were able to run specific microVMs in TSO mode, or do you *really*
> need the VM to have finer-grained control than that? If the whole VM is
> running in TSO mode, then my concerns largely disappear, as that's
> indistinguishable from running on a hardware implementation that happens
> to be TSO.
>
> > That said, I have a pragmatic proposal here. The "fixed TSO" part of the
> > implementation should be harmless, since those CPUs would correctly run
> > poorly-written applications anyway so the API is moot. That leaves Apple
> > Silicon. Our native kernels are and likely always will be 16K page size,
> > due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
> > natively but with very broken functionality including no GPU
> > acceleration) plus performance differences that favor 16K. How about we
> > gate the TSO functionality to only be supported on 4K kernel builds?
> > This would make them only work in 4K VMs on Asahi Linux. We are very
> > explicitly discouraging people from trying to use the microVMs to work
> > around page size problems (which they can already do, another
> > fragmentation problem, anyway); any application which requires the 4K VM
> > to run that isn't an emulator is already clearly broken and advertising
> > that fact openly. So, adding TSO to this should be only a marginal risk
> > of further fragmentation, and it wouldn't allow apps to "sneakily" "just
> > work" on Apple machines by abusing TSO.
>
> I appreciate that you're trying to be constructive here, but I don't think
> we should tie this to the page size. It's an artifical limitation and I
> don't think it really addresses the underlying concerns that I have.
>
> Will
>
Eric Curtin April 20, 2024, 12:15 p.m. UTC | #11
On Sat, 20 Apr 2024 at 13:13, Eric Curtin <ecurtin@redhat.com> wrote:
>
> On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
> >
> > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > > On 2024/04/11 22:28, Will Deacon wrote:
> > > >   * Some binaries in a distribution exhibit instability which goes away
> > > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > > >     enabled.
> > >
> > > Since the flag is cleared on execve, this third one isn't generally
> > > possible as far as I know.
> >
> > Ah ok, I'd missed that. Thanks.
> >
> > > > In all these cases, we end up with native arm64 applications that will
> > > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > > feature. Assuming that the application cannot be fixed, a better
> > > > approach would be to recompile using stronger instructions (e.g.
> > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > > true that some existing CPUs are TSO by design (this is a perfectly
> > > > valid implementation of the arm64 memory model), but I think there's a
> > > > big difference between quietly providing more ordering guarantees than
> > > > software may be relying on and providing a mechanism to discover,
> > > > request and ultimately rely upon the stronger behaviour.
> > >
> > > The problem is "just" using stronger instructions is much more
> > > expensive, as emulators have demonstrated. If TSO didn't serve a
> > > practical purpose I wouldn't be submitting this, but it does. This is
> > > basically non-negotiable for x86 emulation; if this is rejected
> > > upstream, it will forever live as a downstream patch used by the entire
> > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > > explicitly targeting, given our efforts with microVMs for 4K page size
> > > support and the upcoming Vulkan drivers).
> >
> > These microVMs sound quite interesting. What exactly are they? Are you
> > running them under KVM?
>
> It's the magic of libkrun. This is one of the git repos in the family
> of libkrun, it has a wide array of use cases, which I personally won't
> do much justice explaining all then, this is just one
> repo/tool/usecases:
>
> https://github.com/containers/krunvm
>
> https://sinrega.org/running-microvms-on-m1/

Sorry for the double post, meant to share this one for the Asahi
emulator usecase. Sergio's blogs are great in general:

https://sinrega.org/2023-10-06-using-microvms-for-gaming-on-fedora-asahi/

Is mise le meas/Regards,

Eric Curtin

>
> CC'ing @Sergio Lopez Pascual the lead of krun in general.
>
> Is mise le meas/Regards,
>
> Eric Curtin
>
> >
> > Ignoring the mechanism for the time being, would it solve your problem
> > if you were able to run specific microVMs in TSO mode, or do you *really*
> > need the VM to have finer-grained control than that? If the whole VM is
> > running in TSO mode, then my concerns largely disappear, as that's
> > indistinguishable from running on a hardware implementation that happens
> > to be TSO.
> >
> > > That said, I have a pragmatic proposal here. The "fixed TSO" part of the
> > > implementation should be harmless, since those CPUs would correctly run
> > > poorly-written applications anyway so the API is moot. That leaves Apple
> > > Silicon. Our native kernels are and likely always will be 16K page size,
> > > due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
> > > natively but with very broken functionality including no GPU
> > > acceleration) plus performance differences that favor 16K. How about we
> > > gate the TSO functionality to only be supported on 4K kernel builds?
> > > This would make them only work in 4K VMs on Asahi Linux. We are very
> > > explicitly discouraging people from trying to use the microVMs to work
> > > around page size problems (which they can already do, another
> > > fragmentation problem, anyway); any application which requires the 4K VM
> > > to run that isn't an emulator is already clearly broken and advertising
> > > that fact openly. So, adding TSO to this should be only a marginal risk
> > > of further fragmentation, and it wouldn't allow apps to "sneakily" "just
> > > work" on Apple machines by abusing TSO.
> >
> > I appreciate that you're trying to be constructive here, but I don't think
> > we should tie this to the page size. It's an artifical limitation and I
> > don't think it really addresses the underlying concerns that I have.
> >
> > Will
> >
Zayd Qumsieh May 2, 2024, 12:10 a.m. UTC | #12
> On Fri, 19 Apr 2024 17:58:09 +0100,
> Will Deacon <will@kernel.org> wrote:
> > 
> > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > > On 2024/04/11 22:28, Will Deacon wrote:
> > > >   * Some binaries in a distribution exhibit instability which goes away
> > > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > > >     enabled.
> > > 
> > > Since the flag is cleared on execve, this third one isn't generally
> > > possible as far as I know.
> > 
> > Ah ok, I'd missed that. Thanks.
> > 
> > > > In all these cases, we end up with native arm64 applications that will
> > > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > > feature. Assuming that the application cannot be fixed, a better
> > > > approach would be to recompile using stronger instructions (e.g.
> > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > > true that some existing CPUs are TSO by design (this is a perfectly
> > > > valid implementation of the arm64 memory model), but I think there's a
> > > > big difference between quietly providing more ordering guarantees than
> > > > software may be relying on and providing a mechanism to discover,
> > > > request and ultimately rely upon the stronger behaviour.
> > > 
> > > The problem is "just" using stronger instructions is much more
> > > expensive, as emulators have demonstrated. If TSO didn't serve a
> > > practical purpose I wouldn't be submitting this, but it does. This is
> > > basically non-negotiable for x86 emulation; if this is rejected
> > > upstream, it will forever live as a downstream patch used by the entire
> > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > > explicitly targeting, given our efforts with microVMs for 4K page size
> > > support and the upcoming Vulkan drivers).
> > 
> > These microVMs sound quite interesting. What exactly are they? Are you
> > running them under KVM?
> > 
> > Ignoring the mechanism for the time being, would it solve your problem
> > if you were able to run specific microVMs in TSO mode, or do you *really*
> > need the VM to have finer-grained control than that? If the whole VM is
> > running in TSO mode, then my concerns largely disappear, as that's
> > indistinguishable from running on a hardware implementation that happens
> > to be TSO.
>
> Since KVM has been mentioned a few times, I'll give my take on this.
>
> Since day 1, it was a conscious decision for KVM/arm64 to emulate the
> architecture, and only that -- this is complicated enough. Meaning
> that no implementation-defined features should be explicitly exposed
> to the guest. So I have no plan to expose any such feature for
> userspace to configure TSO or anything else of the sort.

Agreed. We do not intend for TSO mode to be used extensively for EL1, the
intention is for TSO mode to be reserved for userspace applications that
request it.
Zayd Qumsieh May 2, 2024, 12:16 a.m. UTC | #13
On Thu, 11 Apr 2024 14:28:54 +0100,
Will Deacon <will@kernel.org> wrote:
> P.S. I briefly pondered the idea of the kernel toggling the bit in the
> ELF loader when e.g. it sees an x86 machine type but I suspect that
> doesn't really help with existing emulators and you'd still need a way
> to tell the emulator whether or not it was enabled.

This seems promising to me. What do people think of adding an opt-in argument,
option, or similar to binfmt that allows users to mark certain file formats as
"must run under TSO"? And then, the kernel would set the TSO bit when invoking
the interpreter for those file formats. If an emulator decides to create a
non-CPU-emulation thread, then it can use a prctl to disable TSO and switch to
the default ARM memory model. Note that this prctl wouldn't be allowed to
enable TSO - it would only disable it. This way, it is much harder for a
faulty application to be made that relies on TSO, since enabling of TSO is
only done via a binfmt handler that the user must explicitly opt into.

It is true that existing emulators wouldn't be able to benefit from this, but
that's the case no matter the activation mechanism. We can, however, expose a
prctl to get the memory model, so emulators can detect if TSO was enabled for
their threads.

To summarize, I propose two prctls (similar to the ones in the current revision
of the patch series). One to switch from the TSO memory model to the default
ARM one (this is a one-way street). And another to query the current memory
model.

Thanks,
Zayd

P.S. I forgot to CC you in my most recent email to Marc Zyngier just now. 
Sorry, I'm quite new to using mailing lists.
Marc Zyngier May 2, 2024, 1:25 p.m. UTC | #14
[adding Will back to the thread]

On Thu, 02 May 2024 01:10:35 +0100,
Zayd Qumsieh <zayd_qumsieh@apple.com> wrote:
> 
> > On Fri, 19 Apr 2024 17:58:09 +0100,
> > Will Deacon <will@kernel.org> wrote:
> > > 
> > > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > > > On 2024/04/11 22:28, Will Deacon wrote:
> > > > >   * Some binaries in a distribution exhibit instability which goes away
> > > > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > > > >     enabled.
> > > > 
> > > > Since the flag is cleared on execve, this third one isn't generally
> > > > possible as far as I know.
> > > 
> > > Ah ok, I'd missed that. Thanks.
> > > 
> > > > > In all these cases, we end up with native arm64 applications that will
> > > > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > > > feature. Assuming that the application cannot be fixed, a better
> > > > > approach would be to recompile using stronger instructions (e.g.
> > > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > > > true that some existing CPUs are TSO by design (this is a perfectly
> > > > > valid implementation of the arm64 memory model), but I think there's a
> > > > > big difference between quietly providing more ordering guarantees than
> > > > > software may be relying on and providing a mechanism to discover,
> > > > > request and ultimately rely upon the stronger behaviour.
> > > > 
> > > > The problem is "just" using stronger instructions is much more
> > > > expensive, as emulators have demonstrated. If TSO didn't serve a
> > > > practical purpose I wouldn't be submitting this, but it does. This is
> > > > basically non-negotiable for x86 emulation; if this is rejected
> > > > upstream, it will forever live as a downstream patch used by the entire
> > > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > > > explicitly targeting, given our efforts with microVMs for 4K page size
> > > > support and the upcoming Vulkan drivers).
> > > 
> > > These microVMs sound quite interesting. What exactly are they? Are you
> > > running them under KVM?
> > > 
> > > Ignoring the mechanism for the time being, would it solve your problem
> > > if you were able to run specific microVMs in TSO mode, or do you *really*
> > > need the VM to have finer-grained control than that? If the whole VM is
> > > running in TSO mode, then my concerns largely disappear, as that's
> > > indistinguishable from running on a hardware implementation that happens
> > > to be TSO.
> >
> > Since KVM has been mentioned a few times, I'll give my take on this.
> >
> > Since day 1, it was a conscious decision for KVM/arm64 to emulate the
> > architecture, and only that -- this is complicated enough. Meaning
> > that no implementation-defined features should be explicitly exposed
> > to the guest. So I have no plan to expose any such feature for
> > userspace to configure TSO or anything else of the sort.
> 
> Agreed. We do not intend for TSO mode to be used extensively for EL1, the
> intention is for TSO mode to be reserved for userspace applications that
> request it.

But that's the same thing for a hypervisor.

For usersoace in a VM to make use of any feature, it must be exposed
to the VM as a whole by the host VMM (QEMU, kvmtool, whatever). Which
means having a new userspace ABI, specific to KVM, exposing a feature
for which there is no spec whatsoever. Even worse, you cannot discover
whether the instruction you must use to context switch the ACTLR_EL1
register is implemented. Isn't that great?

And I'm not even talking about the joys of migrating such a VM,
because we have no clue what this bit means on other implementations.
For all we know it causes another CPU to catch fire (or go PDP-endian,
which is basically the same).

Which is why my proposal is for this bit to be set statically for
*all* VMs, and leave the kernel (and KVM) out of the picture
altogether. At least that is something we can reason about (although
someone would need to start thinking of how this particular TSO
implementation composes with the relaxed memory ordering used outside
of the VM and show that they actually lead to correct results for
something such as virtio, for example).

Thanks,

	M.
Jonas Oberhauser May 6, 2024, 8:20 a.m. UTC | #15
Am 5/2/2024 um 3:25 PM schrieb Marc Zyngier:
> although
> someone would need to start thinking of how this particular TSO
> implementation composes with the relaxed memory ordering used outside
> of the VM and show that they actually lead to correct results for
> something such as virtio, for example

I used to think about this problem space. Composing some kinds of memory
models (e.g., Arm and TSO) is easy, others is hard.

I don't know much about virtio, so this may show my naivety, but what
complications could arise from virtio?

Does the "visible behavior" of virtio change depending on the memory
model of the machine it is running on?

At least internally inside virtio it should not cause any problems, since
you are effectively adding some barriers inside some of the virtio threads.
(those that are running in the VM).

But if the VM relies on virtio behaving in a "TSO manner" but its behavior
is more relaxed on e.g. Arm, then that could cause issues.

have fun, jonas
Sergio Lopez May 6, 2024, 11:21 a.m. UTC | #16
Eric Curtin <ecurtin@redhat.com> writes:

> On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
>>
>> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
>> > On 2024/04/11 22:28, Will Deacon wrote:
>> > >   * Some binaries in a distribution exhibit instability which goes away
>> > >     in TSO mode, so a taskset-like program is used to run them with TSO
>> > >     enabled.
>> >
>> > Since the flag is cleared on execve, this third one isn't generally
>> > possible as far as I know.
>>
>> Ah ok, I'd missed that. Thanks.
>>
>> > > In all these cases, we end up with native arm64 applications that will
>> > > either fail to load or will crash in subtle ways on CPUs without the TSO
>> > > feature. Assuming that the application cannot be fixed, a better
>> > > approach would be to recompile using stronger instructions (e.g.
>> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
>> > > true that some existing CPUs are TSO by design (this is a perfectly
>> > > valid implementation of the arm64 memory model), but I think there's a
>> > > big difference between quietly providing more ordering guarantees than
>> > > software may be relying on and providing a mechanism to discover,
>> > > request and ultimately rely upon the stronger behaviour.
>> >
>> > The problem is "just" using stronger instructions is much more
>> > expensive, as emulators have demonstrated. If TSO didn't serve a
>> > practical purpose I wouldn't be submitting this, but it does. This is
>> > basically non-negotiable for x86 emulation; if this is rejected
>> > upstream, it will forever live as a downstream patch used by the entire
>> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
>> > explicitly targeting, given our efforts with microVMs for 4K page size
>> > support and the upcoming Vulkan drivers).

In addition to the use case Hector exposed here, there's another,
potentially larger one, which is running x86_64 containers on aarch64
systems, using a combination of both Virtualization and emulation.

In this scenario, both not being able to use TSO for emulation
and having to enable it all the time for the whole VM have a very large
impact on performance (~25% on some workloads).

I understand the concern about the risk of userspace fragmentation, but
I was wondering if we could minimize it to an acceptable level by
narrowing down the context. For instance, since both use cases we're
bringing to the table imply the use of Virtualization, we should be able
to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1
(and not in nVHE nor pKVM), returning EINVAL otherwise. This would
heavily discourage users from relying on this feature for native
applications that can run on arbitrary contexts, hence drastically
reducing the fragmentation risk.

We would still need a way to ensure the trap gets to the VMM and for
the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on
a different series.

Thanks,
Sergio.
Marc Zyngier May 6, 2024, 4:12 p.m. UTC | #17
On Mon, 06 May 2024 12:21:40 +0100,
Sergio Lopez Pascual <slp@redhat.com> wrote:
> 
> Eric Curtin <ecurtin@redhat.com> writes:
> 
> > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
> >>
> >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> >> > On 2024/04/11 22:28, Will Deacon wrote:
> >> > >   * Some binaries in a distribution exhibit instability which goes away
> >> > >     in TSO mode, so a taskset-like program is used to run them with TSO
> >> > >     enabled.
> >> >
> >> > Since the flag is cleared on execve, this third one isn't generally
> >> > possible as far as I know.
> >>
> >> Ah ok, I'd missed that. Thanks.
> >>
> >> > > In all these cases, we end up with native arm64 applications that will
> >> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> >> > > feature. Assuming that the application cannot be fixed, a better
> >> > > approach would be to recompile using stronger instructions (e.g.
> >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> >> > > true that some existing CPUs are TSO by design (this is a perfectly
> >> > > valid implementation of the arm64 memory model), but I think there's a
> >> > > big difference between quietly providing more ordering guarantees than
> >> > > software may be relying on and providing a mechanism to discover,
> >> > > request and ultimately rely upon the stronger behaviour.
> >> >
> >> > The problem is "just" using stronger instructions is much more
> >> > expensive, as emulators have demonstrated. If TSO didn't serve a
> >> > practical purpose I wouldn't be submitting this, but it does. This is
> >> > basically non-negotiable for x86 emulation; if this is rejected
> >> > upstream, it will forever live as a downstream patch used by the entire
> >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> >> > explicitly targeting, given our efforts with microVMs for 4K page size
> >> > support and the upcoming Vulkan drivers).
> 
> In addition to the use case Hector exposed here, there's another,
> potentially larger one, which is running x86_64 containers on aarch64
> systems, using a combination of both Virtualization and emulation.
> 
> In this scenario, both not being able to use TSO for emulation
> and having to enable it all the time for the whole VM have a very large
> impact on performance (~25% on some workloads).

Well, there is always a price to pay somewhere, and this is the usual
trade-off between performance and maintainability.

> I understand the concern about the risk of userspace fragmentation, but
> I was wondering if we could minimize it to an acceptable level by
> narrowing down the context. For instance, since both use cases we're
> bringing to the table imply the use of Virtualization, we should be able
> to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1
> (and not in nVHE nor pKVM), returning EINVAL otherwise. This would
> heavily discourage users from relying on this feature for native
> applications that can run on arbitrary contexts, hence drastically
> reducing the fragmentation risk.

As I explained in another sub-thread[1], I am not prepared to allow
non architectural state to be exposed to a guest.  I'm also not
prepared to make significant ABI differences between VHE, nVHE, hVHE,
with or without pKVM, because the job of the kernel is to abstract
those differences.

> We would still need a way to ensure the trap gets to the VMM and for
> the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on
> a different series.

The VMM can't use ACTLR_EL12, by the very definition of this register
(the clue is in the name).  You'd have to proxy the write in the
kernel and context-switch it, which means adding non-architectural
state to KVM, breaking VM migration and adding more kludges to the
existing Apple-specific host crap.

Also, let's realise that we are talking about making significant
changes to the arm64 ABI for a platform that is still not fully
supported in the upstream kernel. I have the feeling that changing the
memory model dynamically may not be of the utmost priority until then.

Thanks,

	M.

[1] https://lore.kernel.org/all/867cgcqrb9.wl-maz@kernel.org
Eric Curtin May 6, 2024, 4:20 p.m. UTC | #18
On Mon, 6 May 2024 at 17:13, Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 06 May 2024 12:21:40 +0100,
> Sergio Lopez Pascual <slp@redhat.com> wrote:
> >
> > Eric Curtin <ecurtin@redhat.com> writes:
> >
> > > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
> > >>
> > >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > >> > On 2024/04/11 22:28, Will Deacon wrote:
> > >> > >   * Some binaries in a distribution exhibit instability which goes away
> > >> > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > >> > >     enabled.
> > >> >
> > >> > Since the flag is cleared on execve, this third one isn't generally
> > >> > possible as far as I know.
> > >>
> > >> Ah ok, I'd missed that. Thanks.
> > >>
> > >> > > In all these cases, we end up with native arm64 applications that will
> > >> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > >> > > feature. Assuming that the application cannot be fixed, a better
> > >> > > approach would be to recompile using stronger instructions (e.g.
> > >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > >> > > true that some existing CPUs are TSO by design (this is a perfectly
> > >> > > valid implementation of the arm64 memory model), but I think there's a
> > >> > > big difference between quietly providing more ordering guarantees than
> > >> > > software may be relying on and providing a mechanism to discover,
> > >> > > request and ultimately rely upon the stronger behaviour.
> > >> >
> > >> > The problem is "just" using stronger instructions is much more
> > >> > expensive, as emulators have demonstrated. If TSO didn't serve a
> > >> > practical purpose I wouldn't be submitting this, but it does. This is
> > >> > basically non-negotiable for x86 emulation; if this is rejected
> > >> > upstream, it will forever live as a downstream patch used by the entire
> > >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > >> > explicitly targeting, given our efforts with microVMs for 4K page size
> > >> > support and the upcoming Vulkan drivers).
> >
> > In addition to the use case Hector exposed here, there's another,
> > potentially larger one, which is running x86_64 containers on aarch64
> > systems, using a combination of both Virtualization and emulation.
> >
> > In this scenario, both not being able to use TSO for emulation
> > and having to enable it all the time for the whole VM have a very large
> > impact on performance (~25% on some workloads).
>
> Well, there is always a price to pay somewhere, and this is the usual
> trade-off between performance and maintainability.
>
> > I understand the concern about the risk of userspace fragmentation, but
> > I was wondering if we could minimize it to an acceptable level by
> > narrowing down the context. For instance, since both use cases we're
> > bringing to the table imply the use of Virtualization, we should be able
> > to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1
> > (and not in nVHE nor pKVM), returning EINVAL otherwise. This would
> > heavily discourage users from relying on this feature for native
> > applications that can run on arbitrary contexts, hence drastically
> > reducing the fragmentation risk.
>
> As I explained in another sub-thread[1], I am not prepared to allow
> non architectural state to be exposed to a guest.  I'm also not
> prepared to make significant ABI differences between VHE, nVHE, hVHE,
> with or without pKVM, because the job of the kernel is to abstract
> those differences.
>
> > We would still need a way to ensure the trap gets to the VMM and for
> > the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on
> > a different series.
>
> The VMM can't use ACTLR_EL12, by the very definition of this register
> (the clue is in the name).  You'd have to proxy the write in the
> kernel and context-switch it, which means adding non-architectural
> state to KVM, breaking VM migration and adding more kludges to the
> existing Apple-specific host crap.
>
> Also, let's realise that we are talking about making significant
> changes to the arm64 ABI for a platform that is still not fully
> supported in the upstream kernel. I have the feeling that changing the

Note there's two use-cases for this today, bare-metal Linux on Apple
Silicon devices and Linux VMs on macOS. The latter is fully supported
in the upstream kernel.

Apple Silicon devices have a significantly sized Linux userbase as
there is a shortage of decent local ARM development machines for Linux
as well as just being decent local laptop/desktop SoC's in general for
AI. The general performance of the SoC makes it very useful.

Is mise le meas/Regards,

Eric Curtin

> memory model dynamically may not be of the utmost priority until then.
>
> Thanks,
>
>         M.
>
> [1] https://lore.kernel.org/all/867cgcqrb9.wl-maz@kernel.org
>
> --
> Without deviation from the norm, progress is not possible.
>
Sergio Lopez May 6, 2024, 10:04 p.m. UTC | #19
Marc Zyngier <maz@kernel.org> writes:

> On Mon, 06 May 2024 12:21:40 +0100,
> Sergio Lopez Pascual <slp@redhat.com> wrote:
>>
>> Eric Curtin <ecurtin@redhat.com> writes:
>>
>> > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
>> >>
>> >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
>> >> > On 2024/04/11 22:28, Will Deacon wrote:
>> >> > >   * Some binaries in a distribution exhibit instability which goes away
>> >> > >     in TSO mode, so a taskset-like program is used to run them with TSO
>> >> > >     enabled.
>> >> >
>> >> > Since the flag is cleared on execve, this third one isn't generally
>> >> > possible as far as I know.
>> >>
>> >> Ah ok, I'd missed that. Thanks.
>> >>
>> >> > > In all these cases, we end up with native arm64 applications that will
>> >> > > either fail to load or will crash in subtle ways on CPUs without the TSO
>> >> > > feature. Assuming that the application cannot be fixed, a better
>> >> > > approach would be to recompile using stronger instructions (e.g.
>> >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
>> >> > > true that some existing CPUs are TSO by design (this is a perfectly
>> >> > > valid implementation of the arm64 memory model), but I think there's a
>> >> > > big difference between quietly providing more ordering guarantees than
>> >> > > software may be relying on and providing a mechanism to discover,
>> >> > > request and ultimately rely upon the stronger behaviour.
>> >> >
>> >> > The problem is "just" using stronger instructions is much more
>> >> > expensive, as emulators have demonstrated. If TSO didn't serve a
>> >> > practical purpose I wouldn't be submitting this, but it does. This is
>> >> > basically non-negotiable for x86 emulation; if this is rejected
>> >> > upstream, it will forever live as a downstream patch used by the entire
>> >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
>> >> > explicitly targeting, given our efforts with microVMs for 4K page size
>> >> > support and the upcoming Vulkan drivers).
>>
>> In addition to the use case Hector exposed here, there's another,
>> potentially larger one, which is running x86_64 containers on aarch64
>> systems, using a combination of both Virtualization and emulation.
>>
>> In this scenario, both not being able to use TSO for emulation
>> and having to enable it all the time for the whole VM have a very large
>> impact on performance (~25% on some workloads).
>
> Well, there is always a price to pay somewhere, and this is the usual
> trade-off between performance and maintainability.

Yes, and given that the impact on performance is so big, I honestly
think it's worth exploring a bit if there's an option that could keep
the maintenance cost at an acceptable level.

>> I understand the concern about the risk of userspace fragmentation, but
>> I was wondering if we could minimize it to an acceptable level by
>> narrowing down the context. For instance, since both use cases we're
>> bringing to the table imply the use of Virtualization, we should be able
>> to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1
>> (and not in nVHE nor pKVM), returning EINVAL otherwise. This would
>> heavily discourage users from relying on this feature for native
>> applications that can run on arbitrary contexts, hence drastically
>> reducing the fragmentation risk.
>
> As I explained in another sub-thread[1], I am not prepared to allow
> non architectural state to be exposed to a guest.  I'm also not
> prepared to make significant ABI differences between VHE, nVHE, hVHE,
> with or without pKVM, because the job of the kernel is to abstract
> those differences.

I understand, makes sense.

>> We would still need a way to ensure the trap gets to the VMM and for
>> the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on
>> a different series.
>
> The VMM can't use ACTLR_EL12, by the very definition of this register
> (the clue is in the name).  You'd have to proxy the write in the
> kernel and context-switch it, which means adding non-architectural
> state to KVM, breaking VM migration and adding more kludges to the
> existing Apple-specific host crap.

I know, I just didn't want to go into details here, because this series
is not touching any of that. But since we're already there, I'd like to
ask you, do you think it'd be possible and reasonable dealing with
IMPDEF registers outside of KVM, from a platform-specific module,
treating it like a paravirt feature?

In fact, if that would be acceptable, what if we treated this whole
feature as a platform-specific knob leaving both the ARM64 ABI and KVM
(mostly) aside?

I'm thinking of something in the lines of this:

- Host side:

  * Having vcpu load/put calling into some platform-specific module that
    would be in charge of keeping track of the desired state for a
    particular context and adjusting ACTLR_EL12 as needed, relieving KVM
    from this task and avoiding polluting its structs with
    non-architectural state.

  * Either having a kernel handler for the TACR trap that would call to
    the platform-specific module, or allowing the VMM to request the
    kernel to exit to it when that trap is triggered. The latter would
    also require the module to expose a device node with an ioctl
    interface (independent from KVM's) for the VMM to request the
    desired TSO stategy for a particular thread.

  * An alternative to the previous point could be enabling the VMM to be
    able to request KVM to start a VM with HCR_EL2.TACR = 0. This one
    would be way cheaper in CPU time, and would simplify the
    platform-specific module job to just save/restore ACTLR_EL12 for
    that context, but I guess it could potentially introduce some
    undesired variance between VM configurations. I'm honestly open to
    both options, please let me know if you find one to be better for
    KVM.

- Guest side:

  * Wiring __switch_to() to also call the platform-specific module. Akin
    to what happens with KVM, this one would be in charge of keeping
    track of the threads that want TSO enabled, adjusting ACTLR_EL1
    accordingly.

  * Having the platform-specific module expose a device node with an
    ioctl interface for userspace applications to request TSO to be
    enabled for the current thread.

I think an approach like this would address the ARM64 userspace
fragmentation concerns, relieve KVM from carrying a platform-specific
burden and reduce the maintenance costs to a reasonable level. WDYT?

> Also, let's realise that we are talking about making significant
> changes to the arm64 ABI for a platform that is still not fully
> supported in the upstream kernel. I have the feeling that changing the
> memory model dynamically may not be of the utmost priority until then.

Please note this feature will also be used by Linux running in a VM on
macOS under Hypervisor.framework, so Asahi isn't the only platform. This
raises significantly the number of users potentially benefited by
emulators being able to operate the TSO knob.

Thanks,
Sergio.
Alex Bennée May 7, 2024, 10:24 a.m. UTC | #20
Will Deacon <will@kernel.org> writes:

> Hi Hector,
>
> On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
>> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
>> reason, x86 emulation on baseline ARM64 systems requires very expensive
>> memory model emulation. Having hardware that supports this natively is
>> therefore very attractive. Such hardware, in fact, exists. This series
>> adds support for userspace to identify when TSO is available and
>> toggle it on, if supported.
>
> I'm probably going to make myself hugely unpopular here, but I have a
> strong objection to this patch series as it stands. I firmly believe
> that providing a prctl() to query and toggle the memory model to/from
> TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
>
> It's not difficult to envisage this TSO switch being abused for native
> arm64 applications:
>
>   * A program no longer crashes when TSO is enabled, so the developer
>     just toggles TSO to meet a deadline.
>
>   * Some legacy x86 sources are being ported to arm64 but concurrency
>     is hard so the developer just enables TSO to (mostly) avoid thinking
>     about it.
>
>   * Some binaries in a distribution exhibit instability which goes away
>     in TSO mode, so a taskset-like program is used to run them with TSO
>     enabled.

These all just seem like cases of engineers hiding from their very real
problems. I don't know if its really the kernels place to avoid giving
them the foot gun. Would it assuage your concerns at all if we set a
taint flag so bug reports/core dumps indicated we were in a
non-architectural memory mode?

> In all these cases, we end up with native arm64 applications that will
> either fail to load or will crash in subtle ways on CPUs without the TSO
> feature. Assuming that the application cannot be fixed, a better
> approach would be to recompile using stronger instructions (e.g.
> LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> true that some existing CPUs are TSO by design (this is a perfectly
> valid implementation of the arm64 memory model), but I think there's a
> big difference between quietly providing more ordering guarantees than
> software may be relying on and providing a mechanism to discover,
> request and ultimately rely upon the stronger behaviour.

I think the main use case here is for emulation. When we run x86-on-arm
in QEMU we do currently insert lots of extra barrier instructions on
every load and store. If we can probe and set a TSO mode I can assure
you we'll do the right thing ;-)
Ard Biesheuvel May 7, 2024, 2:52 p.m. UTC | #21
On Tue, 7 May 2024 at 12:24, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Will Deacon <will@kernel.org> writes:
>
> > Hi Hector,
> >
> > On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
> >> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
> >> reason, x86 emulation on baseline ARM64 systems requires very expensive
> >> memory model emulation. Having hardware that supports this natively is
> >> therefore very attractive. Such hardware, in fact, exists. This series
> >> adds support for userspace to identify when TSO is available and
> >> toggle it on, if supported.
> >
> > I'm probably going to make myself hugely unpopular here, but I have a
> > strong objection to this patch series as it stands. I firmly believe
> > that providing a prctl() to query and toggle the memory model to/from
> > TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
> >
> > It's not difficult to envisage this TSO switch being abused for native
> > arm64 applications:
> >
> >   * A program no longer crashes when TSO is enabled, so the developer
> >     just toggles TSO to meet a deadline.
> >
> >   * Some legacy x86 sources are being ported to arm64 but concurrency
> >     is hard so the developer just enables TSO to (mostly) avoid thinking
> >     about it.
> >
> >   * Some binaries in a distribution exhibit instability which goes away
> >     in TSO mode, so a taskset-like program is used to run them with TSO
> >     enabled.
>
> These all just seem like cases of engineers hiding from their very real
> problems. I don't know if its really the kernels place to avoid giving
> them the foot gun. Would it assuage your concerns at all if we set a
> taint flag so bug reports/core dumps indicated we were in a
> non-architectural memory mode?
>
> > In all these cases, we end up with native arm64 applications that will
> > either fail to load or will crash in subtle ways on CPUs without the TSO
> > feature. Assuming that the application cannot be fixed, a better
> > approach would be to recompile using stronger instructions (e.g.
> > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > true that some existing CPUs are TSO by design (this is a perfectly
> > valid implementation of the arm64 memory model), but I think there's a
> > big difference between quietly providing more ordering guarantees than
> > software may be relying on and providing a mechanism to discover,
> > request and ultimately rely upon the stronger behaviour.
>
> I think the main use case here is for emulation. When we run x86-on-arm
> in QEMU we do currently insert lots of extra barrier instructions on
> every load and store. If we can probe and set a TSO mode I can assure
> you we'll do the right thing ;-)
>

Without a public specification of what TSO mode actually entails,
deciding which of those barriers can be dropped is not going to be as
straight-forward as you make it out to be.

Apple's TSO mode is vertically integrated with Rosetta, which means
that TSO mode provides whatever Rosetta needs to run x86 code
correctly, and that it could mean different things on different
generations of the micro-architecture. And whether Apple's TSO is the
same as Fujitsu's is anyone's guess afaik.

Running a game and seeing it perform better is great, but it is not
the kind of rigor we usually attempt to apply when adding support for
architectural features. Hopefully, there will be some architectural
support for this in the future, but without any spec that defines the
memory model it implements, I am not convinced we should merge this.
Catalin Marinas May 9, 2024, 11:13 a.m. UTC | #22
On Tue, May 07, 2024 at 04:52:30PM +0200, Ard Biesheuvel wrote:
> On Tue, 7 May 2024 at 12:24, Alex Bennée <alex.bennee@linaro.org> wrote:
> > I think the main use case here is for emulation. When we run x86-on-arm
> > in QEMU we do currently insert lots of extra barrier instructions on
> > every load and store. If we can probe and set a TSO mode I can assure
> > you we'll do the right thing ;-)
> 
> Without a public specification of what TSO mode actually entails,
> deciding which of those barriers can be dropped is not going to be as
> straight-forward as you make it out to be.
> 
> Apple's TSO mode is vertically integrated with Rosetta, which means
> that TSO mode provides whatever Rosetta needs to run x86 code
> correctly, and that it could mean different things on different
> generations of the micro-architecture. And whether Apple's TSO is the
> same as Fujitsu's is anyone's guess afaik.

Indeed. Apart from using impdef registers, that's what I think is the
second biggest problem with this feature (and the corresponding
patches). We don't know the precise memory model, we can't tell whether
this TSO bit is stored in the TLB. If it is, is it per ASID/VMID? The
other problem Marc raised is what memory model is between two CPUs where
only one has the TSO bit set? Does it only break the TSO model or is
there a chance that it also breaks the default relaxed model? What other
TSO flavours are out there, how do they compare with the Apple one?

> Running a game and seeing it perform better is great, but it is not
> the kind of rigor we usually attempt to apply when adding support for
> architectural features. Hopefully, there will be some architectural
> support for this in the future, but without any spec that defines the
> memory model it implements, I am not convinced we should merge this.

There is FEAT_LRCPC (available on Apple Silicon from M2 onwards). Rather
than having a big knob to turn TSO on or off, this feature introduces
instructions that permit a code generator to get the TSO semantics in a
more efficient way (e.g. using LDAPR+STLR instead of the stricter
LDAR+STLR; not sure how well these are implemented on the Apple
Silicon). There are further improvements in FEAT_LRCPC{2,3} (with the
latter adding support for SIMD but not available in hardware yet). So
the direction from Arm is pretty clear, acknowledging that there is a
need for such TSO emulation but not in the way of undocumented impdef
registers. Whether more is needed here, I guess people working on
emulators could reach out to Arm or CPU vendors with suggestions (the
path to the architects is not straightforward, usually legal has a say,
but it's doable, there are formal channels already).

I see the impdef hardware TSO options as temporary until CPU
implementations catch up to architected FEAT_LRCPC*. Given the problems
already stated in this thread, I think such hacks should be carried
downstream and (hopefully) will eventually vanish. Maybe those TSO knobs
currently make an emulation faster than FEAT_LRCPC* but that's feedback
to go to the microarchitects on the implementation (or architects on
what other instructions should be covered).
Neal Gompa May 9, 2024, 12:31 p.m. UTC | #23
On Thu, May 9, 2024 at 5:13 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Tue, May 07, 2024 at 04:52:30PM +0200, Ard Biesheuvel wrote:
> > On Tue, 7 May 2024 at 12:24, Alex Bennée <alex.bennee@linaro.org> wrote:
> > > I think the main use case here is for emulation. When we run x86-on-arm
> > > in QEMU we do currently insert lots of extra barrier instructions on
> > > every load and store. If we can probe and set a TSO mode I can assure
> > > you we'll do the right thing ;-)
> >
> > Without a public specification of what TSO mode actually entails,
> > deciding which of those barriers can be dropped is not going to be as
> > straight-forward as you make it out to be.
> >
> > Apple's TSO mode is vertically integrated with Rosetta, which means
> > that TSO mode provides whatever Rosetta needs to run x86 code
> > correctly, and that it could mean different things on different
> > generations of the micro-architecture. And whether Apple's TSO is the
> > same as Fujitsu's is anyone's guess afaik.
>
> Indeed. Apart from using impdef registers, that's what I think is the
> second biggest problem with this feature (and the corresponding
> patches). We don't know the precise memory model, we can't tell whether
> this TSO bit is stored in the TLB. If it is, is it per ASID/VMID? The
> other problem Marc raised is what memory model is between two CPUs where
> only one has the TSO bit set? Does it only break the TSO model or is
> there a chance that it also breaks the default relaxed model? What other
> TSO flavours are out there, how do they compare with the Apple one?
>
> > Running a game and seeing it perform better is great, but it is not
> > the kind of rigor we usually attempt to apply when adding support for
> > architectural features. Hopefully, there will be some architectural
> > support for this in the future, but without any spec that defines the
> > memory model it implements, I am not convinced we should merge this.
>
> There is FEAT_LRCPC (available on Apple Silicon from M2 onwards). Rather
> than having a big knob to turn TSO on or off, this feature introduces
> instructions that permit a code generator to get the TSO semantics in a
> more efficient way (e.g. using LDAPR+STLR instead of the stricter
> LDAR+STLR; not sure how well these are implemented on the Apple
> Silicon). There are further improvements in FEAT_LRCPC{2,3} (with the
> latter adding support for SIMD but not available in hardware yet). So
> the direction from Arm is pretty clear, acknowledging that there is a
> need for such TSO emulation but not in the way of undocumented impdef
> registers. Whether more is needed here, I guess people working on
> emulators could reach out to Arm or CPU vendors with suggestions (the
> path to the architects is not straightforward, usually legal has a say,
> but it's doable, there are formal channels already).
>
> I see the impdef hardware TSO options as temporary until CPU
> implementations catch up to architected FEAT_LRCPC*. Given the problems
> already stated in this thread, I think such hacks should be carried
> downstream and (hopefully) will eventually vanish. Maybe those TSO knobs
> currently make an emulation faster than FEAT_LRCPC* but that's feedback
> to go to the microarchitects on the implementation (or architects on
> what other instructions should be covered).
>

They cannot ever "vanish" because we are supporting every Mx platform
back to the first one. The M1 series will never have FEAT_LRCPC.

I do not think it is unreasonable to support this method when we know
what the CPU platform is and FEAT_LRCPC does not exist.



--
真実はいつも一つ!/ Always, there's only one truth!
Catalin Marinas May 9, 2024, 12:56 p.m. UTC | #24
On Thu, May 09, 2024 at 06:31:04AM -0600, Neal Gompa wrote:
> On Thu, May 9, 2024 at 5:13 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > I see the impdef hardware TSO options as temporary until CPU
> > implementations catch up to architected FEAT_LRCPC*. Given the problems
> > already stated in this thread, I think such hacks should be carried
> > downstream and (hopefully) will eventually vanish. Maybe those TSO knobs
> > currently make an emulation faster than FEAT_LRCPC* but that's feedback
> > to go to the microarchitects on the implementation (or architects on
> > what other instructions should be covered).
> 
> They cannot ever "vanish" because we are supporting every Mx platform
> back to the first one. The M1 series will never have FEAT_LRCPC.

Well, you missed "eventually". It depends on the timeline you have in
mind but, say, 15 years from now there may not be many M1s around to be
worth maintaining these patches out-of-tree (and they don't make sense
in-tree either because of the lack of standardisation).

> I do not think it is unreasonable to support this method when we know
> what the CPU platform is and FEAT_LRCPC does not exist.

If you want a portable emulator, you better start supporting FEAT_LRCPC*
(I think FEX does this), ideally detected at run-time with a fallback to
RCsc. Whether, additionally, you want to support the non-portable Apple
TSO with out-of-tree patches, it's up to you.