mbox series

[RFC,0/5] RISC-V: Add dynamic TSO support

Message ID 20231124072142.2786653-1-christoph.muellner@vrull.eu (mailing list archive)
Headers show
Series RISC-V: Add dynamic TSO support | expand

Message

Christoph Müllner Nov. 24, 2023, 7:21 a.m. UTC
From: Christoph Müllner <christoph.muellner@vrull.eu>

The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
CSR to switch the memory consistency model at run-time from RVWMO to TSO
(and back). The active consistency model can therefore be switched on a
per-hart base and managed by the kernel on a per-process/thread base.

This patch implements basic Ssdtso support and adds a prctl API on top
so that user-space processes can switch to a stronger memory consistency
model (than the kernel was written for) at run-time.

I am not sure if other architectures support switching the memory
consistency model at run-time, but designing the prctl API in an
arch-independent way allows reusing it in the future.

The patchset also comes with a short documentation of the prctl API.

This series is based on the second draft of the Ssdtso specification
which was published recently on an RVI list:
  https://lists.riscv.org/g/tech-arch-review/message/183
Note, that the Ssdtso specification is in development state
(i.e., not frozen or even ratified) which is also the reason
why I marked the series as RFC.

One aspect that is not covered in this patchset is virtualization.
It is planned to add virtualization support in a later version.
Hints/suggestions on how to implement this part are very much
appreciated.

Christoph Müllner (5):
  RISC-V: Add basic Ssdtso support
  RISC-V: Expose Ssdtso via hwprobe API
  uapi: prctl: Add new prctl call to set/get the memory consistency
    model
  RISC-V: Implement prctl call to set/get the memory consistency model
  RISC-V: selftests: Add DTSO tests

 Documentation/arch/riscv/hwprobe.rst          |  3 +
 .../mm/dynamic-memory-consistency-model.rst   | 76 ++++++++++++++++++
 arch/riscv/Kconfig                            | 10 +++
 arch/riscv/include/asm/csr.h                  |  1 +
 arch/riscv/include/asm/dtso.h                 | 74 ++++++++++++++++++
 arch/riscv/include/asm/hwcap.h                |  1 +
 arch/riscv/include/asm/processor.h            |  8 ++
 arch/riscv/include/asm/switch_to.h            |  3 +
 arch/riscv/include/uapi/asm/hwprobe.h         |  1 +
 arch/riscv/kernel/Makefile                    |  1 +
 arch/riscv/kernel/cpufeature.c                |  1 +
 arch/riscv/kernel/dtso.c                      | 33 ++++++++
 arch/riscv/kernel/process.c                   |  4 +
 arch/riscv/kernel/sys_riscv.c                 |  1 +
 include/uapi/linux/prctl.h                    |  5 ++
 kernel/sys.c                                  | 12 +++
 tools/testing/selftests/riscv/Makefile        |  2 +-
 tools/testing/selftests/riscv/dtso/.gitignore |  1 +
 tools/testing/selftests/riscv/dtso/Makefile   | 11 +++
 tools/testing/selftests/riscv/dtso/dtso.c     | 77 +++++++++++++++++++
 20 files changed, 324 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/mm/dynamic-memory-consistency-model.rst
 create mode 100644 arch/riscv/include/asm/dtso.h
 create mode 100644 arch/riscv/kernel/dtso.c
 create mode 100644 tools/testing/selftests/riscv/dtso/.gitignore
 create mode 100644 tools/testing/selftests/riscv/dtso/Makefile
 create mode 100644 tools/testing/selftests/riscv/dtso/dtso.c

Comments

Peter Zijlstra Nov. 24, 2023, 10:15 a.m. UTC | #1
On Fri, Nov 24, 2023 at 08:21:37AM +0100, Christoph Muellner wrote:
> From: Christoph Müllner <christoph.muellner@vrull.eu>
> 
> The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
> CSR to switch the memory consistency model at run-time from RVWMO to TSO
> (and back). The active consistency model can therefore be switched on a
> per-hart base and managed by the kernel on a per-process/thread base.

You guys, computers are hartless, nobody told ya?

> This patch implements basic Ssdtso support and adds a prctl API on top
> so that user-space processes can switch to a stronger memory consistency
> model (than the kernel was written for) at run-time.
> 
> I am not sure if other architectures support switching the memory
> consistency model at run-time, but designing the prctl API in an
> arch-independent way allows reusing it in the future.

IIRC some Sparc chips could do this, but I don't think anybody ever
exposed this to userspace (or used it much).

IA64 had planned to do this, except they messed it up and did it the
wrong way around (strong first and then relax it later), which lead to
the discovery that all existing software broke (d'uh).

I think ARM64 approached this problem by adding the
load-acquire/store-release instructions and for TSO based code,
translate into those (eg. x86 -> arm64 transpilers).

IIRC Risc-V actually has such instructions as well, so *why* are you
doing this?!?!
Christoph Müllner Nov. 24, 2023, 10:53 a.m. UTC | #2
On Fri, Nov 24, 2023 at 11:15 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 24, 2023 at 08:21:37AM +0100, Christoph Muellner wrote:
> > From: Christoph Müllner <christoph.muellner@vrull.eu>
> >
> > The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
> > CSR to switch the memory consistency model at run-time from RVWMO to TSO
> > (and back). The active consistency model can therefore be switched on a
> > per-hart base and managed by the kernel on a per-process/thread base.
>
> You guys, computers are hartless, nobody told ya?

That's why they came up with RISC-V, the ISA with hart!

> > This patch implements basic Ssdtso support and adds a prctl API on top
> > so that user-space processes can switch to a stronger memory consistency
> > model (than the kernel was written for) at run-time.
> >
> > I am not sure if other architectures support switching the memory
> > consistency model at run-time, but designing the prctl API in an
> > arch-independent way allows reusing it in the future.
>
> IIRC some Sparc chips could do this, but I don't think anybody ever
> exposed this to userspace (or used it much).
>
> IA64 had planned to do this, except they messed it up and did it the
> wrong way around (strong first and then relax it later), which lead to
> the discovery that all existing software broke (d'uh).
>
> I think ARM64 approached this problem by adding the
> load-acquire/store-release instructions and for TSO based code,
> translate into those (eg. x86 -> arm64 transpilers).
>
> IIRC Risc-V actually has such instructions as well, so *why* are you
> doing this?!?!

Not needing a transpiler is already a benefit.
And the DTSO approach also covers the cases where transpilers can't be used
(e.g. binary-only executables or libraries).

We are also working on extending ld.so such, that it switches to DTSO
(if available) in case the user wants to start an executable that was
compiled for Ztso or loads a library that was compiled for Ztso.
This would utilize the API that is introduced in this patchset.
Peter Zijlstra Nov. 24, 2023, 11:49 a.m. UTC | #3
On Fri, Nov 24, 2023 at 11:53:06AM +0100, Christoph Müllner wrote:

> > I think ARM64 approached this problem by adding the
> > load-acquire/store-release instructions and for TSO based code,
> > translate into those (eg. x86 -> arm64 transpilers).
> >
> > IIRC Risc-V actually has such instructions as well, so *why* are you
> > doing this?!?!
> 
> Not needing a transpiler is already a benefit.

This don't make sense, native risc-v stuff knows about the weak stuff,
its your natve model. The only reason you would ever need this dynamic
TSO stuff, is if you're going to run code that's written for some other
platform (notably x86).

> And the DTSO approach also covers the cases where transpilers can't be used
> (e.g. binary-only executables or libraries).

Uhh.. have you looked at the x86-on-arm64 things? That's all binary to
binary magic.

> We are also working on extending ld.so such, that it switches to DTSO
> (if available) in case the user wants to start an executable that was
> compiled for Ztso or loads a library that was compiled for Ztso.
> This would utilize the API that is introduced in this patchset.

I mean, sure, but *why* would you do this to your users? Who would want
to build a native risc-v tso binary?
Peter Zijlstra Nov. 24, 2023, 11:54 a.m. UTC | #4
On Fri, Nov 24, 2023 at 12:04:09PM +0100, Jonas Oberhauser wrote:

> > I think ARM64 approached this problem by adding the
> > load-acquire/store-release instructions and for TSO based code,
> > translate into those (eg. x86 -> arm64 transpilers).
> 
> 
> Although those instructions have a bit more ordering constraints.
> 
> I have heard rumors that the apple chips also have a register that can be
> set at runtime.

Oh, I thought they made do with the load-acquire/store-release thingies.
But to be fair, I haven't been paying *that* much attention to the apple
stuff.

I did read about how they fudged some of the x86 flags thing.

> And there are some IBM machines that have a setting, but not sure how it is
> controlled.

Cute, I'm assuming this is the Power series (s390 already being TSO)? I
wasn't aware they had this.

> > IIRC Risc-V actually has such instructions as well, so *why* are you
> > doing this?!?!
> 
> 
> Unfortunately, at least last time I checked RISC-V still hadn't gotten such
> instructions.
> What they have is the *semantics* of the instructions, but no actual opcodes
> to encode them.

Well, that sucks..

> I argued for them in the RISC-V memory group, but it was considered to be
> outside the scope of that group.
> 
> Transpiling with sufficient DMB ISH to get the desired ordering is really
> bad for performance.

Ha!, quite dreadful I would imagine.

> That is not to say that linux should support this. Perhaps linux should
> pressure RISC-V into supporting implicit barriers instead.

I'm not sure I count for much in this regard, but yeah, that sounds like
a plan :-)
Michael Ellerman Nov. 24, 2023, 1:05 p.m. UTC | #5
Peter Zijlstra <peterz@infradead.org> writes:
> On Fri, Nov 24, 2023 at 12:04:09PM +0100, Jonas Oberhauser wrote:
>
>> > I think ARM64 approached this problem by adding the
>> > load-acquire/store-release instructions and for TSO based code,
>> > translate into those (eg. x86 -> arm64 transpilers).
>> 
>> 
>> Although those instructions have a bit more ordering constraints.
>> 
>> I have heard rumors that the apple chips also have a register that can be
>> set at runtime.
>
> Oh, I thought they made do with the load-acquire/store-release thingies.
> But to be fair, I haven't been paying *that* much attention to the apple
> stuff.
>
> I did read about how they fudged some of the x86 flags thing.
>
>> And there are some IBM machines that have a setting, but not sure how it is
>> controlled.
>
> Cute, I'm assuming this is the Power series (s390 already being TSO)? I
> wasn't aware they had this.

Are you referring to Strong Access Ordering? That is a per-page
attribute, not a CPU mode, and was removed in ISA v3.1 anyway.

cheers
Guo Ren Nov. 25, 2023, 2:51 a.m. UTC | #6
On Fri, Nov 24, 2023 at 11:15:19AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 24, 2023 at 08:21:37AM +0100, Christoph Muellner wrote:
> > From: Christoph Müllner <christoph.muellner@vrull.eu>
> > 
> > The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
> > CSR to switch the memory consistency model at run-time from RVWMO to TSO
> > (and back). The active consistency model can therefore be switched on a
> > per-hart base and managed by the kernel on a per-process/thread base.
> 
> You guys, computers are hartless, nobody told ya?
> 
> > This patch implements basic Ssdtso support and adds a prctl API on top
> > so that user-space processes can switch to a stronger memory consistency
> > model (than the kernel was written for) at run-time.
> > 
> > I am not sure if other architectures support switching the memory
> > consistency model at run-time, but designing the prctl API in an
> > arch-independent way allows reusing it in the future.
> 
> IIRC some Sparc chips could do this, but I don't think anybody ever
> exposed this to userspace (or used it much).
> 
> IA64 had planned to do this, except they messed it up and did it the
> wrong way around (strong first and then relax it later), which lead to
> the discovery that all existing software broke (d'uh).
> 
> I think ARM64 approached this problem by adding the
> load-acquire/store-release instructions and for TSO based code,
> translate into those (eg. x86 -> arm64 transpilers).
Keeping global TSO order is easier and faster than mixing
acquire/release and regular load/store. That means when ssdtso is
enabled, the transpiler's load-acquire/store-release becomes regular
load/store. Some micro-arch hardwares could speed up the performance.

Of course, you may say powerful machines could smooth out the difference
between ssdtso & load-acquire/store-release, but that's not real life.
Adding ssdtso is a flexible way to gain more choices on the cost of chip
design.

> 
> IIRC Risc-V actually has such instructions as well, so *why* are you
> doing this?!?!
>
Guo Ren Nov. 26, 2023, 12:34 p.m. UTC | #7
On Fri, Nov 24, 2023 at 12:54:30PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 24, 2023 at 12:04:09PM +0100, Jonas Oberhauser wrote:
> 
> > > I think ARM64 approached this problem by adding the
> > > load-acquire/store-release instructions and for TSO based code,
> > > translate into those (eg. x86 -> arm64 transpilers).
> > 
> > 
> > Although those instructions have a bit more ordering constraints.
> > 
> > I have heard rumors that the apple chips also have a register that can be
> > set at runtime.
I could understand the rumor, smart design! Thx for sharing.

> 
> Oh, I thought they made do with the load-acquire/store-release thingies.
> But to be fair, I haven't been paying *that* much attention to the apple
> stuff.
> 
> I did read about how they fudged some of the x86 flags thing.
> 
> > And there are some IBM machines that have a setting, but not sure how it is
> > controlled.
> 
> Cute, I'm assuming this is the Power series (s390 already being TSO)? I
> wasn't aware they had this.
> 
> > > IIRC Risc-V actually has such instructions as well, so *why* are you
> > > doing this?!?!
> > 
> > 
> > Unfortunately, at least last time I checked RISC-V still hadn't gotten such
> > instructions.
> > What they have is the *semantics* of the instructions, but no actual opcodes
> > to encode them.
> 
> Well, that sucks..
> 
> > I argued for them in the RISC-V memory group, but it was considered to be
> > outside the scope of that group.
> > 
> > Transpiling with sufficient DMB ISH to get the desired ordering is really
> > bad for performance.
> 
> Ha!, quite dreadful I would imagine.
> 
> > That is not to say that linux should support this. Perhaps linux should
> > pressure RISC-V into supporting implicit barriers instead.
> 
> I'm not sure I count for much in this regard, but yeah, that sounds like
> a plan :-)
>
Conor Dooley Nov. 27, 2023, 10:36 a.m. UTC | #8
Hi,

On Fri, Nov 24, 2023 at 08:21:37AM +0100, Christoph Muellner wrote:
> From: Christoph Müllner <christoph.muellner@vrull.eu>
> 
> The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
> CSR to switch the memory consistency model at run-time from RVWMO to TSO
> (and back). The active consistency model can therefore be switched on a
> per-hart base and managed by the kernel on a per-process/thread base.
> 
> This patch implements basic Ssdtso support and adds a prctl API on top
> so that user-space processes can switch to a stronger memory consistency
> model (than the kernel was written for) at run-time.
> 
> I am not sure if other architectures support switching the memory
> consistency model at run-time, but designing the prctl API in an
> arch-independent way allows reusing it in the future.
> 
> The patchset also comes with a short documentation of the prctl API.
> 
> This series is based on the second draft of the Ssdtso specification
> which was published recently on an RVI list:
>   https://lists.riscv.org/g/tech-arch-review/message/183
> Note, that the Ssdtso specification is in development state
> (i.e., not frozen or even ratified) which is also the reason
> why I marked the series as RFC.
> 
> One aspect that is not covered in this patchset is virtualization.
> It is planned to add virtualization support in a later version.
> Hints/suggestions on how to implement this part are very much
> appreciated.
> 
> Christoph Müllner (5):

I know this is an RFC, but it could probably do with a bit more compile
testing, as:

>   RISC-V: Add basic Ssdtso support

This patch doesn't build for rv64 allmodconfig

>   RISC-V: Expose Ssdtso via hwprobe API

This one seems to build fine

>   uapi: prctl: Add new prctl call to set/get the memory consistency
>     model
>   RISC-V: Implement prctl call to set/get the memory consistency model
>   RISC-V: selftests: Add DTSO tests

These don't build for:
rv32 defconfig
rv64 allmodconfig
rv64 nommu

Cheers,
Conor.
Peter Zijlstra Nov. 27, 2023, 11:16 a.m. UTC | #9
On Fri, Nov 24, 2023 at 09:51:53PM -0500, Guo Ren wrote:
> On Fri, Nov 24, 2023 at 11:15:19AM +0100, Peter Zijlstra wrote:
> > On Fri, Nov 24, 2023 at 08:21:37AM +0100, Christoph Muellner wrote:
> > > From: Christoph Müllner <christoph.muellner@vrull.eu>
> > > 
> > > The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
> > > CSR to switch the memory consistency model at run-time from RVWMO to TSO
> > > (and back). The active consistency model can therefore be switched on a
> > > per-hart base and managed by the kernel on a per-process/thread base.
> > 
> > You guys, computers are hartless, nobody told ya?
> > 
> > > This patch implements basic Ssdtso support and adds a prctl API on top
> > > so that user-space processes can switch to a stronger memory consistency
> > > model (than the kernel was written for) at run-time.
> > > 
> > > I am not sure if other architectures support switching the memory
> > > consistency model at run-time, but designing the prctl API in an
> > > arch-independent way allows reusing it in the future.
> > 
> > IIRC some Sparc chips could do this, but I don't think anybody ever
> > exposed this to userspace (or used it much).
> > 
> > IA64 had planned to do this, except they messed it up and did it the
> > wrong way around (strong first and then relax it later), which lead to
> > the discovery that all existing software broke (d'uh).
> > 
> > I think ARM64 approached this problem by adding the
> > load-acquire/store-release instructions and for TSO based code,
> > translate into those (eg. x86 -> arm64 transpilers).

> Keeping global TSO order is easier and faster than mixing
> acquire/release and regular load/store. That means when ssdtso is
> enabled, the transpiler's load-acquire/store-release becomes regular
> load/store. Some micro-arch hardwares could speed up the performance.

Why is it faster? Because the release+acquire thing becomes RcSC instead
of RcTSO? Surely that can be fixed with a weaker store-release variant
ot something?

The problem I have with all of this is that you need to context switch
this state and that you need to deal with exceptions, which must be
written for the weak model but then end up running in the tso model --
possibly slower than desired.

If OTOH you only have a single model, everything becomes so much
simpler. You just need to be able to express exactly what you want.
Mark Rutland Nov. 27, 2023, 12:14 p.m. UTC | #10
On Fri, Nov 24, 2023 at 12:54:30PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 24, 2023 at 12:04:09PM +0100, Jonas Oberhauser wrote:
> > > I think ARM64 approached this problem by adding the
> > > load-acquire/store-release instructions and for TSO based code,
> > > translate into those (eg. x86 -> arm64 transpilers).
> > 
> > Although those instructions have a bit more ordering constraints.
> > 
> > I have heard rumors that the apple chips also have a register that can be
> > set at runtime.
> 
> Oh, I thought they made do with the load-acquire/store-release thingies.
> But to be fair, I haven't been paying *that* much attention to the apple
> stuff.
> 
> I did read about how they fudged some of the x86 flags thing.

I don't know what others may have built specifically, but architecturally on
arm64 we expect people to express ordering requirements through instructions.
ARMv8.0 has load-acquire and store-release, ARMv8.3 added RCpc forms of
load-acquire as part of FEAT_LRCPC, and ARMv8.4 added a number of instructions
as part of FEAT_LRCPC2.

For a number of reasons we avoid IMPLEMENTATION DEFINED controls for things
like this.

Thanks
Mark.
Christoph Müllner Nov. 27, 2023, 12:58 p.m. UTC | #11
On Mon, Nov 27, 2023 at 11:37 AM Conor Dooley
<conor.dooley@microchip.com> wrote:
>
> Hi,
>
> On Fri, Nov 24, 2023 at 08:21:37AM +0100, Christoph Muellner wrote:
> > From: Christoph Müllner <christoph.muellner@vrull.eu>
> >
> > The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
> > CSR to switch the memory consistency model at run-time from RVWMO to TSO
> > (and back). The active consistency model can therefore be switched on a
> > per-hart base and managed by the kernel on a per-process/thread base.
> >
> > This patch implements basic Ssdtso support and adds a prctl API on top
> > so that user-space processes can switch to a stronger memory consistency
> > model (than the kernel was written for) at run-time.
> >
> > I am not sure if other architectures support switching the memory
> > consistency model at run-time, but designing the prctl API in an
> > arch-independent way allows reusing it in the future.
> >
> > The patchset also comes with a short documentation of the prctl API.
> >
> > This series is based on the second draft of the Ssdtso specification
> > which was published recently on an RVI list:
> >   https://lists.riscv.org/g/tech-arch-review/message/183
> > Note, that the Ssdtso specification is in development state
> > (i.e., not frozen or even ratified) which is also the reason
> > why I marked the series as RFC.
> >
> > One aspect that is not covered in this patchset is virtualization.
> > It is planned to add virtualization support in a later version.
> > Hints/suggestions on how to implement this part are very much
> > appreciated.
> >
> > Christoph Müllner (5):
>
> I know this is an RFC, but it could probably do with a bit more compile
> testing, as:
>
> >   RISC-V: Add basic Ssdtso support
>
> This patch doesn't build for rv64 allmodconfig
>
> >   RISC-V: Expose Ssdtso via hwprobe API
>
> This one seems to build fine
>
> >   uapi: prctl: Add new prctl call to set/get the memory consistency
> >     model
> >   RISC-V: Implement prctl call to set/get the memory consistency model
> >   RISC-V: selftests: Add DTSO tests
>
> These don't build for:
> rv32 defconfig
> rv64 allmodconfig
> rv64 nommu

Thanks for reporting this. You are absolutely right.
In my defense, this patchset was compile-tested and got some limited
run-time testing in QEMU.
But after that, I wrote the documentation, which triggered a renaming
of several function/macro names,
and these changes did not see adequate testing. I am sorry for that.

I've already fixed the patches (addressing the issues you have
reported, plus other small issues).
To not distract the ongoing discussion, I will not send an updated
patchset right now.
In case you are interested, you can find the latest changes (rebased
on upstream/master) here:
  https://github.com/cmuellner/linux/tree/ssdtso
I've also extended my local compile-test script to include all
mentioned configs.

In case you want to play a bit with these changes, you can also have a
look at the QEMU
patchset, which also got support for the prctl (which is not part of
the published mailpatch):
  https://github.com/cmuellner/qemu/tree/ssdtso
With these changes, you can run the kernel self-test binary in
user-mode emulation.

BR
Christoph
Guo Ren Nov. 28, 2023, 1:42 a.m. UTC | #12
On Mon, Nov 27, 2023 at 12:16:43PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 24, 2023 at 09:51:53PM -0500, Guo Ren wrote:
> > On Fri, Nov 24, 2023 at 11:15:19AM +0100, Peter Zijlstra wrote:
> > > On Fri, Nov 24, 2023 at 08:21:37AM +0100, Christoph Muellner wrote:
> > > > From: Christoph Müllner <christoph.muellner@vrull.eu>
> > > > 
> > > > The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
> > > > CSR to switch the memory consistency model at run-time from RVWMO to TSO
> > > > (and back). The active consistency model can therefore be switched on a
> > > > per-hart base and managed by the kernel on a per-process/thread base.
> > > 
> > > You guys, computers are hartless, nobody told ya?
> > > 
> > > > This patch implements basic Ssdtso support and adds a prctl API on top
> > > > so that user-space processes can switch to a stronger memory consistency
> > > > model (than the kernel was written for) at run-time.
> > > > 
> > > > I am not sure if other architectures support switching the memory
> > > > consistency model at run-time, but designing the prctl API in an
> > > > arch-independent way allows reusing it in the future.
> > > 
> > > IIRC some Sparc chips could do this, but I don't think anybody ever
> > > exposed this to userspace (or used it much).
> > > 
> > > IA64 had planned to do this, except they messed it up and did it the
> > > wrong way around (strong first and then relax it later), which lead to
> > > the discovery that all existing software broke (d'uh).
> > > 
> > > I think ARM64 approached this problem by adding the
> > > load-acquire/store-release instructions and for TSO based code,
> > > translate into those (eg. x86 -> arm64 transpilers).
> 
> > Keeping global TSO order is easier and faster than mixing
> > acquire/release and regular load/store. That means when ssdtso is
> > enabled, the transpiler's load-acquire/store-release becomes regular
> > load/store. Some micro-arch hardwares could speed up the performance.
> 
> Why is it faster? Because the release+acquire thing becomes RcSC instead
> of RcTSO? Surely that can be fixed with a weaker store-release variant
> ot something?
The "ld.acq + st.rel" could only be close to the ideal RCtso because
maintaining "ld.acq + st.rel + ld + st" is more complex in LSU than "ld
+ st" by global TSO.  So, that is why we want a global TSO flag to
simplify the micro-arch implementation, especially for some small
processors in the big-little system.

> 
> The problem I have with all of this is that you need to context switch
> this state and that you need to deal with exceptions, which must be
> written for the weak model but then end up running in the tso model --
> possibly slower than desired.
The s-mode TSO is useless for the riscv Linux kernel and this patch only
uses u-mode TSO. So, the exception handler and the whole kernel always
run in WMO.

Two years ago, we worried about stuff like io_uring, which means
io_uring userspace is in TSO, but the kernel side is in WMO. But it
still seems like no problem because every side has a different
implementation, but they all ensure their order. So, there should be no
problem between TSO & WMO io_uring communication. The only things we
need to prevent are:
1. Do not let the WMO code run in TSO mode, which is inefficient. (you mentioned)
2. Do not let the TSO code run in WMO mode, which is incorrect.

> If OTOH you only have a single model, everything becomes so much
> simpler. You just need to be able to express exactly what you want.
The ssdtso is no harm to the current WMO; it's just a tradeoff for
micro-arch implementation. You still could use "ld + st" are "ld.acq +
st.rl", but they are the same in the global tso state.

> 
> 
>
Andrea Parri Feb. 8, 2024, 11:10 a.m. UTC | #13
On Fri, Nov 24, 2023 at 12:04:09PM +0100, Jonas Oberhauser wrote:
> Unfortunately, at least last time I checked RISC-V still hadn't gotten such
> instructions.
> What they have is the *semantics* of the instructions, but no actual opcodes
> to encode them.
> I argued for them in the RISC-V memory group, but it was considered to be
> outside the scope of that group.

(Sorry for the late, late reply; just recalled this thread...)

That's right.  AFAICT, the discussion about the native load-acquire
and store-release instructions was revived somewhere last year within
the RVI community, culminating in the so called Zalasr-proposal [1];
Brendan, Hans and Andrew (+ Cc) might be able to provide more up-to-
date information about the status/plans for that proposal.

(Remark that RISC-V did introduce LR/SCs and AMOs instructions with
acquire/release semantics separately, cf. the so called A-extension.)

  Andrea

[1] https://github.com/mehnadnerd/riscv-zalasr