[v2,cmpxchg,12/13] sh: Emulate one-byte cmpxchg

Message ID	20240501230130.1111603-12-paulmck@kernel.org (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 31477168B10; Wed, 1 May 2024 23:01:32 +0000 (UTC) From: "Paul E. McKenney" <paulmck@kernel.org> To: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org Cc: elver@google.com, akpm@linux-foundation.org, tglx@linutronix.de, peterz@infradead.org, dianders@chromium.org, pmladek@suse.com, arnd@arndb.de, torvalds@linux-foundation.org, kernel-team@meta.com, "Paul E. McKenney" <paulmck@kernel.org>, Andi Shyti <andi.shyti@linux.intel.com>, Palmer Dabbelt <palmer@rivosinc.com>, Masami Hiramatsu <mhiramat@kernel.org>, linux-sh@vger.kernel.org Subject: [PATCH v2 cmpxchg 12/13] sh: Emulate one-byte cmpxchg Date: Wed, 1 May 2024 16:01:29 -0700 Message-Id: <20240501230130.1111603-12-paulmck@kernel.org> In-Reply-To: <b67e79d4-06cb-4a45-a906-b9e0fbae22c5@paulmck-laptop> References: <b67e79d4-06cb-4a45-a906-b9e0fbae22c5@paulmck-laptop> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	None \| expand [v2,cmpxchg,12/13] sh: Emulate one-byte cmpxchg

Paul E. McKenney May 1, 2024, 11:01 p.m. UTC

Use the new cmpxchg_emu_u8() to emulate one-byte cmpxchg() on sh.

[ paulmck: Drop two-byte support per Arnd Bergmann feedback. ]
[ paulmck: Apply feedback from Naresh Kamboju. ]
[ Apply Geert Uytterhoeven feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: <linux-sh@vger.kernel.org>
---
 arch/sh/Kconfig               | 1 +
 arch/sh/include/asm/cmpxchg.h | 3 +++
 2 files changed, 4 insertions(+)

John Paul Adrian Glaubitz May 2, 2024, 4:52 a.m. UTC | #1

Hi Paul,

On Wed, 2024-05-01 at 16:01 -0700, Paul E. McKenney wrote:
> Use the new cmpxchg_emu_u8() to emulate one-byte cmpxchg() on sh.
> 
> [ paulmck: Drop two-byte support per Arnd Bergmann feedback. ]
> [ paulmck: Apply feedback from Naresh Kamboju. ]
> [ Apply Geert Uytterhoeven feedback. ]
> 
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: Palmer Dabbelt <palmer@rivosinc.com>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: <linux-sh@vger.kernel.org>
> ---
>  arch/sh/Kconfig               | 1 +
>  arch/sh/include/asm/cmpxchg.h | 3 +++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
> index 2ad3e29f0ebec..f47e9ccf4efd2 100644
> --- a/arch/sh/Kconfig
> +++ b/arch/sh/Kconfig
> @@ -16,6 +16,7 @@ config SUPERH
>  	select ARCH_HIBERNATION_POSSIBLE if MMU
>  	select ARCH_MIGHT_HAVE_PC_PARPORT
>  	select ARCH_WANT_IPC_PARSE_VERSION
> +	select ARCH_NEED_CMPXCHG_1_EMU
>  	select CPU_NO_EFFICIENT_FFS
>  	select DMA_DECLARE_COHERENT
>  	select GENERIC_ATOMIC64
> diff --git a/arch/sh/include/asm/cmpxchg.h b/arch/sh/include/asm/cmpxchg.h
> index 5d617b3ef78f7..1e5dc5ccf7bf5 100644
> --- a/arch/sh/include/asm/cmpxchg.h
> +++ b/arch/sh/include/asm/cmpxchg.h
> @@ -9,6 +9,7 @@
>  
>  #include <linux/compiler.h>
>  #include <linux/types.h>
> +#include <linux/cmpxchg-emu.h>
>  
>  #if defined(CONFIG_GUSA_RB)
>  #include <asm/cmpxchg-grb.h>
> @@ -56,6 +57,8 @@ static inline unsigned long __cmpxchg(volatile void * ptr, unsigned long old,
>  		unsigned long new, int size)
>  {
>  	switch (size) {
> +	case 1:
> +		return cmpxchg_emu_u8(ptr, old, new);
>  	case 4:
>  		return __cmpxchg_u32(ptr, old, new);
>  	}

Thanks for the patch. However, I don't quite understand its purpose.

There is already a case for 8-byte cmpxchg in the switch statement below:

        case 1:                                         \
                __xchg__res = xchg_u8(__xchg_ptr, x);   \
                break;

Does cmpxchg_emu_u8() have any advantages over the native xchg_u8()?

Thanks,
Adrian

Paul E. McKenney May 2, 2024, 5:06 a.m. UTC | #2

On Thu, May 02, 2024 at 06:52:53AM +0200, John Paul Adrian Glaubitz wrote:
> Hi Paul,
> 
> On Wed, 2024-05-01 at 16:01 -0700, Paul E. McKenney wrote:
> > Use the new cmpxchg_emu_u8() to emulate one-byte cmpxchg() on sh.
> > 
> > [ paulmck: Drop two-byte support per Arnd Bergmann feedback. ]
> > [ paulmck: Apply feedback from Naresh Kamboju. ]
> > [ Apply Geert Uytterhoeven feedback. ]
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Andi Shyti <andi.shyti@linux.intel.com>
> > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > Cc: Masami Hiramatsu <mhiramat@kernel.org>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: <linux-sh@vger.kernel.org>
> > ---
> >  arch/sh/Kconfig               | 1 +
> >  arch/sh/include/asm/cmpxchg.h | 3 +++
> >  2 files changed, 4 insertions(+)
> > 
> > diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
> > index 2ad3e29f0ebec..f47e9ccf4efd2 100644
> > --- a/arch/sh/Kconfig
> > +++ b/arch/sh/Kconfig
> > @@ -16,6 +16,7 @@ config SUPERH
> >  	select ARCH_HIBERNATION_POSSIBLE if MMU
> >  	select ARCH_MIGHT_HAVE_PC_PARPORT
> >  	select ARCH_WANT_IPC_PARSE_VERSION
> > +	select ARCH_NEED_CMPXCHG_1_EMU
> >  	select CPU_NO_EFFICIENT_FFS
> >  	select DMA_DECLARE_COHERENT
> >  	select GENERIC_ATOMIC64
> > diff --git a/arch/sh/include/asm/cmpxchg.h b/arch/sh/include/asm/cmpxchg.h
> > index 5d617b3ef78f7..1e5dc5ccf7bf5 100644
> > --- a/arch/sh/include/asm/cmpxchg.h
> > +++ b/arch/sh/include/asm/cmpxchg.h
> > @@ -9,6 +9,7 @@
> >  
> >  #include <linux/compiler.h>
> >  #include <linux/types.h>
> > +#include <linux/cmpxchg-emu.h>
> >  
> >  #if defined(CONFIG_GUSA_RB)
> >  #include <asm/cmpxchg-grb.h>
> > @@ -56,6 +57,8 @@ static inline unsigned long __cmpxchg(volatile void * ptr, unsigned long old,
> >  		unsigned long new, int size)
> >  {
> >  	switch (size) {
> > +	case 1:
> > +		return cmpxchg_emu_u8(ptr, old, new);
> >  	case 4:
> >  		return __cmpxchg_u32(ptr, old, new);
> >  	}
> 
> Thanks for the patch. However, I don't quite understand its purpose.
> 
> There is already a case for 8-byte cmpxchg in the switch statement below:
> 
>         case 1:                                         \
>                 __xchg__res = xchg_u8(__xchg_ptr, x);   \
>                 break;
> 
> Does cmpxchg_emu_u8() have any advantages over the native xchg_u8()?

That would be 8-bit xchg() rather than 8-byte cmpxchg(), correct?

Or am I missing something subtle here that makes sh also support one-byte
(8-bit) cmpxchg()?

							Thanx, Paul

John Paul Adrian Glaubitz May 2, 2024, 5:11 a.m. UTC | #3

On Wed, 2024-05-01 at 22:06 -0700, Paul E. McKenney wrote:
> > Does cmpxchg_emu_u8() have any advantages over the native xchg_u8()?
> 
> That would be 8-bit xchg() rather than 8-byte cmpxchg(), correct?

Indeed. I realized this after sending my reply.

> Or am I missing something subtle here that makes sh also support one-byte
> (8-bit) cmpxchg()?

Is there an explanation available that explains the rationale behind the
series, so I can learn more about it?

Also, I am opposed to removing Alpha entirely as it's still being actively
maintained in Debian and Gentoo and works well.

Adrian

D. Jeff Dionne May 2, 2024, 5:42 a.m. UTC | #4

On May 2, 2024, at 14:07, Paul E. McKenney <paulmck@kernel.org> wrote:

> That would be 8-bit xchg() rather than 8-byte cmpxchg(), correct?
> 
> Or am I missing something subtle here that makes sh also support one-byte
> (8-bit) cmpxchg()?

The native SH atomic operation is test and set TAS.B.  J2 adds a compare and swap CAS.L instruction, carefully chosen for patent free prior art (s360, IIRC).

The (relatively expensive) encoding space we allocated for CAS.L does not contain size bits.

Not all SH4 patents had expired when J2 was under development, but now have (watch this space).  Not sure (me myself) if there are more atomic operations in sh4.

Cheers,
J

> 
>                            Thanx, Paul

Arnd Bergmann May 2, 2024, 11:30 a.m. UTC | #5

On Thu, May 2, 2024, at 07:42, D. Jeff Dionne wrote:
> On May 2, 2024, at 14:07, Paul E. McKenney <paulmck@kernel.org> wrote:
>
>> That would be 8-bit xchg() rather than 8-byte cmpxchg(), correct?
>> 
>> Or am I missing something subtle here that makes sh also support one-byte
>> (8-bit) cmpxchg()?
>
> The native SH atomic operation is test and set TAS.B.  J2 adds a 
> compare and swap CAS.L instruction, carefully chosen for patent free 
> prior art (s360, IIRC).
>
> The (relatively expensive) encoding space we allocated for CAS.L does 
> not contain size bits.
>
> Not all SH4 patents had expired when J2 was under development, but now 
> have (watch this space).  Not sure (me myself) if there are more atomic 
> operations in sh4.

SH4A supports MIPS R4000 style LL/SC instructions, but it looks like
the older SH4 does not.

      Arnd

Paul E. McKenney May 2, 2024, 1:33 p.m. UTC | #6

On Thu, May 02, 2024 at 07:11:52AM +0200, John Paul Adrian Glaubitz wrote:
> On Wed, 2024-05-01 at 22:06 -0700, Paul E. McKenney wrote:
> > > Does cmpxchg_emu_u8() have any advantages over the native xchg_u8()?
> > 
> > That would be 8-bit xchg() rather than 8-byte cmpxchg(), correct?
> 
> Indeed. I realized this after sending my reply.

No problem, as I do know that feeling!

> > Or am I missing something subtle here that makes sh also support one-byte
> > (8-bit) cmpxchg()?
> 
> Is there an explanation available that explains the rationale behind the
> series, so I can learn more about it?

We have some places in mainline that need one-byte cmpxchg(), so this
series provides emulation for architectures that do not support this
notion.

> Also, I am opposed to removing Alpha entirely as it's still being actively
> maintained in Debian and Gentoo and works well.

Understood, and this sort of compatibility consideration is why this
version of this patchset does not emulate two-byte (16-bit) cmpxchg()
operations.  The original (RFC) series did emulate these, which does
not work on a few architectures that do not provide 16-bit load/store
instructions, hence no 16-bit support in this series.

So this one-byte-only series affects only Alpha systems lacking
single-byte load/store instructions.  If I understand correctly, Alpha
21164A (EV56) and later *do* have single-byte load/store instructions,
and thus are still just fine.  In fact, it looks like EV56 also has
two-byte load/store instructions, and so would have been OK with
the original one-/two-byte RFC series.

Arnd will not be shy about correcting me if I am wrong.  ;-)

> Adrian
> 
> -- 
>  .''`.  John Paul Adrian Glaubitz
> : :' :  Debian Developer
> `. `'   Physicist
>   `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Al Viro May 2, 2024, 8:53 p.m. UTC | #7

On Thu, May 02, 2024 at 06:33:49AM -0700, Paul E. McKenney wrote:

> Understood, and this sort of compatibility consideration is why this
> version of this patchset does not emulate two-byte (16-bit) cmpxchg()
> operations.  The original (RFC) series did emulate these, which does
> not work on a few architectures that do not provide 16-bit load/store
> instructions, hence no 16-bit support in this series.
> 
> So this one-byte-only series affects only Alpha systems lacking
> single-byte load/store instructions.  If I understand correctly, Alpha
> 21164A (EV56) and later *do* have single-byte load/store instructions,
> and thus are still just fine.  In fact, it looks like EV56 also has
> two-byte load/store instructions, and so would have been OK with
> the original one-/two-byte RFC series.

Wait a sec.  On Alpha we already implement 16bit and 8bit xchg and cmpxchg.
See arch/alpha/include/asm/xchg.h:
static inline unsigned long
____cmpxchg(_u16, volatile short *m, unsigned short old, unsigned short new)
{
       unsigned long prev, tmp, cmp, addr64;

       __asm__ __volatile__(
       "       andnot  %5,7,%4\n"
       "       inswl   %1,%5,%1\n"
       "1:     ldq_l   %2,0(%4)\n"
       "       extwl   %2,%5,%0\n"
       "       cmpeq   %0,%6,%3\n"
       "       beq     %3,2f\n"
       "       mskwl   %2,%5,%2\n"
       "       or      %1,%2,%2\n"
       "       stq_c   %2,0(%4)\n"
       "       beq     %2,3f\n"
       "2:\n"
       ".subsection 2\n"
       "3:     br      1b\n"
       ".previous"
       : "=&r" (prev), "=&r" (new), "=&r" (tmp), "=&r" (cmp), "=&r" (addr64)
       : "r" ((long)m), "Ir" (old), "1" (new) : "memory");

       return prev;
}

Load-locked and store-conditional are done on 64bit value, with
16bit operations done in registers.  This is what 16bit store
(assignment to unsigned short *) turns into with
        stw $17,0($16)		// *(u16*)r16 = r17
and without -mbwx
        insql $17,$16,$17	// r17 = r17 << (8 * (r16 & 7))
        ldq_u $1,0($16)		// r1 = *(u64 *)(r16 & ~7)
	mskwl $1,$16,$1		// r1 &= ~(0xffff << (8 * (r16 & 7))
	bis $17,$1,$17		// r17 |= r1
	stq_u $17,0($16)	// *(u64 *)(r16 & ~7) = r17

What's more, load-locked/store-conditional doesn't have 16bit and 8bit
variants on any Alphas - it's always 32bit (ldl_l) or 64bit (ldq_l).

What BWX adds is load/store byte/word, load/store byte/word unaligned
and sign-extend byte/word.  IOW, it's absolutely irrelevant for
cmpxchg (or xchg) purposes.

Paul E. McKenney May 2, 2024, 9:18 p.m. UTC | #8

On Thu, May 02, 2024 at 09:53:45PM +0100, Al Viro wrote:
> On Thu, May 02, 2024 at 06:33:49AM -0700, Paul E. McKenney wrote:
> 
> > Understood, and this sort of compatibility consideration is why this
> > version of this patchset does not emulate two-byte (16-bit) cmpxchg()
> > operations.  The original (RFC) series did emulate these, which does
> > not work on a few architectures that do not provide 16-bit load/store
> > instructions, hence no 16-bit support in this series.
> > 
> > So this one-byte-only series affects only Alpha systems lacking
> > single-byte load/store instructions.  If I understand correctly, Alpha
> > 21164A (EV56) and later *do* have single-byte load/store instructions,
> > and thus are still just fine.  In fact, it looks like EV56 also has
> > two-byte load/store instructions, and so would have been OK with
> > the original one-/two-byte RFC series.
> 
> Wait a sec.  On Alpha we already implement 16bit and 8bit xchg and cmpxchg.
> See arch/alpha/include/asm/xchg.h:
> static inline unsigned long
> ____cmpxchg(_u16, volatile short *m, unsigned short old, unsigned short new)
> {
>        unsigned long prev, tmp, cmp, addr64;
> 
>        __asm__ __volatile__(
>        "       andnot  %5,7,%4\n"
>        "       inswl   %1,%5,%1\n"
>        "1:     ldq_l   %2,0(%4)\n"
>        "       extwl   %2,%5,%0\n"
>        "       cmpeq   %0,%6,%3\n"
>        "       beq     %3,2f\n"
>        "       mskwl   %2,%5,%2\n"
>        "       or      %1,%2,%2\n"
>        "       stq_c   %2,0(%4)\n"
>        "       beq     %2,3f\n"
>        "2:\n"
>        ".subsection 2\n"
>        "3:     br      1b\n"
>        ".previous"
>        : "=&r" (prev), "=&r" (new), "=&r" (tmp), "=&r" (cmp), "=&r" (addr64)
>        : "r" ((long)m), "Ir" (old), "1" (new) : "memory");
> 
>        return prev;
> }
> 
> Load-locked and store-conditional are done on 64bit value, with
> 16bit operations done in registers.  This is what 16bit store
> (assignment to unsigned short *) turns into with
>         stw $17,0($16)		// *(u16*)r16 = r17
> and without -mbwx
>         insql $17,$16,$17	// r17 = r17 << (8 * (r16 & 7))
>         ldq_u $1,0($16)		// r1 = *(u64 *)(r16 & ~7)
> 	mskwl $1,$16,$1		// r1 &= ~(0xffff << (8 * (r16 & 7))
> 	bis $17,$1,$17		// r17 |= r1
> 	stq_u $17,0($16)	// *(u64 *)(r16 & ~7) = r17
> 
> What's more, load-locked/store-conditional doesn't have 16bit and 8bit
> variants on any Alphas - it's always 32bit (ldl_l) or 64bit (ldq_l).
> 
> What BWX adds is load/store byte/word, load/store byte/word unaligned
> and sign-extend byte/word.  IOW, it's absolutely irrelevant for
> cmpxchg (or xchg) purposes.

If you are only ever doing atomic read-modify-write operations on the
byte in question, then agreed, you don't care about byte loads and stores.

But there are use cases that do mix smp_store_release() with cmpxchg(),
and those use cases won't work unless at least byte store is implemented.
Or I suppose that we could use cmpxchg() instead of smp_store_release(),
but that is wasteful for architectures that do support byte stores.

So EV56 adds the byte loads and stores needed for those use cases.

Or am I missing your point?

							Thanx, Paul

Arnd Bergmann May 2, 2024, 9:50 p.m. UTC | #9

On Thu, May 2, 2024, at 15:33, Paul E. McKenney wrote:
> On Thu, May 02, 2024 at 07:11:52AM +0200, John Paul Adrian Glaubitz wrote:
>> On Wed, 2024-05-01 at 22:06 -0700, Paul E. McKenney wrote:
>> > > Does cmpxchg_emu_u8() have any advantages over the native xchg_u8()?
>> > 
>> > That would be 8-bit xchg() rather than 8-byte cmpxchg(), correct?
>> 
>> Indeed. I realized this after sending my reply.
>
> So this one-byte-only series affects only Alpha systems lacking
> single-byte load/store instructions.  If I understand correctly, Alpha
> 21164A (EV56) and later *do* have single-byte load/store instructions,
> and thus are still just fine.  In fact, it looks like EV56 also has
> two-byte load/store instructions, and so would have been OK with
> the original one-/two-byte RFC series.

Correct, the only other architecture I'm aware of that is missing
16-bit load/store entirely is ARMv3.

> Arnd will not be shy about correcting me if I am wrong.  ;-)

I'll take this as a reminder to send out my EV4/EV5 removal
series. I've merged my patches with Al's bugfixes and rebased
all on top of 6.9-rc now. It's a bit late now, so I'll
send this tomorrow:

https://git.kernel.org/pub/scm/linux/kernel/garch/alpha/include/asm/cmpxchg.hit/arnd/asm-generic.git/log/?h=alpha-cleanup-6.9

     Arnd

Al Viro May 2, 2024, 10:07 p.m. UTC | #10

On Thu, May 02, 2024 at 02:18:48PM -0700, Paul E. McKenney wrote:

> If you are only ever doing atomic read-modify-write operations on the
> byte in question, then agreed, you don't care about byte loads and stores.
> 
> But there are use cases that do mix smp_store_release() with cmpxchg(),
> and those use cases won't work unless at least byte store is implemented.
> Or I suppose that we could use cmpxchg() instead of smp_store_release(),
> but that is wasteful for architectures that do support byte stores.
> 
> So EV56 adds the byte loads and stores needed for those use cases.
> 
> Or am I missing your point?

arch/alpha/include/cmpxchg.h:
#define arch_cmpxchg(ptr, o, n)                                         \
({                                                                      \
        __typeof__(*(ptr)) __ret;                                       \
        __typeof__(*(ptr)) _o_ = (o);                                   \
        __typeof__(*(ptr)) _n_ = (n);                                   \
        smp_mb();                                                       \
        __ret = (__typeof__(*(ptr))) __cmpxchg((ptr),                   \
                (unsigned long)_o_, (unsigned long)_n_, sizeof(*(ptr)));\
        smp_mb();                                                       \
        __ret;                                                          \
})

Are those smp_mb() in there enough?

I'm probably missing your point, though - what mix of cmpxchg and
smp_store_release on 8bit values?

Paul E. McKenney May 2, 2024, 11:12 p.m. UTC | #11

On Thu, May 02, 2024 at 11:07:57PM +0100, Al Viro wrote:
> On Thu, May 02, 2024 at 02:18:48PM -0700, Paul E. McKenney wrote:
> 
> > If you are only ever doing atomic read-modify-write operations on the
> > byte in question, then agreed, you don't care about byte loads and stores.
> > 
> > But there are use cases that do mix smp_store_release() with cmpxchg(),
> > and those use cases won't work unless at least byte store is implemented.
> > Or I suppose that we could use cmpxchg() instead of smp_store_release(),
> > but that is wasteful for architectures that do support byte stores.
> > 
> > So EV56 adds the byte loads and stores needed for those use cases.
> > 
> > Or am I missing your point?
> 
> arch/alpha/include/cmpxchg.h:
> #define arch_cmpxchg(ptr, o, n)                                         \
> ({                                                                      \
>         __typeof__(*(ptr)) __ret;                                       \
>         __typeof__(*(ptr)) _o_ = (o);                                   \
>         __typeof__(*(ptr)) _n_ = (n);                                   \
>         smp_mb();                                                       \
>         __ret = (__typeof__(*(ptr))) __cmpxchg((ptr),                   \
>                 (unsigned long)_o_, (unsigned long)_n_, sizeof(*(ptr)));\
>         smp_mb();                                                       \
>         __ret;                                                          \
> })
> 
> Are those smp_mb() in there enough?
> 
> I'm probably missing your point, though - what mix of cmpxchg and
> smp_store_release on 8bit values?

One of RCU's state machines uses smp_store_release() to start the
state machine (only one task gets to do this) and cmpxchg() to update
state beyond that point.  And the state is 8 bits so that it and other
state fits into 32 bits to allow a single check for multiple conditions
elsewhere.

							Thanx, Paul

Al Viro May 2, 2024, 11:24 p.m. UTC | #12

On Thu, May 02, 2024 at 04:12:44PM -0700, Paul E. McKenney wrote:

> > I'm probably missing your point, though - what mix of cmpxchg and
> > smp_store_release on 8bit values?
> 
> One of RCU's state machines uses smp_store_release() to start the
> state machine (only one task gets to do this) and cmpxchg() to update
> state beyond that point.  And the state is 8 bits so that it and other
> state fits into 32 bits to allow a single check for multiple conditions
> elsewhere.

Humm...  smp_store_release() of 8bit on old alpha is mb + fetch 64bit + replace
8 bits + store 64bit...

Linus Torvalds May 2, 2024, 11:32 p.m. UTC | #13

On Thu, 2 May 2024 at 16:12, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> One of RCU's state machines uses smp_store_release() to start the
> state machine (only one task gets to do this) and cmpxchg() to update
> state beyond that point.  And the state is 8 bits so that it and other
> state fits into 32 bits to allow a single check for multiple conditions
> elsewhere.

Note that since alpha lacks the release-acquire model, it's always
going to be a full memory barrier before the store.

And then the store turns into a load-mask-store for older alphas.

So it's going to be a complete mess from a performance standpoint regardless.

Happily, I doubt anybody really cares.

I've occasionally wondered if we have situations where the
"smp_store_release()" only cares about previous *writes* being ordered
(ie a "smp_wmb()+WRITE_ONCE" would be sufficient).

It makes no difference on x86 (all stores are relases), power64 (wmb
and store_release are both LWSYNC) or arm64 (str is documentated to be
cheaper than DMB).

On alpha, smp_wmb()+WRITE_ONCE() is cheaper than smp_store_release(),
but nobody sane cares.

But *if* we have a situation where the "smp_store_release()" might be
just a "previous writes need to be visible" rather than ordering
previous reads too, we could maybe introduce that kind of op. I
_think_ the RCU writes tend to be of that kind?

                    Linus

Paul E. McKenney May 2, 2024, 11:45 p.m. UTC | #14

On Fri, May 03, 2024 at 12:24:47AM +0100, Al Viro wrote:
> On Thu, May 02, 2024 at 04:12:44PM -0700, Paul E. McKenney wrote:
> 
> > > I'm probably missing your point, though - what mix of cmpxchg and
> > > smp_store_release on 8bit values?
> > 
> > One of RCU's state machines uses smp_store_release() to start the
> > state machine (only one task gets to do this) and cmpxchg() to update
> > state beyond that point.  And the state is 8 bits so that it and other
> > state fits into 32 bits to allow a single check for multiple conditions
> > elsewhere.
> 
> Humm...  smp_store_release() of 8bit on old alpha is mb + fetch 64bit + replace
> 8 bits + store 64bit...

Agreed, which is why Arnd is moving his patches ahead.  (He and I
discussed this some weeks back, so not a surprise for him.)

For my part, I dropped 16-bit cmpxchg emulation when moving from the
RFC series to v1.

							Thanx, Paul

Paul E. McKenney May 3, 2024, 12:16 a.m. UTC | #15

On Thu, May 02, 2024 at 04:32:35PM -0700, Linus Torvalds wrote:
> On Thu, 2 May 2024 at 16:12, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > One of RCU's state machines uses smp_store_release() to start the
> > state machine (only one task gets to do this) and cmpxchg() to update
> > state beyond that point.  And the state is 8 bits so that it and other
> > state fits into 32 bits to allow a single check for multiple conditions
> > elsewhere.
> 
> Note that since alpha lacks the release-acquire model, it's always
> going to be a full memory barrier before the store.
> 
> And then the store turns into a load-mask-store for older alphas.
> 
> So it's going to be a complete mess from a performance standpoint regardless.

And on those older machines, a mess functionally because the other
three bytes in that same 32-bit word can be concurrently updated.
Hence Arnd's patch being necessary here.

EV56 and later all have single-byte stores, so they are OK.  They were
introduced in the mid-1990s, so even they are antiques.  ;-)

> Happily, I doubt anybody really cares.

Here is hoping!

> I've occasionally wondered if we have situations where the
> "smp_store_release()" only cares about previous *writes* being ordered
> (ie a "smp_wmb()+WRITE_ONCE" would be sufficient).

Back in the day, rcu_assign_pointer() worked this way.  But later there
were a few use cases where ordering prior reads was needed.

And in this case, we just barely need that full store-release
functionality.  There is a preceding mutex lock-unlock pair that provides
a full barrier post-boot on almost all systems.

> It makes no difference on x86 (all stores are relases), power64 (wmb
> and store_release are both LWSYNC) or arm64 (str is documentated to be
> cheaper than DMB).
> 
> On alpha, smp_wmb()+WRITE_ONCE() is cheaper than smp_store_release(),
> but nobody sane cares.
> 
> But *if* we have a situation where the "smp_store_release()" might be
> just a "previous writes need to be visible" rather than ordering
> previous reads too, we could maybe introduce that kind of op. I
> _think_ the RCU writes tend to be of that kind?

Most of the time, rcu_assign_pointer() only needs to order prior writes,
not both reads and writes.  In theory, we could make an something like
an rcu_assign_pointer_reads_too(), though hopefully with a shorter name,
and go back to smp_wmb() for rcu_assign_pointer().

But in practice, I am having a really hard time convincing myself that
it would be worth it.

							Thanx, Paul

[v2,cmpxchg,12/13] sh: Emulate one-byte cmpxchg

Commit Message

Comments

Patch