[v6,08/20] xen/riscv: introduce cmpxchg.h

Message ID	bf65ce1ff5f683cc6b638fa7de0c54d5e0aead33.1710517542.git.oleksii.kurochko@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Oleksii Kurochko <oleksii.kurochko@gmail.com> To: xen-devel@lists.xenproject.org Cc: Oleksii Kurochko <oleksii.kurochko@gmail.com>, Alistair Francis <alistair.francis@wdc.com>, Bob Eshleman <bobbyeshleman@gmail.com>, Connor Davis <connojdavis@gmail.com>, Andrew Cooper <andrew.cooper3@citrix.com>, George Dunlap <george.dunlap@citrix.com>, Jan Beulich <jbeulich@suse.com>, Julien Grall <julien@xen.org>, Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wl@xen.org> Subject: [PATCH v6 08/20] xen/riscv: introduce cmpxchg.h Date: Fri, 15 Mar 2024 19:06:04 +0100 Message-ID: <bf65ce1ff5f683cc6b638fa7de0c54d5e0aead33.1710517542.git.oleksii.kurochko@gmail.com> In-Reply-To: <cover.1710517542.git.oleksii.kurochko@gmail.com> References: <cover.1710517542.git.oleksii.kurochko@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	Enable build of full Xen for RISC-V \| expand [v6,00/20] Enable build of full Xen for RISC-V [v6,01/20] automation: introduce fixed randconfig for RISC-V [v6,02/20] xen/riscv: disable unnecessary configs [v6,03/20] xen/riscv: introduce extenstion support check by compiler [v6,04/20] xen/asm-generic: introduce generic non-atomic test_bit() [v6,05/20] xen/bitops: implement fls{l}() in common logic [v6,06/20] xen/bitops: put __ffs() and ffz() into linux compatible header [v6,07/20] xen/riscv: introduce bitops.h [v6,08/20] xen/riscv: introduce cmpxchg.h [v6,09/20] xen/riscv: introduce io.h [v6,10/20] xen/riscv: introduce atomic.h [v6,11/20] xen/riscv: introduce monitor.h [v6,12/20] xen/riscv: add definition of __read_mostly [v6,13/20] xen/riscv: add required things to current.h [v6,14/20] xen/riscv: add minimal stuff to page.h to build full Xen [v6,15/20] xen/riscv: add minimal stuff to processor.h to build full Xen [v6,16/20] xen/riscv: add minimal stuff to mm.h to build full Xen [v6,17/20] xen/riscv: introduce vm_event_() functions [v6,18/20] xen/riscv: add minimal amount of stubs to build full Xen [v6,19/20] xen/riscv: enable full Xen build [v6,20/20] xen/README: add compiler and binutils versions for RISC-V64

Oleksii Kurochko March 15, 2024, 6:06 p.m. UTC

The header was taken from Linux kernl 6.4.0-rc1.

Addionally, were updated:
* add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
  access.
* replace tabs with spaces
* replace __* variale with *__
* introduce generic version of xchg_* and cmpxchg_*.
* drop {cmp}xchg{release,relaxed,acquire} as Xen doesn't use them
* drop barries and use instruction suffixices instead ( .aq, .rl, .aqrl )

Implementation of 4- and 8-byte cases were updated according to the spec:
```
              ....
Linux Construct         RVWMO AMO Mapping
atomic <op> relaxed     amo<op>.{w|d}
atomic <op> acquire     amo<op>.{w|d}.aq
atomic <op> release     amo<op>.{w|d}.rl
atomic <op>             amo<op>.{w|d}.aqrl
Linux Construct         RVWMO LR/SC Mapping
atomic <op> relaxed     loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
atomic <op> acquire     loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
atomic <op> release     loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez loop OR
                        fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop
atomic <op>             loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez loop

Table A.5: Mappings from Linux memory primitives to RISC-V primitives

```

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V6:
-  update the commit message? ( As before I don't understand this point. Can you give an example of what sort of opcode / instruction is missing?)
 - Code style fixes
 - change sizeof(*ptr) -> sizeof(*(ptr))
 - update operands names and some local variables for macros emulate_xchg_1_2() and emulate_cmpxchg_1_2()
 - drop {cmp}xchg_{relaxed,acquire,release) versions as they aren't needed for Xen
 - update __amoswap_generic() prototype and defintion: drop pre and post barries.
 - update emulate_xchg_1_2() prototype and definion: add lr_sfx, drop pre and post barries.
 - rename __xchg_generic to __xchg(), make __xchg as static inline function to be able to "#ifndef CONFIG_32BIT case 8:... " 
---
Changes in V5:
 - update the commit message.
 - drop ALIGN_DOWN().
 - update the definition of emulate_xchg_1_2(): 
   - lr.d -> lr.w, sc.d -> sc.w.
   - drop ret argument.
   - code style fixes around asm volatile.
   - update prototype.
   - use asm named operands.
   - rename local variables.
   - add comment above the macros
 - update the definition of __xchg_generic:
   - rename to __xchg()
   - transform it to static inline
   - code style fixes around switch()
   - update prototype.
 - redefine cmpxchg()
 - update emulate_cmpxchg_1_2():
   - update prototype
   - update local variables names and usage of them
   - use name asm operands.
   - add comment above the macros
 - drop pre and post, and use .aq,.rl, .aqrl suffixes.
 - drop {cmp}xchg_{relaxed, aquire, release} as they are not used by Xen.
 - drop unnessary details in comment above emulate_cmpxchg_1_2()
---
Changes in V4:
 - Code style fixes.
 - enforce in __xchg_*() has the same type for new and *ptr, also "\n"
   was removed at the end of asm instruction.
 - dependency from https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/
 - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE().
 - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used.
 - drop cmpxcg{32,64}_{local} as they aren't used.
 - introduce generic version of xchg_* and cmpxchg_*.
 - update the commit message.
---
Changes in V3:
 - update the commit message
 - add emulation of {cmp}xchg_... for 1 and 2 bytes types
---
Changes in V2:
 - update the comment at the top of the header.
 - change xen/lib.h to xen/bug.h.
 - sort inclusion of headers properly.
---
 xen/arch/riscv/include/asm/cmpxchg.h | 209 +++++++++++++++++++++++++++
 1 file changed, 209 insertions(+)
 create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h

Jan Beulich March 21, 2024, 12:07 p.m. UTC | #1

On 15.03.2024 19:06, Oleksii Kurochko wrote:
> The header was taken from Linux kernl 6.4.0-rc1.
> 
> Addionally, were updated:
> * add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
>   access.
> * replace tabs with spaces
> * replace __* variale with *__
> * introduce generic version of xchg_* and cmpxchg_*.
> * drop {cmp}xchg{release,relaxed,acquire} as Xen doesn't use them

With this, ...

> * drop barries and use instruction suffixices instead ( .aq, .rl, .aqrl )
> 
> Implementation of 4- and 8-byte cases were updated according to the spec:
> ```
>               ....
> Linux Construct         RVWMO AMO Mapping
> atomic <op> relaxed     amo<op>.{w|d}
> atomic <op> acquire     amo<op>.{w|d}.aq
> atomic <op> release     amo<op>.{w|d}.rl
> atomic <op>             amo<op>.{w|d}.aqrl
> Linux Construct         RVWMO LR/SC Mapping
> atomic <op> relaxed     loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
> atomic <op> acquire     loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
> atomic <op> release     loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez loop OR
>                         fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop
> atomic <op>             loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez loop
> 
> Table A.5: Mappings from Linux memory primitives to RISC-V primitives
> 
> ```

... I consider quoting this table in full, without any further remarks, as
confusing: Three of the lines each are inapplicable now, aiui.

Further what are the two * telling us? Quite likely they aren't there just
accidentally.

Finally, why sc.{w|d}.aqrl when in principle one would expect just
sc.{w|d}.rl?

> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/cmpxchg.h
> @@ -0,0 +1,209 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (C) 2014 Regents of the University of California */
> +
> +#ifndef _ASM_RISCV_CMPXCHG_H
> +#define _ASM_RISCV_CMPXCHG_H
> +
> +#include <xen/compiler.h>
> +#include <xen/lib.h>
> +
> +#include <asm/fence.h>
> +#include <asm/io.h>
> +#include <asm/system.h>
> +
> +#define __amoswap_generic(ptr, new, ret, sfx) \

As before / elsewhere: Is there a strong need for two leading underscores
here? Using just one would already be standard compliant afaict.

> +({ \
> +    asm volatile ( \
> +        " amoswap" sfx " %0, %2, %1" \
> +        : "=r" (ret), "+A" (*ptr) \
> +        : "r" (new) \
> +        : "memory" ); \
> +})

This doesn't need the ({ }) (anymore?):

#define __amoswap_generic(ptr, new, ret, sfx) \
    asm volatile ( \
        " amoswap" sfx " %0, %2, %1" \
        : "=r" (ret), "+A" (*(ptr)) \
        : "r" (new) \
        : "memory" )

(note also the added parentheses).

> +/*
> + * For LR and SC, the A extension requires that the address held in rs1 be
> + * naturally aligned to the size of the operand (i.e., eight-byte aligned
> + * for 64-bit words and four-byte aligned for 32-bit words).
> + * If the address is not naturally aligned, an address-misaligned exception
> + * or an access-fault exception will be generated.
> + *
> + * Thereby:
> + * - for 1-byte xchg access the containing word by clearing low two bits
> + * - for 2-byte xchg ccess the containing word by clearing bit 1.

Nit: "access"

> + * If resulting 4-byte access is still misalgined, it will fault just as
> + * non-emulated 4-byte access would.
> + */
> +#define emulate_xchg_1_2(ptr, new, lr_sfx, sc_sfx) \
> +({ \
> +    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr & ~(0x4 - sizeof(*(ptr)))); \
> +    unsigned int new_val_pos = ((unsigned long)(ptr) & (0x4 - sizeof(*(ptr)))) * BITS_PER_BYTE; \

You parenthesize ptr here correctly, but not in the line above.

Instead of "_pos" in the name, maybe better "_bit"?

Finally, here and elsewhere, please limit line length to 80 chars. (Omitting
the 0x here would help a little, but not quite enough. Question is whether
these wouldn't better be sizeof(*aligned_ptr) anyway.)

> +    unsigned long mask = GENMASK(((sizeof(*(ptr))) * BITS_PER_BYTE) - 1, 0) << new_val_pos; \
> +    unsigned int new_ = new << new_val_pos; \

Similarly new wants parenthesizing here.

> +    unsigned int old; \
> +    unsigned int scratch; \
> +    \
> +    asm volatile ( \
> +        "0: lr.w" lr_sfx " %[old], %[aligned_ptr]\n" \
> +        "   and  %[scratch], %[old], %z[nmask]\n" \
> +        "   or   %[scratch], %[scratch], %z[new_]\n" \
> +        "   sc.w" sc_sfx " %[scratch], %[scratch], %[aligned_ptr]\n" \
> +        "   bnez %[scratch], 0b\n" \
> +        : [old] "=&r" (old), [scratch] "=&r" (scratch), [aligned_ptr] "+A" (*aligned_ptr) \

While for the variable name aligned_ptr is likely helpful, for the operand
name just ptr would certainly do?

> +        : [new_] "rJ" (new_), [nmask] "rJ" (~mask) \

Neither mask nor ~mask can be 0. Hence J here and the z modifier above
look pointless. (new_, otoh, can be 0, so allowing x0 to be used in that
case is certainly desirable.)

As to using ~mask here: Now that we look to have settled on requiring
Zbb, you could use andn instead of and, thus allowing the same register
to be used in the asm() and ...

> +        : "memory" ); \
> +    \
> +    (__typeof__(*(ptr)))((old & mask) >> new_val_pos); \

... for this calculation.

> +})
> +
> +static always_inline unsigned long __xchg(volatile void *ptr, unsigned long new, int size)
> +{
> +    unsigned long ret;
> +
> +    switch ( size )
> +    {
> +    case 1:
> +        ret = emulate_xchg_1_2((volatile uint8_t *)ptr, new, ".aq", ".aqrl");
> +        break;
> +    case 2:
> +        ret = emulate_xchg_1_2((volatile uint16_t *)ptr, new, ".aq", ".aqrl");
> +        break;
> +    case 4:
> +        __amoswap_generic((volatile uint32_t *)ptr, new, ret, ".w.aqrl");
> +        break;
> +#ifndef CONFIG_32BIT

There's no 32BIT Kconfig symbol; all we have is a 64BIT one.

> +    case 8:
> +        __amoswap_generic((volatile uint64_t *)ptr, new, ret, ".d.aqrl");
> +        break;
> +#endif
> +    default:
> +        STATIC_ASSERT_UNREACHABLE();
> +    }
> +
> +    return ret;
> +}
> +
> +#define xchg(ptr, x) \
> +({ \
> +    __typeof__(*(ptr)) n_ = (x); \
> +    (__typeof__(*(ptr))) \
> +        __xchg((ptr), (unsigned long)(n_), sizeof(*(ptr))); \

Nit: While excess parentheses "only" harm readability, they would
nevertheless better be omitted (here: the first argument passed).

> +})
> +
> +#define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx)	\
> + ({ \
> +    register unsigned int rc; \

Nit: We don't normally use "register", unless accompanied by asm() tying
a variable to a specific one.

> +    __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \
> +    __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \

The casts aren't very nice to have here; I take they're needed for
cmpxchg_ptr() to compile?

> +    asm volatile( \

Nit: Missing blank once again. Would be really nice if you could go
through and sort this uniformly for the series.

> +        "0: lr" lr_sfx " %0, %2\n" \
> +        "   bne  %0, %z3, 1f\n" \
> +        "   sc" sc_sfx " %1, %z4, %2\n" \
> +        "   bnez %1, 0b\n" \
> +        "1:\n" \
> +        : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \
> +        : "rJ" (old__), "rJ" (new__) \

Please could I talk you into using named operands here, too?

Also ptr here is lacking parentheses again.

> +        : "memory"); \

And yet another missing blank.

> + })

At the use site this construct having a normal return value (rather
than ret being passed in) would overall look more natural.

> +/*
> + * For LR and SC, the A extension requires that the address held in rs1 be
> + * naturally aligned to the size of the operand (i.e., eight-byte aligned
> + * for 64-bit words and four-byte aligned for 32-bit words).
> + * If the address is not naturally aligned, an address-misaligned exception
> + * or an access-fault exception will be generated.
> + *
> + * Thereby:
> + * - for 1-byte xchg access the containing word by clearing low two bits
> + * - for 2-byte xchg ccess the containing word by clearing first bit.
> + * 
> + * If resulting 4-byte access is still misalgined, it will fault just as
> + * non-emulated 4-byte access would.
> + *
> + * old_val was casted to unsigned long for cmpxchgptr()
> + */
> +#define emulate_cmpxchg_1_2(ptr, old, new, lr_sfx, sc_sfx) \
> +({ \
> +    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr & ~(0x4 - sizeof(*(ptr)))); \
> +    uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 - sizeof(*(ptr)))) * BITS_PER_BYTE; \
> +    unsigned long mask = GENMASK(((sizeof(*(ptr))) * BITS_PER_BYTE) - 1, 0) << new_val_pos; \
> +    unsigned int old_ = old << new_val_pos; \
> +    unsigned int new_ = new << new_val_pos; \
> +    unsigned int old_val; \
> +    unsigned int scratch; \
> +    \
> +    __asm__ __volatile__ ( \
> +        "0: lr.w" lr_sfx " %[scratch], %[aligned_ptr]\n" \
> +        "   and  %[old_val], %[scratch], %z[mask]\n" \
> +        "   bne  %[old_val], %z[old_], 1f\n" \
> +        "   xor  %[scratch], %[old_val], %[scratch]\n" \

To be honest I was hoping this line would come with a brief comment.

> +        "   or   %[scratch], %[scratch], %z[new_]\n" \
> +        "   sc.w" sc_sfx " %[scratch], %[scratch], %[aligned_ptr]\n" \
> +        "   bnez %[scratch], 0b\n" \
> +        "1:\n" \
> +        : [old_val] "=&r" (old_val), [scratch] "=&r" (scratch), [aligned_ptr] "+A" (*aligned_ptr) \
> +        : [old_] "rJ" (old_), [new_] "rJ" (new_), \
> +          [mask] "rJ" (mask) \
> +        : "memory" ); \
> +    \
> +    (__typeof__(*(ptr)))((unsigned long)old_val >> new_val_pos); \
> +})

A few of the comments for emulate_xchg_1_2() apply here as well.

> +/*
> + * Atomic compare and exchange.  Compare OLD with MEM, if identical,
> + * store NEW in MEM.  Return the initial value in MEM.  Success is
> + * indicated by comparing RETURN with OLD.
> + */
> +static always_inline unsigned long __cmpxchg(volatile void *ptr,
> +                         unsigned long old,
> +                         unsigned long new,
> +                         int size)

Nit: Inappropriate indentation.

Jan

Oleksii Kurochko March 22, 2024, 10:32 a.m. UTC | #2

On Thu, 2024-03-21 at 13:07 +0100, Jan Beulich wrote:
> On 15.03.2024 19:06, Oleksii Kurochko wrote:
> > The header was taken from Linux kernl 6.4.0-rc1.
> > 
> > Addionally, were updated:
> > * add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
> >   access.
> > * replace tabs with spaces
> > * replace __* variale with *__
> > * introduce generic version of xchg_* and cmpxchg_*.
> > * drop {cmp}xchg{release,relaxed,acquire} as Xen doesn't use them
> 
> With this, ...
> 
> > * drop barries and use instruction suffixices instead ( .aq, .rl,
> > .aqrl )
> > 
> > Implementation of 4- and 8-byte cases were updated according to the
> > spec:
> > ```
> >               ....
> > Linux Construct         RVWMO AMO Mapping
> > atomic <op> relaxed     amo<op>.{w|d}
> > atomic <op> acquire     amo<op>.{w|d}.aq
> > atomic <op> release     amo<op>.{w|d}.rl
> > atomic <op>             amo<op>.{w|d}.aqrl
> > Linux Construct         RVWMO LR/SC Mapping
> > atomic <op> relaxed     loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
> > atomic <op> acquire     loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez
> > loop
> > atomic <op> release     loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez
> > loop OR
> >                         fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗
> > ; bnez loop
> > atomic <op>             loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
> > bnez loop
> > 
> > Table A.5: Mappings from Linux memory primitives to RISC-V
> > primitives
> > 
> > ```
> 
> ... I consider quoting this table in full, without any further
> remarks, as
> confusing: Three of the lines each are inapplicable now, aiui.
I'll shorten the table then.

> 
> Further what are the two * telling us? Quite likely they aren't there
> just
> accidentally.
> 
> Finally, why sc.{w|d}.aqrl when in principle one would expect just
> sc.{w|d}.rl?
Because according to the last line of table A.5:
    atomic <op>             loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
Here it is used sc.{w|d}.aqrl form, so I decided to stick to the what
is mentioned in the table.

> 
> > +})
> > +
> > +static always_inline unsigned long __xchg(volatile void *ptr,
> > unsigned long new, int size)
> > +{
> > +    unsigned long ret;
> > +
> > +    switch ( size )
> > +    {
> > +    case 1:
> > +        ret = emulate_xchg_1_2((volatile uint8_t *)ptr, new,
> > ".aq", ".aqrl");
> > +        break;
> > +    case 2:
> > +        ret = emulate_xchg_1_2((volatile uint16_t *)ptr, new,
> > ".aq", ".aqrl");
> > +        break;
> > +    case 4:
> > +        __amoswap_generic((volatile uint32_t *)ptr, new, ret,
> > ".w.aqrl");
> > +        break;
> > +#ifndef CONFIG_32BIT
> 
> There's no 32BIT Kconfig symbol; all we have is a 64BIT one.
I meant here CONFIG_RISCV_32.

> 
> > +    case 8:
> > +        __amoswap_generic((volatile uint64_t *)ptr, new, ret,
> > ".d.aqrl");
> > +        break;
> > +#endif
> > +    default:
> > +        STATIC_ASSERT_UNREACHABLE();
> > +    }
> > +
> > +    return ret;
> > +}
> > +
> > +#define xchg(ptr, x) \
> > +({ \
> > +    __typeof__(*(ptr)) n_ = (x); \
> > +    (__typeof__(*(ptr))) \
> > +        __xchg((ptr), (unsigned long)(n_), sizeof(*(ptr))); \
> 
> Nit: While excess parentheses "only" harm readability, they would
> nevertheless better be omitted (here: the first argument passed).
> 
> > +})
> > +
> > +#define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx)	\
> > + ({ \
> > +    register unsigned int rc; \
> 
> Nit: We don't normally use "register", unless accompanied by asm()
> tying
> a variable to a specific one.
> 
> > +    __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \
> > +    __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \
> 
> The casts aren't very nice to have here; I take they're needed for
> cmpxchg_ptr() to compile?
Not really, I have not faced an compilation issue.
The reason why it was added is that lr instruction places the sign-
extended value in destination register, but if not to do cast value for
old and new were generated without sign extension, so, for example:
   u32= 0xbbbbbbbb;
   cmpxchg(&u32, 0xbbbbbbbb, 0xCCCCCCCC), u32);
Will fail because after:
       "0: lr" lr_sfx " %0, %2\n" 
in %0 we will have 0xFFFFFFFFBBBBBBBB, but in %3 we will have
0xBBBBBBBB, so
       bne  %0, %z3, 1f\n"
%0 and %3 are always inequal in case when the most significat bit of
value read from %2 has 1.

But now I realized that it would be better just to use a mask and not
be dependent from compiler code generation, so it would be better to in
the following way ( of course, the macros should be updated accordingly
to remarks you give me ):
   #define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx)	\
    ({ \
       register unsigned int rc; \
       unsigned long mask = GENMASK(((sizeof(*(ptr))) * BITS_PER_BYTE)
   - 1, 0); \
       asm volatile( \
           "0: lr" lr_sfx " %0, %2\n" \
           "   and  %0, %0, %z[mask]\n" \
           "   bne  %0, %z3, 1f\n" \
           "   sc" sc_sfx " %1, %z4, %2\n" \
           "   bnez %1, 0b\n" \
           "1:\n" \
           : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \
           : "rJ" (old), "rJ" (new), [mask] "rJ" (mask)  \
           : "memory"); \
    })
   
> 
> > +    asm volatile( \
> 
> Nit: Missing blank once again. Would be really nice if you could go
> through and sort this uniformly for the series.
> 
> > +        "0: lr" lr_sfx " %0, %2\n" \
> > +        "   bne  %0, %z3, 1f\n" \
> > +        "   sc" sc_sfx " %1, %z4, %2\n" \
> > +        "   bnez %1, 0b\n" \
> > +        "1:\n" \
> > +        : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \
> > +        : "rJ" (old__), "rJ" (new__) \
> 
> Please could I talk you into using named operands here, too?
Sure, I will add them.

~ Oleksii

Jan Beulich March 22, 2024, 10:42 a.m. UTC | #3

On 22.03.2024 11:32, Oleksii wrote:
> On Thu, 2024-03-21 at 13:07 +0100, Jan Beulich wrote:
>> On 15.03.2024 19:06, Oleksii Kurochko wrote:
>>> The header was taken from Linux kernl 6.4.0-rc1.
>>>
>>> Addionally, were updated:
>>> * add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
>>>   access.
>>> * replace tabs with spaces
>>> * replace __* variale with *__
>>> * introduce generic version of xchg_* and cmpxchg_*.
>>> * drop {cmp}xchg{release,relaxed,acquire} as Xen doesn't use them
>>
>> With this, ...
>>
>>> * drop barries and use instruction suffixices instead ( .aq, .rl,
>>> .aqrl )
>>>
>>> Implementation of 4- and 8-byte cases were updated according to the
>>> spec:
>>> ```
>>>               ....
>>> Linux Construct         RVWMO AMO Mapping
>>> atomic <op> relaxed     amo<op>.{w|d}
>>> atomic <op> acquire     amo<op>.{w|d}.aq
>>> atomic <op> release     amo<op>.{w|d}.rl
>>> atomic <op>             amo<op>.{w|d}.aqrl
>>> Linux Construct         RVWMO LR/SC Mapping
>>> atomic <op> relaxed     loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
>>> atomic <op> acquire     loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez
>>> loop
>>> atomic <op> release     loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez
>>> loop OR
>>>                         fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗
>>> ; bnez loop
>>> atomic <op>             loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
>>> bnez loop
>>>
>>> Table A.5: Mappings from Linux memory primitives to RISC-V
>>> primitives
>>>
>>> ```
>>
>> ... I consider quoting this table in full, without any further
>> remarks, as
>> confusing: Three of the lines each are inapplicable now, aiui.
> I'll shorten the table then.
> 
>>
>> Further what are the two * telling us? Quite likely they aren't there
>> just
>> accidentally.
>>
>> Finally, why sc.{w|d}.aqrl when in principle one would expect just
>> sc.{w|d}.rl?
> Because according to the last line of table A.5:
>     atomic <op>             loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
> Here it is used sc.{w|d}.aqrl form, so I decided to stick to the what
> is mentioned in the table.

I understand it's mentioned that way in the table. But it being that way
is not explained anywhere. Hence my "Why?"

>>> +    __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \
>>> +    __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \
>>
>> The casts aren't very nice to have here; I take they're needed for
>> cmpxchg_ptr() to compile?
> Not really, I have not faced an compilation issue.
> The reason why it was added is that lr instruction places the sign-
> extended value in destination register, but if not to do cast value for
> old and new were generated without sign extension, so, for example:
>    u32= 0xbbbbbbbb;
>    cmpxchg(&u32, 0xbbbbbbbb, 0xCCCCCCCC), u32);
> Will fail because after:
>        "0: lr" lr_sfx " %0, %2\n" 
> in %0 we will have 0xFFFFFFFFBBBBBBBB, but in %3 we will have
> 0xBBBBBBBB, so
>        bne  %0, %z3, 1f\n"
> %0 and %3 are always inequal in case when the most significat bit of
> value read from %2 has 1.

I'm afraid I don't follow: It's the type conversion you're after, but
that would happen also with the casts omitted.

> But now I realized that it would be better just to use a mask and not
> be dependent from compiler code generation, so it would be better to in
> the following way ( of course, the macros should be updated accordingly
> to remarks you give me ):
>    #define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx)	\
>     ({ \
>        register unsigned int rc; \
>        unsigned long mask = GENMASK(((sizeof(*(ptr))) * BITS_PER_BYTE)
>    - 1, 0); \
>        asm volatile( \
>            "0: lr" lr_sfx " %0, %2\n" \
>            "   and  %0, %0, %z[mask]\n" \
>            "   bne  %0, %z3, 1f\n" \
>            "   sc" sc_sfx " %1, %z4, %2\n" \
>            "   bnez %1, 0b\n" \
>            "1:\n" \
>            : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \
>            : "rJ" (old), "rJ" (new), [mask] "rJ" (mask)  \
>            : "memory"); \
>     })

It'll be up to you whether to switch to such an alternative.

Jan

Oleksii Kurochko March 22, 2024, 12:52 p.m. UTC | #4

On Fri, 2024-03-22 at 11:42 +0100, Jan Beulich wrote:
> On 22.03.2024 11:32, Oleksii wrote:
> > On Thu, 2024-03-21 at 13:07 +0100, Jan Beulich wrote:
> > > On 15.03.2024 19:06, Oleksii Kurochko wrote:
> > > > The header was taken from Linux kernl 6.4.0-rc1.
> > > > 
> > > > Addionally, were updated:
> > > > * add emulation of {cmp}xchg for 1/2 byte types using 32-bit
> > > > atomic
> > > >   access.
> > > > * replace tabs with spaces
> > > > * replace __* variale with *__
> > > > * introduce generic version of xchg_* and cmpxchg_*.
> > > > * drop {cmp}xchg{release,relaxed,acquire} as Xen doesn't use
> > > > them
> > > 
> > > With this, ...
> > > 
> > > > * drop barries and use instruction suffixices instead ( .aq,
> > > > .rl,
> > > > .aqrl )
> > > > 
> > > > Implementation of 4- and 8-byte cases were updated according to
> > > > the
> > > > spec:
> > > > ```
> > > >               ....
> > > > Linux Construct         RVWMO AMO Mapping
> > > > atomic <op> relaxed     amo<op>.{w|d}
> > > > atomic <op> acquire     amo<op>.{w|d}.aq
> > > > atomic <op> release     amo<op>.{w|d}.rl
> > > > atomic <op>             amo<op>.{w|d}.aqrl
> > > > Linux Construct         RVWMO LR/SC Mapping
> > > > atomic <op> relaxed     loop: lr.{w|d}; <op>; sc.{w|d}; bnez
> > > > loop
> > > > atomic <op> acquire     loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez
> > > > loop
> > > > atomic <op> release     loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ;
> > > > bnez
> > > > loop OR
> > > >                         fence.tso; loop: lr.{w|d}; <op>;
> > > > sc.{w|d}∗
> > > > ; bnez loop
> > > > atomic <op>             loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
> > > > bnez loop
> > > > 
> > > > Table A.5: Mappings from Linux memory primitives to RISC-V
> > > > primitives
> > > > 
> > > > ```
> > > 
> > > ... I consider quoting this table in full, without any further
> > > remarks, as
> > > confusing: Three of the lines each are inapplicable now, aiui.
> > I'll shorten the table then.
> > 
> > > 
> > > Further what are the two * telling us? Quite likely they aren't
> > > there
> > > just
> > > accidentally.
> > > 
> > > Finally, why sc.{w|d}.aqrl when in principle one would expect
> > > just
> > > sc.{w|d}.rl?
> > Because according to the last line of table A.5:
> >     atomic <op>             loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
> > Here it is used sc.{w|d}.aqrl form, so I decided to stick to the
> > what
> > is mentioned in the table.
> 
> I understand it's mentioned that way in the table. But it being that
> way
> is not explained anywhere. Hence my "Why?"
> 
> > > > +    __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \
> > > > +    __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \
> > > 
> > > The casts aren't very nice to have here; I take they're needed
> > > for
> > > cmpxchg_ptr() to compile?
> > Not really, I have not faced an compilation issue.
> > The reason why it was added is that lr instruction places the sign-
> > extended value in destination register, but if not to do cast value
> > for
> > old and new were generated without sign extension, so, for example:
> >    u32= 0xbbbbbbbb;
> >    cmpxchg(&u32, 0xbbbbbbbb, 0xCCCCCCCC), u32);
> > Will fail because after:
> >        "0: lr" lr_sfx " %0, %2\n" 
> > in %0 we will have 0xFFFFFFFFBBBBBBBB, but in %3 we will have
> > 0xBBBBBBBB, so
> >        bne  %0, %z3, 1f\n"
> > %0 and %3 are always inequal in case when the most significat bit
> > of
> > value read from %2 has 1.
> 
> I'm afraid I don't follow: It's the type conversion you're after, but
> that would happen also with the casts omitted.
Yes, agree it would happen also with the casts omitted, and it was pure
luck that the compiler that with casts the compiler used an immediate
with the most significant bit = 1:
ffffffffc00397f0:       bbbbc7b7                lui     a5,0xbbbbc
ffffffffc00397f4:       bbb78793                add     a5,a5,-1093 #
ffffffffbbbbbbbb <start-0x4444445>
ffffffffc00397f8:       fef42623                sw      a5,-20(s0)
ffffffffc00397fc:       ccccd737                lui     a4,0xccccd
ffffffffc0039800:       ccc7071b                addw    a4,a4,-820 #
ffffffffcccccccc <__bss_end+0xcc7ff44>
ffffffffc0039804:       56fd                    li      a3,-1
ffffffffc0039806:       9281                    srl     a3,a3,0x20
ffffffffc0039808:       fec40513                add     a0,s0,-20
ffffffffc003980c:       140525af                lr.w.aq a1,(a0)
ffffffffc0039810:       00f59563                bne    
a1,a5,ffffffffc003981a <start_xen+0x4e>
ffffffffc0039814:       1ee5262f                sc.w.aqrl      
a2,a4,(a0)

I will update the code with the mask mentioned below to be sure that a5
has always correct value.

~ Oleksii

> 
> > But now I realized that it would be better just to use a mask and
> > not
> > be dependent from compiler code generation, so it would be better
> > to in
> > the following way ( of course, the macros should be updated
> > accordingly
> > to remarks you give me ):
> >    #define __generic_cmpxchg(ptr, old, new, ret, lr_sfx,
> > sc_sfx)	\
> >     ({ \
> >        register unsigned int rc; \
> >        unsigned long mask = GENMASK(((sizeof(*(ptr))) *
> > BITS_PER_BYTE)
> >    - 1, 0); \
> >        asm volatile( \
> >            "0: lr" lr_sfx " %0, %2\n" \
> >            "   and  %0, %0, %z[mask]\n" \
> >            "   bne  %0, %z3, 1f\n" \
> >            "   sc" sc_sfx " %1, %z4, %2\n" \
> >            "   bnez %1, 0b\n" \
> >            "1:\n" \
> >            : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \
> >            : "rJ" (old), "rJ" (new), [mask] "rJ" (mask)  \
> >            : "memory"); \
> >     })
> 
> It'll be up to you whether to switch to such an alternative.
> 
> Jan

Oleksii Kurochko April 2, 2024, 11:43 a.m. UTC | #5

On Thu, 2024-03-21 at 13:07 +0100, Jan Beulich wrote:
> > + * If resulting 4-byte access is still misalgined, it will fault
> > just as
> > + * non-emulated 4-byte access would.
> > + */
> > +#define emulate_xchg_1_2(ptr, new, lr_sfx, sc_sfx) \
> > +({ \
> > +    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr &
> > ~(0x4 - sizeof(*(ptr)))); \
> > +    unsigned int new_val_pos = ((unsigned long)(ptr) & (0x4 -
> > sizeof(*(ptr)))) * BITS_PER_BYTE; \
> 
> You parenthesize ptr here correctly, but not in the line above.
> 
> Instead of "_pos" in the name, maybe better "_bit"?
> 
> Finally, here and elsewhere, please limit line length to 80 chars.
> (Omitting
> the 0x here would help a little, but not quite enough. Question is
> whether
> these wouldn't better be sizeof(*aligned_ptr) anyway.)

I'm unsure if using sizeof(*aligned_ptr) is correct in this context.
The intention was to determine the position for the value we're
attempting to exchange.

Let's consider a scenario where we have 0xAABBCCDD at address 0x6. Even
though this would result in misaligned access, I believe new_val_pos
should still be calculated accurately. Let's say we want to exchange
two bytes (AABB) with FFFF.

With the current implementation, we would calculate:
0x6 & 2 = 2 * 8 = 16, which is the value on which the new value should
be shifted.

However, if this value is then ANDed with sizeof(*aligned_ptr):
0x6 & 4 = 6 * 8 = 48, which is not the expected value.

Am I overlooking something?

~ Oleksii

Jan Beulich April 2, 2024, 12:54 p.m. UTC | #6

On 02.04.2024 13:43, Oleksii wrote:
> On Thu, 2024-03-21 at 13:07 +0100, Jan Beulich wrote:
>>> + * If resulting 4-byte access is still misalgined, it will fault
>>> just as
>>> + * non-emulated 4-byte access would.
>>> + */
>>> +#define emulate_xchg_1_2(ptr, new, lr_sfx, sc_sfx) \
>>> +({ \
>>> +    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr &
>>> ~(0x4 - sizeof(*(ptr)))); \
>>> +    unsigned int new_val_pos = ((unsigned long)(ptr) & (0x4 -
>>> sizeof(*(ptr)))) * BITS_PER_BYTE; \
>>
>> You parenthesize ptr here correctly, but not in the line above.
>>
>> Instead of "_pos" in the name, maybe better "_bit"?
>>
>> Finally, here and elsewhere, please limit line length to 80 chars.
>> (Omitting
>> the 0x here would help a little, but not quite enough. Question is
>> whether
>> these wouldn't better be sizeof(*aligned_ptr) anyway.)
> 
> I'm unsure if using sizeof(*aligned_ptr) is correct in this context.
> The intention was to determine the position for the value we're
> attempting to exchange.
> 
> Let's consider a scenario where we have 0xAABBCCDD at address 0x6. Even
> though this would result in misaligned access, I believe new_val_pos
> should still be calculated accurately. Let's say we want to exchange
> two bytes (AABB) with FFFF.
> 
> With the current implementation, we would calculate:
> 0x6 & 2 = 2 * 8 = 16, which is the value on which the new value should
> be shifted.
> 
> However, if this value is then ANDed with sizeof(*aligned_ptr):
> 0x6 & 4 = 6 * 8 = 48, which is not the expected value.
> 
> Am I overlooking something?

I'm afraid I can only reply with a counter question (feels like there is
some misunderstanding): Do you agree that 0x4 == 4 == sizeof(*aligned_ptr)?
It's the 0x4 that the latter part of my earlier reply was about. I'm pretty
sure you have, over the various reviews I've done, noticed my dislike for
the use of literal numbers, when those aren't truly "magic".

Jan

Oleksii Kurochko April 2, 2024, 1:43 p.m. UTC | #7

On Tue, 2024-04-02 at 14:54 +0200, Jan Beulich wrote:
> On 02.04.2024 13:43, Oleksii wrote:
> > On Thu, 2024-03-21 at 13:07 +0100, Jan Beulich wrote:
> > > > + * If resulting 4-byte access is still misalgined, it will
> > > > fault
> > > > just as
> > > > + * non-emulated 4-byte access would.
> > > > + */
> > > > +#define emulate_xchg_1_2(ptr, new, lr_sfx, sc_sfx) \
> > > > +({ \
> > > > +    uint32_t *aligned_ptr = (uint32_t *)((unsigned long)ptr &
> > > > ~(0x4 - sizeof(*(ptr)))); \
> > > > +    unsigned int new_val_pos = ((unsigned long)(ptr) & (0x4 -
> > > > sizeof(*(ptr)))) * BITS_PER_BYTE; \
> > > 
> > > You parenthesize ptr here correctly, but not in the line above.
> > > 
> > > Instead of "_pos" in the name, maybe better "_bit"?
> > > 
> > > Finally, here and elsewhere, please limit line length to 80
> > > chars.
> > > (Omitting
> > > the 0x here would help a little, but not quite enough. Question
> > > is
> > > whether
> > > these wouldn't better be sizeof(*aligned_ptr) anyway.)
> > 
> > I'm unsure if using sizeof(*aligned_ptr) is correct in this
> > context.
> > The intention was to determine the position for the value we're
> > attempting to exchange.
> > 
> > Let's consider a scenario where we have 0xAABBCCDD at address 0x6.
> > Even
> > though this would result in misaligned access, I believe
> > new_val_pos
> > should still be calculated accurately. Let's say we want to
> > exchange
> > two bytes (AABB) with FFFF.
> > 
> > With the current implementation, we would calculate:
> > 0x6 & 2 = 2 * 8 = 16, which is the value on which the new value
> > should
> > be shifted.
> > 
> > However, if this value is then ANDed with sizeof(*aligned_ptr):
> > 0x6 & 4 = 6 * 8 = 48, which is not the expected value.
> > 
> > Am I overlooking something?
> 
> I'm afraid I can only reply with a counter question (feels like there
> is
> some misunderstanding): Do you agree that 0x4 == 4 ==
> sizeof(*aligned_ptr)?
> It's the 0x4 that the latter part of my earlier reply was about. I'm
> pretty
> sure you have, over the various reviews I've done, noticed my dislike
> for
> the use of literal numbers, when those aren't truly "magic".
Yes, it was misunderstading. Thanks for clarifying.

~ Oleksii

[v6,08/20] xen/riscv: introduce cmpxchg.h

Commit Message

Comments

Patch