[v4,12/30] xen/riscv: introduce cmpxchg.h

Message ID	25315ca95baffc9b222fb0ae89375a94b01a9b46.1707146506.git.oleksii.kurochko@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Oleksii Kurochko <oleksii.kurochko@gmail.com> To: xen-devel@lists.xenproject.org Cc: Oleksii Kurochko <oleksii.kurochko@gmail.com>, Alistair Francis <alistair.francis@wdc.com>, Bob Eshleman <bobbyeshleman@gmail.com>, Connor Davis <connojdavis@gmail.com>, Andrew Cooper <andrew.cooper3@citrix.com>, George Dunlap <george.dunlap@citrix.com>, Jan Beulich <jbeulich@suse.com>, Julien Grall <julien@xen.org>, Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wl@xen.org> Subject: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h Date: Mon, 5 Feb 2024 16:32:19 +0100 Message-ID: <25315ca95baffc9b222fb0ae89375a94b01a9b46.1707146506.git.oleksii.kurochko@gmail.com> In-Reply-To: <cover.1707146506.git.oleksii.kurochko@gmail.com> References: <cover.1707146506.git.oleksii.kurochko@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Enable build of full Xen for RISC-V \| expand [v4,00/30] Enable build of full Xen for RISC-V [v4,01/30] xen/riscv: disable unnecessary configs [v4,02/30] xen/riscv: use some asm-generic headers [v4,03/30] xen: add support in public/hvm/save.h for PPC and RISC-V [v4,04/30] xen/riscv: introduce cpufeature.h [v4,05/30] xen/riscv: introduce guest_atomics.h [v4,06/30] xen: avoid generation of empty asm/iommu.h [v4,07/30] xen/asm-generic: introdure nospec.h [v4,08/30] xen/riscv: introduce setup.h [v4,09/30] xen/riscv: introduce bitops.h [v4,10/30] xen/riscv: introduce flushtlb.h [v4,11/30] xen/riscv: introduce smp.h [v4,12/30] xen/riscv: introduce cmpxchg.h [v4,13/30] xen/riscv: introduce io.h [v4,14/30] xen/riscv: introduce atomic.h [v4,15/30] xen/riscv: introduce irq.h [v4,16/30] xen/riscv: introduce p2m.h [v4,17/30] xen/riscv: introduce regs.h [v4,18/30] xen/riscv: introduce time.h [v4,19/30] xen/riscv: introduce event.h [v4,20/30] xen/riscv: introduce monitor.h [v4,21/30] xen/riscv: add definition of __read_mostly [v4,22/30] xen/riscv: define an address of frame table [v4,23/30] xen/riscv: add required things to current.h [v4,24/30] xen/riscv: add minimal stuff to page.h to build full Xen [v4,25/30] xen/riscv: add minimal stuff to processor.h to build full Xen [v4,26/30] xen/riscv: add minimal stuff to mm.h to build full Xen [v4,27/30] xen/riscv: introduce vm_event_*() functions [v4,28/30] xen/rirscv: add minimal amount of stubs to build full Xen [v4,29/30] xen/riscv: enable full Xen build [v4,30/30] xen/README: add compiler and binutils versions for RISC-V64

Oleksii Kurochko Feb. 5, 2024, 3:32 p.m. UTC

The header was taken from Linux kernl 6.4.0-rc1.

Addionally, were updated:
* add emulation of {cmp}xchg for 1/2 byte types
* replace tabs with spaces
* replace __* varialbed with *__
* introduce generic version of xchg_* and cmpxchg_*.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V4:
 - Code style fixes.
 - enforce in __xchg_*() has the same type for new and *ptr, also "\n"
   was removed at the end of asm instruction.
 - dependency from https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/
 - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE().
 - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used.
 - drop cmpxcg{32,64}_{local} as they aren't used.
 - introduce generic version of xchg_* and cmpxchg_*.
 - update the commit message.
---
Changes in V3:
 - update the commit message
 - add emulation of {cmp}xchg_... for 1 and 2 bytes types
---
Changes in V2:
 - update the comment at the top of the header.
 - change xen/lib.h to xen/bug.h.
 - sort inclusion of headers properly.
---
 xen/arch/riscv/include/asm/cmpxchg.h | 237 +++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)
 create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h

Jan Beulich Feb. 13, 2024, 10:37 a.m. UTC | #1

On 05.02.2024 16:32, Oleksii Kurochko wrote:
> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/cmpxchg.h
> @@ -0,0 +1,237 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (C) 2014 Regents of the University of California */
> +
> +#ifndef _ASM_RISCV_CMPXCHG_H
> +#define _ASM_RISCV_CMPXCHG_H
> +
> +#include <xen/compiler.h>
> +#include <xen/lib.h>
> +
> +#include <asm/fence.h>
> +#include <asm/io.h>
> +#include <asm/system.h>
> +
> +#define ALIGN_DOWN(addr, size)  ((addr) & (~((size) - 1)))

This feels risky: Consider what happens when someone passes 2U as 2nd argument.
The cheapest adjustment to make would be to use 1UL in the expression.

> +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier, acquire_barrier) \
> +({ \
> +    asm volatile( \
> +        release_barrier \
> +        " amoswap" sfx " %0, %2, %1\n" \

While I won't insist, the revision log says \n were dropped from asm()
where not needed. A separator is needed here only if ...

> +        acquire_barrier \

... this isn't blank. Which imo suggests that the separator should be
part of the argument passed in. But yes, one can view this differently,
hence why I said I won't insist.

As to the naming of the two  - I'd generally suggest to make as litte
implications as possible: It doesn't really matter here whether it's
acquire or release; that matters at the use sites. What matters here
is that one is a "pre" barrier and the other is a "post" one.

> +        : "=r" (ret), "+A" (*ptr) \
> +        : "r" (new) \
> +        : "memory" ); \
> +})
> +
> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, acquire_barrier) \
> +({ \
> +    uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned long)ptr, 4); \

You now appear to assume that this macro is only used with inputs not
crossing word boundaries. That's okay as long as suitably guaranteed
at the use sites, but imo wants saying in a comment.

> +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) * BITS_PER_BYTE; \

Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use
above)?

> +    uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \
> +    uint8_t mask_h = mask_l + mask_size - 1; \
> +    unsigned long mask = GENMASK(mask_h, mask_l); \

Personally I find this confusing, naming-wise: GENMASK() takes bit
positions as inputs, not masks. (Initially, because of this, I
thought the calculations all can't be quite right.)

> +    unsigned long new_ = (unsigned long)(new) << mask_l; \
> +    unsigned long ret_; \
> +    unsigned long rc; \

Similarly, why unsigned long here?

I also wonder about the mix of underscore suffixed (or not) variable
names here.

> +    \
> +    asm volatile( \

Nit: Missing blank before opening parenthesis.

> +        release_barrier \
> +        "0: lr.d %0, %2\n" \

Even here it's an 8-byte access. Even if - didn't check - the insn was
okay to use with just a 4-byte aligned pointer, wouldn't it make sense
then to 8-byte align it, and be consistent throughout this macro wrt
the base unit acted upon? Alternatively, why not use lr.w here, thus
reducing possible collisions between multiple CPUs accessing the same
cache line?

> +        "   and  %1, %0, %z4\n" \
> +        "   or   %1, %1, %z3\n" \
> +        "   sc.d %1, %1, %2\n" \
> +        "   bnez %1, 0b\n" \
> +        acquire_barrier \
> +        : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \
> +        : "rJ" (new_), "rJ" (~mask) \

I think that as soon as there are more than 2 or maybe 3 operands,
legibility is vastly improved by using named asm() operands.

> +        : "memory"); \

Nit: Missing blank before closing parenthesis.

> +    \
> +    ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \
> +})

Why does "ret" need to be a macro argument? If you had only the
expression here, not the the assigment, ...

> +#define __xchg_generic(ptr, new, size, sfx, release_barrier, acquire_barrier) \
> +({ \
> +    __typeof__(ptr) ptr__ = (ptr); \

Is this local variable really needed? Can't you use "ptr" directly
in the three macro invocations?

> +    __typeof__(*(ptr)) new__ = (new); \
> +    __typeof__(*(ptr)) ret__; \
> +    switch (size) \
> +    { \
> +    case 1: \
> +    case 2: \
> +        emulate_xchg_1_2(ptr__, new__, ret__, release_barrier, acquire_barrier); \

... this would become

        ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier, acquire_barrier); \

But, unlike assumed above, there's no enforcement here that a 2-byte
quantity won't cross a word, double-word, cache line, or even page
boundary. That might be okay if then the code would simply crash (like
the AMO insns emitted further down would), but aiui silent misbehavior
would result.

Also nit: The switch() higher up is (still/again) missing blanks.

> +        break; \
> +    case 4: \
> +        __amoswap_generic(ptr__, new__, ret__,\
> +                          ".w" sfx,  release_barrier, acquire_barrier); \
> +        break; \
> +    case 8: \
> +        __amoswap_generic(ptr__, new__, ret__,\
> +                          ".d" sfx,  release_barrier, acquire_barrier); \
> +        break; \
> +    default: \
> +        STATIC_ASSERT_UNREACHABLE(); \
> +    } \
> +    ret__; \
> +})
> +
> +#define xchg_relaxed(ptr, x) \
> +({ \
> +    __typeof__(*(ptr)) x_ = (x); \
> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), "", "", ""); \
> +})
> +
> +#define xchg_acquire(ptr, x) \
> +({ \
> +    __typeof__(*(ptr)) x_ = (x); \
> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
> +                                       "", "", RISCV_ACQUIRE_BARRIER); \
> +})
> +
> +#define xchg_release(ptr, x) \
> +({ \
> +    __typeof__(*(ptr)) x_ = (x); \
> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
> +                                       "", RISCV_RELEASE_BARRIER, ""); \
> +})
> +
> +#define xchg(ptr,x) \
> +({ \
> +    __typeof__(*(ptr)) ret__; \
> +    ret__ = (__typeof__(*(ptr))) \
> +            __xchg_generic(ptr, (unsigned long)(x), sizeof(*(ptr)), \
> +                           ".aqrl", "", ""); \

The .aqrl doesn't look to affect the (emulated) 1- and 2-byte cases.

Further, amoswap also exists in release-only and acquire-only forms.
Why do you prefer explicit barrier insns over those? (Looks to
similarly apply to the emulation path as well as to the cmpxchg
machinery then, as both lr and sc also come in all four possible
acquire/release forms. Perhaps for the emulation path using
explicit barriers is better, in case the acquire/release forms of
lr/sc - being used inside the loop - might perform worse.)

> +    ret__; \
> +})
> +
> +#define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx, release_barrier, acquire_barrier)	\
> + ({ \
> +    register unsigned int rc; \
> +    asm volatile( \
> +        release_barrier \
> +        "0: lr" lr_sfx " %0, %2\n" \
> +        "   bne  %0, %z3, 1f\n" \
> +        "   sc" sc_sfx " %1, %z4, %2\n" \
> +        "   bnez %1, 0b\n" \
> +        acquire_barrier \
> +        "1:\n" \
> +        : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \
> +        : "rJ" (old), "rJ" (new) \
> +        : "memory"); \
> + })
> +
> +#define emulate_cmpxchg_1_2(ptr, old, new, ret, sc_sfx, release_barrier, acquire_barrier) \
> +({ \
> +    uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned long)ptr, 4); \
> +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) * BITS_PER_BYTE; \
> +    uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \
> +    uint8_t mask_h = mask_l + mask_size - 1; \
> +    unsigned long mask = GENMASK(mask_h, mask_l); \
> +    unsigned long old_ = (unsigned long)(old) << mask_l; \
> +    unsigned long new_ = (unsigned long)(new) << mask_l; \
> +    unsigned long ret_; \
> +    unsigned long rc; \
> +    \
> +    __asm__ __volatile__ ( \
> +        release_barrier \
> +        "0: lr.d %0, %2\n" \
> +        "   and  %1, %0, %z5\n" \
> +        "   bne  %1, %z3, 1f\n" \
> +        "   and  %1, %0, %z6\n" \

Isn't this equivalent to

        "   xor  %1, %1, %0\n" \

this eliminating one (likely register) input?

Furthermore with the above and ...

> +        "   or   %1, %1, %z4\n" \
> +        "   sc.d" sc_sfx " %1, %1, %2\n" \
> +        "   bnez %1, 0b\n" \

... this re-written to

        "   xor  %0, %1, %0\n" \
        "   or   %0, %0, %z4\n" \
        "   sc.d" sc_sfx " %0, %0, %2\n" \
        "   bnez %0, 0b\n" \

you'd then no longer clobber the ret_ & mask you've already calculated
in %1, so ...

> +        acquire_barrier \
> +        "1:\n" \
> +        : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \
> +        : "rJ" (old_), "rJ" (new_), \
> +          "rJ" (mask), "rJ" (~mask) \
> +        : "memory"); \
> +    \
> +    ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \

... you could use rc here. (Of course variable naming or use then may
want changing, assuming I understand why "rc" is named the way it is.)

> +})
> +
> +/*
> + * Atomic compare and exchange.  Compare OLD with MEM, if identical,
> + * store NEW in MEM.  Return the initial value in MEM.  Success is
> + * indicated by comparing RETURN with OLD.
> + */
> +#define __cmpxchg_generic(ptr, old, new, size, sc_sfx, release_barrier, acquire_barrier) \
> +({ \
> +    __typeof__(ptr) ptr__ = (ptr); \
> +    __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \
> +    __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \
> +    __typeof__(*(ptr)) ret__; \
> +    switch (size) \
> +    { \
> +    case 1: \
> +    case 2: \
> +        emulate_cmpxchg_1_2(ptr, old, new, ret__,\
> +                            sc_sfx, release_barrier, acquire_barrier); \
> +        break; \
> +    case 4: \
> +        __generic_cmpxchg(ptr__, old__, new__, ret__, \
> +                          ".w", ".w"sc_sfx, release_barrier, acquire_barrier); \
> +        break; \
> +    case 8: \
> +        __generic_cmpxchg(ptr__, old__, new__, ret__, \
> +                          ".d", ".d"sc_sfx, release_barrier, acquire_barrier); \
> +        break; \
> +    default: \
> +        STATIC_ASSERT_UNREACHABLE(); \
> +    } \
> +    ret__; \
> +})
> +
> +#define cmpxchg_relaxed(ptr, o, n) \
> +({ \
> +    __typeof__(*(ptr)) o_ = (o); \
> +    __typeof__(*(ptr)) n_ = (n); \
> +    (__typeof__(*(ptr)))__cmpxchg_generic(ptr, \
> +                    o_, n_, sizeof(*(ptr)), "", "", ""); \
> +})
> +
> +#define cmpxchg_acquire(ptr, o, n) \
> +({ \
> +    __typeof__(*(ptr)) o_ = (o); \
> +    __typeof__(*(ptr)) n_ = (n); \
> +    (__typeof__(*(ptr)))__cmpxchg_generic(ptr, o_, n_, sizeof(*(ptr)), \
> +                                          "", "", RISCV_ACQUIRE_BARRIER); \
> +})
> +
> +#define cmpxchg_release(ptr, o, n) \
> +({ \
> +    __typeof__(*(ptr)) o_ = (o); \
> +    __typeof__(*(ptr)) n_ = (n); \
> +    (__typeof__(*(ptr)))__cmpxchg_release(ptr, o_, n_, sizeof(*(ptr)), \
> +                                          "", RISCV_RELEASE_BARRIER, ""); \
> +})
> +
> +#define cmpxchg(ptr, o, n) \
> +({ \
> +    __typeof__(*(ptr)) ret__; \
> +    ret__ = (__typeof__(*(ptr))) \
> +            __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \
> +                              sizeof(*(ptr)), ".rl", "", " fence rw, rw\n"); \

No RISCV_..._BARRIER for use here and ...

> +    ret__; \
> +})
> +
> +#define __cmpxchg(ptr, o, n, s) \
> +({ \
> +    __typeof__(*(ptr)) ret__; \
> +    ret__ = (__typeof__(*(ptr))) \
> +            __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \
> +                              s, ".rl", "", " fence rw, rw\n"); \

... here? And anyway, wouldn't it make sense to have

#define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr))

to limit redundancy?

Plus wouldn't

#define __cmpxchg(ptr, o, n, s) \
    ((__typeof__(*(ptr))) \
     __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \
                       s, ".rl", "", " fence rw, rw\n"))

be shorter and thus easier to follow as well? As I notice only now,
this would apparently apply further up as well.

Jan

Oleksii Kurochko Feb. 15, 2024, 1:41 p.m. UTC | #2

> > +        : "=r" (ret), "+A" (*ptr) \
> > +        : "r" (new) \
> > +        : "memory" ); \
> > +})
> > +
> > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
> > acquire_barrier) \
> > +({ \
> > +    uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned
> > long)ptr, 4); \
> 
> You now appear to assume that this macro is only used with inputs not
> crossing word boundaries. That's okay as long as suitably guaranteed
> at the use sites, but imo wants saying in a comment.
> 
> > +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr)))
> > * BITS_PER_BYTE; \
> 
> Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use
> above)?
The idea to read 8 bytes was to deal with crossing word boundary. So if
our address is 0x3 and we have to xchg() 2 bytes, what will cross 4
byte boundary. Instead we align add 0x3, so it will become 0x0 and then
just always work with 8 bytes.

> 
> > +    unsigned long new_ = (unsigned long)(new) << mask_l; \
> > +    unsigned long ret_; \
> > +    unsigned long rc; \
> 
> Similarly, why unsigned long here?
sizeof(unsigned long) is 8 bytes and it was chosen as we are working
with lc/sc.d which are working with 8 bytes.

> 
> I also wonder about the mix of underscore suffixed (or not) variable
> names here.
If the question about ret_, then the same as before size of ret
argument of the macros will be 1 or 2, but {lc/sc}.d expected to work
with 8 bytes.

> 
> > +        release_barrier \
> > +        "0: lr.d %0, %2\n" \
> 
> Even here it's an 8-byte access. Even if - didn't check - the insn
> was
> okay to use with just a 4-byte aligned pointer, wouldn't it make
> sense
> then to 8-byte align it, and be consistent throughout this macro wrt
> the base unit acted upon? Alternatively, why not use lr.w here, thus
> reducing possible collisions between multiple CPUs accessing the same
> cache line?
According to the docs:
LR and SC operate on naturally-aligned 64-bit (RV64 only) or 32-bit
words in memory. Misaligned
addresses will generate misaligned address exceptions.

My intention was to deal with 4-byte crossing boundary. so if ptr is 4-
byte aligned then by reading 8-bytes we shouldn't care about boundary
crossing, if I am not missing something.

But your opinion about reduction of collisions makes sense also...

> 
> > +        "   and  %1, %0, %z4\n" \
> > +        "   or   %1, %1, %z3\n" \
> > +        "   sc.d %1, %1, %2\n" \
> > +        "   bnez %1, 0b\n" \
> > +        acquire_barrier \
> > +        : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \
> > +        : "rJ" (new_), "rJ" (~mask) \
> 
> I think that as soon as there are more than 2 or maybe 3 operands,
> legibility is vastly improved by using named asm() operands.
Just to clarify you mean that it would be better to use instead of %0
use names?

> 
> > +        : "memory"); \
> 
> Nit: Missing blank before closing parenthesis.
> 
> > +    \
> > +    ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \
> > +})
> 
> Why does "ret" need to be a macro argument? If you had only the
> expression here, not the the assigment, ...
> 
> > +#define __xchg_generic(ptr, new, size, sfx, release_barrier,
> > acquire_barrier) \
> > +({ \
> > +    __typeof__(ptr) ptr__ = (ptr); \
> 
> Is this local variable really needed? Can't you use "ptr" directly
> in the three macro invocations?
> 
> > +    __typeof__(*(ptr)) new__ = (new); \
> > +    __typeof__(*(ptr)) ret__; \
> > +    switch (size) \
> > +    { \
> > +    case 1: \
> > +    case 2: \
> > +        emulate_xchg_1_2(ptr__, new__, ret__, release_barrier,
> > acquire_barrier); \
> 
> ... this would become
> 
>         ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier,
> acquire_barrier); \
> 
> But, unlike assumed above, there's no enforcement here that a 2-byte
> quantity won't cross a word, double-word, cache line, or even page
> boundary. That might be okay if then the code would simply crash
> (like
> the AMO insns emitted further down would), but aiui silent
> misbehavior
> would result.
As I mentioned above with 4-byte alignment and then reading and working
with 8-byte then crossing a word or double-word boundary shouldn't be
an issue.

I am not sure that I know how to check that we are crossing cache line
boundary.

Regarding page boundary, if the next page is mapped then all should
work fine, otherwise it will be an exception.

> 
> Also nit: The switch() higher up is (still/again) missing blanks.
> 
> > +        break; \
> > +    case 4: \
> > +        __amoswap_generic(ptr__, new__, ret__,\
> > +                          ".w" sfx,  release_barrier,
> > acquire_barrier); \
> > +        break; \
> > +    case 8: \
> > +        __amoswap_generic(ptr__, new__, ret__,\
> > +                          ".d" sfx,  release_barrier,
> > acquire_barrier); \
> > +        break; \
> > +    default: \
> > +        STATIC_ASSERT_UNREACHABLE(); \
> > +    } \
> > +    ret__; \
> > +})
> > +
> > +#define xchg_relaxed(ptr, x) \
> > +({ \
> > +    __typeof__(*(ptr)) x_ = (x); \
> > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),
> > "", "", ""); \
> > +})
> > +
> > +#define xchg_acquire(ptr, x) \
> > +({ \
> > +    __typeof__(*(ptr)) x_ = (x); \
> > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
> > +                                       "", "",
> > RISCV_ACQUIRE_BARRIER); \
> > +})
> > +
> > +#define xchg_release(ptr, x) \
> > +({ \
> > +    __typeof__(*(ptr)) x_ = (x); \
> > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
> > +                                       "", RISCV_RELEASE_BARRIER,
> > ""); \
> > +})
> > +
> > +#define xchg(ptr,x) \
> > +({ \
> > +    __typeof__(*(ptr)) ret__; \
> > +    ret__ = (__typeof__(*(ptr))) \
> > +            __xchg_generic(ptr, (unsigned long)(x),
> > sizeof(*(ptr)), \
> > +                           ".aqrl", "", ""); \
> 
> The .aqrl doesn't look to affect the (emulated) 1- and 2-byte cases.
> 
> Further, amoswap also exists in release-only and acquire-only forms.
> Why do you prefer explicit barrier insns over those? (Looks to
> similarly apply to the emulation path as well as to the cmpxchg
> machinery then, as both lr and sc also come in all four possible
> acquire/release forms. Perhaps for the emulation path using
> explicit barriers is better, in case the acquire/release forms of
> lr/sc - being used inside the loop - might perform worse.)
As 1- and 2-byte cases are emulated I decided that is not to provide
sfx argument for emulation macros as it will not have to much affect on
emulated types and just consume more performance on acquire and release
version of sc/ld instructions.


> > 
> 
> No RISCV_..._BARRIER for use here and ...
> 
> > +    ret__; \
> > +})
> > +
> > +#define __cmpxchg(ptr, o, n, s) \
> > +({ \
> > +    __typeof__(*(ptr)) ret__; \
> > +    ret__ = (__typeof__(*(ptr))) \
> > +            __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned
> > long)(n), \
> > +                              s, ".rl", "", " fence rw, rw\n"); \
> 
> ... here? And anyway, wouldn't it make sense to have
> 
> #define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr))
> 
> to limit redundancy?
> 
> Plus wouldn't
> 
> #define __cmpxchg(ptr, o, n, s) \
>     ((__typeof__(*(ptr))) \
>      __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \
>                        s, ".rl", "", " fence rw, rw\n"))
> 
> be shorter and thus easier to follow as well? As I notice only now,
> this would apparently apply further up as well.
I understand your point about "#define cmpxchg(ptr, o, n) __cmpxchg(",
but I can't undestand how the definition of __cmxchng should be done
shorter. Could you please clarify that?

~ Oleksii

Julien Grall Feb. 18, 2024, 7 p.m. UTC | #3

On 05/02/2024 15:32, Oleksii Kurochko wrote:
> The header was taken from Linux kernl 6.4.0-rc1.
> 
> Addionally, were updated:
> * add emulation of {cmp}xchg for 1/2 byte types

This explaination is a little bit light. IIUC, you are implementing them 
using 32-bit atomic access. Is that correct? If so, please spell it out.

Also, I wonder whether it would be better to try to get rid of the 1/2 
bytes access. Do you know where they are used?

> * replace tabs with spaces
Does this mean you are not planning to backport any Linux fixes?

> * replace __* varialbed with *__

s/varialbed/variable/

> * introduce generic version of xchg_* and cmpxchg_*.
> 
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> ---
> Changes in V4:
>   - Code style fixes.
>   - enforce in __xchg_*() has the same type for new and *ptr, also "\n"
>     was removed at the end of asm instruction.
>   - dependency from https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/
>   - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE().
>   - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used.
>   - drop cmpxcg{32,64}_{local} as they aren't used.
>   - introduce generic version of xchg_* and cmpxchg_*.
>   - update the commit message.
> ---
> Changes in V3:
>   - update the commit message
>   - add emulation of {cmp}xchg_... for 1 and 2 bytes types
> ---
> Changes in V2:
>   - update the comment at the top of the header.
>   - change xen/lib.h to xen/bug.h.
>   - sort inclusion of headers properly.
> ---
>   xen/arch/riscv/include/asm/cmpxchg.h | 237 +++++++++++++++++++++++++++
>   1 file changed, 237 insertions(+)
>   create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h
> 
> diff --git a/xen/arch/riscv/include/asm/cmpxchg.h b/xen/arch/riscv/include/asm/cmpxchg.h
> new file mode 100644
> index 0000000000..b751a50cbf
> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/cmpxchg.h
> @@ -0,0 +1,237 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright (C) 2014 Regents of the University of California */
> +
> +#ifndef _ASM_RISCV_CMPXCHG_H
> +#define _ASM_RISCV_CMPXCHG_H
> +
> +#include <xen/compiler.h>
> +#include <xen/lib.h>
> +
> +#include <asm/fence.h>
> +#include <asm/io.h>
> +#include <asm/system.h>
> +
> +#define ALIGN_DOWN(addr, size)  ((addr) & (~((size) - 1)))
> +
> +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier, acquire_barrier) \
> +({ \
> +    asm volatile( \
> +        release_barrier \
> +        " amoswap" sfx " %0, %2, %1\n" \
> +        acquire_barrier \
> +        : "=r" (ret), "+A" (*ptr) \
> +        : "r" (new) \
> +        : "memory" ); \
> +})
> +
> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, acquire_barrier) \
> +({ \
> +    uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned long)ptr, 4); \
> +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) * BITS_PER_BYTE; \
> +    uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \
> +    uint8_t mask_h = mask_l + mask_size - 1; \
> +    unsigned long mask = GENMASK(mask_h, mask_l); \
> +    unsigned long new_ = (unsigned long)(new) << mask_l; \
> +    unsigned long ret_; \
> +    unsigned long rc; \
> +    \
> +    asm volatile( \
> +        release_barrier \
> +        "0: lr.d %0, %2\n" \

I was going to ask why this is lr.d rather than lr.w. But I see Jan 
already asked. I agree with him that it should probably be a lr.w and ...

> +        "   and  %1, %0, %z4\n" \
> +        "   or   %1, %1, %z3\n" \
> +        "   sc.d %1, %1, %2\n" \

... respectively sc.w. The same applies for cmpxchg.

Cheers,

Jan Beulich Feb. 19, 2024, 11:22 a.m. UTC | #4

On 15.02.2024 14:41, Oleksii wrote:
>>> +        : "=r" (ret), "+A" (*ptr) \
>>> +        : "r" (new) \
>>> +        : "memory" ); \
>>> +})
>>> +
>>> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
>>> acquire_barrier) \
>>> +({ \
>>> +    uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned
>>> long)ptr, 4); \
>>
>> You now appear to assume that this macro is only used with inputs not
>> crossing word boundaries. That's okay as long as suitably guaranteed
>> at the use sites, but imo wants saying in a comment.
>>
>>> +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr)))
>>> * BITS_PER_BYTE; \
>>
>> Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use
>> above)?
> The idea to read 8 bytes was to deal with crossing word boundary. So if
> our address is 0x3 and we have to xchg() 2 bytes, what will cross 4
> byte boundary. Instead we align add 0x3, so it will become 0x0 and then
> just always work with 8 bytes.

Then what if my 2-byte access crosses a dword boundary? A cache line
one? A page one?

>>> +    unsigned long new_ = (unsigned long)(new) << mask_l; \
>>> +    unsigned long ret_; \
>>> +    unsigned long rc; \
>>
>> Similarly, why unsigned long here?
> sizeof(unsigned long) is 8 bytes and it was chosen as we are working
> with lc/sc.d which are working with 8 bytes.
> 
>>
>> I also wonder about the mix of underscore suffixed (or not) variable
>> names here.
> If the question about ret_, then the same as before size of ret
> argument of the macros will be 1 or 2, but {lc/sc}.d expected to work
> with 8 bytes.

Then what's the uint32_t * about?

>>> +        release_barrier \
>>> +        "0: lr.d %0, %2\n" \
>>
>> Even here it's an 8-byte access. Even if - didn't check - the insn
>> was
>> okay to use with just a 4-byte aligned pointer, wouldn't it make
>> sense
>> then to 8-byte align it, and be consistent throughout this macro wrt
>> the base unit acted upon? Alternatively, why not use lr.w here, thus
>> reducing possible collisions between multiple CPUs accessing the same
>> cache line?
> According to the docs:
> LR and SC operate on naturally-aligned 64-bit (RV64 only) or 32-bit
> words in memory. Misaligned
> addresses will generate misaligned address exceptions.
> 
> My intention was to deal with 4-byte crossing boundary. so if ptr is 4-
> byte aligned then by reading 8-bytes we shouldn't care about boundary
> crossing, if I am not missing something.

If a ptr is 4-byte aligned, there's no point reading more than 4 bytes.

>>> +        "   and  %1, %0, %z4\n" \
>>> +        "   or   %1, %1, %z3\n" \
>>> +        "   sc.d %1, %1, %2\n" \
>>> +        "   bnez %1, 0b\n" \
>>> +        acquire_barrier \
>>> +        : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \
>>> +        : "rJ" (new_), "rJ" (~mask) \
>>
>> I think that as soon as there are more than 2 or maybe 3 operands,
>> legibility is vastly improved by using named asm() operands.
> Just to clarify you mean that it would be better to use instead of %0
> use names?

Yes. Just like you have it in one of the other patches that I looked at
later.

>>> +        : "memory"); \
>>
>> Nit: Missing blank before closing parenthesis.
>>
>>> +    \
>>> +    ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \
>>> +})
>>
>> Why does "ret" need to be a macro argument? If you had only the
>> expression here, not the the assigment, ...
>>
>>> +#define __xchg_generic(ptr, new, size, sfx, release_barrier,
>>> acquire_barrier) \
>>> +({ \
>>> +    __typeof__(ptr) ptr__ = (ptr); \
>>
>> Is this local variable really needed? Can't you use "ptr" directly
>> in the three macro invocations?
>>
>>> +    __typeof__(*(ptr)) new__ = (new); \
>>> +    __typeof__(*(ptr)) ret__; \
>>> +    switch (size) \
>>> +    { \
>>> +    case 1: \
>>> +    case 2: \
>>> +        emulate_xchg_1_2(ptr__, new__, ret__, release_barrier,
>>> acquire_barrier); \
>>
>> ... this would become
>>
>>         ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier,
>> acquire_barrier); \
>>
>> But, unlike assumed above, there's no enforcement here that a 2-byte
>> quantity won't cross a word, double-word, cache line, or even page
>> boundary. That might be okay if then the code would simply crash
>> (like
>> the AMO insns emitted further down would), but aiui silent
>> misbehavior
>> would result.
> As I mentioned above with 4-byte alignment and then reading and working
> with 8-byte then crossing a word or double-word boundary shouldn't be
> an issue.
> 
> I am not sure that I know how to check that we are crossing cache line
> boundary.
> 
> Regarding page boundary, if the next page is mapped then all should
> work fine, otherwise it will be an exception.

Are you sure lr.d / sc.d are happy to access across such a boundary,
when both pages are mapped?

To me it seems pretty clear that for atomic accesses you want to
demand natural alignment, i.e. 2-byte alignment for 2-byte accesses.
This way you can be sure no potentially problematic boundaries will
be crossed.

>>> +        break; \
>>> +    case 4: \
>>> +        __amoswap_generic(ptr__, new__, ret__,\
>>> +                          ".w" sfx,  release_barrier,
>>> acquire_barrier); \
>>> +        break; \
>>> +    case 8: \
>>> +        __amoswap_generic(ptr__, new__, ret__,\
>>> +                          ".d" sfx,  release_barrier,
>>> acquire_barrier); \
>>> +        break; \
>>> +    default: \
>>> +        STATIC_ASSERT_UNREACHABLE(); \
>>> +    } \
>>> +    ret__; \
>>> +})
>>> +
>>> +#define xchg_relaxed(ptr, x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) x_ = (x); \
>>> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),
>>> "", "", ""); \
>>> +})
>>> +
>>> +#define xchg_acquire(ptr, x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) x_ = (x); \
>>> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
>>> +                                       "", "",
>>> RISCV_ACQUIRE_BARRIER); \
>>> +})
>>> +
>>> +#define xchg_release(ptr, x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) x_ = (x); \
>>> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
>>> +                                       "", RISCV_RELEASE_BARRIER,
>>> ""); \
>>> +})
>>> +
>>> +#define xchg(ptr,x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) ret__; \
>>> +    ret__ = (__typeof__(*(ptr))) \
>>> +            __xchg_generic(ptr, (unsigned long)(x),
>>> sizeof(*(ptr)), \
>>> +                           ".aqrl", "", ""); \
>>
>> The .aqrl doesn't look to affect the (emulated) 1- and 2-byte cases.
>>
>> Further, amoswap also exists in release-only and acquire-only forms.
>> Why do you prefer explicit barrier insns over those? (Looks to
>> similarly apply to the emulation path as well as to the cmpxchg
>> machinery then, as both lr and sc also come in all four possible
>> acquire/release forms. Perhaps for the emulation path using
>> explicit barriers is better, in case the acquire/release forms of
>> lr/sc - being used inside the loop - might perform worse.)
> As 1- and 2-byte cases are emulated I decided that is not to provide
> sfx argument for emulation macros as it will not have to much affect on
> emulated types and just consume more performance on acquire and release
> version of sc/ld instructions.

Question is whether the common case (4- and 8-byte accesses) shouldn't
be valued higher, with 1- and 2-byte emulation being there just to
allow things to not break altogether.

>> No RISCV_..._BARRIER for use here and ...
>>
>>> +    ret__; \
>>> +})
>>> +
>>> +#define __cmpxchg(ptr, o, n, s) \
>>> +({ \
>>> +    __typeof__(*(ptr)) ret__; \
>>> +    ret__ = (__typeof__(*(ptr))) \
>>> +            __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned
>>> long)(n), \
>>> +                              s, ".rl", "", " fence rw, rw\n"); \
>>
>> ... here? And anyway, wouldn't it make sense to have
>>
>> #define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr))
>>
>> to limit redundancy?
>>
>> Plus wouldn't
>>
>> #define __cmpxchg(ptr, o, n, s) \
>>     ((__typeof__(*(ptr))) \
>>      __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \
>>                        s, ".rl", "", " fence rw, rw\n"))
>>
>> be shorter and thus easier to follow as well? As I notice only now,
>> this would apparently apply further up as well.
> I understand your point about "#define cmpxchg(ptr, o, n) __cmpxchg(",
> but I can't undestand how the definition of __cmxchng should be done
> shorter. Could you please clarify that?

You did notice that in my form there's no local variable, and hence
also no macro-wide scope ( ({ ... }) )?

Jan

Oleksii Kurochko Feb. 19, 2024, 2 p.m. UTC | #5

On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote:
> 
> 
> On 05/02/2024 15:32, Oleksii Kurochko wrote:
> > The header was taken from Linux kernl 6.4.0-rc1.
> > 
> > Addionally, were updated:
> > * add emulation of {cmp}xchg for 1/2 byte types
> 
> This explaination is a little bit light. IIUC, you are implementing
> them 
> using 32-bit atomic access. Is that correct? If so, please spell it
> out.
Sure, I'll update commit message.

> 
> Also, I wonder whether it would be better to try to get rid of the
> 1/2 
> bytes access. Do you know where they are used?
Right now, the issue is with test_and_clear_bool() which is used in
common/sched/core.c:840
[https://gitlab.com/xen-project/xen/-/blob/staging/xen/common/sched/core.c?ref_type=heads#L840
]

I don't remember details, but in xen-devel chat someone told me that
grant table requires 1/2 bytes access.

> 
> > * replace tabs with spaces
> Does this mean you are not planning to backport any Linux fixes?
If it will be any fixes for sure I'll back port them, but it looks like
this code is stable enough and not to many fixes will be done there, so
it shouldn't be hard to backport them and switch to spaces.

> 
> > * replace __* varialbed with *__
> 
> s/varialbed/variable/
> 
> > * introduce generic version of xchg_* and cmpxchg_*.
> > 
> > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> > ---
> > Changes in V4:
> >   - Code style fixes.
> >   - enforce in __xchg_*() has the same type for new and *ptr, also
> > "\n"
> >     was removed at the end of asm instruction.
> >   - dependency from
> > https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/
> >   - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE().
> >   - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used.
> >   - drop cmpxcg{32,64}_{local} as they aren't used.
> >   - introduce generic version of xchg_* and cmpxchg_*.
> >   - update the commit message.
> > ---
> > Changes in V3:
> >   - update the commit message
> >   - add emulation of {cmp}xchg_... for 1 and 2 bytes types
> > ---
> > Changes in V2:
> >   - update the comment at the top of the header.
> >   - change xen/lib.h to xen/bug.h.
> >   - sort inclusion of headers properly.
> > ---
> >   xen/arch/riscv/include/asm/cmpxchg.h | 237
> > +++++++++++++++++++++++++++
> >   1 file changed, 237 insertions(+)
> >   create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h
> > 
> > diff --git a/xen/arch/riscv/include/asm/cmpxchg.h
> > b/xen/arch/riscv/include/asm/cmpxchg.h
> > new file mode 100644
> > index 0000000000..b751a50cbf
> > --- /dev/null
> > +++ b/xen/arch/riscv/include/asm/cmpxchg.h
> > @@ -0,0 +1,237 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/* Copyright (C) 2014 Regents of the University of California */
> > +
> > +#ifndef _ASM_RISCV_CMPXCHG_H
> > +#define _ASM_RISCV_CMPXCHG_H
> > +
> > +#include <xen/compiler.h>
> > +#include <xen/lib.h>
> > +
> > +#include <asm/fence.h>
> > +#include <asm/io.h>
> > +#include <asm/system.h>
> > +
> > +#define ALIGN_DOWN(addr, size)  ((addr) & (~((size) - 1)))
> > +
> > +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier,
> > acquire_barrier) \
> > +({ \
> > +    asm volatile( \
> > +        release_barrier \
> > +        " amoswap" sfx " %0, %2, %1\n" \
> > +        acquire_barrier \
> > +        : "=r" (ret), "+A" (*ptr) \
> > +        : "r" (new) \
> > +        : "memory" ); \
> > +})
> > +
> > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
> > acquire_barrier) \
> > +({ \
> > +    uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned
> > long)ptr, 4); \
> > +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr)))
> > * BITS_PER_BYTE; \
> > +    uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \
> > +    uint8_t mask_h = mask_l + mask_size - 1; \
> > +    unsigned long mask = GENMASK(mask_h, mask_l); \
> > +    unsigned long new_ = (unsigned long)(new) << mask_l; \
> > +    unsigned long ret_; \
> > +    unsigned long rc; \
> > +    \
> > +    asm volatile( \
> > +        release_barrier \
> > +        "0: lr.d %0, %2\n" \
> 
> I was going to ask why this is lr.d rather than lr.w. But I see Jan 
> already asked. I agree with him that it should probably be a lr.w and
> ...
> 
> > +        "   and  %1, %0, %z4\n" \
> > +        "   or   %1, %1, %z3\n" \
> > +        "   sc.d %1, %1, %2\n" \
> 
> ... respectively sc.w. The same applies for cmpxchg.

I agree that it would be better, and my initial attempt was to handle
4-byte or 8-byte boundary crossing during 2-byte access:

0 1 2 3 4 5 6 7 8
X X X 1 1 X X X X

In this case, if I align address 3 to address 0 and then read 4 bytes
instead of 8 bytes, I will not process the byte at address 4. This was
the reason why I started to read 8 bytes.

I also acknowledge that there could be an issue in the following case:

X  4094 4095 4096
X    1   1    X
In this situation, when I read 8 bytes, there could be an issue where
the new page (which starts at 4096) will not be mapped. It seems
correct in this case to check that variable is within one page and read
4 bytes instead of 8.

One more thing I am uncertain about is if we change everything to read
4 bytes with 4-byte alignment, what should be done with the first case?
Should we panic? (I am not sure if this is an option.) Should we
perform the operation twice for addresses 0x0 and 0x4?

~ Oleksii

Jan Beulich Feb. 19, 2024, 2:12 p.m. UTC | #6

On 19.02.2024 15:00, Oleksii wrote:
> On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote:
>> On 05/02/2024 15:32, Oleksii Kurochko wrote:
>>> --- /dev/null
>>> +++ b/xen/arch/riscv/include/asm/cmpxchg.h
>>> @@ -0,0 +1,237 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/* Copyright (C) 2014 Regents of the University of California */
>>> +
>>> +#ifndef _ASM_RISCV_CMPXCHG_H
>>> +#define _ASM_RISCV_CMPXCHG_H
>>> +
>>> +#include <xen/compiler.h>
>>> +#include <xen/lib.h>
>>> +
>>> +#include <asm/fence.h>
>>> +#include <asm/io.h>
>>> +#include <asm/system.h>
>>> +
>>> +#define ALIGN_DOWN(addr, size)  ((addr) & (~((size) - 1)))
>>> +
>>> +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier,
>>> acquire_barrier) \
>>> +({ \
>>> +    asm volatile( \
>>> +        release_barrier \
>>> +        " amoswap" sfx " %0, %2, %1\n" \
>>> +        acquire_barrier \
>>> +        : "=r" (ret), "+A" (*ptr) \
>>> +        : "r" (new) \
>>> +        : "memory" ); \
>>> +})
>>> +
>>> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
>>> acquire_barrier) \
>>> +({ \
>>> +    uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned
>>> long)ptr, 4); \
>>> +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr)))
>>> * BITS_PER_BYTE; \
>>> +    uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \
>>> +    uint8_t mask_h = mask_l + mask_size - 1; \
>>> +    unsigned long mask = GENMASK(mask_h, mask_l); \
>>> +    unsigned long new_ = (unsigned long)(new) << mask_l; \
>>> +    unsigned long ret_; \
>>> +    unsigned long rc; \
>>> +    \
>>> +    asm volatile( \
>>> +        release_barrier \
>>> +        "0: lr.d %0, %2\n" \
>>
>> I was going to ask why this is lr.d rather than lr.w. But I see Jan 
>> already asked. I agree with him that it should probably be a lr.w and
>> ...
>>
>>> +        "   and  %1, %0, %z4\n" \
>>> +        "   or   %1, %1, %z3\n" \
>>> +        "   sc.d %1, %1, %2\n" \
>>
>> ... respectively sc.w. The same applies for cmpxchg.
> 
> I agree that it would be better, and my initial attempt was to handle
> 4-byte or 8-byte boundary crossing during 2-byte access:
> 
> 0 1 2 3 4 5 6 7 8
> X X X 1 1 X X X X
> 
> In this case, if I align address 3 to address 0 and then read 4 bytes
> instead of 8 bytes, I will not process the byte at address 4. This was
> the reason why I started to read 8 bytes.
> 
> I also acknowledge that there could be an issue in the following case:
> 
> X  4094 4095 4096
> X    1   1    X
> In this situation, when I read 8 bytes, there could be an issue where
> the new page (which starts at 4096) will not be mapped. It seems
> correct in this case to check that variable is within one page and read
> 4 bytes instead of 8.
> 
> One more thing I am uncertain about is if we change everything to read
> 4 bytes with 4-byte alignment, what should be done with the first case?
> Should we panic? (I am not sure if this is an option.)

Counter question (raised elsewhere already): What if a 4-byte access
crosses a word / cache line / page boundary? Ideally exactly the
same would happen for a 2-byte access crossing a respective boundary.
(Which you can achieve relatively easily by masking off only address
bit 1, keeping address bit 0 unaltered.)

> Should we
> perform the operation twice for addresses 0x0 and 0x4?

That wouldn't be atomic then.

Jan

Julien Grall Feb. 19, 2024, 2:25 p.m. UTC | #7

Hi Oleksii,

On 19/02/2024 14:00, Oleksii wrote:
> On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote:
>>
>>
>> On 05/02/2024 15:32, Oleksii Kurochko wrote:
>>> The header was taken from Linux kernl 6.4.0-rc1.
>>>
>>> Addionally, were updated:
>>> * add emulation of {cmp}xchg for 1/2 byte types
>>
>> This explaination is a little bit light. IIUC, you are implementing
>> them
>> using 32-bit atomic access. Is that correct? If so, please spell it
>> out.
> Sure, I'll update commit message.
> 
>>
>> Also, I wonder whether it would be better to try to get rid of the
>> 1/2
>> bytes access. Do you know where they are used?
> Right now, the issue is with test_and_clear_bool() which is used in
> common/sched/core.c:840
> [https://gitlab.com/xen-project/xen/-/blob/staging/xen/common/sched/core.c?ref_type=heads#L840
> ]
> 
> I don't remember details, but in xen-devel chat someone told me that
> grant table requires 1/2 bytes access.

Ok :/. This would be part of the ABI then and therefore can't be easily 
changed.

> 
>>
>>> * replace tabs with spaces
>> Does this mean you are not planning to backport any Linux fixes?
> If it will be any fixes for sure I'll back port them, but it looks like
> this code is stable enough and not to many fixes will be done there, so
> it shouldn't be hard to backport them and switch to spaces.

Fair enough.

>>> +        "   and  %1, %0, %z4\n" \
>>> +        "   or   %1, %1, %z3\n" \
>>> +        "   sc.d %1, %1, %2\n" \
>>
>> ... respectively sc.w. The same applies for cmpxchg.
> 
> I agree that it would be better, and my initial attempt was to handle
> 4-byte or 8-byte boundary crossing during 2-byte access:
> 
> 0 1 2 3 4 5 6 7 8
> X X X 1 1 X X X X
> 
> In this case, if I align address 3 to address 0 and then read 4 bytes
> instead of 8 bytes, I will not process the byte at address 4. This was
> the reason why I started to read 8 bytes.

At least on Arm, the architecture doesn't support atomic operations if 
the access is not aligned to its size (this will send a data abort). On 
some architecture, this is supported but potentially very slow.

So all the common code should already use properly aligned address. 
Therefore, I don't really see the reason to add support for unaligned 
access.

Cheers,

Oleksii Kurochko Feb. 19, 2024, 2:29 p.m. UTC | #8

On Mon, 2024-02-19 at 12:22 +0100, Jan Beulich wrote:
> On 15.02.2024 14:41, Oleksii wrote:
> > > > +        : "=r" (ret), "+A" (*ptr) \
> > > > +        : "r" (new) \
> > > > +        : "memory" ); \
> > > > +})
> > > > +
> > > > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
> > > > acquire_barrier) \
> > > > +({ \
> > > > +    uint32_t *ptr_32b_aligned = (uint32_t
> > > > *)ALIGN_DOWN((unsigned
> > > > long)ptr, 4); \
> > > 
> > > You now appear to assume that this macro is only used with inputs
> > > not
> > > crossing word boundaries. That's okay as long as suitably
> > > guaranteed
> > > at the use sites, but imo wants saying in a comment.
> > > 
> > > > +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 -
> > > > sizeof(*ptr)))
> > > > * BITS_PER_BYTE; \
> > > 
> > > Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use
> > > above)?
> > The idea to read 8 bytes was to deal with crossing word boundary.
> > So if
> > our address is 0x3 and we have to xchg() 2 bytes, what will cross 4
> > byte boundary. Instead we align add 0x3, so it will become 0x0 and
> > then
> > just always work with 8 bytes.
> 
> Then what if my 2-byte access crosses a dword boundary? A cache line
> one? A page one?
Everything looks okay to me, except in the case of a page boundary.

In the scenario of a dword boundary:

0 1 2 3 4 5 6 7 8 9 ...
X X X X X X X 1 1 X

Assuming a variable starts at address 7, 4-byte alignment will be
enforced, and 8 bytes will be processed starting from address 4.

Concerning a cache line, it should still work, with potential
performance issues arising only if a part of the variable is cached
while another part is not.

Regarding page crossing, I acknowledge that it could be problematic if
the variable is entirely located at the end of a page, as there is no
guarantee that the next page exists. In this case, it would be
preferable to consistently read 4 bytes with 4-byte alignment:

X 4094 4095 4096?
X  1    1    ?

However, if the variable spans two pages, proper page mapping should be
ensured.

It appears sensible to reconsider the macros and implement 4-byte
alignment and 4-byte access, but then this is not clear how better to
deal with first case ( dword boundary ). Panic ? or use the macros
twice for address 4, and address 8?

> 
> > > > +    unsigned long new_ = (unsigned long)(new) << mask_l; \
> > > > +    unsigned long ret_; \
> > > > +    unsigned long rc; \
> > > 
> > > Similarly, why unsigned long here?
> > sizeof(unsigned long) is 8 bytes and it was chosen as we are
> > working
> > with lc/sc.d which are working with 8 bytes.
> > 
> > > 
> > > I also wonder about the mix of underscore suffixed (or not)
> > > variable
> > > names here.
> > If the question about ret_, then the same as before size of ret
> > argument of the macros will be 1 or 2, but {lc/sc}.d expected to
> > work
> > with 8 bytes.
> 
> Then what's the uint32_t * about?
Agree, then it should be also unsigned long.
> > 

> > > > +    __typeof__(*(ptr)) new__ = (new); \
> > > > +    __typeof__(*(ptr)) ret__; \
> > > > +    switch (size) \
> > > > +    { \
> > > > +    case 1: \
> > > > +    case 2: \
> > > > +        emulate_xchg_1_2(ptr__, new__, ret__, release_barrier,
> > > > acquire_barrier); \
> > > 
> > > ... this would become
> > > 
> > >         ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier,
> > > acquire_barrier); \
> > > 
> > > But, unlike assumed above, there's no enforcement here that a 2-
> > > byte
> > > quantity won't cross a word, double-word, cache line, or even
> > > page
> > > boundary. That might be okay if then the code would simply crash
> > > (like
> > > the AMO insns emitted further down would), but aiui silent
> > > misbehavior
> > > would result.
> > As I mentioned above with 4-byte alignment and then reading and
> > working
> > with 8-byte then crossing a word or double-word boundary shouldn't
> > be
> > an issue.
> > 
> > I am not sure that I know how to check that we are crossing cache
> > line
> > boundary.
> > 
> > Regarding page boundary, if the next page is mapped then all should
> > work fine, otherwise it will be an exception.
> 
> Are you sure lr.d / sc.d are happy to access across such a boundary,
> when both pages are mapped?
If they are mapped, my expectation that lr.d and sc.d should be happy.

> 
> To me it seems pretty clear that for atomic accesses you want to
> demand natural alignment, i.e. 2-byte alignment for 2-byte accesses.
> This way you can be sure no potentially problematic boundaries will
> be crossed.
It makes sense, but I am not sure that I can guarantee that a user of
macros will always have 2-byte alignment (except during a panic) in the
future.

Even now, I am uncertain that everyone will be willing to add
__alignment(...) to struct vcpu->is_urgent
(xen/include/xen/sched.h:218) and other possible cases to accommodate
RISC-V requirements.

> 
> > > > +        break; \
> > > > +    case 4: \
> > > > +        __amoswap_generic(ptr__, new__, ret__,\
> > > > +                          ".w" sfx,  release_barrier,
> > > > acquire_barrier); \
> > > > +        break; \
> > > > +    case 8: \
> > > > +        __amoswap_generic(ptr__, new__, ret__,\
> > > > +                          ".d" sfx,  release_barrier,
> > > > acquire_barrier); \
> > > > +        break; \
> > > > +    default: \
> > > > +        STATIC_ASSERT_UNREACHABLE(); \
> > > > +    } \
> > > > +    ret__; \
> > > > +})
> > > > +
> > > > +#define xchg_relaxed(ptr, x) \
> > > > +({ \
> > > > +    __typeof__(*(ptr)) x_ = (x); \
> > > > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_,
> > > > sizeof(*(ptr)),
> > > > "", "", ""); \
> > > > +})
> > > > +
> > > > +#define xchg_acquire(ptr, x) \
> > > > +({ \
> > > > +    __typeof__(*(ptr)) x_ = (x); \
> > > > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_,
> > > > sizeof(*(ptr)), \
> > > > +                                       "", "",
> > > > RISCV_ACQUIRE_BARRIER); \
> > > > +})
> > > > +
> > > > +#define xchg_release(ptr, x) \
> > > > +({ \
> > > > +    __typeof__(*(ptr)) x_ = (x); \
> > > > +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_,
> > > > sizeof(*(ptr)),\
> > > > +                                       "",
> > > > RISCV_RELEASE_BARRIER,
> > > > ""); \
> > > > +})
> > > > +
> > > > +#define xchg(ptr,x) \
> > > > +({ \
> > > > +    __typeof__(*(ptr)) ret__; \
> > > > +    ret__ = (__typeof__(*(ptr))) \
> > > > +            __xchg_generic(ptr, (unsigned long)(x),
> > > > sizeof(*(ptr)), \
> > > > +                           ".aqrl", "", ""); \
> > > 
> > > The .aqrl doesn't look to affect the (emulated) 1- and 2-byte
> > > cases.
> > > 
> > > Further, amoswap also exists in release-only and acquire-only
> > > forms.
> > > Why do you prefer explicit barrier insns over those? (Looks to
> > > similarly apply to the emulation path as well as to the cmpxchg
> > > machinery then, as both lr and sc also come in all four possible
> > > acquire/release forms. Perhaps for the emulation path using
> > > explicit barriers is better, in case the acquire/release forms of
> > > lr/sc - being used inside the loop - might perform worse.)
> > As 1- and 2-byte cases are emulated I decided that is not to
> > provide
> > sfx argument for emulation macros as it will not have to much
> > affect on
> > emulated types and just consume more performance on acquire and
> > release
> > version of sc/ld instructions.
> 
> Question is whether the common case (4- and 8-byte accesses)
> shouldn't
> be valued higher, with 1- and 2-byte emulation being there just to
> allow things to not break altogether.
If I understand you correctly, it would make sense to add the 'sfx'
argument for the 1/2-byte access case, ensuring that all options are
available for 1/2-byte access case as well.


~ Oleksii

Jan Beulich Feb. 19, 2024, 3:01 p.m. UTC | #9

On 19.02.2024 15:29, Oleksii wrote:
> On Mon, 2024-02-19 at 12:22 +0100, Jan Beulich wrote:
>> On 15.02.2024 14:41, Oleksii wrote:
>>> As I mentioned above with 4-byte alignment and then reading and
>>> working
>>> with 8-byte then crossing a word or double-word boundary shouldn't
>>> be
>>> an issue.
>>>
>>> I am not sure that I know how to check that we are crossing cache
>>> line
>>> boundary.
>>>
>>> Regarding page boundary, if the next page is mapped then all should
>>> work fine, otherwise it will be an exception.
>>
>> Are you sure lr.d / sc.d are happy to access across such a boundary,
>> when both pages are mapped?
> If they are mapped, my expectation that lr.d and sc.d should be happy.

How does this expectation of yours fit with the A extension doc having
this:

"For LR and SC, the A extension requires that the address held in rs1 be
 naturally aligned to the size of the operand (i.e., eight-byte aligned
 for 64-bit words and four-byte aligned for 32-bit words). If the
 address is not naturally aligned, an address-misaligned exception or an
 access-fault exception will be generated."

It doesn't even say "may"; it says "will".

>> To me it seems pretty clear that for atomic accesses you want to
>> demand natural alignment, i.e. 2-byte alignment for 2-byte accesses.
>> This way you can be sure no potentially problematic boundaries will
>> be crossed.
> It makes sense, but I am not sure that I can guarantee that a user of
> macros will always have 2-byte alignment (except during a panic) in the
> future.
> 
> Even now, I am uncertain that everyone will be willing to add
> __alignment(...) to struct vcpu->is_urgent
> (xen/include/xen/sched.h:218) and other possible cases to accommodate
> RISC-V requirements.

->is_urgent is bool, i.e. 1 byte and hence okay at any address. For all
normal variables and fields the compiler will guarantee suitable
(natural) alignment. What you prohibit by requiring aligned items is
use of fields of e.g. packed structures.

>>> As 1- and 2-byte cases are emulated I decided that is not to
>>> provide
>>> sfx argument for emulation macros as it will not have to much
>>> affect on
>>> emulated types and just consume more performance on acquire and
>>> release
>>> version of sc/ld instructions.
>>
>> Question is whether the common case (4- and 8-byte accesses)
>> shouldn't
>> be valued higher, with 1- and 2-byte emulation being there just to
>> allow things to not break altogether.
> If I understand you correctly, it would make sense to add the 'sfx'
> argument for the 1/2-byte access case, ensuring that all options are
> available for 1/2-byte access case as well.

That's one of the possibilities. As said, I'm not overly worried about
the emulated cases. For the initial implementation I'd recommend going
with what is easiest there, yielding the best possible result for the
4- and 8-byte cases. If later it turns out repeated acquire/release
accesses are a problem in the emulation loop, things can be changed
to explicit barriers, without touching the 4- and 8-byte cases.

Jan

Oleksii Kurochko Feb. 19, 2024, 3:20 p.m. UTC | #10

On Mon, 2024-02-19 at 15:12 +0100, Jan Beulich wrote:
> On 19.02.2024 15:00, Oleksii wrote:
> > On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote:
> > > On 05/02/2024 15:32, Oleksii Kurochko wrote:
> > > > --- /dev/null
> > > > +++ b/xen/arch/riscv/include/asm/cmpxchg.h
> > > > @@ -0,0 +1,237 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > +/* Copyright (C) 2014 Regents of the University of California
> > > > */
> > > > +
> > > > +#ifndef _ASM_RISCV_CMPXCHG_H
> > > > +#define _ASM_RISCV_CMPXCHG_H
> > > > +
> > > > +#include <xen/compiler.h>
> > > > +#include <xen/lib.h>
> > > > +
> > > > +#include <asm/fence.h>
> > > > +#include <asm/io.h>
> > > > +#include <asm/system.h>
> > > > +
> > > > +#define ALIGN_DOWN(addr, size)  ((addr) & (~((size) - 1)))
> > > > +
> > > > +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier,
> > > > acquire_barrier) \
> > > > +({ \
> > > > +    asm volatile( \
> > > > +        release_barrier \
> > > > +        " amoswap" sfx " %0, %2, %1\n" \
> > > > +        acquire_barrier \
> > > > +        : "=r" (ret), "+A" (*ptr) \
> > > > +        : "r" (new) \
> > > > +        : "memory" ); \
> > > > +})
> > > > +
> > > > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
> > > > acquire_barrier) \
> > > > +({ \
> > > > +    uint32_t *ptr_32b_aligned = (uint32_t
> > > > *)ALIGN_DOWN((unsigned
> > > > long)ptr, 4); \
> > > > +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 -
> > > > sizeof(*ptr)))
> > > > * BITS_PER_BYTE; \
> > > > +    uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \
> > > > +    uint8_t mask_h = mask_l + mask_size - 1; \
> > > > +    unsigned long mask = GENMASK(mask_h, mask_l); \
> > > > +    unsigned long new_ = (unsigned long)(new) << mask_l; \
> > > > +    unsigned long ret_; \
> > > > +    unsigned long rc; \
> > > > +    \
> > > > +    asm volatile( \
> > > > +        release_barrier \
> > > > +        "0: lr.d %0, %2\n" \
> > > 
> > > I was going to ask why this is lr.d rather than lr.w. But I see
> > > Jan 
> > > already asked. I agree with him that it should probably be a lr.w
> > > and
> > > ...
> > > 
> > > > +        "   and  %1, %0, %z4\n" \
> > > > +        "   or   %1, %1, %z3\n" \
> > > > +        "   sc.d %1, %1, %2\n" \
> > > 
> > > ... respectively sc.w. The same applies for cmpxchg.
> > 
> > I agree that it would be better, and my initial attempt was to
> > handle
> > 4-byte or 8-byte boundary crossing during 2-byte access:
> > 
> > 0 1 2 3 4 5 6 7 8
> > X X X 1 1 X X X X
> > 
> > In this case, if I align address 3 to address 0 and then read 4
> > bytes
> > instead of 8 bytes, I will not process the byte at address 4. This
> > was
> > the reason why I started to read 8 bytes.
> > 
> > I also acknowledge that there could be an issue in the following
> > case:
> > 
> > X  4094 4095 4096
> > X    1   1    X
> > In this situation, when I read 8 bytes, there could be an issue
> > where
> > the new page (which starts at 4096) will not be mapped. It seems
> > correct in this case to check that variable is within one page and
> > read
> > 4 bytes instead of 8.
> > 
> > One more thing I am uncertain about is if we change everything to
> > read
> > 4 bytes with 4-byte alignment, what should be done with the first
> > case?
> > Should we panic? (I am not sure if this is an option.)
> 
> Counter question (raised elsewhere already): What if a 4-byte access
> crosses a word / cache line / page boundary? Ideally exactly the
> same would happen for a 2-byte access crossing a respective boundary.
> (Which you can achieve relatively easily by masking off only address
> bit 1, keeping address bit 0 unaltered.)
But if we align down on a 4-byte boundary and then access 4 bytes, we
can't cross a boundary. I agree that the algorithm is not correct, as
it can ignore some values in certain situations. For example:
0 1 2 3 4 5 6 7 8
X X X 1 1 X X X X
In this case, the value at address 4 won't be updated.

I agree that introducing a new macro to check if a variable crosses a
boundary is necessary or as an option we can check that addr is 2-byte
aligned:

#define CHECK_BOUNDARY_CROSSING(start, end, boundary_size)
ASSERT((start / boundary_size) != (end / boundary_size))

Then, it is necessary to check:

CHECK_BOUNDARY_CROSSING(start, end, 4)
CHECK_BOUNDARY_CROSSING(start, end, PAGE_SIZE)

But why do we need to check the cache line boundary? In the case of the
cache, the question will only be about performance, but it should still
work, shouldn't it?

~ Oleksii

Jan Beulich Feb. 19, 2024, 3:29 p.m. UTC | #11

On 19.02.2024 16:20, Oleksii wrote:
> On Mon, 2024-02-19 at 15:12 +0100, Jan Beulich wrote:
>> On 19.02.2024 15:00, Oleksii wrote:
>>> On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote:
>>>> On 05/02/2024 15:32, Oleksii Kurochko wrote:
>>>>> --- /dev/null
>>>>> +++ b/xen/arch/riscv/include/asm/cmpxchg.h
>>>>> @@ -0,0 +1,237 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>>>> +/* Copyright (C) 2014 Regents of the University of California
>>>>> */
>>>>> +
>>>>> +#ifndef _ASM_RISCV_CMPXCHG_H
>>>>> +#define _ASM_RISCV_CMPXCHG_H
>>>>> +
>>>>> +#include <xen/compiler.h>
>>>>> +#include <xen/lib.h>
>>>>> +
>>>>> +#include <asm/fence.h>
>>>>> +#include <asm/io.h>
>>>>> +#include <asm/system.h>
>>>>> +
>>>>> +#define ALIGN_DOWN(addr, size)  ((addr) & (~((size) - 1)))
>>>>> +
>>>>> +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier,
>>>>> acquire_barrier) \
>>>>> +({ \
>>>>> +    asm volatile( \
>>>>> +        release_barrier \
>>>>> +        " amoswap" sfx " %0, %2, %1\n" \
>>>>> +        acquire_barrier \
>>>>> +        : "=r" (ret), "+A" (*ptr) \
>>>>> +        : "r" (new) \
>>>>> +        : "memory" ); \
>>>>> +})
>>>>> +
>>>>> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier,
>>>>> acquire_barrier) \
>>>>> +({ \
>>>>> +    uint32_t *ptr_32b_aligned = (uint32_t
>>>>> *)ALIGN_DOWN((unsigned
>>>>> long)ptr, 4); \
>>>>> +    uint8_t mask_l = ((unsigned long)(ptr) & (0x8 -
>>>>> sizeof(*ptr)))
>>>>> * BITS_PER_BYTE; \
>>>>> +    uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \
>>>>> +    uint8_t mask_h = mask_l + mask_size - 1; \
>>>>> +    unsigned long mask = GENMASK(mask_h, mask_l); \
>>>>> +    unsigned long new_ = (unsigned long)(new) << mask_l; \
>>>>> +    unsigned long ret_; \
>>>>> +    unsigned long rc; \
>>>>> +    \
>>>>> +    asm volatile( \
>>>>> +        release_barrier \
>>>>> +        "0: lr.d %0, %2\n" \
>>>>
>>>> I was going to ask why this is lr.d rather than lr.w. But I see
>>>> Jan 
>>>> already asked. I agree with him that it should probably be a lr.w
>>>> and
>>>> ...
>>>>
>>>>> +        "   and  %1, %0, %z4\n" \
>>>>> +        "   or   %1, %1, %z3\n" \
>>>>> +        "   sc.d %1, %1, %2\n" \
>>>>
>>>> ... respectively sc.w. The same applies for cmpxchg.
>>>
>>> I agree that it would be better, and my initial attempt was to
>>> handle
>>> 4-byte or 8-byte boundary crossing during 2-byte access:
>>>
>>> 0 1 2 3 4 5 6 7 8
>>> X X X 1 1 X X X X
>>>
>>> In this case, if I align address 3 to address 0 and then read 4
>>> bytes
>>> instead of 8 bytes, I will not process the byte at address 4. This
>>> was
>>> the reason why I started to read 8 bytes.
>>>
>>> I also acknowledge that there could be an issue in the following
>>> case:
>>>
>>> X  4094 4095 4096
>>> X    1   1    X
>>> In this situation, when I read 8 bytes, there could be an issue
>>> where
>>> the new page (which starts at 4096) will not be mapped. It seems
>>> correct in this case to check that variable is within one page and
>>> read
>>> 4 bytes instead of 8.
>>>
>>> One more thing I am uncertain about is if we change everything to
>>> read
>>> 4 bytes with 4-byte alignment, what should be done with the first
>>> case?
>>> Should we panic? (I am not sure if this is an option.)
>>
>> Counter question (raised elsewhere already): What if a 4-byte access
>> crosses a word / cache line / page boundary? Ideally exactly the
>> same would happen for a 2-byte access crossing a respective boundary.
>> (Which you can achieve relatively easily by masking off only address
>> bit 1, keeping address bit 0 unaltered.)
> But if we align down on a 4-byte boundary and then access 4 bytes, we
> can't cross a boundary. I agree that the algorithm is not correct, as
> it can ignore some values in certain situations. For example:
> 0 1 2 3 4 5 6 7 8
> X X X 1 1 X X X X
> In this case, the value at address 4 won't be updated.
> 
> I agree that introducing a new macro to check if a variable crosses a
> boundary is necessary or as an option we can check that addr is 2-byte
> aligned:
> 
> #define CHECK_BOUNDARY_CROSSING(start, end, boundary_size)
> ASSERT((start / boundary_size) != (end / boundary_size))
> 
> Then, it is necessary to check:
> 
> CHECK_BOUNDARY_CROSSING(start, end, 4)
> CHECK_BOUNDARY_CROSSING(start, end, PAGE_SIZE)
> 
> But why do we need to check the cache line boundary? In the case of the
> cache, the question will only be about performance, but it should still
> work, shouldn't it?

You don't need to check for any of these boundaries. You can simply
leverage what the hardware does for misaligned accesses. See the
various other replies I've sent - I thought things should have become
pretty much crystal clear by now: For 1-byte accesses you access the
containing word, by clearing the low two bits. For 2-byte accesses
you also access the containing word, by clearing only bit 1 (which
the naturally leaves no bit that needs clearing for the projected
[but not necessary] case of handling a 4-byte access). If the resulting
4-byte access then is still misaligned, it'll fault just as a non-
emulated 4-byte access would. And you don't need to care about any of
the boundaries, not at words, not at cache lines, and not at pages.

Jan

Oleksii Kurochko Feb. 23, 2024, 12:23 p.m. UTC | #12

> 
> > > > As 1- and 2-byte cases are emulated I decided that is not to
> > > > provide
> > > > sfx argument for emulation macros as it will not have to much
> > > > affect on
> > > > emulated types and just consume more performance on acquire and
> > > > release
> > > > version of sc/ld instructions.
> > > 
> > > Question is whether the common case (4- and 8-byte accesses)
> > > shouldn't
> > > be valued higher, with 1- and 2-byte emulation being there just
> > > to
> > > allow things to not break altogether.
> > If I understand you correctly, it would make sense to add the 'sfx'
> > argument for the 1/2-byte access case, ensuring that all options
> > are
> > available for 1/2-byte access case as well.
> 
> That's one of the possibilities. As said, I'm not overly worried
> about
> the emulated cases. For the initial implementation I'd recommend
> going
> with what is easiest there, yielding the best possible result for the
> 4- and 8-byte cases. If later it turns out repeated acquire/release
> accesses are a problem in the emulation loop, things can be changed
> to explicit barriers, without touching the 4- and 8-byte cases.
I am confused then a little bit if emulated case is not an issue.

For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and aqcuire
version of xchg barries are used.

The similar is done for cmpxchg.

If something will be needed to change in emulation loop it won't
require to change 4- and 8-byte cases.

~ Oleksii

Jan Beulich Feb. 26, 2024, 9:45 a.m. UTC | #13

On 23.02.2024 13:23, Oleksii wrote:
>>
>>>>> As 1- and 2-byte cases are emulated I decided that is not to
>>>>> provide
>>>>> sfx argument for emulation macros as it will not have to much
>>>>> affect on
>>>>> emulated types and just consume more performance on acquire and
>>>>> release
>>>>> version of sc/ld instructions.
>>>>
>>>> Question is whether the common case (4- and 8-byte accesses)
>>>> shouldn't
>>>> be valued higher, with 1- and 2-byte emulation being there just
>>>> to
>>>> allow things to not break altogether.
>>> If I understand you correctly, it would make sense to add the 'sfx'
>>> argument for the 1/2-byte access case, ensuring that all options
>>> are
>>> available for 1/2-byte access case as well.
>>
>> That's one of the possibilities. As said, I'm not overly worried
>> about
>> the emulated cases. For the initial implementation I'd recommend
>> going
>> with what is easiest there, yielding the best possible result for the
>> 4- and 8-byte cases. If later it turns out repeated acquire/release
>> accesses are a problem in the emulation loop, things can be changed
>> to explicit barriers, without touching the 4- and 8-byte cases.
> I am confused then a little bit if emulated case is not an issue.
> 
> For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and aqcuire
> version of xchg barries are used.
> 
> The similar is done for cmpxchg.
> 
> If something will be needed to change in emulation loop it won't
> require to change 4- and 8-byte cases.

I'm afraid I don't understand your reply.

Jan

Oleksii Kurochko Feb. 26, 2024, 11:18 a.m. UTC | #14

On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote:
> On 23.02.2024 13:23, Oleksii wrote:
> > > 
> > > > > > As 1- and 2-byte cases are emulated I decided that is not
> > > > > > to
> > > > > > provide
> > > > > > sfx argument for emulation macros as it will not have to
> > > > > > much
> > > > > > affect on
> > > > > > emulated types and just consume more performance on acquire
> > > > > > and
> > > > > > release
> > > > > > version of sc/ld instructions.
> > > > > 
> > > > > Question is whether the common case (4- and 8-byte accesses)
> > > > > shouldn't
> > > > > be valued higher, with 1- and 2-byte emulation being there
> > > > > just
> > > > > to
> > > > > allow things to not break altogether.
> > > > If I understand you correctly, it would make sense to add the
> > > > 'sfx'
> > > > argument for the 1/2-byte access case, ensuring that all
> > > > options
> > > > are
> > > > available for 1/2-byte access case as well.
> > > 
> > > That's one of the possibilities. As said, I'm not overly worried
> > > about
> > > the emulated cases. For the initial implementation I'd recommend
> > > going
> > > with what is easiest there, yielding the best possible result for
> > > the
> > > 4- and 8-byte cases. If later it turns out repeated
> > > acquire/release
> > > accesses are a problem in the emulation loop, things can be
> > > changed
> > > to explicit barriers, without touching the 4- and 8-byte cases.
> > I am confused then a little bit if emulated case is not an issue.
> > 
> > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and
> > aqcuire
> > version of xchg barries are used.
> > 
> > The similar is done for cmpxchg.
> > 
> > If something will be needed to change in emulation loop it won't
> > require to change 4- and 8-byte cases.
> 
> I'm afraid I don't understand your reply.
IIUC, emulated cases it is implemented correctly in terms of usage
barriers. And it also OK not to use sfx for lr/sc instructions and use
only barriers.

For 4- and 8-byte cases are used sfx + barrier depending on the
specific case ( relaxed, acquire, release, generic xchg/cmpxchg ).
What also looks to me correct. But you suggested to provide the best
possible result for 4- and 8-byte cases. 

So I don't understand what the best possible result is as the current
one usage of __{cmp}xchg_generic for each specific case  ( relaxed,
acquire, release, generic xchg/cmpxchg ) looks correct to me:
xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without
barriers.
xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only release
barrier
xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire
barrier
xchg_relaxed ->  (..., "", "", "") - no barries, no sfx

The similar for cmpxchg().

~ Oleksii

Jan Beulich Feb. 26, 2024, 11:28 a.m. UTC | #15

On 26.02.2024 12:18, Oleksii wrote:
> On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote:
>> On 23.02.2024 13:23, Oleksii wrote:
>>>>
>>>>>>> As 1- and 2-byte cases are emulated I decided that is not
>>>>>>> to
>>>>>>> provide
>>>>>>> sfx argument for emulation macros as it will not have to
>>>>>>> much
>>>>>>> affect on
>>>>>>> emulated types and just consume more performance on acquire
>>>>>>> and
>>>>>>> release
>>>>>>> version of sc/ld instructions.
>>>>>>
>>>>>> Question is whether the common case (4- and 8-byte accesses)
>>>>>> shouldn't
>>>>>> be valued higher, with 1- and 2-byte emulation being there
>>>>>> just
>>>>>> to
>>>>>> allow things to not break altogether.
>>>>> If I understand you correctly, it would make sense to add the
>>>>> 'sfx'
>>>>> argument for the 1/2-byte access case, ensuring that all
>>>>> options
>>>>> are
>>>>> available for 1/2-byte access case as well.
>>>>
>>>> That's one of the possibilities. As said, I'm not overly worried
>>>> about
>>>> the emulated cases. For the initial implementation I'd recommend
>>>> going
>>>> with what is easiest there, yielding the best possible result for
>>>> the
>>>> 4- and 8-byte cases. If later it turns out repeated
>>>> acquire/release
>>>> accesses are a problem in the emulation loop, things can be
>>>> changed
>>>> to explicit barriers, without touching the 4- and 8-byte cases.
>>> I am confused then a little bit if emulated case is not an issue.
>>>
>>> For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and
>>> aqcuire
>>> version of xchg barries are used.
>>>
>>> The similar is done for cmpxchg.
>>>
>>> If something will be needed to change in emulation loop it won't
>>> require to change 4- and 8-byte cases.
>>
>> I'm afraid I don't understand your reply.
> IIUC, emulated cases it is implemented correctly in terms of usage
> barriers. And it also OK not to use sfx for lr/sc instructions and use
> only barriers.
> 
> For 4- and 8-byte cases are used sfx + barrier depending on the
> specific case ( relaxed, acquire, release, generic xchg/cmpxchg ).
> What also looks to me correct. But you suggested to provide the best
> possible result for 4- and 8-byte cases. 
> 
> So I don't understand what the best possible result is as the current
> one usage of __{cmp}xchg_generic for each specific case  ( relaxed,
> acquire, release, generic xchg/cmpxchg ) looks correct to me:
> xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without
> barriers.
> xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only release
> barrier
> xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire
> barrier
> xchg_relaxed ->  (..., "", "", "") - no barries, no sfx

So first: While explicit barriers are technically okay, I don't follow why
you insist on using them when you can achieve the same by suitably tagging
the actual insn doing the exchange. Then second: It's somewhat hard for me
to see the final effect on the emulation paths without you actually having
done the switch. Maybe no special handling is necessary there anymore
then. And as said, it may actually be acceptable for the emulation paths
to "only" be correct, but not be ideal in terms of performance. After all,
if you use the normal 4-byte primitive in there, more (non-explicit)
barriers than needed would occur if the involved loop has to take more
than one iteration. Which could (but imo doesn't need to be) avoided by
using a more relaxed 4-byte primitive there and an explicit barrier
outside of the loop.

Jan

Oleksii Kurochko Feb. 26, 2024, 12:58 p.m. UTC | #16

On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote:
> On 26.02.2024 12:18, Oleksii wrote:
> > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote:
> > > On 23.02.2024 13:23, Oleksii wrote:
> > > > > 
> > > > > > > > As 1- and 2-byte cases are emulated I decided that is
> > > > > > > > not
> > > > > > > > to
> > > > > > > > provide
> > > > > > > > sfx argument for emulation macros as it will not have
> > > > > > > > to
> > > > > > > > much
> > > > > > > > affect on
> > > > > > > > emulated types and just consume more performance on
> > > > > > > > acquire
> > > > > > > > and
> > > > > > > > release
> > > > > > > > version of sc/ld instructions.
> > > > > > > 
> > > > > > > Question is whether the common case (4- and 8-byte
> > > > > > > accesses)
> > > > > > > shouldn't
> > > > > > > be valued higher, with 1- and 2-byte emulation being
> > > > > > > there
> > > > > > > just
> > > > > > > to
> > > > > > > allow things to not break altogether.
> > > > > > If I understand you correctly, it would make sense to add
> > > > > > the
> > > > > > 'sfx'
> > > > > > argument for the 1/2-byte access case, ensuring that all
> > > > > > options
> > > > > > are
> > > > > > available for 1/2-byte access case as well.
> > > > > 
> > > > > That's one of the possibilities. As said, I'm not overly
> > > > > worried
> > > > > about
> > > > > the emulated cases. For the initial implementation I'd
> > > > > recommend
> > > > > going
> > > > > with what is easiest there, yielding the best possible result
> > > > > for
> > > > > the
> > > > > 4- and 8-byte cases. If later it turns out repeated
> > > > > acquire/release
> > > > > accesses are a problem in the emulation loop, things can be
> > > > > changed
> > > > > to explicit barriers, without touching the 4- and 8-byte
> > > > > cases.
> > > > I am confused then a little bit if emulated case is not an
> > > > issue.
> > > > 
> > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and
> > > > aqcuire
> > > > version of xchg barries are used.
> > > > 
> > > > The similar is done for cmpxchg.
> > > > 
> > > > If something will be needed to change in emulation loop it
> > > > won't
> > > > require to change 4- and 8-byte cases.
> > > 
> > > I'm afraid I don't understand your reply.
> > IIUC, emulated cases it is implemented correctly in terms of usage
> > barriers. And it also OK not to use sfx for lr/sc instructions and
> > use
> > only barriers.
> > 
> > For 4- and 8-byte cases are used sfx + barrier depending on the
> > specific case ( relaxed, acquire, release, generic xchg/cmpxchg ).
> > What also looks to me correct. But you suggested to provide the
> > best
> > possible result for 4- and 8-byte cases. 
> > 
> > So I don't understand what the best possible result is as the
> > current
> > one usage of __{cmp}xchg_generic for each specific case  ( relaxed,
> > acquire, release, generic xchg/cmpxchg ) looks correct to me:
> > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without
> > barriers.
> > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only
> > release
> > barrier
> > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire
> > barrier
> > xchg_relaxed ->  (..., "", "", "") - no barries, no sfx
> 
> So first: While explicit barriers are technically okay, I don't
> follow why
> you insist on using them when you can achieve the same by suitably
> tagging
> the actual insn doing the exchange. Then second: It's somewhat hard
> for me
> to see the final effect on the emulation paths without you actually
> having
> done the switch. Maybe no special handling is necessary there anymore
> then. And as said, it may actually be acceptable for the emulation
> paths
> to "only" be correct, but not be ideal in terms of performance. After
> all,
> if you use the normal 4-byte primitive in there, more (non-explicit)
> barriers than needed would occur if the involved loop has to take
> more
> than one iteration. Which could (but imo doesn't need to be) avoided
> by
> using a more relaxed 4-byte primitive there and an explicit barrier
> outside of the loop.

According to the spec:
Table A.5 ( part of the table only I copied here )

Linux Construct          RVWMO Mapping
atomic <op> relaxed           amo<op>.{w|d}
atomic <op> acquire           amo<op>.{w|d}.aq
atomic <op> release           amo<op>.{w|d}.rl
atomic <op>                   amo<op>.{w|d}.aqrl

Linux Construct          RVWMO LR/SC Mapping
atomic <op> relaxed       loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
atomic <op> acquire       loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
atomic <op> release       loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez 
loop OR
                          fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ;
bnez loop
atomic <op>               loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
loop

The Linux mappings for release operations may seem stronger than
necessary, but these mappings
are needed to cover some cases in which Linux requires stronger
orderings than the more intuitive
mappings would provide. In particular, as of the time this text is
being written, Linux is actively
debating whether to require load-load, load-store, and store-store
orderings between accesses in one
critical section and accesses in a subsequent critical section in the
same hart and protected by the
same synchronization object. Not all combinations of FENCE RW,W/FENCE
R,RW mappings
with aq/rl mappings combine to provide such orderings. There are a few
ways around this problem,
including:
1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices
but is undesir-
able, as it defeats the purpose of the aq/rl modifiers.
2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not
currently work
due to the lack of load and store opcodes with aq and rl modifiers.
3. Strengthen the mappings of release operations such that they would
enforce sufficient order-
ings in the presence of either type of acquire mapping. This is the
currently-recommended
solution, and the one shown in Table A.5.


Based on this it is enough in our case use only suffixed istructions
(amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }.


But as far as I understand in Linux atomics were strengthen with
fences:
    Atomics present the same issue with locking: release and acquire
    variants need to be strengthened to meet the constraints defined
    by the Linux-kernel memory consistency model [1].
    
    Atomics present a further issue: implementations of atomics such
    as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
    which do not give full-ordering with .aqrl; for example, current
    implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
    below to end up with the state indicated in the "exists" clause.
    
    In order to "synchronize" LKMM and RISC-V's implementation, this
    commit strengthens the implementations of the atomics operations
    by replacing .rl and .aq with the use of ("lightweigth") fences,
    and by replacing .aqrl LR/SC pairs in sequences such as:
    
      0:      lr.w.aqrl  %0, %addr
              bne        %0, %old, 1f
              ...
              sc.w.aqrl  %1, %new, %addr
              bnez       %1, 0b
      1:
    
    with sequences of the form:
    
      0:      lr.w       %0, %addr
              bne        %0, %old, 1f
              ...
              sc.w.rl    %1, %new, %addr   /* SC-release   */
              bnez       %1, 0b
              fence      rw, rw            /* "full" fence */
      1:
    
    following Daniel's suggestion.
    
    These modifications were validated with simulation of the RISC-V
    with sequences of the form:
    
      0:      lr.w       %0, %addr
              bne        %0, %old, 1f
              ...
              sc.w.rl    %1, %new, %addr   /* SC-release   */
              bnez       %1, 0b
              fence      rw, rw            /* "full" fence */
      1:
    
    following Daniel's suggestion.
    
    These modifications were validated with simulation of the RISC-V
    memory consistency model.
    
    C lr-sc-aqrl-pair-vs-full-barrier
    
    {}
    
    P0(int *x, int *y, atomic_t *u)
    {
            int r0;
            int r1;
    
            WRITE_ONCE(*x, 1);
            r0 = atomic_cmpxchg(u, 0, 1);
            r1 = READ_ONCE(*y);
    }
    
    P1(int *x, int *y, atomic_t *v)
    {
            int r0;
            int r1;
    
            WRITE_ONCE(*y, 1);
            r0 = atomic_cmpxchg(v, 0, 1);
            r1 = READ_ONCE(*x);
    }
    
    exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
    
    [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2
     
https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
        https://marc.info/?l=linux-kernel&m=151633436614259&w=2


Thereby Linux kernel implementation seems to me more safe and it is a
reason why I want/wanted to be aligned with it.

~ Oleksii

Jan Beulich Feb. 26, 2024, 2:20 p.m. UTC | #17

On 26.02.2024 13:58, Oleksii wrote:
> On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote:
>> On 26.02.2024 12:18, Oleksii wrote:
>>> On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote:
>>>> On 23.02.2024 13:23, Oleksii wrote:
>>>>>>
>>>>>>>>> As 1- and 2-byte cases are emulated I decided that is
>>>>>>>>> not
>>>>>>>>> to
>>>>>>>>> provide
>>>>>>>>> sfx argument for emulation macros as it will not have
>>>>>>>>> to
>>>>>>>>> much
>>>>>>>>> affect on
>>>>>>>>> emulated types and just consume more performance on
>>>>>>>>> acquire
>>>>>>>>> and
>>>>>>>>> release
>>>>>>>>> version of sc/ld instructions.
>>>>>>>>
>>>>>>>> Question is whether the common case (4- and 8-byte
>>>>>>>> accesses)
>>>>>>>> shouldn't
>>>>>>>> be valued higher, with 1- and 2-byte emulation being
>>>>>>>> there
>>>>>>>> just
>>>>>>>> to
>>>>>>>> allow things to not break altogether.
>>>>>>> If I understand you correctly, it would make sense to add
>>>>>>> the
>>>>>>> 'sfx'
>>>>>>> argument for the 1/2-byte access case, ensuring that all
>>>>>>> options
>>>>>>> are
>>>>>>> available for 1/2-byte access case as well.
>>>>>>
>>>>>> That's one of the possibilities. As said, I'm not overly
>>>>>> worried
>>>>>> about
>>>>>> the emulated cases. For the initial implementation I'd
>>>>>> recommend
>>>>>> going
>>>>>> with what is easiest there, yielding the best possible result
>>>>>> for
>>>>>> the
>>>>>> 4- and 8-byte cases. If later it turns out repeated
>>>>>> acquire/release
>>>>>> accesses are a problem in the emulation loop, things can be
>>>>>> changed
>>>>>> to explicit barriers, without touching the 4- and 8-byte
>>>>>> cases.
>>>>> I am confused then a little bit if emulated case is not an
>>>>> issue.
>>>>>
>>>>> For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and
>>>>> aqcuire
>>>>> version of xchg barries are used.
>>>>>
>>>>> The similar is done for cmpxchg.
>>>>>
>>>>> If something will be needed to change in emulation loop it
>>>>> won't
>>>>> require to change 4- and 8-byte cases.
>>>>
>>>> I'm afraid I don't understand your reply.
>>> IIUC, emulated cases it is implemented correctly in terms of usage
>>> barriers. And it also OK not to use sfx for lr/sc instructions and
>>> use
>>> only barriers.
>>>
>>> For 4- and 8-byte cases are used sfx + barrier depending on the
>>> specific case ( relaxed, acquire, release, generic xchg/cmpxchg ).
>>> What also looks to me correct. But you suggested to provide the
>>> best
>>> possible result for 4- and 8-byte cases. 
>>>
>>> So I don't understand what the best possible result is as the
>>> current
>>> one usage of __{cmp}xchg_generic for each specific case  ( relaxed,
>>> acquire, release, generic xchg/cmpxchg ) looks correct to me:
>>> xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without
>>> barriers.
>>> xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only
>>> release
>>> barrier
>>> xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire
>>> barrier
>>> xchg_relaxed ->  (..., "", "", "") - no barries, no sfx
>>
>> So first: While explicit barriers are technically okay, I don't
>> follow why
>> you insist on using them when you can achieve the same by suitably
>> tagging
>> the actual insn doing the exchange. Then second: It's somewhat hard
>> for me
>> to see the final effect on the emulation paths without you actually
>> having
>> done the switch. Maybe no special handling is necessary there anymore
>> then. And as said, it may actually be acceptable for the emulation
>> paths
>> to "only" be correct, but not be ideal in terms of performance. After
>> all,
>> if you use the normal 4-byte primitive in there, more (non-explicit)
>> barriers than needed would occur if the involved loop has to take
>> more
>> than one iteration. Which could (but imo doesn't need to be) avoided
>> by
>> using a more relaxed 4-byte primitive there and an explicit barrier
>> outside of the loop.
> 
> According to the spec:
> Table A.5 ( part of the table only I copied here )
> 
> Linux Construct          RVWMO Mapping
> atomic <op> relaxed           amo<op>.{w|d}
> atomic <op> acquire           amo<op>.{w|d}.aq
> atomic <op> release           amo<op>.{w|d}.rl
> atomic <op>                   amo<op>.{w|d}.aqrl
> 
> Linux Construct          RVWMO LR/SC Mapping
> atomic <op> relaxed       loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
> atomic <op> acquire       loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
> atomic <op> release       loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez 
> loop OR
>                           fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ;
> bnez loop
> atomic <op>               loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
> loop

In your consideration what to implement you will want to limit
things to constructs we actually use. I can't find any use of the
relaxed, acquire, or release forms of atomics as mentioned above.

> The Linux mappings for release operations may seem stronger than
> necessary, but these mappings
> are needed to cover some cases in which Linux requires stronger
> orderings than the more intuitive
> mappings would provide. In particular, as of the time this text is
> being written, Linux is actively
> debating whether to require load-load, load-store, and store-store
> orderings between accesses in one
> critical section and accesses in a subsequent critical section in the
> same hart and protected by the
> same synchronization object. Not all combinations of FENCE RW,W/FENCE
> R,RW mappings
> with aq/rl mappings combine to provide such orderings. There are a few
> ways around this problem,
> including:
> 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices
> but is undesir-
> able, as it defeats the purpose of the aq/rl modifiers.
> 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not
> currently work
> due to the lack of load and store opcodes with aq and rl modifiers.

I don't understand this point: Which specific load and/or store forms
are missing? According to my reading of the A extension spec all
combination of aq/rl exist with both lr and sc.

> 3. Strengthen the mappings of release operations such that they would
> enforce sufficient order-
> ings in the presence of either type of acquire mapping. This is the
> currently-recommended
> solution, and the one shown in Table A.5.
> 
> 
> Based on this it is enough in our case use only suffixed istructions
> (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }.
> 
> 
> But as far as I understand in Linux atomics were strengthen with
> fences:
>     Atomics present the same issue with locking: release and acquire
>     variants need to be strengthened to meet the constraints defined
>     by the Linux-kernel memory consistency model [1].
>     
>     Atomics present a further issue: implementations of atomics such
>     as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
>     which do not give full-ordering with .aqrl; for example, current
>     implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
>     below to end up with the state indicated in the "exists" clause.
>     
>     In order to "synchronize" LKMM and RISC-V's implementation, this
>     commit strengthens the implementations of the atomics operations
>     by replacing .rl and .aq with the use of ("lightweigth") fences,
>     and by replacing .aqrl LR/SC pairs in sequences such as:
>     
>       0:      lr.w.aqrl  %0, %addr
>               bne        %0, %old, 1f
>               ...
>               sc.w.aqrl  %1, %new, %addr
>               bnez       %1, 0b
>       1:
>     
>     with sequences of the form:
>     
>       0:      lr.w       %0, %addr
>               bne        %0, %old, 1f
>               ...
>               sc.w.rl    %1, %new, %addr   /* SC-release   */
>               bnez       %1, 0b
>               fence      rw, rw            /* "full" fence */
>       1:
>     
>     following Daniel's suggestion.

I'm likely missing something, yet as it looks it does help that the
code fragment above appears to be ...

>     These modifications were validated with simulation of the RISC-V
>     with sequences of the form:
>     
>       0:      lr.w       %0, %addr
>               bne        %0, %old, 1f
>               ...
>               sc.w.rl    %1, %new, %addr   /* SC-release   */
>               bnez       %1, 0b
>               fence      rw, rw            /* "full" fence */
>       1:
>     
>     following Daniel's suggestion.

... entirely the same as this one. Yet there's presumably a reason
for quoting it twice?

>     These modifications were validated with simulation of the RISC-V
>     memory consistency model.
>     
>     C lr-sc-aqrl-pair-vs-full-barrier
>     
>     {}
>     
>     P0(int *x, int *y, atomic_t *u)
>     {
>             int r0;
>             int r1;
>     
>             WRITE_ONCE(*x, 1);
>             r0 = atomic_cmpxchg(u, 0, 1);
>             r1 = READ_ONCE(*y);
>     }
>     
>     P1(int *x, int *y, atomic_t *v)
>     {
>             int r0;
>             int r1;
>     
>             WRITE_ONCE(*y, 1);
>             r0 = atomic_cmpxchg(v, 0, 1);
>             r1 = READ_ONCE(*x);
>     }
>     
>     exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
>     
>     [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2
>      
> https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
>         https://marc.info/?l=linux-kernel&m=151633436614259&w=2
> 
> 
> Thereby Linux kernel implementation seems to me more safe and it is a
> reason why I want/wanted to be aligned with it.

Which may end up being okay. I hope you realize though that there's a
lot more explanation needed in the respective commits then compared to
what you've had so far. As a minimum, absolutely anything remotely
unexpected needs to be explained.

Jan

Oleksii Kurochko Feb. 26, 2024, 2:37 p.m. UTC | #18

On Mon, 2024-02-26 at 15:20 +0100, Jan Beulich wrote:
> On 26.02.2024 13:58, Oleksii wrote:
> > On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote:
> > > On 26.02.2024 12:18, Oleksii wrote:
> > > > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote:
> > > > > On 23.02.2024 13:23, Oleksii wrote:
> > > > > > > 
> > > > > > > > > > As 1- and 2-byte cases are emulated I decided that
> > > > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > to
> > > > > > > > > > provide
> > > > > > > > > > sfx argument for emulation macros as it will not
> > > > > > > > > > have
> > > > > > > > > > to
> > > > > > > > > > much
> > > > > > > > > > affect on
> > > > > > > > > > emulated types and just consume more performance on
> > > > > > > > > > acquire
> > > > > > > > > > and
> > > > > > > > > > release
> > > > > > > > > > version of sc/ld instructions.
> > > > > > > > > 
> > > > > > > > > Question is whether the common case (4- and 8-byte
> > > > > > > > > accesses)
> > > > > > > > > shouldn't
> > > > > > > > > be valued higher, with 1- and 2-byte emulation being
> > > > > > > > > there
> > > > > > > > > just
> > > > > > > > > to
> > > > > > > > > allow things to not break altogether.
> > > > > > > > If I understand you correctly, it would make sense to
> > > > > > > > add
> > > > > > > > the
> > > > > > > > 'sfx'
> > > > > > > > argument for the 1/2-byte access case, ensuring that
> > > > > > > > all
> > > > > > > > options
> > > > > > > > are
> > > > > > > > available for 1/2-byte access case as well.
> > > > > > > 
> > > > > > > That's one of the possibilities. As said, I'm not overly
> > > > > > > worried
> > > > > > > about
> > > > > > > the emulated cases. For the initial implementation I'd
> > > > > > > recommend
> > > > > > > going
> > > > > > > with what is easiest there, yielding the best possible
> > > > > > > result
> > > > > > > for
> > > > > > > the
> > > > > > > 4- and 8-byte cases. If later it turns out repeated
> > > > > > > acquire/release
> > > > > > > accesses are a problem in the emulation loop, things can
> > > > > > > be
> > > > > > > changed
> > > > > > > to explicit barriers, without touching the 4- and 8-byte
> > > > > > > cases.
> > > > > > I am confused then a little bit if emulated case is not an
> > > > > > issue.
> > > > > > 
> > > > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed
> > > > > > and
> > > > > > aqcuire
> > > > > > version of xchg barries are used.
> > > > > > 
> > > > > > The similar is done for cmpxchg.
> > > > > > 
> > > > > > If something will be needed to change in emulation loop it
> > > > > > won't
> > > > > > require to change 4- and 8-byte cases.
> > > > > 
> > > > > I'm afraid I don't understand your reply.
> > > > IIUC, emulated cases it is implemented correctly in terms of
> > > > usage
> > > > barriers. And it also OK not to use sfx for lr/sc instructions
> > > > and
> > > > use
> > > > only barriers.
> > > > 
> > > > For 4- and 8-byte cases are used sfx + barrier depending on the
> > > > specific case ( relaxed, acquire, release, generic xchg/cmpxchg
> > > > ).
> > > > What also looks to me correct. But you suggested to provide the
> > > > best
> > > > possible result for 4- and 8-byte cases. 
> > > > 
> > > > So I don't understand what the best possible result is as the
> > > > current
> > > > one usage of __{cmp}xchg_generic for each specific case  (
> > > > relaxed,
> > > > acquire, release, generic xchg/cmpxchg ) looks correct to me:
> > > > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without
> > > > barriers.
> > > > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only
> > > > release
> > > > barrier
> > > > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only
> > > > acquire
> > > > barrier
> > > > xchg_relaxed ->  (..., "", "", "") - no barries, no sfx
> > > 
> > > So first: While explicit barriers are technically okay, I don't
> > > follow why
> > > you insist on using them when you can achieve the same by
> > > suitably
> > > tagging
> > > the actual insn doing the exchange. Then second: It's somewhat
> > > hard
> > > for me
> > > to see the final effect on the emulation paths without you
> > > actually
> > > having
> > > done the switch. Maybe no special handling is necessary there
> > > anymore
> > > then. And as said, it may actually be acceptable for the
> > > emulation
> > > paths
> > > to "only" be correct, but not be ideal in terms of performance.
> > > After
> > > all,
> > > if you use the normal 4-byte primitive in there, more (non-
> > > explicit)
> > > barriers than needed would occur if the involved loop has to take
> > > more
> > > than one iteration. Which could (but imo doesn't need to be)
> > > avoided
> > > by
> > > using a more relaxed 4-byte primitive there and an explicit
> > > barrier
> > > outside of the loop.
> > 
> > According to the spec:
> > Table A.5 ( part of the table only I copied here )
> > 
> > Linux Construct          RVWMO Mapping
> > atomic <op> relaxed           amo<op>.{w|d}
> > atomic <op> acquire           amo<op>.{w|d}.aq
> > atomic <op> release           amo<op>.{w|d}.rl
> > atomic <op>                   amo<op>.{w|d}.aqrl
> > 
> > Linux Construct          RVWMO LR/SC Mapping
> > atomic <op> relaxed       loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
> > atomic <op> acquire       loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez
> > loop
> > atomic <op> release       loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ;
> > bnez 
> > loop OR
> >                           fence.tso; loop: lr.{w|d}; <op>;
> > sc.{w|d}∗ ;
> > bnez loop
> > atomic <op>               loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
> > bnez
> > loop
> 
> In your consideration what to implement you will want to limit
> things to constructs we actually use. I can't find any use of the
> relaxed, acquire, or release forms of atomics as mentioned above.
> 
> > The Linux mappings for release operations may seem stronger than
> > necessary, but these mappings
> > are needed to cover some cases in which Linux requires stronger
> > orderings than the more intuitive
> > mappings would provide. In particular, as of the time this text is
> > being written, Linux is actively
> > debating whether to require load-load, load-store, and store-store
> > orderings between accesses in one
> > critical section and accesses in a subsequent critical section in
> > the
> > same hart and protected by the
> > same synchronization object. Not all combinations of FENCE
> > RW,W/FENCE
> > R,RW mappings
> > with aq/rl mappings combine to provide such orderings. There are a
> > few
> > ways around this problem,
> > including:
> > 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This
> > suffices
> > but is undesir-
> > able, as it defeats the purpose of the aq/rl modifiers.
> > 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does
> > not
> > currently work
> > due to the lack of load and store opcodes with aq and rl modifiers.
> 
> I don't understand this point: Which specific load and/or store forms
> are missing? According to my reading of the A extension spec all
> combination of aq/rl exist with both lr and sc.
I think this is not about lr and sc instructions.
It is about l{b|h|w|d} and s{b|h|w|d}, which should be used with fence
in case of acquire and seq_cst.

> 
> > 3. Strengthen the mappings of release operations such that they
> > would
> > enforce sufficient order-
> > ings in the presence of either type of acquire mapping. This is the
> > currently-recommended
> > solution, and the one shown in Table A.5.
> > 
> > 
> > Based on this it is enough in our case use only suffixed
> > istructions
> > (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }.
> > 
> > 
> > But as far as I understand in Linux atomics were strengthen with
> > fences:
> >     Atomics present the same issue with locking: release and
> > acquire
> >     variants need to be strengthened to meet the constraints
> > defined
> >     by the Linux-kernel memory consistency model [1].
> >     
> >     Atomics present a further issue: implementations of atomics
> > such
> >     as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC
> > pairs,
> >     which do not give full-ordering with .aqrl; for example,
> > current
> >     implementations allow the "lr-sc-aqrl-pair-vs-full-barrier"
> > test
> >     below to end up with the state indicated in the "exists"
> > clause.
> >     
> >     In order to "synchronize" LKMM and RISC-V's implementation,
> > this
> >     commit strengthens the implementations of the atomics
> > operations
> >     by replacing .rl and .aq with the use of ("lightweigth")
> > fences,
> >     and by replacing .aqrl LR/SC pairs in sequences such as:
> >     
> >       0:      lr.w.aqrl  %0, %addr
> >               bne        %0, %old, 1f
> >               ...
> >               sc.w.aqrl  %1, %new, %addr
> >               bnez       %1, 0b
> >       1:
> >     
> >     with sequences of the form:
> >     
> >       0:      lr.w       %0, %addr
> >               bne        %0, %old, 1f
> >               ...
> >               sc.w.rl    %1, %new, %addr   /* SC-release   */
> >               bnez       %1, 0b
> >               fence      rw, rw            /* "full" fence */
> >       1:
> >     
> >     following Daniel's suggestion.
> 
> I'm likely missing something, yet as it looks it does help that the
> code fragment above appears to be ...
> 
> >     These modifications were validated with simulation of the RISC-
> > V
> >     with sequences of the form:
> >     
> >       0:      lr.w       %0, %addr
> >               bne        %0, %old, 1f
> >               ...
> >               sc.w.rl    %1, %new, %addr   /* SC-release   */
> >               bnez       %1, 0b
> >               fence      rw, rw            /* "full" fence */
> >       1:
> >     
> >     following Daniel's suggestion.
> 
> ... entirely the same as this one. Yet there's presumably a reason
> for quoting it twice?
I think it was done by accident

~ Oleksii
> 
> >     These modifications were validated with simulation of the RISC-
> > V
> >     memory consistency model.
> >     
> >     C lr-sc-aqrl-pair-vs-full-barrier
> >     
> >     {}
> >     
> >     P0(int *x, int *y, atomic_t *u)
> >     {
> >             int r0;
> >             int r1;
> >     
> >             WRITE_ONCE(*x, 1);
> >             r0 = atomic_cmpxchg(u, 0, 1);
> >             r1 = READ_ONCE(*y);
> >     }
> >     
> >     P1(int *x, int *y, atomic_t *v)
> >     {
> >             int r0;
> >             int r1;
> >     
> >             WRITE_ONCE(*y, 1);
> >             r0 = atomic_cmpxchg(v, 0, 1);
> >             r1 = READ_ONCE(*x);
> >     }
> >     
> >     exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
> >     
> >     [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2
> >      
> > https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
> >         https://marc.info/?l=linux-kernel&m=151633436614259&w=2
> > 
> > 
> > Thereby Linux kernel implementation seems to me more safe and it is
> > a
> > reason why I want/wanted to be aligned with it.
> 
> Which may end up being okay. I hope you realize though that there's a
> lot more explanation needed in the respective commits then compared
> to
> what you've had so far. As a minimum, absolutely anything remotely
> unexpected needs to be explained.
> 
> Jan

[v4,12/30] xen/riscv: introduce cmpxchg.h

Commit Message

Comments

Patch