Message ID | 25315ca95baffc9b222fb0ae89375a94b01a9b46.1707146506.git.oleksii.kurochko@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Enable build of full Xen for RISC-V | expand |
On 05.02.2024 16:32, Oleksii Kurochko wrote: > --- /dev/null > +++ b/xen/arch/riscv/include/asm/cmpxchg.h > @@ -0,0 +1,237 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* Copyright (C) 2014 Regents of the University of California */ > + > +#ifndef _ASM_RISCV_CMPXCHG_H > +#define _ASM_RISCV_CMPXCHG_H > + > +#include <xen/compiler.h> > +#include <xen/lib.h> > + > +#include <asm/fence.h> > +#include <asm/io.h> > +#include <asm/system.h> > + > +#define ALIGN_DOWN(addr, size) ((addr) & (~((size) - 1))) This feels risky: Consider what happens when someone passes 2U as 2nd argument. The cheapest adjustment to make would be to use 1UL in the expression. > +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier, acquire_barrier) \ > +({ \ > + asm volatile( \ > + release_barrier \ > + " amoswap" sfx " %0, %2, %1\n" \ While I won't insist, the revision log says \n were dropped from asm() where not needed. A separator is needed here only if ... > + acquire_barrier \ ... this isn't blank. Which imo suggests that the separator should be part of the argument passed in. But yes, one can view this differently, hence why I said I won't insist. As to the naming of the two - I'd generally suggest to make as litte implications as possible: It doesn't really matter here whether it's acquire or release; that matters at the use sites. What matters here is that one is a "pre" barrier and the other is a "post" one. > + : "=r" (ret), "+A" (*ptr) \ > + : "r" (new) \ > + : "memory" ); \ > +}) > + > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, acquire_barrier) \ > +({ \ > + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned long)ptr, 4); \ You now appear to assume that this macro is only used with inputs not crossing word boundaries. That's okay as long as suitably guaranteed at the use sites, but imo wants saying in a comment. > + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) * BITS_PER_BYTE; \ Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use above)? > + uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \ > + uint8_t mask_h = mask_l + mask_size - 1; \ > + unsigned long mask = GENMASK(mask_h, mask_l); \ Personally I find this confusing, naming-wise: GENMASK() takes bit positions as inputs, not masks. (Initially, because of this, I thought the calculations all can't be quite right.) > + unsigned long new_ = (unsigned long)(new) << mask_l; \ > + unsigned long ret_; \ > + unsigned long rc; \ Similarly, why unsigned long here? I also wonder about the mix of underscore suffixed (or not) variable names here. > + \ > + asm volatile( \ Nit: Missing blank before opening parenthesis. > + release_barrier \ > + "0: lr.d %0, %2\n" \ Even here it's an 8-byte access. Even if - didn't check - the insn was okay to use with just a 4-byte aligned pointer, wouldn't it make sense then to 8-byte align it, and be consistent throughout this macro wrt the base unit acted upon? Alternatively, why not use lr.w here, thus reducing possible collisions between multiple CPUs accessing the same cache line? > + " and %1, %0, %z4\n" \ > + " or %1, %1, %z3\n" \ > + " sc.d %1, %1, %2\n" \ > + " bnez %1, 0b\n" \ > + acquire_barrier \ > + : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \ > + : "rJ" (new_), "rJ" (~mask) \ I think that as soon as there are more than 2 or maybe 3 operands, legibility is vastly improved by using named asm() operands. > + : "memory"); \ Nit: Missing blank before closing parenthesis. > + \ > + ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \ > +}) Why does "ret" need to be a macro argument? If you had only the expression here, not the the assigment, ... > +#define __xchg_generic(ptr, new, size, sfx, release_barrier, acquire_barrier) \ > +({ \ > + __typeof__(ptr) ptr__ = (ptr); \ Is this local variable really needed? Can't you use "ptr" directly in the three macro invocations? > + __typeof__(*(ptr)) new__ = (new); \ > + __typeof__(*(ptr)) ret__; \ > + switch (size) \ > + { \ > + case 1: \ > + case 2: \ > + emulate_xchg_1_2(ptr__, new__, ret__, release_barrier, acquire_barrier); \ ... this would become ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier, acquire_barrier); \ But, unlike assumed above, there's no enforcement here that a 2-byte quantity won't cross a word, double-word, cache line, or even page boundary. That might be okay if then the code would simply crash (like the AMO insns emitted further down would), but aiui silent misbehavior would result. Also nit: The switch() higher up is (still/again) missing blanks. > + break; \ > + case 4: \ > + __amoswap_generic(ptr__, new__, ret__,\ > + ".w" sfx, release_barrier, acquire_barrier); \ > + break; \ > + case 8: \ > + __amoswap_generic(ptr__, new__, ret__,\ > + ".d" sfx, release_barrier, acquire_barrier); \ > + break; \ > + default: \ > + STATIC_ASSERT_UNREACHABLE(); \ > + } \ > + ret__; \ > +}) > + > +#define xchg_relaxed(ptr, x) \ > +({ \ > + __typeof__(*(ptr)) x_ = (x); \ > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), "", "", ""); \ > +}) > + > +#define xchg_acquire(ptr, x) \ > +({ \ > + __typeof__(*(ptr)) x_ = (x); \ > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \ > + "", "", RISCV_ACQUIRE_BARRIER); \ > +}) > + > +#define xchg_release(ptr, x) \ > +({ \ > + __typeof__(*(ptr)) x_ = (x); \ > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\ > + "", RISCV_RELEASE_BARRIER, ""); \ > +}) > + > +#define xchg(ptr,x) \ > +({ \ > + __typeof__(*(ptr)) ret__; \ > + ret__ = (__typeof__(*(ptr))) \ > + __xchg_generic(ptr, (unsigned long)(x), sizeof(*(ptr)), \ > + ".aqrl", "", ""); \ The .aqrl doesn't look to affect the (emulated) 1- and 2-byte cases. Further, amoswap also exists in release-only and acquire-only forms. Why do you prefer explicit barrier insns over those? (Looks to similarly apply to the emulation path as well as to the cmpxchg machinery then, as both lr and sc also come in all four possible acquire/release forms. Perhaps for the emulation path using explicit barriers is better, in case the acquire/release forms of lr/sc - being used inside the loop - might perform worse.) > + ret__; \ > +}) > + > +#define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx, release_barrier, acquire_barrier) \ > + ({ \ > + register unsigned int rc; \ > + asm volatile( \ > + release_barrier \ > + "0: lr" lr_sfx " %0, %2\n" \ > + " bne %0, %z3, 1f\n" \ > + " sc" sc_sfx " %1, %z4, %2\n" \ > + " bnez %1, 0b\n" \ > + acquire_barrier \ > + "1:\n" \ > + : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \ > + : "rJ" (old), "rJ" (new) \ > + : "memory"); \ > + }) > + > +#define emulate_cmpxchg_1_2(ptr, old, new, ret, sc_sfx, release_barrier, acquire_barrier) \ > +({ \ > + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned long)ptr, 4); \ > + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) * BITS_PER_BYTE; \ > + uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \ > + uint8_t mask_h = mask_l + mask_size - 1; \ > + unsigned long mask = GENMASK(mask_h, mask_l); \ > + unsigned long old_ = (unsigned long)(old) << mask_l; \ > + unsigned long new_ = (unsigned long)(new) << mask_l; \ > + unsigned long ret_; \ > + unsigned long rc; \ > + \ > + __asm__ __volatile__ ( \ > + release_barrier \ > + "0: lr.d %0, %2\n" \ > + " and %1, %0, %z5\n" \ > + " bne %1, %z3, 1f\n" \ > + " and %1, %0, %z6\n" \ Isn't this equivalent to " xor %1, %1, %0\n" \ this eliminating one (likely register) input? Furthermore with the above and ... > + " or %1, %1, %z4\n" \ > + " sc.d" sc_sfx " %1, %1, %2\n" \ > + " bnez %1, 0b\n" \ ... this re-written to " xor %0, %1, %0\n" \ " or %0, %0, %z4\n" \ " sc.d" sc_sfx " %0, %0, %2\n" \ " bnez %0, 0b\n" \ you'd then no longer clobber the ret_ & mask you've already calculated in %1, so ... > + acquire_barrier \ > + "1:\n" \ > + : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \ > + : "rJ" (old_), "rJ" (new_), \ > + "rJ" (mask), "rJ" (~mask) \ > + : "memory"); \ > + \ > + ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \ ... you could use rc here. (Of course variable naming or use then may want changing, assuming I understand why "rc" is named the way it is.) > +}) > + > +/* > + * Atomic compare and exchange. Compare OLD with MEM, if identical, > + * store NEW in MEM. Return the initial value in MEM. Success is > + * indicated by comparing RETURN with OLD. > + */ > +#define __cmpxchg_generic(ptr, old, new, size, sc_sfx, release_barrier, acquire_barrier) \ > +({ \ > + __typeof__(ptr) ptr__ = (ptr); \ > + __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \ > + __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \ > + __typeof__(*(ptr)) ret__; \ > + switch (size) \ > + { \ > + case 1: \ > + case 2: \ > + emulate_cmpxchg_1_2(ptr, old, new, ret__,\ > + sc_sfx, release_barrier, acquire_barrier); \ > + break; \ > + case 4: \ > + __generic_cmpxchg(ptr__, old__, new__, ret__, \ > + ".w", ".w"sc_sfx, release_barrier, acquire_barrier); \ > + break; \ > + case 8: \ > + __generic_cmpxchg(ptr__, old__, new__, ret__, \ > + ".d", ".d"sc_sfx, release_barrier, acquire_barrier); \ > + break; \ > + default: \ > + STATIC_ASSERT_UNREACHABLE(); \ > + } \ > + ret__; \ > +}) > + > +#define cmpxchg_relaxed(ptr, o, n) \ > +({ \ > + __typeof__(*(ptr)) o_ = (o); \ > + __typeof__(*(ptr)) n_ = (n); \ > + (__typeof__(*(ptr)))__cmpxchg_generic(ptr, \ > + o_, n_, sizeof(*(ptr)), "", "", ""); \ > +}) > + > +#define cmpxchg_acquire(ptr, o, n) \ > +({ \ > + __typeof__(*(ptr)) o_ = (o); \ > + __typeof__(*(ptr)) n_ = (n); \ > + (__typeof__(*(ptr)))__cmpxchg_generic(ptr, o_, n_, sizeof(*(ptr)), \ > + "", "", RISCV_ACQUIRE_BARRIER); \ > +}) > + > +#define cmpxchg_release(ptr, o, n) \ > +({ \ > + __typeof__(*(ptr)) o_ = (o); \ > + __typeof__(*(ptr)) n_ = (n); \ > + (__typeof__(*(ptr)))__cmpxchg_release(ptr, o_, n_, sizeof(*(ptr)), \ > + "", RISCV_RELEASE_BARRIER, ""); \ > +}) > + > +#define cmpxchg(ptr, o, n) \ > +({ \ > + __typeof__(*(ptr)) ret__; \ > + ret__ = (__typeof__(*(ptr))) \ > + __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \ > + sizeof(*(ptr)), ".rl", "", " fence rw, rw\n"); \ No RISCV_..._BARRIER for use here and ... > + ret__; \ > +}) > + > +#define __cmpxchg(ptr, o, n, s) \ > +({ \ > + __typeof__(*(ptr)) ret__; \ > + ret__ = (__typeof__(*(ptr))) \ > + __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \ > + s, ".rl", "", " fence rw, rw\n"); \ ... here? And anyway, wouldn't it make sense to have #define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr)) to limit redundancy? Plus wouldn't #define __cmpxchg(ptr, o, n, s) \ ((__typeof__(*(ptr))) \ __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \ s, ".rl", "", " fence rw, rw\n")) be shorter and thus easier to follow as well? As I notice only now, this would apparently apply further up as well. Jan
> > + : "=r" (ret), "+A" (*ptr) \ > > + : "r" (new) \ > > + : "memory" ); \ > > +}) > > + > > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, > > acquire_barrier) \ > > +({ \ > > + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned > > long)ptr, 4); \ > > You now appear to assume that this macro is only used with inputs not > crossing word boundaries. That's okay as long as suitably guaranteed > at the use sites, but imo wants saying in a comment. > > > + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) > > * BITS_PER_BYTE; \ > > Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use > above)? The idea to read 8 bytes was to deal with crossing word boundary. So if our address is 0x3 and we have to xchg() 2 bytes, what will cross 4 byte boundary. Instead we align add 0x3, so it will become 0x0 and then just always work with 8 bytes. > > > + unsigned long new_ = (unsigned long)(new) << mask_l; \ > > + unsigned long ret_; \ > > + unsigned long rc; \ > > Similarly, why unsigned long here? sizeof(unsigned long) is 8 bytes and it was chosen as we are working with lc/sc.d which are working with 8 bytes. > > I also wonder about the mix of underscore suffixed (or not) variable > names here. If the question about ret_, then the same as before size of ret argument of the macros will be 1 or 2, but {lc/sc}.d expected to work with 8 bytes. > > > + release_barrier \ > > + "0: lr.d %0, %2\n" \ > > Even here it's an 8-byte access. Even if - didn't check - the insn > was > okay to use with just a 4-byte aligned pointer, wouldn't it make > sense > then to 8-byte align it, and be consistent throughout this macro wrt > the base unit acted upon? Alternatively, why not use lr.w here, thus > reducing possible collisions between multiple CPUs accessing the same > cache line? According to the docs: LR and SC operate on naturally-aligned 64-bit (RV64 only) or 32-bit words in memory. Misaligned addresses will generate misaligned address exceptions. My intention was to deal with 4-byte crossing boundary. so if ptr is 4- byte aligned then by reading 8-bytes we shouldn't care about boundary crossing, if I am not missing something. But your opinion about reduction of collisions makes sense also... > > > + " and %1, %0, %z4\n" \ > > + " or %1, %1, %z3\n" \ > > + " sc.d %1, %1, %2\n" \ > > + " bnez %1, 0b\n" \ > > + acquire_barrier \ > > + : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \ > > + : "rJ" (new_), "rJ" (~mask) \ > > I think that as soon as there are more than 2 or maybe 3 operands, > legibility is vastly improved by using named asm() operands. Just to clarify you mean that it would be better to use instead of %0 use names? > > > + : "memory"); \ > > Nit: Missing blank before closing parenthesis. > > > + \ > > + ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \ > > +}) > > Why does "ret" need to be a macro argument? If you had only the > expression here, not the the assigment, ... > > > +#define __xchg_generic(ptr, new, size, sfx, release_barrier, > > acquire_barrier) \ > > +({ \ > > + __typeof__(ptr) ptr__ = (ptr); \ > > Is this local variable really needed? Can't you use "ptr" directly > in the three macro invocations? > > > + __typeof__(*(ptr)) new__ = (new); \ > > + __typeof__(*(ptr)) ret__; \ > > + switch (size) \ > > + { \ > > + case 1: \ > > + case 2: \ > > + emulate_xchg_1_2(ptr__, new__, ret__, release_barrier, > > acquire_barrier); \ > > ... this would become > > ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier, > acquire_barrier); \ > > But, unlike assumed above, there's no enforcement here that a 2-byte > quantity won't cross a word, double-word, cache line, or even page > boundary. That might be okay if then the code would simply crash > (like > the AMO insns emitted further down would), but aiui silent > misbehavior > would result. As I mentioned above with 4-byte alignment and then reading and working with 8-byte then crossing a word or double-word boundary shouldn't be an issue. I am not sure that I know how to check that we are crossing cache line boundary. Regarding page boundary, if the next page is mapped then all should work fine, otherwise it will be an exception. > > Also nit: The switch() higher up is (still/again) missing blanks. > > > + break; \ > > + case 4: \ > > + __amoswap_generic(ptr__, new__, ret__,\ > > + ".w" sfx, release_barrier, > > acquire_barrier); \ > > + break; \ > > + case 8: \ > > + __amoswap_generic(ptr__, new__, ret__,\ > > + ".d" sfx, release_barrier, > > acquire_barrier); \ > > + break; \ > > + default: \ > > + STATIC_ASSERT_UNREACHABLE(); \ > > + } \ > > + ret__; \ > > +}) > > + > > +#define xchg_relaxed(ptr, x) \ > > +({ \ > > + __typeof__(*(ptr)) x_ = (x); \ > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), > > "", "", ""); \ > > +}) > > + > > +#define xchg_acquire(ptr, x) \ > > +({ \ > > + __typeof__(*(ptr)) x_ = (x); \ > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \ > > + "", "", > > RISCV_ACQUIRE_BARRIER); \ > > +}) > > + > > +#define xchg_release(ptr, x) \ > > +({ \ > > + __typeof__(*(ptr)) x_ = (x); \ > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\ > > + "", RISCV_RELEASE_BARRIER, > > ""); \ > > +}) > > + > > +#define xchg(ptr,x) \ > > +({ \ > > + __typeof__(*(ptr)) ret__; \ > > + ret__ = (__typeof__(*(ptr))) \ > > + __xchg_generic(ptr, (unsigned long)(x), > > sizeof(*(ptr)), \ > > + ".aqrl", "", ""); \ > > The .aqrl doesn't look to affect the (emulated) 1- and 2-byte cases. > > Further, amoswap also exists in release-only and acquire-only forms. > Why do you prefer explicit barrier insns over those? (Looks to > similarly apply to the emulation path as well as to the cmpxchg > machinery then, as both lr and sc also come in all four possible > acquire/release forms. Perhaps for the emulation path using > explicit barriers is better, in case the acquire/release forms of > lr/sc - being used inside the loop - might perform worse.) As 1- and 2-byte cases are emulated I decided that is not to provide sfx argument for emulation macros as it will not have to much affect on emulated types and just consume more performance on acquire and release version of sc/ld instructions. > > > > No RISCV_..._BARRIER for use here and ... > > > + ret__; \ > > +}) > > + > > +#define __cmpxchg(ptr, o, n, s) \ > > +({ \ > > + __typeof__(*(ptr)) ret__; \ > > + ret__ = (__typeof__(*(ptr))) \ > > + __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned > > long)(n), \ > > + s, ".rl", "", " fence rw, rw\n"); \ > > ... here? And anyway, wouldn't it make sense to have > > #define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr)) > > to limit redundancy? > > Plus wouldn't > > #define __cmpxchg(ptr, o, n, s) \ > ((__typeof__(*(ptr))) \ > __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \ > s, ".rl", "", " fence rw, rw\n")) > > be shorter and thus easier to follow as well? As I notice only now, > this would apparently apply further up as well. I understand your point about "#define cmpxchg(ptr, o, n) __cmpxchg(", but I can't undestand how the definition of __cmxchng should be done shorter. Could you please clarify that? ~ Oleksii
On 05/02/2024 15:32, Oleksii Kurochko wrote: > The header was taken from Linux kernl 6.4.0-rc1. > > Addionally, were updated: > * add emulation of {cmp}xchg for 1/2 byte types This explaination is a little bit light. IIUC, you are implementing them using 32-bit atomic access. Is that correct? If so, please spell it out. Also, I wonder whether it would be better to try to get rid of the 1/2 bytes access. Do you know where they are used? > * replace tabs with spaces Does this mean you are not planning to backport any Linux fixes? > * replace __* varialbed with *__ s/varialbed/variable/ > * introduce generic version of xchg_* and cmpxchg_*. > > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> > --- > Changes in V4: > - Code style fixes. > - enforce in __xchg_*() has the same type for new and *ptr, also "\n" > was removed at the end of asm instruction. > - dependency from https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/ > - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE(). > - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used. > - drop cmpxcg{32,64}_{local} as they aren't used. > - introduce generic version of xchg_* and cmpxchg_*. > - update the commit message. > --- > Changes in V3: > - update the commit message > - add emulation of {cmp}xchg_... for 1 and 2 bytes types > --- > Changes in V2: > - update the comment at the top of the header. > - change xen/lib.h to xen/bug.h. > - sort inclusion of headers properly. > --- > xen/arch/riscv/include/asm/cmpxchg.h | 237 +++++++++++++++++++++++++++ > 1 file changed, 237 insertions(+) > create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h > > diff --git a/xen/arch/riscv/include/asm/cmpxchg.h b/xen/arch/riscv/include/asm/cmpxchg.h > new file mode 100644 > index 0000000000..b751a50cbf > --- /dev/null > +++ b/xen/arch/riscv/include/asm/cmpxchg.h > @@ -0,0 +1,237 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* Copyright (C) 2014 Regents of the University of California */ > + > +#ifndef _ASM_RISCV_CMPXCHG_H > +#define _ASM_RISCV_CMPXCHG_H > + > +#include <xen/compiler.h> > +#include <xen/lib.h> > + > +#include <asm/fence.h> > +#include <asm/io.h> > +#include <asm/system.h> > + > +#define ALIGN_DOWN(addr, size) ((addr) & (~((size) - 1))) > + > +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier, acquire_barrier) \ > +({ \ > + asm volatile( \ > + release_barrier \ > + " amoswap" sfx " %0, %2, %1\n" \ > + acquire_barrier \ > + : "=r" (ret), "+A" (*ptr) \ > + : "r" (new) \ > + : "memory" ); \ > +}) > + > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, acquire_barrier) \ > +({ \ > + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned long)ptr, 4); \ > + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) * BITS_PER_BYTE; \ > + uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \ > + uint8_t mask_h = mask_l + mask_size - 1; \ > + unsigned long mask = GENMASK(mask_h, mask_l); \ > + unsigned long new_ = (unsigned long)(new) << mask_l; \ > + unsigned long ret_; \ > + unsigned long rc; \ > + \ > + asm volatile( \ > + release_barrier \ > + "0: lr.d %0, %2\n" \ I was going to ask why this is lr.d rather than lr.w. But I see Jan already asked. I agree with him that it should probably be a lr.w and ... > + " and %1, %0, %z4\n" \ > + " or %1, %1, %z3\n" \ > + " sc.d %1, %1, %2\n" \ ... respectively sc.w. The same applies for cmpxchg. Cheers,
On 15.02.2024 14:41, Oleksii wrote: >>> + : "=r" (ret), "+A" (*ptr) \ >>> + : "r" (new) \ >>> + : "memory" ); \ >>> +}) >>> + >>> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, >>> acquire_barrier) \ >>> +({ \ >>> + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned >>> long)ptr, 4); \ >> >> You now appear to assume that this macro is only used with inputs not >> crossing word boundaries. That's okay as long as suitably guaranteed >> at the use sites, but imo wants saying in a comment. >> >>> + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) >>> * BITS_PER_BYTE; \ >> >> Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use >> above)? > The idea to read 8 bytes was to deal with crossing word boundary. So if > our address is 0x3 and we have to xchg() 2 bytes, what will cross 4 > byte boundary. Instead we align add 0x3, so it will become 0x0 and then > just always work with 8 bytes. Then what if my 2-byte access crosses a dword boundary? A cache line one? A page one? >>> + unsigned long new_ = (unsigned long)(new) << mask_l; \ >>> + unsigned long ret_; \ >>> + unsigned long rc; \ >> >> Similarly, why unsigned long here? > sizeof(unsigned long) is 8 bytes and it was chosen as we are working > with lc/sc.d which are working with 8 bytes. > >> >> I also wonder about the mix of underscore suffixed (or not) variable >> names here. > If the question about ret_, then the same as before size of ret > argument of the macros will be 1 or 2, but {lc/sc}.d expected to work > with 8 bytes. Then what's the uint32_t * about? >>> + release_barrier \ >>> + "0: lr.d %0, %2\n" \ >> >> Even here it's an 8-byte access. Even if - didn't check - the insn >> was >> okay to use with just a 4-byte aligned pointer, wouldn't it make >> sense >> then to 8-byte align it, and be consistent throughout this macro wrt >> the base unit acted upon? Alternatively, why not use lr.w here, thus >> reducing possible collisions between multiple CPUs accessing the same >> cache line? > According to the docs: > LR and SC operate on naturally-aligned 64-bit (RV64 only) or 32-bit > words in memory. Misaligned > addresses will generate misaligned address exceptions. > > My intention was to deal with 4-byte crossing boundary. so if ptr is 4- > byte aligned then by reading 8-bytes we shouldn't care about boundary > crossing, if I am not missing something. If a ptr is 4-byte aligned, there's no point reading more than 4 bytes. >>> + " and %1, %0, %z4\n" \ >>> + " or %1, %1, %z3\n" \ >>> + " sc.d %1, %1, %2\n" \ >>> + " bnez %1, 0b\n" \ >>> + acquire_barrier \ >>> + : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \ >>> + : "rJ" (new_), "rJ" (~mask) \ >> >> I think that as soon as there are more than 2 or maybe 3 operands, >> legibility is vastly improved by using named asm() operands. > Just to clarify you mean that it would be better to use instead of %0 > use names? Yes. Just like you have it in one of the other patches that I looked at later. >>> + : "memory"); \ >> >> Nit: Missing blank before closing parenthesis. >> >>> + \ >>> + ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \ >>> +}) >> >> Why does "ret" need to be a macro argument? If you had only the >> expression here, not the the assigment, ... >> >>> +#define __xchg_generic(ptr, new, size, sfx, release_barrier, >>> acquire_barrier) \ >>> +({ \ >>> + __typeof__(ptr) ptr__ = (ptr); \ >> >> Is this local variable really needed? Can't you use "ptr" directly >> in the three macro invocations? >> >>> + __typeof__(*(ptr)) new__ = (new); \ >>> + __typeof__(*(ptr)) ret__; \ >>> + switch (size) \ >>> + { \ >>> + case 1: \ >>> + case 2: \ >>> + emulate_xchg_1_2(ptr__, new__, ret__, release_barrier, >>> acquire_barrier); \ >> >> ... this would become >> >> ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier, >> acquire_barrier); \ >> >> But, unlike assumed above, there's no enforcement here that a 2-byte >> quantity won't cross a word, double-word, cache line, or even page >> boundary. That might be okay if then the code would simply crash >> (like >> the AMO insns emitted further down would), but aiui silent >> misbehavior >> would result. > As I mentioned above with 4-byte alignment and then reading and working > with 8-byte then crossing a word or double-word boundary shouldn't be > an issue. > > I am not sure that I know how to check that we are crossing cache line > boundary. > > Regarding page boundary, if the next page is mapped then all should > work fine, otherwise it will be an exception. Are you sure lr.d / sc.d are happy to access across such a boundary, when both pages are mapped? To me it seems pretty clear that for atomic accesses you want to demand natural alignment, i.e. 2-byte alignment for 2-byte accesses. This way you can be sure no potentially problematic boundaries will be crossed. >>> + break; \ >>> + case 4: \ >>> + __amoswap_generic(ptr__, new__, ret__,\ >>> + ".w" sfx, release_barrier, >>> acquire_barrier); \ >>> + break; \ >>> + case 8: \ >>> + __amoswap_generic(ptr__, new__, ret__,\ >>> + ".d" sfx, release_barrier, >>> acquire_barrier); \ >>> + break; \ >>> + default: \ >>> + STATIC_ASSERT_UNREACHABLE(); \ >>> + } \ >>> + ret__; \ >>> +}) >>> + >>> +#define xchg_relaxed(ptr, x) \ >>> +({ \ >>> + __typeof__(*(ptr)) x_ = (x); \ >>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), >>> "", "", ""); \ >>> +}) >>> + >>> +#define xchg_acquire(ptr, x) \ >>> +({ \ >>> + __typeof__(*(ptr)) x_ = (x); \ >>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \ >>> + "", "", >>> RISCV_ACQUIRE_BARRIER); \ >>> +}) >>> + >>> +#define xchg_release(ptr, x) \ >>> +({ \ >>> + __typeof__(*(ptr)) x_ = (x); \ >>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\ >>> + "", RISCV_RELEASE_BARRIER, >>> ""); \ >>> +}) >>> + >>> +#define xchg(ptr,x) \ >>> +({ \ >>> + __typeof__(*(ptr)) ret__; \ >>> + ret__ = (__typeof__(*(ptr))) \ >>> + __xchg_generic(ptr, (unsigned long)(x), >>> sizeof(*(ptr)), \ >>> + ".aqrl", "", ""); \ >> >> The .aqrl doesn't look to affect the (emulated) 1- and 2-byte cases. >> >> Further, amoswap also exists in release-only and acquire-only forms. >> Why do you prefer explicit barrier insns over those? (Looks to >> similarly apply to the emulation path as well as to the cmpxchg >> machinery then, as both lr and sc also come in all four possible >> acquire/release forms. Perhaps for the emulation path using >> explicit barriers is better, in case the acquire/release forms of >> lr/sc - being used inside the loop - might perform worse.) > As 1- and 2-byte cases are emulated I decided that is not to provide > sfx argument for emulation macros as it will not have to much affect on > emulated types and just consume more performance on acquire and release > version of sc/ld instructions. Question is whether the common case (4- and 8-byte accesses) shouldn't be valued higher, with 1- and 2-byte emulation being there just to allow things to not break altogether. >> No RISCV_..._BARRIER for use here and ... >> >>> + ret__; \ >>> +}) >>> + >>> +#define __cmpxchg(ptr, o, n, s) \ >>> +({ \ >>> + __typeof__(*(ptr)) ret__; \ >>> + ret__ = (__typeof__(*(ptr))) \ >>> + __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned >>> long)(n), \ >>> + s, ".rl", "", " fence rw, rw\n"); \ >> >> ... here? And anyway, wouldn't it make sense to have >> >> #define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr)) >> >> to limit redundancy? >> >> Plus wouldn't >> >> #define __cmpxchg(ptr, o, n, s) \ >> ((__typeof__(*(ptr))) \ >> __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \ >> s, ".rl", "", " fence rw, rw\n")) >> >> be shorter and thus easier to follow as well? As I notice only now, >> this would apparently apply further up as well. > I understand your point about "#define cmpxchg(ptr, o, n) __cmpxchg(", > but I can't undestand how the definition of __cmxchng should be done > shorter. Could you please clarify that? You did notice that in my form there's no local variable, and hence also no macro-wide scope ( ({ ... }) )? Jan
On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote: > > > On 05/02/2024 15:32, Oleksii Kurochko wrote: > > The header was taken from Linux kernl 6.4.0-rc1. > > > > Addionally, were updated: > > * add emulation of {cmp}xchg for 1/2 byte types > > This explaination is a little bit light. IIUC, you are implementing > them > using 32-bit atomic access. Is that correct? If so, please spell it > out. Sure, I'll update commit message. > > Also, I wonder whether it would be better to try to get rid of the > 1/2 > bytes access. Do you know where they are used? Right now, the issue is with test_and_clear_bool() which is used in common/sched/core.c:840 [https://gitlab.com/xen-project/xen/-/blob/staging/xen/common/sched/core.c?ref_type=heads#L840 ] I don't remember details, but in xen-devel chat someone told me that grant table requires 1/2 bytes access. > > > * replace tabs with spaces > Does this mean you are not planning to backport any Linux fixes? If it will be any fixes for sure I'll back port them, but it looks like this code is stable enough and not to many fixes will be done there, so it shouldn't be hard to backport them and switch to spaces. > > > * replace __* varialbed with *__ > > s/varialbed/variable/ > > > * introduce generic version of xchg_* and cmpxchg_*. > > > > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> > > --- > > Changes in V4: > > - Code style fixes. > > - enforce in __xchg_*() has the same type for new and *ptr, also > > "\n" > > was removed at the end of asm instruction. > > - dependency from > > https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/ > > - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE(). > > - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used. > > - drop cmpxcg{32,64}_{local} as they aren't used. > > - introduce generic version of xchg_* and cmpxchg_*. > > - update the commit message. > > --- > > Changes in V3: > > - update the commit message > > - add emulation of {cmp}xchg_... for 1 and 2 bytes types > > --- > > Changes in V2: > > - update the comment at the top of the header. > > - change xen/lib.h to xen/bug.h. > > - sort inclusion of headers properly. > > --- > > xen/arch/riscv/include/asm/cmpxchg.h | 237 > > +++++++++++++++++++++++++++ > > 1 file changed, 237 insertions(+) > > create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h > > > > diff --git a/xen/arch/riscv/include/asm/cmpxchg.h > > b/xen/arch/riscv/include/asm/cmpxchg.h > > new file mode 100644 > > index 0000000000..b751a50cbf > > --- /dev/null > > +++ b/xen/arch/riscv/include/asm/cmpxchg.h > > @@ -0,0 +1,237 @@ > > +/* SPDX-License-Identifier: GPL-2.0-only */ > > +/* Copyright (C) 2014 Regents of the University of California */ > > + > > +#ifndef _ASM_RISCV_CMPXCHG_H > > +#define _ASM_RISCV_CMPXCHG_H > > + > > +#include <xen/compiler.h> > > +#include <xen/lib.h> > > + > > +#include <asm/fence.h> > > +#include <asm/io.h> > > +#include <asm/system.h> > > + > > +#define ALIGN_DOWN(addr, size) ((addr) & (~((size) - 1))) > > + > > +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier, > > acquire_barrier) \ > > +({ \ > > + asm volatile( \ > > + release_barrier \ > > + " amoswap" sfx " %0, %2, %1\n" \ > > + acquire_barrier \ > > + : "=r" (ret), "+A" (*ptr) \ > > + : "r" (new) \ > > + : "memory" ); \ > > +}) > > + > > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, > > acquire_barrier) \ > > +({ \ > > + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned > > long)ptr, 4); \ > > + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) > > * BITS_PER_BYTE; \ > > + uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \ > > + uint8_t mask_h = mask_l + mask_size - 1; \ > > + unsigned long mask = GENMASK(mask_h, mask_l); \ > > + unsigned long new_ = (unsigned long)(new) << mask_l; \ > > + unsigned long ret_; \ > > + unsigned long rc; \ > > + \ > > + asm volatile( \ > > + release_barrier \ > > + "0: lr.d %0, %2\n" \ > > I was going to ask why this is lr.d rather than lr.w. But I see Jan > already asked. I agree with him that it should probably be a lr.w and > ... > > > + " and %1, %0, %z4\n" \ > > + " or %1, %1, %z3\n" \ > > + " sc.d %1, %1, %2\n" \ > > ... respectively sc.w. The same applies for cmpxchg. I agree that it would be better, and my initial attempt was to handle 4-byte or 8-byte boundary crossing during 2-byte access: 0 1 2 3 4 5 6 7 8 X X X 1 1 X X X X In this case, if I align address 3 to address 0 and then read 4 bytes instead of 8 bytes, I will not process the byte at address 4. This was the reason why I started to read 8 bytes. I also acknowledge that there could be an issue in the following case: X 4094 4095 4096 X 1 1 X In this situation, when I read 8 bytes, there could be an issue where the new page (which starts at 4096) will not be mapped. It seems correct in this case to check that variable is within one page and read 4 bytes instead of 8. One more thing I am uncertain about is if we change everything to read 4 bytes with 4-byte alignment, what should be done with the first case? Should we panic? (I am not sure if this is an option.) Should we perform the operation twice for addresses 0x0 and 0x4? ~ Oleksii
On 19.02.2024 15:00, Oleksii wrote: > On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote: >> On 05/02/2024 15:32, Oleksii Kurochko wrote: >>> --- /dev/null >>> +++ b/xen/arch/riscv/include/asm/cmpxchg.h >>> @@ -0,0 +1,237 @@ >>> +/* SPDX-License-Identifier: GPL-2.0-only */ >>> +/* Copyright (C) 2014 Regents of the University of California */ >>> + >>> +#ifndef _ASM_RISCV_CMPXCHG_H >>> +#define _ASM_RISCV_CMPXCHG_H >>> + >>> +#include <xen/compiler.h> >>> +#include <xen/lib.h> >>> + >>> +#include <asm/fence.h> >>> +#include <asm/io.h> >>> +#include <asm/system.h> >>> + >>> +#define ALIGN_DOWN(addr, size) ((addr) & (~((size) - 1))) >>> + >>> +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier, >>> acquire_barrier) \ >>> +({ \ >>> + asm volatile( \ >>> + release_barrier \ >>> + " amoswap" sfx " %0, %2, %1\n" \ >>> + acquire_barrier \ >>> + : "=r" (ret), "+A" (*ptr) \ >>> + : "r" (new) \ >>> + : "memory" ); \ >>> +}) >>> + >>> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, >>> acquire_barrier) \ >>> +({ \ >>> + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned >>> long)ptr, 4); \ >>> + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) >>> * BITS_PER_BYTE; \ >>> + uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \ >>> + uint8_t mask_h = mask_l + mask_size - 1; \ >>> + unsigned long mask = GENMASK(mask_h, mask_l); \ >>> + unsigned long new_ = (unsigned long)(new) << mask_l; \ >>> + unsigned long ret_; \ >>> + unsigned long rc; \ >>> + \ >>> + asm volatile( \ >>> + release_barrier \ >>> + "0: lr.d %0, %2\n" \ >> >> I was going to ask why this is lr.d rather than lr.w. But I see Jan >> already asked. I agree with him that it should probably be a lr.w and >> ... >> >>> + " and %1, %0, %z4\n" \ >>> + " or %1, %1, %z3\n" \ >>> + " sc.d %1, %1, %2\n" \ >> >> ... respectively sc.w. The same applies for cmpxchg. > > I agree that it would be better, and my initial attempt was to handle > 4-byte or 8-byte boundary crossing during 2-byte access: > > 0 1 2 3 4 5 6 7 8 > X X X 1 1 X X X X > > In this case, if I align address 3 to address 0 and then read 4 bytes > instead of 8 bytes, I will not process the byte at address 4. This was > the reason why I started to read 8 bytes. > > I also acknowledge that there could be an issue in the following case: > > X 4094 4095 4096 > X 1 1 X > In this situation, when I read 8 bytes, there could be an issue where > the new page (which starts at 4096) will not be mapped. It seems > correct in this case to check that variable is within one page and read > 4 bytes instead of 8. > > One more thing I am uncertain about is if we change everything to read > 4 bytes with 4-byte alignment, what should be done with the first case? > Should we panic? (I am not sure if this is an option.) Counter question (raised elsewhere already): What if a 4-byte access crosses a word / cache line / page boundary? Ideally exactly the same would happen for a 2-byte access crossing a respective boundary. (Which you can achieve relatively easily by masking off only address bit 1, keeping address bit 0 unaltered.) > Should we > perform the operation twice for addresses 0x0 and 0x4? That wouldn't be atomic then. Jan
Hi Oleksii, On 19/02/2024 14:00, Oleksii wrote: > On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote: >> >> >> On 05/02/2024 15:32, Oleksii Kurochko wrote: >>> The header was taken from Linux kernl 6.4.0-rc1. >>> >>> Addionally, were updated: >>> * add emulation of {cmp}xchg for 1/2 byte types >> >> This explaination is a little bit light. IIUC, you are implementing >> them >> using 32-bit atomic access. Is that correct? If so, please spell it >> out. > Sure, I'll update commit message. > >> >> Also, I wonder whether it would be better to try to get rid of the >> 1/2 >> bytes access. Do you know where they are used? > Right now, the issue is with test_and_clear_bool() which is used in > common/sched/core.c:840 > [https://gitlab.com/xen-project/xen/-/blob/staging/xen/common/sched/core.c?ref_type=heads#L840 > ] > > I don't remember details, but in xen-devel chat someone told me that > grant table requires 1/2 bytes access. Ok :/. This would be part of the ABI then and therefore can't be easily changed. > >> >>> * replace tabs with spaces >> Does this mean you are not planning to backport any Linux fixes? > If it will be any fixes for sure I'll back port them, but it looks like > this code is stable enough and not to many fixes will be done there, so > it shouldn't be hard to backport them and switch to spaces. Fair enough. >>> + " and %1, %0, %z4\n" \ >>> + " or %1, %1, %z3\n" \ >>> + " sc.d %1, %1, %2\n" \ >> >> ... respectively sc.w. The same applies for cmpxchg. > > I agree that it would be better, and my initial attempt was to handle > 4-byte or 8-byte boundary crossing during 2-byte access: > > 0 1 2 3 4 5 6 7 8 > X X X 1 1 X X X X > > In this case, if I align address 3 to address 0 and then read 4 bytes > instead of 8 bytes, I will not process the byte at address 4. This was > the reason why I started to read 8 bytes. At least on Arm, the architecture doesn't support atomic operations if the access is not aligned to its size (this will send a data abort). On some architecture, this is supported but potentially very slow. So all the common code should already use properly aligned address. Therefore, I don't really see the reason to add support for unaligned access. Cheers,
On Mon, 2024-02-19 at 12:22 +0100, Jan Beulich wrote: > On 15.02.2024 14:41, Oleksii wrote: > > > > + : "=r" (ret), "+A" (*ptr) \ > > > > + : "r" (new) \ > > > > + : "memory" ); \ > > > > +}) > > > > + > > > > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, > > > > acquire_barrier) \ > > > > +({ \ > > > > + uint32_t *ptr_32b_aligned = (uint32_t > > > > *)ALIGN_DOWN((unsigned > > > > long)ptr, 4); \ > > > > > > You now appear to assume that this macro is only used with inputs > > > not > > > crossing word boundaries. That's okay as long as suitably > > > guaranteed > > > at the use sites, but imo wants saying in a comment. > > > > > > > + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - > > > > sizeof(*ptr))) > > > > * BITS_PER_BYTE; \ > > > > > > Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use > > > above)? > > The idea to read 8 bytes was to deal with crossing word boundary. > > So if > > our address is 0x3 and we have to xchg() 2 bytes, what will cross 4 > > byte boundary. Instead we align add 0x3, so it will become 0x0 and > > then > > just always work with 8 bytes. > > Then what if my 2-byte access crosses a dword boundary? A cache line > one? A page one? Everything looks okay to me, except in the case of a page boundary. In the scenario of a dword boundary: 0 1 2 3 4 5 6 7 8 9 ... X X X X X X X 1 1 X Assuming a variable starts at address 7, 4-byte alignment will be enforced, and 8 bytes will be processed starting from address 4. Concerning a cache line, it should still work, with potential performance issues arising only if a part of the variable is cached while another part is not. Regarding page crossing, I acknowledge that it could be problematic if the variable is entirely located at the end of a page, as there is no guarantee that the next page exists. In this case, it would be preferable to consistently read 4 bytes with 4-byte alignment: X 4094 4095 4096? X 1 1 ? However, if the variable spans two pages, proper page mapping should be ensured. It appears sensible to reconsider the macros and implement 4-byte alignment and 4-byte access, but then this is not clear how better to deal with first case ( dword boundary ). Panic ? or use the macros twice for address 4, and address 8? > > > > > + unsigned long new_ = (unsigned long)(new) << mask_l; \ > > > > + unsigned long ret_; \ > > > > + unsigned long rc; \ > > > > > > Similarly, why unsigned long here? > > sizeof(unsigned long) is 8 bytes and it was chosen as we are > > working > > with lc/sc.d which are working with 8 bytes. > > > > > > > > I also wonder about the mix of underscore suffixed (or not) > > > variable > > > names here. > > If the question about ret_, then the same as before size of ret > > argument of the macros will be 1 or 2, but {lc/sc}.d expected to > > work > > with 8 bytes. > > Then what's the uint32_t * about? Agree, then it should be also unsigned long. > > > > > > + __typeof__(*(ptr)) new__ = (new); \ > > > > + __typeof__(*(ptr)) ret__; \ > > > > + switch (size) \ > > > > + { \ > > > > + case 1: \ > > > > + case 2: \ > > > > + emulate_xchg_1_2(ptr__, new__, ret__, release_barrier, > > > > acquire_barrier); \ > > > > > > ... this would become > > > > > > ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier, > > > acquire_barrier); \ > > > > > > But, unlike assumed above, there's no enforcement here that a 2- > > > byte > > > quantity won't cross a word, double-word, cache line, or even > > > page > > > boundary. That might be okay if then the code would simply crash > > > (like > > > the AMO insns emitted further down would), but aiui silent > > > misbehavior > > > would result. > > As I mentioned above with 4-byte alignment and then reading and > > working > > with 8-byte then crossing a word or double-word boundary shouldn't > > be > > an issue. > > > > I am not sure that I know how to check that we are crossing cache > > line > > boundary. > > > > Regarding page boundary, if the next page is mapped then all should > > work fine, otherwise it will be an exception. > > Are you sure lr.d / sc.d are happy to access across such a boundary, > when both pages are mapped? If they are mapped, my expectation that lr.d and sc.d should be happy. > > To me it seems pretty clear that for atomic accesses you want to > demand natural alignment, i.e. 2-byte alignment for 2-byte accesses. > This way you can be sure no potentially problematic boundaries will > be crossed. It makes sense, but I am not sure that I can guarantee that a user of macros will always have 2-byte alignment (except during a panic) in the future. Even now, I am uncertain that everyone will be willing to add __alignment(...) to struct vcpu->is_urgent (xen/include/xen/sched.h:218) and other possible cases to accommodate RISC-V requirements. > > > > > + break; \ > > > > + case 4: \ > > > > + __amoswap_generic(ptr__, new__, ret__,\ > > > > + ".w" sfx, release_barrier, > > > > acquire_barrier); \ > > > > + break; \ > > > > + case 8: \ > > > > + __amoswap_generic(ptr__, new__, ret__,\ > > > > + ".d" sfx, release_barrier, > > > > acquire_barrier); \ > > > > + break; \ > > > > + default: \ > > > > + STATIC_ASSERT_UNREACHABLE(); \ > > > > + } \ > > > > + ret__; \ > > > > +}) > > > > + > > > > +#define xchg_relaxed(ptr, x) \ > > > > +({ \ > > > > + __typeof__(*(ptr)) x_ = (x); \ > > > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, > > > > sizeof(*(ptr)), > > > > "", "", ""); \ > > > > +}) > > > > + > > > > +#define xchg_acquire(ptr, x) \ > > > > +({ \ > > > > + __typeof__(*(ptr)) x_ = (x); \ > > > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, > > > > sizeof(*(ptr)), \ > > > > + "", "", > > > > RISCV_ACQUIRE_BARRIER); \ > > > > +}) > > > > + > > > > +#define xchg_release(ptr, x) \ > > > > +({ \ > > > > + __typeof__(*(ptr)) x_ = (x); \ > > > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, > > > > sizeof(*(ptr)),\ > > > > + "", > > > > RISCV_RELEASE_BARRIER, > > > > ""); \ > > > > +}) > > > > + > > > > +#define xchg(ptr,x) \ > > > > +({ \ > > > > + __typeof__(*(ptr)) ret__; \ > > > > + ret__ = (__typeof__(*(ptr))) \ > > > > + __xchg_generic(ptr, (unsigned long)(x), > > > > sizeof(*(ptr)), \ > > > > + ".aqrl", "", ""); \ > > > > > > The .aqrl doesn't look to affect the (emulated) 1- and 2-byte > > > cases. > > > > > > Further, amoswap also exists in release-only and acquire-only > > > forms. > > > Why do you prefer explicit barrier insns over those? (Looks to > > > similarly apply to the emulation path as well as to the cmpxchg > > > machinery then, as both lr and sc also come in all four possible > > > acquire/release forms. Perhaps for the emulation path using > > > explicit barriers is better, in case the acquire/release forms of > > > lr/sc - being used inside the loop - might perform worse.) > > As 1- and 2-byte cases are emulated I decided that is not to > > provide > > sfx argument for emulation macros as it will not have to much > > affect on > > emulated types and just consume more performance on acquire and > > release > > version of sc/ld instructions. > > Question is whether the common case (4- and 8-byte accesses) > shouldn't > be valued higher, with 1- and 2-byte emulation being there just to > allow things to not break altogether. If I understand you correctly, it would make sense to add the 'sfx' argument for the 1/2-byte access case, ensuring that all options are available for 1/2-byte access case as well. ~ Oleksii
On 19.02.2024 15:29, Oleksii wrote: > On Mon, 2024-02-19 at 12:22 +0100, Jan Beulich wrote: >> On 15.02.2024 14:41, Oleksii wrote: >>> As I mentioned above with 4-byte alignment and then reading and >>> working >>> with 8-byte then crossing a word or double-word boundary shouldn't >>> be >>> an issue. >>> >>> I am not sure that I know how to check that we are crossing cache >>> line >>> boundary. >>> >>> Regarding page boundary, if the next page is mapped then all should >>> work fine, otherwise it will be an exception. >> >> Are you sure lr.d / sc.d are happy to access across such a boundary, >> when both pages are mapped? > If they are mapped, my expectation that lr.d and sc.d should be happy. How does this expectation of yours fit with the A extension doc having this: "For LR and SC, the A extension requires that the address held in rs1 be naturally aligned to the size of the operand (i.e., eight-byte aligned for 64-bit words and four-byte aligned for 32-bit words). If the address is not naturally aligned, an address-misaligned exception or an access-fault exception will be generated." It doesn't even say "may"; it says "will". >> To me it seems pretty clear that for atomic accesses you want to >> demand natural alignment, i.e. 2-byte alignment for 2-byte accesses. >> This way you can be sure no potentially problematic boundaries will >> be crossed. > It makes sense, but I am not sure that I can guarantee that a user of > macros will always have 2-byte alignment (except during a panic) in the > future. > > Even now, I am uncertain that everyone will be willing to add > __alignment(...) to struct vcpu->is_urgent > (xen/include/xen/sched.h:218) and other possible cases to accommodate > RISC-V requirements. ->is_urgent is bool, i.e. 1 byte and hence okay at any address. For all normal variables and fields the compiler will guarantee suitable (natural) alignment. What you prohibit by requiring aligned items is use of fields of e.g. packed structures. >>> As 1- and 2-byte cases are emulated I decided that is not to >>> provide >>> sfx argument for emulation macros as it will not have to much >>> affect on >>> emulated types and just consume more performance on acquire and >>> release >>> version of sc/ld instructions. >> >> Question is whether the common case (4- and 8-byte accesses) >> shouldn't >> be valued higher, with 1- and 2-byte emulation being there just to >> allow things to not break altogether. > If I understand you correctly, it would make sense to add the 'sfx' > argument for the 1/2-byte access case, ensuring that all options are > available for 1/2-byte access case as well. That's one of the possibilities. As said, I'm not overly worried about the emulated cases. For the initial implementation I'd recommend going with what is easiest there, yielding the best possible result for the 4- and 8-byte cases. If later it turns out repeated acquire/release accesses are a problem in the emulation loop, things can be changed to explicit barriers, without touching the 4- and 8-byte cases. Jan
On Mon, 2024-02-19 at 15:12 +0100, Jan Beulich wrote: > On 19.02.2024 15:00, Oleksii wrote: > > On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote: > > > On 05/02/2024 15:32, Oleksii Kurochko wrote: > > > > --- /dev/null > > > > +++ b/xen/arch/riscv/include/asm/cmpxchg.h > > > > @@ -0,0 +1,237 @@ > > > > +/* SPDX-License-Identifier: GPL-2.0-only */ > > > > +/* Copyright (C) 2014 Regents of the University of California > > > > */ > > > > + > > > > +#ifndef _ASM_RISCV_CMPXCHG_H > > > > +#define _ASM_RISCV_CMPXCHG_H > > > > + > > > > +#include <xen/compiler.h> > > > > +#include <xen/lib.h> > > > > + > > > > +#include <asm/fence.h> > > > > +#include <asm/io.h> > > > > +#include <asm/system.h> > > > > + > > > > +#define ALIGN_DOWN(addr, size) ((addr) & (~((size) - 1))) > > > > + > > > > +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier, > > > > acquire_barrier) \ > > > > +({ \ > > > > + asm volatile( \ > > > > + release_barrier \ > > > > + " amoswap" sfx " %0, %2, %1\n" \ > > > > + acquire_barrier \ > > > > + : "=r" (ret), "+A" (*ptr) \ > > > > + : "r" (new) \ > > > > + : "memory" ); \ > > > > +}) > > > > + > > > > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, > > > > acquire_barrier) \ > > > > +({ \ > > > > + uint32_t *ptr_32b_aligned = (uint32_t > > > > *)ALIGN_DOWN((unsigned > > > > long)ptr, 4); \ > > > > + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - > > > > sizeof(*ptr))) > > > > * BITS_PER_BYTE; \ > > > > + uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \ > > > > + uint8_t mask_h = mask_l + mask_size - 1; \ > > > > + unsigned long mask = GENMASK(mask_h, mask_l); \ > > > > + unsigned long new_ = (unsigned long)(new) << mask_l; \ > > > > + unsigned long ret_; \ > > > > + unsigned long rc; \ > > > > + \ > > > > + asm volatile( \ > > > > + release_barrier \ > > > > + "0: lr.d %0, %2\n" \ > > > > > > I was going to ask why this is lr.d rather than lr.w. But I see > > > Jan > > > already asked. I agree with him that it should probably be a lr.w > > > and > > > ... > > > > > > > + " and %1, %0, %z4\n" \ > > > > + " or %1, %1, %z3\n" \ > > > > + " sc.d %1, %1, %2\n" \ > > > > > > ... respectively sc.w. The same applies for cmpxchg. > > > > I agree that it would be better, and my initial attempt was to > > handle > > 4-byte or 8-byte boundary crossing during 2-byte access: > > > > 0 1 2 3 4 5 6 7 8 > > X X X 1 1 X X X X > > > > In this case, if I align address 3 to address 0 and then read 4 > > bytes > > instead of 8 bytes, I will not process the byte at address 4. This > > was > > the reason why I started to read 8 bytes. > > > > I also acknowledge that there could be an issue in the following > > case: > > > > X 4094 4095 4096 > > X 1 1 X > > In this situation, when I read 8 bytes, there could be an issue > > where > > the new page (which starts at 4096) will not be mapped. It seems > > correct in this case to check that variable is within one page and > > read > > 4 bytes instead of 8. > > > > One more thing I am uncertain about is if we change everything to > > read > > 4 bytes with 4-byte alignment, what should be done with the first > > case? > > Should we panic? (I am not sure if this is an option.) > > Counter question (raised elsewhere already): What if a 4-byte access > crosses a word / cache line / page boundary? Ideally exactly the > same would happen for a 2-byte access crossing a respective boundary. > (Which you can achieve relatively easily by masking off only address > bit 1, keeping address bit 0 unaltered.) But if we align down on a 4-byte boundary and then access 4 bytes, we can't cross a boundary. I agree that the algorithm is not correct, as it can ignore some values in certain situations. For example: 0 1 2 3 4 5 6 7 8 X X X 1 1 X X X X In this case, the value at address 4 won't be updated. I agree that introducing a new macro to check if a variable crosses a boundary is necessary or as an option we can check that addr is 2-byte aligned: #define CHECK_BOUNDARY_CROSSING(start, end, boundary_size) ASSERT((start / boundary_size) != (end / boundary_size)) Then, it is necessary to check: CHECK_BOUNDARY_CROSSING(start, end, 4) CHECK_BOUNDARY_CROSSING(start, end, PAGE_SIZE) But why do we need to check the cache line boundary? In the case of the cache, the question will only be about performance, but it should still work, shouldn't it? ~ Oleksii
On 19.02.2024 16:20, Oleksii wrote: > On Mon, 2024-02-19 at 15:12 +0100, Jan Beulich wrote: >> On 19.02.2024 15:00, Oleksii wrote: >>> On Sun, 2024-02-18 at 19:00 +0000, Julien Grall wrote: >>>> On 05/02/2024 15:32, Oleksii Kurochko wrote: >>>>> --- /dev/null >>>>> +++ b/xen/arch/riscv/include/asm/cmpxchg.h >>>>> @@ -0,0 +1,237 @@ >>>>> +/* SPDX-License-Identifier: GPL-2.0-only */ >>>>> +/* Copyright (C) 2014 Regents of the University of California >>>>> */ >>>>> + >>>>> +#ifndef _ASM_RISCV_CMPXCHG_H >>>>> +#define _ASM_RISCV_CMPXCHG_H >>>>> + >>>>> +#include <xen/compiler.h> >>>>> +#include <xen/lib.h> >>>>> + >>>>> +#include <asm/fence.h> >>>>> +#include <asm/io.h> >>>>> +#include <asm/system.h> >>>>> + >>>>> +#define ALIGN_DOWN(addr, size) ((addr) & (~((size) - 1))) >>>>> + >>>>> +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier, >>>>> acquire_barrier) \ >>>>> +({ \ >>>>> + asm volatile( \ >>>>> + release_barrier \ >>>>> + " amoswap" sfx " %0, %2, %1\n" \ >>>>> + acquire_barrier \ >>>>> + : "=r" (ret), "+A" (*ptr) \ >>>>> + : "r" (new) \ >>>>> + : "memory" ); \ >>>>> +}) >>>>> + >>>>> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, >>>>> acquire_barrier) \ >>>>> +({ \ >>>>> + uint32_t *ptr_32b_aligned = (uint32_t >>>>> *)ALIGN_DOWN((unsigned >>>>> long)ptr, 4); \ >>>>> + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - >>>>> sizeof(*ptr))) >>>>> * BITS_PER_BYTE; \ >>>>> + uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \ >>>>> + uint8_t mask_h = mask_l + mask_size - 1; \ >>>>> + unsigned long mask = GENMASK(mask_h, mask_l); \ >>>>> + unsigned long new_ = (unsigned long)(new) << mask_l; \ >>>>> + unsigned long ret_; \ >>>>> + unsigned long rc; \ >>>>> + \ >>>>> + asm volatile( \ >>>>> + release_barrier \ >>>>> + "0: lr.d %0, %2\n" \ >>>> >>>> I was going to ask why this is lr.d rather than lr.w. But I see >>>> Jan >>>> already asked. I agree with him that it should probably be a lr.w >>>> and >>>> ... >>>> >>>>> + " and %1, %0, %z4\n" \ >>>>> + " or %1, %1, %z3\n" \ >>>>> + " sc.d %1, %1, %2\n" \ >>>> >>>> ... respectively sc.w. The same applies for cmpxchg. >>> >>> I agree that it would be better, and my initial attempt was to >>> handle >>> 4-byte or 8-byte boundary crossing during 2-byte access: >>> >>> 0 1 2 3 4 5 6 7 8 >>> X X X 1 1 X X X X >>> >>> In this case, if I align address 3 to address 0 and then read 4 >>> bytes >>> instead of 8 bytes, I will not process the byte at address 4. This >>> was >>> the reason why I started to read 8 bytes. >>> >>> I also acknowledge that there could be an issue in the following >>> case: >>> >>> X 4094 4095 4096 >>> X 1 1 X >>> In this situation, when I read 8 bytes, there could be an issue >>> where >>> the new page (which starts at 4096) will not be mapped. It seems >>> correct in this case to check that variable is within one page and >>> read >>> 4 bytes instead of 8. >>> >>> One more thing I am uncertain about is if we change everything to >>> read >>> 4 bytes with 4-byte alignment, what should be done with the first >>> case? >>> Should we panic? (I am not sure if this is an option.) >> >> Counter question (raised elsewhere already): What if a 4-byte access >> crosses a word / cache line / page boundary? Ideally exactly the >> same would happen for a 2-byte access crossing a respective boundary. >> (Which you can achieve relatively easily by masking off only address >> bit 1, keeping address bit 0 unaltered.) > But if we align down on a 4-byte boundary and then access 4 bytes, we > can't cross a boundary. I agree that the algorithm is not correct, as > it can ignore some values in certain situations. For example: > 0 1 2 3 4 5 6 7 8 > X X X 1 1 X X X X > In this case, the value at address 4 won't be updated. > > I agree that introducing a new macro to check if a variable crosses a > boundary is necessary or as an option we can check that addr is 2-byte > aligned: > > #define CHECK_BOUNDARY_CROSSING(start, end, boundary_size) > ASSERT((start / boundary_size) != (end / boundary_size)) > > Then, it is necessary to check: > > CHECK_BOUNDARY_CROSSING(start, end, 4) > CHECK_BOUNDARY_CROSSING(start, end, PAGE_SIZE) > > But why do we need to check the cache line boundary? In the case of the > cache, the question will only be about performance, but it should still > work, shouldn't it? You don't need to check for any of these boundaries. You can simply leverage what the hardware does for misaligned accesses. See the various other replies I've sent - I thought things should have become pretty much crystal clear by now: For 1-byte accesses you access the containing word, by clearing the low two bits. For 2-byte accesses you also access the containing word, by clearing only bit 1 (which the naturally leaves no bit that needs clearing for the projected [but not necessary] case of handling a 4-byte access). If the resulting 4-byte access then is still misaligned, it'll fault just as a non- emulated 4-byte access would. And you don't need to care about any of the boundaries, not at words, not at cache lines, and not at pages. Jan
> > > > > As 1- and 2-byte cases are emulated I decided that is not to > > > > provide > > > > sfx argument for emulation macros as it will not have to much > > > > affect on > > > > emulated types and just consume more performance on acquire and > > > > release > > > > version of sc/ld instructions. > > > > > > Question is whether the common case (4- and 8-byte accesses) > > > shouldn't > > > be valued higher, with 1- and 2-byte emulation being there just > > > to > > > allow things to not break altogether. > > If I understand you correctly, it would make sense to add the 'sfx' > > argument for the 1/2-byte access case, ensuring that all options > > are > > available for 1/2-byte access case as well. > > That's one of the possibilities. As said, I'm not overly worried > about > the emulated cases. For the initial implementation I'd recommend > going > with what is easiest there, yielding the best possible result for the > 4- and 8-byte cases. If later it turns out repeated acquire/release > accesses are a problem in the emulation loop, things can be changed > to explicit barriers, without touching the 4- and 8-byte cases. I am confused then a little bit if emulated case is not an issue. For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and aqcuire version of xchg barries are used. The similar is done for cmpxchg. If something will be needed to change in emulation loop it won't require to change 4- and 8-byte cases. ~ Oleksii
On 23.02.2024 13:23, Oleksii wrote: >> >>>>> As 1- and 2-byte cases are emulated I decided that is not to >>>>> provide >>>>> sfx argument for emulation macros as it will not have to much >>>>> affect on >>>>> emulated types and just consume more performance on acquire and >>>>> release >>>>> version of sc/ld instructions. >>>> >>>> Question is whether the common case (4- and 8-byte accesses) >>>> shouldn't >>>> be valued higher, with 1- and 2-byte emulation being there just >>>> to >>>> allow things to not break altogether. >>> If I understand you correctly, it would make sense to add the 'sfx' >>> argument for the 1/2-byte access case, ensuring that all options >>> are >>> available for 1/2-byte access case as well. >> >> That's one of the possibilities. As said, I'm not overly worried >> about >> the emulated cases. For the initial implementation I'd recommend >> going >> with what is easiest there, yielding the best possible result for the >> 4- and 8-byte cases. If later it turns out repeated acquire/release >> accesses are a problem in the emulation loop, things can be changed >> to explicit barriers, without touching the 4- and 8-byte cases. > I am confused then a little bit if emulated case is not an issue. > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and aqcuire > version of xchg barries are used. > > The similar is done for cmpxchg. > > If something will be needed to change in emulation loop it won't > require to change 4- and 8-byte cases. I'm afraid I don't understand your reply. Jan
On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote: > On 23.02.2024 13:23, Oleksii wrote: > > > > > > > > > As 1- and 2-byte cases are emulated I decided that is not > > > > > > to > > > > > > provide > > > > > > sfx argument for emulation macros as it will not have to > > > > > > much > > > > > > affect on > > > > > > emulated types and just consume more performance on acquire > > > > > > and > > > > > > release > > > > > > version of sc/ld instructions. > > > > > > > > > > Question is whether the common case (4- and 8-byte accesses) > > > > > shouldn't > > > > > be valued higher, with 1- and 2-byte emulation being there > > > > > just > > > > > to > > > > > allow things to not break altogether. > > > > If I understand you correctly, it would make sense to add the > > > > 'sfx' > > > > argument for the 1/2-byte access case, ensuring that all > > > > options > > > > are > > > > available for 1/2-byte access case as well. > > > > > > That's one of the possibilities. As said, I'm not overly worried > > > about > > > the emulated cases. For the initial implementation I'd recommend > > > going > > > with what is easiest there, yielding the best possible result for > > > the > > > 4- and 8-byte cases. If later it turns out repeated > > > acquire/release > > > accesses are a problem in the emulation loop, things can be > > > changed > > > to explicit barriers, without touching the 4- and 8-byte cases. > > I am confused then a little bit if emulated case is not an issue. > > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and > > aqcuire > > version of xchg barries are used. > > > > The similar is done for cmpxchg. > > > > If something will be needed to change in emulation loop it won't > > require to change 4- and 8-byte cases. > > I'm afraid I don't understand your reply. IIUC, emulated cases it is implemented correctly in terms of usage barriers. And it also OK not to use sfx for lr/sc instructions and use only barriers. For 4- and 8-byte cases are used sfx + barrier depending on the specific case ( relaxed, acquire, release, generic xchg/cmpxchg ). What also looks to me correct. But you suggested to provide the best possible result for 4- and 8-byte cases. So I don't understand what the best possible result is as the current one usage of __{cmp}xchg_generic for each specific case ( relaxed, acquire, release, generic xchg/cmpxchg ) looks correct to me: xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without barriers. xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only release barrier xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire barrier xchg_relaxed -> (..., "", "", "") - no barries, no sfx The similar for cmpxchg(). ~ Oleksii
On 26.02.2024 12:18, Oleksii wrote: > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote: >> On 23.02.2024 13:23, Oleksii wrote: >>>> >>>>>>> As 1- and 2-byte cases are emulated I decided that is not >>>>>>> to >>>>>>> provide >>>>>>> sfx argument for emulation macros as it will not have to >>>>>>> much >>>>>>> affect on >>>>>>> emulated types and just consume more performance on acquire >>>>>>> and >>>>>>> release >>>>>>> version of sc/ld instructions. >>>>>> >>>>>> Question is whether the common case (4- and 8-byte accesses) >>>>>> shouldn't >>>>>> be valued higher, with 1- and 2-byte emulation being there >>>>>> just >>>>>> to >>>>>> allow things to not break altogether. >>>>> If I understand you correctly, it would make sense to add the >>>>> 'sfx' >>>>> argument for the 1/2-byte access case, ensuring that all >>>>> options >>>>> are >>>>> available for 1/2-byte access case as well. >>>> >>>> That's one of the possibilities. As said, I'm not overly worried >>>> about >>>> the emulated cases. For the initial implementation I'd recommend >>>> going >>>> with what is easiest there, yielding the best possible result for >>>> the >>>> 4- and 8-byte cases. If later it turns out repeated >>>> acquire/release >>>> accesses are a problem in the emulation loop, things can be >>>> changed >>>> to explicit barriers, without touching the 4- and 8-byte cases. >>> I am confused then a little bit if emulated case is not an issue. >>> >>> For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and >>> aqcuire >>> version of xchg barries are used. >>> >>> The similar is done for cmpxchg. >>> >>> If something will be needed to change in emulation loop it won't >>> require to change 4- and 8-byte cases. >> >> I'm afraid I don't understand your reply. > IIUC, emulated cases it is implemented correctly in terms of usage > barriers. And it also OK not to use sfx for lr/sc instructions and use > only barriers. > > For 4- and 8-byte cases are used sfx + barrier depending on the > specific case ( relaxed, acquire, release, generic xchg/cmpxchg ). > What also looks to me correct. But you suggested to provide the best > possible result for 4- and 8-byte cases. > > So I don't understand what the best possible result is as the current > one usage of __{cmp}xchg_generic for each specific case ( relaxed, > acquire, release, generic xchg/cmpxchg ) looks correct to me: > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without > barriers. > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only release > barrier > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire > barrier > xchg_relaxed -> (..., "", "", "") - no barries, no sfx So first: While explicit barriers are technically okay, I don't follow why you insist on using them when you can achieve the same by suitably tagging the actual insn doing the exchange. Then second: It's somewhat hard for me to see the final effect on the emulation paths without you actually having done the switch. Maybe no special handling is necessary there anymore then. And as said, it may actually be acceptable for the emulation paths to "only" be correct, but not be ideal in terms of performance. After all, if you use the normal 4-byte primitive in there, more (non-explicit) barriers than needed would occur if the involved loop has to take more than one iteration. Which could (but imo doesn't need to be) avoided by using a more relaxed 4-byte primitive there and an explicit barrier outside of the loop. Jan
On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote: > On 26.02.2024 12:18, Oleksii wrote: > > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote: > > > On 23.02.2024 13:23, Oleksii wrote: > > > > > > > > > > > > > As 1- and 2-byte cases are emulated I decided that is > > > > > > > > not > > > > > > > > to > > > > > > > > provide > > > > > > > > sfx argument for emulation macros as it will not have > > > > > > > > to > > > > > > > > much > > > > > > > > affect on > > > > > > > > emulated types and just consume more performance on > > > > > > > > acquire > > > > > > > > and > > > > > > > > release > > > > > > > > version of sc/ld instructions. > > > > > > > > > > > > > > Question is whether the common case (4- and 8-byte > > > > > > > accesses) > > > > > > > shouldn't > > > > > > > be valued higher, with 1- and 2-byte emulation being > > > > > > > there > > > > > > > just > > > > > > > to > > > > > > > allow things to not break altogether. > > > > > > If I understand you correctly, it would make sense to add > > > > > > the > > > > > > 'sfx' > > > > > > argument for the 1/2-byte access case, ensuring that all > > > > > > options > > > > > > are > > > > > > available for 1/2-byte access case as well. > > > > > > > > > > That's one of the possibilities. As said, I'm not overly > > > > > worried > > > > > about > > > > > the emulated cases. For the initial implementation I'd > > > > > recommend > > > > > going > > > > > with what is easiest there, yielding the best possible result > > > > > for > > > > > the > > > > > 4- and 8-byte cases. If later it turns out repeated > > > > > acquire/release > > > > > accesses are a problem in the emulation loop, things can be > > > > > changed > > > > > to explicit barriers, without touching the 4- and 8-byte > > > > > cases. > > > > I am confused then a little bit if emulated case is not an > > > > issue. > > > > > > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and > > > > aqcuire > > > > version of xchg barries are used. > > > > > > > > The similar is done for cmpxchg. > > > > > > > > If something will be needed to change in emulation loop it > > > > won't > > > > require to change 4- and 8-byte cases. > > > > > > I'm afraid I don't understand your reply. > > IIUC, emulated cases it is implemented correctly in terms of usage > > barriers. And it also OK not to use sfx for lr/sc instructions and > > use > > only barriers. > > > > For 4- and 8-byte cases are used sfx + barrier depending on the > > specific case ( relaxed, acquire, release, generic xchg/cmpxchg ). > > What also looks to me correct. But you suggested to provide the > > best > > possible result for 4- and 8-byte cases. > > > > So I don't understand what the best possible result is as the > > current > > one usage of __{cmp}xchg_generic for each specific case ( relaxed, > > acquire, release, generic xchg/cmpxchg ) looks correct to me: > > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without > > barriers. > > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only > > release > > barrier > > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire > > barrier > > xchg_relaxed -> (..., "", "", "") - no barries, no sfx > > So first: While explicit barriers are technically okay, I don't > follow why > you insist on using them when you can achieve the same by suitably > tagging > the actual insn doing the exchange. Then second: It's somewhat hard > for me > to see the final effect on the emulation paths without you actually > having > done the switch. Maybe no special handling is necessary there anymore > then. And as said, it may actually be acceptable for the emulation > paths > to "only" be correct, but not be ideal in terms of performance. After > all, > if you use the normal 4-byte primitive in there, more (non-explicit) > barriers than needed would occur if the involved loop has to take > more > than one iteration. Which could (but imo doesn't need to be) avoided > by > using a more relaxed 4-byte primitive there and an explicit barrier > outside of the loop. According to the spec: Table A.5 ( part of the table only I copied here ) Linux Construct RVWMO Mapping atomic <op> relaxed amo<op>.{w|d} atomic <op> acquire amo<op>.{w|d}.aq atomic <op> release amo<op>.{w|d}.rl atomic <op> amo<op>.{w|d}.aqrl Linux Construct RVWMO LR/SC Mapping atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez loop OR fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez loop The Linux mappings for release operations may seem stronger than necessary, but these mappings are needed to cover some cases in which Linux requires stronger orderings than the more intuitive mappings would provide. In particular, as of the time this text is being written, Linux is actively debating whether to require load-load, load-store, and store-store orderings between accesses in one critical section and accesses in a subsequent critical section in the same hart and protected by the same synchronization object. Not all combinations of FENCE RW,W/FENCE R,RW mappings with aq/rl mappings combine to provide such orderings. There are a few ways around this problem, including: 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices but is undesir- able, as it defeats the purpose of the aq/rl modifiers. 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not currently work due to the lack of load and store opcodes with aq and rl modifiers. 3. Strengthen the mappings of release operations such that they would enforce sufficient order- ings in the presence of either type of acquire mapping. This is the currently-recommended solution, and the one shown in Table A.5. Based on this it is enough in our case use only suffixed istructions (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }. But as far as I understand in Linux atomics were strengthen with fences: Atomics present the same issue with locking: release and acquire variants need to be strengthened to meet the constraints defined by the Linux-kernel memory consistency model [1]. Atomics present a further issue: implementations of atomics such as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs, which do not give full-ordering with .aqrl; for example, current implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test below to end up with the state indicated in the "exists" clause. In order to "synchronize" LKMM and RISC-V's implementation, this commit strengthens the implementations of the atomics operations by replacing .rl and .aq with the use of ("lightweigth") fences, and by replacing .aqrl LR/SC pairs in sequences such as: 0: lr.w.aqrl %0, %addr bne %0, %old, 1f ... sc.w.aqrl %1, %new, %addr bnez %1, 0b 1: with sequences of the form: 0: lr.w %0, %addr bne %0, %old, 1f ... sc.w.rl %1, %new, %addr /* SC-release */ bnez %1, 0b fence rw, rw /* "full" fence */ 1: following Daniel's suggestion. These modifications were validated with simulation of the RISC-V with sequences of the form: 0: lr.w %0, %addr bne %0, %old, 1f ... sc.w.rl %1, %new, %addr /* SC-release */ bnez %1, 0b fence rw, rw /* "full" fence */ 1: following Daniel's suggestion. These modifications were validated with simulation of the RISC-V memory consistency model. C lr-sc-aqrl-pair-vs-full-barrier {} P0(int *x, int *y, atomic_t *u) { int r0; int r1; WRITE_ONCE(*x, 1); r0 = atomic_cmpxchg(u, 0, 1); r1 = READ_ONCE(*y); } P1(int *x, int *y, atomic_t *v) { int r0; int r1; WRITE_ONCE(*y, 1); r0 = atomic_cmpxchg(v, 0, 1); r1 = READ_ONCE(*x); } exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0) [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2 https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM https://marc.info/?l=linux-kernel&m=151633436614259&w=2 Thereby Linux kernel implementation seems to me more safe and it is a reason why I want/wanted to be aligned with it. ~ Oleksii
On 26.02.2024 13:58, Oleksii wrote: > On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote: >> On 26.02.2024 12:18, Oleksii wrote: >>> On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote: >>>> On 23.02.2024 13:23, Oleksii wrote: >>>>>> >>>>>>>>> As 1- and 2-byte cases are emulated I decided that is >>>>>>>>> not >>>>>>>>> to >>>>>>>>> provide >>>>>>>>> sfx argument for emulation macros as it will not have >>>>>>>>> to >>>>>>>>> much >>>>>>>>> affect on >>>>>>>>> emulated types and just consume more performance on >>>>>>>>> acquire >>>>>>>>> and >>>>>>>>> release >>>>>>>>> version of sc/ld instructions. >>>>>>>> >>>>>>>> Question is whether the common case (4- and 8-byte >>>>>>>> accesses) >>>>>>>> shouldn't >>>>>>>> be valued higher, with 1- and 2-byte emulation being >>>>>>>> there >>>>>>>> just >>>>>>>> to >>>>>>>> allow things to not break altogether. >>>>>>> If I understand you correctly, it would make sense to add >>>>>>> the >>>>>>> 'sfx' >>>>>>> argument for the 1/2-byte access case, ensuring that all >>>>>>> options >>>>>>> are >>>>>>> available for 1/2-byte access case as well. >>>>>> >>>>>> That's one of the possibilities. As said, I'm not overly >>>>>> worried >>>>>> about >>>>>> the emulated cases. For the initial implementation I'd >>>>>> recommend >>>>>> going >>>>>> with what is easiest there, yielding the best possible result >>>>>> for >>>>>> the >>>>>> 4- and 8-byte cases. If later it turns out repeated >>>>>> acquire/release >>>>>> accesses are a problem in the emulation loop, things can be >>>>>> changed >>>>>> to explicit barriers, without touching the 4- and 8-byte >>>>>> cases. >>>>> I am confused then a little bit if emulated case is not an >>>>> issue. >>>>> >>>>> For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and >>>>> aqcuire >>>>> version of xchg barries are used. >>>>> >>>>> The similar is done for cmpxchg. >>>>> >>>>> If something will be needed to change in emulation loop it >>>>> won't >>>>> require to change 4- and 8-byte cases. >>>> >>>> I'm afraid I don't understand your reply. >>> IIUC, emulated cases it is implemented correctly in terms of usage >>> barriers. And it also OK not to use sfx for lr/sc instructions and >>> use >>> only barriers. >>> >>> For 4- and 8-byte cases are used sfx + barrier depending on the >>> specific case ( relaxed, acquire, release, generic xchg/cmpxchg ). >>> What also looks to me correct. But you suggested to provide the >>> best >>> possible result for 4- and 8-byte cases. >>> >>> So I don't understand what the best possible result is as the >>> current >>> one usage of __{cmp}xchg_generic for each specific case ( relaxed, >>> acquire, release, generic xchg/cmpxchg ) looks correct to me: >>> xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without >>> barriers. >>> xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only >>> release >>> barrier >>> xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire >>> barrier >>> xchg_relaxed -> (..., "", "", "") - no barries, no sfx >> >> So first: While explicit barriers are technically okay, I don't >> follow why >> you insist on using them when you can achieve the same by suitably >> tagging >> the actual insn doing the exchange. Then second: It's somewhat hard >> for me >> to see the final effect on the emulation paths without you actually >> having >> done the switch. Maybe no special handling is necessary there anymore >> then. And as said, it may actually be acceptable for the emulation >> paths >> to "only" be correct, but not be ideal in terms of performance. After >> all, >> if you use the normal 4-byte primitive in there, more (non-explicit) >> barriers than needed would occur if the involved loop has to take >> more >> than one iteration. Which could (but imo doesn't need to be) avoided >> by >> using a more relaxed 4-byte primitive there and an explicit barrier >> outside of the loop. > > According to the spec: > Table A.5 ( part of the table only I copied here ) > > Linux Construct RVWMO Mapping > atomic <op> relaxed amo<op>.{w|d} > atomic <op> acquire amo<op>.{w|d}.aq > atomic <op> release amo<op>.{w|d}.rl > atomic <op> amo<op>.{w|d}.aqrl > > Linux Construct RVWMO LR/SC Mapping > atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop > atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop > atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez > loop OR > fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; > bnez loop > atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez > loop In your consideration what to implement you will want to limit things to constructs we actually use. I can't find any use of the relaxed, acquire, or release forms of atomics as mentioned above. > The Linux mappings for release operations may seem stronger than > necessary, but these mappings > are needed to cover some cases in which Linux requires stronger > orderings than the more intuitive > mappings would provide. In particular, as of the time this text is > being written, Linux is actively > debating whether to require load-load, load-store, and store-store > orderings between accesses in one > critical section and accesses in a subsequent critical section in the > same hart and protected by the > same synchronization object. Not all combinations of FENCE RW,W/FENCE > R,RW mappings > with aq/rl mappings combine to provide such orderings. There are a few > ways around this problem, > including: > 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices > but is undesir- > able, as it defeats the purpose of the aq/rl modifiers. > 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not > currently work > due to the lack of load and store opcodes with aq and rl modifiers. I don't understand this point: Which specific load and/or store forms are missing? According to my reading of the A extension spec all combination of aq/rl exist with both lr and sc. > 3. Strengthen the mappings of release operations such that they would > enforce sufficient order- > ings in the presence of either type of acquire mapping. This is the > currently-recommended > solution, and the one shown in Table A.5. > > > Based on this it is enough in our case use only suffixed istructions > (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }. > > > But as far as I understand in Linux atomics were strengthen with > fences: > Atomics present the same issue with locking: release and acquire > variants need to be strengthened to meet the constraints defined > by the Linux-kernel memory consistency model [1]. > > Atomics present a further issue: implementations of atomics such > as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs, > which do not give full-ordering with .aqrl; for example, current > implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test > below to end up with the state indicated in the "exists" clause. > > In order to "synchronize" LKMM and RISC-V's implementation, this > commit strengthens the implementations of the atomics operations > by replacing .rl and .aq with the use of ("lightweigth") fences, > and by replacing .aqrl LR/SC pairs in sequences such as: > > 0: lr.w.aqrl %0, %addr > bne %0, %old, 1f > ... > sc.w.aqrl %1, %new, %addr > bnez %1, 0b > 1: > > with sequences of the form: > > 0: lr.w %0, %addr > bne %0, %old, 1f > ... > sc.w.rl %1, %new, %addr /* SC-release */ > bnez %1, 0b > fence rw, rw /* "full" fence */ > 1: > > following Daniel's suggestion. I'm likely missing something, yet as it looks it does help that the code fragment above appears to be ... > These modifications were validated with simulation of the RISC-V > with sequences of the form: > > 0: lr.w %0, %addr > bne %0, %old, 1f > ... > sc.w.rl %1, %new, %addr /* SC-release */ > bnez %1, 0b > fence rw, rw /* "full" fence */ > 1: > > following Daniel's suggestion. ... entirely the same as this one. Yet there's presumably a reason for quoting it twice? > These modifications were validated with simulation of the RISC-V > memory consistency model. > > C lr-sc-aqrl-pair-vs-full-barrier > > {} > > P0(int *x, int *y, atomic_t *u) > { > int r0; > int r1; > > WRITE_ONCE(*x, 1); > r0 = atomic_cmpxchg(u, 0, 1); > r1 = READ_ONCE(*y); > } > > P1(int *x, int *y, atomic_t *v) > { > int r0; > int r1; > > WRITE_ONCE(*y, 1); > r0 = atomic_cmpxchg(v, 0, 1); > r1 = READ_ONCE(*x); > } > > exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0) > > [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2 > > https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM > https://marc.info/?l=linux-kernel&m=151633436614259&w=2 > > > Thereby Linux kernel implementation seems to me more safe and it is a > reason why I want/wanted to be aligned with it. Which may end up being okay. I hope you realize though that there's a lot more explanation needed in the respective commits then compared to what you've had so far. As a minimum, absolutely anything remotely unexpected needs to be explained. Jan
On Mon, 2024-02-26 at 15:20 +0100, Jan Beulich wrote: > On 26.02.2024 13:58, Oleksii wrote: > > On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote: > > > On 26.02.2024 12:18, Oleksii wrote: > > > > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote: > > > > > On 23.02.2024 13:23, Oleksii wrote: > > > > > > > > > > > > > > > > > As 1- and 2-byte cases are emulated I decided that > > > > > > > > > > is > > > > > > > > > > not > > > > > > > > > > to > > > > > > > > > > provide > > > > > > > > > > sfx argument for emulation macros as it will not > > > > > > > > > > have > > > > > > > > > > to > > > > > > > > > > much > > > > > > > > > > affect on > > > > > > > > > > emulated types and just consume more performance on > > > > > > > > > > acquire > > > > > > > > > > and > > > > > > > > > > release > > > > > > > > > > version of sc/ld instructions. > > > > > > > > > > > > > > > > > > Question is whether the common case (4- and 8-byte > > > > > > > > > accesses) > > > > > > > > > shouldn't > > > > > > > > > be valued higher, with 1- and 2-byte emulation being > > > > > > > > > there > > > > > > > > > just > > > > > > > > > to > > > > > > > > > allow things to not break altogether. > > > > > > > > If I understand you correctly, it would make sense to > > > > > > > > add > > > > > > > > the > > > > > > > > 'sfx' > > > > > > > > argument for the 1/2-byte access case, ensuring that > > > > > > > > all > > > > > > > > options > > > > > > > > are > > > > > > > > available for 1/2-byte access case as well. > > > > > > > > > > > > > > That's one of the possibilities. As said, I'm not overly > > > > > > > worried > > > > > > > about > > > > > > > the emulated cases. For the initial implementation I'd > > > > > > > recommend > > > > > > > going > > > > > > > with what is easiest there, yielding the best possible > > > > > > > result > > > > > > > for > > > > > > > the > > > > > > > 4- and 8-byte cases. If later it turns out repeated > > > > > > > acquire/release > > > > > > > accesses are a problem in the emulation loop, things can > > > > > > > be > > > > > > > changed > > > > > > > to explicit barriers, without touching the 4- and 8-byte > > > > > > > cases. > > > > > > I am confused then a little bit if emulated case is not an > > > > > > issue. > > > > > > > > > > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed > > > > > > and > > > > > > aqcuire > > > > > > version of xchg barries are used. > > > > > > > > > > > > The similar is done for cmpxchg. > > > > > > > > > > > > If something will be needed to change in emulation loop it > > > > > > won't > > > > > > require to change 4- and 8-byte cases. > > > > > > > > > > I'm afraid I don't understand your reply. > > > > IIUC, emulated cases it is implemented correctly in terms of > > > > usage > > > > barriers. And it also OK not to use sfx for lr/sc instructions > > > > and > > > > use > > > > only barriers. > > > > > > > > For 4- and 8-byte cases are used sfx + barrier depending on the > > > > specific case ( relaxed, acquire, release, generic xchg/cmpxchg > > > > ). > > > > What also looks to me correct. But you suggested to provide the > > > > best > > > > possible result for 4- and 8-byte cases. > > > > > > > > So I don't understand what the best possible result is as the > > > > current > > > > one usage of __{cmp}xchg_generic for each specific case ( > > > > relaxed, > > > > acquire, release, generic xchg/cmpxchg ) looks correct to me: > > > > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without > > > > barriers. > > > > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only > > > > release > > > > barrier > > > > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only > > > > acquire > > > > barrier > > > > xchg_relaxed -> (..., "", "", "") - no barries, no sfx > > > > > > So first: While explicit barriers are technically okay, I don't > > > follow why > > > you insist on using them when you can achieve the same by > > > suitably > > > tagging > > > the actual insn doing the exchange. Then second: It's somewhat > > > hard > > > for me > > > to see the final effect on the emulation paths without you > > > actually > > > having > > > done the switch. Maybe no special handling is necessary there > > > anymore > > > then. And as said, it may actually be acceptable for the > > > emulation > > > paths > > > to "only" be correct, but not be ideal in terms of performance. > > > After > > > all, > > > if you use the normal 4-byte primitive in there, more (non- > > > explicit) > > > barriers than needed would occur if the involved loop has to take > > > more > > > than one iteration. Which could (but imo doesn't need to be) > > > avoided > > > by > > > using a more relaxed 4-byte primitive there and an explicit > > > barrier > > > outside of the loop. > > > > According to the spec: > > Table A.5 ( part of the table only I copied here ) > > > > Linux Construct RVWMO Mapping > > atomic <op> relaxed amo<op>.{w|d} > > atomic <op> acquire amo<op>.{w|d}.aq > > atomic <op> release amo<op>.{w|d}.rl > > atomic <op> amo<op>.{w|d}.aqrl > > > > Linux Construct RVWMO LR/SC Mapping > > atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop > > atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez > > loop > > atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; > > bnez > > loop OR > > fence.tso; loop: lr.{w|d}; <op>; > > sc.{w|d}∗ ; > > bnez loop > > atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; > > bnez > > loop > > In your consideration what to implement you will want to limit > things to constructs we actually use. I can't find any use of the > relaxed, acquire, or release forms of atomics as mentioned above. > > > The Linux mappings for release operations may seem stronger than > > necessary, but these mappings > > are needed to cover some cases in which Linux requires stronger > > orderings than the more intuitive > > mappings would provide. In particular, as of the time this text is > > being written, Linux is actively > > debating whether to require load-load, load-store, and store-store > > orderings between accesses in one > > critical section and accesses in a subsequent critical section in > > the > > same hart and protected by the > > same synchronization object. Not all combinations of FENCE > > RW,W/FENCE > > R,RW mappings > > with aq/rl mappings combine to provide such orderings. There are a > > few > > ways around this problem, > > including: > > 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This > > suffices > > but is undesir- > > able, as it defeats the purpose of the aq/rl modifiers. > > 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does > > not > > currently work > > due to the lack of load and store opcodes with aq and rl modifiers. > > I don't understand this point: Which specific load and/or store forms > are missing? According to my reading of the A extension spec all > combination of aq/rl exist with both lr and sc. I think this is not about lr and sc instructions. It is about l{b|h|w|d} and s{b|h|w|d}, which should be used with fence in case of acquire and seq_cst. > > > 3. Strengthen the mappings of release operations such that they > > would > > enforce sufficient order- > > ings in the presence of either type of acquire mapping. This is the > > currently-recommended > > solution, and the one shown in Table A.5. > > > > > > Based on this it is enough in our case use only suffixed > > istructions > > (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }. > > > > > > But as far as I understand in Linux atomics were strengthen with > > fences: > > Atomics present the same issue with locking: release and > > acquire > > variants need to be strengthened to meet the constraints > > defined > > by the Linux-kernel memory consistency model [1]. > > > > Atomics present a further issue: implementations of atomics > > such > > as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC > > pairs, > > which do not give full-ordering with .aqrl; for example, > > current > > implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" > > test > > below to end up with the state indicated in the "exists" > > clause. > > > > In order to "synchronize" LKMM and RISC-V's implementation, > > this > > commit strengthens the implementations of the atomics > > operations > > by replacing .rl and .aq with the use of ("lightweigth") > > fences, > > and by replacing .aqrl LR/SC pairs in sequences such as: > > > > 0: lr.w.aqrl %0, %addr > > bne %0, %old, 1f > > ... > > sc.w.aqrl %1, %new, %addr > > bnez %1, 0b > > 1: > > > > with sequences of the form: > > > > 0: lr.w %0, %addr > > bne %0, %old, 1f > > ... > > sc.w.rl %1, %new, %addr /* SC-release */ > > bnez %1, 0b > > fence rw, rw /* "full" fence */ > > 1: > > > > following Daniel's suggestion. > > I'm likely missing something, yet as it looks it does help that the > code fragment above appears to be ... > > > These modifications were validated with simulation of the RISC- > > V > > with sequences of the form: > > > > 0: lr.w %0, %addr > > bne %0, %old, 1f > > ... > > sc.w.rl %1, %new, %addr /* SC-release */ > > bnez %1, 0b > > fence rw, rw /* "full" fence */ > > 1: > > > > following Daniel's suggestion. > > ... entirely the same as this one. Yet there's presumably a reason > for quoting it twice? I think it was done by accident ~ Oleksii > > > These modifications were validated with simulation of the RISC- > > V > > memory consistency model. > > > > C lr-sc-aqrl-pair-vs-full-barrier > > > > {} > > > > P0(int *x, int *y, atomic_t *u) > > { > > int r0; > > int r1; > > > > WRITE_ONCE(*x, 1); > > r0 = atomic_cmpxchg(u, 0, 1); > > r1 = READ_ONCE(*y); > > } > > > > P1(int *x, int *y, atomic_t *v) > > { > > int r0; > > int r1; > > > > WRITE_ONCE(*y, 1); > > r0 = atomic_cmpxchg(v, 0, 1); > > r1 = READ_ONCE(*x); > > } > > > > exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0) > > > > [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2 > > > > https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM > > https://marc.info/?l=linux-kernel&m=151633436614259&w=2 > > > > > > Thereby Linux kernel implementation seems to me more safe and it is > > a > > reason why I want/wanted to be aligned with it. > > Which may end up being okay. I hope you realize though that there's a > lot more explanation needed in the respective commits then compared > to > what you've had so far. As a minimum, absolutely anything remotely > unexpected needs to be explained. > > Jan
diff --git a/xen/arch/riscv/include/asm/cmpxchg.h b/xen/arch/riscv/include/asm/cmpxchg.h new file mode 100644 index 0000000000..b751a50cbf --- /dev/null +++ b/xen/arch/riscv/include/asm/cmpxchg.h @@ -0,0 +1,237 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* Copyright (C) 2014 Regents of the University of California */ + +#ifndef _ASM_RISCV_CMPXCHG_H +#define _ASM_RISCV_CMPXCHG_H + +#include <xen/compiler.h> +#include <xen/lib.h> + +#include <asm/fence.h> +#include <asm/io.h> +#include <asm/system.h> + +#define ALIGN_DOWN(addr, size) ((addr) & (~((size) - 1))) + +#define __amoswap_generic(ptr, new, ret, sfx, release_barrier, acquire_barrier) \ +({ \ + asm volatile( \ + release_barrier \ + " amoswap" sfx " %0, %2, %1\n" \ + acquire_barrier \ + : "=r" (ret), "+A" (*ptr) \ + : "r" (new) \ + : "memory" ); \ +}) + +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, acquire_barrier) \ +({ \ + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned long)ptr, 4); \ + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) * BITS_PER_BYTE; \ + uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \ + uint8_t mask_h = mask_l + mask_size - 1; \ + unsigned long mask = GENMASK(mask_h, mask_l); \ + unsigned long new_ = (unsigned long)(new) << mask_l; \ + unsigned long ret_; \ + unsigned long rc; \ + \ + asm volatile( \ + release_barrier \ + "0: lr.d %0, %2\n" \ + " and %1, %0, %z4\n" \ + " or %1, %1, %z3\n" \ + " sc.d %1, %1, %2\n" \ + " bnez %1, 0b\n" \ + acquire_barrier \ + : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \ + : "rJ" (new_), "rJ" (~mask) \ + : "memory"); \ + \ + ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \ +}) + +#define __xchg_generic(ptr, new, size, sfx, release_barrier, acquire_barrier) \ +({ \ + __typeof__(ptr) ptr__ = (ptr); \ + __typeof__(*(ptr)) new__ = (new); \ + __typeof__(*(ptr)) ret__; \ + switch (size) \ + { \ + case 1: \ + case 2: \ + emulate_xchg_1_2(ptr__, new__, ret__, release_barrier, acquire_barrier); \ + break; \ + case 4: \ + __amoswap_generic(ptr__, new__, ret__,\ + ".w" sfx, release_barrier, acquire_barrier); \ + break; \ + case 8: \ + __amoswap_generic(ptr__, new__, ret__,\ + ".d" sfx, release_barrier, acquire_barrier); \ + break; \ + default: \ + STATIC_ASSERT_UNREACHABLE(); \ + } \ + ret__; \ +}) + +#define xchg_relaxed(ptr, x) \ +({ \ + __typeof__(*(ptr)) x_ = (x); \ + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), "", "", ""); \ +}) + +#define xchg_acquire(ptr, x) \ +({ \ + __typeof__(*(ptr)) x_ = (x); \ + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \ + "", "", RISCV_ACQUIRE_BARRIER); \ +}) + +#define xchg_release(ptr, x) \ +({ \ + __typeof__(*(ptr)) x_ = (x); \ + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\ + "", RISCV_RELEASE_BARRIER, ""); \ +}) + +#define xchg(ptr,x) \ +({ \ + __typeof__(*(ptr)) ret__; \ + ret__ = (__typeof__(*(ptr))) \ + __xchg_generic(ptr, (unsigned long)(x), sizeof(*(ptr)), \ + ".aqrl", "", ""); \ + ret__; \ +}) + +#define __generic_cmpxchg(ptr, old, new, ret, lr_sfx, sc_sfx, release_barrier, acquire_barrier) \ + ({ \ + register unsigned int rc; \ + asm volatile( \ + release_barrier \ + "0: lr" lr_sfx " %0, %2\n" \ + " bne %0, %z3, 1f\n" \ + " sc" sc_sfx " %1, %z4, %2\n" \ + " bnez %1, 0b\n" \ + acquire_barrier \ + "1:\n" \ + : "=&r" (ret), "=&r" (rc), "+A" (*ptr) \ + : "rJ" (old), "rJ" (new) \ + : "memory"); \ + }) + +#define emulate_cmpxchg_1_2(ptr, old, new, ret, sc_sfx, release_barrier, acquire_barrier) \ +({ \ + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned long)ptr, 4); \ + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) * BITS_PER_BYTE; \ + uint8_t mask_size = sizeof(*ptr) * BITS_PER_BYTE; \ + uint8_t mask_h = mask_l + mask_size - 1; \ + unsigned long mask = GENMASK(mask_h, mask_l); \ + unsigned long old_ = (unsigned long)(old) << mask_l; \ + unsigned long new_ = (unsigned long)(new) << mask_l; \ + unsigned long ret_; \ + unsigned long rc; \ + \ + __asm__ __volatile__ ( \ + release_barrier \ + "0: lr.d %0, %2\n" \ + " and %1, %0, %z5\n" \ + " bne %1, %z3, 1f\n" \ + " and %1, %0, %z6\n" \ + " or %1, %1, %z4\n" \ + " sc.d" sc_sfx " %1, %1, %2\n" \ + " bnez %1, 0b\n" \ + acquire_barrier \ + "1:\n" \ + : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \ + : "rJ" (old_), "rJ" (new_), \ + "rJ" (mask), "rJ" (~mask) \ + : "memory"); \ + \ + ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \ +}) + +/* + * Atomic compare and exchange. Compare OLD with MEM, if identical, + * store NEW in MEM. Return the initial value in MEM. Success is + * indicated by comparing RETURN with OLD. + */ +#define __cmpxchg_generic(ptr, old, new, size, sc_sfx, release_barrier, acquire_barrier) \ +({ \ + __typeof__(ptr) ptr__ = (ptr); \ + __typeof__(*(ptr)) old__ = (__typeof__(*(ptr)))(old); \ + __typeof__(*(ptr)) new__ = (__typeof__(*(ptr)))(new); \ + __typeof__(*(ptr)) ret__; \ + switch (size) \ + { \ + case 1: \ + case 2: \ + emulate_cmpxchg_1_2(ptr, old, new, ret__,\ + sc_sfx, release_barrier, acquire_barrier); \ + break; \ + case 4: \ + __generic_cmpxchg(ptr__, old__, new__, ret__, \ + ".w", ".w"sc_sfx, release_barrier, acquire_barrier); \ + break; \ + case 8: \ + __generic_cmpxchg(ptr__, old__, new__, ret__, \ + ".d", ".d"sc_sfx, release_barrier, acquire_barrier); \ + break; \ + default: \ + STATIC_ASSERT_UNREACHABLE(); \ + } \ + ret__; \ +}) + +#define cmpxchg_relaxed(ptr, o, n) \ +({ \ + __typeof__(*(ptr)) o_ = (o); \ + __typeof__(*(ptr)) n_ = (n); \ + (__typeof__(*(ptr)))__cmpxchg_generic(ptr, \ + o_, n_, sizeof(*(ptr)), "", "", ""); \ +}) + +#define cmpxchg_acquire(ptr, o, n) \ +({ \ + __typeof__(*(ptr)) o_ = (o); \ + __typeof__(*(ptr)) n_ = (n); \ + (__typeof__(*(ptr)))__cmpxchg_generic(ptr, o_, n_, sizeof(*(ptr)), \ + "", "", RISCV_ACQUIRE_BARRIER); \ +}) + +#define cmpxchg_release(ptr, o, n) \ +({ \ + __typeof__(*(ptr)) o_ = (o); \ + __typeof__(*(ptr)) n_ = (n); \ + (__typeof__(*(ptr)))__cmpxchg_release(ptr, o_, n_, sizeof(*(ptr)), \ + "", RISCV_RELEASE_BARRIER, ""); \ +}) + +#define cmpxchg(ptr, o, n) \ +({ \ + __typeof__(*(ptr)) ret__; \ + ret__ = (__typeof__(*(ptr))) \ + __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \ + sizeof(*(ptr)), ".rl", "", " fence rw, rw\n"); \ + ret__; \ +}) + +#define __cmpxchg(ptr, o, n, s) \ +({ \ + __typeof__(*(ptr)) ret__; \ + ret__ = (__typeof__(*(ptr))) \ + __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \ + s, ".rl", "", " fence rw, rw\n"); \ + ret__; \ +}) + +#endif /* _ASM_RISCV_CMPXCHG_H */ + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * indent-tabs-mode: nil + * End: + */
The header was taken from Linux kernl 6.4.0-rc1. Addionally, were updated: * add emulation of {cmp}xchg for 1/2 byte types * replace tabs with spaces * replace __* varialbed with *__ * introduce generic version of xchg_* and cmpxchg_*. Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> --- Changes in V4: - Code style fixes. - enforce in __xchg_*() has the same type for new and *ptr, also "\n" was removed at the end of asm instruction. - dependency from https://lore.kernel.org/xen-devel/cover.1706259490.git.federico.serafini@bugseng.com/ - switch from ASSERT_UNREACHABLE to STATIC_ASSERT_UNREACHABLE(). - drop xchg32(ptr, x) and xchg64(ptr, x) as they aren't used. - drop cmpxcg{32,64}_{local} as they aren't used. - introduce generic version of xchg_* and cmpxchg_*. - update the commit message. --- Changes in V3: - update the commit message - add emulation of {cmp}xchg_... for 1 and 2 bytes types --- Changes in V2: - update the comment at the top of the header. - change xen/lib.h to xen/bug.h. - sort inclusion of headers properly. --- xen/arch/riscv/include/asm/cmpxchg.h | 237 +++++++++++++++++++++++++++ 1 file changed, 237 insertions(+) create mode 100644 xen/arch/riscv/include/asm/cmpxchg.h