Message ID | 20240315134009.580167-8-ajones@ventanamicro.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | riscv: Apply Zawrs when available | expand |
Context | Check | Description |
---|---|---|
conchuod/vmtest-fixes-PR | fail | merge-conflict |
> +config RISCV_ISA_ZAWRS > + bool "Zawrs extension support for more efficient busy waiting" > + depends on RISCV_ALTERNATIVE > + default y > + help > + Enable the use of the Zawrs (wait for reservation set) extension > + when available. > + > + The Zawrs extension instructions (wrs.nto and wrs.sto) are used for > + more efficient busy waiting. Maybe mention that this is about _power_-efficiency? In the discussion of the previous iteration, I suggested [1]: The Zawrs extension defines a pair of instructions to be used in polling loops that allows a core to enter a low-power state and wait on a store to a memory location. (from the Zawrs spec -- I remain open to review other suggestiongs). > +#define ZAWRS_WRS_NTO ".long 0x00d00073" > +#define ZAWRS_WRS_STO ".long 0x01d00073" In the discussion of the previous iteration, you observed [2]: I'd prefer we use insn-def.h to define instructions, rather than scatter .long's around, but since this instruction doesn't have any inputs, then I guess it's not so important to use insn-def.h. So that "preference" doesn't apply to the instructions at stake? Or is not "important"? No real objections to barrier.h, trying to understand the rationale. > +#define ALT_WRS_NTO() \ > + __asm__ __volatile__ (ALTERNATIVE( \ > + "nop\n", ZAWRS_WRS_NTO "\n", \ > + 0, RISCV_ISA_EXT_ZAWRS, CONFIG_RISCV_ISA_ZAWRS)) > +#define ALT_WRS_STO() \ > + __asm__ __volatile__ (ALTERNATIVE( \ > + "nop\n", ZAWRS_WRS_STO "\n", \ > + 0, RISCV_ISA_EXT_ZAWRS, CONFIG_RISCV_ISA_ZAWRS)) > + > #define RISCV_FENCE(p, s) \ > __asm__ __volatile__ ("fence " #p "," #s : : : "memory") FYI, this hunk/patch conflicts with Eric's changes [3]. > +#define ___smp_load_reservedN(attr, ptr) \ > +({ \ > + typeof(*ptr) ___p1; \ > + \ > + __asm__ __volatile__ ("lr." attr " %[p], %[c]\n" \ > + : [p]"=&r" (___p1), [c]"+A"(*ptr)); \ > + ___p1; \ > +}) > + > +#define __smp_load_reserved_relaxed(ptr) \ > +({ \ > + typeof(*ptr) ___p1; \ > + \ > + if (sizeof(*ptr) == sizeof(int)) \ > + ___p1 = ___smp_load_reservedN("w", ptr); \ > + else if (sizeof(*ptr) == sizeof(long)) \ > + ___p1 = ___smp_load_reservedN("d", ptr); \ > + else \ > + compiletime_assert(0, \ > + "Need type compatible with LR/SC instructions for " \ > + __stringify(ptr)); \ > + ___p1; \ > +}) In the discussion of the previous iteration, you observed [2]: It's more common to use a switch for these things, [...] something like the macros in arch/riscv/include/asm/cmpxchg.h. Can we stick with that pattern? Along the same lines (#codingstyle), notice that ___smp_load_reservedN() would become one of the first uses of the asmSymbolicName syntax in the arch/riscv/ directory. So again, why not stick to the common style? something's wrong with it? > +#define smp_cond_load_relaxed(ptr, cond_expr) \ > +({ \ > + typeof(ptr) __PTR = (ptr); \ > + __unqual_scalar_typeof(*ptr) VAL; \ > + \ > + VAL = READ_ONCE(*__PTR); \ > + if (!cond_expr) { \ > + for (;;) { \ > + VAL = __smp_load_reserved_relaxed(__PTR); \ > + if (cond_expr) \ > + break; \ > + ALT_WRS_STO(); \ > + } \ > + } \ > + (typeof(*ptr))VAL; \ > +}) > + > +#define smp_cond_load_acquire(ptr, cond_expr) \ > +({ \ > + typeof(ptr) __PTR = (ptr); \ > + __unqual_scalar_typeof(*ptr) VAL; \ > + \ > + VAL = smp_load_acquire(__PTR); \ > + if (!cond_expr) { \ > + for (;;) { \ > + VAL = __smp_load_reserved_acquire(__PTR); \ > + if (cond_expr) \ > + break; \ > + ALT_WRS_STO(); \ > + } \ > + } \ > + (typeof(*ptr))VAL; \ > +}) In the discussion of the previous iteration, you observed [2]: I guess this peeling off of the first iteration is because it's expected that the load generated by READ_ONCE() is more efficient than lr.w/d? If we're worried about unnecessary use of lr.w/d, then shouldn't we look for a solution that doesn't issue those instructions when we don't have the Zawrs extension? To which Palmer replied (apparently, agreeing with your remarks) [4]: I haven't looked at the patch, but I'd expect we NOP out the whole LR/WRS sequence? I don't remember any reason to have the load reservation without the WRS, [...] Unfortunately, this submission makes no mention to those comments and, more importantly, to the considerations/tradeoffs which have led you to submit different changes. In submitting-patches.rst's words, Review comments or questions that do not lead to a code change should almost certainly bring about a comment or changelog entry so that the next reviewer better understands what is going on. Andrea P.S. BTW, not too far from the previous recommendation/paragraph is: When sending a next version, [...] Notify people that commented on your patch about new versions by adding them to the patches CC list. [1] https://lore.kernel.org/lkml/ZTE7eUyrb8+J+ORB@andrea [2] https://lore.kernel.org/lkml/20230524-35efcabbbfd6a1ef83865fb4@orel [3] https://lore.kernel.org/lkml/20240217131206.3667544-1-ericchancf@google.com [4] https://lore.kernel.org/lkml/mhng-d92f84d8-03db-4fb1-93c3-0d5bfbe7a796@palmer-ri-x1c9a
Hi Andrea, I actually just forgot to re-review the thread and make changes... I grabbed the patch and then focused on experimenting and testing it's impact on guests. I guess I should have started with applying the thread comments, because I ended up forgetting to return to them before posting... On Sat, Mar 16, 2024 at 12:36:23PM +0100, Andrea Parri wrote: > > +config RISCV_ISA_ZAWRS > > + bool "Zawrs extension support for more efficient busy waiting" > > + depends on RISCV_ALTERNATIVE > > + default y > > + help > > + Enable the use of the Zawrs (wait for reservation set) extension > > + when available. > > + > > + The Zawrs extension instructions (wrs.nto and wrs.sto) are used for > > + more efficient busy waiting. > > Maybe mention that this is about _power_-efficiency? In the discussion > of the previous iteration, I suggested [1]: > > The Zawrs extension defines a pair of instructions to be used in > polling loops that allows a core to enter a low-power state and > wait on a store to a memory location. > > (from the Zawrs spec -- I remain open to review other suggestiongs). Sounds good to me. I'll make a change like that for v2. > > > > +#define ZAWRS_WRS_NTO ".long 0x00d00073" > > +#define ZAWRS_WRS_STO ".long 0x01d00073" > > In the discussion of the previous iteration, you observed [2]: > > I'd prefer we use insn-def.h to define instructions, rather than > scatter .long's around, but since this instruction doesn't have > any inputs, then I guess it's not so important to use insn-def.h. > > So that "preference" doesn't apply to the instructions at stake? Or is > not "important"? No real objections to barrier.h, trying to understand > the rationale. Besides having simply forgotten to address my own comment, in this case I can live with the .long since we're at least not hard coding an operand register. However, since I prefer '.4byte', I'll change to that. > > > > +#define ALT_WRS_NTO() \ > > + __asm__ __volatile__ (ALTERNATIVE( \ > > + "nop\n", ZAWRS_WRS_NTO "\n", \ > > + 0, RISCV_ISA_EXT_ZAWRS, CONFIG_RISCV_ISA_ZAWRS)) > > +#define ALT_WRS_STO() \ > > + __asm__ __volatile__ (ALTERNATIVE( \ > > + "nop\n", ZAWRS_WRS_STO "\n", \ > > + 0, RISCV_ISA_EXT_ZAWRS, CONFIG_RISCV_ISA_ZAWRS)) > > + > > #define RISCV_FENCE(p, s) \ > > __asm__ __volatile__ ("fence " #p "," #s : : : "memory") > > FYI, this hunk/patch conflicts with Eric's changes [3]. Thanks, I had mentally noted that series as a possible conflict, but then forgot to try basing on it to see if it would be a problem. > > > > +#define ___smp_load_reservedN(attr, ptr) \ > > +({ \ > > + typeof(*ptr) ___p1; \ > > + \ > > + __asm__ __volatile__ ("lr." attr " %[p], %[c]\n" \ > > + : [p]"=&r" (___p1), [c]"+A"(*ptr)); \ > > + ___p1; \ > > +}) > > + > > +#define __smp_load_reserved_relaxed(ptr) \ > > +({ \ > > + typeof(*ptr) ___p1; \ > > + \ > > + if (sizeof(*ptr) == sizeof(int)) \ > > + ___p1 = ___smp_load_reservedN("w", ptr); \ > > + else if (sizeof(*ptr) == sizeof(long)) \ > > + ___p1 = ___smp_load_reservedN("d", ptr); \ > > + else \ > > + compiletime_assert(0, \ > > + "Need type compatible with LR/SC instructions for " \ > > + __stringify(ptr)); \ > > + ___p1; \ > > +}) > > In the discussion of the previous iteration, you observed [2]: > > It's more common to use a switch for these things, [...] something > like the macros in arch/riscv/include/asm/cmpxchg.h. Can we stick > with that pattern? > > Along the same lines (#codingstyle), notice that ___smp_load_reservedN() > would become one of the first uses of the asmSymbolicName syntax in the > arch/riscv/ directory. > > So again, why not stick to the common style? something's wrong with it? For v2, I'll switch to the style that I recommended :-) > > > > +#define smp_cond_load_relaxed(ptr, cond_expr) \ > > +({ \ > > + typeof(ptr) __PTR = (ptr); \ > > + __unqual_scalar_typeof(*ptr) VAL; \ > > + \ > > + VAL = READ_ONCE(*__PTR); \ > > + if (!cond_expr) { \ > > + for (;;) { \ > > + VAL = __smp_load_reserved_relaxed(__PTR); \ > > + if (cond_expr) \ > > + break; \ > > + ALT_WRS_STO(); \ > > + } \ > > + } \ > > + (typeof(*ptr))VAL; \ > > +}) > > + > > +#define smp_cond_load_acquire(ptr, cond_expr) \ > > +({ \ > > + typeof(ptr) __PTR = (ptr); \ > > + __unqual_scalar_typeof(*ptr) VAL; \ > > + \ > > + VAL = smp_load_acquire(__PTR); \ > > + if (!cond_expr) { \ > > + for (;;) { \ > > + VAL = __smp_load_reserved_acquire(__PTR); \ > > + if (cond_expr) \ > > + break; \ > > + ALT_WRS_STO(); \ > > + } \ > > + } \ > > + (typeof(*ptr))VAL; \ > > +}) > > In the discussion of the previous iteration, you observed [2]: > > I guess this peeling off of the first iteration is because it's > expected that the load generated by READ_ONCE() is more efficient > than lr.w/d? If we're worried about unnecessary use of lr.w/d, > then shouldn't we look for a solution that doesn't issue those > instructions when we don't have the Zawrs extension? > > To which Palmer replied (apparently, agreeing with your remarks) [4]: > > I haven't looked at the patch, but I'd expect we NOP out the > whole LR/WRS sequence? I don't remember any reason to have the > load reservation without the WRS, [...] For v2, I'll look closer at this to decide what to do. > > Unfortunately, this submission makes no mention to those comments and, > more importantly, to the considerations/tradeoffs which have led you > to submit different changes. In submitting-patches.rst's words, > > Review comments or questions that do not lead to a code change > should almost certainly bring about a comment or changelog entry > so that the next reviewer better understands what is going on. Eh, there's nothing mysterious going on here. It was just me being forgetful... And, I even forgot to write a changelog entry explaining that I forgot :-) > > Andrea > > P.S. BTW, not too far from the previous recommendation/paragraph is: > > When sending a next version, [...] Notify people that commented > on your patch about new versions by adding them to the patches CC > list. Yeah, had I returned to the old thread for the re-review, I would have seen your comments again and then remembered to CC you. Thanks, drew > > [1] https://lore.kernel.org/lkml/ZTE7eUyrb8+J+ORB@andrea > [2] https://lore.kernel.org/lkml/20230524-35efcabbbfd6a1ef83865fb4@orel > [3] https://lore.kernel.org/lkml/20240217131206.3667544-1-ericchancf@google.com > [4] https://lore.kernel.org/lkml/mhng-d92f84d8-03db-4fb1-93c3-0d5bfbe7a796@palmer-ri-x1c9a
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..2c296113aeb1 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -569,6 +569,19 @@ config RISCV_ISA_V_PREEMPTIVE preemption. Enabling this config will result in higher memory consumption due to the allocation of per-task's kernel Vector context. +config RISCV_ISA_ZAWRS + bool "Zawrs extension support for more efficient busy waiting" + depends on RISCV_ALTERNATIVE + default y + help + Enable the use of the Zawrs (wait for reservation set) extension + when available. + + The Zawrs extension instructions (wrs.nto and wrs.sto) are used for + more efficient busy waiting. + + If you don't know what to do here, say Y. + config TOOLCHAIN_HAS_ZBB bool default y diff --git a/arch/riscv/include/asm/barrier.h b/arch/riscv/include/asm/barrier.h index 110752594228..93b3f572d643 100644 --- a/arch/riscv/include/asm/barrier.h +++ b/arch/riscv/include/asm/barrier.h @@ -11,11 +11,26 @@ #define _ASM_RISCV_BARRIER_H #ifndef __ASSEMBLY__ +#include <linux/compiler.h> +#include <asm/alternative-macros.h> +#include <asm/hwcap.h> +#include <asm/rwonce.h> #define nop() __asm__ __volatile__ ("nop") #define __nops(n) ".rept " #n "\nnop\n.endr\n" #define nops(n) __asm__ __volatile__ (__nops(n)) +#define ZAWRS_WRS_NTO ".long 0x00d00073" +#define ZAWRS_WRS_STO ".long 0x01d00073" +#define ALT_WRS_NTO() \ + __asm__ __volatile__ (ALTERNATIVE( \ + "nop\n", ZAWRS_WRS_NTO "\n", \ + 0, RISCV_ISA_EXT_ZAWRS, CONFIG_RISCV_ISA_ZAWRS)) +#define ALT_WRS_STO() \ + __asm__ __volatile__ (ALTERNATIVE( \ + "nop\n", ZAWRS_WRS_STO "\n", \ + 0, RISCV_ISA_EXT_ZAWRS, CONFIG_RISCV_ISA_ZAWRS)) + #define RISCV_FENCE(p, s) \ __asm__ __volatile__ ("fence " #p "," #s : : : "memory") @@ -44,6 +59,39 @@ do { \ ___p1; \ }) +#define ___smp_load_reservedN(attr, ptr) \ +({ \ + typeof(*ptr) ___p1; \ + \ + __asm__ __volatile__ ("lr." attr " %[p], %[c]\n" \ + : [p]"=&r" (___p1), [c]"+A"(*ptr)); \ + ___p1; \ +}) + +#define __smp_load_reserved_relaxed(ptr) \ +({ \ + typeof(*ptr) ___p1; \ + \ + if (sizeof(*ptr) == sizeof(int)) \ + ___p1 = ___smp_load_reservedN("w", ptr); \ + else if (sizeof(*ptr) == sizeof(long)) \ + ___p1 = ___smp_load_reservedN("d", ptr); \ + else \ + compiletime_assert(0, \ + "Need type compatible with LR/SC instructions for " \ + __stringify(ptr)); \ + ___p1; \ +}) + +#define __smp_load_reserved_acquire(ptr) \ +({ \ + typeof(*ptr) ___p1; \ + \ + ___p1 = __smp_load_reserved_relaxed(ptr); \ + RISCV_FENCE(r, rw); \ + ___p1; \ +}) + /* * This is a very specific barrier: it's currently only used in two places in * the kernel, both in the scheduler. See include/linux/spinlock.h for the two @@ -71,6 +119,40 @@ do { \ */ #define smp_mb__after_spinlock() RISCV_FENCE(iorw,iorw) +#define smp_cond_load_relaxed(ptr, cond_expr) \ +({ \ + typeof(ptr) __PTR = (ptr); \ + __unqual_scalar_typeof(*ptr) VAL; \ + \ + VAL = READ_ONCE(*__PTR); \ + if (!cond_expr) { \ + for (;;) { \ + VAL = __smp_load_reserved_relaxed(__PTR); \ + if (cond_expr) \ + break; \ + ALT_WRS_STO(); \ + } \ + } \ + (typeof(*ptr))VAL; \ +}) + +#define smp_cond_load_acquire(ptr, cond_expr) \ +({ \ + typeof(ptr) __PTR = (ptr); \ + __unqual_scalar_typeof(*ptr) VAL; \ + \ + VAL = smp_load_acquire(__PTR); \ + if (!cond_expr) { \ + for (;;) { \ + VAL = __smp_load_reserved_acquire(__PTR); \ + if (cond_expr) \ + break; \ + ALT_WRS_STO(); \ + } \ + } \ + (typeof(*ptr))VAL; \ +}) + #include <asm-generic/barrier.h> #endif /* __ASSEMBLY__ */ diff --git a/arch/riscv/include/asm/hwcap.h b/arch/riscv/include/asm/hwcap.h index 1f2d2599c655..eac7214a4bd0 100644 --- a/arch/riscv/include/asm/hwcap.h +++ b/arch/riscv/include/asm/hwcap.h @@ -80,6 +80,7 @@ #define RISCV_ISA_EXT_ZFA 71 #define RISCV_ISA_EXT_ZTSO 72 #define RISCV_ISA_EXT_ZACAS 73 +#define RISCV_ISA_EXT_ZAWRS 74 #define RISCV_ISA_EXT_XLINUXENVCFG 127 diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c index 79a5a35fab96..0e3c79094b07 100644 --- a/arch/riscv/kernel/cpufeature.c +++ b/arch/riscv/kernel/cpufeature.c @@ -271,6 +271,7 @@ const struct riscv_isa_ext_data riscv_isa_ext[] = { __RISCV_ISA_EXT_DATA(zihintpause, RISCV_ISA_EXT_ZIHINTPAUSE), __RISCV_ISA_EXT_DATA(zihpm, RISCV_ISA_EXT_ZIHPM), __RISCV_ISA_EXT_DATA(zacas, RISCV_ISA_EXT_ZACAS), + __RISCV_ISA_EXT_DATA(zawrs, RISCV_ISA_EXT_ZAWRS), __RISCV_ISA_EXT_DATA(zfa, RISCV_ISA_EXT_ZFA), __RISCV_ISA_EXT_DATA(zfh, RISCV_ISA_EXT_ZFH), __RISCV_ISA_EXT_DATA(zfhmin, RISCV_ISA_EXT_ZFHMIN),