Message ID | 20230810040349.92279-7-leobras@redhat.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | Rework & improve riscv cmpxchg.h and atomic.h | expand |
Context | Check | Description |
---|---|---|
conchuod/cover_letter | success | Series has a cover letter |
conchuod/tree_selection | success | Guessed tree name to be for-next at HEAD 174e8ac0272d |
conchuod/fixes_present | success | Fixes tag not required for -next series |
conchuod/maintainers_pattern | success | MAINTAINERS pattern errors before the patch: 4 and now 4 |
conchuod/verify_signedoff | success | Signed-off-by tag matches author and committer |
conchuod/kdoc | success | Errors and warnings before: 0 this patch: 0 |
conchuod/build_rv64_clang_allmodconfig | success | Errors and warnings before: 2808 this patch: 2808 |
conchuod/module_param | success | Was 0 now: 0 |
conchuod/build_rv64_gcc_allmodconfig | fail | Errors and warnings before: 15918 this patch: 15928 |
conchuod/build_rv32_defconfig | success | Build OK |
conchuod/dtb_warn_rv64 | success | Errors and warnings before: 12 this patch: 12 |
conchuod/header_inline | success | No static functions without inline keyword in header files |
conchuod/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 44 lines checked |
conchuod/build_rv64_nommu_k210_defconfig | success | Build OK |
conchuod/verify_fixes | success | No Fixes tag |
conchuod/build_rv64_nommu_virt_defconfig | success | Build OK |
On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > xchg for variables of size 1-byte and 2-bytes is not yet available for > riscv, even though its present in other architectures such as arm64 and > x86. This could lead to not being able to implement some locking mechanisms > or requiring some rework to make it work properly. > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > architectures. > > Signed-off-by: Leonardo Bras <leobras@redhat.com> Parity with other architectures by itself is not a reason to do this, in particular the other architectures you listed have the instructions in hardware while riscv does not. Emulating the small xchg() through cmpxchg() is particularly tricky since it's easy to run into a case where this does not guarantee forward progress. This is also something that almost no architecture specific code relies on (generic qspinlock being a notable exception). I would recommend just dropping this patch from the series, at least until there is a need for it. Arnd
On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > xchg for variables of size 1-byte and 2-bytes is not yet available for > > riscv, even though its present in other architectures such as arm64 and > > x86. This could lead to not being able to implement some locking mechanisms > > or requiring some rework to make it work properly. > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > architectures. > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > Hello Arnd Bergmann, thanks for reviewing! > Parity with other architectures by itself is not a reason to do this, > in particular the other architectures you listed have the instructions > in hardware while riscv does not. Sure, I understand RISC-V don't have native support for xchg on variables of size < 4B. My argument is that it's nice to have even an emulated version for this in case any future mechanism wants to use it. Not having it may mean we won't be able to enable given mechanism in RISC-V. > > Emulating the small xchg() through cmpxchg() is particularly tricky > since it's easy to run into a case where this does not guarantee > forward progress. > Didn't get this part: By "emulating small xchg() through cmpxchg()", did you mean like emulating an xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? If so, yeah, it's a fair point: in some extreme case we could have multiple threads accessing given cacheline and have sc always failing. On the other hand, there are 2 arguments on that: 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) also seem to rely in this mechanism for every xchg size. Another archs like csky and loongarch use asm that look like mine to handle size < 4B xchg. > This is also something that almost no architecture > specific code relies on (generic qspinlock being a notable exception). > 2 - As you mentioned, there should be very little code that will actually make use of xchg for vars < 4B, so it should be safe to assume its fine to not guarantee forward progress for those rare usages (like some of above mentioned archs). > I would recommend just dropping this patch from the series, at least > until there is a need for it. While I agree this is a valid point, I believe its more interesting to have it implemented if any future mechanism wants to make use of this. Thanks! Leo
On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: >> > xchg for variables of size 1-byte and 2-bytes is not yet available for >> > riscv, even though its present in other architectures such as arm64 and >> > x86. This could lead to not being able to implement some locking mechanisms >> > or requiring some rework to make it work properly. >> > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other >> > architectures. >> > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com> >> > > Hello Arnd Bergmann, thanks for reviewing! > >> Parity with other architectures by itself is not a reason to do this, >> in particular the other architectures you listed have the instructions >> in hardware while riscv does not. > > Sure, I understand RISC-V don't have native support for xchg on variables of > size < 4B. My argument is that it's nice to have even an emulated version for > this in case any future mechanism wants to use it. > > Not having it may mean we won't be able to enable given mechanism in RISC-V. IIUC the ask is to have a user within the kernel for these functions. That's the general thing to do, and last time this came up there was no in-kernel use of it -- the qspinlock stuff would, but we haven't enabled it yet because we're worried about the performance/fairness stuff that other ports have seen and nobody's got concrete benchmarks yet (though there's another patch set out that I haven't had time to look through, so that may have changed). So if something uses these I'm happy to go look closer. >> Emulating the small xchg() through cmpxchg() is particularly tricky >> since it's easy to run into a case where this does not guarantee >> forward progress. >> > > Didn't get this part: > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > If so, yeah, it's a fair point: in some extreme case we could have multiple > threads accessing given cacheline and have sc always failing. On the other hand, > there are 2 arguments on that: > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > also seem to rely in this mechanism for every xchg size. Another archs like csky > and loongarch use asm that look like mine to handle size < 4B xchg. > > >> This is also something that almost no architecture >> specific code relies on (generic qspinlock being a notable exception). >> > > 2 - As you mentioned, there should be very little code that will actually make > use of xchg for vars < 4B, so it should be safe to assume its fine to not > guarantee forward progress for those rare usages (like some of above mentioned > archs). > >> I would recommend just dropping this patch from the series, at least >> until there is a need for it. > > While I agree this is a valid point, I believe its more interesting to have it > implemented if any future mechanism wants to make use of this. > > > Thanks! > Leo
On Thu, Aug 10, 2023, at 18:23, Palmer Dabbelt wrote: > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: >> On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: >>> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: >>> > xchg for variables of size 1-byte and 2-bytes is not yet available for >>> > riscv, even though its present in other architectures such as arm64 and >>> > x86. This could lead to not being able to implement some locking mechanisms >>> > or requiring some rework to make it work properly. >>> > >>> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other >>> > architectures. >> >>> Parity with other architectures by itself is not a reason to do this, >>> in particular the other architectures you listed have the instructions >>> in hardware while riscv does not. >> >> Sure, I understand RISC-V don't have native support for xchg on variables of >> size < 4B. My argument is that it's nice to have even an emulated version for >> this in case any future mechanism wants to use it. >> >> Not having it may mean we won't be able to enable given mechanism in RISC-V. > > IIUC the ask is to have a user within the kernel for these functions. > That's the general thing to do, and last time this came up there was no > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > it yet because we're worried about the performance/fairness stuff that > other ports have seen and nobody's got concrete benchmarks yet (though > there's another patch set out that I haven't had time to look through, > so that may have changed). Right. In particular the qspinlock is a good example for something where having the emulated 16-bit xchg() may end up less efficient than a natively supported instruction. The xchg() here is a performance optimization for CPUs that can do this without touching the other half of the 32-bit word. >> >> Didn't get this part: >> By "emulating small xchg() through cmpxchg()", did you mean like emulating an >> xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? >> >> If so, yeah, it's a fair point: in some extreme case we could have multiple >> threads accessing given cacheline and have sc always failing. On the other hand, >> there are 2 arguments on that: >> >> 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) >> also seem to rely in this mechanism for every xchg size. Another archs like csky >> and loongarch use asm that look like mine to handle size < 4B xchg. I think you misread the arm64 code, which should use native instructions for all sizes, in both the armv8.0 and LSE atomics. PowerPC does use the masking for xchg, but I suspect there are no actual users, at least it actually has its own qspinlock implementation that avoids xchg(). >>> This is also something that almost no architecture >>> specific code relies on (generic qspinlock being a notable exception). >>> >> >> 2 - As you mentioned, there should be very little code that will actually make >> use of xchg for vars < 4B, so it should be safe to assume its fine to not >> guarantee forward progress for those rare usages (like some of above mentioned >> archs). I don't this this is a safe assumption, we've had endless discussions about using qspinlock on architectures without a native xchg(), which needs either hardware guarantees or special countermeasures in xchg() itself to avoid this. What I'd actually like to do here is to remove the special 8-bit and 16-bit cases from the xchg() and cmpxchg() interfaces at all, leaving only fixed 32-bit and native wordsize (either 32 or 64) as the option, while dealing with the others the same way we treat the fixed 64-bit cases that hardcode the 64-bit argument types and are only usable on architectures that provide them. Arnd
On Fri, Aug 11, 2023 at 3:13 AM Arnd Bergmann <arnd@arndb.de> wrote: > > On Thu, Aug 10, 2023, at 18:23, Palmer Dabbelt wrote: > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > >> On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > >>> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > >>> > xchg for variables of size 1-byte and 2-bytes is not yet available for > >>> > riscv, even though its present in other architectures such as arm64 and > >>> > x86. This could lead to not being able to implement some locking mechanisms > >>> > or requiring some rework to make it work properly. > >>> > > >>> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > >>> > architectures. > >> > >>> Parity with other architectures by itself is not a reason to do this, > >>> in particular the other architectures you listed have the instructions > >>> in hardware while riscv does not. > >> > >> Sure, I understand RISC-V don't have native support for xchg on variables of > >> size < 4B. My argument is that it's nice to have even an emulated version for > >> this in case any future mechanism wants to use it. > >> > >> Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > IIUC the ask is to have a user within the kernel for these functions. > > That's the general thing to do, and last time this came up there was no > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > it yet because we're worried about the performance/fairness stuff that > > other ports have seen and nobody's got concrete benchmarks yet (though > > there's another patch set out that I haven't had time to look through, > > so that may have changed). > > Right. In particular the qspinlock is a good example for something > where having the emulated 16-bit xchg() may end up less efficient > than a natively supported instruction. The xchg() efficiency depends on micro-architecture. and the number of instructions is not the key, even one instruction would be separated into several micro-ops. I thought the Power guys won't agree with this view :) > > The xchg() here is a performance optimization for CPUs that can > do this without touching the other half of the 32-bit word. It's useless on a non-SMT system because all operations are cacheline based. (Ps: Because xchg() has a load semantic, CHI's "Dirty Partial" & "Clean Empty" can't help anymore.) > > >> > >> Didn't get this part: > >> By "emulating small xchg() through cmpxchg()", did you mean like emulating an > >> xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > >> > >> If so, yeah, it's a fair point: in some extreme case we could have multiple > >> threads accessing given cacheline and have sc always failing. On the other hand, > >> there are 2 arguments on that: > >> > >> 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > >> also seem to rely in this mechanism for every xchg size. Another archs like csky > >> and loongarch use asm that look like mine to handle size < 4B xchg. > > I think you misread the arm64 code, which should use native instructions > for all sizes, in both the armv8.0 and LSE atomics. > > PowerPC does use the masking for xchg, but I suspect there are no > actual users, at least it actually has its own qspinlock implementation > that avoids xchg(). PowerPC still needs similar things, see publish_tail_cpu(), and more complex cmpxchg semantics. Paravrit qspinlock and CNA qspinlock still need more: - xchg8 (RCsc) - cmpxchg8/16_relaxed - cmpxchg8/16_release (Rcpc) - cmpxchg8_acquire (RCpc) - cmpxchg8 (RCsc) > > >>> This is also something that almost no architecture > >>> specific code relies on (generic qspinlock being a notable exception). > >>> > >> > >> 2 - As you mentioned, there should be very little code that will actually make > >> use of xchg for vars < 4B, so it should be safe to assume its fine to not > >> guarantee forward progress for those rare usages (like some of above mentioned > >> archs). > > I don't this this is a safe assumption, we've had endless discussions > about using qspinlock on architectures without a native xchg(), which > needs either hardware guarantees or special countermeasures in xchg() itself > to avoid this. > > What I'd actually like to do here is to remove the special 8-bit and > 16-bit cases from the xchg() and cmpxchg() interfaces at all, leaving It needs to modify qspinlock, paravirt_qspinlock, and CNA_qspinlock code to prevent using 8-bit/16-bit xchg/cmpxchg, and cleanup all architectures' cmpxchg.h. What you do is just get them out of the common atomic.h, but architectures still need to solve them and connect to the qspinlock series. > only fixed 32-bit and native wordsize (either 32 or 64) as the option, > while dealing with the others the same way we treat the fixed > 64-bit cases that hardcode the 64-bit argument types and are only > usable on architectures that provide them. > > Arnd
On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote: > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > >> > xchg for variables of size 1-byte and 2-bytes is not yet available for > >> > riscv, even though its present in other architectures such as arm64 and > >> > x86. This could lead to not being able to implement some locking mechanisms > >> > or requiring some rework to make it work properly. > >> > > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > >> > architectures. > >> > > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com> > >> > > > > Hello Arnd Bergmann, thanks for reviewing! > > > >> Parity with other architectures by itself is not a reason to do this, > >> in particular the other architectures you listed have the instructions > >> in hardware while riscv does not. > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > size < 4B. My argument is that it's nice to have even an emulated version for > > this in case any future mechanism wants to use it. > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > IIUC the ask is to have a user within the kernel for these functions. > That's the general thing to do, and last time this came up there was no > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > it yet because we're worried about the performance/fairness stuff that > other ports have seen and nobody's got concrete benchmarks yet (though > there's another patch set out that I haven't had time to look through, > so that may have changed). Conor doesn't agree with using an alternative as a detour mechanism between qspinlock & ticket lock. So I'm preparing V11 with static_key (jump_label) style. Next version, I would separate paravirt_qspinlock & CNA_qspinlock from V10. That would make it easy to review the qspinlock patch series. You can review the next version V11. Now I'm debugging a static_key init problem when load_modules, which is triggered by our combo_qspinlock. The qspinlock is being tested on the riscv platform [1] with 128 cores with 8 NUMA nodes, next, I would update the comparison results of qspinlock & ticket lock. [1]: https://www.sophon.ai/ > > So if something uses these I'm happy to go look closer. > > >> Emulating the small xchg() through cmpxchg() is particularly tricky > >> since it's easy to run into a case where this does not guarantee > >> forward progress. > >> > > > > Didn't get this part: > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > threads accessing given cacheline and have sc always failing. On the other hand, > > there are 2 arguments on that: > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > >> This is also something that almost no architecture > >> specific code relies on (generic qspinlock being a notable exception). > >> > > > > 2 - As you mentioned, there should be very little code that will actually make > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > guarantee forward progress for those rare usages (like some of above mentioned > > archs). > > > >> I would recommend just dropping this patch from the series, at least > >> until there is a need for it. > > > > While I agree this is a valid point, I believe its more interesting to have it > > implemented if any future mechanism wants to make use of this. > > > > > > Thanks! > > Leo
On Fri, Aug 11, 2023 at 09:40:30AM +0800, Guo Ren wrote: > On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote: > > > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > >> > xchg for variables of size 1-byte and 2-bytes is not yet available for > > >> > riscv, even though its present in other architectures such as arm64 and > > >> > x86. This could lead to not being able to implement some locking mechanisms > > >> > or requiring some rework to make it work properly. > > >> > > > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > >> > architectures. > > >> > > > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > >> > > > > > > Hello Arnd Bergmann, thanks for reviewing! > > > > > >> Parity with other architectures by itself is not a reason to do this, > > >> in particular the other architectures you listed have the instructions > > >> in hardware while riscv does not. > > > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > > size < 4B. My argument is that it's nice to have even an emulated version for > > > this in case any future mechanism wants to use it. > > > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > IIUC the ask is to have a user within the kernel for these functions. > > That's the general thing to do, and last time this came up there was no > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > it yet because we're worried about the performance/fairness stuff that > > other ports have seen and nobody's got concrete benchmarks yet (though > > there's another patch set out that I haven't had time to look through, > > so that may have changed). > Conor doesn't agree with using an alternative as a detour mechanism > between qspinlock & ticket lock. Hold on a sec, I don't recall having a problem with alternatives - it was calling the stronger forward progress guarantee an erratum (which it isn't) and an ISA extension w/o any "abusing" that framework that I did not like. > So I'm preparing V11 with static_key > (jump_label) style. I don't think there's much point rushing into making it based on static keys when no progress has been made on implementing support for non-standard extensions. Changing to a static key doesn't change the detection mechanism, I've not got a problem with using alternatives for this stuff. Thanks, Conor. > Next version, I would separate paravirt_qspinlock > & CNA_qspinlock from V10. That would make it easy to review the > qspinlock patch series. You can review the next version V11. Now I'm > debugging a static_key init problem when load_modules, which is > triggered by our combo_qspinlock. > > The qspinlock is being tested on the riscv platform [1] with 128 cores > with 8 NUMA nodes, next, I would update the comparison results of > qspinlock & ticket lock. > > [1]: https://www.sophon.ai/ > > > > > So if something uses these I'm happy to go look closer. > > > > >> Emulating the small xchg() through cmpxchg() is particularly tricky > > >> since it's easy to run into a case where this does not guarantee > > >> forward progress. > > >> > > > > > > Didn't get this part: > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > > threads accessing given cacheline and have sc always failing. On the other hand, > > > there are 2 arguments on that: > > > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > > > > >> This is also something that almost no architecture > > >> specific code relies on (generic qspinlock being a notable exception). > > >> > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > > guarantee forward progress for those rare usages (like some of above mentioned > > > archs). > > > > > >> I would recommend just dropping this patch from the series, at least > > >> until there is a need for it. > > > > > > While I agree this is a valid point, I believe its more interesting to have it > > > implemented if any future mechanism wants to make use of this. > > > > > > > > > Thanks! > > > Leo > > > > -- > Best Regards > Guo Ren
On Fri, Aug 11, 2023 at 07:27:28AM +0100, Conor Dooley wrote: > On Fri, Aug 11, 2023 at 09:40:30AM +0800, Guo Ren wrote: > > On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote: > > > > > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > > >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > > >> > xchg for variables of size 1-byte and 2-bytes is not yet available for > > > >> > riscv, even though its present in other architectures such as arm64 and > > > >> > x86. This could lead to not being able to implement some locking mechanisms > > > >> > or requiring some rework to make it work properly. > > > >> > > > > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > > >> > architectures. > > > >> > > > > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > >> > > > > > > > > Hello Arnd Bergmann, thanks for reviewing! > > > > > > > >> Parity with other architectures by itself is not a reason to do this, > > > >> in particular the other architectures you listed have the instructions > > > >> in hardware while riscv does not. > > > > > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > > > size < 4B. My argument is that it's nice to have even an emulated version for > > > > this in case any future mechanism wants to use it. > > > > > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > > > IIUC the ask is to have a user within the kernel for these functions. > > > That's the general thing to do, and last time this came up there was no > > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > > it yet because we're worried about the performance/fairness stuff that > > > other ports have seen and nobody's got concrete benchmarks yet (though > > > there's another patch set out that I haven't had time to look through, > > > so that may have changed). > > Conor doesn't agree with using an alternative as a detour mechanism > > between qspinlock & ticket lock. > > Hold on a sec, I don't recall having a problem with alternatives - it > was calling the stronger forward progress guarantee an erratum > (which it isn't) and an ISA extension w/o any "abusing" that framework > that I did not like. Hmm, what I wrote here makes no sense reading it back. Let me try again: Hold on a sec, I don't recall having a problem with alternatives - it was calling the stronger forward progress guarantee an erratum (which it isn't) and "abusing" that framework that I did not like. The series effectively mixed and matched whichever part of ISA extensions and errata that was convenient. > > > So I'm preparing V11 with static_key > > (jump_label) style. > > I don't think there's much point rushing into making it based on static > keys when no progress has been made on implementing support for > non-standard extensions. Changing to a static key doesn't change the > detection mechanism, I've not got a problem with using alternatives for > this stuff. > > Thanks, > Conor. > > > Next version, I would separate paravirt_qspinlock > > & CNA_qspinlock from V10. That would make it easy to review the > > qspinlock patch series. You can review the next version V11. Now I'm > > debugging a static_key init problem when load_modules, which is > > triggered by our combo_qspinlock. > > > > The qspinlock is being tested on the riscv platform [1] with 128 cores > > with 8 NUMA nodes, next, I would update the comparison results of > > qspinlock & ticket lock. > > > > [1]: https://www.sophon.ai/ > > > > > > > > So if something uses these I'm happy to go look closer. > > > > > > >> Emulating the small xchg() through cmpxchg() is particularly tricky > > > >> since it's easy to run into a case where this does not guarantee > > > >> forward progress. > > > >> > > > > > > > > Didn't get this part: > > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > > > threads accessing given cacheline and have sc always failing. On the other hand, > > > > there are 2 arguments on that: > > > > > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > > > > > > > >> This is also something that almost no architecture > > > >> specific code relies on (generic qspinlock being a notable exception). > > > >> > > > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > > > guarantee forward progress for those rare usages (like some of above mentioned > > > > archs). > > > > > > > >> I would recommend just dropping this patch from the series, at least > > > >> until there is a need for it. > > > > > > > > While I agree this is a valid point, I believe its more interesting to have it > > > > implemented if any future mechanism wants to make use of this. > > > > > > > > > > > > Thanks! > > > > Leo > > > > > > > > -- > > Best Regards > > Guo Ren
On Fri, Aug 11, 2023 at 09:40:30AM +0800, Guo Ren wrote: > On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote: > > > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > >> > xchg for variables of size 1-byte and 2-bytes is not yet available for > > >> > riscv, even though its present in other architectures such as arm64 and > > >> > x86. This could lead to not being able to implement some locking mechanisms > > >> > or requiring some rework to make it work properly. > > >> > > > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > >> > architectures. > > >> > > > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > >> > > > > > > Hello Arnd Bergmann, thanks for reviewing! > > > > > >> Parity with other architectures by itself is not a reason to do this, > > >> in particular the other architectures you listed have the instructions > > >> in hardware while riscv does not. > > > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > > size < 4B. My argument is that it's nice to have even an emulated version for > > > this in case any future mechanism wants to use it. > > > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > IIUC the ask is to have a user within the kernel for these functions. > > That's the general thing to do, and last time this came up there was no > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > it yet because we're worried about the performance/fairness stuff that > > other ports have seen and nobody's got concrete benchmarks yet (though > > there's another patch set out that I haven't had time to look through, > > so that may have changed). > Conor doesn't agree with using an alternative as a detour mechanism > between qspinlock & ticket lock. So I'm preparing V11 with static_key > (jump_label) style. Next version, I would separate paravirt_qspinlock > & CNA_qspinlock from V10. That would make it easy to review the > qspinlock patch series. You can review the next version V11. Now I'm > debugging a static_key init problem when load_modules, which is > triggered by our combo_qspinlock. We've seen problems with static keys and module loading in the past. You may want to take a look at commit eb6354e11630 ("riscv: Ensure isa-ext static keys are writable") Thanks, drew > > The qspinlock is being tested on the riscv platform [1] with 128 cores > with 8 NUMA nodes, next, I would update the comparison results of > qspinlock & ticket lock. > > [1]: https://www.sophon.ai/ > > > > > So if something uses these I'm happy to go look closer. > > > > >> Emulating the small xchg() through cmpxchg() is particularly tricky > > >> since it's easy to run into a case where this does not guarantee > > >> forward progress. > > >> > > > > > > Didn't get this part: > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > > threads accessing given cacheline and have sc always failing. On the other hand, > > > there are 2 arguments on that: > > > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > > > > >> This is also something that almost no architecture > > >> specific code relies on (generic qspinlock being a notable exception). > > >> > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > > guarantee forward progress for those rare usages (like some of above mentioned > > > archs). > > > > > >> I would recommend just dropping this patch from the series, at least > > >> until there is a need for it. > > > > > > While I agree this is a valid point, I believe its more interesting to have it > > > implemented if any future mechanism wants to make use of this. > > > > > > > > > Thanks! > > > Leo > > > > -- > Best Regards > Guo Ren > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
On Fri, Aug 11, 2023 at 7:10 PM Andrew Jones <ajones@ventanamicro.com> wrote: > > On Fri, Aug 11, 2023 at 09:40:30AM +0800, Guo Ren wrote: > > On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote: > > > > > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > > >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > > >> > xchg for variables of size 1-byte and 2-bytes is not yet available for > > > >> > riscv, even though its present in other architectures such as arm64 and > > > >> > x86. This could lead to not being able to implement some locking mechanisms > > > >> > or requiring some rework to make it work properly. > > > >> > > > > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > > >> > architectures. > > > >> > > > > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > >> > > > > > > > > Hello Arnd Bergmann, thanks for reviewing! > > > > > > > >> Parity with other architectures by itself is not a reason to do this, > > > >> in particular the other architectures you listed have the instructions > > > >> in hardware while riscv does not. > > > > > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > > > size < 4B. My argument is that it's nice to have even an emulated version for > > > > this in case any future mechanism wants to use it. > > > > > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > > > IIUC the ask is to have a user within the kernel for these functions. > > > That's the general thing to do, and last time this came up there was no > > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > > it yet because we're worried about the performance/fairness stuff that > > > other ports have seen and nobody's got concrete benchmarks yet (though > > > there's another patch set out that I haven't had time to look through, > > > so that may have changed). > > Conor doesn't agree with using an alternative as a detour mechanism > > between qspinlock & ticket lock. So I'm preparing V11 with static_key > > (jump_label) style. Next version, I would separate paravirt_qspinlock > > & CNA_qspinlock from V10. That would make it easy to review the > > qspinlock patch series. You can review the next version V11. Now I'm > > debugging a static_key init problem when load_modules, which is > > triggered by our combo_qspinlock. > > We've seen problems with static keys and module loading in the past. You > may want to take a look at commit eb6354e11630 ("riscv: Ensure isa-ext > static keys are writable") Thank you for the tip. I would take a look. > > Thanks, > drew > > > > > The qspinlock is being tested on the riscv platform [1] with 128 cores > > with 8 NUMA nodes, next, I would update the comparison results of > > qspinlock & ticket lock. > > > > [1]: https://www.sophon.ai/ > > > > > > > > So if something uses these I'm happy to go look closer. > > > > > > >> Emulating the small xchg() through cmpxchg() is particularly tricky > > > >> since it's easy to run into a case where this does not guarantee > > > >> forward progress. > > > >> > > > > > > > > Didn't get this part: > > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > > > threads accessing given cacheline and have sc always failing. On the other hand, > > > > there are 2 arguments on that: > > > > > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > > > > > > > >> This is also something that almost no architecture > > > >> specific code relies on (generic qspinlock being a notable exception). > > > >> > > > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > > > guarantee forward progress for those rare usages (like some of above mentioned > > > > archs). > > > > > > > >> I would recommend just dropping this patch from the series, at least > > > >> until there is a need for it. > > > > > > > > While I agree this is a valid point, I believe its more interesting to have it > > > > implemented if any future mechanism wants to make use of this. > > > > > > > > > > > > Thanks! > > > > Leo > > > > > > > > -- > > Best Regards > > Guo Ren > > > > _______________________________________________ > > linux-riscv mailing list > > linux-riscv@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-riscv
Hello everyone, Sorry for the delay, I was out of office for a while. On Fri, 2023-08-11 at 09:24 +0800, Guo Ren wrote: > On Fri, Aug 11, 2023 at 3:13 AM Arnd Bergmann <arnd@arndb.de> wrote: > > > > On Thu, Aug 10, 2023, at 18:23, Palmer Dabbelt wrote: > > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for > > > > > > riscv, even though its present in other architectures such as arm64 and > > > > > > x86. This could lead to not being able to implement some locking mechanisms > > > > > > or requiring some rework to make it work properly. > > > > > > > > > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > > > > > architectures. > > > > > > > > > Parity with other architectures by itself is not a reason to do this, > > > > > in particular the other architectures you listed have the instructions > > > > > in hardware while riscv does not. > > > > > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > > > size < 4B. My argument is that it's nice to have even an emulated version for > > > > this in case any future mechanism wants to use it. > > > > > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > > > IIUC the ask is to have a user within the kernel for these functions. > > > That's the general thing to do, and last time this came up there was no > > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > > it yet because we're worried about the performance/fairness stuff that > > > other ports have seen and nobody's got concrete benchmarks yet (though > > > there's another patch set out that I haven't had time to look through, > > > so that may have changed). > > > > Right. In particular the qspinlock is a good example for something > > where having the emulated 16-bit xchg() may end up less efficient > > than a natively supported instruction. > The xchg() efficiency depends on micro-architecture. and the number of > instructions is not the key, even one instruction would be separated > into several micro-ops. I thought the Power guys won't agree with this > view :) > > > > > The xchg() here is a performance optimization for CPUs that can > > do this without touching the other half of the 32-bit word. > It's useless on a non-SMT system because all operations are cacheline > based. (Ps: Because xchg() has a load semantic, CHI's "Dirty Partial" > & "Clean Empty" can't help anymore.) > > > > > > > > > > > Didn't get this part: > > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > > > threads accessing given cacheline and have sc always failing. On the other hand, > > > > there are 2 arguments on that: > > > > > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > I think you misread the arm64 code, which should use native instructions > > for all sizes, in both the armv8.0 and LSE atomics. By native I understand you mean swp instead of ll/sc, right? Well, that's right only if the kernel is compiled with LSE, and the ll/sc option for is available for other arm64 that don't. Also, according to Kconfig, it seems to have been introduced in ARMv8.1, meaning arm64 for (at least some) ARMv8.0 use ll/sc, and this is why xchg with the ll/sc code is available for 1, 2, 4 and 8 bytes in arch/arm64/include/asm/cmpxchg.h: #define __XCHG_CASE(w, sfx, name, sz, mb, nop_lse, acq, acq_lse, rel, cl) \static inline u##sz __xchg_case_##name##sz(u##sz x, volatile void *ptr) \{ \ u##szret; \ unsignedlongtmp; \ \ \ asm volatile(ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ " prfm pstl1strm, %2\n" \ "1: ld"#acq"xr"#sfx"\t%"#w"0,%2\n" \ " st"#rel"xr"#sfx"\t%w1,%"#w"3,%2\n" \ " cbnz %w1,1b\n" \ " "#mb, \ /*LSEatomics*/ \ " swp"#acq_lse#rel#sfx"\t%"#w"3,%"#w"0,%2\n" \[...] __XCHG_CASE(w, b, , 8, , , , , , ) __XCHG_CASE(w, h, , 16, , , , , , ) __XCHG_CASE(w, , , 32, , , , , , ) > > > > PowerPC does use the masking for xchg, but I suspect there are no > > actual users, at least it actually has its own qspinlock implementation > > that avoids xchg(). > PowerPC still needs similar things, see publish_tail_cpu(), and more > complex cmpxchg semantics. > > Paravrit qspinlock and CNA qspinlock still need more: > - xchg8 (RCsc) > - cmpxchg8/16_relaxed > - cmpxchg8/16_release (Rcpc) > - cmpxchg8_acquire (RCpc) > - cmpxchg8 (RCsc) > > > > > > > > This is also something that almost no architecture > > > > > specific code relies on (generic qspinlock being a notable exception). > > > > > > > > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > > > guarantee forward progress for those rare usages (like some of above mentioned > > > > archs). > > > > I don't this this is a safe assumption, we've had endless discussions > > about using qspinlock on architectures without a native xchg(), which > > needs either hardware guarantees or special countermeasures in xchg() itself > > to avoid this. That seems a nice discussion, do you have a link for this? By what I could see, Guo Ren is doing a great work on proving that using qspinlock (with smaller xchg) performs better on RISC-V. > > > > What I'd actually like to do here is to remove the special 8-bit and > > 16-bit cases from the xchg() and cmpxchg() interfaces at all, leaving > It needs to modify qspinlock, paravirt_qspinlock, and CNA_qspinlock > code to prevent using 8-bit/16-bit xchg/cmpxchg, and cleanup all > architectures' cmpxchg.h. What you do is just get them out of the > common atomic.h, but architectures still need to solve them and > connect to the qspinlock series. > > > only fixed 32-bit and native wordsize (either 32 or 64) as the option, > > while dealing with the others the same way we treat the fixed > > 64-bit cases that hardcode the 64-bit argument types and are only > > usable on architectures that provide them. > > > > Arnd > > > IIUC, xchg for size 1 & 2 can still be useful if having the lock variable bigger causes the target struct to use more than a cacheline. This could reduce cache usage and avoid some cacheline misses. Even though in some arches those 'non-native' xchg can take longer, it can be perceived as a valid tradeoff for some scenarios. Thanks, Leo
On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote: > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for > > > > riscv, even though its present in other architectures such as arm64 and > > > > x86. This could lead to not being able to implement some locking mechanisms > > > > or requiring some rework to make it work properly. > > > > > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > > > architectures. > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > > Hello Arnd Bergmann, thanks for reviewing! > > > > > Parity with other architectures by itself is not a reason to do this, > > > in particular the other architectures you listed have the instructions > > > in hardware while riscv does not. > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > size < 4B. My argument is that it's nice to have even an emulated version for > > this in case any future mechanism wants to use it. > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > IIUC the ask is to have a user within the kernel for these functions. > That's the general thing to do, and last time this came up there was no > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > it yet because we're worried about the performance/fairness stuff that > other ports have seen and nobody's got concrete benchmarks yet (though > there's another patch set out that I haven't had time to look through, > so that may have changed). > > So if something uses these I'm happy to go look closer. IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we don't have an use for them for the time being. Otherwise, any comments on patches 1, 2 & 3? > > > > Emulating the small xchg() through cmpxchg() is particularly tricky > > > since it's easy to run into a case where this does not guarantee > > > forward progress. > > > > > > > Didn't get this part: > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > threads accessing given cacheline and have sc always failing. On the other hand, > > there are 2 arguments on that: > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > > > This is also something that almost no architecture > > > specific code relies on (generic qspinlock being a notable exception). > > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > guarantee forward progress for those rare usages (like some of above mentioned > > archs). > > > > > I would recommend just dropping this patch from the series, at least > > > until there is a need for it. > > > > While I agree this is a valid point, I believe its more interesting to have it > > implemented if any future mechanism wants to make use of this. > > > > > > Thanks! > > Leo >
On Wed, Aug 30, 2023 at 6:59 PM Leonardo Brás <leobras@redhat.com> wrote: > > On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote: > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for > > > > > riscv, even though its present in other architectures such as arm64 and > > > > > x86. This could lead to not being able to implement some locking mechanisms > > > > > or requiring some rework to make it work properly. > > > > > > > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > > > > architectures. > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > > > > > Hello Arnd Bergmann, thanks for reviewing! > > > > > > > Parity with other architectures by itself is not a reason to do this, > > > > in particular the other architectures you listed have the instructions > > > > in hardware while riscv does not. > > > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > > size < 4B. My argument is that it's nice to have even an emulated version for > > > this in case any future mechanism wants to use it. > > > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > IIUC the ask is to have a user within the kernel for these functions. > > That's the general thing to do, and last time this came up there was no > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > it yet because we're worried about the performance/fairness stuff that > > other ports have seen and nobody's got concrete benchmarks yet (though > > there's another patch set out that I haven't had time to look through, > > so that may have changed). > > > > So if something uses these I'm happy to go look closer. > > IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we > don't have an use for them for the time being. > > Otherwise, any comments on patches 1, 2 & 3? Hello Palmer, Any chance of patches 1, 2 & 3 being merged in this merge window? Thanks! > > > > > > > Emulating the small xchg() through cmpxchg() is particularly tricky > > > > since it's easy to run into a case where this does not guarantee > > > > forward progress. > > > > > > > > > > Didn't get this part: > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > > threads accessing given cacheline and have sc always failing. On the other hand, > > > there are 2 arguments on that: > > > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > > > > > > This is also something that almost no architecture > > > > specific code relies on (generic qspinlock being a notable exception). > > > > > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > > guarantee forward progress for those rare usages (like some of above mentioned > > > archs). > > > > > > > I would recommend just dropping this patch from the series, at least > > > > until there is a need for it. > > > > > > While I agree this is a valid point, I believe its more interesting to have it > > > implemented if any future mechanism wants to make use of this. > > > > > > > > > Thanks! > > > Leo > > >
From: Leonardo Bras <leobras@redhat.com> On Wed, Aug 30, 2023 at 06:59:46PM -0300, Leonardo Brás wrote: > On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote: > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for > > > > > riscv, even though its present in other architectures such as arm64 and > > > > > x86. This could lead to not being able to implement some locking mechanisms > > > > > or requiring some rework to make it work properly. > > > > > > > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > > > > architectures. > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > > > > > Hello Arnd Bergmann, thanks for reviewing! > > > > > > > Parity with other architectures by itself is not a reason to do this, > > > > in particular the other architectures you listed have the instructions > > > > in hardware while riscv does not. > > > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > > size < 4B. My argument is that it's nice to have even an emulated version for > > > this in case any future mechanism wants to use it. > > > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > IIUC the ask is to have a user within the kernel for these functions. > > That's the general thing to do, and last time this came up there was no > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > it yet because we're worried about the performance/fairness stuff that > > other ports have seen and nobody's got concrete benchmarks yet (though > > there's another patch set out that I haven't had time to look through, > > so that may have changed). > > > > So if something uses these I'm happy to go look closer. > > IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we > don't have an use for them for the time being. > > Otherwise, any comments on patches 1, 2 & 3? ping > > > > > > > Emulating the small xchg() through cmpxchg() is particularly tricky > > > > since it's easy to run into a case where this does not guarantee > > > > forward progress. > > > > > > > > > > Didn't get this part: > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > > threads accessing given cacheline and have sc always failing. On the other hand, > > > there are 2 arguments on that: > > > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > > > > > > This is also something that almost no architecture > > > > specific code relies on (generic qspinlock being a notable exception). > > > > > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > > guarantee forward progress for those rare usages (like some of above mentioned > > > archs). > > > > > > > I would recommend just dropping this patch from the series, at least > > > > until there is a need for it. > > > > > > While I agree this is a valid point, I believe its more interesting to have it > > > implemented if any future mechanism wants to make use of this. > > > > > > > > > Thanks! > > > Leo > > >
On Tue, Dec 05, 2023 at 09:56:44PM -0300, leobras.c@gmail.com wrote: > From: Leonardo Bras <leobras@redhat.com> > > On Wed, Aug 30, 2023 at 06:59:46PM -0300, Leonardo Brás wrote: > > On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote: > > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for > > > > > > riscv, even though its present in other architectures such as arm64 and > > > > > > x86. This could lead to not being able to implement some locking mechanisms > > > > > > or requiring some rework to make it work properly. > > > > > > > > > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > > > > > architectures. > > > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > > > > > > > > Hello Arnd Bergmann, thanks for reviewing! > > > > > > > > > Parity with other architectures by itself is not a reason to do this, > > > > > in particular the other architectures you listed have the instructions > > > > > in hardware while riscv does not. > > > > > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > > > size < 4B. My argument is that it's nice to have even an emulated version for > > > > this in case any future mechanism wants to use it. > > > > > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > > > IIUC the ask is to have a user within the kernel for these functions. > > > That's the general thing to do, and last time this came up there was no > > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > > it yet because we're worried about the performance/fairness stuff that > > > other ports have seen and nobody's got concrete benchmarks yet (though > > > there's another patch set out that I haven't had time to look through, > > > so that may have changed). > > > > > > So if something uses these I'm happy to go look closer. > > > > IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we > > don't have an use for them for the time being. > > > > Otherwise, any comments on patches 1, 2 & 3? > > ping Hi, I believe the "RFC" makes some reviewers think the series isn't ready for review, so could you please send a new one w/o RFC? thanks > > > > > > > > > > > Emulating the small xchg() through cmpxchg() is particularly tricky > > > > > since it's easy to run into a case where this does not guarantee > > > > > forward progress. > > > > > > > > > > > > > Didn't get this part: > > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > > > threads accessing given cacheline and have sc always failing. On the other hand, > > > > there are 2 arguments on that: > > > > > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > > > > > > > > > This is also something that almost no architecture > > > > > specific code relies on (generic qspinlock being a notable exception). > > > > > > > > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > > > guarantee forward progress for those rare usages (like some of above mentioned > > > > archs). > > > > > > > > > I would recommend just dropping this patch from the series, at least > > > > > until there is a need for it. > > > > > > > > While I agree this is a valid point, I believe its more interesting to have it > > > > implemented if any future mechanism wants to make use of this. > > > > > > > > > > > > Thanks! > > > > Leo > > > > > > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
On Wed, Jan 03, 2024 at 07:05:04PM +0800, Jisheng Zhang wrote: > On Tue, Dec 05, 2023 at 09:56:44PM -0300, leobras.c@gmail.com wrote: > > From: Leonardo Bras <leobras@redhat.com> > > > > On Wed, Aug 30, 2023 at 06:59:46PM -0300, Leonardo Brás wrote: > > > On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote: > > > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote: > > > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote: > > > > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote: > > > > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for > > > > > > > riscv, even though its present in other architectures such as arm64 and > > > > > > > x86. This could lead to not being able to implement some locking mechanisms > > > > > > > or requiring some rework to make it work properly. > > > > > > > > > > > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other > > > > > > > architectures. > > > > > > > > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com> > > > > > > > > > > > > > > > > Hello Arnd Bergmann, thanks for reviewing! > > > > > > > > > > > Parity with other architectures by itself is not a reason to do this, > > > > > > in particular the other architectures you listed have the instructions > > > > > > in hardware while riscv does not. > > > > > > > > > > Sure, I understand RISC-V don't have native support for xchg on variables of > > > > > size < 4B. My argument is that it's nice to have even an emulated version for > > > > > this in case any future mechanism wants to use it. > > > > > > > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. > > > > > > > > IIUC the ask is to have a user within the kernel for these functions. > > > > That's the general thing to do, and last time this came up there was no > > > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled > > > > it yet because we're worried about the performance/fairness stuff that > > > > other ports have seen and nobody's got concrete benchmarks yet (though > > > > there's another patch set out that I haven't had time to look through, > > > > so that may have changed). > > > > > > > > So if something uses these I'm happy to go look closer. > > > > > > IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we > > > don't have an use for them for the time being. > > > > > > Otherwise, any comments on patches 1, 2 & 3? > > > > ping > > Hi, > > I believe the "RFC" makes some reviewers think the series isn't ready > for review, so could you please send a new one w/o RFC? > > thanks Sure, will do. Thanks! Leo > > > > > > > > > > > > > > > > Emulating the small xchg() through cmpxchg() is particularly tricky > > > > > > since it's easy to run into a case where this does not guarantee > > > > > > forward progress. > > > > > > > > > > > > > > > > Didn't get this part: > > > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an > > > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ? > > > > > > > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple > > > > > threads accessing given cacheline and have sc always failing. On the other hand, > > > > > there are 2 arguments on that: > > > > > > > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics) > > > > > also seem to rely in this mechanism for every xchg size. Another archs like csky > > > > > and loongarch use asm that look like mine to handle size < 4B xchg. > > > > > > > > > > > > > > > > This is also something that almost no architecture > > > > > > specific code relies on (generic qspinlock being a notable exception). > > > > > > > > > > > > > > > > 2 - As you mentioned, there should be very little code that will actually make > > > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not > > > > > guarantee forward progress for those rare usages (like some of above mentioned > > > > > archs). > > > > > > > > > > > I would recommend just dropping this patch from the series, at least > > > > > > until there is a need for it. > > > > > > > > > > While I agree this is a valid point, I believe its more interesting to have it > > > > > implemented if any future mechanism wants to make use of this. > > > > > > > > > > > > > > > Thanks! > > > > > Leo > > > > > > > > > > > _______________________________________________ > > linux-riscv mailing list > > linux-riscv@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-riscv >
diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h index ac9d0eeb74e6..26cea2395aae 100644 --- a/arch/riscv/include/asm/cmpxchg.h +++ b/arch/riscv/include/asm/cmpxchg.h @@ -11,6 +11,31 @@ #include <asm/barrier.h> #include <asm/fence.h> +#define __arch_xchg_masked(prepend, append, r, p, n) \ +({ \ + u32 *__ptr32b = (u32 *)((ulong)(p) & ~0x3); \ + ulong __s = ((ulong)(p) & (0x4 - sizeof(*p))) * BITS_PER_BYTE; \ + ulong __mask = GENMASK(((sizeof(*p)) * BITS_PER_BYTE) - 1, 0) \ + << __s; \ + ulong __newx = (ulong)(n) << __s; \ + ulong __retx; \ + ulong __rc; \ + \ + __asm__ __volatile__ ( \ + prepend \ + "0: lr.w %0, %2\n" \ + " and %1, %0, %z4\n" \ + " or %1, %1, %z3\n" \ + " sc.w %1, %1, %2\n" \ + " bnez %1, 0b\n" \ + append \ + : "=&r" (__retx), "=&r" (__rc), "+A" (*(__ptr32b)) \ + : "rJ" (__newx), "rJ" (~__mask) \ + : "memory"); \ + \ + r = (__typeof__(*(p)))((__retx & __mask) >> __s); \ +}) + #define __arch_xchg(sfx, prepend, append, r, p, n) \ ({ \ __asm__ __volatile__ ( \ @@ -27,7 +52,13 @@ __typeof__(ptr) __ptr = (ptr); \ __typeof__(*(__ptr)) __new = (new); \ __typeof__(*(__ptr)) __ret; \ + \ switch (sizeof(*__ptr)) { \ + case 1: \ + case 2: \ + __arch_xchg_masked(prepend, append, \ + __ret, __ptr, __new); \ + break; \ case 4: \ __arch_xchg(".w" sfx, prepend, append, \ __ret, __ptr, __new); \
xchg for variables of size 1-byte and 2-bytes is not yet available for riscv, even though its present in other architectures such as arm64 and x86. This could lead to not being able to implement some locking mechanisms or requiring some rework to make it work properly. Implement 1-byte and 2-bytes xchg in order to achieve parity with other architectures. Signed-off-by: Leonardo Bras <leobras@redhat.com> --- arch/riscv/include/asm/cmpxchg.h | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+)