diff mbox series

[RFC,v5,5/5] riscv/cmpxchg: Implement xchg for variables of size 1 and 2

Message ID 20230810040349.92279-7-leobras@redhat.com (mailing list archive)
State RFC
Headers show
Series Rework & improve riscv cmpxchg.h and atomic.h | expand

Checks

Context Check Description
conchuod/cover_letter success Series has a cover letter
conchuod/tree_selection success Guessed tree name to be for-next at HEAD 174e8ac0272d
conchuod/fixes_present success Fixes tag not required for -next series
conchuod/maintainers_pattern success MAINTAINERS pattern errors before the patch: 4 and now 4
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig success Errors and warnings before: 2808 this patch: 2808
conchuod/module_param success Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig fail Errors and warnings before: 15918 this patch: 15928
conchuod/build_rv32_defconfig success Build OK
conchuod/dtb_warn_rv64 success Errors and warnings before: 12 this patch: 12
conchuod/header_inline success No static functions without inline keyword in header files
conchuod/checkpatch success total: 0 errors, 0 warnings, 0 checks, 44 lines checked
conchuod/build_rv64_nommu_k210_defconfig success Build OK
conchuod/verify_fixes success No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig success Build OK

Commit Message

Leonardo Bras Aug. 10, 2023, 4:03 a.m. UTC
xchg for variables of size 1-byte and 2-bytes is not yet available for
riscv, even though its present in other architectures such as arm64 and
x86. This could lead to not being able to implement some locking mechanisms
or requiring some rework to make it work properly.

Implement 1-byte and 2-bytes xchg in order to achieve parity with other
architectures.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 arch/riscv/include/asm/cmpxchg.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

Comments

Arnd Bergmann Aug. 10, 2023, 6:51 a.m. UTC | #1
On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> xchg for variables of size 1-byte and 2-bytes is not yet available for
> riscv, even though its present in other architectures such as arm64 and
> x86. This could lead to not being able to implement some locking mechanisms
> or requiring some rework to make it work properly.
>
> Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> architectures.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

Parity with other architectures by itself is not a reason to do this,
in particular the other architectures you listed have the instructions
in hardware while riscv does not.

Emulating the small xchg() through cmpxchg() is particularly tricky
since it's easy to run into a case where this does not guarantee
forward progress. This is also something that almost no architecture
specific code relies on (generic qspinlock being a notable exception).

I would recommend just dropping this patch from the series, at least
until there is a need for it.

    Arnd
Leonardo Bras Aug. 10, 2023, 4:04 p.m. UTC | #2
On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > riscv, even though its present in other architectures such as arm64 and
> > x86. This could lead to not being able to implement some locking mechanisms
> > or requiring some rework to make it work properly.
> > 
> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > architectures.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> 

Hello Arnd Bergmann, thanks for reviewing!

> Parity with other architectures by itself is not a reason to do this,
> in particular the other architectures you listed have the instructions
> in hardware while riscv does not.

Sure, I understand RISC-V don't have native support for xchg on variables of
size < 4B. My argument is that it's nice to have even an emulated version for
this in case any future mechanism wants to use it.

Not having it may mean we won't be able to enable given mechanism in RISC-V. 

> 
> Emulating the small xchg() through cmpxchg() is particularly tricky
> since it's easy to run into a case where this does not guarantee
> forward progress.
> 

Didn't get this part:
By "emulating small xchg() through cmpxchg()", did you mean like emulating an
xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?

If so, yeah, it's a fair point: in some extreme case we could have multiple
threads accessing given cacheline and have sc always failing. On the other hand,
there are 2 arguments on that:

1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
also seem to rely in this mechanism for every xchg size. Another archs like csky
and loongarch use asm that look like mine to handle size < 4B xchg. 
    

>  This is also something that almost no architecture
> specific code relies on (generic qspinlock being a notable exception).
> 

2 - As you mentioned, there should be very little code that will actually make
use of xchg for vars < 4B, so it should be safe to assume its fine to not
guarantee forward progress for those rare usages (like some of above mentioned
archs).

> I would recommend just dropping this patch from the series, at least
> until there is a need for it.

While I agree this is a valid point, I believe its more interesting to have it
implemented if any future mechanism wants to make use of this. 


Thanks!
Leo
Palmer Dabbelt Aug. 10, 2023, 4:23 p.m. UTC | #3
On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
>> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
>> > xchg for variables of size 1-byte and 2-bytes is not yet available for
>> > riscv, even though its present in other architectures such as arm64 and
>> > x86. This could lead to not being able to implement some locking mechanisms
>> > or requiring some rework to make it work properly.
>> > 
>> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
>> > architectures.
>> > 
>> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
>> 
>
> Hello Arnd Bergmann, thanks for reviewing!
>
>> Parity with other architectures by itself is not a reason to do this,
>> in particular the other architectures you listed have the instructions
>> in hardware while riscv does not.
>
> Sure, I understand RISC-V don't have native support for xchg on variables of
> size < 4B. My argument is that it's nice to have even an emulated version for
> this in case any future mechanism wants to use it.
>
> Not having it may mean we won't be able to enable given mechanism in RISC-V. 

IIUC the ask is to have a user within the kernel for these functions.  
That's the general thing to do, and last time this came up there was no 
in-kernel use of it -- the qspinlock stuff would, but we haven't enabled 
it yet because we're worried about the performance/fairness stuff that 
other ports have seen and nobody's got concrete benchmarks yet (though 
there's another patch set out that I haven't had time to look through, 
so that may have changed).

So if something uses these I'm happy to go look closer.

>> Emulating the small xchg() through cmpxchg() is particularly tricky
>> since it's easy to run into a case where this does not guarantee
>> forward progress.
>> 
>
> Didn't get this part:
> By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
>
> If so, yeah, it's a fair point: in some extreme case we could have multiple
> threads accessing given cacheline and have sc always failing. On the other hand,
> there are 2 arguments on that:
>
> 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> also seem to rely in this mechanism for every xchg size. Another archs like csky
> and loongarch use asm that look like mine to handle size < 4B xchg. 
>     
>
>>  This is also something that almost no architecture
>> specific code relies on (generic qspinlock being a notable exception).
>> 
>
> 2 - As you mentioned, there should be very little code that will actually make
> use of xchg for vars < 4B, so it should be safe to assume its fine to not
> guarantee forward progress for those rare usages (like some of above mentioned
> archs).
>
>> I would recommend just dropping this patch from the series, at least
>> until there is a need for it.
>
> While I agree this is a valid point, I believe its more interesting to have it
> implemented if any future mechanism wants to make use of this. 
>
>
> Thanks!
> Leo
Arnd Bergmann Aug. 10, 2023, 7:13 p.m. UTC | #4
On Thu, Aug 10, 2023, at 18:23, Palmer Dabbelt wrote:
> On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
>> On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
>>> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
>>> > xchg for variables of size 1-byte and 2-bytes is not yet available for
>>> > riscv, even though its present in other architectures such as arm64 and
>>> > x86. This could lead to not being able to implement some locking mechanisms
>>> > or requiring some rework to make it work properly.
>>> > 
>>> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
>>> > architectures.
>>
>>> Parity with other architectures by itself is not a reason to do this,
>>> in particular the other architectures you listed have the instructions
>>> in hardware while riscv does not.
>>
>> Sure, I understand RISC-V don't have native support for xchg on variables of
>> size < 4B. My argument is that it's nice to have even an emulated version for
>> this in case any future mechanism wants to use it.
>>
>> Not having it may mean we won't be able to enable given mechanism in RISC-V. 
>
> IIUC the ask is to have a user within the kernel for these functions.  
> That's the general thing to do, and last time this came up there was no 
> in-kernel use of it -- the qspinlock stuff would, but we haven't enabled 
> it yet because we're worried about the performance/fairness stuff that 
> other ports have seen and nobody's got concrete benchmarks yet (though 
> there's another patch set out that I haven't had time to look through, 
> so that may have changed).

Right. In particular the qspinlock is a good example for something
where having the emulated 16-bit xchg() may end up less efficient
than a natively supported instruction.

The xchg() here is a performance optimization for CPUs that can
do this without touching the other half of the 32-bit word.

>>
>> Didn't get this part:
>> By "emulating small xchg() through cmpxchg()", did you mean like emulating an
>> xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
>>
>> If so, yeah, it's a fair point: in some extreme case we could have multiple
>> threads accessing given cacheline and have sc always failing. On the other hand,
>> there are 2 arguments on that:
>>
>> 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
>> also seem to rely in this mechanism for every xchg size. Another archs like csky
>> and loongarch use asm that look like mine to handle size < 4B xchg. 

I think you misread the arm64 code, which should use native instructions
for all sizes, in both the armv8.0 and LSE atomics.

PowerPC does use the masking for xchg, but I suspect there are no
actual users, at least it actually has its own qspinlock implementation
that avoids xchg().

>>>  This is also something that almost no architecture
>>> specific code relies on (generic qspinlock being a notable exception).
>>> 
>>
>> 2 - As you mentioned, there should be very little code that will actually make
>> use of xchg for vars < 4B, so it should be safe to assume its fine to not
>> guarantee forward progress for those rare usages (like some of above mentioned
>> archs).

I don't this this is a safe assumption, we've had endless discussions
about using qspinlock on architectures without a native xchg(), which
needs either hardware guarantees or special countermeasures in xchg() itself
to avoid this.

What I'd actually like to do here is to remove the special 8-bit and
16-bit cases from the xchg() and cmpxchg() interfaces at all, leaving
only fixed 32-bit and native wordsize (either 32 or 64) as the option,
while dealing with the others the same way we treat the fixed
64-bit cases that hardcode the 64-bit argument types and are only
usable on architectures that provide them.

    Arnd
Guo Ren Aug. 11, 2023, 1:24 a.m. UTC | #5
On Fri, Aug 11, 2023 at 3:13 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, Aug 10, 2023, at 18:23, Palmer Dabbelt wrote:
> > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> >> On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> >>> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> >>> > xchg for variables of size 1-byte and 2-bytes is not yet available for
> >>> > riscv, even though its present in other architectures such as arm64 and
> >>> > x86. This could lead to not being able to implement some locking mechanisms
> >>> > or requiring some rework to make it work properly.
> >>> >
> >>> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> >>> > architectures.
> >>
> >>> Parity with other architectures by itself is not a reason to do this,
> >>> in particular the other architectures you listed have the instructions
> >>> in hardware while riscv does not.
> >>
> >> Sure, I understand RISC-V don't have native support for xchg on variables of
> >> size < 4B. My argument is that it's nice to have even an emulated version for
> >> this in case any future mechanism wants to use it.
> >>
> >> Not having it may mean we won't be able to enable given mechanism in RISC-V.
> >
> > IIUC the ask is to have a user within the kernel for these functions.
> > That's the general thing to do, and last time this came up there was no
> > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled
> > it yet because we're worried about the performance/fairness stuff that
> > other ports have seen and nobody's got concrete benchmarks yet (though
> > there's another patch set out that I haven't had time to look through,
> > so that may have changed).
>
> Right. In particular the qspinlock is a good example for something
> where having the emulated 16-bit xchg() may end up less efficient
> than a natively supported instruction.
The xchg() efficiency depends on micro-architecture. and the number of
instructions is not the key, even one instruction would be separated
into several micro-ops. I thought the Power guys won't agree with this
view :)

>
> The xchg() here is a performance optimization for CPUs that can
> do this without touching the other half of the 32-bit word.
It's useless on a non-SMT system because all operations are cacheline
based. (Ps: Because xchg() has a load semantic, CHI's "Dirty Partial"
& "Clean Empty" can't help anymore.)

>
> >>
> >> Didn't get this part:
> >> By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> >> xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> >>
> >> If so, yeah, it's a fair point: in some extreme case we could have multiple
> >> threads accessing given cacheline and have sc always failing. On the other hand,
> >> there are 2 arguments on that:
> >>
> >> 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> >> also seem to rely in this mechanism for every xchg size. Another archs like csky
> >> and loongarch use asm that look like mine to handle size < 4B xchg.
>
> I think you misread the arm64 code, which should use native instructions
> for all sizes, in both the armv8.0 and LSE atomics.
>
> PowerPC does use the masking for xchg, but I suspect there are no
> actual users, at least it actually has its own qspinlock implementation
> that avoids xchg().
PowerPC still needs similar things, see publish_tail_cpu(), and more
complex cmpxchg semantics.

Paravrit qspinlock and CNA qspinlock still need more:
 - xchg8 (RCsc)
 - cmpxchg8/16_relaxed
 - cmpxchg8/16_release (Rcpc)
 - cmpxchg8_acquire (RCpc)
 - cmpxchg8 (RCsc)

>
> >>>  This is also something that almost no architecture
> >>> specific code relies on (generic qspinlock being a notable exception).
> >>>
> >>
> >> 2 - As you mentioned, there should be very little code that will actually make
> >> use of xchg for vars < 4B, so it should be safe to assume its fine to not
> >> guarantee forward progress for those rare usages (like some of above mentioned
> >> archs).
>
> I don't this this is a safe assumption, we've had endless discussions
> about using qspinlock on architectures without a native xchg(), which
> needs either hardware guarantees or special countermeasures in xchg() itself
> to avoid this.
>
> What I'd actually like to do here is to remove the special 8-bit and
> 16-bit cases from the xchg() and cmpxchg() interfaces at all, leaving
It needs to modify qspinlock, paravirt_qspinlock, and CNA_qspinlock
code to prevent using 8-bit/16-bit xchg/cmpxchg, and cleanup all
architectures' cmpxchg.h. What you do is just get them out of the
common atomic.h, but architectures still need to solve them and
connect to the qspinlock series.

> only fixed 32-bit and native wordsize (either 32 or 64) as the option,
> while dealing with the others the same way we treat the fixed
> 64-bit cases that hardcode the 64-bit argument types and are only
> usable on architectures that provide them.
>
>     Arnd
Guo Ren Aug. 11, 2023, 1:40 a.m. UTC | #6
On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote:
>
> On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> >> > xchg for variables of size 1-byte and 2-bytes is not yet available for
> >> > riscv, even though its present in other architectures such as arm64 and
> >> > x86. This could lead to not being able to implement some locking mechanisms
> >> > or requiring some rework to make it work properly.
> >> >
> >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> >> > architectures.
> >> >
> >> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> >>
> >
> > Hello Arnd Bergmann, thanks for reviewing!
> >
> >> Parity with other architectures by itself is not a reason to do this,
> >> in particular the other architectures you listed have the instructions
> >> in hardware while riscv does not.
> >
> > Sure, I understand RISC-V don't have native support for xchg on variables of
> > size < 4B. My argument is that it's nice to have even an emulated version for
> > this in case any future mechanism wants to use it.
> >
> > Not having it may mean we won't be able to enable given mechanism in RISC-V.
>
> IIUC the ask is to have a user within the kernel for these functions.
> That's the general thing to do, and last time this came up there was no
> in-kernel use of it -- the qspinlock stuff would, but we haven't enabled
> it yet because we're worried about the performance/fairness stuff that
> other ports have seen and nobody's got concrete benchmarks yet (though
> there's another patch set out that I haven't had time to look through,
> so that may have changed).
Conor doesn't agree with using an alternative as a detour mechanism
between qspinlock & ticket lock. So I'm preparing V11 with static_key
(jump_label) style. Next version, I would separate paravirt_qspinlock
& CNA_qspinlock from V10. That would make it easy to review the
qspinlock patch series. You can review the next version V11. Now I'm
debugging a static_key init problem when load_modules, which is
triggered by our combo_qspinlock.

The qspinlock is being tested on the riscv platform [1] with 128 cores
with 8 NUMA nodes, next, I would update the comparison results of
qspinlock & ticket lock.

[1]: https://www.sophon.ai/

>
> So if something uses these I'm happy to go look closer.
>
> >> Emulating the small xchg() through cmpxchg() is particularly tricky
> >> since it's easy to run into a case where this does not guarantee
> >> forward progress.
> >>
> >
> > Didn't get this part:
> > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> >
> > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > threads accessing given cacheline and have sc always failing. On the other hand,
> > there are 2 arguments on that:
> >
> > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > and loongarch use asm that look like mine to handle size < 4B xchg.
> >
> >
> >>  This is also something that almost no architecture
> >> specific code relies on (generic qspinlock being a notable exception).
> >>
> >
> > 2 - As you mentioned, there should be very little code that will actually make
> > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > guarantee forward progress for those rare usages (like some of above mentioned
> > archs).
> >
> >> I would recommend just dropping this patch from the series, at least
> >> until there is a need for it.
> >
> > While I agree this is a valid point, I believe its more interesting to have it
> > implemented if any future mechanism wants to make use of this.
> >
> >
> > Thanks!
> > Leo
Conor Dooley Aug. 11, 2023, 6:27 a.m. UTC | #7
On Fri, Aug 11, 2023 at 09:40:30AM +0800, Guo Ren wrote:
> On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote:
> >
> > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > >> > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > >> > riscv, even though its present in other architectures such as arm64 and
> > >> > x86. This could lead to not being able to implement some locking mechanisms
> > >> > or requiring some rework to make it work properly.
> > >> >
> > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > >> > architectures.
> > >> >
> > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > >>
> > >
> > > Hello Arnd Bergmann, thanks for reviewing!
> > >
> > >> Parity with other architectures by itself is not a reason to do this,
> > >> in particular the other architectures you listed have the instructions
> > >> in hardware while riscv does not.
> > >
> > > Sure, I understand RISC-V don't have native support for xchg on variables of
> > > size < 4B. My argument is that it's nice to have even an emulated version for
> > > this in case any future mechanism wants to use it.
> > >
> > > Not having it may mean we won't be able to enable given mechanism in RISC-V.
> >
> > IIUC the ask is to have a user within the kernel for these functions.
> > That's the general thing to do, and last time this came up there was no
> > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled
> > it yet because we're worried about the performance/fairness stuff that
> > other ports have seen and nobody's got concrete benchmarks yet (though
> > there's another patch set out that I haven't had time to look through,
> > so that may have changed).
> Conor doesn't agree with using an alternative as a detour mechanism
> between qspinlock & ticket lock.

Hold on a sec, I don't recall having a problem with alternatives - it
was calling the stronger forward progress guarantee an erratum
(which it isn't) and an ISA extension w/o any "abusing" that framework
that I did not like.

> So I'm preparing V11 with static_key
> (jump_label) style.

I don't think there's much point rushing into making it based on static
keys when no progress has been made on implementing support for
non-standard extensions. Changing to a static key doesn't change the
detection mechanism, I've not got a problem with using alternatives for
this stuff.

Thanks,
Conor.

> Next version, I would separate paravirt_qspinlock
> & CNA_qspinlock from V10. That would make it easy to review the
> qspinlock patch series. You can review the next version V11. Now I'm
> debugging a static_key init problem when load_modules, which is
> triggered by our combo_qspinlock.
> 
> The qspinlock is being tested on the riscv platform [1] with 128 cores
> with 8 NUMA nodes, next, I would update the comparison results of
> qspinlock & ticket lock.
> 
> [1]: https://www.sophon.ai/
> 
> >
> > So if something uses these I'm happy to go look closer.
> >
> > >> Emulating the small xchg() through cmpxchg() is particularly tricky
> > >> since it's easy to run into a case where this does not guarantee
> > >> forward progress.
> > >>
> > >
> > > Didn't get this part:
> > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > >
> > > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > > threads accessing given cacheline and have sc always failing. On the other hand,
> > > there are 2 arguments on that:
> > >
> > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > > and loongarch use asm that look like mine to handle size < 4B xchg.
> > >
> > >
> > >>  This is also something that almost no architecture
> > >> specific code relies on (generic qspinlock being a notable exception).
> > >>
> > >
> > > 2 - As you mentioned, there should be very little code that will actually make
> > > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > > guarantee forward progress for those rare usages (like some of above mentioned
> > > archs).
> > >
> > >> I would recommend just dropping this patch from the series, at least
> > >> until there is a need for it.
> > >
> > > While I agree this is a valid point, I believe its more interesting to have it
> > > implemented if any future mechanism wants to make use of this.
> > >
> > >
> > > Thanks!
> > > Leo
> 
> 
> 
> -- 
> Best Regards
>  Guo Ren
Conor Dooley Aug. 11, 2023, 9:48 a.m. UTC | #8
On Fri, Aug 11, 2023 at 07:27:28AM +0100, Conor Dooley wrote:
> On Fri, Aug 11, 2023 at 09:40:30AM +0800, Guo Ren wrote:
> > On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote:
> > >
> > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > > >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > > >> > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > > >> > riscv, even though its present in other architectures such as arm64 and
> > > >> > x86. This could lead to not being able to implement some locking mechanisms
> > > >> > or requiring some rework to make it work properly.
> > > >> >
> > > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > > >> > architectures.
> > > >> >
> > > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > >>
> > > >
> > > > Hello Arnd Bergmann, thanks for reviewing!
> > > >
> > > >> Parity with other architectures by itself is not a reason to do this,
> > > >> in particular the other architectures you listed have the instructions
> > > >> in hardware while riscv does not.
> > > >
> > > > Sure, I understand RISC-V don't have native support for xchg on variables of
> > > > size < 4B. My argument is that it's nice to have even an emulated version for
> > > > this in case any future mechanism wants to use it.
> > > >
> > > > Not having it may mean we won't be able to enable given mechanism in RISC-V.
> > >
> > > IIUC the ask is to have a user within the kernel for these functions.
> > > That's the general thing to do, and last time this came up there was no
> > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled
> > > it yet because we're worried about the performance/fairness stuff that
> > > other ports have seen and nobody's got concrete benchmarks yet (though
> > > there's another patch set out that I haven't had time to look through,
> > > so that may have changed).
> > Conor doesn't agree with using an alternative as a detour mechanism
> > between qspinlock & ticket lock.
> 
> Hold on a sec, I don't recall having a problem with alternatives - it
> was calling the stronger forward progress guarantee an erratum
> (which it isn't) and an ISA extension w/o any "abusing" that framework
> that I did not like.

Hmm, what I wrote here makes no sense reading it back. Let me try again:

	Hold on a sec, I don't recall having a problem with alternatives - it
	was calling the stronger forward progress guarantee an erratum
	(which it isn't) and "abusing" that framework that I did not like.
	The series effectively mixed and matched whichever part of ISA
	extensions and errata that was convenient.

> 
> > So I'm preparing V11 with static_key
> > (jump_label) style.
> 
> I don't think there's much point rushing into making it based on static
> keys when no progress has been made on implementing support for
> non-standard extensions. Changing to a static key doesn't change the
> detection mechanism, I've not got a problem with using alternatives for
> this stuff.
> 
> Thanks,
> Conor.
> 
> > Next version, I would separate paravirt_qspinlock
> > & CNA_qspinlock from V10. That would make it easy to review the
> > qspinlock patch series. You can review the next version V11. Now I'm
> > debugging a static_key init problem when load_modules, which is
> > triggered by our combo_qspinlock.
> > 
> > The qspinlock is being tested on the riscv platform [1] with 128 cores
> > with 8 NUMA nodes, next, I would update the comparison results of
> > qspinlock & ticket lock.
> > 
> > [1]: https://www.sophon.ai/
> > 
> > >
> > > So if something uses these I'm happy to go look closer.
> > >
> > > >> Emulating the small xchg() through cmpxchg() is particularly tricky
> > > >> since it's easy to run into a case where this does not guarantee
> > > >> forward progress.
> > > >>
> > > >
> > > > Didn't get this part:
> > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > > >
> > > > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > > > threads accessing given cacheline and have sc always failing. On the other hand,
> > > > there are 2 arguments on that:
> > > >
> > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > > > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > > > and loongarch use asm that look like mine to handle size < 4B xchg.
> > > >
> > > >
> > > >>  This is also something that almost no architecture
> > > >> specific code relies on (generic qspinlock being a notable exception).
> > > >>
> > > >
> > > > 2 - As you mentioned, there should be very little code that will actually make
> > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > > > guarantee forward progress for those rare usages (like some of above mentioned
> > > > archs).
> > > >
> > > >> I would recommend just dropping this patch from the series, at least
> > > >> until there is a need for it.
> > > >
> > > > While I agree this is a valid point, I believe its more interesting to have it
> > > > implemented if any future mechanism wants to make use of this.
> > > >
> > > >
> > > > Thanks!
> > > > Leo
> > 
> > 
> > 
> > -- 
> > Best Regards
> >  Guo Ren
Andrew Jones Aug. 11, 2023, 11:10 a.m. UTC | #9
On Fri, Aug 11, 2023 at 09:40:30AM +0800, Guo Ren wrote:
> On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote:
> >
> > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > >> > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > >> > riscv, even though its present in other architectures such as arm64 and
> > >> > x86. This could lead to not being able to implement some locking mechanisms
> > >> > or requiring some rework to make it work properly.
> > >> >
> > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > >> > architectures.
> > >> >
> > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > >>
> > >
> > > Hello Arnd Bergmann, thanks for reviewing!
> > >
> > >> Parity with other architectures by itself is not a reason to do this,
> > >> in particular the other architectures you listed have the instructions
> > >> in hardware while riscv does not.
> > >
> > > Sure, I understand RISC-V don't have native support for xchg on variables of
> > > size < 4B. My argument is that it's nice to have even an emulated version for
> > > this in case any future mechanism wants to use it.
> > >
> > > Not having it may mean we won't be able to enable given mechanism in RISC-V.
> >
> > IIUC the ask is to have a user within the kernel for these functions.
> > That's the general thing to do, and last time this came up there was no
> > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled
> > it yet because we're worried about the performance/fairness stuff that
> > other ports have seen and nobody's got concrete benchmarks yet (though
> > there's another patch set out that I haven't had time to look through,
> > so that may have changed).
> Conor doesn't agree with using an alternative as a detour mechanism
> between qspinlock & ticket lock. So I'm preparing V11 with static_key
> (jump_label) style. Next version, I would separate paravirt_qspinlock
> & CNA_qspinlock from V10. That would make it easy to review the
> qspinlock patch series. You can review the next version V11. Now I'm
> debugging a static_key init problem when load_modules, which is
> triggered by our combo_qspinlock.

We've seen problems with static keys and module loading in the past. You
may want to take a look at commit eb6354e11630 ("riscv: Ensure isa-ext
static keys are writable")

Thanks,
drew

> 
> The qspinlock is being tested on the riscv platform [1] with 128 cores
> with 8 NUMA nodes, next, I would update the comparison results of
> qspinlock & ticket lock.
> 
> [1]: https://www.sophon.ai/
> 
> >
> > So if something uses these I'm happy to go look closer.
> >
> > >> Emulating the small xchg() through cmpxchg() is particularly tricky
> > >> since it's easy to run into a case where this does not guarantee
> > >> forward progress.
> > >>
> > >
> > > Didn't get this part:
> > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > >
> > > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > > threads accessing given cacheline and have sc always failing. On the other hand,
> > > there are 2 arguments on that:
> > >
> > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > > and loongarch use asm that look like mine to handle size < 4B xchg.
> > >
> > >
> > >>  This is also something that almost no architecture
> > >> specific code relies on (generic qspinlock being a notable exception).
> > >>
> > >
> > > 2 - As you mentioned, there should be very little code that will actually make
> > > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > > guarantee forward progress for those rare usages (like some of above mentioned
> > > archs).
> > >
> > >> I would recommend just dropping this patch from the series, at least
> > >> until there is a need for it.
> > >
> > > While I agree this is a valid point, I believe its more interesting to have it
> > > implemented if any future mechanism wants to make use of this.
> > >
> > >
> > > Thanks!
> > > Leo
> 
> 
> 
> -- 
> Best Regards
>  Guo Ren
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Guo Ren Aug. 11, 2023, 11:16 a.m. UTC | #10
On Fri, Aug 11, 2023 at 7:10 PM Andrew Jones <ajones@ventanamicro.com> wrote:
>
> On Fri, Aug 11, 2023 at 09:40:30AM +0800, Guo Ren wrote:
> > On Fri, Aug 11, 2023 at 12:23 AM Palmer Dabbelt <palmer@rivosinc.com> wrote:
> > >
> > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > > >> On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > > >> > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > > >> > riscv, even though its present in other architectures such as arm64 and
> > > >> > x86. This could lead to not being able to implement some locking mechanisms
> > > >> > or requiring some rework to make it work properly.
> > > >> >
> > > >> > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > > >> > architectures.
> > > >> >
> > > >> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > >>
> > > >
> > > > Hello Arnd Bergmann, thanks for reviewing!
> > > >
> > > >> Parity with other architectures by itself is not a reason to do this,
> > > >> in particular the other architectures you listed have the instructions
> > > >> in hardware while riscv does not.
> > > >
> > > > Sure, I understand RISC-V don't have native support for xchg on variables of
> > > > size < 4B. My argument is that it's nice to have even an emulated version for
> > > > this in case any future mechanism wants to use it.
> > > >
> > > > Not having it may mean we won't be able to enable given mechanism in RISC-V.
> > >
> > > IIUC the ask is to have a user within the kernel for these functions.
> > > That's the general thing to do, and last time this came up there was no
> > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled
> > > it yet because we're worried about the performance/fairness stuff that
> > > other ports have seen and nobody's got concrete benchmarks yet (though
> > > there's another patch set out that I haven't had time to look through,
> > > so that may have changed).
> > Conor doesn't agree with using an alternative as a detour mechanism
> > between qspinlock & ticket lock. So I'm preparing V11 with static_key
> > (jump_label) style. Next version, I would separate paravirt_qspinlock
> > & CNA_qspinlock from V10. That would make it easy to review the
> > qspinlock patch series. You can review the next version V11. Now I'm
> > debugging a static_key init problem when load_modules, which is
> > triggered by our combo_qspinlock.
>
> We've seen problems with static keys and module loading in the past. You
> may want to take a look at commit eb6354e11630 ("riscv: Ensure isa-ext
> static keys are writable")
Thank you for the tip. I would take a look.

>
> Thanks,
> drew
>
> >
> > The qspinlock is being tested on the riscv platform [1] with 128 cores
> > with 8 NUMA nodes, next, I would update the comparison results of
> > qspinlock & ticket lock.
> >
> > [1]: https://www.sophon.ai/
> >
> > >
> > > So if something uses these I'm happy to go look closer.
> > >
> > > >> Emulating the small xchg() through cmpxchg() is particularly tricky
> > > >> since it's easy to run into a case where this does not guarantee
> > > >> forward progress.
> > > >>
> > > >
> > > > Didn't get this part:
> > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > > >
> > > > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > > > threads accessing given cacheline and have sc always failing. On the other hand,
> > > > there are 2 arguments on that:
> > > >
> > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > > > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > > > and loongarch use asm that look like mine to handle size < 4B xchg.
> > > >
> > > >
> > > >>  This is also something that almost no architecture
> > > >> specific code relies on (generic qspinlock being a notable exception).
> > > >>
> > > >
> > > > 2 - As you mentioned, there should be very little code that will actually make
> > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > > > guarantee forward progress for those rare usages (like some of above mentioned
> > > > archs).
> > > >
> > > >> I would recommend just dropping this patch from the series, at least
> > > >> until there is a need for it.
> > > >
> > > > While I agree this is a valid point, I believe its more interesting to have it
> > > > implemented if any future mechanism wants to make use of this.
> > > >
> > > >
> > > > Thanks!
> > > > Leo
> >
> >
> >
> > --
> > Best Regards
> >  Guo Ren
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
Leonardo Bras Aug. 30, 2023, 9:51 p.m. UTC | #11
Hello everyone,

Sorry for the delay, I was out of office for a while.

On Fri, 2023-08-11 at 09:24 +0800, Guo Ren wrote:
> On Fri, Aug 11, 2023 at 3:13 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > 
> > On Thu, Aug 10, 2023, at 18:23, Palmer Dabbelt wrote:
> > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > > > > > riscv, even though its present in other architectures such as arm64 and
> > > > > > x86. This could lead to not being able to implement some locking mechanisms
> > > > > > or requiring some rework to make it work properly.
> > > > > > 
> > > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > > > > > architectures.
> > > > 
> > > > > Parity with other architectures by itself is not a reason to do this,
> > > > > in particular the other architectures you listed have the instructions
> > > > > in hardware while riscv does not.
> > > > 
> > > > Sure, I understand RISC-V don't have native support for xchg on variables of
> > > > size < 4B. My argument is that it's nice to have even an emulated version for
> > > > this in case any future mechanism wants to use it.
> > > > 
> > > > Not having it may mean we won't be able to enable given mechanism in RISC-V.
> > > 
> > > IIUC the ask is to have a user within the kernel for these functions.
> > > That's the general thing to do, and last time this came up there was no
> > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled
> > > it yet because we're worried about the performance/fairness stuff that
> > > other ports have seen and nobody's got concrete benchmarks yet (though
> > > there's another patch set out that I haven't had time to look through,
> > > so that may have changed).
> > 
> > Right. In particular the qspinlock is a good example for something
> > where having the emulated 16-bit xchg() may end up less efficient
> > than a natively supported instruction.
> The xchg() efficiency depends on micro-architecture. and the number of
> instructions is not the key, even one instruction would be separated
> into several micro-ops. I thought the Power guys won't agree with this
> view :)
> 
> > 
> > The xchg() here is a performance optimization for CPUs that can
> > do this without touching the other half of the 32-bit word.
> It's useless on a non-SMT system because all operations are cacheline
> based. (Ps: Because xchg() has a load semantic, CHI's "Dirty Partial"
> & "Clean Empty" can't help anymore.)
> 
> > 
> > > > 
> > > > Didn't get this part:
> > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > > > 
> > > > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > > > threads accessing given cacheline and have sc always failing. On the other hand,
> > > > there are 2 arguments on that:
> > > > 
> > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > > > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > > > and loongarch use asm that look like mine to handle size < 4B xchg.
> > 
> > I think you misread the arm64 code, which should use native instructions
> > for all sizes, in both the armv8.0 and LSE atomics.

By native I understand you mean swp instead of ll/sc, right?

Well, that's right only if the kernel is compiled with LSE, and the ll/sc option
for is available for other arm64 that don't. 

Also, according to Kconfig, it seems to have been introduced in ARMv8.1, meaning
arm64 for (at least some) ARMv8.0 use ll/sc, and this is why xchg with the ll/sc
code is available for 1, 2, 4 and 8 bytes in arch/arm64/include/asm/cmpxchg.h:

#define __XCHG_CASE(w, sfx, name, sz, mb, nop_lse, acq, acq_lse, rel,
cl)	\static inline u##sz __xchg_case_##name##sz(u##sz x, volatile void *ptr)		\{										\	u##szret;								\	unsignedlongtmp;							\										\
\	asm
volatile(ARM64_LSE_ATOMIC_INSN(					\	/* LL/SC */								\	"	prfm	pstl1strm, %2\n"					\	"1:	ld"#acq"xr"#sfx"\t%"#w"0,%2\n"				\	"	st"#rel"xr"#sfx"\t%w1,%"#w"3,%2\n"			\	"	cbnz	%w1,1b\n"						\	"	"#mb,								\	/*LSEatomics*/							\	"	swp"#acq_lse#rel#sfx"\t%"#w"3,%"#w"0,%2\n"		\[...]

__XCHG_CASE(w, b,     ,  8,        ,    ,  ,  ,  ,         )
__XCHG_CASE(w, h,     , 16,        ,    ,  ,  ,  ,         )
__XCHG_CASE(w,  ,     , 32,        ,    ,  ,  ,  ,         )

> > 
> > PowerPC does use the masking for xchg, but I suspect there are no
> > actual users, at least it actually has its own qspinlock implementation
> > that avoids xchg().
> PowerPC still needs similar things, see publish_tail_cpu(), and more
> complex cmpxchg semantics.
> 
> Paravrit qspinlock and CNA qspinlock still need more:
>  - xchg8 (RCsc)
>  - cmpxchg8/16_relaxed
>  - cmpxchg8/16_release (Rcpc)
>  - cmpxchg8_acquire (RCpc)
>  - cmpxchg8 (RCsc)
> 
> > 
> > > > >  This is also something that almost no architecture
> > > > > specific code relies on (generic qspinlock being a notable exception).
> > > > > 
> > > > 
> > > > 2 - As you mentioned, there should be very little code that will actually make
> > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > > > guarantee forward progress for those rare usages (like some of above mentioned
> > > > archs).
> > 
> > I don't this this is a safe assumption, we've had endless discussions
> > about using qspinlock on architectures without a native xchg(), which
> > needs either hardware guarantees or special countermeasures in xchg() itself
> > to avoid this.

That seems a nice discussion, do you have a link for this?

By what I could see, Guo Ren is doing a great work on proving that using
qspinlock (with smaller xchg) performs better on RISC-V. 

> > 
> > What I'd actually like to do here is to remove the special 8-bit and
> > 16-bit cases from the xchg() and cmpxchg() interfaces at all, leaving
> It needs to modify qspinlock, paravirt_qspinlock, and CNA_qspinlock
> code to prevent using 8-bit/16-bit xchg/cmpxchg, and cleanup all
> architectures' cmpxchg.h. What you do is just get them out of the
> common atomic.h, but architectures still need to solve them and
> connect to the qspinlock series.
> 
> > only fixed 32-bit and native wordsize (either 32 or 64) as the option,
> > while dealing with the others the same way we treat the fixed
> > 64-bit cases that hardcode the 64-bit argument types and are only
> > usable on architectures that provide them.
> > 
> >     Arnd
> 
> 
> 

IIUC, xchg for size 1 & 2 can still be useful if having the lock variable bigger
causes the target struct to use more than a cacheline. This could reduce cache
usage and avoid some cacheline misses. 

Even though in some arches those 'non-native' xchg can take longer, it can be
perceived as a valid tradeoff for some scenarios.

Thanks,
Leo
Leonardo Bras Aug. 30, 2023, 9:59 p.m. UTC | #12
On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote:
> On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > > > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > > > riscv, even though its present in other architectures such as arm64 and
> > > > x86. This could lead to not being able to implement some locking mechanisms
> > > > or requiring some rework to make it work properly.
> > > > 
> > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > > > architectures.
> > > > 
> > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > 
> > 
> > Hello Arnd Bergmann, thanks for reviewing!
> > 
> > > Parity with other architectures by itself is not a reason to do this,
> > > in particular the other architectures you listed have the instructions
> > > in hardware while riscv does not.
> > 
> > Sure, I understand RISC-V don't have native support for xchg on variables of
> > size < 4B. My argument is that it's nice to have even an emulated version for
> > this in case any future mechanism wants to use it.
> > 
> > Not having it may mean we won't be able to enable given mechanism in RISC-V. 
> 
> IIUC the ask is to have a user within the kernel for these functions.  
> That's the general thing to do, and last time this came up there was no 
> in-kernel use of it -- the qspinlock stuff would, but we haven't enabled 
> it yet because we're worried about the performance/fairness stuff that 
> other ports have seen and nobody's got concrete benchmarks yet (though 
> there's another patch set out that I haven't had time to look through, 
> so that may have changed).
> 
> So if something uses these I'm happy to go look closer.

IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we
don't have an use for them for the time being.

Otherwise, any comments on patches 1, 2 & 3?

> 
> > > Emulating the small xchg() through cmpxchg() is particularly tricky
> > > since it's easy to run into a case where this does not guarantee
> > > forward progress.
> > > 
> > 
> > Didn't get this part:
> > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > 
> > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > threads accessing given cacheline and have sc always failing. On the other hand,
> > there are 2 arguments on that:
> > 
> > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > and loongarch use asm that look like mine to handle size < 4B xchg. 
> >     
> > 
> > >  This is also something that almost no architecture
> > > specific code relies on (generic qspinlock being a notable exception).
> > > 
> > 
> > 2 - As you mentioned, there should be very little code that will actually make
> > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > guarantee forward progress for those rare usages (like some of above mentioned
> > archs).
> > 
> > > I would recommend just dropping this patch from the series, at least
> > > until there is a need for it.
> > 
> > While I agree this is a valid point, I believe its more interesting to have it
> > implemented if any future mechanism wants to make use of this. 
> > 
> > 
> > Thanks!
> > Leo
>
Leonardo Bras Sept. 6, 2023, 9:15 p.m. UTC | #13
On Wed, Aug 30, 2023 at 6:59 PM Leonardo Brás <leobras@redhat.com> wrote:
>
> On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote:
> > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > > > > riscv, even though its present in other architectures such as arm64 and
> > > > > x86. This could lead to not being able to implement some locking mechanisms
> > > > > or requiring some rework to make it work properly.
> > > > >
> > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > > > > architectures.
> > > > >
> > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > >
> > >
> > > Hello Arnd Bergmann, thanks for reviewing!
> > >
> > > > Parity with other architectures by itself is not a reason to do this,
> > > > in particular the other architectures you listed have the instructions
> > > > in hardware while riscv does not.
> > >
> > > Sure, I understand RISC-V don't have native support for xchg on variables of
> > > size < 4B. My argument is that it's nice to have even an emulated version for
> > > this in case any future mechanism wants to use it.
> > >
> > > Not having it may mean we won't be able to enable given mechanism in RISC-V.
> >
> > IIUC the ask is to have a user within the kernel for these functions.
> > That's the general thing to do, and last time this came up there was no
> > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled
> > it yet because we're worried about the performance/fairness stuff that
> > other ports have seen and nobody's got concrete benchmarks yet (though
> > there's another patch set out that I haven't had time to look through,
> > so that may have changed).
> >
> > So if something uses these I'm happy to go look closer.
>
> IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we
> don't have an use for them for the time being.
>
> Otherwise, any comments on patches 1, 2 & 3?

Hello Palmer,
Any chance of patches 1, 2 & 3 being merged in this merge window?

Thanks!


>
> >
> > > > Emulating the small xchg() through cmpxchg() is particularly tricky
> > > > since it's easy to run into a case where this does not guarantee
> > > > forward progress.
> > > >
> > >
> > > Didn't get this part:
> > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > >
> > > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > > threads accessing given cacheline and have sc always failing. On the other hand,
> > > there are 2 arguments on that:
> > >
> > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > > and loongarch use asm that look like mine to handle size < 4B xchg.
> > >
> > >
> > > >  This is also something that almost no architecture
> > > > specific code relies on (generic qspinlock being a notable exception).
> > > >
> > >
> > > 2 - As you mentioned, there should be very little code that will actually make
> > > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > > guarantee forward progress for those rare usages (like some of above mentioned
> > > archs).
> > >
> > > > I would recommend just dropping this patch from the series, at least
> > > > until there is a need for it.
> > >
> > > While I agree this is a valid point, I believe its more interesting to have it
> > > implemented if any future mechanism wants to make use of this.
> > >
> > >
> > > Thanks!
> > > Leo
> >
>
Leonardo Bras Dec. 6, 2023, 12:56 a.m. UTC | #14
From: Leonardo Bras <leobras@redhat.com>

On Wed, Aug 30, 2023 at 06:59:46PM -0300, Leonardo Brás wrote:
> On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote:
> > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > > > > riscv, even though its present in other architectures such as arm64 and
> > > > > x86. This could lead to not being able to implement some locking mechanisms
> > > > > or requiring some rework to make it work properly.
> > > > > 
> > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > > > > architectures.
> > > > > 
> > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > 
> > > 
> > > Hello Arnd Bergmann, thanks for reviewing!
> > > 
> > > > Parity with other architectures by itself is not a reason to do this,
> > > > in particular the other architectures you listed have the instructions
> > > > in hardware while riscv does not.
> > > 
> > > Sure, I understand RISC-V don't have native support for xchg on variables of
> > > size < 4B. My argument is that it's nice to have even an emulated version for
> > > this in case any future mechanism wants to use it.
> > > 
> > > Not having it may mean we won't be able to enable given mechanism in RISC-V. 
> > 
> > IIUC the ask is to have a user within the kernel for these functions.  
> > That's the general thing to do, and last time this came up there was no 
> > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled 
> > it yet because we're worried about the performance/fairness stuff that 
> > other ports have seen and nobody's got concrete benchmarks yet (though 
> > there's another patch set out that I haven't had time to look through, 
> > so that may have changed).
> > 
> > So if something uses these I'm happy to go look closer.
> 
> IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we
> don't have an use for them for the time being.
> 
> Otherwise, any comments on patches 1, 2 & 3?

ping

> 
> > 
> > > > Emulating the small xchg() through cmpxchg() is particularly tricky
> > > > since it's easy to run into a case where this does not guarantee
> > > > forward progress.
> > > > 
> > > 
> > > Didn't get this part:
> > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > > 
> > > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > > threads accessing given cacheline and have sc always failing. On the other hand,
> > > there are 2 arguments on that:
> > > 
> > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > > and loongarch use asm that look like mine to handle size < 4B xchg. 
> > >     
> > > 
> > > >  This is also something that almost no architecture
> > > > specific code relies on (generic qspinlock being a notable exception).
> > > > 
> > > 
> > > 2 - As you mentioned, there should be very little code that will actually make
> > > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > > guarantee forward progress for those rare usages (like some of above mentioned
> > > archs).
> > > 
> > > > I would recommend just dropping this patch from the series, at least
> > > > until there is a need for it.
> > > 
> > > While I agree this is a valid point, I believe its more interesting to have it
> > > implemented if any future mechanism wants to make use of this. 
> > > 
> > > 
> > > Thanks!
> > > Leo
> > 
>
Jisheng Zhang Jan. 3, 2024, 11:05 a.m. UTC | #15
On Tue, Dec 05, 2023 at 09:56:44PM -0300, leobras.c@gmail.com wrote:
> From: Leonardo Bras <leobras@redhat.com>
> 
> On Wed, Aug 30, 2023 at 06:59:46PM -0300, Leonardo Brás wrote:
> > On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote:
> > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > > > > > riscv, even though its present in other architectures such as arm64 and
> > > > > > x86. This could lead to not being able to implement some locking mechanisms
> > > > > > or requiring some rework to make it work properly.
> > > > > > 
> > > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > > > > > architectures.
> > > > > > 
> > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > 
> > > > 
> > > > Hello Arnd Bergmann, thanks for reviewing!
> > > > 
> > > > > Parity with other architectures by itself is not a reason to do this,
> > > > > in particular the other architectures you listed have the instructions
> > > > > in hardware while riscv does not.
> > > > 
> > > > Sure, I understand RISC-V don't have native support for xchg on variables of
> > > > size < 4B. My argument is that it's nice to have even an emulated version for
> > > > this in case any future mechanism wants to use it.
> > > > 
> > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. 
> > > 
> > > IIUC the ask is to have a user within the kernel for these functions.  
> > > That's the general thing to do, and last time this came up there was no 
> > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled 
> > > it yet because we're worried about the performance/fairness stuff that 
> > > other ports have seen and nobody's got concrete benchmarks yet (though 
> > > there's another patch set out that I haven't had time to look through, 
> > > so that may have changed).
> > > 
> > > So if something uses these I'm happy to go look closer.
> > 
> > IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we
> > don't have an use for them for the time being.
> > 
> > Otherwise, any comments on patches 1, 2 & 3?
> 
> ping

Hi,

I believe the "RFC" makes some reviewers think the series isn't ready
for review, so could you please send a new one w/o RFC?

thanks

> 
> > 
> > > 
> > > > > Emulating the small xchg() through cmpxchg() is particularly tricky
> > > > > since it's easy to run into a case where this does not guarantee
> > > > > forward progress.
> > > > > 
> > > > 
> > > > Didn't get this part:
> > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > > > 
> > > > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > > > threads accessing given cacheline and have sc always failing. On the other hand,
> > > > there are 2 arguments on that:
> > > > 
> > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > > > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > > > and loongarch use asm that look like mine to handle size < 4B xchg. 
> > > >     
> > > > 
> > > > >  This is also something that almost no architecture
> > > > > specific code relies on (generic qspinlock being a notable exception).
> > > > > 
> > > > 
> > > > 2 - As you mentioned, there should be very little code that will actually make
> > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > > > guarantee forward progress for those rare usages (like some of above mentioned
> > > > archs).
> > > > 
> > > > > I would recommend just dropping this patch from the series, at least
> > > > > until there is a need for it.
> > > > 
> > > > While I agree this is a valid point, I believe its more interesting to have it
> > > > implemented if any future mechanism wants to make use of this. 
> > > > 
> > > > 
> > > > Thanks!
> > > > Leo
> > > 
> > 
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Leonardo Bras Jan. 3, 2024, 3:31 p.m. UTC | #16
On Wed, Jan 03, 2024 at 07:05:04PM +0800, Jisheng Zhang wrote:
> On Tue, Dec 05, 2023 at 09:56:44PM -0300, leobras.c@gmail.com wrote:
> > From: Leonardo Bras <leobras@redhat.com>
> > 
> > On Wed, Aug 30, 2023 at 06:59:46PM -0300, Leonardo Brás wrote:
> > > On Thu, 2023-08-10 at 09:23 -0700, Palmer Dabbelt wrote:
> > > > On Thu, 10 Aug 2023 09:04:04 PDT (-0700), leobras@redhat.com wrote:
> > > > > On Thu, 2023-08-10 at 08:51 +0200, Arnd Bergmann wrote:
> > > > > > On Thu, Aug 10, 2023, at 06:03, Leonardo Bras wrote:
> > > > > > > xchg for variables of size 1-byte and 2-bytes is not yet available for
> > > > > > > riscv, even though its present in other architectures such as arm64 and
> > > > > > > x86. This could lead to not being able to implement some locking mechanisms
> > > > > > > or requiring some rework to make it work properly.
> > > > > > > 
> > > > > > > Implement 1-byte and 2-bytes xchg in order to achieve parity with other
> > > > > > > architectures.
> > > > > > > 
> > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > 
> > > > > 
> > > > > Hello Arnd Bergmann, thanks for reviewing!
> > > > > 
> > > > > > Parity with other architectures by itself is not a reason to do this,
> > > > > > in particular the other architectures you listed have the instructions
> > > > > > in hardware while riscv does not.
> > > > > 
> > > > > Sure, I understand RISC-V don't have native support for xchg on variables of
> > > > > size < 4B. My argument is that it's nice to have even an emulated version for
> > > > > this in case any future mechanism wants to use it.
> > > > > 
> > > > > Not having it may mean we won't be able to enable given mechanism in RISC-V. 
> > > > 
> > > > IIUC the ask is to have a user within the kernel for these functions.  
> > > > That's the general thing to do, and last time this came up there was no 
> > > > in-kernel use of it -- the qspinlock stuff would, but we haven't enabled 
> > > > it yet because we're worried about the performance/fairness stuff that 
> > > > other ports have seen and nobody's got concrete benchmarks yet (though 
> > > > there's another patch set out that I haven't had time to look through, 
> > > > so that may have changed).
> > > > 
> > > > So if something uses these I'm happy to go look closer.
> > > 
> > > IIUC patches 4 & 5 will be used by qspinlock, which may not be done yet, so we
> > > don't have an use for them for the time being.
> > > 
> > > Otherwise, any comments on patches 1, 2 & 3?
> > 
> > ping
> 
> Hi,
> 
> I believe the "RFC" makes some reviewers think the series isn't ready
> for review, so could you please send a new one w/o RFC?
> 
> thanks

Sure, will do.

Thanks!
Leo

> 
> > 
> > > 
> > > > 
> > > > > > Emulating the small xchg() through cmpxchg() is particularly tricky
> > > > > > since it's easy to run into a case where this does not guarantee
> > > > > > forward progress.
> > > > > > 
> > > > > 
> > > > > Didn't get this part:
> > > > > By "emulating small xchg() through cmpxchg()", did you mean like emulating an
> > > > > xchg (usually 1 instruction) with lr & sc (same used in cmpxchg) ?
> > > > > 
> > > > > If so, yeah, it's a fair point: in some extreme case we could have multiple
> > > > > threads accessing given cacheline and have sc always failing. On the other hand,
> > > > > there are 2 arguments on that:
> > > > > 
> > > > > 1 - Other architectures, (such as powerpc, arm and arm64 without LSE atomics)
> > > > > also seem to rely in this mechanism for every xchg size. Another archs like csky
> > > > > and loongarch use asm that look like mine to handle size < 4B xchg. 
> > > > >     
> > > > > 
> > > > > >  This is also something that almost no architecture
> > > > > > specific code relies on (generic qspinlock being a notable exception).
> > > > > > 
> > > > > 
> > > > > 2 - As you mentioned, there should be very little code that will actually make
> > > > > use of xchg for vars < 4B, so it should be safe to assume its fine to not
> > > > > guarantee forward progress for those rare usages (like some of above mentioned
> > > > > archs).
> > > > > 
> > > > > > I would recommend just dropping this patch from the series, at least
> > > > > > until there is a need for it.
> > > > > 
> > > > > While I agree this is a valid point, I believe its more interesting to have it
> > > > > implemented if any future mechanism wants to make use of this. 
> > > > > 
> > > > > 
> > > > > Thanks!
> > > > > Leo
> > > > 
> > > 
> > 
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
>
diff mbox series

Patch

diff --git a/arch/riscv/include/asm/cmpxchg.h b/arch/riscv/include/asm/cmpxchg.h
index ac9d0eeb74e6..26cea2395aae 100644
--- a/arch/riscv/include/asm/cmpxchg.h
+++ b/arch/riscv/include/asm/cmpxchg.h
@@ -11,6 +11,31 @@ 
 #include <asm/barrier.h>
 #include <asm/fence.h>
 
+#define __arch_xchg_masked(prepend, append, r, p, n)			\
+({									\
+	u32 *__ptr32b = (u32 *)((ulong)(p) & ~0x3);			\
+	ulong __s = ((ulong)(p) & (0x4 - sizeof(*p))) * BITS_PER_BYTE;	\
+	ulong __mask = GENMASK(((sizeof(*p)) * BITS_PER_BYTE) - 1, 0)	\
+			<< __s;						\
+	ulong __newx = (ulong)(n) << __s;				\
+	ulong __retx;							\
+	ulong __rc;							\
+									\
+	__asm__ __volatile__ (						\
+	       prepend							\
+	       "0:	lr.w %0, %2\n"					\
+	       "	and  %1, %0, %z4\n"				\
+	       "	or   %1, %1, %z3\n"				\
+	       "	sc.w %1, %1, %2\n"				\
+	       "	bnez %1, 0b\n"					\
+	       append							\
+	       : "=&r" (__retx), "=&r" (__rc), "+A" (*(__ptr32b))	\
+	       : "rJ" (__newx), "rJ" (~__mask)				\
+	       : "memory");						\
+									\
+	r = (__typeof__(*(p)))((__retx & __mask) >> __s);		\
+})
+
 #define __arch_xchg(sfx, prepend, append, r, p, n)			\
 ({									\
 	__asm__ __volatile__ (						\
@@ -27,7 +52,13 @@ 
 	__typeof__(ptr) __ptr = (ptr);					\
 	__typeof__(*(__ptr)) __new = (new);				\
 	__typeof__(*(__ptr)) __ret;					\
+									\
 	switch (sizeof(*__ptr)) {					\
+	case 1:								\
+	case 2:								\
+		__arch_xchg_masked(prepend, append,			\
+				   __ret, __ptr, __new);		\
+		break;							\
 	case 4:								\
 		__arch_xchg(".w" sfx, prepend, append,			\
 			      __ret, __ptr, __new);			\