diff mbox series

[V11,04/17] locking/qspinlock: Improve xchg_tail for number of cpus >= 16k

Message ID 20230910082911.3378782-5-guoren@kernel.org (mailing list archive)
State Changes Requested
Headers show
Series riscv: Add Native/Paravirt qspinlock support | expand

Checks

Context Check Description
conchuod/tree_selection fail Failed to apply to next/pending-fixes, riscv/for-next or riscv/master

Commit Message

Guo Ren Sept. 10, 2023, 8:28 a.m. UTC
From: Guo Ren <guoren@linux.alibaba.com>

The target of xchg_tail is to write the tail to the lock value, so
adding prefetchw could help the next cmpxchg step, which may
decrease the cmpxchg retry loops of xchg_tail. Some processors may
utilize this feature to give a forward guarantee, e.g., RISC-V
XuanTie processors would block the snoop channel & irq for several
cycles when prefetch.w instruction (from Zicbop extension) retired,
which guarantees the next cmpxchg succeeds.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
---
 kernel/locking/qspinlock.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Waiman Long Sept. 11, 2023, 2:35 a.m. UTC | #1
On 9/10/23 04:28, guoren@kernel.org wrote:
> From: Guo Ren <guoren@linux.alibaba.com>
>
> The target of xchg_tail is to write the tail to the lock value, so
> adding prefetchw could help the next cmpxchg step, which may
> decrease the cmpxchg retry loops of xchg_tail. Some processors may
> utilize this feature to give a forward guarantee, e.g., RISC-V
> XuanTie processors would block the snoop channel & irq for several
> cycles when prefetch.w instruction (from Zicbop extension) retired,
> which guarantees the next cmpxchg succeeds.
>
> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> Signed-off-by: Guo Ren <guoren@kernel.org>
> ---
>   kernel/locking/qspinlock.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index d3f99060b60f..96b54e2ade86 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
>    */
>   static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>   {
> -	u32 old, new, val = atomic_read(&lock->val);
> +	u32 old, new, val;
> +
> +	prefetchw(&lock->val);
> +	val = atomic_read(&lock->val);
>   
>   	for (;;) {
>   		new = (val & _Q_LOCKED_PENDING_MASK) | tail;

That looks a bit weird. You pre-fetch and then immediately read it. How 
much performance gain you get by this change alone?

Maybe you can define an arch specific primitive that default back to 
atomic_read() if not defined.

Cheers,
Longman
Guo Ren Sept. 11, 2023, 3:09 a.m. UTC | #2
On Mon, Sep 11, 2023 at 10:35 AM Waiman Long <longman@redhat.com> wrote:
>
>
> On 9/10/23 04:28, guoren@kernel.org wrote:
> > From: Guo Ren <guoren@linux.alibaba.com>
> >
> > The target of xchg_tail is to write the tail to the lock value, so
> > adding prefetchw could help the next cmpxchg step, which may
> > decrease the cmpxchg retry loops of xchg_tail. Some processors may
> > utilize this feature to give a forward guarantee, e.g., RISC-V
> > XuanTie processors would block the snoop channel & irq for several
> > cycles when prefetch.w instruction (from Zicbop extension) retired,
> > which guarantees the next cmpxchg succeeds.
> >
> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > Signed-off-by: Guo Ren <guoren@kernel.org>
> > ---
> >   kernel/locking/qspinlock.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index d3f99060b60f..96b54e2ade86 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
> >    */
> >   static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> >   {
> > -     u32 old, new, val = atomic_read(&lock->val);
> > +     u32 old, new, val;
> > +
> > +     prefetchw(&lock->val);
> > +     val = atomic_read(&lock->val);
> >
> >       for (;;) {
> >               new = (val & _Q_LOCKED_PENDING_MASK) | tail;
>
> That looks a bit weird. You pre-fetch and then immediately read it. How
> much performance gain you get by this change alone?
>
> Maybe you can define an arch specific primitive that default back to
> atomic_read() if not defined.
Thx for the reply. This is a generic optimization point I would like
to talk about with you.

First, prefetchw() makes cacheline an exclusive state and serves for
the next cmpxchg loop semantic, which writes the idx_tail part of
arch_spin_lock. The atomic_read only makes cacheline in the shared
state, which couldn't give any guarantee for the next cmpxchg loop
semantic. Micro-architecture could utilize prefetchw() to provide a
strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
XuanTie processor would hold the exclusive cacheline state until the
next cmpxchg write success.

In the end, Let's go back to the principle: the xchg_tail is an atomic
swap operation that contains write eventually, so giving a prefetchw()
at the beginning is acceptable for all architectures..

>
> Cheers,
> Longman
>
Waiman Long Sept. 11, 2023, 1:03 p.m. UTC | #3
On 9/10/23 23:09, Guo Ren wrote:
> On Mon, Sep 11, 2023 at 10:35 AM Waiman Long <longman@redhat.com> wrote:
>>
>> On 9/10/23 04:28, guoren@kernel.org wrote:
>>> From: Guo Ren <guoren@linux.alibaba.com>
>>>
>>> The target of xchg_tail is to write the tail to the lock value, so
>>> adding prefetchw could help the next cmpxchg step, which may
>>> decrease the cmpxchg retry loops of xchg_tail. Some processors may
>>> utilize this feature to give a forward guarantee, e.g., RISC-V
>>> XuanTie processors would block the snoop channel & irq for several
>>> cycles when prefetch.w instruction (from Zicbop extension) retired,
>>> which guarantees the next cmpxchg succeeds.
>>>
>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
>>> Signed-off-by: Guo Ren <guoren@kernel.org>
>>> ---
>>>    kernel/locking/qspinlock.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>>> index d3f99060b60f..96b54e2ade86 100644
>>> --- a/kernel/locking/qspinlock.c
>>> +++ b/kernel/locking/qspinlock.c
>>> @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
>>>     */
>>>    static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>>>    {
>>> -     u32 old, new, val = atomic_read(&lock->val);
>>> +     u32 old, new, val;
>>> +
>>> +     prefetchw(&lock->val);
>>> +     val = atomic_read(&lock->val);
>>>
>>>        for (;;) {
>>>                new = (val & _Q_LOCKED_PENDING_MASK) | tail;
>> That looks a bit weird. You pre-fetch and then immediately read it. How
>> much performance gain you get by this change alone?
>>
>> Maybe you can define an arch specific primitive that default back to
>> atomic_read() if not defined.
> Thx for the reply. This is a generic optimization point I would like
> to talk about with you.
>
> First, prefetchw() makes cacheline an exclusive state and serves for
> the next cmpxchg loop semantic, which writes the idx_tail part of
> arch_spin_lock. The atomic_read only makes cacheline in the shared
> state, which couldn't give any guarantee for the next cmpxchg loop
> semantic. Micro-architecture could utilize prefetchw() to provide a
> strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
> XuanTie processor would hold the exclusive cacheline state until the
> next cmpxchg write success.
>
> In the end, Let's go back to the principle: the xchg_tail is an atomic
> swap operation that contains write eventually, so giving a prefetchw()
> at the beginning is acceptable for all architectures..
> ••••••••••••

I did realize afterward that prefetchw gets the cacheline in exclusive 
state. I will suggest you mention that in your commit log as well as 
adding a comment about its purpose in the code.

Thanks,
Longman

>> Cheers,
>> Longman
>>
>
Guo Ren Sept. 12, 2023, 1:10 a.m. UTC | #4
On Mon, Sep 11, 2023 at 9:03 PM Waiman Long <longman@redhat.com> wrote:
>
> On 9/10/23 23:09, Guo Ren wrote:
> > On Mon, Sep 11, 2023 at 10:35 AM Waiman Long <longman@redhat.com> wrote:
> >>
> >> On 9/10/23 04:28, guoren@kernel.org wrote:
> >>> From: Guo Ren <guoren@linux.alibaba.com>
> >>>
> >>> The target of xchg_tail is to write the tail to the lock value, so
> >>> adding prefetchw could help the next cmpxchg step, which may
> >>> decrease the cmpxchg retry loops of xchg_tail. Some processors may
> >>> utilize this feature to give a forward guarantee, e.g., RISC-V
> >>> XuanTie processors would block the snoop channel & irq for several
> >>> cycles when prefetch.w instruction (from Zicbop extension) retired,
> >>> which guarantees the next cmpxchg succeeds.
> >>>
> >>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> >>> Signed-off-by: Guo Ren <guoren@kernel.org>
> >>> ---
> >>>    kernel/locking/qspinlock.c | 5 ++++-
> >>>    1 file changed, 4 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> >>> index d3f99060b60f..96b54e2ade86 100644
> >>> --- a/kernel/locking/qspinlock.c
> >>> +++ b/kernel/locking/qspinlock.c
> >>> @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
> >>>     */
> >>>    static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> >>>    {
> >>> -     u32 old, new, val = atomic_read(&lock->val);
> >>> +     u32 old, new, val;
> >>> +
> >>> +     prefetchw(&lock->val);
> >>> +     val = atomic_read(&lock->val);
> >>>
> >>>        for (;;) {
> >>>                new = (val & _Q_LOCKED_PENDING_MASK) | tail;
> >> That looks a bit weird. You pre-fetch and then immediately read it. How
> >> much performance gain you get by this change alone?
> >>
> >> Maybe you can define an arch specific primitive that default back to
> >> atomic_read() if not defined.
> > Thx for the reply. This is a generic optimization point I would like
> > to talk about with you.
> >
> > First, prefetchw() makes cacheline an exclusive state and serves for
> > the next cmpxchg loop semantic, which writes the idx_tail part of
> > arch_spin_lock. The atomic_read only makes cacheline in the shared
> > state, which couldn't give any guarantee for the next cmpxchg loop
> > semantic. Micro-architecture could utilize prefetchw() to provide a
> > strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
> > XuanTie processor would hold the exclusive cacheline state until the
> > next cmpxchg write success.
> >
> > In the end, Let's go back to the principle: the xchg_tail is an atomic
> > swap operation that contains write eventually, so giving a prefetchw()
> > at the beginning is acceptable for all architectures..
> > ••••••••••••
>
> I did realize afterward that prefetchw gets the cacheline in exclusive
> state. I will suggest you mention that in your commit log as well as
> adding a comment about its purpose in the code.
Okay, I would do that in v12, thx.

>
> Thanks,
> Longman
>
> >> Cheers,
> >> Longman
> >>
> >
>
Leonardo Bras Sept. 13, 2023, 8:55 a.m. UTC | #5
On Tue, Sep 12, 2023 at 09:10:08AM +0800, Guo Ren wrote:
> On Mon, Sep 11, 2023 at 9:03 PM Waiman Long <longman@redhat.com> wrote:
> >
> > On 9/10/23 23:09, Guo Ren wrote:
> > > On Mon, Sep 11, 2023 at 10:35 AM Waiman Long <longman@redhat.com> wrote:
> > >>
> > >> On 9/10/23 04:28, guoren@kernel.org wrote:
> > >>> From: Guo Ren <guoren@linux.alibaba.com>
> > >>>
> > >>> The target of xchg_tail is to write the tail to the lock value, so
> > >>> adding prefetchw could help the next cmpxchg step, which may
> > >>> decrease the cmpxchg retry loops of xchg_tail. Some processors may
> > >>> utilize this feature to give a forward guarantee, e.g., RISC-V
> > >>> XuanTie processors would block the snoop channel & irq for several
> > >>> cycles when prefetch.w instruction (from Zicbop extension) retired,
> > >>> which guarantees the next cmpxchg succeeds.
> > >>>
> > >>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > >>> Signed-off-by: Guo Ren <guoren@kernel.org>
> > >>> ---
> > >>>    kernel/locking/qspinlock.c | 5 ++++-
> > >>>    1 file changed, 4 insertions(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > >>> index d3f99060b60f..96b54e2ade86 100644
> > >>> --- a/kernel/locking/qspinlock.c
> > >>> +++ b/kernel/locking/qspinlock.c
> > >>> @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
> > >>>     */
> > >>>    static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> > >>>    {
> > >>> -     u32 old, new, val = atomic_read(&lock->val);
> > >>> +     u32 old, new, val;
> > >>> +
> > >>> +     prefetchw(&lock->val);
> > >>> +     val = atomic_read(&lock->val);
> > >>>
> > >>>        for (;;) {
> > >>>                new = (val & _Q_LOCKED_PENDING_MASK) | tail;
> > >> That looks a bit weird. You pre-fetch and then immediately read it. How
> > >> much performance gain you get by this change alone?
> > >>
> > >> Maybe you can define an arch specific primitive that default back to
> > >> atomic_read() if not defined.
> > > Thx for the reply. This is a generic optimization point I would like
> > > to talk about with you.
> > >
> > > First, prefetchw() makes cacheline an exclusive state and serves for
> > > the next cmpxchg loop semantic, which writes the idx_tail part of
> > > arch_spin_lock. The atomic_read only makes cacheline in the shared
> > > state, which couldn't give any guarantee for the next cmpxchg loop
> > > semantic. Micro-architecture could utilize prefetchw() to provide a
> > > strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
> > > XuanTie processor would hold the exclusive cacheline state until the
> > > next cmpxchg write success.
> > >
> > > In the end, Let's go back to the principle: the xchg_tail is an atomic
> > > swap operation that contains write eventually, so giving a prefetchw()
> > > at the beginning is acceptable for all architectures..
> > > ••••••••••••
> >
> > I did realize afterward that prefetchw gets the cacheline in exclusive
> > state. I will suggest you mention that in your commit log as well as
> > adding a comment about its purpose in the code.
> Okay, I would do that in v12, thx.

I would suggest adding a snippet from the ISA Extenstion doc:

"A prefetch.w instruction indicates to hardware that the cache block whose 
effective address is the sum of the base address specified in rs1 and the  
sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000, 
is likely to be accessed by a data write (i.e. store) in the near future."

Other than that,
Reviewed-by: Leonardo Bras <leobras@redhat.com>


> 
> >
> > Thanks,
> > Longman
> >
> > >> Cheers,
> > >> Longman
> > >>
> > >
> >
> 
> 
> -- 
> Best Regards
>  Guo Ren
>
Guo Ren Sept. 13, 2023, 12:52 p.m. UTC | #6
On Wed, Sep 13, 2023 at 4:55 PM Leonardo Bras <leobras@redhat.com> wrote:
>
> On Tue, Sep 12, 2023 at 09:10:08AM +0800, Guo Ren wrote:
> > On Mon, Sep 11, 2023 at 9:03 PM Waiman Long <longman@redhat.com> wrote:
> > >
> > > On 9/10/23 23:09, Guo Ren wrote:
> > > > On Mon, Sep 11, 2023 at 10:35 AM Waiman Long <longman@redhat.com> wrote:
> > > >>
> > > >> On 9/10/23 04:28, guoren@kernel.org wrote:
> > > >>> From: Guo Ren <guoren@linux.alibaba.com>
> > > >>>
> > > >>> The target of xchg_tail is to write the tail to the lock value, so
> > > >>> adding prefetchw could help the next cmpxchg step, which may
> > > >>> decrease the cmpxchg retry loops of xchg_tail. Some processors may
> > > >>> utilize this feature to give a forward guarantee, e.g., RISC-V
> > > >>> XuanTie processors would block the snoop channel & irq for several
> > > >>> cycles when prefetch.w instruction (from Zicbop extension) retired,
> > > >>> which guarantees the next cmpxchg succeeds.
> > > >>>
> > > >>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > >>> Signed-off-by: Guo Ren <guoren@kernel.org>
> > > >>> ---
> > > >>>    kernel/locking/qspinlock.c | 5 ++++-
> > > >>>    1 file changed, 4 insertions(+), 1 deletion(-)
> > > >>>
> > > >>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > > >>> index d3f99060b60f..96b54e2ade86 100644
> > > >>> --- a/kernel/locking/qspinlock.c
> > > >>> +++ b/kernel/locking/qspinlock.c
> > > >>> @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
> > > >>>     */
> > > >>>    static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> > > >>>    {
> > > >>> -     u32 old, new, val = atomic_read(&lock->val);
> > > >>> +     u32 old, new, val;
> > > >>> +
> > > >>> +     prefetchw(&lock->val);
> > > >>> +     val = atomic_read(&lock->val);
> > > >>>
> > > >>>        for (;;) {
> > > >>>                new = (val & _Q_LOCKED_PENDING_MASK) | tail;
> > > >> That looks a bit weird. You pre-fetch and then immediately read it. How
> > > >> much performance gain you get by this change alone?
> > > >>
> > > >> Maybe you can define an arch specific primitive that default back to
> > > >> atomic_read() if not defined.
> > > > Thx for the reply. This is a generic optimization point I would like
> > > > to talk about with you.
> > > >
> > > > First, prefetchw() makes cacheline an exclusive state and serves for
> > > > the next cmpxchg loop semantic, which writes the idx_tail part of
> > > > arch_spin_lock. The atomic_read only makes cacheline in the shared
> > > > state, which couldn't give any guarantee for the next cmpxchg loop
> > > > semantic. Micro-architecture could utilize prefetchw() to provide a
> > > > strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
> > > > XuanTie processor would hold the exclusive cacheline state until the
> > > > next cmpxchg write success.
> > > >
> > > > In the end, Let's go back to the principle: the xchg_tail is an atomic
> > > > swap operation that contains write eventually, so giving a prefetchw()
> > > > at the beginning is acceptable for all architectures..
> > > > ••••••••••••
> > >
> > > I did realize afterward that prefetchw gets the cacheline in exclusive
> > > state. I will suggest you mention that in your commit log as well as
> > > adding a comment about its purpose in the code.
> > Okay, I would do that in v12, thx.
>
> I would suggest adding a snippet from the ISA Extenstion doc:
>
> "A prefetch.w instruction indicates to hardware that the cache block whose
> effective address is the sum of the base address specified in rs1 and the
> sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000,
> is likely to be accessed by a data write (i.e. store) in the near future."
Good point, thx.

>
> Other than that,
> Reviewed-by: Leonardo Bras <leobras@redhat.com>
>
>
> >
> > >
> > > Thanks,
> > > Longman
> > >
> > > >> Cheers,
> > > >> Longman
> > > >>
> > > >
> > >
> >
> >
> > --
> > Best Regards
> >  Guo Ren
> >
>
Waiman Long Sept. 13, 2023, 1:06 p.m. UTC | #7
On 9/13/23 08:52, Guo Ren wrote:
> On Wed, Sep 13, 2023 at 4:55 PM Leonardo Bras <leobras@redhat.com> wrote:
>> On Tue, Sep 12, 2023 at 09:10:08AM +0800, Guo Ren wrote:
>>> On Mon, Sep 11, 2023 at 9:03 PM Waiman Long <longman@redhat.com> wrote:
>>>> On 9/10/23 23:09, Guo Ren wrote:
>>>>> On Mon, Sep 11, 2023 at 10:35 AM Waiman Long <longman@redhat.com> wrote:
>>>>>> On 9/10/23 04:28, guoren@kernel.org wrote:
>>>>>>> From: Guo Ren <guoren@linux.alibaba.com>
>>>>>>>
>>>>>>> The target of xchg_tail is to write the tail to the lock value, so
>>>>>>> adding prefetchw could help the next cmpxchg step, which may
>>>>>>> decrease the cmpxchg retry loops of xchg_tail. Some processors may
>>>>>>> utilize this feature to give a forward guarantee, e.g., RISC-V
>>>>>>> XuanTie processors would block the snoop channel & irq for several
>>>>>>> cycles when prefetch.w instruction (from Zicbop extension) retired,
>>>>>>> which guarantees the next cmpxchg succeeds.
>>>>>>>
>>>>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
>>>>>>> Signed-off-by: Guo Ren <guoren@kernel.org>
>>>>>>> ---
>>>>>>>     kernel/locking/qspinlock.c | 5 ++++-
>>>>>>>     1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>>>>>>> index d3f99060b60f..96b54e2ade86 100644
>>>>>>> --- a/kernel/locking/qspinlock.c
>>>>>>> +++ b/kernel/locking/qspinlock.c
>>>>>>> @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
>>>>>>>      */
>>>>>>>     static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>>>>>>>     {
>>>>>>> -     u32 old, new, val = atomic_read(&lock->val);
>>>>>>> +     u32 old, new, val;
>>>>>>> +
>>>>>>> +     prefetchw(&lock->val);
>>>>>>> +     val = atomic_read(&lock->val);
>>>>>>>
>>>>>>>         for (;;) {
>>>>>>>                 new = (val & _Q_LOCKED_PENDING_MASK) | tail;
>>>>>> That looks a bit weird. You pre-fetch and then immediately read it. How
>>>>>> much performance gain you get by this change alone?
>>>>>>
>>>>>> Maybe you can define an arch specific primitive that default back to
>>>>>> atomic_read() if not defined.
>>>>> Thx for the reply. This is a generic optimization point I would like
>>>>> to talk about with you.
>>>>>
>>>>> First, prefetchw() makes cacheline an exclusive state and serves for
>>>>> the next cmpxchg loop semantic, which writes the idx_tail part of
>>>>> arch_spin_lock. The atomic_read only makes cacheline in the shared
>>>>> state, which couldn't give any guarantee for the next cmpxchg loop
>>>>> semantic. Micro-architecture could utilize prefetchw() to provide a
>>>>> strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
>>>>> XuanTie processor would hold the exclusive cacheline state until the
>>>>> next cmpxchg write success.
>>>>>
>>>>> In the end, Let's go back to the principle: the xchg_tail is an atomic
>>>>> swap operation that contains write eventually, so giving a prefetchw()
>>>>> at the beginning is acceptable for all architectures..
>>>>> ••••••••••••
>>>> I did realize afterward that prefetchw gets the cacheline in exclusive
>>>> state. I will suggest you mention that in your commit log as well as
>>>> adding a comment about its purpose in the code.
>>> Okay, I would do that in v12, thx.
>> I would suggest adding a snippet from the ISA Extenstion doc:
>>
>> "A prefetch.w instruction indicates to hardware that the cache block whose
>> effective address is the sum of the base address specified in rs1 and the
>> sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000,
>> is likely to be accessed by a data write (i.e. store) in the near future."
> Good point, thx.

qspinlock is generic code. I suppose this is for the RISCV architecture. 
You can mention that in the commit log as an example, but I prefer more 
generic comment especially in the code.

Cheers,
Longman
Guo Ren Sept. 14, 2023, 3:45 a.m. UTC | #8
On Wed, Sep 13, 2023 at 9:06 PM Waiman Long <longman@redhat.com> wrote:
>
> On 9/13/23 08:52, Guo Ren wrote:
> > On Wed, Sep 13, 2023 at 4:55 PM Leonardo Bras <leobras@redhat.com> wrote:
> >> On Tue, Sep 12, 2023 at 09:10:08AM +0800, Guo Ren wrote:
> >>> On Mon, Sep 11, 2023 at 9:03 PM Waiman Long <longman@redhat.com> wrote:
> >>>> On 9/10/23 23:09, Guo Ren wrote:
> >>>>> On Mon, Sep 11, 2023 at 10:35 AM Waiman Long <longman@redhat.com> wrote:
> >>>>>> On 9/10/23 04:28, guoren@kernel.org wrote:
> >>>>>>> From: Guo Ren <guoren@linux.alibaba.com>
> >>>>>>>
> >>>>>>> The target of xchg_tail is to write the tail to the lock value, so
> >>>>>>> adding prefetchw could help the next cmpxchg step, which may
> >>>>>>> decrease the cmpxchg retry loops of xchg_tail. Some processors may
> >>>>>>> utilize this feature to give a forward guarantee, e.g., RISC-V
> >>>>>>> XuanTie processors would block the snoop channel & irq for several
> >>>>>>> cycles when prefetch.w instruction (from Zicbop extension) retired,
> >>>>>>> which guarantees the next cmpxchg succeeds.
> >>>>>>>
> >>>>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> >>>>>>> Signed-off-by: Guo Ren <guoren@kernel.org>
> >>>>>>> ---
> >>>>>>>     kernel/locking/qspinlock.c | 5 ++++-
> >>>>>>>     1 file changed, 4 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> >>>>>>> index d3f99060b60f..96b54e2ade86 100644
> >>>>>>> --- a/kernel/locking/qspinlock.c
> >>>>>>> +++ b/kernel/locking/qspinlock.c
> >>>>>>> @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
> >>>>>>>      */
> >>>>>>>     static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> >>>>>>>     {
> >>>>>>> -     u32 old, new, val = atomic_read(&lock->val);
> >>>>>>> +     u32 old, new, val;
> >>>>>>> +
> >>>>>>> +     prefetchw(&lock->val);
> >>>>>>> +     val = atomic_read(&lock->val);
> >>>>>>>
> >>>>>>>         for (;;) {
> >>>>>>>                 new = (val & _Q_LOCKED_PENDING_MASK) | tail;
> >>>>>> That looks a bit weird. You pre-fetch and then immediately read it. How
> >>>>>> much performance gain you get by this change alone?
> >>>>>>
> >>>>>> Maybe you can define an arch specific primitive that default back to
> >>>>>> atomic_read() if not defined.
> >>>>> Thx for the reply. This is a generic optimization point I would like
> >>>>> to talk about with you.
> >>>>>
> >>>>> First, prefetchw() makes cacheline an exclusive state and serves for
> >>>>> the next cmpxchg loop semantic, which writes the idx_tail part of
> >>>>> arch_spin_lock. The atomic_read only makes cacheline in the shared
> >>>>> state, which couldn't give any guarantee for the next cmpxchg loop
> >>>>> semantic. Micro-architecture could utilize prefetchw() to provide a
> >>>>> strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
> >>>>> XuanTie processor would hold the exclusive cacheline state until the
> >>>>> next cmpxchg write success.
> >>>>>
> >>>>> In the end, Let's go back to the principle: the xchg_tail is an atomic
> >>>>> swap operation that contains write eventually, so giving a prefetchw()
> >>>>> at the beginning is acceptable for all architectures..
> >>>>> ••••••••••••
> >>>> I did realize afterward that prefetchw gets the cacheline in exclusive
> >>>> state. I will suggest you mention that in your commit log as well as
> >>>> adding a comment about its purpose in the code.
> >>> Okay, I would do that in v12, thx.
> >> I would suggest adding a snippet from the ISA Extenstion doc:
> >>
> >> "A prefetch.w instruction indicates to hardware that the cache block whose
> >> effective address is the sum of the base address specified in rs1 and the
> >> sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b00000,
> >> is likely to be accessed by a data write (i.e. store) in the near future."
> > Good point, thx.
>
> qspinlock is generic code. I suppose this is for the RISCV architecture.
> You can mention that in the commit log as an example, but I prefer more
> generic comment especially in the code.
Okay, I would only leave a generic comment on it and move Leonardo's
advice into this patch:
https://lore.kernel.org/linux-riscv/ZQF3qS1KRYAt3coC@redhat.com/


>
> Cheers,
> Longman
>
diff mbox series

Patch

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d3f99060b60f..96b54e2ade86 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -223,7 +223,10 @@  static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
  */
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
-	u32 old, new, val = atomic_read(&lock->val);
+	u32 old, new, val;
+
+	prefetchw(&lock->val);
+	val = atomic_read(&lock->val);
 
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;