diff mbox series

lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput())

Message ID 20231110042041.GL1957730@ZenIV (mailing list archive)
State New
Headers show
Series lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput()) | expand

Commit Message

Al Viro Nov. 10, 2023, 4:20 a.m. UTC
On Wed, Nov 01, 2023 at 06:19:10PM +0000, Al Viro wrote:
 
> gcc-12 on x86 turns the series of ifs into
>         movl    %edi, %eax
> 	andl    $32832, %eax
> 	cmpl    $32832, %eax
> 	jne     .L17
> 	andl    $168, %edi
> 	jne     .L17
> instead of combining that into
>         andl    $33000, %edi
> 	cmpl    $32832, %edi
> 	jne     .L17
> 
> OTOH, that's not much of pessimization...  Up to you.

	FWIW, on top of current #work.dcache2 the following delta might be worth
looking into.  Not sure if it's less confusing that way, though - I'd been staring
at that place for too long.  Code generation is slightly suboptimal with recent
gcc, but only marginally so.

Comments

Linus Torvalds Nov. 10, 2023, 5:57 a.m. UTC | #1
On Thu, 9 Nov 2023 at 20:20, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         FWIW, on top of current #work.dcache2 the following delta might be worth
> looking into.  Not sure if it's less confusing that way, though - I'd been staring
> at that place for too long.  Code generation is slightly suboptimal with recent
> gcc, but only marginally so.

I doubt the pure ALU ops and a couple of extra conditional branches
(that _probably_ predict well) matter at all.

Especially since this is all after lockref_put_return() has done that
locked cmpxchg, which *is* expensive.

My main reaction is that we use hlist_bl_unhashed() for d_unhashed(),
and we *intentionally* make it separate from the actual unhasing:

 - ___d_drop() does the __hlist_bl_del()

 - but d_unhashed() does hlist_bl_unhashed(), which checks
d_hash.pprev == NULL, and that's done by __d_drop

We even have a comment about this:

 * ___d_drop doesn't mark dentry as "unhashed"
 * (dentry->d_hash.pprev will be LIST_POISON2, not NULL).

and we depend on this in __d_move(), which will unhash things
temporarily, but not mark things unhashed, because they get re-hashed
again. Same goes for __d_add().

Anyway, what I'm actually getting at in a roundabout way is that maybe
we should make D_UNHASHED be another flag in d_flags, and *not* use
that d_hash.pprev field, and that would allow us to combine even more
of these tests in dput(), because now pretty much *all* of those
"retain_dentry()" checks would be about d_flags bits.

Hmm? As it is, it has that odd combination of d_flags and that
d_unhashed() test, so it's testing two different fields.

Anyway, I really don't think it matters much, but since you brought up
the whole suboptimal code generation..

I tried to look at dput() code generation, and it doesn't look
horrendous as-is in your dcache2 branch.

If anything, the thing that hirs is the lockref_put_return() being
out-of-line even though this is basically the only caller, plus people
have pessimized the arch_spin_value_unlocked() implementation *again*,
so that it uses a volatile read, when the *WHOLE*POINT* of that
"VALUE" part of "value_unlocked()" is that we've already read the
value, and we should *not* re-read it.

Damn.

The bug seems to affect both the generic qspinlock code, and the
ticket-based one.

For the ticket based ones, it's PeterZ and commit 1bce11126d57
("asm-generic: ticket-lock: New generic ticket-based spinlock"), which
does

  static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
  {
        return !arch_spin_is_locked(&lock);
  }

where we've got that "lock" value, but then it takes the address of
it, and uses arch_spin_is_locked() on it, so now it will force a flush
to memory, and then an READ_ONCE() on it.

And for the qspinlock code, we had a similar

  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
  {
        return !atomic_read(&lock.val);
  }

thing, where it does 'atomic_read()' on the value it was passed as an argument.

Stupid, stupid. It's literally forcing a re-read of a value that is
guaranteed to be on stack.

I know this worked at some point, but that may have been many years
ago, since I haven't looked at this part of lockref code generation in
ages.

Anway, as a result now all the lockref functions will do silly "store
the old lockref value to memory, in order to read it again" dances in
that CMPXCHG_LOOP() loop.

It literally makes that whole "is this an unlocked value" function
completely pointless. The *whole* and only point was "look, I already
loaded the value from memory, is this *VALUE* unlocked.

Compared to that complete braindamage in the fast-path loop, the small
extra ALU ops in fast_dput() are nothing.

Peter - those functions are done exactly the wrong way around.
arch_spin_is_locked() should be implemented using
arch_spin_value_unlocked(), not this way around.

And the queued spinlocks should not do an atomic_read()of the argument
they get, they should just do "!lock.val.counter"

So something like this should fix lockref. ENTIRELY UNTESTED, except
now the code generation of lockref_put_return() looks much better,
without a pointless flush to the stack, and now it has no pointless
stack frame as a result.

Of course, it should probably be inlined, since it has only one user
(ok, two, since fast_dput() gets used twice), and that should make the
return value testing much better.

               Linus
Linus Torvalds Nov. 10, 2023, 6:22 a.m. UTC | #2
On Thu, 9 Nov 2023 at 21:57, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So something like this should fix lockref. ENTIRELY UNTESTED, except
> now the code generation of lockref_put_return() looks much better,
> without a pointless flush to the stack, and now it has no pointless
> stack frame as a result.

Heh. And because I was looking at Al's tree, I didn't notice that
commit c6f4a9002252 ("asm-generic: ticket-lock: Optimize
arch_spin_value_unlocked()") had solved the ticket spinlock part of
this in this merge window in the meantime.

The qspinlock implementation - which is what x86 uses - is still
broken in mainline, though.

So that part of my patch still stands. Now attached just the small
one-liner part. Adding Ingo and Guo Ren, who did the ticket lock part
(and looks to have done it very similarly to my suggested patch.

Ingo?

                     Linus
Al Viro Nov. 10, 2023, 8:19 a.m. UTC | #3
On Thu, Nov 09, 2023 at 09:57:39PM -0800, Linus Torvalds wrote:

> Anyway, what I'm actually getting at in a roundabout way is that maybe
> we should make D_UNHASHED be another flag in d_flags, and *not* use
> that d_hash.pprev field, and that would allow us to combine even more
> of these tests in dput(), because now pretty much *all* of those
> "retain_dentry()" checks would be about d_flags bits.
> 
> Hmm? As it is, it has that odd combination of d_flags and that
> d_unhashed() test, so it's testing two different fields.

Hmm, indeed.  The trouble is, we are getting tight on the ->d_flags bits.
Only two unassigned bits left (0x08000000 and 0x80000000).

DCACHE_COOKIE is defined (0x00002000), but unused.  Should've been
taken out when dcookie stuff went.

DCACHE_DENTRY_KILLED might be mergable with DCACHE_MAY_FREE now;
worth looking into.  In effect, DCACHE_MAY_FREE is set iff
we have both DCACHE_DENTRY_KILLED and DCACHE_SHRINK_LIST - and
the only place that checks it is guaranteed to have had
DCACHE_SHRINK_LIST.  Actually, that's nice - in terms of dentry
states we have
refcount > 0 <=> Busy
refcount == 0 <=> Retained
refcount < 0 && !KILLED <=> Dying
refcount < 0 && KILLED && !SHRINK_LIST <=> Freeing
refcount < 0 && KILLED && SHRINK_LIST <=> Husk.
<makes a note in the docs being written>

DCACHE_FALLTRHU is odd - it's never checked (or set, for that matter);
might be killable, might be intended for some overlayfs plans.

DCACHE_GENOCIDE might become killable, what with selinuxfs patch I've
got (apparently OK with selinux folks, will sort it out after -rc1).

OK, it's not as awful as I thought - one more bit won't hurt.
I'll go through the unlocked callers and see if any of those is
sensitive to separating setting that flag from hash list removal.
There might be dragons...

> Anyway, I really don't think it matters much, but since you brought up
> the whole suboptimal code generation..

FWIW, it's not all that suboptimal, at least with current gcc.  The thing
I'm really not sure about is whether that patch makes the whole thing
easier to follow - probably need to let it sit around for a week or so,
then look at it again; right now I don't trust my taste regarding that
particular change, having spent too much time today mucking with it ;-/
Guo Ren Nov. 22, 2023, 6:29 a.m. UTC | #4
On Thu, Nov 09, 2023 at 10:22:13PM -0800, Linus Torvalds wrote:
> On Thu, 9 Nov 2023 at 21:57, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So something like this should fix lockref. ENTIRELY UNTESTED, except
> > now the code generation of lockref_put_return() looks much better,
> > without a pointless flush to the stack, and now it has no pointless
> > stack frame as a result.
> 
> Heh. And because I was looking at Al's tree, I didn't notice that
> commit c6f4a9002252 ("asm-generic: ticket-lock: Optimize
> arch_spin_value_unlocked()") had solved the ticket spinlock part of
> this in this merge window in the meantime.
> 
> The qspinlock implementation - which is what x86 uses - is still
> broken in mainline, though.
> 
> So that part of my patch still stands. Now attached just the small
> one-liner part. Adding Ingo and Guo Ren, who did the ticket lock part
> (and looks to have done it very similarly to my suggested patch.
Not only generic ticket lock, I think Will Deacon recognized the lockref
problem of the arm32 ticket-lock in 2013.

After my patch merged, I think riscv could also select
ARCH_USE_CMPXCHG_LOCKREF in its Kconfig.

Ref:
commit 0cbad9c9dfe0c38e8ec7385b39087c005a6dee3e
Author: Will Deacon <will@kernel.org>
Date:   Wed Oct 9 17:19:22 2013 +0100

    ARM: 7854/1: lockref: add support for lockless lockrefs using
cmpxchg64

    Our spinlocks are only 32-bit (2x16-bit tickets) and, on processors
    with 64-bit atomic instructions, cmpxchg64 makes use of the
double-word
    exclusive accessors.

    This patch wires up the cmpxchg-based lockless lockref
implementation
    for ARM.

    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 1ad6fb6c094d..fc184bcd7848 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
        select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
        select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
        select ARCH_HAVE_CUSTOM_GPIO_H
+       select ARCH_USE_CMPXCHG_LOCKREF
        select ARCH_WANT_IPC_PARSE_VERSION
        select BUILDTIME_EXTABLE_SORT if MMU
        select CLONE_BACKWARDS
diff --git a/arch/arm/include/asm/spinlock.h
b/arch/arm/include/asm/spinlock.h
index 4f2c28060c9a..ed6c22919e47 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -127,10 +127,14 @@ static inline void
arch_spin_unlock(arch_spinlock_t *lock)
        dsb_sev();
 }

+static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+{
+       return lock.tickets.owner == lock.tickets.next;
+}
+
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
-       struct __raw_tickets tickets = ACCESS_ONCE(lock->tickets);
-       return tickets.owner != tickets.next;
+       return !arch_spin_value_unlocked(ACCESS_ONCE(*lock));
 }

 static inline int arch_spin_is_contended(arch_spinlock_t *lock)

> 
> Ingo?
> 
>                      Linus

>  include/asm-generic/qspinlock.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index 995513fa2690..0655aa5b57b2 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -70,7 +70,7 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>   */
>  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
>  {
> -	return !atomic_read(&lock.val);
> +	return !lock.val.counter;
>  }
>  
>  /**
Guo Ren Nov. 22, 2023, 7:19 a.m. UTC | #5
On Thu, Nov 09, 2023 at 09:57:39PM -0800, Linus Torvalds wrote:
> On Thu, 9 Nov 2023 at 20:20, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         FWIW, on top of current #work.dcache2 the following delta might be worth
> > looking into.  Not sure if it's less confusing that way, though - I'd been staring
> > at that place for too long.  Code generation is slightly suboptimal with recent
> > gcc, but only marginally so.
> 
> I doubt the pure ALU ops and a couple of extra conditional branches
> (that _probably_ predict well) matter at all.
> 
> Especially since this is all after lockref_put_return() has done that
> locked cmpxchg, which *is* expensive.
> 
> My main reaction is that we use hlist_bl_unhashed() for d_unhashed(),
> and we *intentionally* make it separate from the actual unhasing:
> 
>  - ___d_drop() does the __hlist_bl_del()
> 
>  - but d_unhashed() does hlist_bl_unhashed(), which checks
> d_hash.pprev == NULL, and that's done by __d_drop
> 
> We even have a comment about this:
> 
>  * ___d_drop doesn't mark dentry as "unhashed"
>  * (dentry->d_hash.pprev will be LIST_POISON2, not NULL).
> 
> and we depend on this in __d_move(), which will unhash things
> temporarily, but not mark things unhashed, because they get re-hashed
> again. Same goes for __d_add().
> 
> Anyway, what I'm actually getting at in a roundabout way is that maybe
> we should make D_UNHASHED be another flag in d_flags, and *not* use
> that d_hash.pprev field, and that would allow us to combine even more
> of these tests in dput(), because now pretty much *all* of those
> "retain_dentry()" checks would be about d_flags bits.
> 
> Hmm? As it is, it has that odd combination of d_flags and that
> d_unhashed() test, so it's testing two different fields.
> 
> Anyway, I really don't think it matters much, but since you brought up
> the whole suboptimal code generation..
> 
> I tried to look at dput() code generation, and it doesn't look
> horrendous as-is in your dcache2 branch.
> 
> If anything, the thing that hirs is the lockref_put_return() being
> out-of-line even though this is basically the only caller, plus people
> have pessimized the arch_spin_value_unlocked() implementation *again*,
> so that it uses a volatile read, when the *WHOLE*POINT* of that
> "VALUE" part of "value_unlocked()" is that we've already read the
> value, and we should *not* re-read it.
> 
> Damn.
> 
> The bug seems to affect both the generic qspinlock code, and the
> ticket-based one.
> 
> For the ticket based ones, it's PeterZ and commit 1bce11126d57
> ("asm-generic: ticket-lock: New generic ticket-based spinlock"), which
> does
> 
>   static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>   {
>         return !arch_spin_is_locked(&lock);
>   }
> 
> where we've got that "lock" value, but then it takes the address of
> it, and uses arch_spin_is_locked() on it, so now it will force a flush
> to memory, and then an READ_ONCE() on it.
> 
> And for the qspinlock code, we had a similar

We discussed x86 qspinlock code generation. It looked not too bad as I
thought because qspinlock_spin_value_unlocked is much cheaper than the
ticket-lock. But the riscv ticket-lock code generation is terrible
because of the shift left & right 16-bit.
https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@gmail.com

> 
>   static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
>   {
>         return !atomic_read(&lock.val);
>   }
> 
> thing, where it does 'atomic_read()' on the value it was passed as an argument.
> 
> Stupid, stupid. It's literally forcing a re-read of a value that is
> guaranteed to be on stack.
> 
> I know this worked at some point, but that may have been many years
> ago, since I haven't looked at this part of lockref code generation in
> ages.
> 
> Anway, as a result now all the lockref functions will do silly "store
> the old lockref value to memory, in order to read it again" dances in
> that CMPXCHG_LOOP() loop.
> 
> It literally makes that whole "is this an unlocked value" function
> completely pointless. The *whole* and only point was "look, I already
> loaded the value from memory, is this *VALUE* unlocked.
> 
> Compared to that complete braindamage in the fast-path loop, the small
> extra ALU ops in fast_dput() are nothing.
> 
> Peter - those functions are done exactly the wrong way around.
> arch_spin_is_locked() should be implemented using
> arch_spin_value_unlocked(), not this way around.
> 
> And the queued spinlocks should not do an atomic_read()of the argument
> they get, they should just do "!lock.val.counter"
> 
> So something like this should fix lockref. ENTIRELY UNTESTED, except
> now the code generation of lockref_put_return() looks much better,
> without a pointless flush to the stack, and now it has no pointless
> stack frame as a result.
> 
> Of course, it should probably be inlined, since it has only one user
> (ok, two, since fast_dput() gets used twice), and that should make the
> return value testing much better.
> 
>                Linus

>  include/asm-generic/qspinlock.h |  2 +-
>  include/asm-generic/spinlock.h  | 17 +++++++++--------
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index 995513fa2690..0655aa5b57b2 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -70,7 +70,7 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>   */
>  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
>  {
> -	return !atomic_read(&lock.val);
> +	return !lock.val.counter;
>  }
>  
>  /**
> diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h
> index fdfebcb050f4..a35eda0ec2a2 100644
> --- a/include/asm-generic/spinlock.h
> +++ b/include/asm-generic/spinlock.h
> @@ -68,11 +68,17 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>  	smp_store_release(ptr, (u16)val + 1);
>  }
>  
> +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
> +{
> +	u32 val = lock.counter;
> +	return ((val >> 16) == (val & 0xffff));
> +}
> +
>  static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock)
>  {
> -	u32 val = atomic_read(lock);
> -
> -	return ((val >> 16) != (val & 0xffff));
> +	arch_spinlock_t val;
> +	val.counter = atomic_read(lock);
> +	return !arch_spin_value_unlocked(val);
>  }
>  
>  static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
> @@ -82,11 +88,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
>  	return (s16)((val >> 16) - (val & 0xffff)) > 1;
>  }
>  
> -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
> -{
> -	return !arch_spin_is_locked(&lock);
> -}
> -
>  #include <asm/qrwlock.h>
>  
>  #endif /* __ASM_GENERIC_SPINLOCK_H */
Linus Torvalds Nov. 22, 2023, 5:20 p.m. UTC | #6
On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@kernel.org> wrote:
>
> We discussed x86 qspinlock code generation. It looked not too bad as I
> thought because qspinlock_spin_value_unlocked is much cheaper than the
> ticket-lock. But the riscv ticket-lock code generation is terrible
> because of the shift left & right 16-bit.
> https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@gmail.com

No, it's not the 16-bit shifts in the spin_value_unlocked() check,
that just generates simple and straightforward code:

  a0:   0107569b                srlw    a3,a4,0x10
  a4:   00c77733                and     a4,a4,a2
  a8:   04e69063                bne     a3,a4,e8 <.L12>

(plus two stupid instructions for generating the immediate in a2 for
0xffff, but hey, that's the usual insane RISC-V encoding thing - you
can load a 20-bit U-immediate only shifted up by 12, if it's in the
lower bits you're kind of screwed and limited to 12-bit immediates).

The *bad* code generation is from the much simpler

        new.count++;

which sadly neither gcc not clang is quite smart enough to understand
that "hey, I can do that in 64 bits".

It's incrementing the higher 32-bit word in a 64-bit union, and with a
smarter compiler it *should* basically become

        lock_count += 1 << 32;

but the compiler isn't that clever, so it splits the 64-bit word into
two 32-bit words, increments one of them, and then merges the two
words back into 64 bits:

  98:   4207d693                sra     a3,a5,0x20
  9c:   02079713                sll     a4,a5,0x20
  a0:   0016869b                addw    a3,a3,1
  a4:   02069693                sll     a3,a3,0x20
  a8:   02075713                srl     a4,a4,0x20
  ac:   00d76733                or      a4,a4,a3

which is pretty sad.

If you want to do the optimization that the compiler misses by hand,
it would be something like the attached patch.

NOTE! Very untested. But that *should* cause the compiler to just
generate a single "add" instruction (in addition to generating the
constant 0x100000000, of course).

Of course, on a LL/SC architecture like RISC-V, in an *optimal* world,
the whole sequence would actually be done with one single LL/SC,
rather than the "load,add,cmpxchg" thing.

But then you'd have to do absolutely everything by hand in assembly.

                  Linus
Linus Torvalds Nov. 22, 2023, 5:52 p.m. UTC | #7
On Wed, 22 Nov 2023 at 09:20, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If you want to do the optimization that the compiler misses by hand,
> it would be something like the attached patch.

Bah. Might as well do the reference decrements with the same logic,
not just the increments.

Of course, this is much more noticeable with the ticket locks, because
with the qspinlocks the "is this unlocked" test will check whether the
lock is all zeroes.

So with qspinlocks, the compiler sees that "oh, the low 32 bits are
zero", and the whole "merge the two words back to 64 bits" is much
cheaper, and doesn't generate quite the mess that it does for RISC-V
with ticket locks.

But this "treat the lockref as a 64-bit entity" thing is probably a
good thing on most 64-bit architectures, including x86 that has that
qspinlock thing.

Still not actually tested, but the code generation on x86 looks
reasonable, so it migth be worth looking at whether it helps the
RISC-V case.

                 Linus
Linus Torvalds Nov. 22, 2023, 6:05 p.m. UTC | #8
On Wed, 22 Nov 2023 at 09:52, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Bah. Might as well do the reference decrements with the same logic,
> not just the increments.

And thanks for reminding me about this issue. I just committed the
trivial one-liner fix for qspinlock code generation that apparently
never went anywhere (mostly my own damn fault for not having pushed it
enough and made a proper commit message).

               Linus
Linus Torvalds Nov. 22, 2023, 7:11 p.m. UTC | #9
On Wed, 22 Nov 2023 at 09:52, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Still not actually tested, but the code generation on x86 looks
> reasonable, so it migth be worth looking at whether it helps the
> RISC-V case.

Doing some more munging, and actually looking at RISC-V code
generation too (I obviously had to enable ARCH_USE_CMPXCHG_LOCKREF for
RISC-V).

I get this:

  lockref_get:
        addi    sp,sp,-32
        sd      s0,16(sp)
        sd      s1,8(sp)
        sd      ra,24(sp)
        addi    s0,sp,32
        li      a1,65536
        ld      a5,0(a0)
        mv      s1,a0
        addi    a1,a1,-1
        li      a0,100
  .L43:
        sext.w  a3,a5
        li      a4,1
        srliw   a2,a5,16
        and     a3,a3,a1
        slli    a4,a4,32
        bne     a2,a3,.L49
        add     a4,a5,a4
  0:
        lr.d a3, 0(s1)
        bne a3, a5, 1f
        sc.d.rl a2, a4, 0(s1)
        bnez a2, 0b
        fence rw, rw
  1:
        bne     a5,a3,.L52
        ld      ra,24(sp)
        ld      s0,16(sp)
        ld      s1,8(sp)
        addi    sp,sp,32
        jr      ra
  ...

so now that single update is indeed just one single instruction:

        add     a4,a5,a4

is that "increment count in the high 32 bits".

The ticket lock being unlocked checks are those

        li      a1,65536
        sext.w  a3,a5
        srliw   a2,a5,16
        and     a3,a3,a1
        bne     a2,a3,.L49

instructions if I read it right.

That actually looks fairly close to optimal, although the frame setup
is kind of sad.

(The above does not include the "loop if the cmpxchg failed" part of
the code generation)

Anyway, apart from enabling LOCKREF, the patch to get this for RISC-V
is attached.

I'm not going to play with this any more, but you might want to check
whether this actually does work on RISC-V.

Becaue I only looked at the code generation, I didn't actually look at
whether it *worked*.

                Linus
Guo Ren Nov. 26, 2023, 4:39 p.m. UTC | #10
On Wed, Nov 22, 2023 at 09:20:53AM -0800, Linus Torvalds wrote:
> On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@kernel.org> wrote:
> >
> > We discussed x86 qspinlock code generation. It looked not too bad as I
> > thought because qspinlock_spin_value_unlocked is much cheaper than the
> > ticket-lock. But the riscv ticket-lock code generation is terrible
> > because of the shift left & right 16-bit.
> > https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@gmail.com
> 
> No, it's not the 16-bit shifts in the spin_value_unlocked() check,
> that just generates simple and straightforward code:
> 
>   a0:   0107569b                srlw    a3,a4,0x10
>   a4:   00c77733                and     a4,a4,a2
>   a8:   04e69063                bne     a3,a4,e8 <.L12>
> 
> (plus two stupid instructions for generating the immediate in a2 for
> 0xffff, but hey, that's the usual insane RISC-V encoding thing - you
> can load a 20-bit U-immediate only shifted up by 12, if it's in the
> lower bits you're kind of screwed and limited to 12-bit immediates).
> 
> The *bad* code generation is from the much simpler
> 
>         new.count++;
> 
> which sadly neither gcc not clang is quite smart enough to understand
> that "hey, I can do that in 64 bits".
> 
> It's incrementing the higher 32-bit word in a 64-bit union, and with a
> smarter compiler it *should* basically become
> 
>         lock_count += 1 << 32;
> 
> but the compiler isn't that clever, so it splits the 64-bit word into
> two 32-bit words, increments one of them, and then merges the two
> words back into 64 bits:
> 
>   98:   4207d693                sra     a3,a5,0x20
>   9c:   02079713                sll     a4,a5,0x20
>   a0:   0016869b                addw    a3,a3,1
>   a4:   02069693                sll     a3,a3,0x20
>   a8:   02075713                srl     a4,a4,0x20
>   ac:   00d76733                or      a4,a4,a3
> 
> which is pretty sad.
9c & a8 is for word-zero-extend; riscv would have zext.w in the future.
Your patch may improve above with:
	li      a4,1
	slli    a4,a4,32
	add     a4,a5,a4

v.s.
	sra     a3,a5,0x20
	zext.w	a4,a5
	addw    a3,a3,1
	or      a4,a4,a3
You win one instruction "or a4,a4,a3", which is less than one cycle.

The zext.w is important, and it could replace sll+srl a lot, so I think
it's a current ISA design short.

Here, what I want to improve is to prevent stack frame setup in the fast
path, and that's the most benefit my patch could give out. Unnecessary
memory access is the most important performance killer in SMP.

My patch removes the stack frame setup from the fast path.
void lockref_get(struct lockref *lockref)
{
  78:   00053783                ld      a5,0(a0)
000000000000007c <.LBB212>:
  7c:   00010637                lui     a2,0x10

0000000000000080 <.LBE212>:
  80:   06400593                li      a1,100

0000000000000084 <.LBB216>:
  84:   fff60613                add     a2,a2,-1 # ffff <.LLST8+0xf4aa>

0000000000000088 <.L8>:
  88:   0007871b                sext.w  a4,a5

000000000000008c <.LBB217>:
  8c:   0107d69b                srlw    a3,a5,0x10
  90:   00c77733                and     a4,a4,a2
  94:   04e69063                bne     a3,a4,d4 <.L12> --------+
						      		|
0000000000000098 <.LBB218>:					|
  98:   4207d693                sra     a3,a5,0x20		|
  9c:   02079713                sll     a4,a5,0x20		|
  a0:   0016869b                addw    a3,a3,1			|
  a4:   02069693                sll     a3,a3,0x20		|
  a8:   02075713                srl     a4,a4,0x20		|
  ac:   00d76733                or      a4,a4,a3		|
								|
00000000000000b0 <.L0^B1>:					|
  b0:   100536af                lr.d    a3,(a0)			|
  b4:   00f69863                bne     a3,a5,c4 <.L1^B1>	|
  b8:   1ae5382f                sc.d.rl a6,a4,(a0)		|
  bc:   fe081ae3                bnez    a6,b0 <.L0^B1>		|
  c0:   0330000f                fence   rw,rw			|
								|
00000000000000c4 <.L1^B1>:					|
  c4:   04d78a63                beq     a5,a3,118 <.L18>	|
								|
00000000000000c8 <.LBE228>:					|
  c8:   fff5859b                addw    a1,a1,-1		|	
								|
00000000000000cc <.LBB229>:					|
  cc:   00068793                mv      a5,a3			|
								|
00000000000000d0 <.LBE229>:					|
  d0:   fa059ce3                bnez    a1,88 <.L8>		|
						     		|
00000000000000d4 <.L12>: <--------------------------------------+
{						      slow_path
  d4:   fe010113                add     sp,sp,-32
  d8:   00113c23                sd      ra,24(sp)
  dc:   00813823                sd      s0,16(sp)
  e0:   02010413                add     s0,sp,32


> 
> If you want to do the optimization that the compiler misses by hand,
> it would be something like the attached patch.
> 
> NOTE! Very untested. But that *should* cause the compiler to just
> generate a single "add" instruction (in addition to generating the
> constant 0x100000000, of course).
> 
> Of course, on a LL/SC architecture like RISC-V, in an *optimal* world,
> the whole sequence would actually be done with one single LL/SC,
> rather than the "load,add,cmpxchg" thing.
> 
> But then you'd have to do absolutely everything by hand in assembly.
No, it's not worth to do that.
 - There are only atomic primitives in Linux, but no ll/sc primitive in
   the real world. The world belongs to AMO and the only usage of ll/sc
   is to implement AMO and CAS.
 - Using single ll/sc primitive instead of cmpxchg is similar to your
   patch, and you may win 1 cycle or not.
 - The critical work here are reducing bus transactions, preventing
   cache dance, and forward progress guarantee.

Here is my optimization advice:

#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                        \
        int retry = 100;                                                        \
        struct lockref old;                                                     \
        BUILD_BUG_ON(sizeof(old) != 8);                                         \
+       prefetchw(lockref);                                                     \
        old.lock_count = READ_ONCE(lockref->lock_count);                        \
        while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
                struct lockref new = old;                                       \
                CODE                                                            \
                if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
                                                 &old.lock_count,               \
                                                 new.lock_count))) {            \
                        SUCCESS;                                                \
                }                                                               \

Micro-arch could give prefetchw more guarantee:
 - Prefetch.w must guarantee cache line exclusiveness even when a
   shareable state cache line hits.
 - Hold the exclusive cache line for several cycles until the next
   store or timeout
 - Mask interrupt during the holding cycles (Optional)

The lockref slow path is killed in this micro-architecture, which
means there is no chance to execute the spinlock.

I've written down more details in my ppt:
https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing

This type of prefetchw could help large-size atomic operations within
one cache line. Compared to the transaction memory model, prefetchw
could give a forward progress guarantee and easier landing in Linux
without any new primitive.

> 
>                   Linus

>  lib/lockref.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/lockref.c b/lib/lockref.c
> index 2afe4c5d8919..481b102a6476 100644
> --- a/lib/lockref.c
> +++ b/lib/lockref.c
> @@ -26,6 +26,17 @@
>  	}									\
>  } while (0)
>  
> +/*
> + * The compiler isn't smart enough to the the count
> + * increment in the high 32 bits of the 64-bit value,
> + * so do this optimization by hand.
> + */
> +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64
> + #define LOCKREF_INC(n) ((n).lock_count += 1ul<<32)
> +#else
> + #define LOCKREF_INC(n) ((n).count++)
> +#endif
> +
>  #else
>  
>  #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0)
> @@ -42,7 +53,7 @@
>  void lockref_get(struct lockref *lockref)
>  {
>  	CMPXCHG_LOOP(
> -		new.count++;
> +		LOCKREF_INC(new);
>  	,
>  		return;
>  	);
> @@ -63,7 +74,7 @@ int lockref_get_not_zero(struct lockref *lockref)
>  	int retval;
>  
>  	CMPXCHG_LOOP(
> -		new.count++;
> +		LOCKREF_INC(new);
>  		if (old.count <= 0)
>  			return 0;
>  	,
> @@ -174,7 +185,7 @@ int lockref_get_not_dead(struct lockref *lockref)
>  	int retval;
>  
>  	CMPXCHG_LOOP(
> -		new.count++;
> +		LOCKREF_INC(new);
>  		if (old.count < 0)
>  			return 0;
>  	,
Linus Torvalds Nov. 26, 2023, 4:51 p.m. UTC | #11
On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@kernel.org> wrote:
>
> Here is my optimization advice:
>
> #define CMPXCHG_LOOP(CODE, SUCCESS) do {                                        \
>         int retry = 100;                                                        \
>         struct lockref old;                                                     \
>         BUILD_BUG_ON(sizeof(old) != 8);                                         \
> +       prefetchw(lockref);                                                     \\

No.

We're not adding software prefetches to generic code. Been there, done
that. They *never* improve performance on good hardware. They end up
helping on some random (usually particularly bad) microarchitecture,
and then they hurt everybody else.

And the real optimization advice is: "don't run on crap hardware".

It really is that simple. Good hardware does OoO and sees the future write.

> Micro-arch could give prefetchw more guarantee:

Well, in practice, they never do, and in fact they are often buggy and
cause problems because they weren't actually tested very much.

                 Linus
Guo Ren Nov. 26, 2023, 4:51 p.m. UTC | #12
On Sun, Nov 26, 2023 at 11:39:37AM -0500, Guo Ren wrote:
> On Wed, Nov 22, 2023 at 09:20:53AM -0800, Linus Torvalds wrote:
> > On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@kernel.org> wrote:
> > >
> > > We discussed x86 qspinlock code generation. It looked not too bad as I
> > > thought because qspinlock_spin_value_unlocked is much cheaper than the
> > > ticket-lock. But the riscv ticket-lock code generation is terrible
> > > because of the shift left & right 16-bit.
> > > https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@gmail.com
> > 
> > No, it's not the 16-bit shifts in the spin_value_unlocked() check,
> > that just generates simple and straightforward code:
> > 
> >   a0:   0107569b                srlw    a3,a4,0x10
> >   a4:   00c77733                and     a4,a4,a2
> >   a8:   04e69063                bne     a3,a4,e8 <.L12>
> > 
> > (plus two stupid instructions for generating the immediate in a2 for
> > 0xffff, but hey, that's the usual insane RISC-V encoding thing - you
> > can load a 20-bit U-immediate only shifted up by 12, if it's in the
> > lower bits you're kind of screwed and limited to 12-bit immediates).
> > 
> > The *bad* code generation is from the much simpler
> > 
> >         new.count++;
> > 
> > which sadly neither gcc not clang is quite smart enough to understand
> > that "hey, I can do that in 64 bits".
> > 
> > It's incrementing the higher 32-bit word in a 64-bit union, and with a
> > smarter compiler it *should* basically become
> > 
> >         lock_count += 1 << 32;
> > 
> > but the compiler isn't that clever, so it splits the 64-bit word into
> > two 32-bit words, increments one of them, and then merges the two
> > words back into 64 bits:
> > 
> >   98:   4207d693                sra     a3,a5,0x20
> >   9c:   02079713                sll     a4,a5,0x20
> >   a0:   0016869b                addw    a3,a3,1
> >   a4:   02069693                sll     a3,a3,0x20
> >   a8:   02075713                srl     a4,a4,0x20
> >   ac:   00d76733                or      a4,a4,a3
> > 
> > which is pretty sad.
> 9c & a8 is for word-zero-extend; riscv would have zext.w in the future.
> Your patch may improve above with:
> 	li      a4,1
> 	slli    a4,a4,32
> 	add     a4,a5,a4
> 
> v.s.
> 	sra     a3,a5,0x20
> 	zext.w	a4,a5
> 	addw    a3,a3,1
> 	or      a4,a4,a3
> You win one instruction "or a4,a4,a3", which is less than one cycle.
Sorry, I forgot "sll     a3,a3,0x20", so it's 1.5 cycles, but it didn't
affect my opinion here; local core operations are the lower optimization
priority than memory transations.

> 
> The zext.w is important, and it could replace sll+srl a lot, so I think
> it's a current ISA design short.
> 
> Here, what I want to improve is to prevent stack frame setup in the fast
> path, and that's the most benefit my patch could give out. Unnecessary
> memory access is the most important performance killer in SMP.
> 
> My patch removes the stack frame setup from the fast path.
> void lockref_get(struct lockref *lockref)
> {
>   78:   00053783                ld      a5,0(a0)
> 000000000000007c <.LBB212>:
>   7c:   00010637                lui     a2,0x10
> 
> 0000000000000080 <.LBE212>:
>   80:   06400593                li      a1,100
> 
> 0000000000000084 <.LBB216>:
>   84:   fff60613                add     a2,a2,-1 # ffff <.LLST8+0xf4aa>
> 
> 0000000000000088 <.L8>:
>   88:   0007871b                sext.w  a4,a5
> 
> 000000000000008c <.LBB217>:
>   8c:   0107d69b                srlw    a3,a5,0x10
>   90:   00c77733                and     a4,a4,a2
>   94:   04e69063                bne     a3,a4,d4 <.L12> --------+
> 						      		|
> 0000000000000098 <.LBB218>:					|
>   98:   4207d693                sra     a3,a5,0x20		|
>   9c:   02079713                sll     a4,a5,0x20		|
>   a0:   0016869b                addw    a3,a3,1			|
>   a4:   02069693                sll     a3,a3,0x20		|
>   a8:   02075713                srl     a4,a4,0x20		|
>   ac:   00d76733                or      a4,a4,a3		|
> 								|
> 00000000000000b0 <.L0^B1>:					|
>   b0:   100536af                lr.d    a3,(a0)			|
>   b4:   00f69863                bne     a3,a5,c4 <.L1^B1>	|
>   b8:   1ae5382f                sc.d.rl a6,a4,(a0)		|
>   bc:   fe081ae3                bnez    a6,b0 <.L0^B1>		|
>   c0:   0330000f                fence   rw,rw			|
> 								|
> 00000000000000c4 <.L1^B1>:					|
>   c4:   04d78a63                beq     a5,a3,118 <.L18>	|
> 								|
> 00000000000000c8 <.LBE228>:					|
>   c8:   fff5859b                addw    a1,a1,-1		|	
> 								|
> 00000000000000cc <.LBB229>:					|
>   cc:   00068793                mv      a5,a3			|
> 								|
> 00000000000000d0 <.LBE229>:					|
>   d0:   fa059ce3                bnez    a1,88 <.L8>		|
> 						     		|
> 00000000000000d4 <.L12>: <--------------------------------------+
> {						      slow_path
>   d4:   fe010113                add     sp,sp,-32
>   d8:   00113c23                sd      ra,24(sp)
>   dc:   00813823                sd      s0,16(sp)
>   e0:   02010413                add     s0,sp,32
> 
> 
> > 
> > If you want to do the optimization that the compiler misses by hand,
> > it would be something like the attached patch.
> > 
> > NOTE! Very untested. But that *should* cause the compiler to just
> > generate a single "add" instruction (in addition to generating the
> > constant 0x100000000, of course).
> > 
> > Of course, on a LL/SC architecture like RISC-V, in an *optimal* world,
> > the whole sequence would actually be done with one single LL/SC,
> > rather than the "load,add,cmpxchg" thing.
> > 
> > But then you'd have to do absolutely everything by hand in assembly.
> No, it's not worth to do that.
>  - There are only atomic primitives in Linux, but no ll/sc primitive in
>    the real world. The world belongs to AMO and the only usage of ll/sc
>    is to implement AMO and CAS.
>  - Using single ll/sc primitive instead of cmpxchg is similar to your
>    patch, and you may win 1 cycle or not.
>  - The critical work here are reducing bus transactions, preventing
>    cache dance, and forward progress guarantee.
> 
> Here is my optimization advice:
> 
> #define CMPXCHG_LOOP(CODE, SUCCESS) do {                                        \
>         int retry = 100;                                                        \
>         struct lockref old;                                                     \
>         BUILD_BUG_ON(sizeof(old) != 8);                                         \
> +       prefetchw(lockref);                                                     \
>         old.lock_count = READ_ONCE(lockref->lock_count);                        \
>         while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
>                 struct lockref new = old;                                       \
>                 CODE                                                            \
>                 if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
>                                                  &old.lock_count,               \
>                                                  new.lock_count))) {            \
>                         SUCCESS;                                                \
>                 }                                                               \
> 
> Micro-arch could give prefetchw more guarantee:
>  - Prefetch.w must guarantee cache line exclusiveness even when a
>    shareable state cache line hits.
>  - Hold the exclusive cache line for several cycles until the next
>    store or timeout
>  - Mask interrupt during the holding cycles (Optional)
> 
> The lockref slow path is killed in this micro-architecture, which
> means there is no chance to execute the spinlock.
> 
> I've written down more details in my ppt:
> https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing
> 
> This type of prefetchw could help large-size atomic operations within
> one cache line. Compared to the transaction memory model, prefetchw
> could give a forward progress guarantee and easier landing in Linux
> without any new primitive.
> 
> > 
> >                   Linus
> 
> >  lib/lockref.c | 17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/lockref.c b/lib/lockref.c
> > index 2afe4c5d8919..481b102a6476 100644
> > --- a/lib/lockref.c
> > +++ b/lib/lockref.c
> > @@ -26,6 +26,17 @@
> >  	}									\
> >  } while (0)
> >  
> > +/*
> > + * The compiler isn't smart enough to the the count
> > + * increment in the high 32 bits of the 64-bit value,
> > + * so do this optimization by hand.
> > + */
> > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64
> > + #define LOCKREF_INC(n) ((n).lock_count += 1ul<<32)
> > +#else
> > + #define LOCKREF_INC(n) ((n).count++)
> > +#endif
> > +
> >  #else
> >  
> >  #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0)
> > @@ -42,7 +53,7 @@
> >  void lockref_get(struct lockref *lockref)
> >  {
> >  	CMPXCHG_LOOP(
> > -		new.count++;
> > +		LOCKREF_INC(new);
> >  	,
> >  		return;
> >  	);
> > @@ -63,7 +74,7 @@ int lockref_get_not_zero(struct lockref *lockref)
> >  	int retval;
> >  
> >  	CMPXCHG_LOOP(
> > -		new.count++;
> > +		LOCKREF_INC(new);
> >  		if (old.count <= 0)
> >  			return 0;
> >  	,
> > @@ -174,7 +185,7 @@ int lockref_get_not_dead(struct lockref *lockref)
> >  	int retval;
> >  
> >  	CMPXCHG_LOOP(
> > -		new.count++;
> > +		LOCKREF_INC(new);
> >  		if (old.count < 0)
> >  			return 0;
> >  	,
> 
>
Linus Torvalds Nov. 26, 2023, 5:06 p.m. UTC | #13
On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@kernel.org> wrote:
>
> Here, what I want to improve is to prevent stack frame setup in the fast
> path, and that's the most benefit my patch could give out.

Side note: what patch do you have that avoids the stack frame setup?
Because I still saw the stack frame even with the
arch_spin_value_unlocked() fix and the improved code generation. The
compiler still does

        addi    sp,sp,-32
        sd      s0,16(sp)
        sd      s1,8(sp)
        sd      ra,24(sp)
        addi    s0,sp,32

at the top of the function for me - not because of the (now fixed)
lock value spilling, but just because it wants to save registers.

The reason seems to be that gcc isn't smart enough to delay the frame
setup to the slow path where it then has to do the actual spinlock, so
it has to generate a stack frame just for the return address and then
it does the whole frame setup thing.

I was using just the risc-v defconfig (with the cmpxchg lockrefs
enabled, and spinlock debugging disabled so that lockrefs actually do
something), so there might be some other config thing like "force
frame pointers" that then causes problems.

But while the current tree avoids the silly lock value spill and
reload, and my patch improved the integer instruction selection, I
really couldn't get rid of the stack frame entirely. The x86 code also
ends up looking quite nice, although part of that is that the
qspinlock test is a simple compare against zero:

  lockref_get:
        pushq   %rbx
        movq    %rdi, %rbx
        movq    (%rdi), %rax
        movl    $-100, %ecx
        movabsq $4294967296, %rdx
  .LBB0_1:
        testl   %eax, %eax
        jne     .LBB0_4
        leaq    (%rax,%rdx), %rsi
        lock
        cmpxchgq        %rsi, (%rbx)
        je      .LBB0_5
        incl    %ecx
        jne     .LBB0_1
  .LBB0_4:
        movq    %rbx, %rdi
        callq   _raw_spin_lock
        incl    4(%rbx)
        movb    $0, (%rbx)
  .LBB0_5:
        popq    %rbx
        retq

(That 'movabsq' thing is what generates the big constant that adds '1'
in the upper word - that add is then done as a 'leaq').

In this case, the 'retry' count is actually a noticeable part of the
code generation, and is probably also why it has to save/restore
'%rbx'. Oh well. We limited the cmpxchg loop because of horrible
issues with starvation on bad arm64 cores.  It turns out that SMP
cacheline bouncing is hard, and if you haven't been doing it for a
couple of decades, you'll do it wrong.

You'll find out the hard way that the same is probably true on any
early RISC-V SMP setups. You wanting to use prefetchw is a pretty
clear indication of the same kind of thing.

             Linus
Linus Torvalds Nov. 26, 2023, 5:59 p.m. UTC | #14
On Sun, 26 Nov 2023 at 09:06, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> In this case, the 'retry' count is actually a noticeable part of the
> code generation, and is probably also why it has to save/restore
> '%rbx'.

Nope. The reason for having to save/restore a register is the

        spin_lock(&lockref->lock);
        lockref->count++;

sequence: since spin_lock() is a function call, it will clobber all
the registers that a function can clobber, and the callee has to keep
the 'lockref' argument somewhere. So it needs a callee-saved register,
which it then itself needs to save.

Inlining the spinlock sequence entirely would fix it, but is the wrong
thing to do for the slow case.

Marking the spinlock functions with

  __attribute__((no_caller_saved_registers))

might actually be a reasonable option. It makes the spinlock itself
more expensive (since now it saves/restores all the registers it
uses), but in this case that's the right thing to do.

Of course, in this case, lockref has already done the optimistic
"check the lock" version, so our spinlock code that does that

        LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);

which first tries to do the trylock, is all kinds of wrong.

In a perfect world, the lockref code actually wants only the
slow-path, since it has already done the fast-path case. And it would
have that "slow path saves all registers" thing. That might be a good
idea for spinlocks in general, who knows..

Oh well. Probably not worth worrying about. In my profiles, lockref
looks pretty good even under heavy dentry load. Even if it's not
perfect.

                 Linus
Guo Ren Nov. 29, 2023, 7:14 a.m. UTC | #15
On Wed, Nov 22, 2023 at 11:11:38AM -0800, Linus Torvalds wrote:
> On Wed, 22 Nov 2023 at 09:52, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Still not actually tested, but the code generation on x86 looks
> > reasonable, so it migth be worth looking at whether it helps the
> > RISC-V case.
> 
> Doing some more munging, and actually looking at RISC-V code
> generation too (I obviously had to enable ARCH_USE_CMPXCHG_LOCKREF for
> RISC-V).
> 
> I get this:
> 
>   lockref_get:
>         addi    sp,sp,-32
>         sd      s0,16(sp)
>         sd      s1,8(sp)
>         sd      ra,24(sp)
>         addi    s0,sp,32
>         li      a1,65536
>         ld      a5,0(a0)
>         mv      s1,a0
>         addi    a1,a1,-1
>         li      a0,100
>   .L43:
>         sext.w  a3,a5
>         li      a4,1
>         srliw   a2,a5,16
>         and     a3,a3,a1
>         slli    a4,a4,32
>         bne     a2,a3,.L49
>         add     a4,a5,a4
>   0:
>         lr.d a3, 0(s1)
>         bne a3, a5, 1f
>         sc.d.rl a2, a4, 0(s1)
>         bnez a2, 0b
>         fence rw, rw
>   1:
>         bne     a5,a3,.L52
>         ld      ra,24(sp)
>         ld      s0,16(sp)
>         ld      s1,8(sp)
>         addi    sp,sp,32
>         jr      ra
>   ...
> 
> so now that single update is indeed just one single instruction:
> 
>         add     a4,a5,a4
> 
> is that "increment count in the high 32 bits".
> 
> The ticket lock being unlocked checks are those
> 
>         li      a1,65536
>         sext.w  a3,a5
>         srliw   a2,a5,16
>         and     a3,a3,a1
>         bne     a2,a3,.L49
> 
> instructions if I read it right.
> 
> That actually looks fairly close to optimal, although the frame setup
> is kind of sad.
> 
> (The above does not include the "loop if the cmpxchg failed" part of
> the code generation)
> 
> Anyway, apart from enabling LOCKREF, the patch to get this for RISC-V
> is attached.
> 
> I'm not going to play with this any more, but you might want to check
> whether this actually does work on RISC-V.
> 
> Becaue I only looked at the code generation, I didn't actually look at
> whether it *worked*.
> 
>                 Linus

> From 168f35850c15468941e597907e33daacd179d54a Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Wed, 22 Nov 2023 09:33:29 -0800
> Subject: [PATCH] lockref: improve code generation for ref updates
> 
> Our lockref data structure is two 32-bit words laid out next to each
> other, combining the spinlock and the count into one entity that can be
> accessed atomically together.
> 
> In particular, the structure is laid out so that the count is the upper
> 32 bit word (on little-endian), so that you can do basic arithmetic on
> the count in 64 bits: instead of adding one to the 32-bit word, you can
> just add a value shifted by 32 to the full 64-bit word.
> 
> Sadly, neither gcc nor clang are quite clever enough to work that out on
> their own, so this does that "manually".
> 
> Also, try to do any compares against zero values, which generally
> improves the code generation.  So rather than check that the value was
> at least 1 before a decrement, check that it's positive or zero after
> the decrement.  We don't worry about the overflow point in lockrefs.
Tested-by: Guo Ren <guoren@kernel.org>

This patch could help riscv optimize 3 ALU instructions.

Before the patch:
000000000000020c <lockref_get>:
        CMPXCHG_LOOP(
 20c:   611c                    ld      a5,0(a0)

000000000000020e <.LBB492>:
 20e:   03079713                sll     a4,a5,0x30
 212:   0107d69b                srlw    a3,a5,0x10
 216:   9341                    srl     a4,a4,0x30
 218:   02e69663                bne     a3,a4,244 <.L40>

000000000000021c <.LBB494>:
 21c:   4207d693                sra     a3,a5,0x20    -------+
 220:   02079713                sll     a4,a5,0x20	     |
 224:   2685                    addw    a3,a3,1		     |
 226:   1682                    sll     a3,a3,0x20	     |
 228:   9301                    srl     a4,a4,0x20	     |
 22a:   8f55                    or      a4,a4,a3      -------+

000000000000022c <.L0^B4>:
 22c:   100536af                lr.d    a3,(a0)
 230:   00f69763                bne     a3,a5,23e <.L1^B5>
 234:   1ae5362f                sc.d.rl a2,a4,(a0)
 238:   fa75                    bnez    a2,22c <.L0^B4>
 23a:   0330000f                fence   rw,rw

After the patch:
000000000000020c <lockref_get>:
        CMPXCHG_LOOP(
 20c:   611c                    ld      a5,0(a0)

000000000000020e <.LBB526>:
 20e:   03079713                sll     a4,a5,0x30
 212:   0107d69b                srlw    a3,a5,0x10
 216:   9341                    srl     a4,a4,0x30
 218:   02e69163                bne     a3,a4,23a <.L40>

000000000000021c <.LBB528>:
 21c:   4705                    li      a4,1		------+
 21e:   1702                    sll     a4,a4,0x20	      |
 220:   973e                    add     a4,a4,a5	------+

0000000000000222 <.L0^B4>:
 222:   100536af                lr.d    a3,(a0)
 226:   00f69763                bne     a3,a5,234 <.L1^B5>
 22a:   1ae5362f                sc.d.rl a2,a4,(a0)
 22e:   fa75                    bnez    a2,222 <.L0^B4>
 230:   0330000f                fence   rw,rw

> 
> Cc: Guo Ren <guoren@kernel.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  lib/lockref.c | 29 ++++++++++++++++++++---------
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/lib/lockref.c b/lib/lockref.c
> index 2afe4c5d8919..f3c30c538af1 100644
> --- a/lib/lockref.c
> +++ b/lib/lockref.c
> @@ -26,6 +26,17 @@
>  	}									\
>  } while (0)
>  
> +/*
> + * The compiler isn't smart enough to the the count
> + * increment in the high 32 bits of the 64-bit value,
> + * so do this optimization by hand.
> + */
> +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64
> + #define LOCKREF_ADD(n,x) ((n).lock_count += (unsigned long)(x)<<32)
> +#else
> + #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x)<<32)
> +#endif
> +
>  #else
>  
>  #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0)
> @@ -42,7 +53,7 @@
>  void lockref_get(struct lockref *lockref)
>  {
>  	CMPXCHG_LOOP(
> -		new.count++;
> +		LOCKREF_ADD(new,1);
>  	,
>  		return;
>  	);
> @@ -63,9 +74,9 @@ int lockref_get_not_zero(struct lockref *lockref)
>  	int retval;
>  
>  	CMPXCHG_LOOP(
> -		new.count++;
>  		if (old.count <= 0)
>  			return 0;
> +		LOCKREF_ADD(new,1);
>  	,
>  		return 1;
>  	);
> @@ -91,8 +102,8 @@ int lockref_put_not_zero(struct lockref *lockref)
>  	int retval;
>  
>  	CMPXCHG_LOOP(
> -		new.count--;
> -		if (old.count <= 1)
> +		LOCKREF_ADD(new,-1);
> +		if (new.count <= 0)
>  			return 0;
>  	,
>  		return 1;
> @@ -119,8 +130,8 @@ EXPORT_SYMBOL(lockref_put_not_zero);
>  int lockref_put_return(struct lockref *lockref)
>  {
>  	CMPXCHG_LOOP(
> -		new.count--;
> -		if (old.count <= 0)
> +		LOCKREF_ADD(new,-1);
> +		if (new.count < 0)
>  			return -1;
>  	,
>  		return new.count;
> @@ -137,8 +148,8 @@ EXPORT_SYMBOL(lockref_put_return);
>  int lockref_put_or_lock(struct lockref *lockref)
>  {
>  	CMPXCHG_LOOP(
> -		new.count--;
> -		if (old.count <= 1)
> +		LOCKREF_ADD(new,-1);
> +		if (new.count <= 0)
>  			break;
>  	,
>  		return 1;
> @@ -174,9 +185,9 @@ int lockref_get_not_dead(struct lockref *lockref)
>  	int retval;
>  
>  	CMPXCHG_LOOP(
> -		new.count++;
>  		if (old.count < 0)
>  			return 0;
> +		LOCKREF_ADD(new,1);
>  	,
>  		return 1;
>  	);
> -- 
> 2.43.0.5.g38fb137bdb
>
Guo Ren Nov. 29, 2023, 9:52 a.m. UTC | #16
On Sun, Nov 26, 2023 at 09:06:03AM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@kernel.org> wrote:
> >
> > Here, what I want to improve is to prevent stack frame setup in the fast
> > path, and that's the most benefit my patch could give out.
> 
> Side note: what patch do you have that avoids the stack frame setup?
> Because I still saw the stack frame even with the
> arch_spin_value_unlocked() fix and the improved code generation. The
> compiler still does
> 
>         addi    sp,sp,-32
>         sd      s0,16(sp)
>         sd      s1,8(sp)
>         sd      ra,24(sp)
>         addi    s0,sp,32
I found below affect you:

 #define CMPXCHG_LOOP(CODE, SUCCESS) do {                                       \
-       int retry = 100;                                                        \
        struct lockref old;                                                     \
        BUILD_BUG_ON(sizeof(old) != 8);                                         \
        old.lock_count = READ_ONCE(lockref->lock_count);                        \
@@ -21,11 +20,21 @@
                                                 new.lock_count))) {            \
                        SUCCESS;                                                \
                }                                                               \
-               if (!--retry)                                                   \
-                       break;                                                  \

Yes, The 'retry' count patch [1] hurts us.

[1]: https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m091df9dca68c27c28f8f69a72cae0e1361dba4fa

> 
> at the top of the function for me - not because of the (now fixed)
> lock value spilling, but just because it wants to save registers.
> 
> The reason seems to be that gcc isn't smart enough to delay the frame
> setup to the slow path where it then has to do the actual spinlock, so
> it has to generate a stack frame just for the return address and then
> it does the whole frame setup thing.
> 
> I was using just the risc-v defconfig (with the cmpxchg lockrefs
> enabled, and spinlock debugging disabled so that lockrefs actually do
> something), so there might be some other config thing like "force
> frame pointers" that then causes problems.
> 
> But while the current tree avoids the silly lock value spill and
> reload, and my patch improved the integer instruction selection, I
> really couldn't get rid of the stack frame entirely. The x86 code also
> ends up looking quite nice, although part of that is that the
> qspinlock test is a simple compare against zero:
> 
>   lockref_get:
>         pushq   %rbx
>         movq    %rdi, %rbx
>         movq    (%rdi), %rax
>         movl    $-100, %ecx
>         movabsq $4294967296, %rdx
>   .LBB0_1:
>         testl   %eax, %eax
>         jne     .LBB0_4
>         leaq    (%rax,%rdx), %rsi
>         lock
>         cmpxchgq        %rsi, (%rbx)
>         je      .LBB0_5
>         incl    %ecx
>         jne     .LBB0_1
>   .LBB0_4:
>         movq    %rbx, %rdi
>         callq   _raw_spin_lock
>         incl    4(%rbx)
>         movb    $0, (%rbx)
>   .LBB0_5:
>         popq    %rbx
>         retq
> 
> (That 'movabsq' thing is what generates the big constant that adds '1'
> in the upper word - that add is then done as a 'leaq').
> 
> In this case, the 'retry' count is actually a noticeable part of the
> code generation, and is probably also why it has to save/restore
> '%rbx'. Oh well. We limited the cmpxchg loop because of horrible
> issues with starvation on bad arm64 cores.  It turns out that SMP
> cacheline bouncing is hard, and if you haven't been doing it for a
> couple of decades, you'll do it wrong.
> 
> You'll find out the hard way that the same is probably true on any
> early RISC-V SMP setups. You wanting to use prefetchw is a pretty
> clear indication of the same kind of thing.

The 'retry' count is bad solution, which hides the problem. ThunderX2's
problem is mainly about unnecessary cpu_relax & cacheline sticky less.
In the AMBA 5 CHI spec "Home behavior" section says: [2]
"When a Home(CIU/LLcache) determines that an Exclusive Store transaction
has failed, the following rules must be followed: If the Requester has
lost the cache line, then the Home is expected to send SnpPreferUniqueFwd
or SnpPreferUnique to get a copy of the cache line."
The SnpPreferUnique is not SnpUnique, which means it would return a shared
cacheline in case of serious contention. No guarantee for the next cmpxchg.

But, we want a unique cache line right? You said: [1]
"... And once one CPU gets ownership of the line, it doesn't lose it
immediately, so the next cmpxchg will *succeed*.
So at most, the *first* cmpxchg will fail (because that's the one that
was fed not by a previous cmpxchg, but by a regular load (which we'd
*like* to do as a "load-for-ownership" load, but we don't have the
interfaces to do that). But the second cmpxchg should basically always
succeed, ..."
(Sorry, I quoted you like this.)

My argue is:
Why do we need to wait for cmpxchg failure? You have the
"load-for-ownership" interface: "prefetchw"!

   lockref_get:
         pushq   %rbx
  +------prefetchw (%rdi)    --------> doesn't lose it immediately,
st|				so the next cmpxchg will *succeed*
ic|						- Linus
ky|      movq    %rdi, %rbx
 t|      movq    (%rdi), %rax  ------> local acquire, comfortable!
im|      movl    $-100, %ecx
e |      movabsq $4294967296, %rdx
  |.LBB0_1:
  |      testl   %eax, %eax
  |      jne     .LBB0_4
  |      leaq    (%rax,%rdx), %rsi
  |      lock
  |      cmpxchgq        %rsi, (%rbx) --> local cas, success!
  +----- je      .LBB0_5          ------> Farewell to the slowpath!

If x86 is not a crap machine, "movq+movq+movl+movabsq+testl+jne+leak+
cmpxchg" should be fast enough to satisfy the sticky time.

The prefetchw primitive has been defined in include/linux/prefetch.h
for many years.

The prefetchw has been used for generic code: 
➜  linux git:(master) ✗ grep prefetchw mm/ fs/ kernel/ -r
mm/slub.c:      prefetchw(object + s->offset);
mm/slab.c:      prefetchw(objp);
mm/page_alloc.c:        prefetchw(p);
mm/page_alloc.c:                prefetchw(p + 1);
mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field)                     \
mm/vmscan.c:                    prefetchw(&prev->_field);                       \
mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) do { } while (0)
mm/vmscan.c:            prefetchw_prev_lru_folio(folio, src, flags);
fs/mpage.c:             prefetchw(&folio->flags);
fs/f2fs/data.c:                 prefetchw(&page->flags);
fs/ext4/readpage.c:             prefetchw(&folio->flags);
kernel/bpf/cpumap.c:                    prefetchw(page);
kernel/locking/qspinlock.c:                     prefetchw(next);
➜  linux git:(master) ✗ grep prefetchw drivers/ -r | wc -l
80

The prefetchw is okay for all good hardware. Not like the 'retry' one.

[1]: https://lore.kernel.org/lkml/CAHk-=wiEahkwDXpoy=-SzJHNMRXKVSjPa870+eKKenufhO_Hgw@mail.gmail.com/raw
[2]: https://kolegite.com/EE_library/datasheets_and_manuals/FPGA/AMBA/IHI0050E_a_amba_5_chi_architecture_spec.pdf

> 
>              Linus
>
Guo Ren Nov. 29, 2023, 12:25 p.m. UTC | #17
On Wed, Nov 22, 2023 at 11:11:38AM -0800, Linus Torvalds wrote:
> On Wed, 22 Nov 2023 at 09:52, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Still not actually tested, but the code generation on x86 looks
> > reasonable, so it migth be worth looking at whether it helps the
> > RISC-V case.
> 
> Doing some more munging, and actually looking at RISC-V code
> generation too (I obviously had to enable ARCH_USE_CMPXCHG_LOCKREF for
> RISC-V).
> 
> I get this:
> 
>   lockref_get:
>         addi    sp,sp,-32
>         sd      s0,16(sp)
>         sd      s1,8(sp)
>         sd      ra,24(sp)
>         addi    s0,sp,32
>         li      a1,65536
>         ld      a5,0(a0)
>         mv      s1,a0
>         addi    a1,a1,-1
>         li      a0,100
>   .L43:
>         sext.w  a3,a5
>         li      a4,1
>         srliw   a2,a5,16
>         and     a3,a3,a1
>         slli    a4,a4,32
>         bne     a2,a3,.L49
>         add     a4,a5,a4
>   0:
>         lr.d a3, 0(s1)
>         bne a3, a5, 1f
>         sc.d.rl a2, a4, 0(s1)
>         bnez a2, 0b
>         fence rw, rw
>   1:
>         bne     a5,a3,.L52
>         ld      ra,24(sp)
>         ld      s0,16(sp)
>         ld      s1,8(sp)
>         addi    sp,sp,32
>         jr      ra
>   ...
> 
> so now that single update is indeed just one single instruction:
> 
>         add     a4,a5,a4
> 
> is that "increment count in the high 32 bits".
> 
> The ticket lock being unlocked checks are those
> 
>         li      a1,65536
>         sext.w  a3,a5
>         srliw   a2,a5,16
>         and     a3,a3,a1
>         bne     a2,a3,.L49
> 
> instructions if I read it right.
> 
> That actually looks fairly close to optimal, although the frame setup
> is kind of sad.
> 
> (The above does not include the "loop if the cmpxchg failed" part of
> the code generation)
> 
> Anyway, apart from enabling LOCKREF, the patch to get this for RISC-V
> is attached.
> 
> I'm not going to play with this any more, but you might want to check
> whether this actually does work on RISC-V.
> 
> Becaue I only looked at the code generation, I didn't actually look at
> whether it *worked*.
> 
>                 Linus

> From 168f35850c15468941e597907e33daacd179d54a Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Wed, 22 Nov 2023 09:33:29 -0800
> Subject: [PATCH] lockref: improve code generation for ref updates
> 
> Our lockref data structure is two 32-bit words laid out next to each
> other, combining the spinlock and the count into one entity that can be
> accessed atomically together.
> 
> In particular, the structure is laid out so that the count is the upper
> 32 bit word (on little-endian), so that you can do basic arithmetic on
> the count in 64 bits: instead of adding one to the 32-bit word, you can
> just add a value shifted by 32 to the full 64-bit word.
> 
> Sadly, neither gcc nor clang are quite clever enough to work that out on
> their own, so this does that "manually".
> 
> Also, try to do any compares against zero values, which generally
> improves the code generation.  So rather than check that the value was
> at least 1 before a decrement, check that it's positive or zero after
> the decrement.  We don't worry about the overflow point in lockrefs.
> 
> Cc: Guo Ren <guoren@kernel.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  lib/lockref.c | 29 ++++++++++++++++++++---------
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/lib/lockref.c b/lib/lockref.c
> index 2afe4c5d8919..f3c30c538af1 100644
> --- a/lib/lockref.c
> +++ b/lib/lockref.c
> @@ -26,6 +26,17 @@
>  	}									\
>  } while (0)
>  
> +/*
> + * The compiler isn't smart enough to the the count
> + * increment in the high 32 bits of the 64-bit value,
> + * so do this optimization by hand.
> + */
> +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64
> + #define LOCKREF_ADD(n,x) ((n).lock_count += (unsigned long)(x)<<32)
> +#else
> + #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x)<<32)
#define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x))
?

> +#endif
> +
>  #else
>  
>  #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0)
> @@ -42,7 +53,7 @@
>  void lockref_get(struct lockref *lockref)
>  {
>  	CMPXCHG_LOOP(
> -		new.count++;
> +		LOCKREF_ADD(new,1);
>  	,
>  		return;
>  	);
> @@ -63,9 +74,9 @@ int lockref_get_not_zero(struct lockref *lockref)
>  	int retval;
>  
>  	CMPXCHG_LOOP(
> -		new.count++;
>  		if (old.count <= 0)
>  			return 0;
> +		LOCKREF_ADD(new,1);
>  	,
>  		return 1;
>  	);
> @@ -91,8 +102,8 @@ int lockref_put_not_zero(struct lockref *lockref)
>  	int retval;
>  
>  	CMPXCHG_LOOP(
> -		new.count--;
> -		if (old.count <= 1)
> +		LOCKREF_ADD(new,-1);
> +		if (new.count <= 0)
>  			return 0;
>  	,
>  		return 1;
> @@ -119,8 +130,8 @@ EXPORT_SYMBOL(lockref_put_not_zero);
>  int lockref_put_return(struct lockref *lockref)
>  {
>  	CMPXCHG_LOOP(
> -		new.count--;
> -		if (old.count <= 0)
> +		LOCKREF_ADD(new,-1);
> +		if (new.count < 0)
>  			return -1;
>  	,
>  		return new.count;
> @@ -137,8 +148,8 @@ EXPORT_SYMBOL(lockref_put_return);
>  int lockref_put_or_lock(struct lockref *lockref)
>  {
>  	CMPXCHG_LOOP(
> -		new.count--;
> -		if (old.count <= 1)
> +		LOCKREF_ADD(new,-1);
> +		if (new.count <= 0)
>  			break;
>  	,
>  		return 1;
> @@ -174,9 +185,9 @@ int lockref_get_not_dead(struct lockref *lockref)
>  	int retval;
>  
>  	CMPXCHG_LOOP(
> -		new.count++;
>  		if (old.count < 0)
>  			return 0;
> +		LOCKREF_ADD(new,1);
>  	,
>  		return 1;
>  	);
> -- 
> 2.43.0.5.g38fb137bdb
>
Linus Torvalds Nov. 29, 2023, 2:42 p.m. UTC | #18
On Wed, 29 Nov 2023 at 04:25, Guo Ren <guoren@kernel.org> wrote:
>
> > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64
> > + #define LOCKREF_ADD(n,x) ((n).lock_count += (unsigned long)(x)<<32)
> > +#else
> > + #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x)<<32)
> #define LOCKREF_ADD(n,x) ((n).count += (unsigned long)(x))
> ?

Yes. I obviously only tested the little-endian case, and the BE case
was a bit too much cut-and-paste..

             Linus
Guo Ren Nov. 30, 2023, 10 a.m. UTC | #19
On Sun, Nov 26, 2023 at 08:51:35AM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@kernel.org> wrote:
> >
> > Here is my optimization advice:
> >
> > #define CMPXCHG_LOOP(CODE, SUCCESS) do {                                        \
> >         int retry = 100;                                                        \
> >         struct lockref old;                                                     \
> >         BUILD_BUG_ON(sizeof(old) != 8);                                         \
> > +       prefetchw(lockref);                                                     \\
> 
> No.
> 
> We're not adding software prefetches to generic code. Been there, done
> that. They *never* improve performance on good hardware. They end up
> helping on some random (usually particularly bad) microarchitecture,
> and then they hurt everybody else.
> 
> And the real optimization advice is: "don't run on crap hardware".
> 
> It really is that simple. Good hardware does OoO and sees the future write.
That needs the expensive mechanism DynAMO [1], but some power-efficient
core lacks the capability. Yes, powerful OoO hardware could virtually
satisfy you by a minimum number of retries, but why couldn't we
explicitly tell hardware for "prefetchw"?

Advanced hardware would treat cmpxchg as interconnect transactions when
cache miss(far atomic), which means L3 cache wouldn't return a unique
cacheline even when cmpxchg fails. The cmpxchg loop would continue to
read data bypassing the L1/L2 cache, which means every failure cmpxchg
is a cache-miss read. Because of the "new.count++"/CODE data dependency,
the continuous cmpxchg requests must wait first finish. This will cause
a gap between cmpxchg requests, which will cause most CPU's cmpxchgs
continue failling during serious contention.

   cas: Compare-And-Swap

   L1&L2          L3 cache
 +------+       +-----------
 | CPU1 | wait  |
 | cas2 |------>| CPU1_cas1 --+
 +------+       |             |
 +------+       |             |
 | CPU2 | wait  |             |
 | cas2 |------>| CPU2_cas1 --+--> If queued with CPU1_cas1 CPU2_cas1
 +------+       |             |    CPU3_cas1, and most of CPUs would
 +------+       |             |    fail and retry.
 | CPU3 | wait  |             |
 | cas2 |------>| CPU3_cas1---+
 +------+       +----------

The entire system moves forward with inefficiency:
 - A large number of invalid read requests CPU->L3
 - High power consumption
 - Poor performance

But, the “far atomic” is suitable for scenarios where contention is
not particularly serious. So it is reasonable to let the software give
prompts. That is "prefetchw":
 - The prefetchw is the preparation of "load + cmpxchg loop."
 - The prefetchw is not for single AMO or CAS or Store.

[1] https://dl.acm.org/doi/10.1145/3579371.3589065

> 
> > Micro-arch could give prefetchw more guarantee:
> 
> Well, in practice, they never do, and in fact they are often buggy and
> cause problems because they weren't actually tested very much.
> 
>                  Linus
>
Linus Torvalds Dec. 1, 2023, 1:09 a.m. UTC | #20
On Thu, 30 Nov 2023 at 19:01, Guo Ren <guoren@kernel.org> wrote:
>
> That needs the expensive mechanism DynAMO [1], but some power-efficient
> core lacks the capability. Yes, powerful OoO hardware could virtually
> satisfy you by a minimum number of retries, but why couldn't we
> explicitly tell hardware for "prefetchw"?

Because every single time we've had a prefetch in the kernel, it has
caused problems. A bit like cpu_relax() - these things get added for
random hardware where it helps, and then a few years later it turns
out that it hurts almost everywhere else.

We've had particular problems with 'prefetch' because it turns out
that (a) nobody sane uses them so (b) hardware is often buggy. And
here "buggy" may be just performance (ie "prefetch actually stalls on
TLB lookup" etc broken behavior that means that prefetch is not even
remotely like a no-op that just hints to the cache subsystem), but
sometimes even in actual semantics (ie "prefetch causes spurious
faulting behavior")

> Advanced hardware would treat cmpxchg as interconnect transactions when
> cache miss(far atomic), which means L3 cache wouldn't return a unique
> cacheline even when cmpxchg fails. The cmpxchg loop would continue to
> read data bypassing the L1/L2 cache, which means every failure cmpxchg
> is a cache-miss read.

Honestly, I wouldn't call that "advanced hardware". I would call that
ridiculous.

If the cmpxchg isn't guaranteed to make progress, then the cmpxchg is
broken. It's really that simple.

It does sound like on your hardware, maybe you just want to make the
RISC-V cmpxchg function always do a "prefetchw" if the 'sc.d' fails,
something like

                        "0:     lr.w %0, %2\n"                          \
                        "       bne  %0, %z3, 1f\n"                     \
                        "       sc.w %1, %z4, %2\n"                     \
-                       "       bnez %1, 0b\n"                          \
+                       "       beqz %1, 1f\n"                          \
+                       "       prefetchw %2\n"                         \
+                       "       j 0b\n"                                 \
                        "1:\n"                                          \

(quick entirely untested hack, you get the idea). A better
implementation might use "asm goto" and expose the different error
cases to the compiler so that it can move things around, but I'm not
convinced it's worth the effort.

But no, we're *not* adding a prefetchw to generic code just because
apparently some RISC-V code is doing bad things. You need to keep
workarounds for RISC-V behavior to RISC-V.

And yes, the current "retry count" in our lockref implementation comes
from another "some hardware does bad things for cmpxchg". But that
workaround at most causes a few extra (regular) ALU instructions, and
while not optimal, it's at least not going to cause any bigger
problems.

           Linus
Guo Ren Dec. 1, 2023, 3:36 a.m. UTC | #21
On Fri, Dec 01, 2023 at 10:09:01AM +0900, Linus Torvalds wrote:
> On Thu, 30 Nov 2023 at 19:01, Guo Ren <guoren@kernel.org> wrote:
> >
> > That needs the expensive mechanism DynAMO [1], but some power-efficient
> > core lacks the capability. Yes, powerful OoO hardware could virtually
> > satisfy you by a minimum number of retries, but why couldn't we
> > explicitly tell hardware for "prefetchw"?
> 
> Because every single time we've had a prefetch in the kernel, it has
> caused problems. A bit like cpu_relax() - these things get added for
> random hardware where it helps, and then a few years later it turns
> out that it hurts almost everywhere else.
> 
> We've had particular problems with 'prefetch' because it turns out
> that (a) nobody sane uses them so (b) hardware is often buggy. And
> here "buggy" may be just performance (ie "prefetch actually stalls on
> TLB lookup" etc broken behavior that means that prefetch is not even
> remotely like a no-op that just hints to the cache subsystem), but
> sometimes even in actual semantics (ie "prefetch causes spurious
> faulting behavior")
Thanks for sharing your experience, I now know the problem with generic
prefetchw.

But what to do with these codes?
➜  linux git:(master) ✗ grep prefetchw mm/ fs/ kernel/ -r
mm/slub.c:      prefetchw(object + s->offset);
mm/slab.c:      prefetchw(objp);
mm/page_alloc.c:        prefetchw(p);
mm/page_alloc.c:                prefetchw(p + 1);
mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field)
mm/vmscan.c:                    prefetchw(&prev->_field);
mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) do
mm/vmscan.c:            prefetchw_prev_lru_folio(folio, src, flags);
fs/mpage.c:             prefetchw(&folio->flags);
fs/f2fs/data.c:                 prefetchw(&page->flags);
fs/ext4/readpage.c:             prefetchw(&folio->flags);
kernel/bpf/cpumap.c:                    prefetchw(page);
kernel/locking/qspinlock.c:                     prefetchw(next);
➜  linux git:(master) ✗ grep prefetchw drivers/ -r | wc -l
80
...

> 
> > Advanced hardware would treat cmpxchg as interconnect transactions when
> > cache miss(far atomic), which means L3 cache wouldn't return a unique
> > cacheline even when cmpxchg fails. The cmpxchg loop would continue to
> > read data bypassing the L1/L2 cache, which means every failure cmpxchg
> > is a cache-miss read.
> 
> Honestly, I wouldn't call that "advanced hardware". I would call that
> ridiculous.
Ridiculous Hardware:
When CAS fails, the hardware still keeps "far atomic" mode.

Correct Hardware:
When CAS fails, the hardware should change to "near-atomic," which means
acquiring an exclusive cache line and making progress.

> 
> If the cmpxchg isn't guaranteed to make progress, then the cmpxchg is
> broken. It's really that simple.
I totally agree, and it's a correct guide, Thx.

> 
> It does sound like on your hardware, maybe you just want to make the
> RISC-V cmpxchg function always do a "prefetchw" if the 'sc.d' fails,
> something like
> 
>                         "0:     lr.w %0, %2\n"                          \
>                         "       bne  %0, %z3, 1f\n"                     \
>                         "       sc.w %1, %z4, %2\n"                     \
> -                       "       bnez %1, 0b\n"                          \
> +                       "       beqz %1, 1f\n"                          \
> +                       "       prefetchw %2\n"                         \
> +                       "       j 0b\n"                                 \
>                         "1:\n"                                          \

I modify your code to guarantee the progress of the comparison failure
situation:
Final version (for easy read):
                         "0:     lr.w %0, %2\n"                          \
                         "       bne  %0, %z3, 2f\n"                     \
                         "       sc.w %1, %z4, %2\n"                     \
                         "       beqz %1, 1f\n"                          \
                         "       prefetchw %2\n"                         \
                         "       j 0b\n"                         	 \
                         "2:\n"                                          \
                         "       prefetchw %2\n"                         \
                         "1:\n"                                          \

Diff version:
                         "0:     lr.w %0, %2\n"                          \
 -                       "       bne  %0, %z3, 1f\n"                     \
 +                       "       bne  %0, %z3, 2f\n"                     \
                         "       sc.w %1, %z4, %2\n"                     \
 -                       "       bnez %1, 0b\n"                          \
 +                       "       beqz %1, 1f\n"                          \
 +                       "       prefetchw %2\n"                         \
 +                       "       j 0b\n"                         	 \
 +                       "2:\n"                                          \
 +                       "       prefetchw %2\n"                         \
                         "1:\n"                                          \

> 
> (quick entirely untested hack, you get the idea). A better
> implementation might use "asm goto" and expose the different error
> cases to the compiler so that it can move things around, but I'm not
> convinced it's worth the effort.
> 
> But no, we're *not* adding a prefetchw to generic code just because
> apparently some RISC-V code is doing bad things. You need to keep
> workarounds for RISC-V behavior to RISC-V.
> 
> And yes, the current "retry count" in our lockref implementation comes
> from another "some hardware does bad things for cmpxchg". But that
> workaround at most causes a few extra (regular) ALU instructions, and
> while not optimal, it's at least not going to cause any bigger
> problems.
> 
>            Linus
>
Linus Torvalds Dec. 1, 2023, 5:15 a.m. UTC | #22
On Fri, 1 Dec 2023 at 12:36, Guo Ren <guoren@kernel.org> wrote:
>
> I modify your code to guarantee the progress of the comparison failure
> situation:

Are you sure you want to prefetch when the value doesn't even match
the existing value? Aren't you better off just looping doing just
reads until you at least have a valid value to exchange?

Otherwise you might easily find that your cmpxchg loops cause
horrendous cacheline ping-pong patterns.

Of course, if your hardware is bad at releasing the written state,
that may actually be what you want, to see changes in a timely manner.

At least some of our cmpxchg uses are the "try_cmpxchg()" pattern,
which wouldn't even loop - and won't write at all - on a value
mismatch.

And some of those try_cmpxchg cases are a lot more important than the
lockref code. Things like spin_trylock() etc. Of course, for best
results you might want to have an actual architecture-specific helper
for the try_cmpxchg case, and use the compiler for "outputs in
condition codes" (but then you need to have fallback cases for older
compilers that don't support it).

See the code code for example of the kinds of nasty support code you need with

  /*
   * Macros to generate condition code outputs from inline assembly,
   * The output operand must be type "bool".
   */
  #ifdef __GCC_ASM_FLAG_OUTPUTS__
  # define CC_SET(c) "\n\t/* output condition code " #c "*/\n"
  # define CC_OUT(c) "=@cc" #c
  #else
  # define CC_SET(c) "\n\tset" #c " %[_cc_" #c "]\n"
  # define CC_OUT(c) [_cc_ ## c] "=qm"
  #endif

and then a lot of "CC_SET()/CC_OUT()" use in the inline asms in
<asm/cmpxchg.h>...

IOW, you really should time this and then add the timing information
to whatever commit message.

             Linus
Guo Ren Dec. 1, 2023, 7:31 a.m. UTC | #23
On Fri, Dec 01, 2023 at 02:15:15PM +0900, Linus Torvalds wrote:
> On Fri, 1 Dec 2023 at 12:36, Guo Ren <guoren@kernel.org> wrote:
> >
> > I modify your code to guarantee the progress of the comparison failure
> > situation:
> 
> Are you sure you want to prefetch when the value doesn't even match
> the existing value? Aren't you better off just looping doing just
> reads until you at least have a valid value to exchange?
Oops, you are right, I'm wrong here. Here is what I want to say:
+          "       prefetch %2\n"                          \
           "0:     lr.w %0, %2\n"                          \
           "       bne  %0, %z3, 1f\n"                     \
           "       sc.w %1, %z4, %2\n"                     \
           "       beqz %1, 1f\n"                          \
           "       prefetchw %2\n"                         \
           "       j 0b\n"                                 \
           "1:\n"                                          \

Just add a prefetch shared cache line before, which could stick the
shared cache line several cycles to ensure the outer cmpxchg loop could
make progress.

All we wrote here is not for actual code, and it's just a reference for
hardware guys.
 - lr could imply a sticky shared cache line.
 - sc could imply a sticky unique cache line when sc fails.

> 
> Otherwise you might easily find that your cmpxchg loops cause
> horrendous cacheline ping-pong patterns.
> 
> Of course, if your hardware is bad at releasing the written state,
> that may actually be what you want, to see changes in a timely manner.
> 
> At least some of our cmpxchg uses are the "try_cmpxchg()" pattern,
> which wouldn't even loop - and won't write at all - on a value
> mismatch.
> 
> And some of those try_cmpxchg cases are a lot more important than the
> lockref code. Things like spin_trylock() etc. Of course, for best
> results you might want to have an actual architecture-specific helper
> for the try_cmpxchg case, and use the compiler for "outputs in
> condition codes" (but then you need to have fallback cases for older
> compilers that don't support it).
> 
> See the code code for example of the kinds of nasty support code you need with
> 
>   /*
>    * Macros to generate condition code outputs from inline assembly,
>    * The output operand must be type "bool".
>    */
>   #ifdef __GCC_ASM_FLAG_OUTPUTS__
>   # define CC_SET(c) "\n\t/* output condition code " #c "*/\n"
>   # define CC_OUT(c) "=@cc" #c
>   #else
>   # define CC_SET(c) "\n\tset" #c " %[_cc_" #c "]\n"
>   # define CC_OUT(c) [_cc_ ## c] "=qm"
>   #endif
> 
> and then a lot of "CC_SET()/CC_OUT()" use in the inline asms in
> <asm/cmpxchg.h>...
Thanks for the tip. It's helpful to try_cmpxchg optimization.

> 
> IOW, you really should time this and then add the timing information
> to whatever commit message.
> 
>              Linus
>
diff mbox series

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index bd57b9a08894..9e1486db64a7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -665,30 +665,57 @@  static bool lock_for_kill(struct dentry *dentry)
 	return false;
 }
 
-static inline bool retain_dentry(struct dentry *dentry)
+/*
+ * Decide if dentry is worth retaining.  Usually this is called with dentry
+ * locked; if not locked, we are more limited and might not be able to tell
+ * without a lock.  False in this case means "punt to locked path and recheck".
+ *
+ * In case we aren't locked, these predicates are not "stable". However, it is
+ * sufficient that at some point after we dropped the reference the dentry was
+ * hashed and the flags had the proper value. Other dentry users may have
+ * re-gotten a reference to the dentry and change that, but our work is done -
+ * we can leave the dentry around with a zero refcount.
+ */
+static inline bool retain_dentry(struct dentry *dentry, bool locked)
 {
-	WARN_ON(d_in_lookup(dentry));
+	unsigned int d_flags;
 
-	/* Unreachable? Get rid of it */
+	smp_rmb();
+	d_flags = READ_ONCE(dentry->d_flags);
+
+	// Unreachable? Nobody would be able to look it up, no point retaining
 	if (unlikely(d_unhashed(dentry)))
 		return false;
 
-	if (unlikely(dentry->d_flags & DCACHE_DISCONNECTED))
+	// Same if it's disconnected
+	if (unlikely(d_flags & DCACHE_DISCONNECTED))
 		return false;
 
-	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
-		if (dentry->d_op->d_delete(dentry))
+	// ->d_delete() might tell us not to bother, but that requires
+	// ->d_lock; can't decide without it
+	if (unlikely(d_flags & DCACHE_OP_DELETE)) {
+		if (!locked || dentry->d_op->d_delete(dentry))
 			return false;
 	}
 
-	if (unlikely(dentry->d_flags & DCACHE_DONTCACHE))
+	// Explicitly told not to bother
+	if (unlikely(d_flags & DCACHE_DONTCACHE))
 		return false;
 
-	/* retain; LRU fodder */
-	if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST)))
+	// At this point it looks like we ought to keep it.  We also might
+	// need to do something - put it on LRU if it wasn't there already
+	// and mark it referenced if it was on LRU, but not marked yet.
+	// Unfortunately, both actions require ->d_lock, so in lockless
+	// case we'd have to punt rather than doing those.
+	if (unlikely(!(d_flags & DCACHE_LRU_LIST))) {
+		if (!locked)
+			return false;
 		d_lru_add(dentry);
-	else if (unlikely(!(dentry->d_flags & DCACHE_REFERENCED)))
+	} else if (unlikely(!(d_flags & DCACHE_REFERENCED))) {
+		if (!locked)
+			return false;
 		dentry->d_flags |= DCACHE_REFERENCED;
+	}
 	return true;
 }
 
@@ -720,7 +747,6 @@  EXPORT_SYMBOL(d_mark_dontcache);
 static inline bool fast_dput(struct dentry *dentry)
 {
 	int ret;
-	unsigned int d_flags;
 
 	/*
 	 * try to decrement the lockref optimistically.
@@ -749,45 +775,18 @@  static inline bool fast_dput(struct dentry *dentry)
 		return true;
 
 	/*
-	 * Careful, careful. The reference count went down
-	 * to zero, but we don't hold the dentry lock, so
-	 * somebody else could get it again, and do another
-	 * dput(), and we need to not race with that.
-	 *
-	 * However, there is a very special and common case
-	 * where we don't care, because there is nothing to
-	 * do: the dentry is still hashed, it does not have
-	 * a 'delete' op, and it's referenced and already on
-	 * the LRU list.
-	 *
-	 * NOTE! Since we aren't locked, these values are
-	 * not "stable". However, it is sufficient that at
-	 * some point after we dropped the reference the
-	 * dentry was hashed and the flags had the proper
-	 * value. Other dentry users may have re-gotten
-	 * a reference to the dentry and change that, but
-	 * our work is done - we can leave the dentry
-	 * around with a zero refcount.
-	 *
-	 * Nevertheless, there are two cases that we should kill
-	 * the dentry anyway.
-	 * 1. free disconnected dentries as soon as their refcount
-	 *    reached zero.
-	 * 2. free dentries if they should not be cached.
+	 * Can we decide that decrement of refcount is all we needed without
+	 * taking the lock?  There's a very common case when it's all we need -
+	 * dentry looks like it ought to be retained and there's nothing else
+	 * to do.
 	 */
-	smp_rmb();
-	d_flags = READ_ONCE(dentry->d_flags);
-	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_OP_DELETE |
-			DCACHE_DISCONNECTED | DCACHE_DONTCACHE;
-
-	/* Nothing to do? Dropping the reference was all we needed? */
-	if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
+	if (retain_dentry(dentry, false))
 		return true;
 
 	/*
-	 * Not the fast normal case? Get the lock. We've already decremented
-	 * the refcount, but we'll need to re-check the situation after
-	 * getting the lock.
+	 * Either not worth retaining or we can't tell without the lock.
+	 * Get the lock, then.  We've already decremented the refcount to 0,
+	 * but we'll need to re-check the situation after getting the lock.
 	 */
 	spin_lock(&dentry->d_lock);
 
@@ -798,7 +797,7 @@  static inline bool fast_dput(struct dentry *dentry)
 	 * don't need to do anything else.
 	 */
 locked:
-	if (dentry->d_lockref.count || retain_dentry(dentry)) {
+	if (dentry->d_lockref.count || retain_dentry(dentry, true)) {
 		spin_unlock(&dentry->d_lock);
 		return true;
 	}
@@ -847,7 +846,7 @@  void dput(struct dentry *dentry)
 		dentry = __dentry_kill(dentry);
 		if (!dentry)
 			return;
-		if (retain_dentry(dentry)) {
+		if (retain_dentry(dentry, true)) {
 			spin_unlock(&dentry->d_lock);
 			return;
 		}