Message ID | YXFli3mzMishRpEq@hirez.programming.kicks-ass.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | locking: Generic ticket lock | expand |
On Thu, Oct 21, 2021 at 03:05:15PM +0200, Peter Zijlstra wrote: > > There's currently a number of architectures that want/have graduated > from test-and-set locks and are looking at qspinlock. > > *HOWEVER* qspinlock is very complicated and requires a lot of an > architecture to actually work correctly. Specifically it requires > forward progress between a fair number of atomic primitives, including > an xchg16 operation, which I've seen a fair number of fundamentally > broken implementations of in the tree (specifically for qspinlock no > less). > > The benefit of qspinlock over ticket lock is also non-obvious, esp. > at low contention (the vast majority of cases in the kernel), and it > takes a fairly large number of CPUs (typically also NUMA) to make > qspinlock beat ticket locks. > > Esp. things like ARM64's WFE can move the balance a lot in favour of > simpler locks by reducing the cacheline pressure due to waiters (see > their smp_cond_load_acquire() implementation for details). > > Unless you've audited qspinlock for your architecture and found it > sound *and* can show actual benefit, simpler is better. > > Therefore provide ticket locks, which depend on a single atomic > operation (fetch_add) while still providing fairness. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/asm-generic/qspinlock.h | 30 +++++++++ > include/asm-generic/ticket_lock_types.h | 11 +++ > include/asm-generic/ticket_lock.h | 97 ++++++++++++++++++++++++++++++++ > 3 files changed, 138 insertions(+) A few notes... > + * It relies on smp_store_release() + atomic_*_acquire() to be RCsc (or no > + * weaker than RCtso if you're Power, also see smp_mb__after_unlock_lock()), This should hold true to RISC-V in its current form, AFAICT atomic_fetch_add ends up using AMOADD, and therefore the argument made in the unlock+lock thread [1], gives that this results in RW,RW ordering. [1] https://lore.kernel.org/lkml/5412ab37-2979-5717-4951-6a61366df0f2@nvidia.com/ I've compile tested on openrisc/simple_smp_defconfig using the below. --- a/arch/openrisc/Kconfig +++ b/arch/openrisc/Kconfig @@ -30,7 +30,6 @@ config OPENRISC select HAVE_DEBUG_STACKOVERFLOW select OR1K_PIC select CPU_NO_EFFICIENT_FFS if !OPENRISC_HAVE_INST_FF1 - select ARCH_USE_QUEUED_SPINLOCKS select ARCH_USE_QUEUED_RWLOCKS select OMPIC if SMP select ARCH_WANT_FRAME_POINTERS --- a/arch/openrisc/include/asm/Kbuild +++ b/arch/openrisc/include/asm/Kbuild @@ -1,9 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 generic-y += extable.h generic-y += kvm_para.h -generic-y += mcs_spinlock.h -generic-y += qspinlock_types.h -generic-y += qspinlock.h +generic-y += ticket_lock_types.h +generic-y += ticket_lock.h generic-y += qrwlock_types.h generic-y += qrwlock.h generic-y += user.h --- a/arch/openrisc/include/asm/spinlock.h +++ b/arch/openrisc/include/asm/spinlock.h @@ -15,7 +15,7 @@ #ifndef __ASM_OPENRISC_SPINLOCK_H #define __ASM_OPENRISC_SPINLOCK_H -#include <asm/qspinlock.h> +#include <asm/ticket_lock.h> #include <asm/qrwlock.h> --- a/arch/openrisc/include/asm/spinlock_types.h +++ b/arch/openrisc/include/asm/spinlock_types.h @@ -1,7 +1,7 @@ #ifndef _ASM_OPENRISC_SPINLOCK_TYPES_H #define _ASM_OPENRISC_SPINLOCK_TYPES_H -#include <asm/qspinlock_types.h> +#include <asm/ticket_lock_types.h> #include <asm/qrwlock_types.h> #endif /* _ASM_OPENRISC_SPINLOCK_TYPES_H */
On Thu, Oct 21, 2021 at 3:05 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Therefore provide ticket locks, which depend on a single atomic > operation (fetch_add) while still providing fairness. Nice! Aside from the qspinlock vs ticket-lock question, can you describe the tradeoffs between this generic ticket lock and a custom implementation in architecture code? Should we convert most architectures over to the generic code in the long run, or is there something they can usually do better with an inline asm based ticket lock or a trivial test-and-set? > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/asm-generic/qspinlock.h | 30 +++++++++ > include/asm-generic/ticket_lock_types.h | 11 +++ > include/asm-generic/ticket_lock.h | 97 ++++++++++++++++++++++++++++++++ > 3 files changed, 138 insertions(+) If anyone wants to use this for their architecture, feel free to add Acked-by: Arnd Bergmann <arnd@arndb.de> to merge it through the respective architecture git tree. If there is more than one architecture that wants it right now, I could also take them all through the asm-generic tree. Arnd
On Thu, Oct 21, 2021 at 03:12:25PM +0200, Peter Zijlstra wrote: > On Thu, Oct 21, 2021 at 03:05:15PM +0200, Peter Zijlstra wrote: > > > > There's currently a number of architectures that want/have graduated > > from test-and-set locks and are looking at qspinlock. > > > > *HOWEVER* qspinlock is very complicated and requires a lot of an > > architecture to actually work correctly. Specifically it requires > > forward progress between a fair number of atomic primitives, including > > an xchg16 operation, which I've seen a fair number of fundamentally > > broken implementations of in the tree (specifically for qspinlock no > > less). > > > > The benefit of qspinlock over ticket lock is also non-obvious, esp. > > at low contention (the vast majority of cases in the kernel), and it > > takes a fairly large number of CPUs (typically also NUMA) to make > > qspinlock beat ticket locks. > > > > Esp. things like ARM64's WFE can move the balance a lot in favour of > > simpler locks by reducing the cacheline pressure due to waiters (see > > their smp_cond_load_acquire() implementation for details). > > > > Unless you've audited qspinlock for your architecture and found it > > sound *and* can show actual benefit, simpler is better. For OpenRISC originally we had a custom ticket locking mechanism, but it was suggested to use qspinlocks as the genric implementation meant less code. Changed here: https://yhbt.net/lore/all/86vaix5fmr.fsf@arm.com/T/ I think moving to qspinlocks was suggested by you. But now that we have this generic infrastructure, I am good to switch. > > Therefore provide ticket locks, which depend on a single atomic > > operation (fetch_add) while still providing fairness. > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > --- > > include/asm-generic/qspinlock.h | 30 +++++++++ > > include/asm-generic/ticket_lock_types.h | 11 +++ > > include/asm-generic/ticket_lock.h | 97 ++++++++++++++++++++++++++++++++ > > 3 files changed, 138 insertions(+) > > A few notes... > > > + * It relies on smp_store_release() + atomic_*_acquire() to be RCsc (or no > > + * weaker than RCtso if you're Power, also see smp_mb__after_unlock_lock()), > > This should hold true to RISC-V in its current form, AFAICT > atomic_fetch_add ends up using AMOADD, and therefore the argument made > in the unlock+lock thread [1], gives that this results in RW,RW > ordering. > > [1] https://lore.kernel.org/lkml/5412ab37-2979-5717-4951-6a61366df0f2@nvidia.com/ > > > I've compile tested on openrisc/simple_smp_defconfig using the below. > > --- a/arch/openrisc/Kconfig > +++ b/arch/openrisc/Kconfig > @@ -30,7 +30,6 @@ config OPENRISC > select HAVE_DEBUG_STACKOVERFLOW > select OR1K_PIC > select CPU_NO_EFFICIENT_FFS if !OPENRISC_HAVE_INST_FF1 > - select ARCH_USE_QUEUED_SPINLOCKS > select ARCH_USE_QUEUED_RWLOCKS > select OMPIC if SMP > select ARCH_WANT_FRAME_POINTERS > --- a/arch/openrisc/include/asm/Kbuild > +++ b/arch/openrisc/include/asm/Kbuild > @@ -1,9 +1,8 @@ > # SPDX-License-Identifier: GPL-2.0 > generic-y += extable.h > generic-y += kvm_para.h > -generic-y += mcs_spinlock.h > -generic-y += qspinlock_types.h > -generic-y += qspinlock.h > +generic-y += ticket_lock_types.h > +generic-y += ticket_lock.h > generic-y += qrwlock_types.h > generic-y += qrwlock.h > generic-y += user.h > --- a/arch/openrisc/include/asm/spinlock.h > +++ b/arch/openrisc/include/asm/spinlock.h > @@ -15,7 +15,7 @@ > #ifndef __ASM_OPENRISC_SPINLOCK_H > #define __ASM_OPENRISC_SPINLOCK_H > > -#include <asm/qspinlock.h> > +#include <asm/ticket_lock.h> > > #include <asm/qrwlock.h> > > --- a/arch/openrisc/include/asm/spinlock_types.h > +++ b/arch/openrisc/include/asm/spinlock_types.h > @@ -1,7 +1,7 @@ > #ifndef _ASM_OPENRISC_SPINLOCK_TYPES_H > #define _ASM_OPENRISC_SPINLOCK_TYPES_H > > -#include <asm/qspinlock_types.h> > +#include <asm/ticket_lock_types.h> > #include <asm/qrwlock_types.h> > > #endif /* _ASM_OPENRISC_SPINLOCK_TYPES_H */ This looks good to me. Do you want to commit along with the generic ticket lock patch? Otherwise I can queue it after it is upstreamed. Another option is I can help merge the generic ticket lock code via the OpenRISC branch. Let me know what works. -Stafford
On Thu, Oct 21, 2021 at 03:49:51PM +0200, Arnd Bergmann wrote: > On Thu, Oct 21, 2021 at 3:05 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Therefore provide ticket locks, which depend on a single atomic > > operation (fetch_add) while still providing fairness. > > Nice! > > Aside from the qspinlock vs ticket-lock question, can you describe the > tradeoffs between this generic ticket lock and a custom implementation > in architecture code? Should we convert most architectures over > to the generic code in the long run, or is there something they > can usually do better with an inline asm based ticket lock I think for a load-store arch this thing should generate pretty close to optimal code. x86 can do ticket_unlock() slightly better using a single INCW (or ADDW 1) on the owner subword, where this implementation will to separate load-add-store instructions. If that is actually measurable is something else entirely. > or a trivial test-and-set? If your SMP arch is halfway sane (no fwd progress issues etc..) then ticket should behave well and avoid the starvation/variablilty of TaS lock. The big exception there is virtualized architectures, ticket is absolutely horrendous for 'guests' (any fair lock is for that matter).
On Thu, Oct 21, 2021 at 5:14 PM Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Oct 21, 2021 at 03:49:51PM +0200, Arnd Bergmann wrote: > I think for a load-store arch this thing should generate pretty close to > optimal code. x86 can do ticket_unlock() slightly better using a single > INCW (or ADDW 1) on the owner subword, where this implementation will to > separate load-add-store instructions. > > If that is actually measurable is something else entirely. Ok, so I guess such an architecture could take the generic implementation and override just arch_spin_unlock() or just arch_spin_lock(), if that makes a difference for them. Should we perhaps turn your modified openrisc asm/spinlock.h and asm/spin_lock_types.h the fallback in asm-generic, and remove the ones for the architectures that have no overrides at all? Or possibly a version that can do both based on CONFIG_ARCH_USE_QUEUED_SPINLOCKS? That would let us remove even more architecture specific headers, but it increases the risk of some architecture using qspinlock when they really should not. > > or a trivial test-and-set? > > If your SMP arch is halfway sane (no fwd progress issues etc..) then > ticket should behave well and avoid the starvation/variablilty of TaS > lock. Ok, and I guess we still need to keep the parisc and sparc32 versions anyway. > The big exception there is virtualized architectures, ticket is > absolutely horrendous for 'guests' (any fair lock is for that matter). This might be useful information to put into the header, at least I had no idea about this distinction. Arnd
On Thu, Oct 21, 2021 at 05:31:51PM +0200, Arnd Bergmann wrote: > On Thu, Oct 21, 2021 at 5:14 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Oct 21, 2021 at 03:49:51PM +0200, Arnd Bergmann wrote: > > I think for a load-store arch this thing should generate pretty close to > > optimal code. x86 can do ticket_unlock() slightly better using a single > > INCW (or ADDW 1) on the owner subword, where this implementation will to > > separate load-add-store instructions. > > > > If that is actually measurable is something else entirely. > > Ok, so I guess such an architecture could take the generic implementation > and override just arch_spin_unlock() or just arch_spin_lock(), if that > makes a difference for them. Also, Pre EV5 Dec Alpha might have issues since it can only do 32bit wide accesses, and it would need an ll/sc to unlock. But yes, if/when needed we could allow overrides. > Should we perhaps turn your modified openrisc asm/spinlock.h > and asm/spin_lock_types.h the fallback in asm-generic, and > remove the ones for the architectures that have no overrides > at all? Possibly, yes. > > If your SMP arch is halfway sane (no fwd progress issues etc..) then > > ticket should behave well and avoid the starvation/variablilty of TaS > > lock. > > Ok, and I guess we still need to keep the parisc and sparc32 versions > anyway. Yes, both those only have an xchg() (like) instruction and can realistically only implement TaS locks and have to build everything else on top of that... if only we could get rid of all that :-) > > The big exception there is virtualized architectures, ticket is > > absolutely horrendous for 'guests' (any fair lock is for that matter). > > This might be useful information to put into the header, at least > I had no idea about this distinction. Yes indeed, I'd not thought of it until you asked.
On 10/21/21 9:05 AM, Peter Zijlstra wrote: > There's currently a number of architectures that want/have graduated > from test-and-set locks and are looking at qspinlock. > > *HOWEVER* qspinlock is very complicated and requires a lot of an > architecture to actually work correctly. Specifically it requires > forward progress between a fair number of atomic primitives, including > an xchg16 operation, which I've seen a fair number of fundamentally > broken implementations of in the tree (specifically for qspinlock no > less). > > The benefit of qspinlock over ticket lock is also non-obvious, esp. > at low contention (the vast majority of cases in the kernel), and it > takes a fairly large number of CPUs (typically also NUMA) to make > qspinlock beat ticket locks. > > Esp. things like ARM64's WFE can move the balance a lot in favour of > simpler locks by reducing the cacheline pressure due to waiters (see > their smp_cond_load_acquire() implementation for details). > > Unless you've audited qspinlock for your architecture and found it > sound *and* can show actual benefit, simpler is better. > > Therefore provide ticket locks, which depend on a single atomic > operation (fetch_add) while still providing fairness. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/asm-generic/qspinlock.h | 30 +++++++++ > include/asm-generic/ticket_lock_types.h | 11 +++ > include/asm-generic/ticket_lock.h | 97 ++++++++++++++++++++++++++++++++ > 3 files changed, 138 insertions(+) > > --- a/include/asm-generic/qspinlock.h > +++ b/include/asm-generic/qspinlock.h > @@ -2,6 +2,36 @@ > /* > * Queued spinlock > * > + * A 'generic' spinlock implementation that is based on MCS locks. An > + * architecture that's looking for a 'generic' spinlock, please first consider > + * ticket_lock.h and only come looking here when you've considered all the > + * constraints below and can show your hardware does actually perform better > + * with qspinlock. > + * > + * > + * It relies on smp_store_release() + atomic_*_acquire() to be RCsc (or no > + * weaker than RCtso if you're Power, also see smp_mb__after_unlock_lock()), > + * > + * It relies on a far greater (compared to ticket_lock.h) set of atomic > + * operations to behave well together, please audit them carefully to ensure > + * they all have forward progress. Many atomic operations may default to > + * cmpxchg() loops which will not have good forward progress properties on > + * LL/SC architectures. > + * > + * One notable example is atomic_fetch_or_acquire(), which x86 cannot (cheaply) > + * do. Carefully read the patches that introduced > + * queued_fetch_set_pending_acquire(). > + * > + * It also heavily relies on mixed size atomic operations, in specific it > + * requires architectures to have xchg16; something which many LL/SC > + * architectures need to implement as a 32bit and+or in order to satisfy the > + * forward progress guarantees mentioned above. > + * > + * Further reading on mixed size atomics that might be relevant: > + * > + * http://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf > + * > + * > * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P. > * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP > * > --- /dev/null > +++ b/include/asm-generic/ticket_lock_types.h > @@ -0,0 +1,11 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +#ifndef __ASM_GENERIC_TICKET_LOCK_TYPES_H > +#define __ASM_GENERIC_TICKET_LOCK_TYPES_H > + > +#include <linux/types.h> > +typedef atomic_t arch_spinlock_t; > + > +#define __ARCH_SPIN_LOCK_UNLOCKED ATOMIC_INIT(0) > + > +#endif /* __ASM_GENERIC_TICKET_LOCK_TYPES_H */ > --- /dev/null > +++ b/include/asm-generic/ticket_lock.h > @@ -0,0 +1,97 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +/* > + * 'Generic' ticket lock implementation. > + * > + * It relies on atomic_fetch_add() having well defined forward progress > + * guarantees under contention. If your architecture cannot provide this, stick > + * to a test-and-set lock. > + * > + * It also relies on atomic_fetch_add() being safe vs smp_store_release() on a > + * sub-word of the value. This is generally true for anything LL/SC although > + * you'd be hard pressed to find anything useful in architecture specifications > + * about this. If your architecture cannot do this you might be better off with > + * a test-and-set. > + * > + * It relies on smp_store_release() + atomic_*_acquire() to be RCsc (or no > + * weaker than RCtso if you're Power, also see smp_mb__after_unlock_lock()), > + * > + * The implementation uses smp_cond_load_acquire() to spin, so if the > + * architecture has WFE like instructions to sleep instead of poll for word > + * modifications be sure to implement that (see ARM64 for example). > + * > + */ > + > +#ifndef __ASM_GENERIC_TICKET_LOCK_H > +#define __ASM_GENERIC_TICKET_LOCK_H > + > +#include <linux/atomic.h> > +#include <asm/ticket_lock_types.h> > + > +#define ONE_TICKET (1 << 16) > +#define __ticket(val) (u16)((val) >> 16) > +#define __owner(val) (u16)((val) & 0xffff) > + > +static __always_inline bool __ticket_is_locked(u32 val) > +{ > + return __ticket(val) != __owner(val); > +} > + > +static __always_inline void ticket_lock(arch_spinlock_t *lock) > +{ > + u32 val = atomic_fetch_add_acquire(ONE_TICKET, lock); > + u16 ticket = __ticket(val); > + > + if (ticket == __owner(val)) > + return; > + > + atomic_cond_read_acquire(lock, ticket == __owner(VAL)); > +} > + > +static __always_inline bool ticket_trylock(arch_spinlock_t *lock) > +{ > + u32 old = atomic_read(lock); > + > + if (__ticket_is_locked(old)) > + return false; > + > + return atomic_try_cmpxchg_acquire(lock, &old, old + ONE_TICKET); > +} > + > +static __always_inline void ticket_unlock(arch_spinlock_t *lock) > +{ > + u16 *ptr = (u16 *)lock + __is_defined(__BIG_ENDIAN); > + u32 val = atomic_read(lock); > + > + smp_store_release(ptr, __owner(val) + 1); > +} > + > +static __always_inline int ticket_is_contended(arch_spinlock_t *lock) > +{ > + u32 val = atomic_read(lock); > + > + return (__ticket(val) - __owner(val)) > 1; Nit: The left side is unsigned, but the right is signed. I think you are relying on the implicit signed to unsigned conversion. It may be a bit clearer if you use 1U instead. > +} > + > +static __always_inline int ticket_is_locked(arch_spinlock_t *lock) > +{ > + return __ticket_is_locked(atomic_read(lock)); > +} > + > +static __always_inline int ticket_value_unlocked(arch_spinlock_t lock) > +{ > + return !__ticket_is_locked(lock.counter); > +} > + > +#undef __owner > +#undef __ticket > +#undef ONE_TICKET > + > +#define arch_spin_lock(l) ticket_lock(l) > +#define arch_spin_trylock(l) ticket_trylock(l) > +#define arch_spin_unlock(l) ticket_unlock(l) > +#define arch_spin_is_locked(l) ticket_is_locked(l) > +#define arch_spin_is_contended(l) ticket_is_contended(l) > +#define arch_spin_value_unlocked(l) ticket_value_unlocked(l) > + > +#endif /* __ASM_GENERIC_TICKET_LOCK_H */ Other than the nit above, the patch looks good to me. Reviewed-by: Waiman Long <longman@redhat.com>
C-SKY would follow this, thx. Reviewed-by: Guo Ren <guoren@kernel.org> On Thu, Oct 21, 2021 at 9:05 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > There's currently a number of architectures that want/have graduated > from test-and-set locks and are looking at qspinlock. > > *HOWEVER* qspinlock is very complicated and requires a lot of an > architecture to actually work correctly. Specifically it requires > forward progress between a fair number of atomic primitives, including > an xchg16 operation, which I've seen a fair number of fundamentally > broken implementations of in the tree (specifically for qspinlock no > less). > > The benefit of qspinlock over ticket lock is also non-obvious, esp. > at low contention (the vast majority of cases in the kernel), and it > takes a fairly large number of CPUs (typically also NUMA) to make > qspinlock beat ticket locks. > > Esp. things like ARM64's WFE can move the balance a lot in favour of > simpler locks by reducing the cacheline pressure due to waiters (see > their smp_cond_load_acquire() implementation for details). > > Unless you've audited qspinlock for your architecture and found it > sound *and* can show actual benefit, simpler is better. > > Therefore provide ticket locks, which depend on a single atomic > operation (fetch_add) while still providing fairness. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/asm-generic/qspinlock.h | 30 +++++++++ > include/asm-generic/ticket_lock_types.h | 11 +++ > include/asm-generic/ticket_lock.h | 97 ++++++++++++++++++++++++++++++++ > 3 files changed, 138 insertions(+) > > --- a/include/asm-generic/qspinlock.h > +++ b/include/asm-generic/qspinlock.h > @@ -2,6 +2,36 @@ > /* > * Queued spinlock > * > + * A 'generic' spinlock implementation that is based on MCS locks. An > + * architecture that's looking for a 'generic' spinlock, please first consider > + * ticket_lock.h and only come looking here when you've considered all the > + * constraints below and can show your hardware does actually perform better > + * with qspinlock. > + * > + * > + * It relies on smp_store_release() + atomic_*_acquire() to be RCsc (or no > + * weaker than RCtso if you're Power, also see smp_mb__after_unlock_lock()), > + * > + * It relies on a far greater (compared to ticket_lock.h) set of atomic > + * operations to behave well together, please audit them carefully to ensure > + * they all have forward progress. Many atomic operations may default to > + * cmpxchg() loops which will not have good forward progress properties on > + * LL/SC architectures. > + * > + * One notable example is atomic_fetch_or_acquire(), which x86 cannot (cheaply) > + * do. Carefully read the patches that introduced > + * queued_fetch_set_pending_acquire(). > + * > + * It also heavily relies on mixed size atomic operations, in specific it > + * requires architectures to have xchg16; something which many LL/SC > + * architectures need to implement as a 32bit and+or in order to satisfy the > + * forward progress guarantees mentioned above. > + * > + * Further reading on mixed size atomics that might be relevant: > + * > + * http://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf > + * > + * > * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P. > * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP > * > --- /dev/null > +++ b/include/asm-generic/ticket_lock_types.h > @@ -0,0 +1,11 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +#ifndef __ASM_GENERIC_TICKET_LOCK_TYPES_H > +#define __ASM_GENERIC_TICKET_LOCK_TYPES_H > + > +#include <linux/types.h> > +typedef atomic_t arch_spinlock_t; > + > +#define __ARCH_SPIN_LOCK_UNLOCKED ATOMIC_INIT(0) > + > +#endif /* __ASM_GENERIC_TICKET_LOCK_TYPES_H */ > --- /dev/null > +++ b/include/asm-generic/ticket_lock.h > @@ -0,0 +1,97 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +/* > + * 'Generic' ticket lock implementation. > + * > + * It relies on atomic_fetch_add() having well defined forward progress > + * guarantees under contention. If your architecture cannot provide this, stick > + * to a test-and-set lock. > + * > + * It also relies on atomic_fetch_add() being safe vs smp_store_release() on a > + * sub-word of the value. This is generally true for anything LL/SC although > + * you'd be hard pressed to find anything useful in architecture specifications > + * about this. If your architecture cannot do this you might be better off with > + * a test-and-set. > + * > + * It relies on smp_store_release() + atomic_*_acquire() to be RCsc (or no > + * weaker than RCtso if you're Power, also see smp_mb__after_unlock_lock()), > + * > + * The implementation uses smp_cond_load_acquire() to spin, so if the > + * architecture has WFE like instructions to sleep instead of poll for word > + * modifications be sure to implement that (see ARM64 for example). > + * > + */ > + > +#ifndef __ASM_GENERIC_TICKET_LOCK_H > +#define __ASM_GENERIC_TICKET_LOCK_H > + > +#include <linux/atomic.h> > +#include <asm/ticket_lock_types.h> > + > +#define ONE_TICKET (1 << 16) > +#define __ticket(val) (u16)((val) >> 16) > +#define __owner(val) (u16)((val) & 0xffff) > + > +static __always_inline bool __ticket_is_locked(u32 val) > +{ > + return __ticket(val) != __owner(val); > +} > + > +static __always_inline void ticket_lock(arch_spinlock_t *lock) > +{ > + u32 val = atomic_fetch_add_acquire(ONE_TICKET, lock); > + u16 ticket = __ticket(val); > + > + if (ticket == __owner(val)) > + return; > + > + atomic_cond_read_acquire(lock, ticket == __owner(VAL)); > +} > + > +static __always_inline bool ticket_trylock(arch_spinlock_t *lock) > +{ > + u32 old = atomic_read(lock); > + > + if (__ticket_is_locked(old)) > + return false; > + > + return atomic_try_cmpxchg_acquire(lock, &old, old + ONE_TICKET); > +} > + > +static __always_inline void ticket_unlock(arch_spinlock_t *lock) > +{ > + u16 *ptr = (u16 *)lock + __is_defined(__BIG_ENDIAN); > + u32 val = atomic_read(lock); > + > + smp_store_release(ptr, __owner(val) + 1); > +} > + > +static __always_inline int ticket_is_contended(arch_spinlock_t *lock) > +{ > + u32 val = atomic_read(lock); > + > + return (__ticket(val) - __owner(val)) > 1; > +} > + > +static __always_inline int ticket_is_locked(arch_spinlock_t *lock) > +{ > + return __ticket_is_locked(atomic_read(lock)); > +} > + > +static __always_inline int ticket_value_unlocked(arch_spinlock_t lock) > +{ > + return !__ticket_is_locked(lock.counter); > +} > + > +#undef __owner > +#undef __ticket > +#undef ONE_TICKET > + > +#define arch_spin_lock(l) ticket_lock(l) > +#define arch_spin_trylock(l) ticket_trylock(l) > +#define arch_spin_unlock(l) ticket_unlock(l) > +#define arch_spin_is_locked(l) ticket_is_locked(l) > +#define arch_spin_is_contended(l) ticket_is_contended(l) > +#define arch_spin_value_unlocked(l) ticket_value_unlocked(l) > + > +#endif /* __ASM_GENERIC_TICKET_LOCK_H */
Hi Peter, On Thu, Oct 21, 2021 at 03:05:15PM +0200, Peter Zijlstra wrote: > +static __always_inline void ticket_lock(arch_spinlock_t *lock) > +{ > + u32 val = atomic_fetch_add_acquire(ONE_TICKET, lock); I wonder, should these atomics be arch_atomic_*(), in case an arch_ or raw_ lock is used in noinstr code? The plain atomic_*() forms can have explicit inline instrumentation. I haven't seen any issues with qspinlock so far, and that also uses the (instrumented) atomics, so maybe that's not actually a problem, but I'm not sure what we intend here w.r.t. instrumentability. Thanks, Mark.
On Fri, Oct 22, 2021 at 10:23:02AM +0100, Mark Rutland wrote: > Hi Peter, > > On Thu, Oct 21, 2021 at 03:05:15PM +0200, Peter Zijlstra wrote: > > +static __always_inline void ticket_lock(arch_spinlock_t *lock) > > +{ > > + u32 val = atomic_fetch_add_acquire(ONE_TICKET, lock); > > I wonder, should these atomics be arch_atomic_*(), in case an arch_ or raw_ > lock is used in noinstr code? The plain atomic_*() forms can have explicit > inline instrumentation. > > I haven't seen any issues with qspinlock so far, and that also uses the > (instrumented) atomics, so maybe that's not actually a problem, but I'm not > sure what we intend here w.r.t. instrumentability. So far it's not been a problem, and as you say, if we want to change this, we need a larger audit/cleanup. IIRC there's a potential problem in the arm idle code (noinstr'ing the idle code is still on the TODO list somewhre, hampered by the need to create more tooling).
On Thu, Oct 21, 2021 at 02:04:55PM -0400, Waiman Long wrote: [...] > > +static __always_inline int ticket_is_contended(arch_spinlock_t *lock) > > +{ > > + u32 val = atomic_read(lock); > > + > > + return (__ticket(val) - __owner(val)) > 1; > Nit: The left side is unsigned, but the right is signed. I think you are > relying on the implicit signed to unsigned conversion. It may be a bit > clearer if you use 1U instead. > > +} > > + > > +static __always_inline int ticket_is_locked(arch_spinlock_t *lock) > > +{ > > + return __ticket_is_locked(atomic_read(lock)); > > +} > > + > > +static __always_inline int ticket_value_unlocked(arch_spinlock_t lock) > > +{ > > + return !__ticket_is_locked(lock.counter); > > +} > > + > > +#undef __owner > > +#undef __ticket > > +#undef ONE_TICKET > > + > > +#define arch_spin_lock(l) ticket_lock(l) > > +#define arch_spin_trylock(l) ticket_trylock(l) > > +#define arch_spin_unlock(l) ticket_unlock(l) > > +#define arch_spin_is_locked(l) ticket_is_locked(l) > > +#define arch_spin_is_contended(l) ticket_is_contended(l) > > +#define arch_spin_value_unlocked(l) ticket_value_unlocked(l) > > + > > +#endif /* __ASM_GENERIC_TICKET_LOCK_H */ > > Other than the nit above, the patch looks good to me. > > Reviewed-by: Waiman Long <longman@redhat.com> > Same here ;-) Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Regards, Boqun
On Thu, Oct 21, 2021 at 03:05:15PM +0200, Peter Zijlstra wrote: > > There's currently a number of architectures that want/have graduated > from test-and-set locks and are looking at qspinlock. > > *HOWEVER* qspinlock is very complicated and requires a lot of an > architecture to actually work correctly. Specifically it requires > forward progress between a fair number of atomic primitives, including > an xchg16 operation, which I've seen a fair number of fundamentally > broken implementations of in the tree (specifically for qspinlock no > less). > > The benefit of qspinlock over ticket lock is also non-obvious, esp. > at low contention (the vast majority of cases in the kernel), and it > takes a fairly large number of CPUs (typically also NUMA) to make > qspinlock beat ticket locks. > > Esp. things like ARM64's WFE can move the balance a lot in favour of > simpler locks by reducing the cacheline pressure due to waiters (see > their smp_cond_load_acquire() implementation for details). > > Unless you've audited qspinlock for your architecture and found it > sound *and* can show actual benefit, simpler is better. > > Therefore provide ticket locks, which depend on a single atomic > operation (fetch_add) while still providing fairness. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/asm-generic/qspinlock.h | 30 +++++++++ > include/asm-generic/ticket_lock_types.h | 11 +++ > include/asm-generic/ticket_lock.h | 97 ++++++++++++++++++++++++++++++++ > 3 files changed, 138 insertions(+) Huh. I looked quite closely at this a while back but seems like I forgot to actually reply here. So, given that it doesn't seem to be in linux-next yet: Acked-by: Will Deacon <will@kernel.org> Will
--- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -2,6 +2,36 @@ /* * Queued spinlock * + * A 'generic' spinlock implementation that is based on MCS locks. An + * architecture that's looking for a 'generic' spinlock, please first consider + * ticket_lock.h and only come looking here when you've considered all the + * constraints below and can show your hardware does actually perform better + * with qspinlock. + * + * + * It relies on smp_store_release() + atomic_*_acquire() to be RCsc (or no + * weaker than RCtso if you're Power, also see smp_mb__after_unlock_lock()), + * + * It relies on a far greater (compared to ticket_lock.h) set of atomic + * operations to behave well together, please audit them carefully to ensure + * they all have forward progress. Many atomic operations may default to + * cmpxchg() loops which will not have good forward progress properties on + * LL/SC architectures. + * + * One notable example is atomic_fetch_or_acquire(), which x86 cannot (cheaply) + * do. Carefully read the patches that introduced + * queued_fetch_set_pending_acquire(). + * + * It also heavily relies on mixed size atomic operations, in specific it + * requires architectures to have xchg16; something which many LL/SC + * architectures need to implement as a 32bit and+or in order to satisfy the + * forward progress guarantees mentioned above. + * + * Further reading on mixed size atomics that might be relevant: + * + * http://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf + * + * * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P. * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP * --- /dev/null +++ b/include/asm-generic/ticket_lock_types.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __ASM_GENERIC_TICKET_LOCK_TYPES_H +#define __ASM_GENERIC_TICKET_LOCK_TYPES_H + +#include <linux/types.h> +typedef atomic_t arch_spinlock_t; + +#define __ARCH_SPIN_LOCK_UNLOCKED ATOMIC_INIT(0) + +#endif /* __ASM_GENERIC_TICKET_LOCK_TYPES_H */ --- /dev/null +++ b/include/asm-generic/ticket_lock.h @@ -0,0 +1,97 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * 'Generic' ticket lock implementation. + * + * It relies on atomic_fetch_add() having well defined forward progress + * guarantees under contention. If your architecture cannot provide this, stick + * to a test-and-set lock. + * + * It also relies on atomic_fetch_add() being safe vs smp_store_release() on a + * sub-word of the value. This is generally true for anything LL/SC although + * you'd be hard pressed to find anything useful in architecture specifications + * about this. If your architecture cannot do this you might be better off with + * a test-and-set. + * + * It relies on smp_store_release() + atomic_*_acquire() to be RCsc (or no + * weaker than RCtso if you're Power, also see smp_mb__after_unlock_lock()), + * + * The implementation uses smp_cond_load_acquire() to spin, so if the + * architecture has WFE like instructions to sleep instead of poll for word + * modifications be sure to implement that (see ARM64 for example). + * + */ + +#ifndef __ASM_GENERIC_TICKET_LOCK_H +#define __ASM_GENERIC_TICKET_LOCK_H + +#include <linux/atomic.h> +#include <asm/ticket_lock_types.h> + +#define ONE_TICKET (1 << 16) +#define __ticket(val) (u16)((val) >> 16) +#define __owner(val) (u16)((val) & 0xffff) + +static __always_inline bool __ticket_is_locked(u32 val) +{ + return __ticket(val) != __owner(val); +} + +static __always_inline void ticket_lock(arch_spinlock_t *lock) +{ + u32 val = atomic_fetch_add_acquire(ONE_TICKET, lock); + u16 ticket = __ticket(val); + + if (ticket == __owner(val)) + return; + + atomic_cond_read_acquire(lock, ticket == __owner(VAL)); +} + +static __always_inline bool ticket_trylock(arch_spinlock_t *lock) +{ + u32 old = atomic_read(lock); + + if (__ticket_is_locked(old)) + return false; + + return atomic_try_cmpxchg_acquire(lock, &old, old + ONE_TICKET); +} + +static __always_inline void ticket_unlock(arch_spinlock_t *lock) +{ + u16 *ptr = (u16 *)lock + __is_defined(__BIG_ENDIAN); + u32 val = atomic_read(lock); + + smp_store_release(ptr, __owner(val) + 1); +} + +static __always_inline int ticket_is_contended(arch_spinlock_t *lock) +{ + u32 val = atomic_read(lock); + + return (__ticket(val) - __owner(val)) > 1; +} + +static __always_inline int ticket_is_locked(arch_spinlock_t *lock) +{ + return __ticket_is_locked(atomic_read(lock)); +} + +static __always_inline int ticket_value_unlocked(arch_spinlock_t lock) +{ + return !__ticket_is_locked(lock.counter); +} + +#undef __owner +#undef __ticket +#undef ONE_TICKET + +#define arch_spin_lock(l) ticket_lock(l) +#define arch_spin_trylock(l) ticket_trylock(l) +#define arch_spin_unlock(l) ticket_unlock(l) +#define arch_spin_is_locked(l) ticket_is_locked(l) +#define arch_spin_is_contended(l) ticket_is_contended(l) +#define arch_spin_value_unlocked(l) ticket_value_unlocked(l) + +#endif /* __ASM_GENERIC_TICKET_LOCK_H */
There's currently a number of architectures that want/have graduated from test-and-set locks and are looking at qspinlock. *HOWEVER* qspinlock is very complicated and requires a lot of an architecture to actually work correctly. Specifically it requires forward progress between a fair number of atomic primitives, including an xchg16 operation, which I've seen a fair number of fundamentally broken implementations of in the tree (specifically for qspinlock no less). The benefit of qspinlock over ticket lock is also non-obvious, esp. at low contention (the vast majority of cases in the kernel), and it takes a fairly large number of CPUs (typically also NUMA) to make qspinlock beat ticket locks. Esp. things like ARM64's WFE can move the balance a lot in favour of simpler locks by reducing the cacheline pressure due to waiters (see their smp_cond_load_acquire() implementation for details). Unless you've audited qspinlock for your architecture and found it sound *and* can show actual benefit, simpler is better. Therefore provide ticket locks, which depend on a single atomic operation (fetch_add) while still providing fairness. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/asm-generic/qspinlock.h | 30 +++++++++ include/asm-generic/ticket_lock_types.h | 11 +++ include/asm-generic/ticket_lock.h | 97 ++++++++++++++++++++++++++++++++ 3 files changed, 138 insertions(+)