From patchwork Wed May 7 15:01:33 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 4129391 Return-Path: X-Original-To: patchwork-kvm@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 22ACBBFF02 for ; Wed, 7 May 2014 15:08:14 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 19E5420179 for ; Wed, 7 May 2014 15:08:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0525F20176 for ; Wed, 7 May 2014 15:08:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933056AbaEGPCq (ORCPT ); Wed, 7 May 2014 11:02:46 -0400 Received: from g4t3425.houston.hp.com ([15.201.208.53]:32257 "EHLO g4t3425.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932899AbaEGPCn (ORCPT ); Wed, 7 May 2014 11:02:43 -0400 Received: from g9t2301.houston.hp.com (g9t2301.houston.hp.com [16.216.185.78]) by g4t3425.houston.hp.com (Postfix) with ESMTP id A0EC919F; Wed, 7 May 2014 15:02:39 +0000 (UTC) Received: from RHEL65.localdomain (unknown [16.99.26.141]) by g9t2301.houston.hp.com (Postfix) with ESMTP id E387C6C; Wed, 7 May 2014 15:02:37 +0000 (UTC) From: Waiman Long To: Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Peter Zijlstra Cc: linux-arch@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, xen-devel@lists.xenproject.org, kvm@vger.kernel.org, Paolo Bonzini , Konrad Rzeszutek Wilk , Boris Ostrovsky , "Paul E. McKenney" , Rik van Riel , Linus Torvalds , Raghavendra K T , David Vrabel , Oleg Nesterov , Gleb Natapov , Scott J Norton , Chegu Vinod , Peter Zijlstra , Waiman Long Subject: [PATCH v10 05/19] qspinlock: Optimize for smaller NR_CPUS Date: Wed, 7 May 2014 11:01:33 -0400 Message-Id: <1399474907-22206-6-git-send-email-Waiman.Long@hp.com> X-Mailer: git-send-email 1.7.1 In-Reply-To: <1399474907-22206-1-git-send-email-Waiman.Long@hp.com> References: <1399474907-22206-1-git-send-email-Waiman.Long@hp.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Peter Zijlstra When we allow for a max NR_CPUS < 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra Signed-off-by: Waiman Long --- include/asm-generic/qspinlock_types.h | 13 ++++ kernel/locking/qspinlock.c | 107 +++++++++++++++++++++++++++++--- 2 files changed, 110 insertions(+), 10 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index ed5d89a..4914abe 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -38,6 +38,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS < 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS >= 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -50,7 +58,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS < (1U << 14) +#define _Q_PENDING_BITS 8 +#else #define _Q_PENDING_BITS 1 +#endif #define _Q_PENDING_MASK _Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -61,6 +73,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index a49b82b..3e908f7 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -22,6 +22,7 @@ #include #include #include +#include #include /* @@ -48,6 +49,9 @@ * We can further change the first spinner to spin on a bit in the lock word * instead of its node; whereby avoiding the need to carry a node from lock to * unlock, and preserving API. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. */ #include "mcs_spinlock.h" @@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 -> *,0,1 + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l->locked_pending) = _Q_LOCKED_VAL; +} + +/* + * xchg_tail - Put in the new queue tail code word & retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * @pval : Pointer to current value of the queue spinlock 32-bit word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* -> n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 +xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) +{ + struct __qspinlock *l = (void *)lock; + + return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET; +} + +#else /* _Q_PENDING_BITS == 8 */ + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 -> *,0,1 + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + u32 new, old; + + for (;;) { + new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; + + old = atomic_cmpxchg(&lock->val, val, new); + if (old == val) + break; + + val = old; + } +} + /** * xchg_tail - Put in the new queue tail code word & retrieve previous one * @lock : Pointer to queue spinlock structure @@ -112,12 +197,17 @@ xchg_tail(struct qspinlock *lock, u32 tail, u32 *pval) *pval = new; return old; } +#endif /* _Q_PENDING_BITS == 8 */ /** * trylock_pending - try to acquire queue spinlock using the pending bit * @lock : Pointer to queue spinlock structure * @pval : Pointer to value of the queue spinlock 32-bit word * Return: 1 if lock acquired, 0 otherwise + * + * The pending bit won't be set as soon as one or more tasks queue up. + * This function should only be called when lock stealing will not happen. + * Otherwise, it has to be disabled. */ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) { @@ -157,8 +247,13 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * we're pending, wait for the owner to go away. * * *,1,1 -> *,1,0 + * + * this wait loop must be a load-acquire such that we match the + * store-release that clears the locked bit and create lock + * sequentiality; this because not all clear_pending_set_locked() + * implementations imply full barriers. */ - while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK) + while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK) arch_mutex_cpu_relax(); /* @@ -166,15 +261,7 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * * *,1,0 -> *,0,1 */ - for (;;) { - new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - - old = atomic_cmpxchg(&lock->val, val, new); - if (old == val) - break; - - val = old; - } + clear_pending_set_locked(lock, val); return 1; }