Message ID | 20140618155730.GA5107@laptop.dumpdata.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jun 18, 2014 at 11:57:30AM -0400, Konrad Rzeszutek Wilk wrote: > On Sun, Jun 15, 2014 at 02:47:02PM +0200, Peter Zijlstra wrote: > > From: Peter Zijlstra <peterz@infradead.org> > > > > When we allow for a max NR_CPUS < 2^14 we can optimize the pending > > wait-acquire and the xchg_tail() operations. > > > > By growing the pending bit to a byte, we reduce the tail to 16bit. > > This means we can use xchg16 for the tail part and do away with all > > the repeated compxchg() operations. > > > > This in turn allows us to unconditionally acquire; the locked state > > as observed by the wait loops cannot change. And because both locked > > and pending are now a full byte we can use simple stores for the > > state transition, obviating one atomic operation entirely. > > I have to ask - how much more performance do you get from this? > > Is this extra atomic operation hurting that much? Its not extra, its a cmpxchg loop vs an unconditional xchg. And yes, its somewhat tedious to show, but on 4 socket systems you can really see it make a difference. I'll try and run some numbers, I need to reinstall the box. (there were numbers in the previous threads, but you're right, I should've put some in the Changelog). > > /** > > * queue_spin_lock_slowpath - acquire the queue spinlock > > @@ -173,8 +259,13 @@ void queue_spin_lock_slowpath(struct qsp > > * we're pending, wait for the owner to go away. > > * > > * *,1,1 -> *,1,0 > > + * > > + * this wait loop must be a load-acquire such that we match the > > + * store-release that clears the locked bit and create lock > > + * sequentiality; this because not all clear_pending_set_locked() > > + * implementations imply full barriers. > > */ > > - while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK) > > + while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK) > > lock->val.counter? Ugh, all to deal with the 'int' -> 'u32' (or 'u64') No, to do atomic_t -> int. > Could you introduce a macro in atomic.h called 'atomic_read_raw' which > would do the this? Like this: That would be worse I think. It looks like a function returning an rvalue whereas we really want an lvalue.
diff --git a/include/linux/atomic.h b/include/linux/atomic.h index fef3a80..5a83750 100644 --- a/include/linux/atomic.h +++ b/include/linux/atomic.h @@ -160,6 +160,8 @@ static inline void atomic_or(int i, atomic_t *v) } #endif /* #ifndef CONFIG_ARCH_HAS_ATOMIC_OR */ +#define atomic_read_raw(v) (v.counter) + #include <asm-generic/atomic-long.h> #ifdef CONFIG_GENERIC_ATOMIC64 #include <asm-generic/atomic64.h> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index fc7fd8c..2833fe1 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -265,7 +265,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) * sequentiality; this because not all clear_pending_set_locked() * implementations imply full barriers. */ - while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK) + while ((val = smp_load_acquire(atomic_read_raw(&lock->val))) & _Q_LOCKED_MASK) arch_mutex_cpu_relax(); /*