[v8,01/10] qspinlock: A generic 4-byte queue spinlock implementation

This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

The idea behind this spinlock implementation is the fact that spinlocks
are acquired with preemption disabled. In other words, the process
will not be migrated to another CPU while it is trying to get a
spinlock. Ignoring interrupt handling, a CPU can only be contending
in one spinlock at any one time. Of course, interrupt handler can try
to acquire one spinlock while the interrupted user process is in the
process of getting another spinlock. By allocating a set of per-cpu
queue nodes and used them to form a waiting queue, we can encode the
queue node address into a much smaller 16-bit size. Together with
the 1-byte lock bit, this queue spinlock implementation will only
need 4 bytes to hold all the information that it needs.

The current queue node address encoding of the 4-byte word is as
follows:
Bits 0-7  : the locked byte
Bits 8-9  : queue node index in the per-cpu array (4 entries)
Bits 10-31: cpu number + 1 (max cpus = 4M -1)

For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

  Lock Type			Time (ns)
  ---------			---------
  Ticket spinlock		  14.1
  Queue spinlock		   8.8

So the queue spinlock is much faster than the ticket spinlock, even
though the overhead of locking and unlocking should be pretty small
when there is no contention. The performance advantage is mainly
due to the fact that ticket spinlock does a read-modify-write (add)
instruction in unlock whereas queue spinlock only does a simple write
in unlock which can be much faster in a pipelined CPU.

The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere
x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate
the performance impact of this patch on a 3.13 kernel.

  +------------+----------+-----------------+---------+
  | Kernel     | 3.13 JPM |    3.13 with    | %Change |
  |            |          | qspinlock patch |	      |
  +------------+----------+-----------------+---------+
  |		      10-100 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   357459 |      363109     |  +1.58% |
  |dbase       |   496847 |      498801	    |  +0.39% |
  |disk        |  2925312 |     2771387     |  -5.26% |
  |five_sec    |   166612 |      169215     |  +1.56% |
  |fserver     |   382129 |      383279     |  +0.30% |
  |high_systime|    16356 |       16380     |  +0.15% |
  |short       |  4521978 |     4257363     |  -5.85% |
  +------------+----------+-----------------+---------+
  |		     200-1000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   449070 |      447711     |  -0.30% |
  |dbase       |   845029 |      853362	    |  +0.99% |
  |disk        |  2725249 |     4892907     | +79.54% |
  |five_sec    |   169410 |      170638     |  +0.72% |
  |fserver     |   489662 |      491828     |  +0.44% |
  |high_systime|   142823 |      143790     |  +0.68% |
  |short       |  7435288 |     9016171     | +21.26% |
  +------------+----------+-----------------+---------+
  |		     1100-2000 users		      |
  +------------+----------+-----------------+---------+
  |custom      |   432470 |      432570     |  +0.02% |
  |dbase       |   889289 |      890026	    |  +0.08% |
  |disk        |  2565138 |     5008732     | +95.26% |
  |five_sec    |   169141 |      170034     |  +0.53% |
  |fserver     |   498569 |      500701     |  +0.43% |
  |high_systime|   229913 |      245866     |  +6.94% |
  |short       |  8496794 |     8281918     |  -2.53% |
  +------------+----------+-----------------+---------+

The workload with the most gain was the disk workload. Without the
patch, the perf profile at 1500 users looked like:

 26.19%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--47.28%-- evict
              |--46.87%-- inode_sb_list_add
              |--1.24%-- xlog_cil_insert_items
              |--0.68%-- __remove_inode_hash
              |--0.67%-- inode_wait_for_writeback
               --3.26%-- [...]
 22.96%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  5.56%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  4.87%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.04%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.30%    reaim  [kernel.kallsyms]  [k] memcpy
  1.08%    reaim  [unknown]          [.] 0x0000003c52009447

There was pretty high spinlock contention on the inode_sb_list_lock
and maybe the inode's i_lock.

With the patch, the perf profile at 1500 users became:

 26.82%  swapper  [kernel.kallsyms]  [k] cpu_idle_loop
  4.66%    reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  3.97%    reaim  [kernel.kallsyms]  [k] update_cfs_rq_blocked_load
  2.40%    reaim  [kernel.kallsyms]  [k] queue_spin_lock_slowpath
              |--88.31%-- _raw_spin_lock
              |          |--36.02%-- inode_sb_list_add
              |          |--35.09%-- evict
              |          |--16.89%-- xlog_cil_insert_items
              |          |--6.30%-- try_to_wake_up
              |          |--2.20%-- _xfs_buf_find
              |          |--0.75%-- __remove_inode_hash
              |          |--0.72%-- __mutex_lock_slowpath
              |          |--0.53%-- load_balance
              |--6.02%-- _raw_spin_lock_irqsave
              |          |--74.75%-- down_trylock
              |          |--9.69%-- rcu_check_quiescent_state
              |          |--7.47%-- down
              |          |--3.57%-- up
              |          |--1.67%-- rwsem_wake
              |          |--1.00%-- remove_wait_queue
              |          |--0.56%-- pagevec_lru_move_fn
              |--5.39%-- _raw_spin_lock_irq
              |          |--82.05%-- rwsem_down_read_failed
              |          |--10.48%-- rwsem_down_write_failed
              |          |--4.24%-- __down
              |          |--2.74%-- __schedule
               --0.28%-- [...]
  2.20%    reaim  [kernel.kallsyms]  [k] memcpy
  1.84%    reaim  [unknown]          [.] 0x000000000041517b
  1.77%    reaim  [kernel.kallsyms]  [k] _raw_spin_lock
              |--21.08%-- xlog_cil_insert_items
              |--10.14%-- xfs_icsb_modify_counters
              |--7.20%-- xfs_iget_cache_hit
              |--6.56%-- inode_sb_list_add
              |--5.49%-- _xfs_buf_find
              |--5.25%-- evict
              |--5.03%-- __remove_inode_hash
              |--4.64%-- __mutex_lock_slowpath
              |--3.78%-- selinux_inode_free_security
              |--2.95%-- xfs_inode_is_filestream
              |--2.35%-- try_to_wake_up
              |--2.07%-- xfs_inode_set_reclaim_tag
              |--1.52%-- list_lru_add
              |--1.16%-- xfs_inode_clear_eofblocks_tag
		  :
  1.30%    reaim  [kernel.kallsyms]  [k] effective_load
  1.27%    reaim  [kernel.kallsyms]  [k] mspin_lock
  1.10%    reaim  [kernel.kallsyms]  [k] security_compute_sid

On the ext4 filesystem, the disk workload improved from 416281 JPM
to 899101 JPM (+116%) with the patch. In this case, the contended
spinlock is the mb_cache_spinlock.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/asm-generic/qspinlock.h       |  122 +++++++++++
 include/asm-generic/qspinlock_types.h |   49 +++++
 kernel/Kconfig.locks                  |    7 +
 kernel/locking/Makefile               |    1 +
 kernel/locking/qspinlock.c            |  371 +++++++++++++++++++++++++++++++++
 5 files changed, 550 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

[v8,01/10] qspinlock: A generic 4-byte queue spinlock implementation

Commit Message

Comments

Patch