[RFC,v4,69/71] cputlb: queue async flush jobs without the BQL

This yields sizable scalability improvements, as the below results show.

Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)

Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
"make -j N", where N is the number of cores in the guest.

                      Speedup vs a single thread (higher is better):

         14 +---------------------------------------------------------------+
            |       +    +       +      +       +      +      $$$$$$  +     |
            |                                            $$$$$              |
            |                                      $$$$$$                   |
         12 |-+                                $A$$                       +-|
            |                                $$                             |
            |                             $$$                               |
         10 |-+                         $$    ##D#####################D   +-|
            |                        $$$ #####**B****************           |
            |                      $$####*****                   *****      |
            |                    A$#*****                             B     |
          8 |-+                $$B**                                      +-|
            |                $$**                                           |
            |               $**                                             |
          6 |-+           $$*                                             +-|
            |            A**                                                |
            |           $B                                                  |
            |           $                                                   |
          4 |-+        $*                                                 +-|
            |          $                                                    |
            |         $                                                     |
          2 |-+      $                                                    +-|
            |        $                                 +cputlb-no-bql $$A$$ |
            |       A                                   +per-cpu-lock ##D## |
            |       +    +       +      +       +      +     baseline **B** |
          0 +---------------------------------------------------------------+
                    1    4       8      12      16     20      24     28
                                       Guest vCPUs
  png: https://imgur.com/zZRvS7q

Some notes:
- baseline corresponds to the commit before this series

- per-cpu-lock is the commit that converts the CPU loop to per-cpu locks.

- cputlb-no-bql is this commit.

- I'm using taskset to assign cores to threads, favouring locality whenever
  possible but not using SMT. When N=1, I'm using a single host core, which
  leads to superlinear speedups (since with more cores the I/O thread can execute
  while vCPU threads sleep). In the future I might use N+1 host cores for N
  guest cores to avoid this, or perhaps pin guest threads to cores one-by-one.

- Scalability is not good at 64 cores, where the BQL for handling
  interrupts dominates. I got this from another machine (a 64-core one),
  that unfortunately is much slower than this 28-core one, so I don't have
  the numbers for 1-16 cores. The plot is normalized at 16-core baseline
  performance, and therefore very ugly :-) https://imgur.com/XyKGkAw
  See below for an example of the *huge* amount of waiting on the BQL:

(qemu) info sync-profile
Type               Object  Call site                             Wait Time (s)         Count  Average (us)
----------------------------------------------------------------------------------------------------------
BQL mutex  0x55ba286c9800  accel/tcg/cpu-exec.c:545                 2868.85676      14872596        192.90
BQL mutex  0x55ba286c9800  hw/ppc/ppc.c:70                           539.58924       3666820        147.15
BQL mutex  0x55ba286c9800  target/ppc/helper_regs.h:105              323.49283       2544959        127.11
mutex      [           2]  util/qemu-timer.c:426                     181.38420       3666839         49.47
condvar    [          61]  cpus.c:1327                               136.50872         15379       8876.31
BQL mutex  0x55ba286c9800  accel/tcg/cpu-exec.c:516                   86.14785        946301         91.04
condvar    0x55ba286eb6a0  cpus-common.c:196                          78.41010           126     622302.35
BQL mutex  0x55ba286c9800  util/main-loop.c:236                       28.14795        272940        103.13
mutex      [          64]  include/qom/cpu.h:514                      17.87662      75139413          0.24
BQL mutex  0x55ba286c9800  target/ppc/translate_init.inc.c:8665        7.04738         36528        192.93
----------------------------------------------------------------------------------------------------------

Single-threaded performance is affected very lightly. Results
below for debian aarch64 bootup+test for the entire series
on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:

- Before:

 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7269.033478      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.06% )
    30,659,870,302      cycles                    #    4.218 GHz                      ( +-  0.06% )
    54,790,540,051      instructions              #    1.79  insns per cycle          ( +-  0.05% )
     9,796,441,380      branches                  # 1347.695 M/sec                    ( +-  0.05% )
       165,132,201      branch-misses             #    1.69% of all branches          ( +-  0.12% )

       7.287011656 seconds time elapsed                                          ( +-  0.10% )

- After:

       7375.924053      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.13% )
    31,107,548,846      cycles                    #    4.217 GHz                      ( +-  0.12% )
    55,355,668,947      instructions              #    1.78  insns per cycle          ( +-  0.05% )
     9,929,917,664      branches                  # 1346.261 M/sec                    ( +-  0.04% )
       166,547,442      branch-misses             #    1.68% of all branches          ( +-  0.09% )

       7.389068145 seconds time elapsed                                          ( +-  0.13% )

That is, a 1.37% slowdown.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cputlb.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

Message ID	20181025144644.15464-69-cota@braap.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: "Emilio G. Cota" <cota@braap.org> To: qemu-devel@nongnu.org Date: Thu, 25 Oct 2018 10:46:42 -0400 Message-Id: <20181025144644.15464-69-cota@braap.org> In-Reply-To: <20181025144644.15464-1-cota@braap.org> References: <20181025144644.15464-1-cota@braap.org> Subject: [Qemu-devel] [RFC v4 69/71] cputlb: queue async flush jobs without the BQL Precedence: list Cc: Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	[RFC,v4,01/71] cpu: convert queued work to a QSIMPLEQ \| expand [RFC,v4,01/71] cpu: convert queued work to a QSIMPLEQ [RFC,v4,02/71] cpu: rename cpu->work_mutex to cpu->lock [RFC,v4,03/71] cpu: introduce cpu_mutex_lock/unlock [RFC,v4,04/71] cpu: make qemu_work_cond per-cpu [RFC,v4,05/71] cpu: move run_on_cpu to cpus-common [RFC,v4,06/71] cpu: introduce process_queued_cpu_work_locked [RFC,v4,07/71] tcg-runtime: define helper_cpu_halted_set [RFC,v4,08/71] ppc: convert to helper_cpu_halted_set [RFC,v4,09/71] cris: convert to helper_cpu_halted_set [RFC,v4,10/71] hppa: convert to helper_cpu_halted_set [RFC,v4,11/71] m68k: convert to helper_cpu_halted_set [RFC,v4,12/71] alpha: convert to helper_cpu_halted_set [RFC,v4,13/71] microblaze: convert to helper_cpu_halted_set [RFC,v4,14/71] cpu: define cpu_halted helpers [RFC,v4,15/71] tcg-runtime: convert to cpu_halted_set [RFC,v4,16/71] arm: convert to cpu_halted [RFC,v4,17/71] ppc: convert to cpu_halted [RFC,v4,18/71] sh4: convert to cpu_halted [RFC,v4,19/71] i386: convert to cpu_halted [RFC,v4,20/71] lm32: convert to cpu_halted [RFC,v4,21/71] m68k: convert to cpu_halted [RFC,v4,22/71] mips: convert to cpu_halted [RFC,v4,23/71] riscv: convert to cpu_halted [RFC,v4,24/71] s390x: convert to cpu_halted [RFC,v4,25/71] sparc: convert to cpu_halted [RFC,v4,26/71] xtensa: convert to cpu_halted [RFC,v4,27/71] gdbstub: convert to cpu_halted [RFC,v4,28/71] openrisc: convert to cpu_halted [RFC,v4,29/71] cpu-exec: convert to cpu_halted [RFC,v4,30/71] cpu: define cpu_interrupt_request helpers [RFC,v4,31/71] ppc: use cpu_reset_interrupt [RFC,v4,32/71] exec: use cpu_reset_interrupt [RFC,v4,33/71] i386: use cpu_reset_interrupt [RFC,v4,34/71] s390x: use cpu_reset_interrupt [RFC,v4,35/71] openrisc: use cpu_reset_interrupt [RFC,v4,36/71] arm: convert to cpu_interrupt_request [RFC,v4,37/71] i386: convert to cpu_interrupt_request [RFC,v4,38/71] i386/kvm: convert to cpu_interrupt_request [RFC,v4,39/71] i386/hax-all: convert to cpu_interrupt_request [RFC,v4,40/71] i386/whpx-all: convert to cpu_interrupt_request [RFC,v4,41/71] i386/hvf: convert to cpu_request_interrupt [RFC,v4,42/71] ppc: convert to cpu_interrupt_request [RFC,v4,43/71] sh4: convert to cpu_interrupt_request [RFC,v4,44/71] cris: convert to cpu_interrupt_request [RFC,v4,45/71] hppa: convert to cpu_interrupt_request [RFC,v4,46/71] lm32: convert to cpu_interrupt_request [RFC,v4,47/71] m68k: convert to cpu_interrupt_request [RFC,v4,48/71] mips: convert to cpu_interrupt_request [RFC,v4,49/71] nios: convert to cpu_interrupt_request [RFC,v4,50/71] s390x: convert to cpu_interrupt_request [RFC,v4,51/71] alpha: convert to cpu_interrupt_request [RFC,v4,52/71] moxie: convert to cpu_interrupt_request [RFC,v4,53/71] sparc: convert to cpu_interrupt_request [RFC,v4,54/71] openrisc: convert to cpu_interrupt_request [RFC,v4,55/71] unicore32: convert to cpu_interrupt_request [RFC,v4,56/71] microblaze: convert to cpu_interrupt_request [RFC,v4,57/71] accel/tcg: convert to cpu_interrupt_request [RFC,v4,58/71] cpu: call .cpu_has_work with the CPU lock held [RFC,v4,59/71] cpu: introduce cpu_has_work_with_iothread_lock [RFC,v4,60/71] ppc: convert to cpu_has_work_with_iothread_lock [RFC,v4,61/71] mips: convert to cpu_has_work_with_iothread_lock [RFC,v4,62/71] s390x: convert to cpu_has_work_with_iothread_lock [RFC,v4,63/71] riscv: convert to cpu_has_work_with_iothread_lock [RFC,v4,64/71] sparc: convert to cpu_has_work_with_iothread_lock [RFC,v4,65/71] xtensa: convert to cpu_has_work_with_iothread_lock [RFC,v4,66/71] cpu: protect most CPU state with cpu->lock [RFC,v4,67/71] cpus-common: release BQL earlier in run_on_cpu [RFC,v4,68/71] cpu: add async_run_on_cpu_no_bql [RFC,v4,69/71] cputlb: queue async flush jobs without the BQL [RFC,v4,70/71] cpus-common: move exclusive_idle higher in the file

[RFC,v4,69/71] cputlb: queue async flush jobs without the BQL

Commit Message

Patch