mbox series

[00/11] hvf and APIC fixes, improvements, and optimisations

Message ID 20241209203629.74436-1-phil@philjordan.eu (mailing list archive)
Headers show
Series hvf and APIC fixes, improvements, and optimisations | expand

Message

Phil Dennis-Jordan Dec. 9, 2024, 8:36 p.m. UTC
From: Phil Dennis-Jordan <phil@philjordan.eu>

These patches are a loosely related series of changes that I've
collected while working on the VMApple/PVG patch set and my
doomed project of integrating Hypervisor.framework's APIC
implementation. (The latter turns out to suffer from a number of
serious bugs which prevent it from being usable with
level-triggered interrupts, at least as of macOS 15.1.)

The motivations behind the patches are:

 * Correctness. A bunch of these fix bugs or improve compliance
   with the appropriate specification.
 * Performance. The biggest performance problem on macOS hosts
   and the hvf accel is BQL contention. Depending on workload
   and configuration, busy vCPU threads seem to spend between
   10% and 30% of their time in bql_lock(). A number of these
   patches either reduce total work done, or move work out of
   the BQL in certain configurations.
 * Code hygiene. I've encountered a few instances of incorrect
   types used, outdated comments, magic numbers, etc. which
   were easy to fix.

There are some dependencies between the patches:

 * Patch 1 is required for both patches 2 and 7. This is why
   I have not split the aarch64 (patch 2) and the x86 (3-10)
   specific changes into separate patch sets.
 * Patch 6 will not apply cleanly without 5 in its current state
   but can be modified to work.
 * Patches 6 and 7 are only useful in combination.

Beyond that, we can mix and match - i.e. pull
reviewed-and-approved patches while further iterating on any
that might need refinement or dropping unwanted ones entirely.

The patches, in brief:

 1. Adds a function for architecture-specific vCPU setup
    before the vCPU's first VM entry, but after QEMU init has
    concluded. This is useful for correctly setting the aarch64
    GIC flag (patch 2) and for telling Hypervisor.framework the
    initial APICBASE address. (patch 7)
 2. Set the GICv3 flag in the PFR0_EL1 system register after
    the GIC has actually been set up. Previously the flag was
    computed before the GIC was ready, and thus always 0.
 3. Don't send signal to thread when kicking. cpus_kick_thread
    calls pthread_kill() to send a SIG_IPI to the target vCPU's
    thread. There does not appear to be any need to do this
    with HVF, so let's not call cpus_kick_thread() when kicking
    a vCPU there.
 4. Prefetch all bytes of an instruction when emulating it in
    software. The x86_decode system in the HVF accel read guest
    memory in 1-, 2-, 4-, or 8-byte parcels when decoding
    instructions. The VM exit tells us how big the instruction
    is, so let's read all of it in one go for efficiency.
 5. x86 instruction decoding is somewhat expensive, and if we
    know that it must definitely be done before we acquire the
    BQL, we can actually move decoding to before the lock. The
    simplest case for this are APIC accesses, so we start with
    those.
 6. APIC access fast paths. EPT faults are somewhat expensive
    to process. This includes xAPIC MMIO. We can skip walking
    the memory graph and fully software-emulating the faulting
    instruction by obtaining the access type and MMIO offset
    provided by VMX/HVF's virtualised APIC access feature, and
    detecting common MOV instructions to call the APIC register
    read/write functions directly.
 7. To enable the APIC_ACCESS VM exit type, both the APIC base
    and the relevant ctl must be set in the VMCS; this change
    implements the former and toggles the latter according to
    APIC state, thus enabling the fast path implemented in patch
    6 and also adding better support for APIC relocation.
 8. This fixes a technically incorrect variable type. No change
    in functionality.
 9. The HVF "failed to decode instruction" error prints
    instruction bytes in variable-length hex; it's easier to
    read when the hex digits are always paired, even if the
    first one is 0.
10. This APIC change replaces a magic number with a symbolic
    constant and removes a comment which is no longer accurate.
11. This change causes the software APIC to raise an exception
    when the guest attempts to write to the reserved bits of the
    APICBASE MSR.


In combination, the patches improve QEMU/hvf's relevant
kvm-unit-tests results in the APIC/xAPIC/x2APIC suites (after
disabling the crashing test_pv_ipi test) from:

FAIL apic-split (56 tests, 9 unexpected failures, 2 skipped)
FAIL x2apic (56 tests, 9 unexpected failures, 2 skipped)
FAIL xapic (45 tests, 7 unexpected failures, 3 skipped)

To:

FAIL apic-split (56 tests, 2 unexpected failures, 2 skipped)
FAIL x2apic (56 tests, 2 unexpected failures, 2 skipped)
FAIL xapic (45 tests, 2 unexpected failures, 3 skipped)

The remaining test failures are, in each case:

FAIL: multiple nmi (received: 1)
FAIL: TMCCT should stay at zero

These are not regressions, they were part of the original
test failures; APIC logic relevant to them was not touched
in this patch set.


Performance improvements vary a lot depending on workload.
MMIO-based APIC accesses are most affected, so APIC-heavy
loads with x2apic disabled benefit most.

macOS guests seem to use the APIC very heavily, so it comes
as no surprise that the improvement is noticeable, with
the guest OS booting in about 33 seconds with xAPIC, down
from about 43 with unmodified QEMU 9.2-rc3. Even with x2APIC
enabled, the changes shaved about 1-2 seconds off boot times.

"Warm" builds of QEMU 9.2-rc3 with
make clean ; time ( ./configure --target-list=x86_64-softmmu ; make -j4)
inside an OpenSUSE 15.6 VM with -smp 4 and -cpu host,-x2apic
on my 8-core MacBook Pro 2019 with macOS 14 repeatably went from
  2m 58s (+/- 1.2s) real, 8m 39s (+/- 2s) user, 1m 14s (+/- <1s) sys
with 9.2-rc3 to
  2m 56s (+/- 1.2s) real, 8m 32s (+/- 2s) user, 1m 13s (+/- <1s) sys
with this patch set applied. It's not a huge improvement
(>1% for real/user times, 2% for sys) but the runs are very
consistently in favour of the modifications.

My wall-clock measurements with boot times for Windows 10,
Linux, and FreeBSD guests were all within margin of error,
though profiling the QEMU process did indicate a reduction of
vCPU thread time spent in bql_lock().

I did not spot any performance regressions, but I can't
entirely rule them out either.

There is scope for similar optimisations that yield more widely
applicable improvements. In particular EPT fault MMIO handling
can be split into non-BQL and BQL portions. I measured around 1-2%
additional improvement with a naive implementation of this, but the
hvf memory slot data structure needs to be made RCU-safe before such
an optimisation can be upstreamed.


I have done my best to do a reasonable amount of testing with
these patches, in particular I have tested Linux, FreeBSD,
macOS, Windows 10, and Windows XP guests on x86-64, each with
x2APIC alternately enabled and disabled. On aarch64, I have
checked Linux and macOS (VMApple) guests. All of this on
macOS hosts.

Phil Dennis-Jordan (11):
  hvf: Add facility for initialisation code prior to first vCPU run
  arm/hvf: Initialise GICv3 state just before first vCPU run
  i386/hvf: Don't send signal to thread when kicking
  i386/hvf: Pre-fetch emulated instructions
  i386/hvf: Decode APIC access x86 instruction outside BQL
  i386/hvf: APIC access exit with fast-path for common mov cases
  i386/hvf: Enables APIC_ACCESS VM exits by setting APICBASE
  i386/hvf: Variable type fixup in decoder
  i386/hvf: Print hex pairs for each opcode byte in decode error
  hw/intc/apic: Fixes magic number use, removes outdated comment
  hw/intc/apic: Raise exception when setting reserved APICBASE bits

 accel/hvf/hvf-accel-ops.c    |   5 ++
 hw/intc/apic.c               |  12 +++--
 include/hw/i386/apic.h       |   2 +
 include/sysemu/hvf_int.h     |   1 +
 meson.build                  |   1 +
 target/arm/hvf/hvf.c         |  21 +++++---
 target/i386/hvf/hvf.c        |  50 +++++++++++++----
 target/i386/hvf/trace-events |   9 ++++
 target/i386/hvf/trace.h      |   1 +
 target/i386/hvf/x86_decode.c |  24 ++++++---
 target/i386/hvf/x86_decode.h |   5 +-
 target/i386/hvf/x86_emu.c    | 102 +++++++++++++++++++++++++++++++++++
 target/i386/hvf/x86_emu.h    |   2 +
 13 files changed, 206 insertions(+), 29 deletions(-)
 create mode 100644 target/i386/hvf/trace-events
 create mode 100644 target/i386/hvf/trace.h