Message ID | 20230922140914.13906-1-phil@philjordan.eu (mailing list archive) |
---|---|
Headers | show |
Series | hvf x86 correctness and efficiency improvements | expand |
Ping - let me know if there's anything particularly controversial, unclear, etc. about these patches or if I can do anything to make reviewing easier. Thanks! On Fri, 22 Sept 2023 at 16:09, Phil Dennis-Jordan <phil@philjordan.eu> wrote: > > This is a series of semi-related patches for the x86 macOS Hypervisor.framework > (hvf) accelerator backend. The intention is to make VMs run slightly more > efficiently on macOS host machines. They have been subject to some months of > CI workloads with macOS guest VMs without issues and they seem to give a few > percent performance improvement. (Though this varies greatly with the type of > workload.) > > Patch 1 enables the INVTSC CPUID bit when running with hvf. This can enable > some optimisations in the guest OS, and I've not found any reason it shouldn't > be allowed for hvf based hosts. > > Patch 2 fixes hvf_kick_vcpu_thread so it actually forces a VM exit instead of > doing nothing. I guess this previously didn't cause any huge issues because > hvf's hv_vcpu_run() would exit so extremely frequently on its own accord. The > temp variable is needed because the pointer expected by the hv_vcpu_interrupt() > call doesn't match the fd field's type in the hvf accel's struct AccelCPUState. > I'm unsure if it would be better to change that struct field to the relevant > architecture's handle types, hv_vcpuid_t (x86, unsigned int) and hv_vcpu_t > (aarch64, uint64_t), perhaps via an intermediate typedef? > > Patch 3, which replaces the call to hv_vcpu_run() with the more modern > hv_vcpu_run_until() for running the guest vCPU. The newer API is available > from macOS 10.15 host systems onwards. This call causes significantly fewer > VM exits, which also means we really need that exit-forcing interrupt from > patch 2. The reduction in VM exits means less overhead from exits and less > contention on the BQL. Using hv_vcpu_run_until() is also a prerequisite for > using certain newer hvf features, though this patchset doesn't use any. > > Patches 2 & 3 must therefore be applied in that order, patch 1 is independent. > > This work has been sponsored by Sauce Labs Inc. > > Phil Dennis-Jordan (3): > i386: hvf: Adds support for INVTSC cpuid bit > i386: hvf: In kick_vcpu use hv_vcpu_interrupt to force exit > i386: hvf: Updates API usage to use modern vCPU run function > > target/i386/hvf/hvf.c | 26 +++++++++++++++++++++++++- > target/i386/hvf/x86_cpuid.c | 4 ++++ > 2 files changed, 29 insertions(+), 1 deletion(-) > > -- > 2.36.1 >
On 9/22/23 16:09, Phil Dennis-Jordan wrote: > Patch 1 enables the INVTSC CPUID bit when running with hvf. This can > enable some optimisations in the guest OS, and I've not found any reason > it shouldn't be allowed for hvf based hosts. It can be enabled, but it should include a migration blocker. In fact, probably HVF itself should include a migration blocker because QEMU doesn't support TSC scaling. > Patch 2 fixes hvf_kick_vcpu_thread so it actually forces a VM exit instead of > doing nothing. I guess this previously didn't cause any huge issues because > hvf's hv_vcpu_run() would exit so extremely frequently on its own accord. No, it's because signal delivery should already kick the vCPU out of guest mode. I guess it does with hv_vcpu_run(), but not with hv_vcpu_run_until()? hv_vcpu_interrupt() is a problematic API, in that it is not clear how it handles races with hv_vcpu_run(). In particular, whether this causes an immediate vmexit or not: thread 1 thread 2 ----------------------- ----------------------- <CPU not in guest mode> hv_vcpu_interrupt() hv_vcpu_run() (or run_until) Not that the current code is any better; there is no guarantee that the signal is delivered before hv_vcpu_run() is called. However, if as you said hv_vcpu_run() will exit often on its own accord but hv_vcpu_run_until() does not, then this could cause difficult to reproduce bugs by switching to the latter. > The temp variable is needed because the pointer expected by the hv_vcpu_interrupt() > call doesn't match the fd field's type in the hvf accel's struct AccelCPUState. > I'm unsure if it would be better to change that struct field to the relevant > architecture's handle types, hv_vcpuid_t (x86, unsigned int) and hv_vcpu_t > (aarch64, uint64_t), perhaps via an intermediate typedef? I think fd and the HVF type should be placed in an anonymous union. Thanks, Paolo
Hi Paolo, On Mon, 16 Oct 2023 at 16:39, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 9/22/23 16:09, Phil Dennis-Jordan wrote: > > Patch 1 enables the INVTSC CPUID bit when running with hvf. This can > > enable some optimisations in the guest OS, and I've not found any reason > > it shouldn't be allowed for hvf based hosts. > > It can be enabled, but it should include a migration blocker. In fact, > probably HVF itself should include a migration blocker because QEMU > doesn't support TSC scaling. I didn't think Qemu's HVF backend supported migration in any form at this point anyway? Or do you mean machine model versioning of the default setting? > > Patch 2 fixes hvf_kick_vcpu_thread so it actually forces a VM exit instead of > > doing nothing. I guess this previously didn't cause any huge issues because > > hvf's hv_vcpu_run() would exit so extremely frequently on its own accord. > > No, it's because signal delivery should already kick the vCPU out of > guest mode. I guess it does with hv_vcpu_run(), but not with > hv_vcpu_run_until()? It's possible! At any rate, hv_vcpu_run() only seems to run for VERY short time slices in practice. I'll try gathering some data on exit reasons/kick delivery latencies with the various combinations of kick methods and vcpu run APIs if that helps. > hv_vcpu_interrupt() is a problematic API, in that it is not clear how it > handles races with hv_vcpu_run(). In particular, whether this causes an > immediate vmexit or not: > > thread 1 thread 2 > ----------------------- ----------------------- > <CPU not in guest mode> > hv_vcpu_interrupt() > hv_vcpu_run() (or run_until) Roman brought me up to speed on this issue in the thread here: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg02156.html including a patch and discussion on the subject from 2020. I've now updated Roman's old patch which addressed it thanks to your feedback at the time but which never got merged. (Judging by the list archives, v4 just never got reviewed.) Details on my latest progress here: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg04752.html > Not that the current code is any better; there is no guarantee that the > signal is delivered before hv_vcpu_run() is called. However, if as you > said hv_vcpu_run() will exit often on its own accord but > hv_vcpu_run_until() does not, then this could cause difficult to > reproduce bugs by switching to the latter. At least with Roman's updated kick patch, according to tracing points I inserted locally, I'm not seeing any slow or undelivered kicks with hv_vcpu_run_until. However, I'm still observing the (S)VGA issues to which Roman drew my attention at the same time, so I'm wondering if the issues we're seeing with hv_vcpu_run_until aren't (all) related to interrupt delivery. (To be clear, switching to hv_vcpu_run_until() WITHOUT hv_vcpu_interrupt() causes some very obvious problems where the vCPU simply doesn't exit at all for long periods.) From my point of view, there are at least 2 strong incentives for switching to hv_vcpu_run_until(): 1. hv_vcpu_run() exits very frequently, and often there is actually nothing for the VMM to do except call hv_vcpu_run() again. With Qemu's current hvf backend, each exit causes a BQL acquisition, and VMs with a bunch of vCPUs rapidly become limited by BQL contention according to my profiling. 2. The HVF in macOS 12 and newer contains an in-kernel APIC implementation, in a similar vein to KVM’s kernel irqchip. This should further reduce apparent VM exits, but using it is conditional on using hv_vcpu_run_until(). Calling hv_vcpu_run() errors out with that enabled. Integrating that into Qemu is still a work in progress, but I’ve got pretty far. (most of the APIC and IOAPIC suites in kvm-unit-tests pass) For those 2 reasons I’m pretty motivated to make things work with hv_vcpu_run_until() one way or another. > > The temp variable is needed because the pointer expected by the hv_vcpu_interrupt() > > call doesn't match the fd field's type in the hvf accel's struct AccelCPUState. > > I'm unsure if it would be better to change that struct field to the relevant > > architecture's handle types, hv_vcpuid_t (x86, unsigned int) and hv_vcpu_t > > (aarch64, uint64_t), perhaps via an intermediate typedef? > > I think fd and the HVF type should be placed in an anonymous union. That’s a good idea, I’ll put a patch together doing exactly that. Thanks, Phil
On Mon, Oct 16, 2023 at 6:45 PM Phil Dennis-Jordan <lists@philjordan.eu> wrote: > > Hi Paolo, > > > On Mon, 16 Oct 2023 at 16:39, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > On 9/22/23 16:09, Phil Dennis-Jordan wrote: > > > Patch 1 enables the INVTSC CPUID bit when running with hvf. This can > > > enable some optimisations in the guest OS, and I've not found any reason > > > it shouldn't be allowed for hvf based hosts. > > > > It can be enabled, but it should include a migration blocker. In fact, > > probably HVF itself should include a migration blocker because QEMU > > doesn't support TSC scaling. > > I didn't think Qemu's HVF backend supported migration in any form at this point anyway? Or do you mean machine model versioning of the default setting? If it doesn't support migration, it needs to register a migration blocker. > switching to hv_vcpu_run_until() WITHOUT hv_vcpu_interrupt() > causes some very obvious problems where the vCPU simply > doesn't exit at all for long periods.) Yes, that makes sense. It looks like hv_vcpu_run_until() has an equivalent of a "do ... while (errno == EINTR)" loop inside it. > 1. hv_vcpu_run() exits very frequently, and often there is actually > nothing for the VMM to do except call hv_vcpu_run() again. With > Qemu's current hvf backend, each exit causes a BQL acquisition, > and VMs with a bunch of vCPUs rapidly become limited by BQL > contention according to my profiling. Yes, that should be fixed anyway, but I agree it is a separate issue. Paolo
On Mon, 16 Oct 2023 at 18:49, Paolo Bonzini <pbonzini@redhat.com> wrote: > > switching to hv_vcpu_run_until() WITHOUT hv_vcpu_interrupt() > > causes some very obvious problems where the vCPU simply > > doesn't exit at all for long periods.) > > Yes, that makes sense. It looks like hv_vcpu_run_until() has an > equivalent of a "do ... while (errno == EINTR)" loop inside it. Thinking about this some more, I wonder if it's worth putting together some test code to check empirically how signals and hv_vcpu_interrupt interact with each of the 2 vcpu_run implementations. It should be pretty straightforward to establish whether calling hv_vcpu_interrupt *before* hv_vcpu_run*() causes an instant exit when it is called, and whether the signal causes hv_vcpu_run() to exit. (Based on observed behaviour, I'm already pretty confident hv_vcpu_run_until() ignores signals until it exits for other reasons.) > > 1. hv_vcpu_run() exits very frequently, and often there is actually > > nothing for the VMM to do except call hv_vcpu_run() again. With > > Qemu's current hvf backend, each exit causes a BQL acquisition, > > and VMs with a bunch of vCPUs rapidly become limited by BQL > > contention according to my profiling. > > Yes, that should be fixed anyway, but I agree it is a separate issue. I've locally got a bunch of very messy patches for reducing BQL acquisition in the x86 HVF vCPU loop. I found it difficult to make much of a real difference however, and the code gets a lot harder to follow. - Decoding instructions that need emulating can be done outside the lock. Actually running the instruction typically ends up causing an action that needs it though, so the win isn't that big. - With hv_vcpu_run() there are a bunch of (EPT fault) exits that don't really need anything in particular to be done. Or instead of detecting those outside the lock you can switch to hv_vcpu_run_until() which avoids those exits altogether. - KVM's vCPU loop calls into MMIO faults without the BQL, but they acquire it internally I think? - Going from xAPIC to x2APIC reduces the number of exits, and using MSRs is a bit quicker than walking the memory hierarchy, so that definitely helps too, but the APIC implementation still needs the BQL held, and untangling that is probably harder than switching to an in-kernel APIC. Beyond that: there's a good chance that turning the BQL into a read-write lock would help, but that's a much bigger undertaking than I'm currently willing to entertain! Re hvf fd: > I think fd and the HVF type should be placed in an anonymous union. Taking a look at the code and thinking through the implications, I'm not actually sure what the intention of the union is? IIRC Qemu is built with strict aliasing rules disabled, but there seems little point in actively using union aliasing here? We can just as easily modify this block at the top of hvf_int.h: #ifdef __aarch64__ #include <Hypervisor/Hypervisor.h> #else #include <Hypervisor/hv.h> #endif to something like: #ifdef __aarch64__ #include <Hypervisor/Hypervisor.h> typedef hv_vcpu_t hvf_vcpu_id; #else #include <Hypervisor/hv.h> typedef hv_vcpuid_t hvf_vcpu_id; #endif And then: struct AccelCPUState { hvf_vcpu_id fd; void *exit; … Or perhaps this variant, if we want AccelCPUState to have consistent size/alignment properties, we can use a union after all, but never actually use the fd_padding field: struct AccelCPUState { union { hvf_vcpu_id fd; uint64_t fd_padding; }; void *exit; … (Or is that what you were thinking of with the union idea in the first place?) Thanks, Phil
On Mon, Oct 16, 2023 at 10:05 PM Phil Dennis-Jordan <lists@philjordan.eu> wrote: > > I think fd and the HVF type should be placed in an anonymous union. > > Taking a look at the code and thinking through the implications, I'm > not actually sure what the intention of the union is Nevermind, I got confused that AccelCPUState depends on what the actual accelerator is. So yeah, you can just change the type of "fd"; it was an int only because in the past the field was defined for all accelerators. Paolo