Message ID | 20241211013302.1347853-1-seanjc@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: x86: Address xstate_required_size() perf regression | expand |
On Tue, Dec 10, 2024, Sean Christopherson wrote:
> CPUID is a mandatory intercept on both Intel and AMD
Jim pointed out that CPUID is NOT a mandatory intercept on AMD, and so while the
code is correct, the changelogs are not.
On 12/11/24 02:32, Sean Christopherson wrote: > Fix a hilarious/revolting performance regression (relative to older CPU > generations) in xstate_required_size() that pops up due to CPUID _in the > host_ taking 3x-4x longer on Emerald Rapids than Skylake. > > The issue rears its head on nested virtualization transitions, as KVM > (unnecessarily) performs runtime CPUID updates, including XSAVE sizes, > multiple times per transition. And calculating XSAVE sizes, especially > for vCPUs with a decent number of supported XSAVE features and compacted > format support, can add up to thousands of cycles. > > To fix the immediate issue, cache the CPUID output at kvm.ko load. The > information is static for a given CPU, i.e. doesn't need to be re-read > from hardware every time. That's patch 1, and eliminates pretty much all > of the meaningful overhead. Queued this one, thanks! Paolo
On Tue, Dec 10, 2024 at 5:33 PM Sean Christopherson <seanjc@google.com> wrote: > > Fix a hilarious/revolting performance regression (relative to older CPU > generations) in xstate_required_size() that pops up due to CPUID _in the > host_ taking 3x-4x longer on Emerald Rapids than Skylake. > > The issue rears its head on nested virtualization transitions, as KVM > (unnecessarily) performs runtime CPUID updates, including XSAVE sizes, > multiple times per transition. And calculating XSAVE sizes, especially > for vCPUs with a decent number of supported XSAVE features and compacted > format support, can add up to thousands of cycles. > > To fix the immediate issue, cache the CPUID output at kvm.ko load. The > information is static for a given CPU, i.e. doesn't need to be re-read > from hardware every time. That's patch 1, and eliminates pretty much all > of the meaningful overhead. > > Patch 2 is a minor cleanup to try and make the code easier to read. > > Patch 3 fixes a wart in CPUID emulation where KVM does a moderately > expensive (though cheap compared to CPUID, lol) MSR lookup that is likely > unnecessary for the vast majority of VMs. > > Patches 4 and 5 address the problem of KVM doing runtime CPUID updates > multiple times for each nested VM-Enter and VM-Exit, at least half of > which are completely unnecessary (CPUID is a mandatory intercept on both > Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running > L2 is pointless). The idea is fairly simple: lazily do the CPUID updates > by deferring them until something might actually consume guest the relevant > bits. > > This applies on the cpu_caps overhaul[*], as patches 3-5 would otherwise > conflict, and I didn't want to think about how safe patch 5 is without > the rework. > > That said, patch 1, which is the most important and tagged for stable, > applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and > earlier shouldn't be too horrific). > > Side topic, I can't help but wonder if the CPUID latency on EMR is a CPU > or ucode bug. For a number of leaves, KVM can emulate CPUID faster than > the CPUID can execute the instruction. I.e. the entire VM-Exit => emulate > => VM-Enter sequence takes less time than executing CPUID on bare metal. > Which seems absolutely insane. But, it shouldn't impact guest performance, > so that's someone else's problem, at least for now. Virtualization aside, perhaps Linux should set MSR_FEATURE_ENABLES.CPUID_GP_ON_CPL_GT_0[bit 0] on EMR, and emulate the CPUID instruction in the kernel? :) > [*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@google.com > > Sean Christopherson (5): > KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init > KVM: x86: Use for-loop to iterate over XSTATE size entries > KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM > or HLE > KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature > bit > KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID > emulation > > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/cpuid.c | 63 ++++++++++++++++++++++++--------- > arch/x86/kvm/cpuid.h | 10 +++++- > arch/x86/kvm/lapic.c | 2 +- > arch/x86/kvm/smm.c | 2 +- > arch/x86/kvm/svm/sev.c | 2 +- > arch/x86/kvm/svm/svm.c | 2 +- > arch/x86/kvm/vmx/vmx.c | 2 +- > arch/x86/kvm/x86.c | 22 +++++++++--- > 9 files changed, 78 insertions(+), 28 deletions(-) > > > base-commit: 06a4919e729761be751366716c00fb8c3f51d37e > -- > 2.47.0.338.g60cca15819-goog >
On Mon, Dec 16, 2024, Jim Mattson wrote: > On Tue, Dec 10, 2024 at 5:33 PM Sean Christopherson <seanjc@google.com> wrote: > > > > Fix a hilarious/revolting performance regression (relative to older CPU > > generations) in xstate_required_size() that pops up due to CPUID _in the > > host_ taking 3x-4x longer on Emerald Rapids than Skylake. > > > > The issue rears its head on nested virtualization transitions, as KVM > > (unnecessarily) performs runtime CPUID updates, including XSAVE sizes, > > multiple times per transition. And calculating XSAVE sizes, especially > > for vCPUs with a decent number of supported XSAVE features and compacted > > format support, can add up to thousands of cycles. > > > > To fix the immediate issue, cache the CPUID output at kvm.ko load. The > > information is static for a given CPU, i.e. doesn't need to be re-read > > from hardware every time. That's patch 1, and eliminates pretty much all > > of the meaningful overhead. > > > > Patch 2 is a minor cleanup to try and make the code easier to read. > > > > Patch 3 fixes a wart in CPUID emulation where KVM does a moderately > > expensive (though cheap compared to CPUID, lol) MSR lookup that is likely > > unnecessary for the vast majority of VMs. > > > > Patches 4 and 5 address the problem of KVM doing runtime CPUID updates > > multiple times for each nested VM-Enter and VM-Exit, at least half of > > which are completely unnecessary (CPUID is a mandatory intercept on both > > Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running > > L2 is pointless). The idea is fairly simple: lazily do the CPUID updates > > by deferring them until something might actually consume guest the relevant > > bits. > > > > This applies on the cpu_caps overhaul[*], as patches 3-5 would otherwise > > conflict, and I didn't want to think about how safe patch 5 is without > > the rework. > > > > That said, patch 1, which is the most important and tagged for stable, > > applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and > > earlier shouldn't be too horrific). > > > > Side topic, I can't help but wonder if the CPUID latency on EMR is a CPU > > or ucode bug. For a number of leaves, KVM can emulate CPUID faster than > > the CPUID can execute the instruction. I.e. the entire VM-Exit => emulate > > => VM-Enter sequence takes less time than executing CPUID on bare metal. > > Which seems absolutely insane. But, it shouldn't impact guest performance, > > so that's someone else's problem, at least for now. > > Virtualization aside, perhaps Linux should set > MSR_FEATURE_ENABLES.CPUID_GP_ON_CPL_GT_0[bit 0] on EMR, and emulate > the CPUID instruction in the kernel? :) Heh, that thought crossed my mind too.