Message ID | 20240308223702.1350851-5-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] KVM: x86: MMU changes for 6.9 | expand |
On 3/8/24 23:36, Sean Christopherson wrote: > The bulk of the changes are TDP MMU improvements related to memslot deletion > (ChromeOS has a use case that "requires" frequent deletion of a GPU buffer). > The other highlight is allocating the write-tracking metadata on-demand, e.g. > so that distro kernels pay the memory cost of the arrays if and only if KVM > or KVMGT actually needs to shadow guest page tables. > > The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3: > > Linux 6.8-rc2 (2024-01-28 17:01:12 -0800) > > are available in the Git repository at: > > https://github.com/kvm-x86/linux.git tags/kvm-x86-mmu-6.9 > > for you to fetch changes up to a364c014a2c1ad6e011bc5fdb8afb9d4ba316956: > > kvm/x86: allocate the write-tracking metadata on-demand (2024-02-27 11:49:54 -0800) Pulled, thanks. Paolo > ---------------------------------------------------------------- > KVM x86 MMU changes for 6.9: > > - Clean up code related to unprotecting shadow pages when retrying a guest > instruction after failed #PF-induced emulation. > > - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if > a reschedule is needed, e.g. if a high priority task needs to run. Because > KVM doesn't support yielding in the middle of processing a zapped non-leaf > SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when > attempting to schedule in a high priority. > > - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for > read, e.g. to avoid serializing vCPUs when userspace deletes a memslot. > > - Allocate write-tracking metadata on-demand to avoid the memory overhead when > running kernels built with KVMGT support (external write-tracking enabled), > but for workloads that don't use nested virtualization (shadow paging) or > KVMGT. > > ---------------------------------------------------------------- > Andrei Vagin (1): > kvm/x86: allocate the write-tracking metadata on-demand > > Kunwu Chan (1): > KVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create() > > Mingwei Zhang (1): > KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic > > Sean Christopherson (10): > KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction() > KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag > KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity > KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots > KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators > KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range > KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs > KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read > KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read > KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read > > arch/x86/include/asm/kvm_host.h | 9 +++ > arch/x86/kvm/mmu/mmu.c | 37 +++++++----- > arch/x86/kvm/mmu/page_track.c | 68 +++++++++++++++++++++- > arch/x86/kvm/mmu/tdp_mmu.c | 124 ++++++++++++++++++++++++++++------------ > arch/x86/kvm/mmu/tdp_mmu.h | 2 +- > arch/x86/kvm/x86.c | 35 +++++------- > 6 files changed, 201 insertions(+), 74 deletions(-) >
On Fri, Mar 8, 2024 at 11:37 PM Sean Christopherson <seanjc@google.com> wrote: > > - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if > a reschedule is needed, e.g. if a high priority task needs to run. Because > KVM doesn't support yielding in the middle of processing a zapped non-leaf > SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when > attempting to schedule in a high priority. > Would 2 MiB provide a nice middle ground? Paolo
On Thu, Mar 14, 2024, Paolo Bonzini wrote: > On Fri, Mar 8, 2024 at 11:37 PM Sean Christopherson <seanjc@google.com> wrote: > > > > - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if > > a reschedule is needed, e.g. if a high priority task needs to run. Because > > KVM doesn't support yielding in the middle of processing a zapped non-leaf > > SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when > > attempting to schedule in a high priority. > > > > Would 2 MiB provide a nice middle ground? Not really? Zapping at 2MiB definitely fixes the worst of the tail latencies, but there is still a measurable difference between 2MiB and 4KiB. And on the other side of the coing, I was unable to observe a meaningful difference in total runtime by zapping at 2MiB, or even 1GiB, versus 4KiB. In other words, AFAICT, there's no need to shoot for a middle ground because trying to zap at larger granularities doesn't buy us anything.
On Thu, Mar 14, 2024 at 7:38 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Mar 14, 2024, Paolo Bonzini wrote: > > On Fri, Mar 8, 2024 at 11:37 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if > > > a reschedule is needed, e.g. if a high priority task needs to run. Because > > > KVM doesn't support yielding in the middle of processing a zapped non-leaf > > > SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when > > > attempting to schedule in a high priority. > > > > > > > Would 2 MiB provide a nice middle ground? > > Not really? > > Zapping at 2MiB definitely fixes the worst of the tail latencies, but there is > still a measurable difference between 2MiB and 4KiB. Yeah, but you said multi millisecond so I guessed 5/512 is a 10 microsecond latency, which should be pretty acceptable (for PREEMPT_RT tests at Red Hat we shoot at 10-15 worst case, so for CONFIG_PREEMPT it would be more than enough). > And on the other side of the > coing, I was unable to observe a meaningful difference in total runtime by zapping > at 2MiB, or even 1GiB, versus 4KiB. Ok, that's the answer. Paolo > In other words, AFAICT, there's no need to shoot for a middle ground because trying > to zap at larger granularities doesn't buy us anything. >