mbox series

[GIT,PULL] KVM: x86: MMU changes for 6.9

Message ID 20240308223702.1350851-5-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] KVM: x86: MMU changes for 6.9 | expand

Pull-request

https://github.com/kvm-x86/linux.git tags/kvm-x86-mmu-6.9

Message

Sean Christopherson March 8, 2024, 10:36 p.m. UTC
The bulk of the changes are TDP MMU improvements related to memslot deletion
(ChromeOS has a use case that "requires" frequent deletion of a GPU buffer).
The other highlight is allocating the write-tracking metadata on-demand, e.g.
so that distro kernels pay the memory cost of the arrays if and only if KVM
or KVMGT actually needs to shadow guest page tables.

The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:

  Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)

are available in the Git repository at:

  https://github.com/kvm-x86/linux.git tags/kvm-x86-mmu-6.9

for you to fetch changes up to a364c014a2c1ad6e011bc5fdb8afb9d4ba316956:

  kvm/x86: allocate the write-tracking metadata on-demand (2024-02-27 11:49:54 -0800)

----------------------------------------------------------------
KVM x86 MMU changes for 6.9:

 - Clean up code related to unprotecting shadow pages when retrying a guest
   instruction after failed #PF-induced emulation.

 - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
   a reschedule is needed, e.g. if a high priority task needs to run.  Because
   KVM doesn't support yielding in the middle of processing a zapped non-leaf
   SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
   attempting to schedule in a high priority.

 - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
   read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.

 - Allocate write-tracking metadata on-demand to avoid the memory overhead when
   running kernels built with KVMGT support (external write-tracking enabled),
   but for workloads that don't use nested virtualization (shadow paging) or
   KVMGT.

----------------------------------------------------------------
Andrei Vagin (1):
      kvm/x86: allocate the write-tracking metadata on-demand

Kunwu Chan (1):
      KVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create()

Mingwei Zhang (1):
      KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic

Sean Christopherson (10):
      KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction()
      KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag
      KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity
      KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots
      KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators
      KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range
      KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs
      KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read
      KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read
      KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read

 arch/x86/include/asm/kvm_host.h |   9 +++
 arch/x86/kvm/mmu/mmu.c          |  37 +++++++-----
 arch/x86/kvm/mmu/page_track.c   |  68 +++++++++++++++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 124 ++++++++++++++++++++++++++++------------
 arch/x86/kvm/mmu/tdp_mmu.h      |   2 +-
 arch/x86/kvm/x86.c              |  35 +++++-------
 6 files changed, 201 insertions(+), 74 deletions(-)

Comments

Paolo Bonzini March 11, 2024, 2:30 p.m. UTC | #1
On 3/8/24 23:36, Sean Christopherson wrote:
> The bulk of the changes are TDP MMU improvements related to memslot deletion
> (ChromeOS has a use case that "requires" frequent deletion of a GPU buffer).
> The other highlight is allocating the write-tracking metadata on-demand, e.g.
> so that distro kernels pay the memory cost of the arrays if and only if KVM
> or KVMGT actually needs to shadow guest page tables.
> 
> The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:
> 
>    Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)
> 
> are available in the Git repository at:
> 
>    https://github.com/kvm-x86/linux.git tags/kvm-x86-mmu-6.9
> 
> for you to fetch changes up to a364c014a2c1ad6e011bc5fdb8afb9d4ba316956:
> 
>    kvm/x86: allocate the write-tracking metadata on-demand (2024-02-27 11:49:54 -0800)

Pulled, thanks.

Paolo

> ----------------------------------------------------------------
> KVM x86 MMU changes for 6.9:
> 
>   - Clean up code related to unprotecting shadow pages when retrying a guest
>     instruction after failed #PF-induced emulation.
> 
>   - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
>     a reschedule is needed, e.g. if a high priority task needs to run.  Because
>     KVM doesn't support yielding in the middle of processing a zapped non-leaf
>     SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
>     attempting to schedule in a high priority.
> 
>   - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
>     read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
> 
>   - Allocate write-tracking metadata on-demand to avoid the memory overhead when
>     running kernels built with KVMGT support (external write-tracking enabled),
>     but for workloads that don't use nested virtualization (shadow paging) or
>     KVMGT.
> 
> ----------------------------------------------------------------
> Andrei Vagin (1):
>        kvm/x86: allocate the write-tracking metadata on-demand
> 
> Kunwu Chan (1):
>        KVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create()
> 
> Mingwei Zhang (1):
>        KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic
> 
> Sean Christopherson (10):
>        KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction()
>        KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag
>        KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity
>        KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots
>        KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators
>        KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range
>        KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs
>        KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read
>        KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read
>        KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read
> 
>   arch/x86/include/asm/kvm_host.h |   9 +++
>   arch/x86/kvm/mmu/mmu.c          |  37 +++++++-----
>   arch/x86/kvm/mmu/page_track.c   |  68 +++++++++++++++++++++-
>   arch/x86/kvm/mmu/tdp_mmu.c      | 124 ++++++++++++++++++++++++++++------------
>   arch/x86/kvm/mmu/tdp_mmu.h      |   2 +-
>   arch/x86/kvm/x86.c              |  35 +++++-------
>   6 files changed, 201 insertions(+), 74 deletions(-)
>
Paolo Bonzini March 14, 2024, 6:31 p.m. UTC | #2
On Fri, Mar 8, 2024 at 11:37 PM Sean Christopherson <seanjc@google.com> wrote:
>
>  - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
>    a reschedule is needed, e.g. if a high priority task needs to run.  Because
>    KVM doesn't support yielding in the middle of processing a zapped non-leaf
>    SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
>    attempting to schedule in a high priority.
>

Would 2 MiB provide a nice middle ground?

Paolo
Sean Christopherson March 14, 2024, 6:38 p.m. UTC | #3
On Thu, Mar 14, 2024, Paolo Bonzini wrote:
> On Fri, Mar 8, 2024 at 11:37 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> >  - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
> >    a reschedule is needed, e.g. if a high priority task needs to run.  Because
> >    KVM doesn't support yielding in the middle of processing a zapped non-leaf
> >    SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
> >    attempting to schedule in a high priority.
> >
> 
> Would 2 MiB provide a nice middle ground?

Not really?

Zapping at 2MiB definitely fixes the worst of the tail latencies, but there is
still a measurable difference between 2MiB and 4KiB.  And on the other side of the
coing, I was unable to observe a meaningful difference in total runtime by zapping
at 2MiB, or even 1GiB, versus 4KiB.

In other words, AFAICT, there's no need to shoot for a middle ground because trying
to zap at larger granularities doesn't buy us anything.
Paolo Bonzini March 14, 2024, 6:43 p.m. UTC | #4
On Thu, Mar 14, 2024 at 7:38 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Mar 14, 2024, Paolo Bonzini wrote:
> > On Fri, Mar 8, 2024 at 11:37 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > >  - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
> > >    a reschedule is needed, e.g. if a high priority task needs to run.  Because
> > >    KVM doesn't support yielding in the middle of processing a zapped non-leaf
> > >    SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
> > >    attempting to schedule in a high priority.
> > >
> >
> > Would 2 MiB provide a nice middle ground?
>
> Not really?
>
> Zapping at 2MiB definitely fixes the worst of the tail latencies, but there is
> still a measurable difference between 2MiB and 4KiB.

Yeah, but you said multi millisecond so I guessed 5/512 is a 10
microsecond latency, which should be pretty acceptable (for PREEMPT_RT
tests at Red Hat we shoot at 10-15 worst case, so for CONFIG_PREEMPT
it would be more than enough).

> And on the other side of the
> coing, I was unable to observe a meaningful difference in total runtime by zapping
> at 2MiB, or even 1GiB, versus 4KiB.

Ok, that's the answer.

Paolo

> In other words, AFAICT, there's no need to shoot for a middle ground because trying
> to zap at larger granularities doesn't buy us anything.
>