mbox series

[00/22] Introduce the TDP MMU

Message ID 20200925212302.3979661-1-bgardon@google.com (mailing list archive)
Headers show
Series Introduce the TDP MMU | expand

Message

Ben Gardon Sept. 25, 2020, 9:22 p.m. UTC
Over the years, the needs for KVM's x86 MMU have grown from running small
guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
we previously depended on shadow paging to run all guests, we now have
two dimensional paging (TDP). This patch set introduces a new
implementation of much of the KVM MMU, optimized for running guests with
TDP. We have re-implemented many of the MMU functions to take advantage of
the relative simplicity of TDP and eliminate the need for an rmap.
Building on this simplified implementation, a future patch set will change
the synchronization model for this "TDP MMU" to enable more parallelism
than the monolithic MMU lock. A TDP MMU is currently in use at Google
and has given us the performance necessary to live migrate our 416 vCPU,
12TiB m2-ultramem-416 VMs.

This work was motivated by the need to handle page faults in parallel for
very large VMs. When VMs have hundreds of vCPUs and terabytes of memory,
KVM's MMU lock suffers extreme contention, resulting in soft-lockups and
long latency on guest page faults. This contention can be easily seen
running the KVM selftests demand_paging_test with a couple hundred vCPUs.
Over a 1 second profile of the demand_paging_test, with 416 vCPUs and 4G
per vCPU, 98% of the time was spent waiting for the MMU lock. At Google,
the TDP MMU reduced the test duration by 89% and the execution was
dominated by get_user_pages and the user fault FD ioctl instead of the
MMU lock.

This series is the first of two. In this series we add a basic
implementation of the TDP MMU. In the next series we will improve the
performance of the TDP MMU and allow it to execute MMU operations
in parallel.

The overall purpose of the KVM MMU is to program paging structures
(CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
addresses (HPA), and to provide utilities for other KVM features, for
example dirty logging. The definition of the L1 guest physical address
(GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
MMU must program the x86 page tables to encode the full translation of
guest virtual addresses (GVA) to HPA. This requires "shadowing" the
guest's page tables to create a composite x86 paging structure. This
solution is complicated, requires separate paging structures for each
guest CR3, and requires emulating guest page table changes. The TDP case
is much simpler. In this case, KVM lets the guest control CR3 and programs
the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has
no way to change this mapping and only one version of the paging structure
is needed per L1 paging mode. In this case the paging mode is some
combination of the number of levels in the paging structure, the address
space (normal execution or system management mode, on x86), and other
attributes. Most VMs only ever use 1 paging mode and so only ever need one
TDP structure.

This series implements a "TDP MMU" through alternative implementations of
MMU functions for running L1 guests with TDP. The TDP MMU falls back to
the existing shadow paging implementation when TDP is not available, and
interoperates with the existing shadow paging implementation for nesting.
The use of the TDP MMU can be controlled by a module parameter which is
snapshot on VM creation and follows the life of the VM. This snapshot
is used in many functions to decide whether or not to use TDP MMU handlers
for a given operation.

This series can also be viewed in Gerrit here:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
(Thanks to Dmitry Vyukov <dvyukov@google.com> for setting up the
Gerrit instance)

Ben Gardon (22):
  kvm: mmu: Separate making SPTEs from set_spte
  kvm: mmu: Introduce tdp_iter
  kvm: mmu: Init / Uninit the TDP MMU
  kvm: mmu: Allocate and free TDP MMU roots
  kvm: mmu: Add functions to handle changed TDP SPTEs
  kvm: mmu: Make address space ID a property of memslots
  kvm: mmu: Support zapping SPTEs in the TDP MMU
  kvm: mmu: Separate making non-leaf sptes from link_shadow_page
  kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
  kvm: mmu: Add TDP MMU PF handler
  kvm: mmu: Factor out allocating a new tdp_mmu_page
  kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
  kvm: mmu: Support invalidate range MMU notifier for TDP MMU
  kvm: mmu: Add access tracking for tdp_mmu
  kvm: mmu: Support changed pte notifier in tdp MMU
  kvm: mmu: Add dirty logging handler for changed sptes
  kvm: mmu: Support dirty logging for the TDP MMU
  kvm: mmu: Support disabling dirty logging for the tdp MMU
  kvm: mmu: Support write protection for nesting in tdp MMU
  kvm: mmu: NX largepage recovery for TDP MMU
  kvm: mmu: Support MMIO in the TDP MMU
  kvm: mmu: Don't clear write flooding count for direct roots

 arch/x86/include/asm/kvm_host.h |   17 +
 arch/x86/kvm/Makefile           |    3 +-
 arch/x86/kvm/mmu/mmu.c          |  437 ++++++----
 arch/x86/kvm/mmu/mmu_internal.h |   98 +++
 arch/x86/kvm/mmu/paging_tmpl.h  |    3 +-
 arch/x86/kvm/mmu/tdp_iter.c     |  198 +++++
 arch/x86/kvm/mmu/tdp_iter.h     |   55 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 1315 +++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |   52 ++
 include/linux/kvm_host.h        |    2 +
 virt/kvm/kvm_main.c             |    7 +-
 11 files changed, 2022 insertions(+), 165 deletions(-)
 create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
 create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
 create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
 create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h

Comments

Paolo Bonzini Sept. 26, 2020, 1:14 a.m. UTC | #1
On 25/09/20 23:22, Ben Gardon wrote:
> Over the years, the needs for KVM's x86 MMU have grown from running small
> guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
> we previously depended on shadow paging to run all guests, we now have
> two dimensional paging (TDP). This patch set introduces a new
> implementation of much of the KVM MMU, optimized for running guests with
> TDP. We have re-implemented many of the MMU functions to take advantage of
> the relative simplicity of TDP and eliminate the need for an rmap.
> Building on this simplified implementation, a future patch set will change
> the synchronization model for this "TDP MMU" to enable more parallelism
> than the monolithic MMU lock. A TDP MMU is currently in use at Google
> and has given us the performance necessary to live migrate our 416 vCPU,
> 12TiB m2-ultramem-416 VMs.
> 
> This work was motivated by the need to handle page faults in parallel for
> very large VMs. When VMs have hundreds of vCPUs and terabytes of memory,
> KVM's MMU lock suffers extreme contention, resulting in soft-lockups and
> long latency on guest page faults. This contention can be easily seen
> running the KVM selftests demand_paging_test with a couple hundred vCPUs.
> Over a 1 second profile of the demand_paging_test, with 416 vCPUs and 4G
> per vCPU, 98% of the time was spent waiting for the MMU lock. At Google,
> the TDP MMU reduced the test duration by 89% and the execution was
> dominated by get_user_pages and the user fault FD ioctl instead of the
> MMU lock.
> 
> This series is the first of two. In this series we add a basic
> implementation of the TDP MMU. In the next series we will improve the
> performance of the TDP MMU and allow it to execute MMU operations
> in parallel.
> 
> The overall purpose of the KVM MMU is to program paging structures
> (CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
> addresses (HPA), and to provide utilities for other KVM features, for
> example dirty logging. The definition of the L1 guest physical address
> (GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
> and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
> MMU must program the x86 page tables to encode the full translation of
> guest virtual addresses (GVA) to HPA. This requires "shadowing" the
> guest's page tables to create a composite x86 paging structure. This
> solution is complicated, requires separate paging structures for each
> guest CR3, and requires emulating guest page table changes. The TDP case
> is much simpler. In this case, KVM lets the guest control CR3 and programs
> the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has
> no way to change this mapping and only one version of the paging structure
> is needed per L1 paging mode. In this case the paging mode is some
> combination of the number of levels in the paging structure, the address
> space (normal execution or system management mode, on x86), and other
> attributes. Most VMs only ever use 1 paging mode and so only ever need one
> TDP structure.
> 
> This series implements a "TDP MMU" through alternative implementations of
> MMU functions for running L1 guests with TDP. The TDP MMU falls back to
> the existing shadow paging implementation when TDP is not available, and
> interoperates with the existing shadow paging implementation for nesting.
> The use of the TDP MMU can be controlled by a module parameter which is
> snapshot on VM creation and follows the life of the VM. This snapshot
> is used in many functions to decide whether or not to use TDP MMU handlers
> for a given operation.
> 
> This series can also be viewed in Gerrit here:
> https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> (Thanks to Dmitry Vyukov <dvyukov@google.com> for setting up the
> Gerrit instance)
> 
> Ben Gardon (22):
>   kvm: mmu: Separate making SPTEs from set_spte
>   kvm: mmu: Introduce tdp_iter
>   kvm: mmu: Init / Uninit the TDP MMU
>   kvm: mmu: Allocate and free TDP MMU roots
>   kvm: mmu: Add functions to handle changed TDP SPTEs
>   kvm: mmu: Make address space ID a property of memslots
>   kvm: mmu: Support zapping SPTEs in the TDP MMU
>   kvm: mmu: Separate making non-leaf sptes from link_shadow_page
>   kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
>   kvm: mmu: Add TDP MMU PF handler
>   kvm: mmu: Factor out allocating a new tdp_mmu_page
>   kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
>   kvm: mmu: Support invalidate range MMU notifier for TDP MMU
>   kvm: mmu: Add access tracking for tdp_mmu
>   kvm: mmu: Support changed pte notifier in tdp MMU
>   kvm: mmu: Add dirty logging handler for changed sptes
>   kvm: mmu: Support dirty logging for the TDP MMU
>   kvm: mmu: Support disabling dirty logging for the tdp MMU
>   kvm: mmu: Support write protection for nesting in tdp MMU
>   kvm: mmu: NX largepage recovery for TDP MMU
>   kvm: mmu: Support MMIO in the TDP MMU
>   kvm: mmu: Don't clear write flooding count for direct roots
> 
>  arch/x86/include/asm/kvm_host.h |   17 +
>  arch/x86/kvm/Makefile           |    3 +-
>  arch/x86/kvm/mmu/mmu.c          |  437 ++++++----
>  arch/x86/kvm/mmu/mmu_internal.h |   98 +++
>  arch/x86/kvm/mmu/paging_tmpl.h  |    3 +-
>  arch/x86/kvm/mmu/tdp_iter.c     |  198 +++++
>  arch/x86/kvm/mmu/tdp_iter.h     |   55 ++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 1315 +++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h      |   52 ++
>  include/linux/kvm_host.h        |    2 +
>  virt/kvm/kvm_main.c             |    7 +-
>  11 files changed, 2022 insertions(+), 165 deletions(-)
>  create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
>  create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
>  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
>  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h
> 

Ok, I've not finished reading the code but I have already an idea of
what it's like.  I really think we should fast track this as the basis
for more 5.11 work.  I'll finish reviewing it and, if you don't mind, I
might make some of the changes myself so I have the occasion to play and
get accustomed to the code; speak up if you disagree with them though!
Another thing I'd like to add is a few tracepoints.

Paolo
Paolo Bonzini Sept. 28, 2020, 5:31 p.m. UTC | #2
On 25/09/20 23:22, Ben Gardon wrote:
> This series is the first of two. In this series we add a basic
> implementation of the TDP MMU. In the next series we will improve the
> performance of the TDP MMU and allow it to execute MMU operations
> in parallel.

I have finished rebasing and adding a few cleanups on top, but I don't
have time to test it today.  I think the changes shouldn't get too much
in the way of the second series, but I've also pushed your v1 unmodified
to kvm/tdp-mmu for future convenience.  I'll await for your feedback in
the meanwhile!

One feature that I noticed is missing is the shrinker.  What are your
plans (or opinions) around it?

Also, the code generally assume a 64-bit CPU (i.e. that writes to 64-bit
PTEs are atomic).  That is not a big issue, it just needs a small change
on top to make the TDP MMU conditional on CONFIG_X86_64.

Thanks,

Paolo
Ben Gardon Sept. 29, 2020, 5:40 p.m. UTC | #3
On Mon, Sep 28, 2020 at 10:31 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:22, Ben Gardon wrote:
> > This series is the first of two. In this series we add a basic
> > implementation of the TDP MMU. In the next series we will improve the
> > performance of the TDP MMU and allow it to execute MMU operations
> > in parallel.
>
> I have finished rebasing and adding a few cleanups on top, but I don't
> have time to test it today.  I think the changes shouldn't get too much
> in the way of the second series, but I've also pushed your v1 unmodified
> to kvm/tdp-mmu for future convenience.  I'll await for your feedback in
> the meanwhile!

Awesome, thank you for the reviews and positive response! I'll get to
work responding to your comments and preparing a v2.

>
> One feature that I noticed is missing is the shrinker.  What are your
> plans (or opinions) around it?

I assume by the shrinker you mean the page table quota that controls
how many pages the MMU can use at a time to back guest memory?
I think the shrinker is less important for the TDP MMU as there is an
implicit limit on how much memory it will use to back guest memory.
You could set the limit smaller than the number of pages required to
fully map the guest's memory, but I'm not really sure why you would
want to in a practical scenario. I understand the quota's importance
for x86 shadow paging and nested TDP scenarios where the guest could
cause KVM to allocate an unbounded amount of memory for page tables,
but the guest does not have this power in the non-nested TDP scenario.
Really, I didn't include it in this series because we haven't needed
it at Google and so I never implemented the quota enforcement. It
shouldn't be difficult to implement if you think it's worth having,
and it'll be needed to support nested on the TDP MMU (without using
the shadow MMU) anyway. If you're okay with leaving it out of the
initial patch set though, I'm inclined to do that.

>
> Also, the code generally assume a 64-bit CPU (i.e. that writes to 64-bit
> PTEs are atomic).  That is not a big issue, it just needs a small change
> on top to make the TDP MMU conditional on CONFIG_X86_64.

Ah, that didn't occur to me. Thank you for pointing that out. I'll fix
that oversight in v2.

>
> Thanks,
>
> Paolo
>
Paolo Bonzini Sept. 29, 2020, 6:10 p.m. UTC | #4
On 29/09/20 19:40, Ben Gardon wrote:
> I'll get to work responding to your comments and preparing a v2.

Please do respond to the comments, but I've actually already done most
of the changes (I'm bad at reviewing code without tinkering).  NX
recovery seems broken, but we can leave it out in the beginning as it's
fairly self contained.

I was going to post today, but I was undecided about whether to leave
out NX or try and fix it.

>> One feature that I noticed is missing is the shrinker.  What are your
>> plans (or opinions) around it?
> I assume by the shrinker you mean the page table quota that controls
> how many pages the MMU can use at a time to back guest memory?
> I think the shrinker is less important for the TDP MMU as there is an
> implicit limit on how much memory it will use to back guest memory.

Good point.  That's why I asked for opinions too.

Paolo
Sean Christopherson Sept. 30, 2020, 6:19 a.m. UTC | #5
In case Paolo is feeling trigger happy, I'm going to try and get through the
second half of this series tomorrow.
Paolo Bonzini Sept. 30, 2020, 6:30 a.m. UTC | #6
On 30/09/20 08:19, Sean Christopherson wrote:
> In case Paolo is feeling trigger happy, I'm going to try and get through the
> second half of this series tomorrow.

I'm indeed feeling trigger happy about this series, but I wasn't
planning to include it in kvm.git this week.  I'll have my version
posted by tomorrow, and I'll include some of your feedback already when
it does not make incremental review too much harder.

Paolo