mbox series

[RFC,00/28] kvm: mmu: Rework the x86 TDP direct mapped case

Message ID 20190926231824.149014-1-bgardon@google.com (mailing list archive)
Headers show
Series kvm: mmu: Rework the x86 TDP direct mapped case | expand

Message

Ben Gardon Sept. 26, 2019, 11:17 p.m. UTC
Over the years, the needs for KVM's x86 MMU have grown from running small
guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
we previously depended upon shadow paging to run all guests, we now have
the use of two dimensional paging (TDP). This RFC proposes and
demonstrates two major changes to the MMU. First, an iterator abstraction 
that simplifies traversal of TDP paging structures when running an L1
guest. This abstraction takes advantage of the relative simplicity of TDP
to simplify the implementation of MMU functions. Second, this RFC changes
the synchronization model to enable more parallelism than the monolithic
MMU lock. This "direct mode" MMU is currently in use at Google and has
given us the performance necessary to live migrate our 416 vCPU, 12TiB
m2-ultramem-416 VMs.

The primary motivation for this work was to handle page faults in
parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's
MMU lock suffers from extreme contention, resulting in soft-lockups and
jitter in the guest. To demonstrate this I also written, and will submit
a demand paging test to KVM selftests. The test creates N vCPUs, which
each touch disjoint regions of memory. Page faults are picked up by N
user fault FD handlers, one for each vCPU. Over a 1 second profile of
the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the
execution time was spent waiting for the MMU lock! With this patch
series the total execution time for the test was reduced by 89% and the
execution was dominated by get_user_pages and the user fault FD ioctl.
As a secondary benefit, the iterator-based implementation does not use
the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM
overheads.

The goal of this  RFC is to demonstrate and gather feedback on the
iterator pattern, the memory savings it enables for the "direct case"
and the changes to the synchronization model. Though they are interwoven
in this series, I will separate the iterator from the synchronization
changes in a future series. I recognize that some feature work will be
needed to make this patch set ready for merging. That work is detailed
at the end of this cover letter.

The overall purpose of the KVM MMU is to program paging structures
(CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
addresses (HPA), and to provide utilities for other KVM features, for
example dirty logging. The definition of the L1 guest physical address
(GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
MMU must program the x86 page tables to encode the full translation of
guest virtual addresses (GVA) to HPA. This requires "shadowing" the
guest's page tables to create a composite x86 paging structure. This
solution is complicated, requires separate paging structures for each
guest CR3, and requires emulating guest page table changes. The TDP case
is much simpler. In this case, KVM lets the guest control CR3 and
programs the EPT/NPT paging structures with the GPA -> HPA mapping. The
guest has no way to change this mapping and only one version of the
paging structure is needed per L1 address space (normal execution or
system management mode, on x86).

This RFC implements a "direct MMU" through alternative implementations
of MMU functions for running L1 guests with TDP. The direct MMU gets its
name from the direct role bit in struct kvm_mmu_page in the existing MMU
implementation, which indicates that the PTEs in a page table (and their
children) map a linear range of L1 GPAs. Though the direct MMU does not
currently use struct kvm_mmu_page, all of its pages would implicitly
have that bit set. The direct MMU falls back to the existing shadow
paging implementation when TDP is not available, and interoperates with
the existing shadow paging implementation for nesting. 

In order to handle page faults in parallel, the MMU needs to allow a
variety of changes to PTEs concurrently. The first step in this series
is to replace the MMU lock with a read/write lock to enable multiple
threads to perform operations at the same time and interoperate with
functions that still need the monolithic lock. With threads handling
page faults in parallel, the functions operating on the page table
need to: a) ensure PTE modifications are atomic, and  b) ensure that page
table memory is freed and accessed safely Conveniently, the iterator
pattern introduced in this series handles both concerns.

The direct walk iterator implements a pre-order traversal of the TDP
paging structures. Threads are able to read and write page table memory
safely in this traversal through the use of RCU and page table memory is
freed in RCU callbacks, as part of a three step process. (More on that
below.) To ensure that PTEs are updated atomically, the iterator
provides a function for updating the current pte. If the update
succeeds, the iterator handles bookkeeping based on the current and
previous value of the PTE. If it fails, some other thread will have
succeeded, and the iterator repeats that PTE on the next iteration,
transparently retrying the operation. The iterator also handles yielding
and reacquiring the appropriate MMU lock, and flushing the TLB or
queuing work to be done on the next flush.

In order to minimize TLB flushes, we expand the tlbs_dirty count to
track unflushed changes made through the iterator, so that other threads
know that the in-memory page tables they traverse might not be what the
guest is using to access memory. Page table pages that have been
disconnected from the paging structure root are freed in a three step
process. First the pages are filled with special, nonpresent PTEs so
that guest accesses to them, through the paging structure caches result
in TDP page faults. Second, the pages are added to a disconnected list,
a snapshot of which is transferred to a free list, after each TLB flush.
The TLB flush clears the paging structure caches, so the guest will no
longer use the disconnected pages. Lastly, the free list is processed
asynchronously to queue RCU callbacks which free the memory. The RCU
grace period ensures no kernel threads are using the disconnected pages.
This allows the MMU to leave the guest in an inconsistent, but safe,
state with respect to the in-memory paging structure. When functions
need to guarantee that the guest will use the in-memory state after a
traversal, they can either flush the TLBs unconditionally or, if using
the MMU lock in write mode, flush the TLBs under the lock only if the
tlbs_dirty count is elevated.

The use of the direct MMU can be controlled by a module parameter which
is snapshotted on VM creation and follows the life of the VM. This
snapshot is used in many functions to decide whether or not to use
direct MMU handlers for a given operation. This is a maintenance burden
and in future versions of this series I will address that and remove
some of the code the direct MMU replaces. I am especially interested in
feedback from the community as to how this series can best be merged. I
see two broad approaches: replacement and integration or modularization.

Replacement and integration would require amending the existing shadow
paging implementation to use a similar iterator pattern. This would mean
expanding the iterator to work with an rmap to support shadow paging and
reconciling the synchronization changes made to the direct case with the
complexities of shadow paging and nesting.

The modularization approach would require factoring out the "direct MMU"
or "TDP MMU" and "shadow MMU(s)." The function pointers in the MMU
struct would need to be expanded to fully encompass the interface of the
MMU and multiple, simpler, implementations of those functions would be
needed. As it is, use of the module parameter snapshot gives us a rough
outline of the previously undocumented shape of the MMU interface, which
could facilitate modularization. Modularization could allow for the
separation of the shadow paging implementations for running guests
without TDP, and running nested guests with TDP, and the breakup of
paging_tmpl.h.

In addition to the integration question, below are some of the work
items I plan to address before sending the series out again:

Disentangle the iterator pattern from the synchronization changes
	Currently the direct_walk_iterator is very closely tied to the use
	of atomic operations, RCU, and a rwlock for MMU operations. This
	does not need to be the case: instead I would like to see those
	synchronization changes built on top of this iterator pattern.

Support 5 level paging and PAE
	Currently the direct walk iterator only supports 4 level, 64bit
	architectures.

Support MMU memory reclaim
	Currently this patch series does not respect memory limits applied
	through kvm_vm_ioctl_set_nr_mmu_pages.

Support nonpaging guests
	Guests that are not using virtual addresses can be direct mapped,
	even without TDP.

Implement fast invalidation of all PTEs
	This series was prepared between when the fast invalidate_all
	mechanism was removed and when it was re-added. Currently, there
	is no fast path for invalidating all direct MMU PTEs.

Move more operations to execute concurrently
	In this patch series, only page faults are able to execute
	concurrently, however several other functions can also execute
	concurrently, simply by changing the write lock acquisition to a
	read lock.

This series can also be viewed in Gerrit here:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/1416 (Thanks to
Dmitry Vyukov <dvyukov@google.com> for setting up the Gerrit instance)

Ben Gardon (28):
  kvm: mmu: Separate generating and setting mmio ptes
  kvm: mmu: Separate pte generation from set_spte
  kvm: mmu: Zero page cache memory at allocation time
  kvm: mmu: Update the lpages stat atomically
  sched: Add cond_resched_rwlock
  kvm: mmu: Replace mmu_lock with a read/write lock
  kvm: mmu: Add functions for handling changed PTEs
  kvm: mmu: Init / Uninit the direct MMU
  kvm: mmu: Free direct MMU page table memory in an RCU callback
  kvm: mmu: Flush TLBs before freeing direct MMU page table memory
  kvm: mmu: Optimize for freeing direct MMU PTs on teardown
  kvm: mmu: Set tlbs_dirty atomically
  kvm: mmu: Add an iterator for concurrent paging structure walks
  kvm: mmu: Batch updates to the direct mmu disconnected list
  kvm: mmu: Support invalidate_zap_all_pages
  kvm: mmu: Add direct MMU page fault handler
  kvm: mmu: Add direct MMU fast page fault handler
  kvm: mmu: Add an hva range iterator for memslot GFNs
  kvm: mmu: Make address space ID a property of memslots
  kvm: mmu: Implement the invalidation MMU notifiers for the direct MMU
  kvm: mmu: Integrate the direct mmu with the changed pte notifier
  kvm: mmu: Implement access tracking for the direct MMU
  kvm: mmu: Make mark_page_dirty_in_slot usable from outside kvm_main
  kvm: mmu: Support dirty logging in the direct MMU
  kvm: mmu: Support kvm_zap_gfn_range in the direct MMU
  kvm: mmu: Integrate direct MMU with nesting
  kvm: mmu: Lazily allocate rmap when direct MMU is enabled
  kvm: mmu: Support MMIO in the direct MMU

 arch/x86/include/asm/kvm_host.h |   66 +-
 arch/x86/kvm/Kconfig            |    1 +
 arch/x86/kvm/mmu.c              | 2578 ++++++++++++++++++++++++++-----
 arch/x86/kvm/mmu.h              |    2 +
 arch/x86/kvm/mmutrace.h         |   50 +
 arch/x86/kvm/page_track.c       |    8 +-
 arch/x86/kvm/paging_tmpl.h      |   37 +-
 arch/x86/kvm/vmx/vmx.c          |   10 +-
 arch/x86/kvm/x86.c              |   96 +-
 arch/x86/kvm/x86.h              |    2 +
 include/linux/kvm_host.h        |    6 +-
 include/linux/sched.h           |   11 +
 kernel/sched/core.c             |   23 +
 virt/kvm/kvm_main.c             |   57 +-
 14 files changed, 2503 insertions(+), 444 deletions(-)

Comments

Sean Christopherson Oct. 17, 2019, 6:50 p.m. UTC | #1
On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote:
> Over the years, the needs for KVM's x86 MMU have grown from running small
> guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
> we previously depended upon shadow paging to run all guests, we now have
> the use of two dimensional paging (TDP). This RFC proposes and
> demonstrates two major changes to the MMU. First, an iterator abstraction 
> that simplifies traversal of TDP paging structures when running an L1
> guest. This abstraction takes advantage of the relative simplicity of TDP
> to simplify the implementation of MMU functions. Second, this RFC changes
> the synchronization model to enable more parallelism than the monolithic
> MMU lock. This "direct mode" MMU is currently in use at Google and has
> given us the performance necessary to live migrate our 416 vCPU, 12TiB
> m2-ultramem-416 VMs.
> 
> The primary motivation for this work was to handle page faults in
> parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's
> MMU lock suffers from extreme contention, resulting in soft-lockups and
> jitter in the guest. To demonstrate this I also written, and will submit
> a demand paging test to KVM selftests. The test creates N vCPUs, which
> each touch disjoint regions of memory. Page faults are picked up by N
> user fault FD handlers, one for each vCPU. Over a 1 second profile of
> the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the
> execution time was spent waiting for the MMU lock! With this patch
> series the total execution time for the test was reduced by 89% and the
> execution was dominated by get_user_pages and the user fault FD ioctl.
> As a secondary benefit, the iterator-based implementation does not use
> the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM
> overheads.
> 
> The goal of this  RFC is to demonstrate and gather feedback on the
> iterator pattern, the memory savings it enables for the "direct case"
> and the changes to the synchronization model. Though they are interwoven
> in this series, I will separate the iterator from the synchronization
> changes in a future series. I recognize that some feature work will be
> needed to make this patch set ready for merging. That work is detailed
> at the end of this cover letter.

Diving into this series is on my todo list, but realistically that's not
going to happen until after KVM forum.  Sorry I can't provide timely
feedback.
Paolo Bonzini Oct. 18, 2019, 1:42 p.m. UTC | #2
On 17/10/19 20:50, Sean Christopherson wrote:
> On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote:
>> Over the years, the needs for KVM's x86 MMU have grown from running small
>> guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
>> we previously depended upon shadow paging to run all guests, we now have
>> the use of two dimensional paging (TDP). This RFC proposes and
>> demonstrates two major changes to the MMU. First, an iterator abstraction 
>> that simplifies traversal of TDP paging structures when running an L1
>> guest. This abstraction takes advantage of the relative simplicity of TDP
>> to simplify the implementation of MMU functions. Second, this RFC changes
>> the synchronization model to enable more parallelism than the monolithic
>> MMU lock. This "direct mode" MMU is currently in use at Google and has
>> given us the performance necessary to live migrate our 416 vCPU, 12TiB
>> m2-ultramem-416 VMs.
>>
>> The primary motivation for this work was to handle page faults in
>> parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's
>> MMU lock suffers from extreme contention, resulting in soft-lockups and
>> jitter in the guest. To demonstrate this I also written, and will submit
>> a demand paging test to KVM selftests. The test creates N vCPUs, which
>> each touch disjoint regions of memory. Page faults are picked up by N
>> user fault FD handlers, one for each vCPU. Over a 1 second profile of
>> the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the
>> execution time was spent waiting for the MMU lock! With this patch
>> series the total execution time for the test was reduced by 89% and the
>> execution was dominated by get_user_pages and the user fault FD ioctl.
>> As a secondary benefit, the iterator-based implementation does not use
>> the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM
>> overheads.
>>
>> The goal of this  RFC is to demonstrate and gather feedback on the
>> iterator pattern, the memory savings it enables for the "direct case"
>> and the changes to the synchronization model. Though they are interwoven
>> in this series, I will separate the iterator from the synchronization
>> changes in a future series. I recognize that some feature work will be
>> needed to make this patch set ready for merging. That work is detailed
>> at the end of this cover letter.
> 
> Diving into this series is on my todo list, but realistically that's not
> going to happen until after KVM forum.  Sorry I can't provide timely
> feedback.

Same here.  I was very lazily waiting to get the big picture from Ben's
talk.

Paolo
Sean Christopherson Nov. 27, 2019, 7:09 p.m. UTC | #3
On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote:
> The goal of this  RFC is to demonstrate and gather feedback on the
> iterator pattern, the memory savings it enables for the "direct case"
> and the changes to the synchronization model. Though they are interwoven
> in this series, I will separate the iterator from the synchronization
> changes in a future series. I recognize that some feature work will be
> needed to make this patch set ready for merging. That work is detailed
> at the end of this cover letter.

How difficult would it be to send the synchronization changes as a separate
series in the not-too-distant future?  At a brief glance, those changes
appear to be tiny relative to the direct iterator changes.  From a stability
perspective, it would be nice if the locking changes can get upstreamed and
tested in the wild for a few kernel versions before the iterator code is
introduced.
Ben Gardon Dec. 6, 2019, 7:55 p.m. UTC | #4
I'm finally back in the office. Sorry for not getting back to you sooner.
I don't think it would be easy to send the synchronization changes
first. The reason they seem so small is that they're all handled by
the iterator. If we tried to put the synchronization changes in
without the iterator we'd have to 1.) deal with struct kvm_mmu_pages,
2.) deal with the rmap, and 3.) change a huge amount of code to insert
the synchronization changes into the existing framework. The changes
wouldn't be mechanical or easy to insert either since a lot of
bookkeeping is currently done before PTEs are updated, with no
facility for rolling back the bookkeeping on PTE cmpxchg failure. We
could start with the iterator changes and then do the synchronization
changes, but the other way around would be very difficult.


On Wed, Nov 27, 2019 at 11:09 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote:
> > The goal of this  RFC is to demonstrate and gather feedback on the
> > iterator pattern, the memory savings it enables for the "direct case"
> > and the changes to the synchronization model. Though they are interwoven
> > in this series, I will separate the iterator from the synchronization
> > changes in a future series. I recognize that some feature work will be
> > needed to make this patch set ready for merging. That work is detailed
> > at the end of this cover letter.
>
> How difficult would it be to send the synchronization changes as a separate
> series in the not-too-distant future?  At a brief glance, those changes
> appear to be tiny relative to the direct iterator changes.  From a stability
> perspective, it would be nice if the locking changes can get upstreamed and
> tested in the wild for a few kernel versions before the iterator code is
> introduced.
Sean Christopherson Dec. 6, 2019, 7:57 p.m. UTC | #5
On Fri, Dec 06, 2019 at 11:55:42AM -0800, Ben Gardon wrote:
> I'm finally back in the office. Sorry for not getting back to you sooner.
> I don't think it would be easy to send the synchronization changes
> first. The reason they seem so small is that they're all handled by
> the iterator. If we tried to put the synchronization changes in
> without the iterator we'd have to 1.) deal with struct kvm_mmu_pages,
> 2.) deal with the rmap, and 3.) change a huge amount of code to insert
> the synchronization changes into the existing framework. The changes
> wouldn't be mechanical or easy to insert either since a lot of
> bookkeeping is currently done before PTEs are updated, with no
> facility for rolling back the bookkeeping on PTE cmpxchg failure. We
> could start with the iterator changes and then do the synchronization
> changes, but the other way around would be very difficult.

By synchronization changes, I meant switching to a r/w lock instead of a
straight spinlock.  Is that doable in a smallish series?
Ben Gardon Dec. 6, 2019, 8:42 p.m. UTC | #6
Switching to a RW lock is easy, but nothing would be able to use the
read lock because it's not safe to make most kinds of changes to PTEs
in parallel in the existing code. If we sharded the spinlock based on
GFN it might be easier, but that would also take a lot of
re-engineering.

On Fri, Dec 6, 2019 at 11:57 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Dec 06, 2019 at 11:55:42AM -0800, Ben Gardon wrote:
> > I'm finally back in the office. Sorry for not getting back to you sooner.
> > I don't think it would be easy to send the synchronization changes
> > first. The reason they seem so small is that they're all handled by
> > the iterator. If we tried to put the synchronization changes in
> > without the iterator we'd have to 1.) deal with struct kvm_mmu_pages,
> > 2.) deal with the rmap, and 3.) change a huge amount of code to insert
> > the synchronization changes into the existing framework. The changes
> > wouldn't be mechanical or easy to insert either since a lot of
> > bookkeeping is currently done before PTEs are updated, with no
> > facility for rolling back the bookkeeping on PTE cmpxchg failure. We
> > could start with the iterator changes and then do the synchronization
> > changes, but the other way around would be very difficult.
>
> By synchronization changes, I meant switching to a r/w lock instead of a
> straight spinlock.  Is that doable in a smallish series?