Message ID | 20190926231824.149014-1-bgardon@google.com (mailing list archive) |
---|---|
Headers | show |
Series | kvm: mmu: Rework the x86 TDP direct mapped case | expand |
On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote: > Over the years, the needs for KVM's x86 MMU have grown from running small > guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where > we previously depended upon shadow paging to run all guests, we now have > the use of two dimensional paging (TDP). This RFC proposes and > demonstrates two major changes to the MMU. First, an iterator abstraction > that simplifies traversal of TDP paging structures when running an L1 > guest. This abstraction takes advantage of the relative simplicity of TDP > to simplify the implementation of MMU functions. Second, this RFC changes > the synchronization model to enable more parallelism than the monolithic > MMU lock. This "direct mode" MMU is currently in use at Google and has > given us the performance necessary to live migrate our 416 vCPU, 12TiB > m2-ultramem-416 VMs. > > The primary motivation for this work was to handle page faults in > parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's > MMU lock suffers from extreme contention, resulting in soft-lockups and > jitter in the guest. To demonstrate this I also written, and will submit > a demand paging test to KVM selftests. The test creates N vCPUs, which > each touch disjoint regions of memory. Page faults are picked up by N > user fault FD handlers, one for each vCPU. Over a 1 second profile of > the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the > execution time was spent waiting for the MMU lock! With this patch > series the total execution time for the test was reduced by 89% and the > execution was dominated by get_user_pages and the user fault FD ioctl. > As a secondary benefit, the iterator-based implementation does not use > the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM > overheads. > > The goal of this RFC is to demonstrate and gather feedback on the > iterator pattern, the memory savings it enables for the "direct case" > and the changes to the synchronization model. Though they are interwoven > in this series, I will separate the iterator from the synchronization > changes in a future series. I recognize that some feature work will be > needed to make this patch set ready for merging. That work is detailed > at the end of this cover letter. Diving into this series is on my todo list, but realistically that's not going to happen until after KVM forum. Sorry I can't provide timely feedback.
On 17/10/19 20:50, Sean Christopherson wrote: > On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote: >> Over the years, the needs for KVM's x86 MMU have grown from running small >> guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where >> we previously depended upon shadow paging to run all guests, we now have >> the use of two dimensional paging (TDP). This RFC proposes and >> demonstrates two major changes to the MMU. First, an iterator abstraction >> that simplifies traversal of TDP paging structures when running an L1 >> guest. This abstraction takes advantage of the relative simplicity of TDP >> to simplify the implementation of MMU functions. Second, this RFC changes >> the synchronization model to enable more parallelism than the monolithic >> MMU lock. This "direct mode" MMU is currently in use at Google and has >> given us the performance necessary to live migrate our 416 vCPU, 12TiB >> m2-ultramem-416 VMs. >> >> The primary motivation for this work was to handle page faults in >> parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's >> MMU lock suffers from extreme contention, resulting in soft-lockups and >> jitter in the guest. To demonstrate this I also written, and will submit >> a demand paging test to KVM selftests. The test creates N vCPUs, which >> each touch disjoint regions of memory. Page faults are picked up by N >> user fault FD handlers, one for each vCPU. Over a 1 second profile of >> the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the >> execution time was spent waiting for the MMU lock! With this patch >> series the total execution time for the test was reduced by 89% and the >> execution was dominated by get_user_pages and the user fault FD ioctl. >> As a secondary benefit, the iterator-based implementation does not use >> the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM >> overheads. >> >> The goal of this RFC is to demonstrate and gather feedback on the >> iterator pattern, the memory savings it enables for the "direct case" >> and the changes to the synchronization model. Though they are interwoven >> in this series, I will separate the iterator from the synchronization >> changes in a future series. I recognize that some feature work will be >> needed to make this patch set ready for merging. That work is detailed >> at the end of this cover letter. > > Diving into this series is on my todo list, but realistically that's not > going to happen until after KVM forum. Sorry I can't provide timely > feedback. Same here. I was very lazily waiting to get the big picture from Ben's talk. Paolo
On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote: > The goal of this RFC is to demonstrate and gather feedback on the > iterator pattern, the memory savings it enables for the "direct case" > and the changes to the synchronization model. Though they are interwoven > in this series, I will separate the iterator from the synchronization > changes in a future series. I recognize that some feature work will be > needed to make this patch set ready for merging. That work is detailed > at the end of this cover letter. How difficult would it be to send the synchronization changes as a separate series in the not-too-distant future? At a brief glance, those changes appear to be tiny relative to the direct iterator changes. From a stability perspective, it would be nice if the locking changes can get upstreamed and tested in the wild for a few kernel versions before the iterator code is introduced.
I'm finally back in the office. Sorry for not getting back to you sooner. I don't think it would be easy to send the synchronization changes first. The reason they seem so small is that they're all handled by the iterator. If we tried to put the synchronization changes in without the iterator we'd have to 1.) deal with struct kvm_mmu_pages, 2.) deal with the rmap, and 3.) change a huge amount of code to insert the synchronization changes into the existing framework. The changes wouldn't be mechanical or easy to insert either since a lot of bookkeeping is currently done before PTEs are updated, with no facility for rolling back the bookkeeping on PTE cmpxchg failure. We could start with the iterator changes and then do the synchronization changes, but the other way around would be very difficult. On Wed, Nov 27, 2019 at 11:09 AM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Thu, Sep 26, 2019 at 04:17:56PM -0700, Ben Gardon wrote: > > The goal of this RFC is to demonstrate and gather feedback on the > > iterator pattern, the memory savings it enables for the "direct case" > > and the changes to the synchronization model. Though they are interwoven > > in this series, I will separate the iterator from the synchronization > > changes in a future series. I recognize that some feature work will be > > needed to make this patch set ready for merging. That work is detailed > > at the end of this cover letter. > > How difficult would it be to send the synchronization changes as a separate > series in the not-too-distant future? At a brief glance, those changes > appear to be tiny relative to the direct iterator changes. From a stability > perspective, it would be nice if the locking changes can get upstreamed and > tested in the wild for a few kernel versions before the iterator code is > introduced.
On Fri, Dec 06, 2019 at 11:55:42AM -0800, Ben Gardon wrote: > I'm finally back in the office. Sorry for not getting back to you sooner. > I don't think it would be easy to send the synchronization changes > first. The reason they seem so small is that they're all handled by > the iterator. If we tried to put the synchronization changes in > without the iterator we'd have to 1.) deal with struct kvm_mmu_pages, > 2.) deal with the rmap, and 3.) change a huge amount of code to insert > the synchronization changes into the existing framework. The changes > wouldn't be mechanical or easy to insert either since a lot of > bookkeeping is currently done before PTEs are updated, with no > facility for rolling back the bookkeeping on PTE cmpxchg failure. We > could start with the iterator changes and then do the synchronization > changes, but the other way around would be very difficult. By synchronization changes, I meant switching to a r/w lock instead of a straight spinlock. Is that doable in a smallish series?
Switching to a RW lock is easy, but nothing would be able to use the read lock because it's not safe to make most kinds of changes to PTEs in parallel in the existing code. If we sharded the spinlock based on GFN it might be easier, but that would also take a lot of re-engineering. On Fri, Dec 6, 2019 at 11:57 AM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Fri, Dec 06, 2019 at 11:55:42AM -0800, Ben Gardon wrote: > > I'm finally back in the office. Sorry for not getting back to you sooner. > > I don't think it would be easy to send the synchronization changes > > first. The reason they seem so small is that they're all handled by > > the iterator. If we tried to put the synchronization changes in > > without the iterator we'd have to 1.) deal with struct kvm_mmu_pages, > > 2.) deal with the rmap, and 3.) change a huge amount of code to insert > > the synchronization changes into the existing framework. The changes > > wouldn't be mechanical or easy to insert either since a lot of > > bookkeeping is currently done before PTEs are updated, with no > > facility for rolling back the bookkeeping on PTE cmpxchg failure. We > > could start with the iterator changes and then do the synchronization > > changes, but the other way around would be very difficult. > > By synchronization changes, I meant switching to a r/w lock instead of a > straight spinlock. Is that doable in a smallish series?