mbox series

[0/9] Create a userfaultfd demand paging test

Message ID 20190927161836.57978-1-bgardon@google.com (mailing list archive)
Headers show
Series Create a userfaultfd demand paging test | expand

Message

Ben Gardon Sept. 27, 2019, 4:18 p.m. UTC
When handling page faults for many vCPUs during demand paging, KVM's MMU
lock becomes highly contended. This series creates a test with a naive
userfaultfd based demand paging implementation to demonstrate that
contention. This test serves both as a functional test of userfaultfd
and a microbenchmark of demand paging performance with a variable number
of vCPUs and memory per vCPU.

The test creates N userfaultfd threads, N vCPUs, and a region of memory
with M pages per vCPU. The N userfaultfd polling threads are each set up
to serve faults on a region of memory corresponding to one of the vCPUs.
Each of the vCPUs is then started, and touches each page of its disjoint
memory region, sequentially. In response to faults, the userfaultfd
threads copy a static buffer into the guest's memory. This creates a
worst case for MMU lock contention as we have removed most of the
contention between the userfaultfd threads and there is no time required
to fetch the contents of guest memory.

This test was run successfully on Intel Haswell, Broadwell, and
Cascadelake hosts with a variety of vCPU counts and memory sizes.

This test was adapted from the dirty_log_test.

The series can also be viewed in Gerrit here:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/1464
(Thanks to Dmitry Vyukov <dvyukov@google.com> for setting up the Gerrit
instance)

Ben Gardon (9):
  KVM: selftests: Create a demand paging test
  KVM: selftests: Add demand paging content to the demand paging test
  KVM: selftests: Add memory size parameter to the demand paging test
  KVM: selftests: Pass args to vCPU instead of using globals
  KVM: selftests: Support multiple vCPUs in demand paging test
  KVM: selftests: Time guest demand paging
  KVM: selftests: Add parameter to _vm_create for memslot 0 base paddr
  KVM: selftests: Support large VMs in demand paging test
  Add static flag

 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   4 +-
 .../selftests/kvm/demand_paging_test.c        | 610 ++++++++++++++++++
 tools/testing/selftests/kvm/dirty_log_test.c  |   2 +-
 .../testing/selftests/kvm/include/kvm_util.h  |   3 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    |   7 +-
 6 files changed, 621 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/demand_paging_test.c

Comments

Peter Xu Sept. 29, 2019, 7:22 a.m. UTC | #1
On Fri, Sep 27, 2019 at 09:18:28AM -0700, Ben Gardon wrote:
> When handling page faults for many vCPUs during demand paging, KVM's MMU
> lock becomes highly contended. This series creates a test with a naive
> userfaultfd based demand paging implementation to demonstrate that
> contention. This test serves both as a functional test of userfaultfd
> and a microbenchmark of demand paging performance with a variable number
> of vCPUs and memory per vCPU.
> 
> The test creates N userfaultfd threads, N vCPUs, and a region of memory
> with M pages per vCPU. The N userfaultfd polling threads are each set up
> to serve faults on a region of memory corresponding to one of the vCPUs.
> Each of the vCPUs is then started, and touches each page of its disjoint
> memory region, sequentially. In response to faults, the userfaultfd
> threads copy a static buffer into the guest's memory. This creates a
> worst case for MMU lock contention as we have removed most of the
> contention between the userfaultfd threads and there is no time required
> to fetch the contents of guest memory.

Hi, Ben,

Even though I may not have enough MMU knowledge to say this... this of
course looks like a good test at least to me.  I'm just curious about
whether you have plan to customize the userfaultfd handler in the
future with this infrastructure?

Asked because IIUC with this series userfaultfd only plays a role to
introduce a relatively adhoc delay to page faults.  In other words,
I'm also curious what would be the number look like (as you mentioned
in your MMU rework cover letter) if you simply start hundreds of vcpu
and do the same test like this, but use the default anonymous page
faults rather than uffd page faults.  I feel like even without uffd
that could be a huge contention already there.  Or did I miss anything
important on your decision to use userfaultfd?

Thanks,
Ben Gardon Sept. 30, 2019, 5:02 p.m. UTC | #2
Hi Peter,
You're absolutely right that we could demonstrate more contention by
avoiding UFFD and just letting the kernel resolve page faults. I used
UFFD in this test and benchmarking for the other MMU patch set because
I believe it's a more realistic scenario. A simpler page access
benchmark would be better for identifying further scaling problems
within the MMU, but the only situation I can think of where that would
be used is VM boot. However, we don't usually see many vCPUs touching
memory all over the place on boot. In a migration or restore without
demand paging, the memory would have to be pre-populated with the
contents of guest memory and the KVM MMU fault handler wouldn't be
taking a fault in get_user_pages. In the interest of eliminating the
delay from UFFD, I will add an option to use anonymous page faults or
prefault memory instead.

I don't have any plans to customize the UFFD implementation at the
moment, but experimenting with UFFD strategies will be useful for
building higher performance post-copy in QEMU and other userspaces in
the future.
Thank you for taking a look at these patches.
Ben

On Sun, Sep 29, 2019 at 12:23 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Sep 27, 2019 at 09:18:28AM -0700, Ben Gardon wrote:
> > When handling page faults for many vCPUs during demand paging, KVM's MMU
> > lock becomes highly contended. This series creates a test with a naive
> > userfaultfd based demand paging implementation to demonstrate that
> > contention. This test serves both as a functional test of userfaultfd
> > and a microbenchmark of demand paging performance with a variable number
> > of vCPUs and memory per vCPU.
> >
> > The test creates N userfaultfd threads, N vCPUs, and a region of memory
> > with M pages per vCPU. The N userfaultfd polling threads are each set up
> > to serve faults on a region of memory corresponding to one of the vCPUs.
> > Each of the vCPUs is then started, and touches each page of its disjoint
> > memory region, sequentially. In response to faults, the userfaultfd
> > threads copy a static buffer into the guest's memory. This creates a
> > worst case for MMU lock contention as we have removed most of the
> > contention between the userfaultfd threads and there is no time required
> > to fetch the contents of guest memory.
>
> Hi, Ben,
>
> Even though I may not have enough MMU knowledge to say this... this of
> course looks like a good test at least to me.  I'm just curious about
> whether you have plan to customize the userfaultfd handler in the
> future with this infrastructure?
>
> Asked because IIUC with this series userfaultfd only plays a role to
> introduce a relatively adhoc delay to page faults.  In other words,
> I'm also curious what would be the number look like (as you mentioned
> in your MMU rework cover letter) if you simply start hundreds of vcpu
> and do the same test like this, but use the default anonymous page
> faults rather than uffd page faults.  I feel like even without uffd
> that could be a huge contention already there.  Or did I miss anything
> important on your decision to use userfaultfd?
>
> Thanks,
>
> --
> Peter Xu