mbox series

[RFC,0/6] Pass memslot around during page fault handling

Message ID 20210813203504.2742757-1-dmatlack@google.com (mailing list archive)
Headers show
Series Pass memslot around during page fault handling | expand

Message

David Matlack Aug. 13, 2021, 8:34 p.m. UTC
This series avoids kvm_vcpu_gfn_to_memslot() calls during page fault
handling by passing around the memslot in struct kvm_page_fault. This
idea came from Ben Gardon who authored an similar series in Google's
kernel.

This series is an RFC because kvm_vcpu_gfn_to_memslot() calls are
actually quite cheap after commit fe22ed827c5b ("KVM: Cache the last
used slot index per vCPU") since we always hit the cache. However
profiling shows there is still some time (1-2%) spent in
kvm_vcpu_gfn_to_memslot() and that hot instructions are the memory loads
for kvm->memslots[as_id] and slots->used_slots. This series eliminates
this remaining overhead but at the cost of a bit of code churn.

Design
------

We can avoid the cost of kvm_vcpu_gfn_to_memslot() by looking up the
slot once and passing it around. In fact this is quite easy to do now
that KVM passes around struct kvm_page_fault to most of the page fault
handling code.  We can store the slot there without changing most of the
call sites.

The one exception to this is mmu_set_spte, which does not take a
kvm_page_fault since it is also used during spte prefetching. There are
three memslots lookups under mmu_set_spte:

mmu_set_spte
  rmap_add
    kvm_vcpu_gfn_to_memslot
  rmap_recycle
    kvm_vcpu_gfn_to_memslot
  set_spte
    make_spte
      mmu_try_to_unsync_pages
        kvm_page_track_is_active
          kvm_vcpu_gfn_to_memslot

Avoiding these lookups requires plumbing the slot through all of the
above functions. I explored creating a synthetic kvm_page_fault for
prefetching so that kvm_page_fault could be passed to all of these
functions instead, but that resulted in even more code churn.

Patches
-------

Patches 1-2 are small cleanups related to the series.

Patches 3-4 pass the memslot through kvm_page_fault and use it where
kvm_page_fault is already accessible.

Patches 5-6 plumb the memslot down into the guts of mmu_set_spte to
avoid the remaining memslot lookups.

Performance
-----------

I measured the performance using dirty_log_perf_test and taking the
average "Populate memory time" over 10 runs. To help inform whether or
not different parts of this series is worth the code churn I measured
the performance of pages 1-4 and 1-6 separately.

Test                            | tdp_mmu | kvm/queue | Patches 1-4 | Patches 1-6
------------------------------- | ------- | --------- | ----------- | -----------
./dirty_log_perf_test -v64      | Y       | 5.22s     | 5.20s       | 5.20s
./dirty_log_perf_test -v64 -x64 | Y       | 5.23s     | 5.14s       | 5.14s
./dirty_log_perf_test -v64      | N       | 17.14s    | 16.39s      | 15.36s
./dirty_log_perf_test -v64 -x64 | N       | 17.17s    | 16.60s      | 15.31s

This series provides no performance improvement to the tdp_mmu but
improves the legacy MMU page fault handling by about 10%.

David Matlack (6):
  KVM: x86/mmu: Rename try_async_pf to kvm_faultin_pfn in comment
  KVM: x86/mmu: Fold rmap_recycle into rmap_add
  KVM: x86/mmu: Pass around the memslot in kvm_page_fault
  KVM: x86/mmu: Avoid memslot lookup in page_fault_handle_page_track
  KVM: x86/mmu: Avoid memslot lookup in rmap_add
  KVM: x86/mmu: Avoid memslot lookup in mmu_try_to_unsync_pages

 arch/x86/include/asm/kvm_page_track.h |   4 +-
 arch/x86/kvm/mmu.h                    |   5 +-
 arch/x86/kvm/mmu/mmu.c                | 110 +++++++++-----------------
 arch/x86/kvm/mmu/mmu_internal.h       |   3 +-
 arch/x86/kvm/mmu/page_track.c         |   6 +-
 arch/x86/kvm/mmu/paging_tmpl.h        |  18 ++++-
 arch/x86/kvm/mmu/spte.c               |  11 +--
 arch/x86/kvm/mmu/spte.h               |   9 ++-
 arch/x86/kvm/mmu/tdp_mmu.c            |  12 +--
 9 files changed, 80 insertions(+), 98 deletions(-)

Comments

Paolo Bonzini Aug. 17, 2021, 11:12 a.m. UTC | #1
On 13/08/21 22:34, David Matlack wrote:
> This series avoids kvm_vcpu_gfn_to_memslot() calls during page fault
> handling by passing around the memslot in struct kvm_page_fault. This
> idea came from Ben Gardon who authored an similar series in Google's
> kernel.
> 
> This series is an RFC because kvm_vcpu_gfn_to_memslot() calls are
> actually quite cheap after commit fe22ed827c5b ("KVM: Cache the last
> used slot index per vCPU") since we always hit the cache. However
> profiling shows there is still some time (1-2%) spent in
> kvm_vcpu_gfn_to_memslot() and that hot instructions are the memory loads
> for kvm->memslots[as_id] and slots->used_slots. This series eliminates
> this remaining overhead but at the cost of a bit of code churn.
> 
> Design
> ------
> 
> We can avoid the cost of kvm_vcpu_gfn_to_memslot() by looking up the
> slot once and passing it around. In fact this is quite easy to do now
> that KVM passes around struct kvm_page_fault to most of the page fault
> handling code.  We can store the slot there without changing most of the
> call sites.
> 
> The one exception to this is mmu_set_spte, which does not take a
> kvm_page_fault since it is also used during spte prefetching. There are
> three memslots lookups under mmu_set_spte:
> 
> mmu_set_spte
>    rmap_add
>      kvm_vcpu_gfn_to_memslot
>    rmap_recycle
>      kvm_vcpu_gfn_to_memslot
>    set_spte
>      make_spte
>        mmu_try_to_unsync_pages
>          kvm_page_track_is_active
>            kvm_vcpu_gfn_to_memslot
> 
> Avoiding these lookups requires plumbing the slot through all of the
> above functions. I explored creating a synthetic kvm_page_fault for
> prefetching so that kvm_page_fault could be passed to all of these
> functions instead, but that resulted in even more code churn.
> 
> Patches
> -------
> 
> Patches 1-2 are small cleanups related to the series.
> 
> Patches 3-4 pass the memslot through kvm_page_fault and use it where
> kvm_page_fault is already accessible.
> 
> Patches 5-6 plumb the memslot down into the guts of mmu_set_spte to
> avoid the remaining memslot lookups.
> 
> Performance
> -----------
> 
> I measured the performance using dirty_log_perf_test and taking the
> average "Populate memory time" over 10 runs. To help inform whether or
> not different parts of this series is worth the code churn I measured
> the performance of pages 1-4 and 1-6 separately.
> 
> Test                            | tdp_mmu | kvm/queue | Patches 1-4 | Patches 1-6
> ------------------------------- | ------- | --------- | ----------- | -----------
> ./dirty_log_perf_test -v64      | Y       | 5.22s     | 5.20s       | 5.20s
> ./dirty_log_perf_test -v64 -x64 | Y       | 5.23s     | 5.14s       | 5.14s
> ./dirty_log_perf_test -v64      | N       | 17.14s    | 16.39s      | 15.36s
> ./dirty_log_perf_test -v64 -x64 | N       | 17.17s    | 16.60s      | 15.31s
> 
> This series provides no performance improvement to the tdp_mmu but
> improves the legacy MMU page fault handling by about 10%.
> 
> David Matlack (6):
>    KVM: x86/mmu: Rename try_async_pf to kvm_faultin_pfn in comment
>    KVM: x86/mmu: Fold rmap_recycle into rmap_add
>    KVM: x86/mmu: Pass around the memslot in kvm_page_fault
>    KVM: x86/mmu: Avoid memslot lookup in page_fault_handle_page_track
>    KVM: x86/mmu: Avoid memslot lookup in rmap_add
>    KVM: x86/mmu: Avoid memslot lookup in mmu_try_to_unsync_pages
> 
>   arch/x86/include/asm/kvm_page_track.h |   4 +-
>   arch/x86/kvm/mmu.h                    |   5 +-
>   arch/x86/kvm/mmu/mmu.c                | 110 +++++++++-----------------
>   arch/x86/kvm/mmu/mmu_internal.h       |   3 +-
>   arch/x86/kvm/mmu/page_track.c         |   6 +-
>   arch/x86/kvm/mmu/paging_tmpl.h        |  18 ++++-
>   arch/x86/kvm/mmu/spte.c               |  11 +--
>   arch/x86/kvm/mmu/spte.h               |   9 ++-
>   arch/x86/kvm/mmu/tdp_mmu.c            |  12 +--
>   9 files changed, 80 insertions(+), 98 deletions(-)
> 

Queued patches 1-3, thanks.  For the others, see the reply to patch 6.

Paolo