mbox series

[00/23] Extend Eager Page Splitting to the shadow MMU

Message ID 20220203010051.2813563-1-dmatlack@google.com (mailing list archive)
Headers show
Series Extend Eager Page Splitting to the shadow MMU | expand

Message

David Matlack Feb. 3, 2022, 1 a.m. UTC
This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
This will be useful for configurations that use Nested Virtualization,
disable the TDP MMU, or disable/lack TDP hardware support.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages.

    - Indirect shadow pages have the possibilty of being unsync. As a
      policy we opt not to split such pages as their translation may no
      longer be valid.
    - Huge pages on indirect shadow pages may have access permission
      constraints from the guest (unlike the TDP MMU which is ACC_ALL
      by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases (patches 19-22), which comes with some additional
complexity and an additional 4KiB of memory per VM to store the extra
pte_list_desc cache. However it does also avoid the need for TLB flushes
in most cases.

About half of this series, patches 1-13, is just refactoring the
existing MMU code in preparation for splitting. The bulk of the
refactoring is to make it possible to operate on the MMU outside of a
vCPU context.

Performance
-----------

Eager page splitting moves the cost of splitting huge pages off of the
vCPU thread and onto the thread invoking VM-ioctls to configure dirty
logging. This is useful because:

 - Splitting on the vCPU thread interrupts vCPUs execution and is
   disruptive to customers whereas splitting on VM ioctl threads can
   run in parallel with vCPU execution.

 - Splitting on the VM ioctl thread is more efficient because it does
   no require performing VM-exit handling and page table walks for every
   4K page.

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
vCPU, and backed by 1GiB HugeTLB memory.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

             | "Iteration 1 dirty memory time"             |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.310786549s         | 0.058731929s         |
4            | 0.419165587s         | 0.059615316s         |
8            | 1.061233860s         | 0.060945457s         |
16           | 2.852955595s         | 0.067069980s         |
32           | 7.032750509s         | 0.078623606s         |
64           | 16.501287504s        | 0.083914116s         |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and does
not bottleneck on the MMU lock.

             | "Enabling dirty logging time"               |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.001581619s         |  0.025699730s        |
4            | 0.003138664s         |  0.051510208s        |
8            | 0.006247177s         |  0.102960379s        |
16           | 0.012603892s         |  0.206949435s        |
32           | 0.026428036s         |  0.435855597s        |
64           | 0.103826796s         |  1.199686530s        |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | "Iteration 1 clear dirty log time"          |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.001544730s         | 0.055327916s         |
4            | 0.003145920s         | 0.111887354s         |
8            | 0.006306964s         | 0.223920530s         |
16           | 0.012681628s         | 0.447849488s         |
32           | 0.026827560s         | 0.943874520s         |
64           | 0.090461490s         | 2.664388025s         |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | "Iteration 2 clear dirty log time"          |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.015613726s         | 0.015771982s         |
4            | 0.031456620s         | 0.031911594s         |
8            | 0.063341572s         | 0.063837403s         |
16           | 0.128409332s         | 0.127484064s         |
32           | 0.255635696s         | 0.268837996s         |
64           | 0.695572818s         | 0.700420727s         |

Eager Page Splitting also improves the performance for shadow paging
configurations, as measured with ept=N. Although the absolute gains are
less since ept=N requires taking the MMU lock to track writes to 4KiB
pages (i.e. no fast_page_fault() or PML), which dominates the dirty
memory time.

             | "Iteration 1 dirty memory time"             |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.373022770s         | 0.348926043s         |
4            | 0.563697483s         | 0.453022037s         |
8            | 1.588492808s         | 1.524962010s         |
16           | 3.988934732s         | 3.369129917s         |
32           | 9.470333115s         | 8.292953856s         |
64           | 20.086419186s        | 18.531840021s        |

Testing
-------

- Ran all kvm-unit-tests and KVM selftests with all combinations of
  ept=[NY] and tdp_mmu=[NY].
- Tested VM live migration [*] with ept=N and ept=Y and observed pages
  being split via tracepoint and the pages_* stats.

[*] The live migration setup consisted of an 8 vCPU 8 GiB VM running
    on an Intel Cascade Lake host and backed by 1GiB HugeTLBFS memory.
    The VM was running Debian 10 and a workload that consisted of 16
    independent processes that each dirty memory. The tests were run
    with ept=N to exercise the interaction of Eager Page Splitting and
    shadow paging.

David Matlack (23):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Pass memslot to kvm_mmu_create_sp()
  KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from
    vCPU caches
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Pass const memslot to kvm_mmu_init_sp() and descendants
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Pass access information to make_huge_page_split_spte()
  KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  KVM: x86/mmu: Pass bool flush parameter to drop_large_spte()
  KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: Allow GFP flags to be passed when topping up MMU caches
  KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc
    structs
  KVM: x86/mmu: Split huge pages aliased by multiple SPTEs
  KVM: selftests: Map x86_64 guest virtual memory with huge pages

 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/arm64/include/asm/kvm_host.h             |   2 +-
 arch/arm64/kvm/mmu.c                          |  12 +-
 arch/mips/include/asm/kvm_host.h              |   2 +-
 arch/x86/include/asm/kvm_host.h               |  19 +-
 arch/x86/include/asm/kvm_page_track.h         |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 744 +++++++++++++++---
 arch/x86/kvm/mmu/mmu_internal.h               |  22 +-
 arch/x86/kvm/mmu/page_track.c                 |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  25 +-
 arch/x86/kvm/mmu/spte.c                       |  10 +-
 arch/x86/kvm/mmu/spte.h                       |   3 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  37 +-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   2 +-
 include/linux/kvm_host.h                      |   1 +
 include/linux/kvm_types.h                     |  24 +-
 .../selftests/kvm/include/x86_64/processor.h  |   6 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 +-
 .../selftests/kvm/lib/x86_64/processor.c      |  31 +
 virt/kvm/kvm_main.c                           |  17 +-
 20 files changed, 765 insertions(+), 205 deletions(-)


base-commit: f02ccc0f669341de1a831dfa7ca843ebbdbc8bd7

Comments

Peter Xu March 7, 2022, 5:21 a.m. UTC | #1
Hi, David,

Sorry for a very late comment.

On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> Performance
> -----------
> 
> Eager page splitting moves the cost of splitting huge pages off of the
> vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> logging. This is useful because:
> 
>  - Splitting on the vCPU thread interrupts vCPUs execution and is
>    disruptive to customers whereas splitting on VM ioctl threads can
>    run in parallel with vCPU execution.
> 
>  - Splitting on the VM ioctl thread is more efficient because it does
>    no require performing VM-exit handling and page table walks for every
>    4K page.
> 
> To measure the performance impact of Eager Page Splitting I ran
> dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> vCPU, and backed by 1GiB HugeTLB memory.
> 
> To measure the imapct of customer performance, we can look at the time
> it takes all vCPUs to dirty memory after dirty logging has been enabled.
> Without Eager Page Splitting enabled, such dirtying must take faults to
> split huge pages and bottleneck on the MMU lock.
> 
>              | "Iteration 1 dirty memory time"             |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.310786549s         | 0.058731929s         |
> 4            | 0.419165587s         | 0.059615316s         |
> 8            | 1.061233860s         | 0.060945457s         |
> 16           | 2.852955595s         | 0.067069980s         |
> 32           | 7.032750509s         | 0.078623606s         |
> 64           | 16.501287504s        | 0.083914116s         |
> 
> Eager Page Splitting does increase the time it takes to enable dirty
> logging when not using initially-all-set, since that's when KVM splits
> huge pages. However, this runs in parallel with vCPU execution and does
> not bottleneck on the MMU lock.
> 
>              | "Enabling dirty logging time"               |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.001581619s         |  0.025699730s        |
> 4            | 0.003138664s         |  0.051510208s        |
> 8            | 0.006247177s         |  0.102960379s        |
> 16           | 0.012603892s         |  0.206949435s        |
> 32           | 0.026428036s         |  0.435855597s        |
> 64           | 0.103826796s         |  1.199686530s        |
> 
> Similarly, Eager Page Splitting increases the time it takes to clear the
> dirty log for when using initially-all-set. The first time userspace
> clears the dirty log, KVM will split huge pages:
> 
>              | "Iteration 1 clear dirty log time"          |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.001544730s         | 0.055327916s         |
> 4            | 0.003145920s         | 0.111887354s         |
> 8            | 0.006306964s         | 0.223920530s         |
> 16           | 0.012681628s         | 0.447849488s         |
> 32           | 0.026827560s         | 0.943874520s         |
> 64           | 0.090461490s         | 2.664388025s         |
> 
> Subsequent calls to clear the dirty log incur almost no additional cost
> since KVM can very quickly determine there are no more huge pages to
> split via the RMAP. This is unlike the TDP MMU which must re-traverse
> the entire page table to check for huge pages.
> 
>              | "Iteration 2 clear dirty log time"          |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.015613726s         | 0.015771982s         |
> 4            | 0.031456620s         | 0.031911594s         |
> 8            | 0.063341572s         | 0.063837403s         |
> 16           | 0.128409332s         | 0.127484064s         |
> 32           | 0.255635696s         | 0.268837996s         |
> 64           | 0.695572818s         | 0.700420727s         |

Are all the tests above with ept=Y (except the one below)?

> 
> Eager Page Splitting also improves the performance for shadow paging
> configurations, as measured with ept=N. Although the absolute gains are
> less since ept=N requires taking the MMU lock to track writes to 4KiB
> pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> memory time.
> 
>              | "Iteration 1 dirty memory time"             |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.373022770s         | 0.348926043s         |
> 4            | 0.563697483s         | 0.453022037s         |
> 8            | 1.588492808s         | 1.524962010s         |
> 16           | 3.988934732s         | 3.369129917s         |
> 32           | 9.470333115s         | 8.292953856s         |
> 64           | 20.086419186s        | 18.531840021s        |

This one is definitely for ept=N because it's written there. That's ~10%
performance increase which looks still good, but IMHO that increase is
"debatable" since a normal guest may not simply write over the whole guest
mem.. So that 10% increase is based on some assumptions.

What if the guest writes 80% and reads 20%?  IIUC the split thread will
also start to block the readers too for shadow mmu while it was not blocked
previusly?  From that pov, not sure whether the series needs some more
justification, as the changeset seems still large.

Is there other benefits besides the 10% increase on writes?

Thanks,
David Matlack March 7, 2022, 11:39 p.m. UTC | #2
On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, David,
>
> Sorry for a very late comment.
>
> On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > Performance
> > -----------
> >
> > Eager page splitting moves the cost of splitting huge pages off of the
> > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > logging. This is useful because:
> >
> >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> >    disruptive to customers whereas splitting on VM ioctl threads can
> >    run in parallel with vCPU execution.
> >
> >  - Splitting on the VM ioctl thread is more efficient because it does
> >    no require performing VM-exit handling and page table walks for every
> >    4K page.
> >
> > To measure the performance impact of Eager Page Splitting I ran
> > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > vCPU, and backed by 1GiB HugeTLB memory.
> >
> > To measure the imapct of customer performance, we can look at the time
> > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > Without Eager Page Splitting enabled, such dirtying must take faults to
> > split huge pages and bottleneck on the MMU lock.
> >
> >              | "Iteration 1 dirty memory time"             |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.310786549s         | 0.058731929s         |
> > 4            | 0.419165587s         | 0.059615316s         |
> > 8            | 1.061233860s         | 0.060945457s         |
> > 16           | 2.852955595s         | 0.067069980s         |
> > 32           | 7.032750509s         | 0.078623606s         |
> > 64           | 16.501287504s        | 0.083914116s         |
> >
> > Eager Page Splitting does increase the time it takes to enable dirty
> > logging when not using initially-all-set, since that's when KVM splits
> > huge pages. However, this runs in parallel with vCPU execution and does
> > not bottleneck on the MMU lock.
> >
> >              | "Enabling dirty logging time"               |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.001581619s         |  0.025699730s        |
> > 4            | 0.003138664s         |  0.051510208s        |
> > 8            | 0.006247177s         |  0.102960379s        |
> > 16           | 0.012603892s         |  0.206949435s        |
> > 32           | 0.026428036s         |  0.435855597s        |
> > 64           | 0.103826796s         |  1.199686530s        |
> >
> > Similarly, Eager Page Splitting increases the time it takes to clear the
> > dirty log for when using initially-all-set. The first time userspace
> > clears the dirty log, KVM will split huge pages:
> >
> >              | "Iteration 1 clear dirty log time"          |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.001544730s         | 0.055327916s         |
> > 4            | 0.003145920s         | 0.111887354s         |
> > 8            | 0.006306964s         | 0.223920530s         |
> > 16           | 0.012681628s         | 0.447849488s         |
> > 32           | 0.026827560s         | 0.943874520s         |
> > 64           | 0.090461490s         | 2.664388025s         |
> >
> > Subsequent calls to clear the dirty log incur almost no additional cost
> > since KVM can very quickly determine there are no more huge pages to
> > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > the entire page table to check for huge pages.
> >
> >              | "Iteration 2 clear dirty log time"          |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.015613726s         | 0.015771982s         |
> > 4            | 0.031456620s         | 0.031911594s         |
> > 8            | 0.063341572s         | 0.063837403s         |
> > 16           | 0.128409332s         | 0.127484064s         |
> > 32           | 0.255635696s         | 0.268837996s         |
> > 64           | 0.695572818s         | 0.700420727s         |
>
> Are all the tests above with ept=Y (except the one below)?

Yes.

>
> >
> > Eager Page Splitting also improves the performance for shadow paging
> > configurations, as measured with ept=N. Although the absolute gains are
> > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > memory time.
> >
> >              | "Iteration 1 dirty memory time"             |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.373022770s         | 0.348926043s         |
> > 4            | 0.563697483s         | 0.453022037s         |
> > 8            | 1.588492808s         | 1.524962010s         |
> > 16           | 3.988934732s         | 3.369129917s         |
> > 32           | 9.470333115s         | 8.292953856s         |
> > 64           | 20.086419186s        | 18.531840021s        |
>
> This one is definitely for ept=N because it's written there. That's ~10%
> performance increase which looks still good, but IMHO that increase is
> "debatable" since a normal guest may not simply write over the whole guest
> mem.. So that 10% increase is based on some assumptions.
>
> What if the guest writes 80% and reads 20%?  IIUC the split thread will
> also start to block the readers too for shadow mmu while it was not blocked
> previusly?  From that pov, not sure whether the series needs some more
> justification, as the changeset seems still large.
>
> Is there other benefits besides the 10% increase on writes?

Yes, in fact workloads that perform some reads will benefit _more_
than workloads that perform only writes.

The reason is that the current lazy splitting approach unmaps the
entire huge page on write and then maps in the just the faulting 4K
page. That means reads on the unmapped portion of the hugepage will
now take a fault and require the MMU lock. In contrast, Eager Page
Splitting fully splits each huge page so readers should never take
faults.

For example, here is the data with 20% writes and 80% reads (i.e. pass
`-f 5` to dirty_log_perf_test):

             | "Iteration 1 dirty memory time"             |
             | ------------------------------------------- |
vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
------------ | -------------------- | -------------------- |
2            | 0.403108098s         | 0.071808764s         |
4            | 0.562173582s         | 0.105272819s         |
8            | 1.382974557s         | 0.248713796s         |
16           | 3.608993666s         | 0.571990327s         |
32           | 9.100678321s         | 1.702453103s         |
64           | 19.784780903s        | 3.489443239s        |

>
> Thanks,

>
> --
> Peter Xu
>
Peter Xu March 9, 2022, 7:31 a.m. UTC | #3
On Mon, Mar 07, 2022 at 03:39:37PM -0800, David Matlack wrote:
> On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, David,
> >
> > Sorry for a very late comment.
> >
> > On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > > Performance
> > > -----------
> > >
> > > Eager page splitting moves the cost of splitting huge pages off of the
> > > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > > logging. This is useful because:
> > >
> > >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> > >    disruptive to customers whereas splitting on VM ioctl threads can
> > >    run in parallel with vCPU execution.
> > >
> > >  - Splitting on the VM ioctl thread is more efficient because it does
> > >    no require performing VM-exit handling and page table walks for every
> > >    4K page.
> > >
> > > To measure the performance impact of Eager Page Splitting I ran
> > > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > > vCPU, and backed by 1GiB HugeTLB memory.
> > >
> > > To measure the imapct of customer performance, we can look at the time
> > > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > > Without Eager Page Splitting enabled, such dirtying must take faults to
> > > split huge pages and bottleneck on the MMU lock.
> > >
> > >              | "Iteration 1 dirty memory time"             |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.310786549s         | 0.058731929s         |
> > > 4            | 0.419165587s         | 0.059615316s         |
> > > 8            | 1.061233860s         | 0.060945457s         |
> > > 16           | 2.852955595s         | 0.067069980s         |
> > > 32           | 7.032750509s         | 0.078623606s         |
> > > 64           | 16.501287504s        | 0.083914116s         |
> > >
> > > Eager Page Splitting does increase the time it takes to enable dirty
> > > logging when not using initially-all-set, since that's when KVM splits
> > > huge pages. However, this runs in parallel with vCPU execution and does
> > > not bottleneck on the MMU lock.
> > >
> > >              | "Enabling dirty logging time"               |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.001581619s         |  0.025699730s        |
> > > 4            | 0.003138664s         |  0.051510208s        |
> > > 8            | 0.006247177s         |  0.102960379s        |
> > > 16           | 0.012603892s         |  0.206949435s        |
> > > 32           | 0.026428036s         |  0.435855597s        |
> > > 64           | 0.103826796s         |  1.199686530s        |
> > >
> > > Similarly, Eager Page Splitting increases the time it takes to clear the
> > > dirty log for when using initially-all-set. The first time userspace
> > > clears the dirty log, KVM will split huge pages:
> > >
> > >              | "Iteration 1 clear dirty log time"          |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.001544730s         | 0.055327916s         |
> > > 4            | 0.003145920s         | 0.111887354s         |
> > > 8            | 0.006306964s         | 0.223920530s         |
> > > 16           | 0.012681628s         | 0.447849488s         |
> > > 32           | 0.026827560s         | 0.943874520s         |
> > > 64           | 0.090461490s         | 2.664388025s         |
> > >
> > > Subsequent calls to clear the dirty log incur almost no additional cost
> > > since KVM can very quickly determine there are no more huge pages to
> > > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > > the entire page table to check for huge pages.
> > >
> > >              | "Iteration 2 clear dirty log time"          |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.015613726s         | 0.015771982s         |
> > > 4            | 0.031456620s         | 0.031911594s         |
> > > 8            | 0.063341572s         | 0.063837403s         |
> > > 16           | 0.128409332s         | 0.127484064s         |
> > > 32           | 0.255635696s         | 0.268837996s         |
> > > 64           | 0.695572818s         | 0.700420727s         |
> >
> > Are all the tests above with ept=Y (except the one below)?
> 
> Yes.
> 
> >
> > >
> > > Eager Page Splitting also improves the performance for shadow paging
> > > configurations, as measured with ept=N. Although the absolute gains are
> > > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > > memory time.
> > >
> > >              | "Iteration 1 dirty memory time"             |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.373022770s         | 0.348926043s         |
> > > 4            | 0.563697483s         | 0.453022037s         |
> > > 8            | 1.588492808s         | 1.524962010s         |
> > > 16           | 3.988934732s         | 3.369129917s         |
> > > 32           | 9.470333115s         | 8.292953856s         |
> > > 64           | 20.086419186s        | 18.531840021s        |
> >
> > This one is definitely for ept=N because it's written there. That's ~10%
> > performance increase which looks still good, but IMHO that increase is
> > "debatable" since a normal guest may not simply write over the whole guest
> > mem.. So that 10% increase is based on some assumptions.
> >
> > What if the guest writes 80% and reads 20%?  IIUC the split thread will
> > also start to block the readers too for shadow mmu while it was not blocked
> > previusly?  From that pov, not sure whether the series needs some more
> > justification, as the changeset seems still large.
> >
> > Is there other benefits besides the 10% increase on writes?
> 
> Yes, in fact workloads that perform some reads will benefit _more_
> than workloads that perform only writes.
> 
> The reason is that the current lazy splitting approach unmaps the
> entire huge page on write and then maps in the just the faulting 4K
> page. That means reads on the unmapped portion of the hugepage will
> now take a fault and require the MMU lock. In contrast, Eager Page
> Splitting fully splits each huge page so readers should never take
> faults.
> 
> For example, here is the data with 20% writes and 80% reads (i.e. pass
> `-f 5` to dirty_log_perf_test):
> 
>              | "Iteration 1 dirty memory time"             |
>              | ------------------------------------------- |
> vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> ------------ | -------------------- | -------------------- |
> 2            | 0.403108098s         | 0.071808764s         |
> 4            | 0.562173582s         | 0.105272819s         |
> 8            | 1.382974557s         | 0.248713796s         |
> 16           | 3.608993666s         | 0.571990327s         |
> 32           | 9.100678321s         | 1.702453103s         |
> 64           | 19.784780903s        | 3.489443239s        |

It's very interesting to know these numbers, thanks for sharing that.

Above reminded me that eager page split actually does two things:

(1) When a page is mapped as huge, we "assume" this whole page will be
    accessed in the near future, so when split is needed we map all the
    small ptes, and,

(2) We move the split operation from page faults to when enable-dirty-track
    happens.

We could have done (1) already without the whole eager split patchsets: if
we see a read-only huge page on a page fault, we could populat the whole
range of ptes, only marking current small pte writable but leaving the rest
small ptes wr-protected.  I had a feeling this will speedup the above 19.78
seconds (64 cores case) fairly much too to some point.

Entry (1) makes a lot of sense to me; OTOH I can understand entry (2) but
not strongly.

My previous concern was majorly about having readers being blocked during
splitting of huge pages (not after).  For shadow mmu, IIUC the split thread
will start to take write lock rather than read lock (comparing to tdp mmu),
hence any vcpu page faults (hmm, not only reader but writters too I think
with non-present pte..) will be blocked longer than before, am I right?

Meanwhile for shadow mmu I think there can be more page tables to walk
comparing to the tdp mmu for a single huge page to split?  My understanding
is tdp mmu pgtables are mostly limited by the number of address spaces (?),
but shadow pgtables are per-task.  So I'm not sure whether for a guest with
a lot of active tasks sharing pages, the split thread can spend quite some
time splitting, during which time with write lock held without releasing.

These are kind of against the purpose of eager split on shadowing, which is
to reduce influence for guest vcpu threads?  But I can't tell, I could have
missed something else.  It's just that when applying the idea to shadow mmu
it sounds less attractive than the tdp mmu case.

Thanks,
David Matlack March 9, 2022, 11:39 p.m. UTC | #4
On Tue, Mar 8, 2022 at 11:31 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Mar 07, 2022 at 03:39:37PM -0800, David Matlack wrote:
> > On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, David,
> > >
> > > Sorry for a very late comment.
> > >
> > > On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > > > Performance
> > > > -----------
> > > >
> > > > Eager page splitting moves the cost of splitting huge pages off of the
> > > > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > > > logging. This is useful because:
> > > >
> > > >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> > > >    disruptive to customers whereas splitting on VM ioctl threads can
> > > >    run in parallel with vCPU execution.
> > > >
> > > >  - Splitting on the VM ioctl thread is more efficient because it does
> > > >    no require performing VM-exit handling and page table walks for every
> > > >    4K page.
> > > >
> > > > To measure the performance impact of Eager Page Splitting I ran
> > > > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > > > vCPU, and backed by 1GiB HugeTLB memory.
> > > >
> > > > To measure the imapct of customer performance, we can look at the time
> > > > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > > > Without Eager Page Splitting enabled, such dirtying must take faults to
> > > > split huge pages and bottleneck on the MMU lock.
> > > >
> > > >              | "Iteration 1 dirty memory time"             |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.310786549s         | 0.058731929s         |
> > > > 4            | 0.419165587s         | 0.059615316s         |
> > > > 8            | 1.061233860s         | 0.060945457s         |
> > > > 16           | 2.852955595s         | 0.067069980s         |
> > > > 32           | 7.032750509s         | 0.078623606s         |
> > > > 64           | 16.501287504s        | 0.083914116s         |
> > > >
> > > > Eager Page Splitting does increase the time it takes to enable dirty
> > > > logging when not using initially-all-set, since that's when KVM splits
> > > > huge pages. However, this runs in parallel with vCPU execution and does
> > > > not bottleneck on the MMU lock.
> > > >
> > > >              | "Enabling dirty logging time"               |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.001581619s         |  0.025699730s        |
> > > > 4            | 0.003138664s         |  0.051510208s        |
> > > > 8            | 0.006247177s         |  0.102960379s        |
> > > > 16           | 0.012603892s         |  0.206949435s        |
> > > > 32           | 0.026428036s         |  0.435855597s        |
> > > > 64           | 0.103826796s         |  1.199686530s        |
> > > >
> > > > Similarly, Eager Page Splitting increases the time it takes to clear the
> > > > dirty log for when using initially-all-set. The first time userspace
> > > > clears the dirty log, KVM will split huge pages:
> > > >
> > > >              | "Iteration 1 clear dirty log time"          |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.001544730s         | 0.055327916s         |
> > > > 4            | 0.003145920s         | 0.111887354s         |
> > > > 8            | 0.006306964s         | 0.223920530s         |
> > > > 16           | 0.012681628s         | 0.447849488s         |
> > > > 32           | 0.026827560s         | 0.943874520s         |
> > > > 64           | 0.090461490s         | 2.664388025s         |
> > > >
> > > > Subsequent calls to clear the dirty log incur almost no additional cost
> > > > since KVM can very quickly determine there are no more huge pages to
> > > > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > > > the entire page table to check for huge pages.
> > > >
> > > >              | "Iteration 2 clear dirty log time"          |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.015613726s         | 0.015771982s         |
> > > > 4            | 0.031456620s         | 0.031911594s         |
> > > > 8            | 0.063341572s         | 0.063837403s         |
> > > > 16           | 0.128409332s         | 0.127484064s         |
> > > > 32           | 0.255635696s         | 0.268837996s         |
> > > > 64           | 0.695572818s         | 0.700420727s         |
> > >
> > > Are all the tests above with ept=Y (except the one below)?
> >
> > Yes.
> >
> > >
> > > >
> > > > Eager Page Splitting also improves the performance for shadow paging
> > > > configurations, as measured with ept=N. Although the absolute gains are
> > > > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > > > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > > > memory time.
> > > >
> > > >              | "Iteration 1 dirty memory time"             |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.373022770s         | 0.348926043s         |
> > > > 4            | 0.563697483s         | 0.453022037s         |
> > > > 8            | 1.588492808s         | 1.524962010s         |
> > > > 16           | 3.988934732s         | 3.369129917s         |
> > > > 32           | 9.470333115s         | 8.292953856s         |
> > > > 64           | 20.086419186s        | 18.531840021s        |
> > >
> > > This one is definitely for ept=N because it's written there. That's ~10%
> > > performance increase which looks still good, but IMHO that increase is
> > > "debatable" since a normal guest may not simply write over the whole guest
> > > mem.. So that 10% increase is based on some assumptions.
> > >
> > > What if the guest writes 80% and reads 20%?  IIUC the split thread will
> > > also start to block the readers too for shadow mmu while it was not blocked
> > > previusly?  From that pov, not sure whether the series needs some more
> > > justification, as the changeset seems still large.
> > >
> > > Is there other benefits besides the 10% increase on writes?
> >
> > Yes, in fact workloads that perform some reads will benefit _more_
> > than workloads that perform only writes.
> >
> > The reason is that the current lazy splitting approach unmaps the
> > entire huge page on write and then maps in the just the faulting 4K
> > page. That means reads on the unmapped portion of the hugepage will
> > now take a fault and require the MMU lock. In contrast, Eager Page
> > Splitting fully splits each huge page so readers should never take
> > faults.
> >
> > For example, here is the data with 20% writes and 80% reads (i.e. pass
> > `-f 5` to dirty_log_perf_test):
> >
> >              | "Iteration 1 dirty memory time"             |
> >              | ------------------------------------------- |
> > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > ------------ | -------------------- | -------------------- |
> > 2            | 0.403108098s         | 0.071808764s         |
> > 4            | 0.562173582s         | 0.105272819s         |
> > 8            | 1.382974557s         | 0.248713796s         |
> > 16           | 3.608993666s         | 0.571990327s         |
> > 32           | 9.100678321s         | 1.702453103s         |
> > 64           | 19.784780903s        | 3.489443239s        |
>
> It's very interesting to know these numbers, thanks for sharing that.
>
> Above reminded me that eager page split actually does two things:
>
> (1) When a page is mapped as huge, we "assume" this whole page will be
>     accessed in the near future, so when split is needed we map all the
>     small ptes, and,

Note, this series does not add this behavior to the fault path.

>
> (2) We move the split operation from page faults to when enable-dirty-track
>     happens.
>
> We could have done (1) already without the whole eager split patchsets: if
> we see a read-only huge page on a page fault, we could populat the whole
> range of ptes, only marking current small pte writable but leaving the rest
> small ptes wr-protected.  I had a feeling this will speedup the above 19.78
> seconds (64 cores case) fairly much too to some point.

The problem with (1) is that it still requires faults to split the
huge pages. Those faults will need to contend for the MMU lock, and
will hold the lock for longer than they do today since they are doing
extra work.

I agree there might be some benefit for workloads, but for write-heavy
workloads there will still be a "thundering herd" problem when dirty
logging is first enable. I'll admit though I have not testing this
approach.

An alternative approach to handling read-heavy workloads we're looking
at is to perform dirty logging at 2M.

>
> Entry (1) makes a lot of sense to me; OTOH I can understand entry (2) but
> not strongly.
>
> My previous concern was majorly about having readers being blocked during
> splitting of huge pages (not after).  For shadow mmu, IIUC the split thread
> will start to take write lock rather than read lock (comparing to tdp mmu),
> hence any vcpu page faults (hmm, not only reader but writters too I think
> with non-present pte..) will be blocked longer than before, am I right?
>
> Meanwhile for shadow mmu I think there can be more page tables to walk
> comparing to the tdp mmu for a single huge page to split?  My understanding
> is tdp mmu pgtables are mostly limited by the number of address spaces (?),
> but shadow pgtables are per-task.

Or per-L2 VM, in the case of nested virtualization.

> So I'm not sure whether for a guest with
> a lot of active tasks sharing pages, the split thread can spend quite some
> time splitting, during which time with write lock held without releasing.

The eager page splitting code does check for contention and drop the
MMU lock in between every SPTE it tries to split. But there still
might be some increase in contention due to eager page splitting.

>
> These are kind of against the purpose of eager split on shadowing, which is
> to reduce influence for guest vcpu threads?  But I can't tell, I could have
> missed something else.  It's just that when applying the idea to shadow mmu
> it sounds less attractive than the tdp mmu case.

The shadow MMU is also used for Nested Virtualization, which is a bit
different from "typical" shadow paging (ept/npt=N) because VMs tend
not to share pages, their page tables are fairly static (compared to
process page tables), and they tend to be longer lived. So there will
not be as much steady-state MMU lock contention that would be
negatively impacted by eager page splitting.

You might be right though that ept/npt=N has enough steady-state MMU
lock contention that it will notice eager page splitting. But then
again, it would be even more affected by lazy splitting unless the
guest is doing very few writes.

>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu March 10, 2022, 7:03 a.m. UTC | #5
On Wed, Mar 09, 2022 at 03:39:44PM -0800, David Matlack wrote:
> On Tue, Mar 8, 2022 at 11:31 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Mar 07, 2022 at 03:39:37PM -0800, David Matlack wrote:
> > > On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Hi, David,
> > > >
> > > > Sorry for a very late comment.
> > > >
> > > > On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > > > > Performance
> > > > > -----------
> > > > >
> > > > > Eager page splitting moves the cost of splitting huge pages off of the
> > > > > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > > > > logging. This is useful because:
> > > > >
> > > > >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> > > > >    disruptive to customers whereas splitting on VM ioctl threads can
> > > > >    run in parallel with vCPU execution.
> > > > >
> > > > >  - Splitting on the VM ioctl thread is more efficient because it does
> > > > >    no require performing VM-exit handling and page table walks for every
> > > > >    4K page.
> > > > >
> > > > > To measure the performance impact of Eager Page Splitting I ran
> > > > > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > > > > vCPU, and backed by 1GiB HugeTLB memory.
> > > > >
> > > > > To measure the imapct of customer performance, we can look at the time
> > > > > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > > > > Without Eager Page Splitting enabled, such dirtying must take faults to
> > > > > split huge pages and bottleneck on the MMU lock.
> > > > >
> > > > >              | "Iteration 1 dirty memory time"             |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.310786549s         | 0.058731929s         |
> > > > > 4            | 0.419165587s         | 0.059615316s         |
> > > > > 8            | 1.061233860s         | 0.060945457s         |
> > > > > 16           | 2.852955595s         | 0.067069980s         |
> > > > > 32           | 7.032750509s         | 0.078623606s         |
> > > > > 64           | 16.501287504s        | 0.083914116s         |
> > > > >
> > > > > Eager Page Splitting does increase the time it takes to enable dirty
> > > > > logging when not using initially-all-set, since that's when KVM splits
> > > > > huge pages. However, this runs in parallel with vCPU execution and does
> > > > > not bottleneck on the MMU lock.
> > > > >
> > > > >              | "Enabling dirty logging time"               |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.001581619s         |  0.025699730s        |
> > > > > 4            | 0.003138664s         |  0.051510208s        |
> > > > > 8            | 0.006247177s         |  0.102960379s        |
> > > > > 16           | 0.012603892s         |  0.206949435s        |
> > > > > 32           | 0.026428036s         |  0.435855597s        |
> > > > > 64           | 0.103826796s         |  1.199686530s        |
> > > > >
> > > > > Similarly, Eager Page Splitting increases the time it takes to clear the
> > > > > dirty log for when using initially-all-set. The first time userspace
> > > > > clears the dirty log, KVM will split huge pages:
> > > > >
> > > > >              | "Iteration 1 clear dirty log time"          |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.001544730s         | 0.055327916s         |
> > > > > 4            | 0.003145920s         | 0.111887354s         |
> > > > > 8            | 0.006306964s         | 0.223920530s         |
> > > > > 16           | 0.012681628s         | 0.447849488s         |
> > > > > 32           | 0.026827560s         | 0.943874520s         |
> > > > > 64           | 0.090461490s         | 2.664388025s         |
> > > > >
> > > > > Subsequent calls to clear the dirty log incur almost no additional cost
> > > > > since KVM can very quickly determine there are no more huge pages to
> > > > > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > > > > the entire page table to check for huge pages.
> > > > >
> > > > >              | "Iteration 2 clear dirty log time"          |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.015613726s         | 0.015771982s         |
> > > > > 4            | 0.031456620s         | 0.031911594s         |
> > > > > 8            | 0.063341572s         | 0.063837403s         |
> > > > > 16           | 0.128409332s         | 0.127484064s         |
> > > > > 32           | 0.255635696s         | 0.268837996s         |
> > > > > 64           | 0.695572818s         | 0.700420727s         |
> > > >
> > > > Are all the tests above with ept=Y (except the one below)?
> > >
> > > Yes.
> > >
> > > >
> > > > >
> > > > > Eager Page Splitting also improves the performance for shadow paging
> > > > > configurations, as measured with ept=N. Although the absolute gains are
> > > > > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > > > > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > > > > memory time.
> > > > >
> > > > >              | "Iteration 1 dirty memory time"             |
> > > > >              | ------------------------------------------- |
> > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > ------------ | -------------------- | -------------------- |
> > > > > 2            | 0.373022770s         | 0.348926043s         |
> > > > > 4            | 0.563697483s         | 0.453022037s         |
> > > > > 8            | 1.588492808s         | 1.524962010s         |
> > > > > 16           | 3.988934732s         | 3.369129917s         |
> > > > > 32           | 9.470333115s         | 8.292953856s         |
> > > > > 64           | 20.086419186s        | 18.531840021s        |
> > > >
> > > > This one is definitely for ept=N because it's written there. That's ~10%
> > > > performance increase which looks still good, but IMHO that increase is
> > > > "debatable" since a normal guest may not simply write over the whole guest
> > > > mem.. So that 10% increase is based on some assumptions.
> > > >
> > > > What if the guest writes 80% and reads 20%?  IIUC the split thread will
> > > > also start to block the readers too for shadow mmu while it was not blocked
> > > > previusly?  From that pov, not sure whether the series needs some more
> > > > justification, as the changeset seems still large.
> > > >
> > > > Is there other benefits besides the 10% increase on writes?
> > >
> > > Yes, in fact workloads that perform some reads will benefit _more_
> > > than workloads that perform only writes.
> > >
> > > The reason is that the current lazy splitting approach unmaps the
> > > entire huge page on write and then maps in the just the faulting 4K
> > > page. That means reads on the unmapped portion of the hugepage will
> > > now take a fault and require the MMU lock. In contrast, Eager Page
> > > Splitting fully splits each huge page so readers should never take
> > > faults.
> > >
> > > For example, here is the data with 20% writes and 80% reads (i.e. pass
> > > `-f 5` to dirty_log_perf_test):
> > >
> > >              | "Iteration 1 dirty memory time"             |
> > >              | ------------------------------------------- |
> > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > ------------ | -------------------- | -------------------- |
> > > 2            | 0.403108098s         | 0.071808764s         |
> > > 4            | 0.562173582s         | 0.105272819s         |
> > > 8            | 1.382974557s         | 0.248713796s         |
> > > 16           | 3.608993666s         | 0.571990327s         |
> > > 32           | 9.100678321s         | 1.702453103s         |
> > > 64           | 19.784780903s        | 3.489443239s        |
> >
> > It's very interesting to know these numbers, thanks for sharing that.
> >
> > Above reminded me that eager page split actually does two things:
> >
> > (1) When a page is mapped as huge, we "assume" this whole page will be
> >     accessed in the near future, so when split is needed we map all the
> >     small ptes, and,
> 
> Note, this series does not add this behavior to the fault path.
> 
> >
> > (2) We move the split operation from page faults to when enable-dirty-track
> >     happens.
> >
> > We could have done (1) already without the whole eager split patchsets: if
> > we see a read-only huge page on a page fault, we could populat the whole
> > range of ptes, only marking current small pte writable but leaving the rest
> > small ptes wr-protected.  I had a feeling this will speedup the above 19.78
> > seconds (64 cores case) fairly much too to some point.
> 
> The problem with (1) is that it still requires faults to split the
> huge pages. Those faults will need to contend for the MMU lock, and
> will hold the lock for longer than they do today since they are doing
> extra work.

Right.  But that overhead is very limited, IMHO.. per the numbers, it's the
20sec and 18sec difference for full write faults.

The thing is either split or vcpu will take the write lock anyway.  So it
either contends during split, or later.  Without tdp (so never PML) it'll
need a slow page fault anyway even if split is done before hand..

> 
> I agree there might be some benefit for workloads, but for write-heavy
> workloads there will still be a "thundering herd" problem when dirty
> logging is first enable. I'll admit though I have not testing this
> approach.

Indeed that's majorly the core of my question, on why this series cares
more on write than read workloads.  To me they are all possible workloads,
but maybe I'm wrong?  This series benefits heavy writes, but it may not
benefit (or even make it slower on) heavy reads.

The tdp mmu case is more persuasive in that:

  (a) Split runs concurrently on vcpu faults,

  (b) When with PML the tdp mmu case could completely avoid the small write
      page faults.

All these benefits do not exist for shadow mmu.

I don't think I'm against this series..  I think at least with the series
we can have matching feature on tdp and !tdp, meanwhile it still benefits a
lot on read+write mix workloads are you proved in the follow up tests (PS:
do you think that should be mentioned in the cover letter too?).

IMHO when a performance feature is merged, it'll be harder to be removed
because once merged it'll be harder to be proved wrong.  I hope it'll be
worth it when it gets merged and being maintained in upstream kvm, so I
raised these questions, hope that we at least thoroughly discuss the pros
and cons.

> 
> An alternative approach to handling read-heavy workloads we're looking
> at is to perform dirty logging at 2M.

I agree that's still something worth exploring.

> 
> >
> > Entry (1) makes a lot of sense to me; OTOH I can understand entry (2) but
> > not strongly.
> >
> > My previous concern was majorly about having readers being blocked during
> > splitting of huge pages (not after).  For shadow mmu, IIUC the split thread
> > will start to take write lock rather than read lock (comparing to tdp mmu),
> > hence any vcpu page faults (hmm, not only reader but writters too I think
> > with non-present pte..) will be blocked longer than before, am I right?
> >
> > Meanwhile for shadow mmu I think there can be more page tables to walk
> > comparing to the tdp mmu for a single huge page to split?  My understanding
> > is tdp mmu pgtables are mostly limited by the number of address spaces (?),
> > but shadow pgtables are per-task.
> 
> Or per-L2 VM, in the case of nested virtualization.
> 
> > So I'm not sure whether for a guest with
> > a lot of active tasks sharing pages, the split thread can spend quite some
> > time splitting, during which time with write lock held without releasing.
> 
> The eager page splitting code does check for contention and drop the
> MMU lock in between every SPTE it tries to split. But there still
> might be some increase in contention due to eager page splitting.

Ah right..

> 
> >
> > These are kind of against the purpose of eager split on shadowing, which is
> > to reduce influence for guest vcpu threads?  But I can't tell, I could have
> > missed something else.  It's just that when applying the idea to shadow mmu
> > it sounds less attractive than the tdp mmu case.
> 
> The shadow MMU is also used for Nested Virtualization, which is a bit
> different from "typical" shadow paging (ept/npt=N) because VMs tend
> not to share pages, their page tables are fairly static (compared to
> process page tables), and they tend to be longer lived. So there will
> not be as much steady-state MMU lock contention that would be
> negatively impacted by eager page splitting.
> 
> You might be right though that ept/npt=N has enough steady-state MMU
> lock contention that it will notice eager page splitting. But then
> again, it would be even more affected by lazy splitting unless the
> guest is doing very few writes.

Yes, indeed I see no easy solution to this due to the same lock contention.

Thanks,
David Matlack March 10, 2022, 7:26 p.m. UTC | #6
On Wed, Mar 9, 2022 at 11:03 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Mar 09, 2022 at 03:39:44PM -0800, David Matlack wrote:
> > On Tue, Mar 8, 2022 at 11:31 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Mar 07, 2022 at 03:39:37PM -0800, David Matlack wrote:
> > > > On Sun, Mar 6, 2022 at 9:22 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > Hi, David,
> > > > >
> > > > > Sorry for a very late comment.
> > > > >
> > > > > On Thu, Feb 03, 2022 at 01:00:28AM +0000, David Matlack wrote:
> > > > > > Performance
> > > > > > -----------
> > > > > >
> > > > > > Eager page splitting moves the cost of splitting huge pages off of the
> > > > > > vCPU thread and onto the thread invoking VM-ioctls to configure dirty
> > > > > > logging. This is useful because:
> > > > > >
> > > > > >  - Splitting on the vCPU thread interrupts vCPUs execution and is
> > > > > >    disruptive to customers whereas splitting on VM ioctl threads can
> > > > > >    run in parallel with vCPU execution.
> > > > > >
> > > > > >  - Splitting on the VM ioctl thread is more efficient because it does
> > > > > >    no require performing VM-exit handling and page table walks for every
> > > > > >    4K page.
> > > > > >
> > > > > > To measure the performance impact of Eager Page Splitting I ran
> > > > > > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > > > > > vCPU, and backed by 1GiB HugeTLB memory.
> > > > > >
> > > > > > To measure the imapct of customer performance, we can look at the time
> > > > > > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > > > > > Without Eager Page Splitting enabled, such dirtying must take faults to
> > > > > > split huge pages and bottleneck on the MMU lock.
> > > > > >
> > > > > >              | "Iteration 1 dirty memory time"             |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.310786549s         | 0.058731929s         |
> > > > > > 4            | 0.419165587s         | 0.059615316s         |
> > > > > > 8            | 1.061233860s         | 0.060945457s         |
> > > > > > 16           | 2.852955595s         | 0.067069980s         |
> > > > > > 32           | 7.032750509s         | 0.078623606s         |
> > > > > > 64           | 16.501287504s        | 0.083914116s         |
> > > > > >
> > > > > > Eager Page Splitting does increase the time it takes to enable dirty
> > > > > > logging when not using initially-all-set, since that's when KVM splits
> > > > > > huge pages. However, this runs in parallel with vCPU execution and does
> > > > > > not bottleneck on the MMU lock.
> > > > > >
> > > > > >              | "Enabling dirty logging time"               |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.001581619s         |  0.025699730s        |
> > > > > > 4            | 0.003138664s         |  0.051510208s        |
> > > > > > 8            | 0.006247177s         |  0.102960379s        |
> > > > > > 16           | 0.012603892s         |  0.206949435s        |
> > > > > > 32           | 0.026428036s         |  0.435855597s        |
> > > > > > 64           | 0.103826796s         |  1.199686530s        |
> > > > > >
> > > > > > Similarly, Eager Page Splitting increases the time it takes to clear the
> > > > > > dirty log for when using initially-all-set. The first time userspace
> > > > > > clears the dirty log, KVM will split huge pages:
> > > > > >
> > > > > >              | "Iteration 1 clear dirty log time"          |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.001544730s         | 0.055327916s         |
> > > > > > 4            | 0.003145920s         | 0.111887354s         |
> > > > > > 8            | 0.006306964s         | 0.223920530s         |
> > > > > > 16           | 0.012681628s         | 0.447849488s         |
> > > > > > 32           | 0.026827560s         | 0.943874520s         |
> > > > > > 64           | 0.090461490s         | 2.664388025s         |
> > > > > >
> > > > > > Subsequent calls to clear the dirty log incur almost no additional cost
> > > > > > since KVM can very quickly determine there are no more huge pages to
> > > > > > split via the RMAP. This is unlike the TDP MMU which must re-traverse
> > > > > > the entire page table to check for huge pages.
> > > > > >
> > > > > >              | "Iteration 2 clear dirty log time"          |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.015613726s         | 0.015771982s         |
> > > > > > 4            | 0.031456620s         | 0.031911594s         |
> > > > > > 8            | 0.063341572s         | 0.063837403s         |
> > > > > > 16           | 0.128409332s         | 0.127484064s         |
> > > > > > 32           | 0.255635696s         | 0.268837996s         |
> > > > > > 64           | 0.695572818s         | 0.700420727s         |
> > > > >
> > > > > Are all the tests above with ept=Y (except the one below)?
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > >
> > > > > > Eager Page Splitting also improves the performance for shadow paging
> > > > > > configurations, as measured with ept=N. Although the absolute gains are
> > > > > > less since ept=N requires taking the MMU lock to track writes to 4KiB
> > > > > > pages (i.e. no fast_page_fault() or PML), which dominates the dirty
> > > > > > memory time.
> > > > > >
> > > > > >              | "Iteration 1 dirty memory time"             |
> > > > > >              | ------------------------------------------- |
> > > > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > > > ------------ | -------------------- | -------------------- |
> > > > > > 2            | 0.373022770s         | 0.348926043s         |
> > > > > > 4            | 0.563697483s         | 0.453022037s         |
> > > > > > 8            | 1.588492808s         | 1.524962010s         |
> > > > > > 16           | 3.988934732s         | 3.369129917s         |
> > > > > > 32           | 9.470333115s         | 8.292953856s         |
> > > > > > 64           | 20.086419186s        | 18.531840021s        |
> > > > >
> > > > > This one is definitely for ept=N because it's written there. That's ~10%
> > > > > performance increase which looks still good, but IMHO that increase is
> > > > > "debatable" since a normal guest may not simply write over the whole guest
> > > > > mem.. So that 10% increase is based on some assumptions.
> > > > >
> > > > > What if the guest writes 80% and reads 20%?  IIUC the split thread will
> > > > > also start to block the readers too for shadow mmu while it was not blocked
> > > > > previusly?  From that pov, not sure whether the series needs some more
> > > > > justification, as the changeset seems still large.
> > > > >
> > > > > Is there other benefits besides the 10% increase on writes?
> > > >
> > > > Yes, in fact workloads that perform some reads will benefit _more_
> > > > than workloads that perform only writes.
> > > >
> > > > The reason is that the current lazy splitting approach unmaps the
> > > > entire huge page on write and then maps in the just the faulting 4K
> > > > page. That means reads on the unmapped portion of the hugepage will
> > > > now take a fault and require the MMU lock. In contrast, Eager Page
> > > > Splitting fully splits each huge page so readers should never take
> > > > faults.
> > > >
> > > > For example, here is the data with 20% writes and 80% reads (i.e. pass
> > > > `-f 5` to dirty_log_perf_test):
> > > >
> > > >              | "Iteration 1 dirty memory time"             |
> > > >              | ------------------------------------------- |
> > > > vCPU Count   | eager_page_split=N   | eager_page_split=Y   |
> > > > ------------ | -------------------- | -------------------- |
> > > > 2            | 0.403108098s         | 0.071808764s         |
> > > > 4            | 0.562173582s         | 0.105272819s         |
> > > > 8            | 1.382974557s         | 0.248713796s         |
> > > > 16           | 3.608993666s         | 0.571990327s         |
> > > > 32           | 9.100678321s         | 1.702453103s         |
> > > > 64           | 19.784780903s        | 3.489443239s        |
> > >
> > > It's very interesting to know these numbers, thanks for sharing that.
> > >
> > > Above reminded me that eager page split actually does two things:
> > >
> > > (1) When a page is mapped as huge, we "assume" this whole page will be
> > >     accessed in the near future, so when split is needed we map all the
> > >     small ptes, and,
> >
> > Note, this series does not add this behavior to the fault path.
> >
> > >
> > > (2) We move the split operation from page faults to when enable-dirty-track
> > >     happens.
> > >
> > > We could have done (1) already without the whole eager split patchsets: if
> > > we see a read-only huge page on a page fault, we could populat the whole
> > > range of ptes, only marking current small pte writable but leaving the rest
> > > small ptes wr-protected.  I had a feeling this will speedup the above 19.78
> > > seconds (64 cores case) fairly much too to some point.
> >
> > The problem with (1) is that it still requires faults to split the
> > huge pages. Those faults will need to contend for the MMU lock, and
> > will hold the lock for longer than they do today since they are doing
> > extra work.
>
> Right.  But that overhead is very limited, IMHO.. per the numbers, it's the
> 20sec and 18sec difference for full write faults.
>
> The thing is either split or vcpu will take the write lock anyway.  So it
> either contends during split, or later.  Without tdp (so never PML) it'll
> need a slow page fault anyway even if split is done before hand..
>
> >
> > I agree there might be some benefit for workloads, but for write-heavy
> > workloads there will still be a "thundering herd" problem when dirty
> > logging is first enable. I'll admit though I have not testing this
> > approach.
>
> Indeed that's majorly the core of my question, on why this series cares
> more on write than read workloads.  To me they are all possible workloads,
> but maybe I'm wrong?  This series benefits heavy writes, but it may not
> benefit (or even make it slower on) heavy reads.

It's not that either workload is more important than the other, or
that we care about one more than the other. It's about the effects of
dirty logging on each workload.

Eager page splitting is all about avoiding the large (like 99%
degradation), abrupt, scales-with-the-number-of-vcpus, drop in
performance when dirty logging is enabled. This drop can be
catastrophic to customer workloads, causing application failure. Eager
page splitting may introduce higher TLB miss costs for read-heavy
workloads, making them worse than without Eager page splitting, but
that is not something that causes application failure. Maybe this is
bias from working for a cloud provider, but it's much better to have
predictable performance for all workloads (even if it's slightly worse
for some workloads) than a system that causes catastrophic failure for
some workloads.

Now that being said, KVM's shadow paging can still cause "catastrophic
failure" since it requires the write lock to handle 4KiB
write-protection faults. That's something that would be worth
addressing as well, but separately.

>
> The tdp mmu case is more persuasive in that:
>
>   (a) Split runs concurrently on vcpu faults,
>
>   (b) When with PML the tdp mmu case could completely avoid the small write
>       page faults.
>
> All these benefits do not exist for shadow mmu.

Here's how I reason about the benefits of eager page splitting for the
shadow MMU. During dirty logging the shadow MMU suffers from:

(1) Write-protection faults on huge pages that take the MMU lock to
unmap the huge page, map a 4KiB page, and update the dirty log.
(2) Non-present faults caused by (1) that take the MMU lock to map in
the missing page.
(3) Write-protection faults on 4KiB pages that take the MMU lock to
make the page writable and update the dirty log.

The benefit of eager page splitting is to eliminate (1) and (2).

(BTW, maybe to address (3) we could try to handle these
write-protection faults under the MMU read lock.)

>
> I don't think I'm against this series..  I think at least with the series
> we can have matching feature on tdp and !tdp, meanwhile it still benefits a
> lot on read+write mix workloads are you proved in the follow up tests (PS:
> do you think that should be mentioned in the cover letter too?).

Yes, will do!

>
> IMHO when a performance feature is merged, it'll be harder to be removed
> because once merged it'll be harder to be proved wrong.  I hope it'll be
> worth it when it gets merged and being maintained in upstream kvm, so I
> raised these questions, hope that we at least thoroughly discuss the pros
> and cons.
>
> >
> > An alternative approach to handling read-heavy workloads we're looking
> > at is to perform dirty logging at 2M.
>
> I agree that's still something worth exploring.
>
> >
> > >
> > > Entry (1) makes a lot of sense to me; OTOH I can understand entry (2) but
> > > not strongly.
> > >
> > > My previous concern was majorly about having readers being blocked during
> > > splitting of huge pages (not after).  For shadow mmu, IIUC the split thread
> > > will start to take write lock rather than read lock (comparing to tdp mmu),
> > > hence any vcpu page faults (hmm, not only reader but writters too I think
> > > with non-present pte..) will be blocked longer than before, am I right?
> > >
> > > Meanwhile for shadow mmu I think there can be more page tables to walk
> > > comparing to the tdp mmu for a single huge page to split?  My understanding
> > > is tdp mmu pgtables are mostly limited by the number of address spaces (?),
> > > but shadow pgtables are per-task.
> >
> > Or per-L2 VM, in the case of nested virtualization.
> >
> > > So I'm not sure whether for a guest with
> > > a lot of active tasks sharing pages, the split thread can spend quite some
> > > time splitting, during which time with write lock held without releasing.
> >
> > The eager page splitting code does check for contention and drop the
> > MMU lock in between every SPTE it tries to split. But there still
> > might be some increase in contention due to eager page splitting.
>
> Ah right..
>
> >
> > >
> > > These are kind of against the purpose of eager split on shadowing, which is
> > > to reduce influence for guest vcpu threads?  But I can't tell, I could have
> > > missed something else.  It's just that when applying the idea to shadow mmu
> > > it sounds less attractive than the tdp mmu case.
> >
> > The shadow MMU is also used for Nested Virtualization, which is a bit
> > different from "typical" shadow paging (ept/npt=N) because VMs tend
> > not to share pages, their page tables are fairly static (compared to
> > process page tables), and they tend to be longer lived. So there will
> > not be as much steady-state MMU lock contention that would be
> > negatively impacted by eager page splitting.
> >
> > You might be right though that ept/npt=N has enough steady-state MMU
> > lock contention that it will notice eager page splitting. But then
> > again, it would be even more affected by lazy splitting unless the
> > guest is doing very few writes.
>
> Yes, indeed I see no easy solution to this due to the same lock contention.
>
> Thanks,
>
> --
> Peter Xu
>