[v3,00/23] KVM: Extend Eager Page Splitting to the shadow MMU

Message ID	20220401175554.1931568-1-dmatlack@google.com (mailing list archive)
Headers	show Return-Path: <linux-mips-owner@kernel.org> Date: Fri, 1 Apr 2022 17:55:31 +0000 Message-Id: <20220401175554.1931568-1-dmatlack@google.com> Mime-Version: 1.0 Subject: [PATCH v3 00/23] KVM: Extend Eager Page Splitting to the shadow MMU From: David Matlack <dmatlack@google.com> To: Paolo Bonzini <pbonzini@redhat.com> Cc: Marc Zyngier <maz@kernel.org>, Huacai Chen <chenhuacai@kernel.org>, Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>, Anup Patel <anup@brainfault.org>, Paul Walmsley <paul.walmsley@sifive.com>, Palmer Dabbelt <palmer@dabbelt.com>, Albert Ou <aou@eecs.berkeley.edu>, Sean Christopherson <seanjc@google.com>, Andrew Jones <drjones@redhat.com>, Ben Gardon <bgardon@google.com>, Peter Xu <peterx@redhat.com>, maciej.szmigiero@oracle.com, "moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)" <kvmarm@lists.cs.columbia.edu>, "open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips)" <linux-mips@vger.kernel.org>, "open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips)" <kvm@vger.kernel.org>, "open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv)" <kvm-riscv@lists.infradead.org>, Peter Feiner <pfeiner@google.com>, David Matlack <dmatlack@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	KVM: Extend Eager Page Splitting to the shadow MMU \| expand [v3,00/23] KVM: Extend Eager Page Splitting to the shadow MMU [v3,01/23] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs [v3,02/23] KVM: x86/mmu: Use a bool for direct [v3,03/23] KVM: x86/mmu: Derive shadow MMU page role from parent [v3,04/23] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions [v3,05/23] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages [v3,06/23] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page() [v3,07/23] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization [v3,08/23] KVM: x86/mmu: Link spt to sp during allocation [v3,09/23] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c [v3,10/23] KVM: x86/mmu: Use common code to free kvm_mmu_page structs [v3,11/23] KVM: x86/mmu: Use common code to allocate shadow pages from vCPU caches [v3,12/23] KVM: x86/mmu: Pass const memslot to rmap_add() [v3,13/23] KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants [v3,14/23] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu [v3,15/23] KVM: x86/mmu: Update page stats in __rmap_add() [v3,16/23] KVM: x86/mmu: Cache the access bits of shadowed translations [v3,17/23] KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU [v3,18/23] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU [v3,19/23] KVM: x86/mmu: Refactor drop_large_spte() [v3,20/23] KVM: Allow for different capacities in kvm_mmu_memory_cache structs [v3,21/23] KVM: Allow GFP flags to be passed when topping up MMU caches [v3,22/23] KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU [v3,23/23] KVM: selftests: Map x86_64 guest virtual memory with huge pages

David Matlack April 1, 2022, 5:55 p.m. UTC

This series extends KVM's Eager Page Splitting to also split huge pages
mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
This will be useful for configurations that use Nested Virtualization,
disable the TDP MMU, or disable/lack TDP hardware support.

For background on Eager Page Splitting, see:
 - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
 - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/

Splitting huge pages mapped by the shadow MMU is more complicated than
the TDP MMU, but it is also more important for performance as the shadow
MMU handles huge page write-protection faults under the write lock.  See
the Performance section for more details.

The extra complexity of splitting huge pages mapped by the shadow MMU
comes from a few places:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Huge pages may be mapped by indirect shadow pages.

    - Indirect shadow pages have the possibilty of being unsync. As a
      policy we opt not to split such pages as their translation may no
      longer be valid.
    - Huge pages on indirect shadow pages may have access permission
      constraints from the guest (unlike the TDP MMU which is ACC_ALL
      by default).

(3) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(4) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

In Google's internal implementation of Eager Page Splitting, we do not
handle cases (3) and (4), and intstead opts to skip splitting entirely
(case 3) or only partially splitting (case 4). This series handles the
additional cases, which requires an additional 4KiB of memory per VM to
store the extra pte_list_desc cache. However it does also avoids the need
for TLB flushes in most cases and allows KVM to split more pages mapped
by shadow paging.

The bulk of this series is just refactoring the existing MMU code in
preparation for splitting, specifically to make it possible to operate
on the MMU outside of a vCPU context.

Motivation
----------

During dirty logging, VMs using the shadow MMU suffer from:

(1) Write-protection faults on huge pages that take the MMU lock to
    unmap the huge page, map a 4KiB page, and update the dirty log.

(2) Non-present faults caused by (1) that take the MMU lock to map in
    the missing page.

(3) Write-protection faults on 4KiB pages that take the MMU lock to
    make the page writable and update the dirty log. [Note: These faults
    only take the MMU lock during shadow paging.]

The lock contention from (1), (2) and (3) can severely degrade
application performance to the point of failure.  Eager page splitting
eliminates (1) by moving the splitting of huge pages off the vCPU
threads onto the thread invoking VM-ioctls to configure dirty logging,
and eliminates (2) by fully splitting each huge page into its
constituent small pages. (3) is still a concern for shadow paging
workloads (e.g. nested virtualization) but is not addressed by this
series.

Splitting in the VM-ioctl thread is useful because it can run in the
background without interrupting vCPU execution. However, it does take
the MMU lock so it may introduce some extra contention if vCPUs are
hammering the MMU lock. This is offset by the fact that eager page
splitting drops the MMU lock after splitting each SPTE if there is any
contention, and the fact that eager page splitting is reducing the MMU
lock contention from (1) and (2) above. Even workloads that only write
to 5% of their memory see massive MMU lock contention reduction during
dirty logging thanks to Eager Page Splitting (see Performance data
below).

A downside of Eager Page Splitting is that it splits all huge pages,
which may include ranges of memory that are never written to by the
guest and thus could theoretically stay huge. Workloads that write to
only a fraction of their memory may see higher TLB miss costs with Eager
Page Splitting enabled. However, that is secondary to the application
failure that otherwise may occur without Eager Page Splitting.

Further work is necessary to improve the TLB miss performance for
read-heavy workoads, such as dirty logging at 2M instead of 4K.

Performance
-----------

To measure the performance impact of Eager Page Splitting I ran
dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
written to versus read was controlled with the -f option.

To measure the imapct of customer performance, we can look at the time
it takes all vCPUs to dirty memory after dirty logging has been enabled.
Without Eager Page Splitting enabled, such dirtying must take faults to
split huge pages and bottleneck on the MMU lock.

             | Config: ept=Y, tdp_mmu=N, 100% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.326340139s               | 0.058645119s               |
4            | 0.425658733s               | 0.059211364s               |
8            | 1.392495283s               | 0.059992269s               |
16           | 2.891475203s               | 0.074386427s               |
32           | 7.077453255s               | 0.074484273s               |
64           | 17.907075277s              | 0.080433025s               |

             | Config: ept=Y, tdp_mmu=N, 5% writes                     |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.333304003s               | 0.005811521s               |
4            | 0.418382740s               | 0.006195093s               |
8            | 1.127732621s               | 0.007388453s               |
16           | 3.003522635s               | 0.007854799s               |
32           | 7.341293635s               | 0.012048705s               |
64           | 16.555752029s              | 0.016820654s               |

Eager Page Splitting does increase the time it takes to enable dirty
logging when not using initially-all-set, since that's when KVM splits
huge pages. However, this runs in parallel with vCPU execution and drops
the MMU lock whenever there is contention.

             | Config: ept=Y, tdp_mmu=N, 100% writes                   |
             | Enabling dirty logging time                             |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001618723s               | 0.026362044s               |
4            | 0.003214434s               | 0.052079721s               |
8            | 0.006392632s               | 0.106090423s               |
16           | 0.012733697s               | 0.212877154s               |
32           | 0.060493391s               | 0.438669189s               |
64           | 0.104983842s               | 1.199435360s               |

Similarly, Eager Page Splitting increases the time it takes to clear the
dirty log for when using initially-all-set. The first time userspace
clears the dirty log, KVM will split huge pages:

             | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
             | Iteration 1 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.001599286s               | 0.027095008s               |
4            | 0.003180496s               | 0.053376070s               |
8            | 0.006417064s               | 0.106120759s               |
16           | 0.012826026s               | 0.215146223s               |
32           | 0.027089937s               | 0.444193363s               |
64           | 0.090911208s               | 1.200149758s               |

Subsequent calls to clear the dirty log incur almost no additional cost
since KVM can very quickly determine there are no more huge pages to
split via the RMAP. This is unlike the TDP MMU which must re-traverse
the entire page table to check for huge pages.

             | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
             | Iteration 2 clear dirty log time                        |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.016009056s               | 0.016397093s               |
4            | 0.031897037s               | 0.032602744s               |
8            | 0.063964980s               | 0.064671106s               |
16           | 0.128522311s               | 0.131025221s               |
32           | 0.259647643s               | 0.273217378s               |
64           | 0.647532182s               | 0.705208495s               |


Eager Page Splitting also improves the performance for shadow paging
configurations, as measured with ept=N. Although the absolute gains are
less for write-heavy workloads since KVM's shadow paging takes the write
lock to track 4KiB writes (i.e. no fast_page_faut() or PML). However
there are still major gains for read/write and read-heavy workloads.

             | Config: ept=N, tdp_mmu=Y, 100% writes                   |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.369208141s               | 0.346649742s               |
4            | 0.586295755s               | 0.486400573s               |
8            | 1.605597377s               | 1.433775448s               |
16           | 3.499827702s               | 3.424215648s               |
32           | 9.219003076s               | 8.142864934s               |
64           | 20.876099825s              | 19.342073745s              |

             | Config: ept=N, tdp_mmu=Y, 50% writes                    |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.381155928s               | 0.179281063s               |
4            | 0.530227136s               | 0.262919652s               |
8            | 1.771578193s               | 0.732303441s               |
16           | 3.744348496s               | 1.633369935s               |
32           | 9.558809131s               | 4.343945991s               |
64           | 20.257131790s              | 9.609166533s               |

             | Config: ept=N, tdp_mmu=Y, 5% writes                     |
             | Iteration 1 dirty memory time                           |
             | ------------------------------------------------------- |
vCPU Count   | eager_page_split=N         | eager_page_split=Y         |
------------ | -------------------------- | -------------------------- |
2            | 0.385243149s               | 0.020628192s               |
4            | 0.506447280s               | 0.024045445s               |
8            | 1.635995700s               | 0.064108556s               |
16           | 3.752015939s               | 0.131484155s               |
32           | 9.493783332s               | 0.422104253s               |
64           | 21.250099240s              | 0.984948363s               |

Testing
-------

- Ran all kvm-unit-tests and KVM selftests with all combinations of
  ept=[NY] and tdp_mmu=[NY].
- Booted a 32-bit non-PAE kernel with shadow paging to verify the
  quadrant change in patch 3.
- Tested VM live migration [*] with ept=N and ept=Y and observed pages
  being split via tracepoint and the pages_* stats.

[*] The live migration setup consisted of an 8 vCPU 8 GiB VM running
    on an Intel Cascade Lake host and backed by 1GiB HugeTLBFS memory.
    The VM was running Debian 10 and a workload that consisted of 16
    independent processes that each dirty memory. The tests were run
    with ept=N to exercise the interaction of Eager Page Splitting and
    shadow paging.

Version Log
-----------

v3:
 - Add R-b tags from Peter.
 - Explain direct SPs in indirect MMUs in commit message [Peter]
 - Change BUG_ON() to WARN_ON_ONCE() in quadrant calculation [me]
 - Eliminate unnecessary gotos [Peter]
 - Drop mmu_alloc_pte_list_desc() [Peter]
 - Also update access cache in mmu_set_spte() if was_rmapped [Peter]
 - Fix number of gfn bits in shadowed_translation cache [Peter]
 - Pass sp to make_huge_page_split_spte() to derive level and exec [me]
 - Eliminate flush var in kvm_rmap_zap_collapsible_sptes() [Peter]
 - Drop NULL pte_list_desc cache fallback [Peter]
 - Fix get_access to return sp->role.access. [me]
 - Re-use split cache across calls to CLEAR_DIRTY_LOG for better perf [me]
 - Top-up the split cache outside of the MMU lock when possible [me]
 - Refactor prepare_to_split_huge_page() into try_split_huge_page() [me]
 - Collapse PATCH 20, 23, and 24 avoid intermediate complexity [Peter]
 - Update the RISC-V function stage2_ioremap() [Anup]

v2: https://lore.kernel.org/kvm/20220311002528.2230172-1-dmatlack@google.com/
 - Add performance data for workloads that mix reads and writes [Peter]
 - Collect R-b tags from Ben and Sean.
 - Fix quadrant calculation when deriving role from parent [Sean]
 - Tweak new shadow page function names [Sean]
 - Move set_page_private() to allocation functions [Ben]
 - Only zap collapsible SPTEs up to MAX_LEVEL-1 [Ben]
 - Always top-up pte_list_desc cache to reduce complexity [Ben]
 - Require mmu cache capacity field to be initialized and add WARN()
   to reduce chance of programmer error [Marc]
 - Fix up kvm_mmu_memory_cache struct initialization in arm64 [Marc]

v1: https://lore.kernel.org/kvm/20220203010051.2813563-1-dmatlack@google.com/


David Matlack (23):
  KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs
  KVM: x86/mmu: Use a bool for direct
  KVM: x86/mmu: Derive shadow MMU page role from parent
  KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions
  KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages
  KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page()
  KVM: x86/mmu: Separate shadow MMU sp allocation from initialization
  KVM: x86/mmu: Link spt to sp during allocation
  KVM: x86/mmu: Move huge page split sp allocation code to mmu.c
  KVM: x86/mmu: Use common code to free kvm_mmu_page structs
  KVM: x86/mmu: Use common code to allocate shadow pages from vCPU
    caches
  KVM: x86/mmu: Pass const memslot to rmap_add()
  KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants
  KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu
  KVM: x86/mmu: Update page stats in __rmap_add()
  KVM: x86/mmu: Cache the access bits of shadowed translations
  KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU
  KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU
  KVM: x86/mmu: Refactor drop_large_spte()
  KVM: Allow for different capacities in kvm_mmu_memory_cache structs
  KVM: Allow GFP flags to be passed when topping up MMU caches
  KVM: x86/mmu: Support Eager Page Splitting in the shadow MMU
  KVM: selftests: Map x86_64 guest virtual memory with huge pages

 .../admin-guide/kernel-parameters.txt         |   3 -
 arch/arm64/include/asm/kvm_host.h             |   2 +-
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |  13 +-
 arch/mips/include/asm/kvm_host.h              |   2 +-
 arch/mips/kvm/mips.c                          |   2 +
 arch/riscv/include/asm/kvm_host.h             |   2 +-
 arch/riscv/kvm/mmu.c                          |  17 +-
 arch/riscv/kvm/vcpu.c                         |   1 +
 arch/x86/include/asm/kvm_host.h               |  20 +-
 arch/x86/include/asm/kvm_page_track.h         |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 732 ++++++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h               |  27 +-
 arch/x86/kvm/mmu/page_track.c                 |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  22 +-
 arch/x86/kvm/mmu/spte.c                       |  18 +-
 arch/x86/kvm/mmu/spte.h                       |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  48 +-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   2 +-
 arch/x86/kvm/x86.c                            |   6 +
 include/linux/kvm_host.h                      |   1 +
 include/linux/kvm_types.h                     |  19 +-
 .../selftests/kvm/include/x86_64/processor.h  |   6 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 +-
 .../selftests/kvm/lib/x86_64/processor.c      |  31 +
 virt/kvm/kvm_main.c                           |  19 +-
 26 files changed, 780 insertions(+), 226 deletions(-)


base-commit: 19164ad08bf668bca4f4bfbaacaa0a47c1b737a6

Sean Christopherson April 11, 2022, 5:12 p.m. UTC | #1

On Fri, Apr 01, 2022, David Matlack wrote:
> This series extends KVM's Eager Page Splitting to also split huge pages
> mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
> This will be useful for configurations that use Nested Virtualization,
> disable the TDP MMU, or disable/lack TDP hardware support.
> 
> For background on Eager Page Splitting, see:
>  - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
>  - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
> 
> Splitting huge pages mapped by the shadow MMU is more complicated than
> the TDP MMU, but it is also more important for performance as the shadow
> MMU handles huge page write-protection faults under the write lock.  See
> the Performance section for more details.
> 
> The extra complexity of splitting huge pages mapped by the shadow MMU
> comes from a few places:

I think we should restrict eager page splitting to the TDP MMU being enabled,
i.e. restrict shadow MMU support to nested MMUs.

A decent chunk of the churn and complexity in this series comes from having to
deal with the intersection of things no one cares about in practice (!TDP shadow
paging), and/or things we should be putting into maintenance-only mode (legacy MMU
with TDP enabled).  I see zero reason to support this for legacy shadow paging
without a very concrete, very performance sensitive use case, and the legacy MMU
with TDP should be a hard "no".

With those out of the way, unsync support can also be jettisoned, because barring
use cases I don't know about, hypervisors don't modify TDP entries in the same way
that kernels modify native page tables, i.e. don't benefit from allowing SPTEs to
go unsync.

The other feature that I think we should deprecate (which I'm pretty sure someone on
our team, maybe even you, is planning on proposing upstream) is support for zapping
KVM shadow pages for the shrinker.  In hindsight, we should have done that a few
years ago instead of fixing the bug that made KVM's support meaningful (see commit
ebdb292dac79 ("KVM: x86/mmu: Batch zap MMU pages when shrinking the slab").  Doing
that for nested MMUs only (or at least first) should be less controversial.

The other thing we want to do sooner than later is improve the scalability of the
nested MMU.  A relatively simple way to pick some juicy low hanging fruit, if we
drop the aforementioned features we don't actually need for nested MMUs, would be
to turn all of the tracking structures needed for handling a page fault into
per-root lists/structures, e.g. active_mmu_pages and mmu_page_hash.  Unless L1 is
doing something funky, there is unlikely to be overlap between nested TDP page
tables, i.e. per-root tracking shouldn't cause a memory explosion.

At that point, as a first step/stopgap toward a more scalable nested MMU implementation,
nested TDP page faults, zapping of obsolete pages (memslot updates), and eager page
splitting (I think) can take mmu_lock for read and then take a per-root spinlock.

At a bare minimum, taking mmu_lock for read would prevent a nested vCPU from blocking
the TDP MMU, which in itself should be a big win.  Zapping after a memslot updates
would not interfere at all with re-faulting memory since zapping the obsolete roots
would never get a lock conflict.  And for use cases that spin up a large number of small
L2 VMs, per-root locking will allow KVM to handle page faults for each L2 in parallel,
which could be a huge performance boost for select use cases.

Circling back to eager page splitting, this series could be reworked to take the
first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
order to provide the necessary path for reworking nested MMU page faults.  Then it
can remove unsync and shrinker support for nested MMUs.  With those gone,
dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
some of the complexity/churn.

> Performance
> -----------
> 
> To measure the performance impact of Eager Page Splitting I ran
> dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
> written to versus read was controlled with the -f option.
> 
> To measure the imapct of customer performance, we can look at the time
> it takes all vCPUs to dirty memory after dirty logging has been enabled.
> Without Eager Page Splitting enabled, such dirtying must take faults to
> split huge pages and bottleneck on the MMU lock.
> 
>              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
>              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
>              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
>              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
>              | Config: ept=N, tdp_mmu=Y, 100% writes                   |
>              | Config: ept=N, tdp_mmu=Y, 50% writes                    |
>              | Config: ept=N, tdp_mmu=Y, 5% writes                     |

IMO, to justify this there needs to be performance numbers for ept=Y, tdp_mmu=Y,
i.e. for the use case we actually care about.  I don't expect the outcome to be
any different, but it really should be explicitly tested.

David Matlack April 11, 2022, 5:54 p.m. UTC | #2

On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 01, 2022, David Matlack wrote:
> > This series extends KVM's Eager Page Splitting to also split huge pages
> > mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps.
> > This will be useful for configurations that use Nested Virtualization,
> > disable the TDP MMU, or disable/lack TDP hardware support.
> >
> > For background on Eager Page Splitting, see:
> >  - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/
> >  - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
> >
> > Splitting huge pages mapped by the shadow MMU is more complicated than
> > the TDP MMU, but it is also more important for performance as the shadow
> > MMU handles huge page write-protection faults under the write lock.  See
> > the Performance section for more details.
> >
> > The extra complexity of splitting huge pages mapped by the shadow MMU
> > comes from a few places:
>
> I think we should restrict eager page splitting to the TDP MMU being enabled,
> i.e. restrict shadow MMU support to nested MMUs.
>
> A decent chunk of the churn and complexity in this series comes from having to
> deal with the intersection of things no one cares about in practice (!TDP shadow
> paging), and/or things we should be putting into maintenance-only mode (legacy MMU
> with TDP enabled).  I see zero reason to support this for legacy shadow paging
> without a very concrete, very performance sensitive use case, and the legacy MMU
> with TDP should be a hard "no".
>
> With those out of the way, unsync support can also be jettisoned, because barring
> use cases I don't know about, hypervisors don't modify TDP entries in the same way
> that kernels modify native page tables, i.e. don't benefit from allowing SPTEs to
> go unsync.
>
> The other feature that I think we should deprecate (which I'm pretty sure someone on
> our team, maybe even you, is planning on proposing upstream) is support for zapping
> KVM shadow pages for the shrinker.  In hindsight, we should have done that a few
> years ago instead of fixing the bug that made KVM's support meaningful (see commit
> ebdb292dac79 ("KVM: x86/mmu: Batch zap MMU pages when shrinking the slab").  Doing
> that for nested MMUs only (or at least first) should be less controversial.
>
> The other thing we want to do sooner than later is improve the scalability of the
> nested MMU.  A relatively simple way to pick some juicy low hanging fruit, if we
> drop the aforementioned features we don't actually need for nested MMUs, would be
> to turn all of the tracking structures needed for handling a page fault into
> per-root lists/structures, e.g. active_mmu_pages and mmu_page_hash.  Unless L1 is
> doing something funky, there is unlikely to be overlap between nested TDP page
> tables, i.e. per-root tracking shouldn't cause a memory explosion.
>
> At that point, as a first step/stopgap toward a more scalable nested MMU implementation,
> nested TDP page faults, zapping of obsolete pages (memslot updates), and eager page
> splitting (I think) can take mmu_lock for read and then take a per-root spinlock.
>
> At a bare minimum, taking mmu_lock for read would prevent a nested vCPU from blocking
> the TDP MMU, which in itself should be a big win.  Zapping after a memslot updates
> would not interfere at all with re-faulting memory since zapping the obsolete roots
> would never get a lock conflict.  And for use cases that spin up a large number of small
> L2 VMs, per-root locking will allow KVM to handle page faults for each L2 in parallel,
> which could be a huge performance boost for select use cases.
>
> Circling back to eager page splitting, this series could be reworked to take the
> first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> order to provide the necessary path for reworking nested MMU page faults.  Then it
> can remove unsync and shrinker support for nested MMUs.  With those gone,
> dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> some of the complexity/churn.

These sound like useful improvements but I am not really seeing the
value of sequencing them before this series:

 - IMO the "churn" in patches 1-14 are a net improvement to the
existing code. They improve readability by decomposing the shadow page
creation path into smaller functions with better names, reduce the
amount of redundant calculations, and reduce the dependence on struct
kvm_vcpu where it is not needed. Even if eager page splitting is
completely dropped I think they would be useful to merge.

 - Patches 15-21 are necessary complexity to support eager page
splitting, but wouldn't change at all if this splitting was specific
to splitting nested MMUs.

 - Outside of patches 1-14, unsync really doesn't play a role other
than to skip splitting if sp->unsync is true. But as you pointed out
in patch 22, that check can already be dropped since SPs with roles
>4k are never marked unsync.

I'd be fine with limiting eager page splitting to tdp_mmu=Y since
nested is the primary use-case for Google and I agree TDP with the
shadow MMU should be phased out. This would be an artificial
limitation in the short term, but I imagine it would make all those
improvements easier to make down the road.

>
> > Performance
> > -----------
> >
> > To measure the performance impact of Eager Page Splitting I ran
> > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per
> > vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was
> > written to versus read was controlled with the -f option.
> >
> > To measure the imapct of customer performance, we can look at the time
> > it takes all vCPUs to dirty memory after dirty logging has been enabled.
> > Without Eager Page Splitting enabled, such dirtying must take faults to
> > split huge pages and bottleneck on the MMU lock.
> >
> >              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
> >              | Config: ept=Y, tdp_mmu=N, 100% writes                   |
> >              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
> >              | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set |
> >              | Config: ept=N, tdp_mmu=Y, 100% writes                   |
> >              | Config: ept=N, tdp_mmu=Y, 50% writes                    |
> >              | Config: ept=N, tdp_mmu=Y, 5% writes                     |
>
> IMO, to justify this there needs to be performance numbers for ept=Y, tdp_mmu=Y,
> i.e. for the use case we actually care about.  I don't expect the outcome to be
> any different, but it really should be explicitly tested.

That's a fair request I guess. There should be no difference in
performance from the ept=N results but it require a lot more effort to
rig up, which is why I tested this way.

I'll look into collecting some results with nested MMUs. On the plus
side, better selftests support for nested MMUs will be useful as the
various improvements you suggested are implemented.

Sean Christopherson April 11, 2022, 8:12 p.m. UTC | #3

On Mon, Apr 11, 2022, David Matlack wrote:
> On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
> > Circling back to eager page splitting, this series could be reworked to take the
> > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > some of the complexity/churn.
> 
> These sound like useful improvements but I am not really seeing the
> value of sequencing them before this series:
> 
>  - IMO the "churn" in patches 1-14 are a net improvement to the
> existing code. They improve readability by decomposing the shadow page
> creation path into smaller functions with better names, reduce the
> amount of redundant calculations, and reduce the dependence on struct
> kvm_vcpu where it is not needed. Even if eager page splitting is
> completely dropped I think they would be useful to merge.

I definitely like some of patches 1-14, probably most after a few read throughs.
But there are key parts that I do not like that are motivated almost entirely by
the desire to support page splitting.  Specifically, I don't like splitting the
logic of finding a page, and I don't like having a separate alloc vs. initializer
(though I'm guessing this will be needed somewhere to split huge pages for nested
MMUs).

E.g. I'd prefer the "get" flow look like the below (completely untested, for
discussion purposes only).  There's still churn, but the core loop is almost
entirely unchanged.

And it's not just this series, I don't want future improvements nested TDP to have
to deal with the legacy baggage.

Waaaay off topic, why do we still bother with stat.max_mmu_page_hash_collision?
I assume it was originally added to tune the hashing logic?  At this point is it
anything but wasted cycles?

static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
						     gfn_t gfn,
						     unsigned int gfn_hash,
						     union kvm_mmu_page_role role)
{
	struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
	struct kvm_mmu_page *sp;
	LIST_HEAD(invalid_list);

	int collisions = 0;

	for_each_valid_sp(kvm, sp, sp_list) {
		if (sp->gfn != gfn) {
			collisions++;
			continue;
		}

		if (sp->role.word != role.word) {
			/*
			 * If the guest is creating an upper-level page, zap
			 * unsync pages for the same gfn.  While it's possible
			 * the guest is using recursive page tables, in all
			 * likelihood the guest has stopped using the unsync
			 * page and is installing a completely unrelated page.
			 * Unsync pages must not be left as is, because the new
			 * upper-level page will be write-protected.
			 */
			if (role.level > PG_LEVEL_4K && sp->unsync)
				kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);

			continue;
		}

		/* unsync and write-flooding only apply to indirect SPs. */
		if (sp->role.direct)
			goto out;

		if (sp->unsync) {
			/*
			 * The page is good, but is stale.  kvm_sync_page does
			 * get the latest guest state, but (unlike mmu_unsync_children)
			 * it doesn't write-protect the page or mark it synchronized!
			 * This way the validity of the mapping is ensured, but the
			 * overhead of write protection is not incurred until the
			 * guest invalidates the TLB mapping.  This allows multiple
			 * SPs for a single gfn to be unsync.
			 *
			 * If the sync fails, the page is zapped.  If so, break
			 * in order to rebuild it.
			 */
			if (!kvm_sync_page(vcpu, sp, &invalid_list))
				break;

			WARN_ON(!list_empty(&invalid_list));
			kvm_flush_remote_tlbs(vcpu->kvm);
		}

		__clear_sp_write_flooding_count(sp);
		goto out;
	}

	sp = NULL;

out:
	if (collisions > kvm->stat.max_mmu_page_hash_collisions)
		kvm->stat.max_mmu_page_hash_collisions = collisions;

	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
	return sp;
}

static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
						      gfn_t gfn,
						      unsigned int gfn_hash,
						      union kvm_mmu_page_role role)
{
	struct kvm_mmu_page *sp = __kvm_mmu_alloc_shadow_page(vcpu, role.direct);
	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
	struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];

	++kvm->stat.mmu_cache_miss;

	sp->gfn = gfn;
	sp->role = role;
	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;

	/*
	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
	 * depends on valid pages being added to the head of the list.  See
	 * comments in kvm_zap_obsolete_pages().
	 */
	list_add(&sp->link, &kvm->arch.active_mmu_pages);
	kvm_mod_used_mmu_pages(kvm, 1);

	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
	hlist_add_head(&sp->hash_link, sp_list);

	if (!role.direct)
		account_shadowed(kvm, slot, sp);
}


static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
						    gfn_t gfn,
						    union kvm_mmu_page_role role)
{
	unsigned int gfn_hash = kvm_page_table_hashfn(gfn);
	struct kvm_mmu_page *sp;
	bool created = false;

	sp = kvm_mmu_find_shadow_page(vcpu, gfn, gfn_hash, role);
	if (!sp) {
		created = true;
		sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, gfn_hash, role);
	}

	trace_kvm_mmu_get_page(sp, created);
	return sp;
}

David Matlack April 11, 2022, 11:41 p.m. UTC | #4

On Mon, Apr 11, 2022 at 1:12 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 11, 2022, David Matlack wrote:
> > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
> > > Circling back to eager page splitting, this series could be reworked to take the
> > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > > some of the complexity/churn.
> >
> > These sound like useful improvements but I am not really seeing the
> > value of sequencing them before this series:
> >
> >  - IMO the "churn" in patches 1-14 are a net improvement to the
> > existing code. They improve readability by decomposing the shadow page
> > creation path into smaller functions with better names, reduce the
> > amount of redundant calculations, and reduce the dependence on struct
> > kvm_vcpu where it is not needed. Even if eager page splitting is
> > completely dropped I think they would be useful to merge.
>
> I definitely like some of patches 1-14, probably most after a few read throughs.
> But there are key parts that I do not like that are motivated almost entirely by
> the desire to support page splitting.  Specifically, I don't like splitting the
> logic of finding a page, and I don't like having a separate alloc vs. initializer
> (though I'm guessing this will be needed somewhere to split huge pages for nested
> MMUs).
>
> E.g. I'd prefer the "get" flow look like the below (completely untested, for
> discussion purposes only).  There's still churn, but the core loop is almost
> entirely unchanged.
>
> And it's not just this series, I don't want future improvements nested TDP to have
> to deal with the legacy baggage.

One thing that would be helpful is if you can explain in a bit more
specifically what you'd like to see. Part of the reason why I prefer
to sequence your proposal after eager page splitting is that I do not
fully understand what you're proposing, and how complex it would be.
e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
for nested MMUs does not sound like less churn.

From my perspective, this series is a net improvement to the
readability and maintainability of existing code, while adding a
performance improvement (eager page splitting). All of the changes you
are proposing can still be implemented on top if and when they become
a priority (including merging {,__}kvm_find_shadow_page()). And if we
limit eager page splitting to nested MMUs, we don't have to worry
about maintaining eager page splitting with TDP shadow MMU or legacy
shadow paging over time.


>
> Waaaay off topic, why do we still bother with stat.max_mmu_page_hash_collision?
> I assume it was originally added to tune the hashing logic?  At this point is it
> anything but wasted cycles?
>
> static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu,
>                                                      gfn_t gfn,
>                                                      unsigned int gfn_hash,
>                                                      union kvm_mmu_page_role role)
> {
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>         struct kvm_mmu_page *sp;
>         LIST_HEAD(invalid_list);
>
>         int collisions = 0;
>
>         for_each_valid_sp(kvm, sp, sp_list) {
>                 if (sp->gfn != gfn) {
>                         collisions++;
>                         continue;
>                 }
>
>                 if (sp->role.word != role.word) {
>                         /*
>                          * If the guest is creating an upper-level page, zap
>                          * unsync pages for the same gfn.  While it's possible
>                          * the guest is using recursive page tables, in all
>                          * likelihood the guest has stopped using the unsync
>                          * page and is installing a completely unrelated page.
>                          * Unsync pages must not be left as is, because the new
>                          * upper-level page will be write-protected.
>                          */
>                         if (role.level > PG_LEVEL_4K && sp->unsync)
>                                 kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
>
>                         continue;
>                 }
>
>                 /* unsync and write-flooding only apply to indirect SPs. */
>                 if (sp->role.direct)
>                         goto out;
>
>                 if (sp->unsync) {
>                         /*
>                          * The page is good, but is stale.  kvm_sync_page does
>                          * get the latest guest state, but (unlike mmu_unsync_children)
>                          * it doesn't write-protect the page or mark it synchronized!
>                          * This way the validity of the mapping is ensured, but the
>                          * overhead of write protection is not incurred until the
>                          * guest invalidates the TLB mapping.  This allows multiple
>                          * SPs for a single gfn to be unsync.
>                          *
>                          * If the sync fails, the page is zapped.  If so, break
>                          * in order to rebuild it.
>                          */
>                         if (!kvm_sync_page(vcpu, sp, &invalid_list))
>                                 break;
>
>                         WARN_ON(!list_empty(&invalid_list));
>                         kvm_flush_remote_tlbs(vcpu->kvm);
>                 }
>
>                 __clear_sp_write_flooding_count(sp);
>                 goto out;
>         }
>
>         sp = NULL;
>
> out:
>         if (collisions > kvm->stat.max_mmu_page_hash_collisions)
>                 kvm->stat.max_mmu_page_hash_collisions = collisions;
>
>         kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
>         return sp;
> }
>
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu,
>                                                       gfn_t gfn,
>                                                       unsigned int gfn_hash,
>                                                       union kvm_mmu_page_role role)
> {
>         struct kvm_mmu_page *sp = __kvm_mmu_alloc_shadow_page(vcpu, role.direct);
>         struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>
>         ++kvm->stat.mmu_cache_miss;
>
>         sp->gfn = gfn;
>         sp->role = role;
>         sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
>
>         /*
>          * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
>          * depends on valid pages being added to the head of the list.  See
>          * comments in kvm_zap_obsolete_pages().
>          */
>         list_add(&sp->link, &kvm->arch.active_mmu_pages);
>         kvm_mod_used_mmu_pages(kvm, 1);
>
>         sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>         hlist_add_head(&sp->hash_link, sp_list);
>
>         if (!role.direct)
>                 account_shadowed(kvm, slot, sp);
> }
>
>
> static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                                                     gfn_t gfn,
>                                                     union kvm_mmu_page_role role)
> {
>         unsigned int gfn_hash = kvm_page_table_hashfn(gfn);
>         struct kvm_mmu_page *sp;
>         bool created = false;
>
>         sp = kvm_mmu_find_shadow_page(vcpu, gfn, gfn_hash, role);
>         if (!sp) {
>                 created = true;
>                 sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, gfn_hash, role);
>         }
>
>         trace_kvm_mmu_get_page(sp, created);
>         return sp;
> }

Sean Christopherson April 12, 2022, 12:39 a.m. UTC | #5

On Mon, Apr 11, 2022, David Matlack wrote:
> On Mon, Apr 11, 2022 at 1:12 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Apr 11, 2022, David Matlack wrote:
> > > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > Circling back to eager page splitting, this series could be reworked to take the
> > > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in
> > > > order to provide the necessary path for reworking nested MMU page faults.  Then it
> > > > can remove unsync and shrinker support for nested MMUs.  With those gone,
> > > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner
> > > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least
> > > > some of the complexity/churn.
> > >
> > > These sound like useful improvements but I am not really seeing the
> > > value of sequencing them before this series:
> > >
> > >  - IMO the "churn" in patches 1-14 are a net improvement to the
> > > existing code. They improve readability by decomposing the shadow page
> > > creation path into smaller functions with better names, reduce the
> > > amount of redundant calculations, and reduce the dependence on struct
> > > kvm_vcpu where it is not needed. Even if eager page splitting is
> > > completely dropped I think they would be useful to merge.
> >
> > I definitely like some of patches 1-14, probably most after a few read throughs.
> > But there are key parts that I do not like that are motivated almost entirely by
> > the desire to support page splitting.  Specifically, I don't like splitting the
> > logic of finding a page, and I don't like having a separate alloc vs. initializer
> > (though I'm guessing this will be needed somewhere to split huge pages for nested
> > MMUs).
> >
> > E.g. I'd prefer the "get" flow look like the below (completely untested, for
> > discussion purposes only).  There's still churn, but the core loop is almost
> > entirely unchanged.
> >
> > And it's not just this series, I don't want future improvements nested TDP to have
> > to deal with the legacy baggage.
> 
> One thing that would be helpful is if you can explain in a bit more
> specifically what you'd like to see. Part of the reason why I prefer
> to sequence your proposal after eager page splitting is that I do not
> fully understand what you're proposing, and how complex it would be.
> e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> for nested MMUs does not sound like less churn.

Oh, it's most definitely not less code, and probably more churn.  But, it's churn
that pushes us in a more favorable direction and that is desirable long term.  I
don't mind churning code, but I want the churn to make future life easier, not
harder.  Details below.

> From my perspective, this series is a net improvement to the
> readability and maintainability of existing code, while adding a
> performance improvement (eager page splitting). All of the changes you
> are proposing can still be implemented on top if

They can be implemented on top, but I want to avoid inhireting complexity we
don't actually want/need, unsync support being the most notable.

What I mean by "fork" is that after the cleanups that make sense irrespective of
eager page splitting, we make a copy of FNAME(page_fault) and add FNAME(get_shadow_page),
extracting common logic where we can and probably doing something fancy to avoid
having multiple copies of FNAME(get_shadow_page).  Looking again at the code, it's
probably best to keep FNAME(fetch), at least for now, as it's only the single unsync
check that we can purge at this point.

That gives us e.g. FNAME(nested_page_fault) that support EPT and 64-bit NPT, and
a nested TDP specific get_shadow_page().

Then we rip out the unsync stuff for nested MMUs, which is quite clean because we
can key off of tdp_enabled.  It'll leave dead code for 32-bit hosts running nested
VMs, but I highly doubt anyone will notice the perf hit.

At that point, dissect kvm_nested_mmu_get_page() for eager page splitting and
continue on.

It's not drastically different than what you have now, but it avoids the nastiness
around unsync pages, e.g. I'm pretty sure kvm_mmu_alloc_shadow_page() can be reused
as I proposed and the "find" becomes something like:

static struct kvm_mmu_page *kvm_mmu_nested_tdp_find_sp(struct kvm_vcpu *vcpu,
						       gfn_t gfn,
						       unsigned int gfn_hash,
						       union kvm_mmu_page_role role)
{
	struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
	struct kvm_mmu_page *sp;

	for_each_valid_sp(kvm, sp, sp_list) {
		if (sp->gfn != gfn || sp->role.word != role.word)
			continue;

		__clear_sp_write_flooding_count(sp);
		return sp;
	}

	return NULL;
}

Having the separate page fault and get_shadow_page(), without the baggage of unsync
in particular, sets us up for switching to taking mmu_lock for read, and in the
distant future, implementing whatever new scheme someone concocts for shadowing
nested TDP.

David Matlack April 12, 2022, 4:49 p.m. UTC | #6

On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 11, 2022, David Matlack wrote:
> >
> > One thing that would be helpful is if you can explain in a bit more
> > specifically what you'd like to see. Part of the reason why I prefer
> > to sequence your proposal after eager page splitting is that I do not
> > fully understand what you're proposing, and how complex it would be.
> > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> > for nested MMUs does not sound like less churn.
>
> Oh, it's most definitely not less code, and probably more churn.  But, it's churn
> that pushes us in a more favorable direction and that is desirable long term.  I
> don't mind churning code, but I want the churn to make future life easier, not
> harder.  Details below.

Of course. Let's make sure we're on the same page about what churn
introduced by this series will make future life harder that we hope to
avoid. If I understand you correctly, it's the following 2 changes:

 (a.) Using separate functions to allocate SPs and initialize SPs.
 (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page().

(a.) stems from the fact that SP allocation during eager page
splitting is made directly rather than through kvm_mmu_memory_caches,
which was what you pushed for in the TDP MMU implementation. We could
instead use kvm_mmu_memory_caches for the shadow MMU eager page
splitting to eliminate (a.). But otherwise (a.) is necessary
complexity of eager page splitting because it needs to allocate SPs
differently from the vCPU fault path.

As for (b.), see below...

>
> > From my perspective, this series is a net improvement to the
> > readability and maintainability of existing code, while adding a
> > performance improvement (eager page splitting). All of the changes you
> > are proposing can still be implemented on top if
>
> They can be implemented on top, but I want to avoid inhireting complexity we
> don't actually want/need, unsync support being the most notable.
>
> What I mean by "fork" is that after the cleanups that make sense irrespective of
> eager page splitting, we make a copy of FNAME(page_fault) and add FNAME(get_shadow_page),
> extracting common logic where we can and probably doing something fancy to avoid
> having multiple copies of FNAME(get_shadow_page).  Looking again at the code, it's
> probably best to keep FNAME(fetch), at least for now, as it's only the single unsync
> check that we can purge at this point.
>
> That gives us e.g. FNAME(nested_page_fault) that support EPT and 64-bit NPT, and
> a nested TDP specific get_shadow_page().
>
> Then we rip out the unsync stuff for nested MMUs, which is quite clean because we
> can key off of tdp_enabled.  It'll leave dead code for 32-bit hosts running nested
> VMs, but I highly doubt anyone will notice the perf hit.
>
> At that point, dissect kvm_nested_mmu_get_page() for eager page splitting and
> continue on.
>
> It's not drastically different than what you have now, but it avoids the nastiness
> around unsync pages, e.g. I'm pretty sure kvm_mmu_alloc_shadow_page() can be reused
> as I proposed and the "find" becomes something like:
>
> static struct kvm_mmu_page *kvm_mmu_nested_tdp_find_sp(struct kvm_vcpu *vcpu,
>                                                        gfn_t gfn,
>                                                        unsigned int gfn_hash,
>                                                        union kvm_mmu_page_role role)
> {
>         struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash];
>         struct kvm_mmu_page *sp;
>
>         for_each_valid_sp(kvm, sp, sp_list) {
>                 if (sp->gfn != gfn || sp->role.word != role.word)
>                         continue;
>
>                 __clear_sp_write_flooding_count(sp);
>                 return sp;
>         }
>
>         return NULL;
> }

IIUC all of this would be to avoid separating
kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page() correct?
i.e. Nested MMUs would have their own "find" function, which is called
by eager page splitting, and thus no separate
__kvm_mmu_find_shadow_page().

But __kvm_mmu_find_shadow_page(), as implemented in this series, is
about 90% similar to what you proposed for
kvm_mmu_nested_tdp_find_sp(). And in fact it would work correctly to
use __kvm_mmu_find_shadow_page() for nested MMUs, since we know the
sp->unsync condition would just be skipped.

So even if we did everything you proposed (which seems like an awful
lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we
would still end up with the exact same code. i.e.
kvm_mmu_nested_tdp_find_sp() would be implemented by calling
__kvm_mmu_find_shadow_page(), because it would be a waste to
re-implement an almost identical function?

>
> Having the separate page fault and get_shadow_page(), without the baggage of unsync
> in particular, sets us up for switching to taking mmu_lock for read, and in the
> distant future, implementing whatever new scheme someone concocts for shadowing
> nested TDP.

Taking MMU read lock with per-root spinlocks for nested MMUs is a
great idea btw. I think it would be a great improvement.

Sean Christopherson April 13, 2022, 1:02 a.m. UTC | #7

On Tue, Apr 12, 2022, David Matlack wrote:
> On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Apr 11, 2022, David Matlack wrote:
> > >
> > > One thing that would be helpful is if you can explain in a bit more
> > > specifically what you'd like to see. Part of the reason why I prefer
> > > to sequence your proposal after eager page splitting is that I do not
> > > fully understand what you're proposing, and how complex it would be.
> > > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> > > for nested MMUs does not sound like less churn.
> >
> > Oh, it's most definitely not less code, and probably more churn.  But, it's churn
> > that pushes us in a more favorable direction and that is desirable long term.  I
> > don't mind churning code, but I want the churn to make future life easier, not
> > harder.  Details below.
> 
> Of course. Let's make sure we're on the same page about what churn
> introduced by this series will make future life harder that we hope to
> avoid. If I understand you correctly, it's the following 2 changes:
> 
>  (a.) Using separate functions to allocate SPs and initialize SPs.
>  (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page().
> 
> (a.) stems from the fact that SP allocation during eager page
> splitting is made directly rather than through kvm_mmu_memory_caches,
> which was what you pushed for in the TDP MMU implementation. We could
> instead use kvm_mmu_memory_caches for the shadow MMU eager page

...

> So even if we did everything you proposed (which seems like an awful
> lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we
> would still end up with the exact same code. i.e.
> kvm_mmu_nested_tdp_find_sp() would be implemented by calling
> __kvm_mmu_find_shadow_page(), because it would be a waste to
> re-implement an almost identical function?

I went far enough down this path to know that my idea isn't completely awful,
and wouldn't actually need to fork FNAME(page_fault) at this time, but sadly I
still dislike the end result.

Your assessment that the we'd still end up with very similar (if not quite exact)
code is spot on.  Ditto for your other assertion in (a) about using the caches.

My vote for this series is to go the cache route, e.g. wrap kvm_mmu_memory_caches
in a struct and pass that into kvm_mmu_get_page().  I still think it was the right
call to ignore the caches for the TDP MMU, it gives the TDP MMU more flexibility
and it was trivial to bypass the caches since the TDP MMU was doing its own thing
anyways.

But for the shadow MMU, IMO the cons outweigh the pros.  E.g. in addition to
ending up with two similar but subtly different "get page" flows, passing around
"struct kvm_mmu_page **spp" is a bit unpleasant.  Ditto for having a partially
initialized kvm_mmu_page.  The split code also ends up in a wierd state where it
uses the caches for the pte_list, but not the other allocations.

There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
for the split case and assert that @vcpu is non-null since all of the children
should be direct.

		if (sp->unsync) {
			if (WARN_ON_ONCE(!vcpu)) {
				kvm_mmu_prepare_zap_page(kvm, sp,
							 &invalid_list);
				continue;
			}

			/*
			 * The page is good, but is stale.  kvm_sync_page does
			 * get the latest guest state, but (unlike mmu_unsync_children)
			 * it doesn't write-protect the page or mark it synchronized!
			 * This way the validity of the mapping is ensured, but the
			 * overhead of write protection is not incurred until the
			 * guest invalidates the TLB mapping.  This allows multiple
			 * SPs for a single gfn to be unsync.
			 *
			 * If the sync fails, the page is zapped.  If so, break
			 * in order to rebuild it.
			 */
			if (!kvm_sync_page(vcpu, sp, &invalid_list))
				break;

			WARN_ON(!list_empty(&invalid_list));
			kvm_flush_remote_tlbs(kvm);
		}

David Matlack April 13, 2022, 5:57 p.m. UTC | #8

On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote:
> On Tue, Apr 12, 2022, David Matlack wrote:
> > On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Apr 11, 2022, David Matlack wrote:
> > > >
> > > > One thing that would be helpful is if you can explain in a bit more
> > > > specifically what you'd like to see. Part of the reason why I prefer
> > > > to sequence your proposal after eager page splitting is that I do not
> > > > fully understand what you're proposing, and how complex it would be.
> > > > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page()
> > > > for nested MMUs does not sound like less churn.
> > >
> > > Oh, it's most definitely not less code, and probably more churn.  But, it's churn
> > > that pushes us in a more favorable direction and that is desirable long term.  I
> > > don't mind churning code, but I want the churn to make future life easier, not
> > > harder.  Details below.
> > 
> > Of course. Let's make sure we're on the same page about what churn
> > introduced by this series will make future life harder that we hope to
> > avoid. If I understand you correctly, it's the following 2 changes:
> > 
> >  (a.) Using separate functions to allocate SPs and initialize SPs.
> >  (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page().
> > 
> > (a.) stems from the fact that SP allocation during eager page
> > splitting is made directly rather than through kvm_mmu_memory_caches,
> > which was what you pushed for in the TDP MMU implementation. We could
> > instead use kvm_mmu_memory_caches for the shadow MMU eager page
> 
> ...
> 
> > So even if we did everything you proposed (which seems like an awful
> > lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we
> > would still end up with the exact same code. i.e.
> > kvm_mmu_nested_tdp_find_sp() would be implemented by calling
> > __kvm_mmu_find_shadow_page(), because it would be a waste to
> > re-implement an almost identical function?
> 
> I went far enough down this path to know that my idea isn't completely awful,
> and wouldn't actually need to fork FNAME(page_fault) at this time, but sadly I
> still dislike the end result.

Thanks for looking into it so quickly so we could figure out a path
forward.

> 
> Your assessment that the we'd still end up with very similar (if not quite exact)
> code is spot on.  Ditto for your other assertion in (a) about using the caches.
> 
> My vote for this series is to go the cache route, e.g. wrap kvm_mmu_memory_caches
> in a struct and pass that into kvm_mmu_get_page().  I still think it was the right
> call to ignore the caches for the TDP MMU, it gives the TDP MMU more flexibility
> and it was trivial to bypass the caches since the TDP MMU was doing its own thing
> anyways.
> 
> But for the shadow MMU, IMO the cons outweigh the pros.  E.g. in addition to
> ending up with two similar but subtly different "get page" flows, passing around
> "struct kvm_mmu_page **spp" is a bit unpleasant.  Ditto for having a partially
> initialized kvm_mmu_page.  The split code also ends up in a wierd state where it
> uses the caches for the pte_list, but not the other allocations.

Sounds good. I will rework the series to use kvm_mmu_memory_cache
structs for the SP allocation during eager page splitting. That will
eliminate the separate allocation and initialization which will be a
nice cleanup. And it will be great to get rid of the spp crud.

And per your earlier feedback, I will also limit eager page splitting to
nested MMUs.

> 
> There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
> for the split case and assert that @vcpu is non-null since all of the children
> should be direct.

The NULL vcpu check will be a little gross, but it should never trigger
in practice since eager page splitting always requests direct SPs. My
preference has been to enforce that in code by splitting out
__kvm_mmu_find_shadow_page(), but I can see the advantage of your
proposal is that eager page splitting and faults will go through the
exact same code path to get a kvm_mmu_page.

> 
> 		if (sp->unsync) {
> 			if (WARN_ON_ONCE(!vcpu)) {
> 				kvm_mmu_prepare_zap_page(kvm, sp,
> 							 &invalid_list);
> 				continue;
> 			}
> 
> 			/*
> 			 * The page is good, but is stale.  kvm_sync_page does
> 			 * get the latest guest state, but (unlike mmu_unsync_children)
> 			 * it doesn't write-protect the page or mark it synchronized!
> 			 * This way the validity of the mapping is ensured, but the
> 			 * overhead of write protection is not incurred until the
> 			 * guest invalidates the TLB mapping.  This allows multiple
> 			 * SPs for a single gfn to be unsync.
> 			 *
> 			 * If the sync fails, the page is zapped.  If so, break
> 			 * in order to rebuild it.
> 			 */
> 			if (!kvm_sync_page(vcpu, sp, &invalid_list))
> 				break;
> 
> 			WARN_ON(!list_empty(&invalid_list));
> 			kvm_flush_remote_tlbs(kvm);
> 		}

Sean Christopherson April 13, 2022, 6:28 p.m. UTC | #9

On Wed, Apr 13, 2022, David Matlack wrote:
> On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote:
> > There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
> > for the split case and assert that @vcpu is non-null since all of the children
> > should be direct.
> 
> The NULL vcpu check will be a little gross,

Yeah, I would even call it a lot gross :-)

> but it should never trigger in practice since eager page splitting always
> requests direct SPs. My preference has been to enforce that in code by
> splitting out

It still is enforced in code, just at different points.  The split version WARNs
and continues after finding a page, the below WARNs and rejects _while_ finding
the page.

Speaking of WARNs, that reminds me... it might be worth adding a WARN in
kvm_mmu_get_child_sp() to document (and detect, but more to document) that @direct
should never encounter an page with unsync or unsync_children, e.g. 

	union kvm_mmu_page_role role;
	struct kvm_mmu_page *sp;

	role = kvm_mmu_child_role(sptep, direct, access);
	sp = kvm_mmu_get_page(vcpu, gfn, role);

	/* Comment goes here about direct pages in shadow MMUs? */
	WARN_ON(direct && (sp->unsync || sp->unsync_children));
	return sp;

The indirect walk of FNAME(fetch)() handles unsync_children, but none of the other
callers do.  Obviously shouldn't happen, but especially in the huge page split
case it took me a second to understand exactly why it can't happen.

> but I can see the advantage of your proposal is that eager page splitting and
> faults will go through the exact same code path to get a kvm_mmu_page.
> __kvm_mmu_find_shadow_page(), but I can see the advantage of your
> proposal is that eager page splitting and faults will go through the
> exact same code path to get a kvm_mmu_page.
> 
> > 
> > 		if (sp->unsync) {
> > 			if (WARN_ON_ONCE(!vcpu)) {
> > 				kvm_mmu_prepare_zap_page(kvm, sp,
> > 							 &invalid_list);
> > 				continue;
> > 			}
> > 
> > 			/*
> > 			 * The page is good, but is stale.  kvm_sync_page does
> > 			 * get the latest guest state, but (unlike mmu_unsync_children)
> > 			 * it doesn't write-protect the page or mark it synchronized!
> > 			 * This way the validity of the mapping is ensured, but the
> > 			 * overhead of write protection is not incurred until the
> > 			 * guest invalidates the TLB mapping.  This allows multiple
> > 			 * SPs for a single gfn to be unsync.
> > 			 *
> > 			 * If the sync fails, the page is zapped.  If so, break
> > 			 * in order to rebuild it.
> > 			 */
> > 			if (!kvm_sync_page(vcpu, sp, &invalid_list))
> > 				break;
> > 
> > 			WARN_ON(!list_empty(&invalid_list));
> > 			kvm_flush_remote_tlbs(kvm);
> > 		}

David Matlack April 13, 2022, 9:22 p.m. UTC | #10

On Wed, Apr 13, 2022 at 11:28 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Apr 13, 2022, David Matlack wrote:
> > On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote:
> > > There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL
> > > for the split case and assert that @vcpu is non-null since all of the children
> > > should be direct.
> >
> > The NULL vcpu check will be a little gross,
>
> Yeah, I would even call it a lot gross :-)
>
> > but it should never trigger in practice since eager page splitting always
> > requests direct SPs. My preference has been to enforce that in code by
> > splitting out
>
> It still is enforced in code, just at different points.  The split version WARNs
> and continues after finding a page, the below WARNs and rejects _while_ finding
> the page.
>
> Speaking of WARNs, that reminds me... it might be worth adding a WARN in
> kvm_mmu_get_child_sp() to document (and detect, but more to document) that @direct
> should never encounter an page with unsync or unsync_children, e.g.
>
>         union kvm_mmu_page_role role;
>         struct kvm_mmu_page *sp;
>
>         role = kvm_mmu_child_role(sptep, direct, access);
>         sp = kvm_mmu_get_page(vcpu, gfn, role);
>
>         /* Comment goes here about direct pages in shadow MMUs? */
>         WARN_ON(direct && (sp->unsync || sp->unsync_children));
>         return sp;
>
> The indirect walk of FNAME(fetch)() handles unsync_children, but none of the other
> callers do.  Obviously shouldn't happen, but especially in the huge page split
> case it took me a second to understand exactly why it can't happen.

Will do.

>
> > but I can see the advantage of your proposal is that eager page splitting and
> > faults will go through the exact same code path to get a kvm_mmu_page.
> > __kvm_mmu_find_shadow_page(), but I can see the advantage of your
> > proposal is that eager page splitting and faults will go through the
> > exact same code path to get a kvm_mmu_page.
> >
> > >
> > >             if (sp->unsync) {
> > >                     if (WARN_ON_ONCE(!vcpu)) {
> > >                             kvm_mmu_prepare_zap_page(kvm, sp,
> > >                                                      &invalid_list);
> > >                             continue;
> > >                     }
> > >
> > >                     /*
> > >                      * The page is good, but is stale.  kvm_sync_page does
> > >                      * get the latest guest state, but (unlike mmu_unsync_children)
> > >                      * it doesn't write-protect the page or mark it synchronized!
> > >                      * This way the validity of the mapping is ensured, but the
> > >                      * overhead of write protection is not incurred until the
> > >                      * guest invalidates the TLB mapping.  This allows multiple
> > >                      * SPs for a single gfn to be unsync.
> > >                      *
> > >                      * If the sync fails, the page is zapped.  If so, break
> > >                      * in order to rebuild it.
> > >                      */
> > >                     if (!kvm_sync_page(vcpu, sp, &invalid_list))
> > >                             break;
> > >
> > >                     WARN_ON(!list_empty(&invalid_list));
> > >                     kvm_flush_remote_tlbs(kvm);
> > >             }

[v3,00/23] KVM: Extend Eager Page Splitting to the shadow MMU

Message

Comments