mbox series

[RFC,00/17] KVM: arm64: Parallelize stage 2 fault handling

Message ID 20220415215901.1737897-1-oupton@google.com (mailing list archive)
Headers show
Series KVM: arm64: Parallelize stage 2 fault handling | expand

Message

Oliver Upton April 15, 2022, 9:58 p.m. UTC
Presently KVM only takes a read lock for stage 2 faults if it believes
the fault can be fixed by relaxing permissions on a PTE (write unprotect
for dirty logging). Otherwise, stage 2 faults grab the write lock, which
predictably can pile up all the vCPUs in a sufficiently large VM.

The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
MMU protected by the combination of a read-write lock and RCU, allowing
page walkers to traverse in parallel.

This series is strongly inspired by the mechanics of the TDP MMU,
making use of RCU to protect parallel walks. Note that the TLB
invalidation mechanics are a bit different between x86 and ARM, so we
need to use the 'break-before-make' sequence to split/collapse a
block/table mapping, respectively.

Nonetheless, using atomics on the break side allows fault handlers to
acquire exclusive access to a PTE (lets just call it locked). Once the
PTE lock is acquired it is then safe to assume exclusive access.

Special consideration is required when pruning the page tables in
parallel. Suppose we are collapsing a table into a block. Allowing
parallel faults means that a software walker could be in the middle of
a lower level traversal when the table is unlinked. Table
walkers that prune the paging structures must now 'lock' all descendent
PTEs, effectively asserting exclusive ownership of the substructure
(no other walker can install something to an already locked pte).

Additionally, for parallel walks we need to punt the freeing of table
pages to the next RCU sync, as there could be multiple observers of the
table until all walkers exit the RCU critical section. For this I
decided to cram an rcu_head into page private data for every table page.
We wind up spending a bit more on table pages now, but lazily allocating
for rcu callbacks probably doesn't make a lot of sense. Not only would
we need a large cache of them (think about installing a level 1 block)
to wire up callbacks on all descendent tables, but we also then need to
spend memory to actually free memory.

I tried to organize these patches as best I could w/o introducing
intermediate breakage.

The first 5 patches are meant mostly as prepatory reworks, and, in the
case of RCU a nop.

Patch 6 is quite large, but I had a hard time deciding how to change the
way we link/unlink tables to use atomics without breaking things along
the way.

Patch 7 probably should come before patch 6, as it informs the other
read-side fault (perm relax) about when a map is in progress so it'll
back off.

Patches 8-10 take care of the pruning case, actually locking the child ptes
instead of simply dropping table page references along the way. Note
that we cannot assume a pte points to a table/page at this point, hence
the same helper is called for pre- and leaf-traversal. Guide the
recursion based on what got yanked from the PTE.

Patches 11-14 wire up everything to schedule rcu callbacks on
to-be-freed table pages. rcu_barrier() is called on the way out from
tearing down a stage 2 page table to guarantee all memory associated
with the VM has actually been cleaned up.

Patches 15-16 loop in the fault handler to the new table traversal game.

Lastly, patch 17 is a nasty bit of debugging residue to spot possible
table page leaks. Please don't laugh ;-)

Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
exercise the table pruning code. Haven't done anything beyond this,
sending as an RFC now to get eyes on the code.

Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
git://git.kernel.dk/linux-block")

Oliver Upton (17):
  KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
  KVM: arm64: Only read the pte once per visit
  KVM: arm64: Return the next table from map callbacks
  KVM: arm64: Protect page table traversal with RCU
  KVM: arm64: Take an argument to indicate parallel walk
  KVM: arm64: Implement break-before-make sequence for parallel walks
  KVM: arm64: Enlighten perm relax path about parallel walks
  KVM: arm64: Spin off helper for initializing table pte
  KVM: arm64: Tear down unlinked page tables in parallel walk
  KVM: arm64: Assume a table pte is already owned in post-order
    traversal
  KVM: arm64: Move MMU cache init/destroy into helpers
  KVM: arm64: Stuff mmu page cache in sub struct
  KVM: arm64: Setup cache for stage2 page headers
  KVM: arm64: Punt last page reference to rcu callback for parallel walk
  KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
  KVM: arm64: Enable parallel stage 2 MMU faults
  TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages

 arch/arm64/include/asm/kvm_host.h     |   5 +-
 arch/arm64/include/asm/kvm_mmu.h      |   2 +
 arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
 arch/arm64/kvm/arm.c                  |   4 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
 arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
 arch/arm64/kvm/mmu.c                  | 120 ++++--
 8 files changed, 503 insertions(+), 186 deletions(-)

Comments

David Matlack April 15, 2022, 11:35 p.m. UTC | #1
On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
>
> The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> MMU protected by the combination of a read-write lock and RCU, allowing
> page walkers to traverse in parallel.
>
> This series is strongly inspired by the mechanics of the TDP MMU,
> making use of RCU to protect parallel walks. Note that the TLB
> invalidation mechanics are a bit different between x86 and ARM, so we
> need to use the 'break-before-make' sequence to split/collapse a
> block/table mapping, respectively.

An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
arch-neutral and port it to support ARM's stage-2 MMU. This is based
on a few observations:

- The problems that motivated the development of the TDP MMU are not
x86-specific (e.g. parallelizing faults during the post-copy phase of
Live Migration).
- The synchronization in the TDP MMU (read/write lock, RCU for PT
freeing, atomic compare-exchanges for modifying PTEs) is complex, but
would be equivalent across architectures.
- Eventually RISC-V is going to want similar performance (my
understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
and it'd be a shame to re-implement TDP MMU synchronization a third
time.
- The TDP MMU includes support for various performance features that
would benefit other architectures, such as eager page splitting,
deferred zapping, lockless write-protection resolution, and (coming
soon) in-place huge page promotion.
- And then there's the obvious wins from less code duplication in KVM
(e.g. get rid of the RISC-V MMU copy, increased code test coverage,
...).

The side of this I haven't really looked into yet is ARM's stage-2
MMU, and how amenable it would be to being managed by the TDP MMU. But
I assume it's a conventional page table structure mapping GPAs to
HPAs, which is the most important overlap.

That all being said, an arch-neutral TDP MMU would be a larger, more
complex code change than something like this series (hence my "v2"
caveat above). But I wanted to get this idea out there since the
rubber is starting to hit the road on improving ARM MMU scalability.

[1] "v2" as in the "next evolution" sense, not the "PATCH v2" sense :)





>
> Nonetheless, using atomics on the break side allows fault handlers to
> acquire exclusive access to a PTE (lets just call it locked). Once the
> PTE lock is acquired it is then safe to assume exclusive access.
>
> Special consideration is required when pruning the page tables in
> parallel. Suppose we are collapsing a table into a block. Allowing
> parallel faults means that a software walker could be in the middle of
> a lower level traversal when the table is unlinked. Table
> walkers that prune the paging structures must now 'lock' all descendent
> PTEs, effectively asserting exclusive ownership of the substructure
> (no other walker can install something to an already locked pte).
>
> Additionally, for parallel walks we need to punt the freeing of table
> pages to the next RCU sync, as there could be multiple observers of the
> table until all walkers exit the RCU critical section. For this I
> decided to cram an rcu_head into page private data for every table page.
> We wind up spending a bit more on table pages now, but lazily allocating
> for rcu callbacks probably doesn't make a lot of sense. Not only would
> we need a large cache of them (think about installing a level 1 block)
> to wire up callbacks on all descendent tables, but we also then need to
> spend memory to actually free memory.
>
> I tried to organize these patches as best I could w/o introducing
> intermediate breakage.
>
> The first 5 patches are meant mostly as prepatory reworks, and, in the
> case of RCU a nop.
>
> Patch 6 is quite large, but I had a hard time deciding how to change the
> way we link/unlink tables to use atomics without breaking things along
> the way.
>
> Patch 7 probably should come before patch 6, as it informs the other
> read-side fault (perm relax) about when a map is in progress so it'll
> back off.
>
> Patches 8-10 take care of the pruning case, actually locking the child ptes
> instead of simply dropping table page references along the way. Note
> that we cannot assume a pte points to a table/page at this point, hence
> the same helper is called for pre- and leaf-traversal. Guide the
> recursion based on what got yanked from the PTE.
>
> Patches 11-14 wire up everything to schedule rcu callbacks on
> to-be-freed table pages. rcu_barrier() is called on the way out from
> tearing down a stage 2 page table to guarantee all memory associated
> with the VM has actually been cleaned up.
>
> Patches 15-16 loop in the fault handler to the new table traversal game.
>
> Lastly, patch 17 is a nasty bit of debugging residue to spot possible
> table page leaks. Please don't laugh ;-)
>
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.
>
> Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
> git://git.kernel.dk/linux-block")
>
> Oliver Upton (17):
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Only read the pte once per visit
>   KVM: arm64: Return the next table from map callbacks
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Take an argument to indicate parallel walk
>   KVM: arm64: Implement break-before-make sequence for parallel walks
>   KVM: arm64: Enlighten perm relax path about parallel walks
>   KVM: arm64: Spin off helper for initializing table pte
>   KVM: arm64: Tear down unlinked page tables in parallel walk
>   KVM: arm64: Assume a table pte is already owned in post-order
>     traversal
>   KVM: arm64: Move MMU cache init/destroy into helpers
>   KVM: arm64: Stuff mmu page cache in sub struct
>   KVM: arm64: Setup cache for stage2 page headers
>   KVM: arm64: Punt last page reference to rcu callback for parallel walk
>   KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
>   KVM: arm64: Enable parallel stage 2 MMU faults
>   TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
>
>  arch/arm64/include/asm/kvm_host.h     |   5 +-
>  arch/arm64/include/asm/kvm_mmu.h      |   2 +
>  arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
>  arch/arm64/kvm/arm.c                  |   4 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c                  | 120 ++++--
>  8 files changed, 503 insertions(+), 186 deletions(-)
>
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
Oliver Upton April 16, 2022, 12:04 a.m. UTC | #2
On Fri, Apr 15, 2022 at 04:35:24PM -0700, David Matlack wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Presently KVM only takes a read lock for stage 2 faults if it believes
> > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > predictably can pile up all the vCPUs in a sufficiently large VM.
> >
> > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > MMU protected by the combination of a read-write lock and RCU, allowing
> > page walkers to traverse in parallel.
> >
> > This series is strongly inspired by the mechanics of the TDP MMU,
> > making use of RCU to protect parallel walks. Note that the TLB
> > invalidation mechanics are a bit different between x86 and ARM, so we
> > need to use the 'break-before-make' sequence to split/collapse a
> > block/table mapping, respectively.
> 
> An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
> arch-neutral and port it to support ARM's stage-2 MMU. This is based
> on a few observations:
> 
> - The problems that motivated the development of the TDP MMU are not
> x86-specific (e.g. parallelizing faults during the post-copy phase of
> Live Migration).
> - The synchronization in the TDP MMU (read/write lock, RCU for PT
> freeing, atomic compare-exchanges for modifying PTEs) is complex, but
> would be equivalent across architectures.
> - Eventually RISC-V is going to want similar performance (my
> understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
> and it'd be a shame to re-implement TDP MMU synchronization a third
> time.
> - The TDP MMU includes support for various performance features that
> would benefit other architectures, such as eager page splitting,
> deferred zapping, lockless write-protection resolution, and (coming
> soon) in-place huge page promotion.
> - And then there's the obvious wins from less code duplication in KVM
> (e.g. get rid of the RISC-V MMU copy, increased code test coverage,
> ...).

I definitely agree with the observation -- we're all trying to solve the
same set of issues. And I completely agree that a good long term goal
would be to create some common parts for all architectures. Less work
for us ARM folks it would seem ;-)

What's top of mind is how we paper over the architectural differences
between all of the architectures, especially when we need to do entirely
different things because of the arch.

For example, I whine about break-before-make a lot throughout this
series which is somewhat unique to ARM. I don't think we can do eager
page splitting on the base architecture w/o doing the TLBI for every
block. Not only that, we can't do a direct valid->valid change without
first making an invalid PTE visible to hardware. Things get even more
exciting when hardware revisions relax break-before-make requirements.

There's also significant architectural differences between KVM on x86
and KVM for ARM. Our paging code runs both in the host kernel and the
hyp/lowvisor, and does:

 - VM two dimensional paging (stage 2 MMU)
 - Hyp's own MMU (stage 1 MMU)
 - Host kernel isolation (stage 2 MMU)

each with its own quirks. The 'not exactly in the kernel' part will make
instrumentation a bit of a hassle too.

None of this is meant to disagree with you in the slightest. I firmly
agree we need to share as many parts between the architectures as
possible. I'm just trying to call out a few of the things relating to
ARM that will make this annoying so that way whoever embarks on the
adventure will see it.

> The side of this I haven't really looked into yet is ARM's stage-2
> MMU, and how amenable it would be to being managed by the TDP MMU. But
> I assume it's a conventional page table structure mapping GPAs to
> HPAs, which is the most important overlap.
> 
> That all being said, an arch-neutral TDP MMU would be a larger, more
> complex code change than something like this series (hence my "v2"
> caveat above). But I wanted to get this idea out there since the
> rubber is starting to hit the road on improving ARM MMU scalability.

All for it. I cc'ed you on the series for this exact reason, I wanted to
grab your attention to spark the conversation :)

--
Thanks,
Oliver
Oliver Upton April 16, 2022, 6:23 a.m. UTC | #3
On Fri, Apr 15, 2022 at 09:58:44PM +0000, Oliver Upton wrote:

[...]

> 
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.

Ok, got around to testing this thing a bit harder. Keep in mind that
permission faults at PAGE_SIZE granularity already go on the read side
of the lock. I used the dirty_log_perf_test with 4G/vCPU and anonymous
THP all the way up to 48 vCPUs. Here is the data as it compares to
5.18-rc2.

Dirty log time (split 2M -> 4K):

+-------+----------+-------------------+
| vCPUs | 5.18-rc2 | 5.18-rc2 + series |
+-------+----------+-------------------+
|     1 | 0.83s    | 0.85s             |
|     2 | 0.95s    | 1.07s             |
|     4 | 2.65s    | 1.13s             |
|     8 | 4.88s    | 1.33s             |
|    16 | 9.71s    | 1.73s             |
|    32 | 20.43s   | 3.99s             |
|    48 | 29.15s   | 6.28s             |
+-------+----------+-------------------+

The scaling of prefaulting pass looks better too (same config):

+-------+----------+-------------------+
| vCPUs | 5.18-rc2 | 5.18-rc2 + series |
+-------+----------+-------------------+
|     1 | 0.42s    | 0.18s             |
|     2 | 0.55s    | 0.19s             |
|     4 | 0.79s    | 0.27s             |
|     8 | 1.29s    | 0.35s             |
|    16 | 2.03s    | 0.53s             |
|    32 | 4.03s    | 1.01s             |
|    48 | 6.10s    | 1.51s             |
+-------+----------+-------------------+

--
Thanks,
Oliver
Ben Gardon April 19, 2022, 5:57 p.m. UTC | #4
On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
>
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
>
> The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> MMU protected by the combination of a read-write lock and RCU, allowing
> page walkers to traverse in parallel.
>
> This series is strongly inspired by the mechanics of the TDP MMU,
> making use of RCU to protect parallel walks. Note that the TLB
> invalidation mechanics are a bit different between x86 and ARM, so we
> need to use the 'break-before-make' sequence to split/collapse a
> block/table mapping, respectively.
>
> Nonetheless, using atomics on the break side allows fault handlers to
> acquire exclusive access to a PTE (lets just call it locked). Once the
> PTE lock is acquired it is then safe to assume exclusive access.
>
> Special consideration is required when pruning the page tables in
> parallel. Suppose we are collapsing a table into a block. Allowing
> parallel faults means that a software walker could be in the middle of
> a lower level traversal when the table is unlinked. Table
> walkers that prune the paging structures must now 'lock' all descendent
> PTEs, effectively asserting exclusive ownership of the substructure
> (no other walker can install something to an already locked pte).
>
> Additionally, for parallel walks we need to punt the freeing of table
> pages to the next RCU sync, as there could be multiple observers of the
> table until all walkers exit the RCU critical section. For this I
> decided to cram an rcu_head into page private data for every table page.
> We wind up spending a bit more on table pages now, but lazily allocating
> for rcu callbacks probably doesn't make a lot of sense. Not only would
> we need a large cache of them (think about installing a level 1 block)
> to wire up callbacks on all descendent tables, but we also then need to
> spend memory to actually free memory.

FWIW we used a similar approach in early versions of the TDP MMU, but
instead of page->private used page->lru so that more metadata could be
stored in page->private.
Ultimately that ended up being too limiting and we decided to switch
to just using the associated struct kvm_mmu_page as the list element.
I don't know if ARM has an equivalent construct though.

>
> I tried to organize these patches as best I could w/o introducing
> intermediate breakage.
>
> The first 5 patches are meant mostly as prepatory reworks, and, in the
> case of RCU a nop.
>
> Patch 6 is quite large, but I had a hard time deciding how to change the
> way we link/unlink tables to use atomics without breaking things along
> the way.
>
> Patch 7 probably should come before patch 6, as it informs the other
> read-side fault (perm relax) about when a map is in progress so it'll
> back off.
>
> Patches 8-10 take care of the pruning case, actually locking the child ptes
> instead of simply dropping table page references along the way. Note
> that we cannot assume a pte points to a table/page at this point, hence
> the same helper is called for pre- and leaf-traversal. Guide the
> recursion based on what got yanked from the PTE.
>
> Patches 11-14 wire up everything to schedule rcu callbacks on
> to-be-freed table pages. rcu_barrier() is called on the way out from
> tearing down a stage 2 page table to guarantee all memory associated
> with the VM has actually been cleaned up.
>
> Patches 15-16 loop in the fault handler to the new table traversal game.
>
> Lastly, patch 17 is a nasty bit of debugging residue to spot possible
> table page leaks. Please don't laugh ;-)
>
> Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to
> exercise the table pruning code. Haven't done anything beyond this,
> sending as an RFC now to get eyes on the code.
>
> Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of
> git://git.kernel.dk/linux-block")
>
> Oliver Upton (17):
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Only read the pte once per visit
>   KVM: arm64: Return the next table from map callbacks
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Take an argument to indicate parallel walk
>   KVM: arm64: Implement break-before-make sequence for parallel walks
>   KVM: arm64: Enlighten perm relax path about parallel walks
>   KVM: arm64: Spin off helper for initializing table pte
>   KVM: arm64: Tear down unlinked page tables in parallel walk
>   KVM: arm64: Assume a table pte is already owned in post-order
>     traversal
>   KVM: arm64: Move MMU cache init/destroy into helpers
>   KVM: arm64: Stuff mmu page cache in sub struct
>   KVM: arm64: Setup cache for stage2 page headers
>   KVM: arm64: Punt last page reference to rcu callback for parallel walk
>   KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map()
>   KVM: arm64: Enable parallel stage 2 MMU faults
>   TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages
>
>  arch/arm64/include/asm/kvm_host.h     |   5 +-
>  arch/arm64/include/asm/kvm_mmu.h      |   2 +
>  arch/arm64/include/asm/kvm_pgtable.h  |  14 +-
>  arch/arm64/kvm/arm.c                  |   4 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |  13 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |  13 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 518 +++++++++++++++++++-------
>  arch/arm64/kvm/mmu.c                  | 120 ++++--
>  8 files changed, 503 insertions(+), 186 deletions(-)
>
> --
> 2.36.0.rc0.470.gd361397f0d-goog
>
Oliver Upton April 19, 2022, 6:36 p.m. UTC | #5
On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> >
> > Presently KVM only takes a read lock for stage 2 faults if it believes
> > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > predictably can pile up all the vCPUs in a sufficiently large VM.
> >
> > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > MMU protected by the combination of a read-write lock and RCU, allowing
> > page walkers to traverse in parallel.
> >
> > This series is strongly inspired by the mechanics of the TDP MMU,
> > making use of RCU to protect parallel walks. Note that the TLB
> > invalidation mechanics are a bit different between x86 and ARM, so we
> > need to use the 'break-before-make' sequence to split/collapse a
> > block/table mapping, respectively.
> >
> > Nonetheless, using atomics on the break side allows fault handlers to
> > acquire exclusive access to a PTE (lets just call it locked). Once the
> > PTE lock is acquired it is then safe to assume exclusive access.
> >
> > Special consideration is required when pruning the page tables in
> > parallel. Suppose we are collapsing a table into a block. Allowing
> > parallel faults means that a software walker could be in the middle of
> > a lower level traversal when the table is unlinked. Table
> > walkers that prune the paging structures must now 'lock' all descendent
> > PTEs, effectively asserting exclusive ownership of the substructure
> > (no other walker can install something to an already locked pte).
> >
> > Additionally, for parallel walks we need to punt the freeing of table
> > pages to the next RCU sync, as there could be multiple observers of the
> > table until all walkers exit the RCU critical section. For this I
> > decided to cram an rcu_head into page private data for every table page.
> > We wind up spending a bit more on table pages now, but lazily allocating
> > for rcu callbacks probably doesn't make a lot of sense. Not only would
> > we need a large cache of them (think about installing a level 1 block)
> > to wire up callbacks on all descendent tables, but we also then need to
> > spend memory to actually free memory.
>
> FWIW we used a similar approach in early versions of the TDP MMU, but
> instead of page->private used page->lru so that more metadata could be
> stored in page->private.
> Ultimately that ended up being too limiting and we decided to switch
> to just using the associated struct kvm_mmu_page as the list element.
> I don't know if ARM has an equivalent construct though.

ARM currently doesn't have any metadata it needs to tie with the table
pages. We just do very basic page reference counting for every valid
PTE. I was going to link together pages (hence the page header), but
we actually do not have a functional need for it at the moment. In
fact, struct page::rcu_head would probably fit the bill and we can
avoid extra metadata/memory for the time being.

Perhaps best to keep it simple and do the rest when we have a genuine
need for it.

--
Thanks,
Oliver
Ben Gardon April 21, 2022, 4:30 p.m. UTC | #6
On Tue, Apr 19, 2022 at 11:36 AM Oliver Upton <oupton@google.com> wrote:
>
> On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Presently KVM only takes a read lock for stage 2 faults if it believes
> > > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > > predictably can pile up all the vCPUs in a sufficiently large VM.
> > >
> > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > > MMU protected by the combination of a read-write lock and RCU, allowing
> > > page walkers to traverse in parallel.
> > >
> > > This series is strongly inspired by the mechanics of the TDP MMU,
> > > making use of RCU to protect parallel walks. Note that the TLB
> > > invalidation mechanics are a bit different between x86 and ARM, so we
> > > need to use the 'break-before-make' sequence to split/collapse a
> > > block/table mapping, respectively.
> > >
> > > Nonetheless, using atomics on the break side allows fault handlers to
> > > acquire exclusive access to a PTE (lets just call it locked). Once the
> > > PTE lock is acquired it is then safe to assume exclusive access.
> > >
> > > Special consideration is required when pruning the page tables in
> > > parallel. Suppose we are collapsing a table into a block. Allowing
> > > parallel faults means that a software walker could be in the middle of
> > > a lower level traversal when the table is unlinked. Table
> > > walkers that prune the paging structures must now 'lock' all descendent
> > > PTEs, effectively asserting exclusive ownership of the substructure
> > > (no other walker can install something to an already locked pte).
> > >
> > > Additionally, for parallel walks we need to punt the freeing of table
> > > pages to the next RCU sync, as there could be multiple observers of the
> > > table until all walkers exit the RCU critical section. For this I
> > > decided to cram an rcu_head into page private data for every table page.
> > > We wind up spending a bit more on table pages now, but lazily allocating
> > > for rcu callbacks probably doesn't make a lot of sense. Not only would
> > > we need a large cache of them (think about installing a level 1 block)
> > > to wire up callbacks on all descendent tables, but we also then need to
> > > spend memory to actually free memory.
> >
> > FWIW we used a similar approach in early versions of the TDP MMU, but
> > instead of page->private used page->lru so that more metadata could be
> > stored in page->private.
> > Ultimately that ended up being too limiting and we decided to switch
> > to just using the associated struct kvm_mmu_page as the list element.
> > I don't know if ARM has an equivalent construct though.
>
> ARM currently doesn't have any metadata it needs to tie with the table
> pages. We just do very basic page reference counting for every valid
> PTE. I was going to link together pages (hence the page header), but
> we actually do not have a functional need for it at the moment. In
> fact, struct page::rcu_head would probably fit the bill and we can
> avoid extra metadata/memory for the time being.

Ah, right! I page::rcu_head was the field I was thinking of.

>
> Perhaps best to keep it simple and do the rest when we have a genuine
> need for it.

Completely agree. I'm surprised that ARM doesn't have a need for a
metadata structure associated with each page of the stage 2 paging
structure, but if you don't need it, that definitely makes things
simpler.

>
> --
> Thanks,
> Oliver
Paolo Bonzini April 21, 2022, 4:37 p.m. UTC | #7
On 4/21/22 18:30, Ben Gardon wrote:
> Completely agree. I'm surprised that ARM doesn't have a need for a
> metadata structure associated with each page of the stage 2 paging
> structure, but if you don't need it, that definitely makes things
> simpler.

The uses of struct kvm_mmu_page in the TDP MMU are all relatively new, 
for the work_struct and the roots' reference count.  sp->ptep is only 
used to in a very specific path, kvm_recover_nx_lpages.

I wouldn't be surprised if ARM grows more metadata later, but in fact 
it's not _that_ surprising that it doesn't need it yet!

Paolo
David Matlack April 21, 2022, 4:43 p.m. UTC | #8
On Fri, Apr 15, 2022 at 5:04 PM Oliver Upton <oupton@google.com> wrote:
>
> On Fri, Apr 15, 2022 at 04:35:24PM -0700, David Matlack wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@google.com> wrote:
> > >
> > > Presently KVM only takes a read lock for stage 2 faults if it believes
> > > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > > predictably can pile up all the vCPUs in a sufficiently large VM.
> > >
> > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > > MMU protected by the combination of a read-write lock and RCU, allowing
> > > page walkers to traverse in parallel.
> > >
> > > This series is strongly inspired by the mechanics of the TDP MMU,
> > > making use of RCU to protect parallel walks. Note that the TLB
> > > invalidation mechanics are a bit different between x86 and ARM, so we
> > > need to use the 'break-before-make' sequence to split/collapse a
> > > block/table mapping, respectively.
> >
> > An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
> > arch-neutral and port it to support ARM's stage-2 MMU. This is based
> > on a few observations:
> >
> > - The problems that motivated the development of the TDP MMU are not
> > x86-specific (e.g. parallelizing faults during the post-copy phase of
> > Live Migration).
> > - The synchronization in the TDP MMU (read/write lock, RCU for PT
> > freeing, atomic compare-exchanges for modifying PTEs) is complex, but
> > would be equivalent across architectures.
> > - Eventually RISC-V is going to want similar performance (my
> > understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
> > and it'd be a shame to re-implement TDP MMU synchronization a third
> > time.
> > - The TDP MMU includes support for various performance features that
> > would benefit other architectures, such as eager page splitting,
> > deferred zapping, lockless write-protection resolution, and (coming
> > soon) in-place huge page promotion.
> > - And then there's the obvious wins from less code duplication in KVM
> > (e.g. get rid of the RISC-V MMU copy, increased code test coverage,
> > ...).
>
> I definitely agree with the observation -- we're all trying to solve the
> same set of issues. And I completely agree that a good long term goal
> would be to create some common parts for all architectures. Less work
> for us ARM folks it would seem ;-)
>
> What's top of mind is how we paper over the architectural differences
> between all of the architectures, especially when we need to do entirely
> different things because of the arch.
>
> For example, I whine about break-before-make a lot throughout this
> series which is somewhat unique to ARM. I don't think we can do eager
> page splitting on the base architecture w/o doing the TLBI for every
> block. Not only that, we can't do a direct valid->valid change without
> first making an invalid PTE visible to hardware. Things get even more
> exciting when hardware revisions relax break-before-make requirements.

Gotcha, so porting the TDP MMU to ARM would require adding
break-before-make support. That seems feasible and we could guard it
behind a e.g. static_key so there is no runtime overhead for
architectures (or ARM hardware revisions) that do not require it.
Anything else come to mind as major architectural differences?

 >
> There's also significant architectural differences between KVM on x86
> and KVM for ARM. Our paging code runs both in the host kernel and the
> hyp/lowvisor, and does:
>
>  - VM two dimensional paging (stage 2 MMU)
>  - Hyp's own MMU (stage 1 MMU)
>  - Host kernel isolation (stage 2 MMU)
>
> each with its own quirks. The 'not exactly in the kernel' part will make
> instrumentation a bit of a hassle too.

Ah, interesting. It'd probably make sense to start with the VM
2-dimensional paging use-case and leave the other use-cases using the
existing MMU, and then investigate transitioning the other use-cases.
Similarly in x86 we still have the legacy MMU for shadow paging (e.g.
hosts with no stage-2 hardware, and nested virtualization).

>
> None of this is meant to disagree with you in the slightest. I firmly
> agree we need to share as many parts between the architectures as
> possible. I'm just trying to call out a few of the things relating to
> ARM that will make this annoying so that way whoever embarks on the
> adventure will see it.
>
> > The side of this I haven't really looked into yet is ARM's stage-2
> > MMU, and how amenable it would be to being managed by the TDP MMU. But
> > I assume it's a conventional page table structure mapping GPAs to
> > HPAs, which is the most important overlap.
> >
> > That all being said, an arch-neutral TDP MMU would be a larger, more
> > complex code change than something like this series (hence my "v2"
> > caveat above). But I wanted to get this idea out there since the
> > rubber is starting to hit the road on improving ARM MMU scalability.
>
> All for it. I cc'ed you on the series for this exact reason, I wanted to
> grab your attention to spark the conversation :)
>
> --
> Thanks,
> Oliver