mbox series

[00/14] KVM: arm64: Parallel stage-2 fault handling

Message ID 20220830194132.962932-1-oliver.upton@linux.dev (mailing list archive)
Headers show
Series KVM: arm64: Parallel stage-2 fault handling | expand

Message

Oliver Upton Aug. 30, 2022, 7:41 p.m. UTC
Presently KVM only takes a read lock for stage 2 faults if it believes
the fault can be fixed by relaxing permissions on a PTE (write unprotect
for dirty logging). Otherwise, stage 2 faults grab the write lock, which
predictably can pile up all the vCPUs in a sufficiently large VM.

Like the TDP MMU for x86, this series loosens the locking around
manipulations of the stage 2 page tables to allow parallel faults. RCU
and atomics are exploited to safely build/destroy the stage 2 page
tables in light of multiple software observers.

Patches 1-2 are a cleanup to the way we collapse page tables, with the
added benefit of narrowing the window of time a range of memory is
unmapped.

Patches 3-7 are minor cleanups and refactorings to the way KVM reads
PTEs and traverses the stage 2 page tables to make it amenable to
concurrent modification.

Patches 8-9 use RCU to punt page table cleanup out of the vCPU fault
path, which should also improve fault latency a bit.

Patches 10-14 implement the meat of this series, extending the
'break-before-make' sequence with atomics to realize locking on PTEs.
Effectively a cmpxchg() is used to 'break' a PTE, thereby serializing
changes to a given PTE.

Finally, patch 15 flips the switch on all the new code and starts
grabbing the read side of the MMU lock for stage 2 faults.

Applies to 6.0-rc3. Tested with KVM selftests and benchmarked with
dirty_log_perf_test, scaling from 1 to 48 vCPUs with 4GB of memory per
vCPU backed by THP.

  ./dirty_log_perf_test -s anonymous_thp -m 2 -b 4G -v ${NR_VCPUS}

Time to dirty memory:

        +-------+---------+------------------+
        | vCPUs | 6.0-rc3 | 6.0-rc3 + series |
        +-------+---------+------------------+
        |     1 | 0.89s   | 0.92s            |
        |     2 | 1.13s   | 1.18s            |
        |     4 | 2.42s   | 1.25s            |
        |     8 | 5.03s   | 1.36s            |
        |    16 | 8.84s   | 2.09s            |
        |    32 | 19.60s  | 4.47s            |
        |    48 | 31.39s  | 6.22s            |
        +-------+---------+------------------+

It is also worth mentioning that the time to populate memory has
improved:

        +-------+---------+------------------+
        | vCPUs | 6.0-rc3 | 6.0-rc3 + series |
        +-------+---------+------------------+
        |     1 | 0.19s   | 0.18s            |
        |     2 | 0.25s   | 0.21s            |
        |     4 | 0.38s   | 0.32s            |
        |     8 | 0.64s   | 0.40s            |
        |    16 | 1.22s   | 0.54s            |
        |    32 | 2.50s   | 1.03s            |
        |    48 | 3.88s   | 1.52s            |
        +-------+---------+------------------+

RFC: https://lore.kernel.org/kvmarm/20220415215901.1737897-1-oupton@google.com/

RFC -> v1:
 - Factored out page table teardown from kvm_pgtable_stage2_map()
 - Use the RCU callback to tear down a subtree, instead of scheduling a
   callback for every individual table page.
 - Reorganized series to (hopefully) avoid intermediate breakage.
 - Dropped the use of page headers, instead stuffing KVM metadata into
   page::private directly

Oliver Upton (14):
  KVM: arm64: Add a helper to tear down unlinked stage-2 subtrees
  KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make
  KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
  KVM: arm64: Read the PTE once per visit
  KVM: arm64: Split init and set for table PTE
  KVM: arm64: Return next table from map callbacks
  KVM: arm64: Document behavior of pgtable visitor callback
  KVM: arm64: Protect page table traversal with RCU
  KVM: arm64: Free removed stage-2 tables in RCU callback
  KVM: arm64: Atomically update stage 2 leaf attributes in parallel
    walks
  KVM: arm64: Make changes block->table to leaf PTEs parallel-aware
  KVM: arm64: Make leaf->leaf PTE changes parallel-aware
  KVM: arm64: Make table->block changes parallel-aware
  KVM: arm64: Handle stage-2 faults in parallel

 arch/arm64/include/asm/kvm_pgtable.h  |  59 ++++-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c |   7 +-
 arch/arm64/kvm/hyp/nvhe/setup.c       |   4 +-
 arch/arm64/kvm/hyp/pgtable.c          | 360 ++++++++++++++++----------
 arch/arm64/kvm/mmu.c                  |  65 +++--
 5 files changed, 325 insertions(+), 170 deletions(-)


base-commit: b90cb1053190353cc30f0fef0ef1f378ccc063c5

Comments

Marc Zyngier Sept. 6, 2022, 10 a.m. UTC | #1
On Tue, 30 Aug 2022 20:41:18 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
> 
> Like the TDP MMU for x86, this series loosens the locking around
> manipulations of the stage 2 page tables to allow parallel faults. RCU
> and atomics are exploited to safely build/destroy the stage 2 page
> tables in light of multiple software observers.
> 
> Patches 1-2 are a cleanup to the way we collapse page tables, with the
> added benefit of narrowing the window of time a range of memory is
> unmapped.
> 
> Patches 3-7 are minor cleanups and refactorings to the way KVM reads
> PTEs and traverses the stage 2 page tables to make it amenable to
> concurrent modification.
> 
> Patches 8-9 use RCU to punt page table cleanup out of the vCPU fault
> path, which should also improve fault latency a bit.
> 
> Patches 10-14 implement the meat of this series, extending the
> 'break-before-make' sequence with atomics to realize locking on PTEs.
> Effectively a cmpxchg() is used to 'break' a PTE, thereby serializing
> changes to a given PTE.
> 
> Finally, patch 15 flips the switch on all the new code and starts
> grabbing the read side of the MMU lock for stage 2 faults.
> 
> Applies to 6.0-rc3. Tested with KVM selftests and benchmarked with
> dirty_log_perf_test, scaling from 1 to 48 vCPUs with 4GB of memory per
> vCPU backed by THP.
> 
>   ./dirty_log_perf_test -s anonymous_thp -m 2 -b 4G -v ${NR_VCPUS}
> 
> Time to dirty memory:
> 
>         +-------+---------+------------------+
>         | vCPUs | 6.0-rc3 | 6.0-rc3 + series |
>         +-------+---------+------------------+
>         |     1 | 0.89s   | 0.92s            |
>         |     2 | 1.13s   | 1.18s            |
>         |     4 | 2.42s   | 1.25s            |
>         |     8 | 5.03s   | 1.36s            |
>         |    16 | 8.84s   | 2.09s            |
>         |    32 | 19.60s  | 4.47s            |
>         |    48 | 31.39s  | 6.22s            |
>         +-------+---------+------------------+
> 
> It is also worth mentioning that the time to populate memory has
> improved:
> 
>         +-------+---------+------------------+
>         | vCPUs | 6.0-rc3 | 6.0-rc3 + series |
>         +-------+---------+------------------+
>         |     1 | 0.19s   | 0.18s            |
>         |     2 | 0.25s   | 0.21s            |
>         |     4 | 0.38s   | 0.32s            |
>         |     8 | 0.64s   | 0.40s            |
>         |    16 | 1.22s   | 0.54s            |
>         |    32 | 2.50s   | 1.03s            |
>         |    48 | 3.88s   | 1.52s            |
>         +-------+---------+------------------+
> 
> RFC: https://lore.kernel.org/kvmarm/20220415215901.1737897-1-oupton@google.com/
> 
> RFC -> v1:
>  - Factored out page table teardown from kvm_pgtable_stage2_map()
>  - Use the RCU callback to tear down a subtree, instead of scheduling a
>    callback for every individual table page.
>  - Reorganized series to (hopefully) avoid intermediate breakage.
>  - Dropped the use of page headers, instead stuffing KVM metadata into
>    page::private directly
> 
> Oliver Upton (14):
>   KVM: arm64: Add a helper to tear down unlinked stage-2 subtrees
>   KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Read the PTE once per visit
>   KVM: arm64: Split init and set for table PTE
>   KVM: arm64: Return next table from map callbacks
>   KVM: arm64: Document behavior of pgtable visitor callback
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Free removed stage-2 tables in RCU callback
>   KVM: arm64: Atomically update stage 2 leaf attributes in parallel
>     walks
>   KVM: arm64: Make changes block->table to leaf PTEs parallel-aware
>   KVM: arm64: Make leaf->leaf PTE changes parallel-aware
>   KVM: arm64: Make table->block changes parallel-aware
>   KVM: arm64: Handle stage-2 faults in parallel
> 
>  arch/arm64/include/asm/kvm_pgtable.h  |  59 ++++-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |   7 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |   4 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 360 ++++++++++++++++----------
>  arch/arm64/kvm/mmu.c                  |  65 +++--
>  5 files changed, 325 insertions(+), 170 deletions(-)

This fails to build on -rc4:

  MODPOST vmlinux.symvers
  MODINFO modules.builtin.modinfo
  GEN     modules.builtin
  CC      .vmlinux.export.o
  LD      .tmp_vmlinux.kallsyms1
ld: Unexpected GOT/PLT entries detected!
ld: Unexpected run-time procedure linkages detected!
ld: ID map text too big or misaligned
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_walk':
(.hyp.text+0xdc0c): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xdc1c): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_get_leaf':
(.hyp.text+0xdc80): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xdc90): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_map':
(.hyp.text+0xddb0): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xddc0): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_unmap':
(.hyp.text+0xde44): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xde50): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_destroy':
(.hyp.text+0xdf40): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xdf50): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_map':
(.hyp.text+0xe16c): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe17c): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_set_owner':
(.hyp.text+0xe264): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe274): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_unmap':
(.hyp.text+0xe2d4): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe2e4): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_flush':
(.hyp.text+0xe5b4): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe5c4): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_destroy':
(.hyp.text+0xe6f0): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe700): undefined reference to `__kvm_nvhe___rcu_read_unlock'
make[3]: *** [Makefile:1169: vmlinux] Error 1
make[2]: *** [debian/rules:7: build-arch] Error 2

as this drags the RCU read-lock into EL2, and that's not going to
work... The following fixes it, but I wonder how you tested it.

Thanks,

	M.

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index dc839db86a1a..adf170122daf 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -580,7 +580,7 @@ enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte);
  */
 enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte);
 
-#if defined(__KVM_NVHE_HYPERVISOR___)
+#if defined(__KVM_NVHE_HYPERVISOR__)
 
 static inline void kvm_pgtable_walk_begin(void) {}
 static inline void kvm_pgtable_walk_end(void) {}
Oliver Upton Sept. 9, 2022, 10:01 a.m. UTC | #2
Hey Marc,

On Tue, Sep 06, 2022 at 11:00:09AM +0100, Marc Zyngier wrote:

[...]

> This fails to build on -rc4:
> 
>   MODPOST vmlinux.symvers
>   MODINFO modules.builtin.modinfo
>   GEN     modules.builtin
>   CC      .vmlinux.export.o
>   LD      .tmp_vmlinux.kallsyms1
> ld: Unexpected GOT/PLT entries detected!
> ld: Unexpected run-time procedure linkages detected!
> ld: ID map text too big or misaligned
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_walk':
> (.hyp.text+0xdc0c): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xdc1c): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_get_leaf':
> (.hyp.text+0xdc80): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xdc90): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_map':
> (.hyp.text+0xddb0): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xddc0): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_unmap':
> (.hyp.text+0xde44): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xde50): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_destroy':
> (.hyp.text+0xdf40): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xdf50): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_map':
> (.hyp.text+0xe16c): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xe17c): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_set_owner':
> (.hyp.text+0xe264): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xe274): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_unmap':
> (.hyp.text+0xe2d4): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xe2e4): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_flush':
> (.hyp.text+0xe5b4): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xe5c4): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_destroy':
> (.hyp.text+0xe6f0): undefined reference to `__kvm_nvhe___rcu_read_lock'
> ld: (.hyp.text+0xe700): undefined reference to `__kvm_nvhe___rcu_read_unlock'
> make[3]: *** [Makefile:1169: vmlinux] Error 1
> make[2]: *** [debian/rules:7: build-arch] Error 2
> 
> as this drags the RCU read-lock into EL2, and that's not going to
> work... The following fixes it, but I wonder how you tested it.

Ugh. I was carrying a patch on top of my series to handle compilation
issues with rseq_test, I managed to squash the equivalent of below in
that patch.

Nonetheless, I *did* actually test it to get the numbers above :)

--
Thanks,
Oliver

> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index dc839db86a1a..adf170122daf 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -580,7 +580,7 @@ enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte);
>   */
>  enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte);
>  
> -#if defined(__KVM_NVHE_HYPERVISOR___)
> +#if defined(__KVM_NVHE_HYPERVISOR__)
>  
>  static inline void kvm_pgtable_walk_begin(void) {}
>  static inline void kvm_pgtable_walk_end(void) {}
> 
> -- 
> Without deviation from the norm, progress is not possible.