diff mbox series

[v2,25/30] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls

Message ID 20211223222318.1039223-26-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Overhaul TDP MMU zapping and flushing | expand

Commit Message

Sean Christopherson Dec. 23, 2021, 10:23 p.m. UTC
When zapping a TDP MMU root, perform the zap in two passes to avoid
zapping an entire top-level SPTE while holding RCU, which can induce RCU
stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
zap top-level entries in the second pass.

With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
SPTEs take up to ~7 or so seconds (time varies based number of kernel
config, CPUs, vCPUs, etc...).  With 5-level paging, that time can balloon
well into hundreds of seconds.

Before remote TLB flushes were omitted, the problem was even worse as
waiting for all active vCPUs to respond to the IPI introduced significant
overhead for VMs with large numbers of vCPUs.

By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
the amount of work that is done without dropping RCU protection is
strictly bounded, with the worst case latency for a single operation
being less than 100ms.

Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
KVM relies on being able to zap 1gb shadow pages in a single shot when
when repacing a shadow page with a hugepage.  Zapping a 1gb shadow page
that is fully populated with 4kb dirty SPTEs also triggers the worst case
latency due writing back the struct page accessed/dirty bits for each 4kb
page, i.e. the two-pass approach is guaranteed to work so long as KVM can
cleany zap a 1gb shadow page.

  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
                                          softirq=15759/15759 fqs=5058
   (t=21016 jiffies g=66453 q=238577)
  NMI backtrace for cpu 52
  Call Trace:
   ...
   mark_page_accessed+0x266/0x2f0
   kvm_set_pfn_accessed+0x31/0x40
   handle_removed_tdp_mmu_page+0x259/0x2e0
   __handle_changed_spte+0x223/0x2c0
   handle_removed_tdp_mmu_page+0x1c1/0x2e0
   __handle_changed_spte+0x223/0x2c0
   handle_removed_tdp_mmu_page+0x1c1/0x2e0
   __handle_changed_spte+0x223/0x2c0
   zap_gfn_range+0x141/0x3b0
   kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
   kvm_mmu_zap_all_fast+0x121/0x190
   kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
   kvm_page_track_flush_slot+0x5c/0x80
   kvm_arch_flush_shadow_memslot+0xe/0x10
   kvm_set_memslot+0x172/0x4e0
   __kvm_set_memory_region+0x337/0x590
   kvm_vm_ioctl+0x49c/0xf80

Reported-by: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

Comments

David Matlack Jan. 5, 2022, 12:19 a.m. UTC | #1
On Thu, Dec 23, 2021 at 10:23:13PM +0000, Sean Christopherson wrote:
> When zapping a TDP MMU root, perform the zap in two passes to avoid
> zapping an entire top-level SPTE while holding RCU, which can induce RCU
> stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
> zap top-level entries in the second pass.
> 
> With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
> SPTEs take up to ~7 or so seconds (time varies based number of kernel
> config, CPUs, vCPUs, etc...).  With 5-level paging, that time can balloon
> well into hundreds of seconds.
> 
> Before remote TLB flushes were omitted, the problem was even worse as
> waiting for all active vCPUs to respond to the IPI introduced significant
> overhead for VMs with large numbers of vCPUs.
> 
> By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
> the amount of work that is done without dropping RCU protection is
> strictly bounded, with the worst case latency for a single operation
> being less than 100ms.
> 
> Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
> KVM relies on being able to zap 1gb shadow pages in a single shot when
> when repacing a shadow page with a hugepage.

When dirty logging is disabled, zap_collapsible_spte_range() does the
bulk of the work zapping leaf SPTEs and allows yielding. I guess that
could race with a vCPU faulting in the huge page though and the vCPU
could do the bulk of the work.

Are there any other scenarios where KVM relies on zapping 1GB worth of
4KB SPTEs without yielding?

In any case, 100ms is a long time to hog the CPU. Why not just do the
safe thing and zap each level? 4K, then 2M, then 1GB, ..., then root
level. The only argument against it I can think of is performance (lots
of redundant walks through the page table). But I don't think root
zapping is especially latency critical.

> Zapping a 1gb shadow page
> that is fully populated with 4kb dirty SPTEs also triggers the worst case
> latency due writing back the struct page accessed/dirty bits for each 4kb
> page, i.e. the two-pass approach is guaranteed to work so long as KVM can
> cleany zap a 1gb shadow page.
> 
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
>                                           softirq=15759/15759 fqs=5058
>    (t=21016 jiffies g=66453 q=238577)
>   NMI backtrace for cpu 52
>   Call Trace:
>    ...
>    mark_page_accessed+0x266/0x2f0
>    kvm_set_pfn_accessed+0x31/0x40
>    handle_removed_tdp_mmu_page+0x259/0x2e0
>    __handle_changed_spte+0x223/0x2c0
>    handle_removed_tdp_mmu_page+0x1c1/0x2e0
>    __handle_changed_spte+0x223/0x2c0
>    handle_removed_tdp_mmu_page+0x1c1/0x2e0
>    __handle_changed_spte+0x223/0x2c0
>    zap_gfn_range+0x141/0x3b0
>    kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
>    kvm_mmu_zap_all_fast+0x121/0x190
>    kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
>    kvm_page_track_flush_slot+0x5c/0x80
>    kvm_arch_flush_shadow_memslot+0xe/0x10
>    kvm_set_memslot+0x172/0x4e0
>    __kvm_set_memory_region+0x337/0x590
>    kvm_vm_ioctl+0x49c/0xf80
> 
> Reported-by: David Matlack <dmatlack@google.com>
> Cc: Ben Gardon <bgardon@google.com>
> Cc: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 27 ++++++++++++++++++++++-----
>  1 file changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index aec97e037a8d..2e28f5e4b761 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -809,6 +809,18 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	gfn_t end = tdp_mmu_max_gfn_host();
>  	gfn_t start = 0;
>  
> +	/*
> +	 * To avoid RCU stalls due to recursively removing huge swaths of SPs,
> +	 * split the zap into two passes.  On the first pass, zap at the 1gb
> +	 * level, and then zap top-level SPs on the second pass.  "1gb" is not
> +	 * arbitrary, as KVM must be able to zap a 1gb shadow page without
> +	 * inducing a stall to allow in-place replacement with a 1gb hugepage.
> +	 *
> +	 * Because zapping a SP recurses on its children, stepping down to
> +	 * PG_LEVEL_4K in the iterator itself is unnecessary.
> +	 */
> +	int zap_level = PG_LEVEL_1G;
> +
>  	/*
>  	 * The root must have an elevated refcount so that it's reachable via
>  	 * mmu_notifier callbacks, which allows this path to yield and drop
> @@ -825,12 +837,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  
>  	rcu_read_lock();
>  
> -	/*
> -	 * No need to try to step down in the iterator when zapping an entire
> -	 * root, zapping an upper-level SPTE will recurse on its children.
> -	 */
> +start:
>  	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> -				   root->role.level, start, end) {
> +				   zap_level, start, end) {
>  retry:
>  		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
>  			continue;
> @@ -838,6 +847,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  		if (!is_shadow_present_pte(iter.old_spte))
>  			continue;
>  
> +		if (iter.level > zap_level)
> +			continue;
> +
>  		if (!shared) {
>  			tdp_mmu_set_spte(kvm, &iter, 0);
>  		} else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
> @@ -846,6 +858,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  		}
>  	}
>  
> +	if (zap_level < root->role.level) {
> +		zap_level = root->role.level;
> +		goto start;
> +	}

This is probably just person opinion but I find the 2 iteration goto
loop harder to understand than just open-coding the 2 passes.

e.g.

  static void tdp_mmu_zap_root(...)
  {
          /*
           * To avoid RCU stalls due to recursively removing huge swaths of SPs,
           * split the zap into two passes.  On the first pass, zap at the 1gb
           * level, and then zap top-level SPs on the second pass.  "1gb" is not
           * arbitrary, as KVM must be able to zap a 1gb shadow page without
           * inducing a stall to allow in-place replacement with a 1gb hugepage.
           *
           * Because zapping a SP recurses on its children, stepping down to
           * PG_LEVEL_4K in the iterator itself is unnecessary.
           */
          tdp_mmu_zap_root_level(..., PG_LEVEL_1G);
          tdp_mmu_zap_root_level(..., root->role.level);
  }

Or just go ahead and zap each level from 4K up to root->role.level as I
mentioned above.

> +
>  	rcu_read_unlock();
>  }
>  
> -- 
> 2.34.1.448.ga2b2bfdf31-goog
>
Sean Christopherson Jan. 5, 2022, 4:33 p.m. UTC | #2
On Wed, Jan 05, 2022, David Matlack wrote:
> On Thu, Dec 23, 2021 at 10:23:13PM +0000, Sean Christopherson wrote:
> > Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
> > KVM relies on being able to zap 1gb shadow pages in a single shot when
> > when repacing a shadow page with a hugepage.
> 
> When dirty logging is disabled, zap_collapsible_spte_range() does the
> bulk of the work zapping leaf SPTEs and allows yielding. I guess that
> could race with a vCPU faulting in the huge page though and the vCPU
> could do the bulk of the work.
> 
> Are there any other scenarios where KVM relies on zapping 1GB worth of
> 4KB SPTEs without yielding?

Yes.  Zapping executable shadow pages that were forced to be small because of the
iTLB multihit mitigation.  If the VM is using nested EPT and a shadow page is
unaccounted, in which case decrementing disallow_lpage could allow a hugepage
and a fault in the 1gb region that installs a 1gb hugepage would then zap the
shadow page.

There are other scenarios, though they are much more contrived, e.g. if the guest
changes its MTRRs such that a previously disallowed hugepage is now allowed.

> In any case, 100ms is a long time to hog the CPU. Why not just do the
> safe thing and zap each level? 4K, then 2M, then 1GB, ..., then root
> level. The only argument against it I can think of is performance (lots
> of redundant walks through the page table). But I don't think root
> zapping is especially latency critical.

I'm not opposed to that approach, assuming putting a root is done asynchronously
so that the high latency isn't problematic for vCPUs.  Though from a test coverage
perspective, I do like zapping at the worst case level (for the above flows).

And regarding the latency, if it's problematic we could track the number of present
SPTEs and skip the walk if there are none.  The downside is that doing so would
require an atomic operation when installing SPTEs to play nice with parallel page
faults.

> > @@ -846,6 +858,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> >  		}
> >  	}
> >  
> > +	if (zap_level < root->role.level) {
> > +		zap_level = root->role.level;
> > +		goto start;
> > +	}
> 
> This is probably just person opinion but I find the 2 iteration goto
> loop harder to understand than just open-coding the 2 passes.

Yeah, but it's clever!  I'll add another helper unless it turns out gross for
some reason. :-)
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index aec97e037a8d..2e28f5e4b761 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -809,6 +809,18 @@  static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	gfn_t end = tdp_mmu_max_gfn_host();
 	gfn_t start = 0;
 
+	/*
+	 * To avoid RCU stalls due to recursively removing huge swaths of SPs,
+	 * split the zap into two passes.  On the first pass, zap at the 1gb
+	 * level, and then zap top-level SPs on the second pass.  "1gb" is not
+	 * arbitrary, as KVM must be able to zap a 1gb shadow page without
+	 * inducing a stall to allow in-place replacement with a 1gb hugepage.
+	 *
+	 * Because zapping a SP recurses on its children, stepping down to
+	 * PG_LEVEL_4K in the iterator itself is unnecessary.
+	 */
+	int zap_level = PG_LEVEL_1G;
+
 	/*
 	 * The root must have an elevated refcount so that it's reachable via
 	 * mmu_notifier callbacks, which allows this path to yield and drop
@@ -825,12 +837,9 @@  static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	/*
-	 * No need to try to step down in the iterator when zapping an entire
-	 * root, zapping an upper-level SPTE will recurse on its children.
-	 */
+start:
 	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-				   root->role.level, start, end) {
+				   zap_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
@@ -838,6 +847,9 @@  static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 		if (!is_shadow_present_pte(iter.old_spte))
 			continue;
 
+		if (iter.level > zap_level)
+			continue;
+
 		if (!shared) {
 			tdp_mmu_set_spte(kvm, &iter, 0);
 		} else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
@@ -846,6 +858,11 @@  static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 		}
 	}
 
+	if (zap_level < root->role.level) {
+		zap_level = root->role.level;
+		goto start;
+	}
+
 	rcu_read_unlock();
 }