diff mbox series

[1/7] Revert "KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs"

Message ID 20240805233114.4060019-2-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log | expand

Commit Message

David Matlack Aug. 5, 2024, 11:31 p.m. UTC
This reverts commit 85f44f8cc07b5f61bef30fe5343d629fd4263230.

Bring back the logic that walks down to leafs when zapping collapsible
SPTEs. Stepping down to leafs is technically unnecessary when zapping,
but the leaf SPTE will be used in a subsequent commit to construct a
huge SPTE and recover the huge mapping in place.

Note, this revert does not revert the function comment changes above
zap_collapsible_spte_range() and kvm_tdp_mmu_zap_collapsible_sptes()
since those are still relevant.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.c |  9 +++++++
 arch/x86/kvm/mmu/tdp_iter.h |  1 +
 arch/x86/kvm/mmu/tdp_mmu.c  | 47 ++++++++++++++++++-------------------
 3 files changed, 33 insertions(+), 24 deletions(-)

Comments

Sean Christopherson Aug. 16, 2024, 10:47 p.m. UTC | #1
On Mon, Aug 05, 2024, David Matlack wrote:
> This reverts commit 85f44f8cc07b5f61bef30fe5343d629fd4263230.
> 
> Bring back the logic that walks down to leafs when zapping collapsible
> SPTEs. Stepping down to leafs is technically unnecessary when zapping,
> but the leaf SPTE will be used in a subsequent commit to construct a
> huge SPTE and recover the huge mapping in place.

Please no, getting rid of the step-up code made me so happy. :-D

It's also suboptimal, e.g. in the worst case scenario (which is comically
unlikely, but theoretically possible), if the first present leaf SPTE in a 1GiB
region is the last SPTE in the last 2MiB range, and all non-leaf SPTEs are
somehow present (this is the super unlikely part), then KVM will read 512*512
SPTEs before encountering a shadow-present leaf SPTE.

The proposed approach will also ignore nx_huge_page_disallowed, and just always
create a huge NX page.  On the plus side, "free" NX hugepage recovery!  The
downside is that it means KVM is pretty much guaranteed to force the guest to
re-fault all of its code pages, and zap a non-trivial number of huge pages that
were just created.  IIRC, we deliberately did that for the zapping case, e.g. to
use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit
different than zap+create (if the guest is still using the region for code).

So rather than looking for a present leaf SPTE, what about "stopping" as soon as
KVM find a SP that can be replaced with a huge SPTE, pre-checking
nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new
SPTE?  Or maybe even use kvm_tdp_map_page()?  Though it might be more work to
massage kvm_tdp_map_page() into a usable form.

By virtue of there being a present non-leaf SPTE, KVM knows the guest accessed
the region at some point.  And now that MTRR virtualization is gone, the only time
KVM zaps _only_ leafs is for mmu_notifiers and and the APIC access page, i.e. the
odds of stepping down NOT finding a present SPTE somewhere in the region is very
small.  Lastly, kvm_mmu_max_mapping_level() has verified there is a valid mapping
in the pr imary MMU, else the max level would be PG_LEVEL_4K.  So the odds of
getting a false positive and effectively pre-faulting memory the guest isn't using
are quite small.
David Matlack Aug. 16, 2024, 11:35 p.m. UTC | #2
On Fri, Aug 16, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Aug 05, 2024, David Matlack wrote:
> > This reverts commit 85f44f8cc07b5f61bef30fe5343d629fd4263230.
> >
> > Bring back the logic that walks down to leafs when zapping collapsible
> > SPTEs. Stepping down to leafs is technically unnecessary when zapping,
> > but the leaf SPTE will be used in a subsequent commit to construct a
> > huge SPTE and recover the huge mapping in place.
>
> Please no, getting rid of the step-up code made me so happy. :-D
>
> It's also suboptimal, e.g. in the worst case scenario (which is comically
> unlikely, but theoretically possible), if the first present leaf SPTE in a 1GiB
> region is the last SPTE in the last 2MiB range, and all non-leaf SPTEs are
> somehow present (this is the super unlikely part), then KVM will read 512*512
> SPTEs before encountering a shadow-present leaf SPTE.

Haha, I didn't consider that case. That seems extremely unlikely. And
even if it did happen, KVM is holding mmu_lock for read and is able to
drop the lock and yield. So I don't see a strong need to structure the
approach around that edge case.

>
> The proposed approach will also ignore nx_huge_page_disallowed, and just always
> create a huge NX page.  On the plus side, "free" NX hugepage recovery!  The
> downside is that it means KVM is pretty much guaranteed to force the guest to
> re-fault all of its code pages, and zap a non-trivial number of huge pages that
> were just created.  IIRC, we deliberately did that for the zapping case, e.g. to
> use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit
> different than zap+create (if the guest is still using the region for code).

I'm ok with skipping nx_huge_page_disallowed pages during disable-dirty-log.

But why is recovering in-place is worse/different than zapping? They
both incur the same TLB flushes. And recovering might even result in
less vCPU faults, since vCPU faults use tdp_mmu_split_huge_page() to
install a fully populated lower level page table (vs faulting on a
zapped mapping will just install a 4K SPTE).

>
> So rather than looking for a present leaf SPTE, what about "stopping" as soon as
> KVM find a SP that can be replaced with a huge SPTE, pre-checking
> nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new
> SPTE?  Or maybe even use kvm_tdp_map_page()?  Though it might be more work to
> massage kvm_tdp_map_page() into a usable form.

My intuition is that going through the full page fault flow would be
more expensive than just stepping down+up in 99.99999% of cases. And
will require more code churn. So it doesn't seem like a win?

>
> By virtue of there being a present non-leaf SPTE, KVM knows the guest accessed
> the region at some point.  And now that MTRR virtualization is gone, the only time
> KVM zaps _only_ leafs is for mmu_notifiers and and the APIC access page, i.e. the
> odds of stepping down NOT finding a present SPTE somewhere in the region is very
> small.  Lastly, kvm_mmu_max_mapping_level() has verified there is a valid mapping
> in the pr imary MMU, else the max level would be PG_LEVEL_4K.  So the odds of
> getting a false positive and effectively pre-faulting memory the guest isn't using
> are quite small.
Sean Christopherson Aug. 22, 2024, 3:10 p.m. UTC | #3
On Fri, Aug 16, 2024, David Matlack wrote:
> On Fri, Aug 16, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> > The proposed approach will also ignore nx_huge_page_disallowed, and just always
> > create a huge NX page.  On the plus side, "free" NX hugepage recovery!  The
> > downside is that it means KVM is pretty much guaranteed to force the guest to
> > re-fault all of its code pages, and zap a non-trivial number of huge pages that
> > were just created.  IIRC, we deliberately did that for the zapping case, e.g. to
> > use the opportunity to recover NX huge pages, but zap+create+zap+create is a bit
> > different than zap+create (if the guest is still using the region for code).
> 
> I'm ok with skipping nx_huge_page_disallowed pages during disable-dirty-log.
> 
> But why is recovering in-place is worse/different than zapping? They
> both incur the same TLB flushes. And recovering might even result in
> less vCPU faults, since vCPU faults use tdp_mmu_split_huge_page() to
> install a fully populated lower level page table (vs faulting on a
> zapped mapping will just install a 4K SPTE).

Doh, never mind, I was thinking zapping collapsible SPTEs zapped leafs to induce
faults, but it does the opposite and zaps at the level KVM thinks can be huge.

> > So rather than looking for a present leaf SPTE, what about "stopping" as soon as
> > KVM find a SP that can be replaced with a huge SPTE, pre-checking
> > nx_huge_page_disallowed, and invoking kvm_mmu_do_page_fault() to install a new
> > SPTE?  Or maybe even use kvm_tdp_map_page()?  Though it might be more work to
> > massage kvm_tdp_map_page() into a usable form.
> 
> My intuition is that going through the full page fault flow would be
> more expensive than just stepping down+up in 99.99999% of cases.

Hmm, yeah, doing the full fault flow isn't the right place to hook in.  Ugh, right,
and it's the whole problem with not having a vCPU for tdp_mmu_map_handle_target_level().
But that's solvable as it's really just is_rsvd_spte(), which I would be a-ok
skipping.  Ah, no, host_writable is also problematic.  Blech.

That's solvable too, e.g. host_pfn_mapping_level() could get the protection, but
that would require checking for an ongoing mmu_notifier invalidation.  So again,
probably not worth it.  Double blech.

> And will require more code churn.

I'm not terribly concerned with code churn.  I care much more about long-term
maintenance, and especially about having multiple ways of doing the same thing
(installing a shadow-present leaf SPTE).  But I agree that trying to remedy that
last point (similar flows) is probably a fool's errand in this case, as creating
a new SPTE from scratch really is different than up-leveling an existing SPTE.

I still have concerns about the step-up code, but I'll respond to those in the
context of the patch I think is problematic.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 04c247bfe318..1279babbc72c 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -142,6 +142,15 @@  static bool try_step_up(struct tdp_iter *iter)
 	return true;
 }
 
+/*
+ * Step the iterator back up a level in the paging structure. Should only be
+ * used when the iterator is below the root level.
+ */
+void tdp_iter_step_up(struct tdp_iter *iter)
+{
+	WARN_ON(!try_step_up(iter));
+}
+
 /*
  * Step to the next SPTE in a pre-order traversal of the paging structure.
  * To get to the next SPTE, the iterator either steps down towards the goal
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 2880fd392e0c..821fde2ac7b0 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -136,5 +136,6 @@  void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 		    int min_level, gfn_t next_last_level_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
+void tdp_iter_step_up(struct tdp_iter *iter);
 
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c7dc49ee7388..ebe2ab3686c7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1628,49 +1628,48 @@  static void zap_collapsible_spte_range(struct kvm *kvm,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
-retry:
+	tdp_root_for_each_pte(iter, root, start, end) {
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
-		if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
-		    !is_shadow_present_pte(iter.old_spte))
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
+		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
+							      iter.gfn, PG_LEVEL_NUM);
+
+		WARN_ON(max_mapping_level < iter.level);
+
 		/*
-		 * Don't zap leaf SPTEs, if a leaf SPTE could be replaced with
-		 * a large page size, then its parent would have been zapped
-		 * instead of stepping down.
+		 * If this page is already mapped at the highest
+		 * viable level, there's nothing more to do.
 		 */
-		if (is_last_spte(iter.old_spte, iter.level))
+		if (max_mapping_level == iter.level)
 			continue;
 
 		/*
-		 * If iter.gfn resides outside of the slot, i.e. the page for
-		 * the current level overlaps but is not contained by the slot,
-		 * then the SPTE can't be made huge.  More importantly, trying
-		 * to query that info from slot->arch.lpage_info will cause an
-		 * out-of-bounds access.
+		 * The page can be remapped at a higher level, so step
+		 * up to zap the parent SPTE.
 		 */
-		if (iter.gfn < start || iter.gfn >= end)
-			continue;
-
-		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
-							      iter.gfn, PG_LEVEL_NUM);
-		if (max_mapping_level < iter.level)
-			continue;
+		while (max_mapping_level > iter.level)
+			tdp_iter_step_up(&iter);
 
 		/* Note, a successful atomic zap also does a remote TLB flush. */
-		if (tdp_mmu_zap_spte_atomic(kvm, &iter))
-			goto retry;
+		(void)tdp_mmu_zap_spte_atomic(kvm, &iter);
+
+		/*
+		 * If the atomic zap fails, the iter will recurse back into
+		 * the same subtree to retry.
+		 */
 	}
 
 	rcu_read_unlock();
 }
 
 /*
- * Zap non-leaf SPTEs (and free their associated page tables) which could
- * be replaced by huge pages, for GFNs within the slot.
+ * Zap non-leaf SPTEs (and free their associated page tables) which could be
+ * replaced by huge pages, for GFNs within the slot.
  */
 void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot)