diff mbox series

[7/7] KVM: x86/mmu: Recheck SPTE points to a PT during huge page recovery

Message ID 20240805233114.4060019-8-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log | expand

Commit Message

David Matlack Aug. 5, 2024, 11:31 p.m. UTC
Recheck the iter.old_spte still points to a page table when recovering
huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
another CPU in between stepping down and back up.

This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
recovery worker, or vCPUs taking faults on the huge page region).

This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
thus can see a different value, which is not immediately obvious when
reading the code.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Sean Christopherson Aug. 22, 2024, 4:50 p.m. UTC | #1
On Mon, Aug 05, 2024, David Matlack wrote:
> Recheck the iter.old_spte still points to a page table when recovering
> huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> another CPU in between stepping down and back up.
> 
> This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> recovery worker, or vCPUs taking faults on the huge page region).
> 
> This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> thus can see a different value, which is not immediately obvious when
> reading the code.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 07d5363c9db7..bdc7fd476721 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
>  		while (max_mapping_level > iter.level)
>  			tdp_iter_step_up(&iter);
>  
> +		/*
> +		 * Re-check that iter.old_spte still points to a page table.
> +		 * Since mmu_lock is held for read and tdp_iter_step_up()
> +		 * re-reads iter.sptep, it's possible the SPTE was zapped or
> +		 * recovered by another CPU in between stepping down and
> +		 * stepping back up.
> +		 */
> +		if (!is_shadow_present_pte(iter.old_spte) ||
> +		    is_last_spte(iter.old_spte, iter.level))
> +			continue;

This is the part of the step-up logic that I do not like.  Even this check doesn't
guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
only because of several very subtle mechanisms.

kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
requires knowing/remembering that KVM disallows huge pages when a gfn is write-
tracked, and relies on that never changing (which is a fairly safe bet, but the
behavior isn't fully set in stone).
not set.

And the presence of a shadow-present leaf SPTE ensures there are no in-flight
mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
return until all relevant leaf SPTEs have been zapped.

And even more subtly, recover_huge_pages_range() can install a huge SPTE while
tdp_mmu_zap_leafs() is running, e.g. if tdp_mmu_zap_leafs() is processing 4KiB
SPTEs because the greater 2MiB page is being unmapped by the primary MMU, and
tdp_mmu_zap_leafs() yields.   That's again safe only because upon regaining
control, tdp_mmu_zap_leafs() will restart at the root and thus observe and zap
the new huge SPTE.

So while I'm pretty sure this logic is functionally ok, its safety is extremely
dependent on a number of behaviors in KVM.

That said, I can't tell which option I dislike less.  E.g. we could do something
like this, where kvm_mmu_name_tbd() grabs the pfn+writable information from the
primary MMU's PTE/PMD/PUD.  Ideally, KVM would use GUP, but then KVM wouldn't
be able to create huge SPTEs for non-GUP-able memory, e.g. PFNMAP'd memory.

I don't love this either, primarily because not using GUP makes this yet another
custom flow, i.e. isn't any less tricky than reusing a child SPTE.  It does have
the advantage of not having to find a shadow-present child though, i.e. is likely
the most performant option.  I agree that that likely doesn't matter in practice,
especially since the raw latency of disabling dirty logging isn't terribly
important.

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
		    !is_shadow_present_pte(iter.old_spte))
			continue;

		/* 
		 * TODO: this should skip to the end of the parent, because if
		 * the first SPTE can't be made huger, then no SPTEs at this
		 * level can be huger.
		 */
		if (is_last_spte(iter.old_spte, iter.level))
			continue;

		/*
		 * If iter.gfn resides outside of the slot, i.e. the page for
		 * the current level overlaps but is not contained by the slot,
		 * then the SPTE can't be made huge.  More importantly, trying
		 * to query that info from slot->arch.lpage_info will cause an
		 * out-of-bounds access.
		 */
		if (iter.gfn < start || iter.gfn >= end)
			continue;

		if (kvm_mmu_name_tbd(kvm, slot, iter.gfn, &pfn, &writable,
				     &max_mapping_level))
			continue;

		/*
		 * If the range is being invalidated, do not install a SPTE as
		 * KVM may have already zapped this specific gfn, e.g. if KVM's
		 * unmapping has run to completion, but the primary MMU hasn't
		 * zapped its PTEs.  There is no need to check for *past*
		 * invalidations, because all information is gathered while
		 * holding mmu_lock, i.e. it can't have become stale due to a
		 * entire mmu_notifier invalidation sequence completing.
		 */
		if (mmu_invalidate_retry_gfn(kvm, kvm->mmu_invalidate_seq, iter.gfn))
			continue;

		/*
		 * KVM disallows huge pages for write-protected gfns, it should
		 * impossible for make_spte() to encounter such a gfn since
		 * write-protecting a gfn requires holding mmu_lock for write.
		 */
		flush |= __tdp_mmu_map_gfn(...);
		WARN_ON_ONCE(r == RET_PF_EMULATE);
	}

	rcu_read_unlock();

Assuming you don't like the above idea (I'm not sure I do), what if instead of
doing the step-up, KVM starts a second iteration using the shadow page it wants
to replace as the root of the walk?

This has the same subtle dependencies on kvm_mmu_max_mapping_level() and the
ordering with respect to an mmu_notifier invalidation, but it at least avoids
having to reason about the correctness of re-reading a SPTE and modifying the
iteration level within the body of an interation loop.

It should also yield smaller diffs overall, e.g. no revert, no separate commit
to recheck the SPTE, etc.  And I believe it's more performant that the step-up
approach when there are SPTE that _can't_ be huge, as KVM won't traverse down
into the leafs in that case.

An alternative to the tdp_mmu_iter_needs_reched() logic would be to pass in
&flush, but I think that ends up being more confusing and harder to follow.

static int tdp_mmu_make_huge_spte(struct kvm *kvm, struct tdp_iter *parent,
				  u64 *huge_spte)
{
	struct kvm_mmu_page *root = sptep_to_sp(parent->sptep);
	gfn_t start = parent->gfn;
	gfn_t end = start + ???; /* use parent's level */
	struct tdp_iter iter;

	tdp_root_for_each_leaf_pte(iter, root, start, end) 	{
		/*
		 * Use the parent iterator when checking for forward progress,
		 * so that KVM doesn't get stuck due to always yielding while
		 * walking child SPTEs.
		 */
		if (tdp_mmu_iter_needs_reched(kvm, parent))
			return -EAGAIN;

		*huge_spte = make_huge_spte(kvm, iter.old_spte);
		return 0;
	}

	return -ENOENT;
}

static void recover_huge_pages_range(struct kvm *kvm,
				     struct kvm_mmu_page *root,
				     const struct kvm_memory_slot *slot)
{
	gfn_t start = slot->base_gfn;
	gfn_t end = start + slot->npages;
	struct tdp_iter iter;
	int max_mapping_level;
	bool flush = false;
	u64 huge_spte;

	if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot)))
		return;

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
restart:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
			flush = false;
			continue;
		}

		if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
		    !is_shadow_present_pte(iter.old_spte))
			continue;

                /*
                 * If iter.gfn resides outside of the slot, i.e. the page for
                 * the current level overlaps but is not contained by the slot,
                 * then the SPTE can't be made huge.  More importantly, trying
                 * to query that info from slot->arch.lpage_info will cause an
                 * out-of-bounds access.
                 */
                if (iter.gfn < start || iter.gfn >= end)
                        continue;

                max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn);
                if (max_mapping_level < iter.level)
                        continue;

		r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte);
		if (r == -EAGAIN)
			goto restart;
		else if (r)
			continue;

		/*
		 * If setting the SPTE fails, e.g. because it has been modified
		 * by a different task, iteration will naturally continue with
		 * the next SPTE.  Don't bother retrying this SPTE, races are
		 * uncommon and odds are good the SPTE 
		 */
		if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte))
			flush = true;
	}

	if (flush)
		kvm_flush_remote_tlbs_memslot(kvm, slot);

	rcu_read_unlock();
}

static inline bool tdp_mmu_iter_needs_reched(struct kvm *kvm,
					     struct tdp_iter *iter)
{
	/* Ensure forward progress has been made before yielding. */
	return iter->next_last_level_gfn != iter->yielded_gfn &&
	       (need_resched() || rwlock_needbreak(&kvm->mmu_lock));

}

/*
 * Yield if the MMU lock is contended or this thread needs to return control
 * to the scheduler.
 *
 * If this function should yield and flush is set, it will perform a remote
 * TLB flush before yielding.
 *
 * If this function yields, iter->yielded is set and the caller must skip to
 * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
 * over the paging structures to allow the iterator to continue its traversal
 * from the paging structure root.
 *
 * Returns true if this function yielded.
 */
static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
							  struct tdp_iter *iter,
							  bool flush, bool shared)
{
	WARN_ON_ONCE(iter->yielded);

	if (!tdp_mmu_iter_needs_reched(kvm, iter))
		return false;

	if (flush)
		kvm_flush_remote_tlbs(kvm);

	rcu_read_unlock();

	if (shared)
		cond_resched_rwlock_read(&kvm->mmu_lock);
	else
		cond_resched_rwlock_write(&kvm->mmu_lock);

	rcu_read_lock();

	WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn);

	iter->yielded = true;
	return true;
}
David Matlack Aug. 22, 2024, 6:35 p.m. UTC | #2
On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Aug 05, 2024, David Matlack wrote:
> > Recheck the iter.old_spte still points to a page table when recovering
> > huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> > another CPU in between stepping down and back up.
> >
> > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> > recovery worker, or vCPUs taking faults on the huge page region).
> >
> > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> > thus can see a different value, which is not immediately obvious when
> > reading the code.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 07d5363c9db7..bdc7fd476721 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
> >               while (max_mapping_level > iter.level)
> >                       tdp_iter_step_up(&iter);
> >
> > +             /*
> > +              * Re-check that iter.old_spte still points to a page table.
> > +              * Since mmu_lock is held for read and tdp_iter_step_up()
> > +              * re-reads iter.sptep, it's possible the SPTE was zapped or
> > +              * recovered by another CPU in between stepping down and
> > +              * stepping back up.
> > +              */
> > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > +                 is_last_spte(iter.old_spte, iter.level))
> > +                     continue;
>
> This is the part of the step-up logic that I do not like.  Even this check doesn't
> guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
> used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
> could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
> only because of several very subtle mechanisms.

I'm not sure why that matters. The only thing that matters is that the
GFN->PFN and permissions cannot change, and that is guaranteed by
holding mmu_lock for read.

At the end of the day, we never actually care about the value of the
SPTE we are replacing. We only care that it's a non-leaf SPTE.

>
> kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
> in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
> requires knowing/remembering that KVM disallows huge pages when a gfn is write-
> tracked, and relies on that never changing (which is a fairly safe bet, but the
> behavior isn't fully set in stone).
> not set.
>
> And the presence of a shadow-present leaf SPTE ensures there are no in-flight
> mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
> return until all relevant leaf SPTEs have been zapped.

As you point out in the next paragraph there could be an inflight
invalidate that yielded, no?

>
> And even more subtly, recover_huge_pages_range() can install a huge SPTE while
> tdp_mmu_zap_leafs() is running, e.g. if tdp_mmu_zap_leafs() is processing 4KiB
> SPTEs because the greater 2MiB page is being unmapped by the primary MMU, and
> tdp_mmu_zap_leafs() yields.   That's again safe only because upon regaining
> control, tdp_mmu_zap_leafs() will restart at the root and thus observe and zap
> the new huge SPTE.

I agree it's subtle, but only in the sense that the TDP MMU is subtle.
Restarting at the root after dropping mmu_lock is a fundamental
concept in the TDP MMU.

>
> So while I'm pretty sure this logic is functionally ok, its safety is extremely
> dependent on a number of behaviors in KVM.
>
> That said, I can't tell which option I dislike less.  E.g. we could do something
> like this, where kvm_mmu_name_tbd() grabs the pfn+writable information from the
> primary MMU's PTE/PMD/PUD.  Ideally, KVM would use GUP, but then KVM wouldn't
> be able to create huge SPTEs for non-GUP-able memory, e.g. PFNMAP'd memory.
>
> I don't love this either, primarily because not using GUP makes this yet another
> custom flow

Yeah. I don't like having the huge page recovery path needing its own
special way to construct SPTEs from scratch. e.g. I could see this
approach becoming a problem if KVM gains support for R/W/X GFN
attributes.

> i.e. isn't any less tricky than reusing a child SPTE.  It does have
> the advantage of not having to find a shadow-present child though, i.e. is likely
> the most performant option.  I agree that that likely doesn't matter in practice,
> especially since the raw latency of disabling dirty logging isn't terribly
> important.
>
>         rcu_read_lock();
>
>         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
> retry:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>                         continue;
>
>                 if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
>                     !is_shadow_present_pte(iter.old_spte))
>                         continue;
>
>                 /*
>                  * TODO: this should skip to the end of the parent, because if
>                  * the first SPTE can't be made huger, then no SPTEs at this
>                  * level can be huger.
>                  */
>                 if (is_last_spte(iter.old_spte, iter.level))
>                         continue;
>
>                 /*
>                  * If iter.gfn resides outside of the slot, i.e. the page for
>                  * the current level overlaps but is not contained by the slot,
>                  * then the SPTE can't be made huge.  More importantly, trying
>                  * to query that info from slot->arch.lpage_info will cause an
>                  * out-of-bounds access.
>                  */
>                 if (iter.gfn < start || iter.gfn >= end)
>                         continue;
>
>                 if (kvm_mmu_name_tbd(kvm, slot, iter.gfn, &pfn, &writable,
>                                      &max_mapping_level))
>                         continue;
>
>                 /*
>                  * If the range is being invalidated, do not install a SPTE as
>                  * KVM may have already zapped this specific gfn, e.g. if KVM's
>                  * unmapping has run to completion, but the primary MMU hasn't
>                  * zapped its PTEs.  There is no need to check for *past*
>                  * invalidations, because all information is gathered while
>                  * holding mmu_lock, i.e. it can't have become stale due to a
>                  * entire mmu_notifier invalidation sequence completing.
>                  */
>                 if (mmu_invalidate_retry_gfn(kvm, kvm->mmu_invalidate_seq, iter.gfn))
>                         continue;
>
>                 /*
>                  * KVM disallows huge pages for write-protected gfns, it should
>                  * impossible for make_spte() to encounter such a gfn since
>                  * write-protecting a gfn requires holding mmu_lock for write.
>                  */
>                 flush |= __tdp_mmu_map_gfn(...);
>                 WARN_ON_ONCE(r == RET_PF_EMULATE);
>         }
>
>         rcu_read_unlock();
>
> Assuming you don't like the above idea (I'm not sure I do), what if instead of
> doing the step-up, KVM starts a second iteration using the shadow page it wants
> to replace as the root of the walk?
>
> This has the same subtle dependencies on kvm_mmu_max_mapping_level() and the
> ordering with respect to an mmu_notifier invalidation, but it at least avoids
> having to reason about the correctness of re-reading a SPTE and modifying the
> iteration level within the body of an interation loop.
>
> It should also yield smaller diffs overall, e.g. no revert, no separate commit
> to recheck the SPTE, etc.  And I believe it's more performant that the step-up
> approach when there are SPTE that _can't_ be huge, as KVM won't traverse down
> into the leafs in that case.

This approach looks good to me. I'll try it out and see if I run into
any issues.

>
> An alternative to the tdp_mmu_iter_needs_reched() logic would be to pass in
> &flush, but I think that ends up being more confusing and harder to follow.

Yeah I think that would be more complicated. If we drop mmu_lock then
we need to re-check kvm_mmu_max_mapping_level() and restart at the
root.

>
> static int tdp_mmu_make_huge_spte(struct kvm *kvm, struct tdp_iter *parent,
>                                   u64 *huge_spte)
> {
>         struct kvm_mmu_page *root = sptep_to_sp(parent->sptep);
>         gfn_t start = parent->gfn;
>         gfn_t end = start + ???; /* use parent's level */
>         struct tdp_iter iter;
>
>         tdp_root_for_each_leaf_pte(iter, root, start, end)      {
>                 /*
>                  * Use the parent iterator when checking for forward progress,
>                  * so that KVM doesn't get stuck due to always yielding while
>                  * walking child SPTEs.
>                  */
>                 if (tdp_mmu_iter_needs_reched(kvm, parent))
>                         return -EAGAIN;
>
>                 *huge_spte = make_huge_spte(kvm, iter.old_spte);
>                 return 0;
>         }
>
>         return -ENOENT;
> }
>
> static void recover_huge_pages_range(struct kvm *kvm,
>                                      struct kvm_mmu_page *root,
>                                      const struct kvm_memory_slot *slot)
> {
>         gfn_t start = slot->base_gfn;
>         gfn_t end = start + slot->npages;
>         struct tdp_iter iter;
>         int max_mapping_level;
>         bool flush = false;
>         u64 huge_spte;
>
>         if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot)))
>                 return;
>
>         rcu_read_lock();
>
>         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
> restart:
>                 if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
>                         flush = false;
>                         continue;
>                 }
>
>                 if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
>                     !is_shadow_present_pte(iter.old_spte))
>                         continue;
>
>                 /*
>                  * If iter.gfn resides outside of the slot, i.e. the page for
>                  * the current level overlaps but is not contained by the slot,
>                  * then the SPTE can't be made huge.  More importantly, trying
>                  * to query that info from slot->arch.lpage_info will cause an
>                  * out-of-bounds access.
>                  */
>                 if (iter.gfn < start || iter.gfn >= end)
>                         continue;
>
>                 max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn);
>                 if (max_mapping_level < iter.level)
>                         continue;
>
>                 r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte);
>                 if (r == -EAGAIN)
>                         goto restart;
>                 else if (r)
>                         continue;
>
>                 /*
>                  * If setting the SPTE fails, e.g. because it has been modified
>                  * by a different task, iteration will naturally continue with
>                  * the next SPTE.  Don't bother retrying this SPTE, races are
>                  * uncommon and odds are good the SPTE
>                  */
>                 if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte))
>                         flush = true;
>         }
>
>         if (flush)
>                 kvm_flush_remote_tlbs_memslot(kvm, slot);
>
>         rcu_read_unlock();
> }
>
> static inline bool tdp_mmu_iter_needs_reched(struct kvm *kvm,
>                                              struct tdp_iter *iter)
> {
>         /* Ensure forward progress has been made before yielding. */
>         return iter->next_last_level_gfn != iter->yielded_gfn &&
>                (need_resched() || rwlock_needbreak(&kvm->mmu_lock));
>
> }
>
> /*
>  * Yield if the MMU lock is contended or this thread needs to return control
>  * to the scheduler.
>  *
>  * If this function should yield and flush is set, it will perform a remote
>  * TLB flush before yielding.
>  *
>  * If this function yields, iter->yielded is set and the caller must skip to
>  * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
>  * over the paging structures to allow the iterator to continue its traversal
>  * from the paging structure root.
>  *
>  * Returns true if this function yielded.
>  */
> static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
>                                                           struct tdp_iter *iter,
>                                                           bool flush, bool shared)
> {
>         WARN_ON_ONCE(iter->yielded);
>
>         if (!tdp_mmu_iter_needs_reched(kvm, iter))
>                 return false;
>
>         if (flush)
>                 kvm_flush_remote_tlbs(kvm);
>
>         rcu_read_unlock();
>
>         if (shared)
>                 cond_resched_rwlock_read(&kvm->mmu_lock);
>         else
>                 cond_resched_rwlock_write(&kvm->mmu_lock);
>
>         rcu_read_lock();
>
>         WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn);
>
>         iter->yielded = true;
>         return true;
> }
Sean Christopherson Aug. 22, 2024, 8:11 p.m. UTC | #3
On Thu, Aug 22, 2024, David Matlack wrote:
> On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Aug 05, 2024, David Matlack wrote:
> > > Recheck the iter.old_spte still points to a page table when recovering
> > > huge pages. Since mmu_lock is held for read and tdp_iter_step_up()
> > > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by
> > > another CPU in between stepping down and back up.
> > >
> > > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if
> > > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page
> > > recovery worker, or vCPUs taking faults on the huge page region).
> > >
> > > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and
> > > thus can see a different value, which is not immediately obvious when
> > > reading the code.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 07d5363c9db7..bdc7fd476721 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm,
> > >               while (max_mapping_level > iter.level)
> > >                       tdp_iter_step_up(&iter);
> > >
> > > +             /*
> > > +              * Re-check that iter.old_spte still points to a page table.
> > > +              * Since mmu_lock is held for read and tdp_iter_step_up()
> > > +              * re-reads iter.sptep, it's possible the SPTE was zapped or
> > > +              * recovered by another CPU in between stepping down and
> > > +              * stepping back up.
> > > +              */
> > > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > > +                 is_last_spte(iter.old_spte, iter.level))
> > > +                     continue;
> >
> > This is the part of the step-up logic that I do not like.  Even this check doesn't
> > guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was
> > used to reach the leaf SPTE.  E.g. in an absurdly theoretical situation, the SPTE
> > could be zapped and then re-set with another non-leaf SPTE.  Which is fine, but
> > only because of several very subtle mechanisms.
> 
> I'm not sure why that matters. The only thing that matters is that the
> GFN->PFN and permissions cannot change, and that is guaranteed by
> holding mmu_lock for read.

Because it introduces possible edge cases, that while benign, require the reader
to think about and reason through.  E.g. if _this_ task is trying to replace a
4KiB page with a 1GiB page, what happens if some other task replaces the parent
2MiB page?  It's not immediately obvious that looping on tdp_iter_step_up() will
happily blaze past a !PRESENT SPTE, which might not even be the actual SPTE in
the tree at this point.

> At the end of the day, we never actually care about the value of the
> SPTE we are replacing. We only care that it's a non-leaf SPTE.
>
> > kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere
> > in the huge range, so it's safe to propagate any and all WRITABLE bits.  This
> > requires knowing/remembering that KVM disallows huge pages when a gfn is write-
> > tracked, and relies on that never changing (which is a fairly safe bet, but the
> > behavior isn't fully set in stone).
> > not set.
> >
> > And the presence of a shadow-present leaf SPTE ensures there are no in-flight
> > mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't
> > return until all relevant leaf SPTEs have been zapped.
> 
> As you point out in the next paragraph there could be an inflight invalidate
> that yielded, no?

Yeah, ignore this, I forgot to amend it after remembering that invalidation can
yield.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 07d5363c9db7..bdc7fd476721 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1619,6 +1619,17 @@  static void recover_huge_pages_range(struct kvm *kvm,
 		while (max_mapping_level > iter.level)
 			tdp_iter_step_up(&iter);
 
+		/*
+		 * Re-check that iter.old_spte still points to a page table.
+		 * Since mmu_lock is held for read and tdp_iter_step_up()
+		 * re-reads iter.sptep, it's possible the SPTE was zapped or
+		 * recovered by another CPU in between stepping down and
+		 * stepping back up.
+		 */
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    is_last_spte(iter.old_spte, iter.level))
+			continue;
+
 		if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte))
 			flush = true;