KVM: x86/mmu: Allow per VM kvm_mmu_max_gfn()

Message ID	fdddad066c88c6cd8f2090f11e32e54f7d5c6178.1721092739.git.isaku.yamahata@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 44165139B; Tue, 16 Jul 2024 01:21:11 +0000 (UTC) From: isaku.yamahata@intel.com To: kvm@vger.kernel.org Cc: isaku.yamahata@intel.com, isaku.yamahata@gmail.com, Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com>, rick.p.edgecombe@intel.com, linux-kernel@vger.kernel.org Subject: [PATCH] KVM: x86/mmu: Allow per VM kvm_mmu_max_gfn() Date: Mon, 15 Jul 2024 18:21:03 -0700 Message-ID: <fdddad066c88c6cd8f2090f11e32e54f7d5c6178.1721092739.git.isaku.yamahata@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	KVM: x86/mmu: Allow per VM kvm_mmu_max_gfn() \| expand KVM: x86/mmu: Allow per VM kvm_mmu_max_gfn()

Isaku Yamahata July 16, 2024, 1:21 a.m. UTC

From: Isaku Yamahata <isaku.yamahata@intel.com>

Prepare for TDX support by making kvm_mmu_max_gfn() configurable.  Have
this preparation also be useful by non-TDX changes to improve correctness
associated with the combination of 4-level EPT and MAXPA > 48.  The issue
is analyzed at [1].

Like other confidential computing technologies, TDX has the concept of
private and shared memory.  For TDX, the private and shared mappings of the
same GFN are mapped at separate GPAs, with a configurable GPA bit selecting
between private and shared aliases of the same GFN.  This means that
operations working with the concept of a maximum GFN should only go up to
this configurable GPA bit instead of the existing behavior based on
shadow_phys_bits.  Other TDX changes will handle applying the operation
across both GFN aliases.

Using the existing kvm_mmu_max_gfn() based on shadow_phys_bits would cause
functional problems for TDX.  Specifically, because the function is used to
restrict the range where memslots can be created.  For TDX, if a memslot is
created at a GFN that includes the bit that selects between private/shared,
it would confuse the logic that zaps both aliases.  It would end up zapping
only the higher alias and missing the lower one.  In this case, freed pages
could remain mapped in the TDX guest.

Since this GPA bit is configurable per-VM, make kvm_mmu_max_gfn() per-VM by
having it take a struct kvm, and keep the max GFN as a member on that
struct.  Future TDX changes will set this member based on the configurable
position of the private/shared bit.

Besides functional issues, it is generally more correct and easier to
reason about in the context of solutions like TDX which have multiple
aliases for the same GFN.  For example, in other TDX changes the logic in
__tdp_mmu_zap_root() will made to iterate twice for each alias.  Using a
correct kvm_mmu_max_gfn() will avoid iterating over high GFN ranges that
could not have set PTEs.

It is worth noting, that existing VM types could get correctness/efficiency
benefits in the same way.  One case would be where a host MAXPA is above
48, but 4-level EPT is used.  In this case, KVM could skip iterating over
EPT that cannot be mapped.  In this case, the benefit of the per-VM value
is to cache the max GFN.  Another could be SEV, where it is unnecessary to
iterate over the C-bits position.  The benefit of the per-VM max GFN in
that case would be not all guests would have a C-bit.

No functional change intended.

Link: https://lore.kernel.org/kvm/20240522223413.GC212599@ls.amr.corp.intel.com/ [1]
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/mmu.h              | 7 ++++++-
 arch/x86/kvm/mmu/mmu.c          | 3 ++-
 arch/x86/kvm/mmu/tdp_mmu.c      | 8 ++++----
 arch/x86/kvm/x86.c              | 2 +-
 5 files changed, 15 insertions(+), 7 deletions(-)


base-commit: c8b8b8190a80b591aa73c27c70a668799f8db547

Edgecombe, Rick P July 16, 2024, 2:53 a.m. UTC | #1

On Mon, 2024-07-15 at 18:21 -0700, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Prepare for TDX support by making kvm_mmu_max_gfn() configurable.  Have
> this preparation also be useful by non-TDX changes to improve correctness
> associated with the combination of 4-level EPT and MAXPA > 48.  The issue
> is analyzed at [1].

It looks like the part where you were going to describe the TDX and other
testing got left off, can elaborate?

Sean Christopherson July 16, 2024, 7:31 p.m. UTC | #2

On Mon, Jul 15, 2024, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Prepare for TDX support by making kvm_mmu_max_gfn() configurable.  Have
> this preparation also be useful by non-TDX changes to improve correctness
> associated with the combination of 4-level EPT and MAXPA > 48.  The issue
> is analyzed at [1].
> 
> Like other confidential computing technologies, TDX has the concept of
> private and shared memory.  For TDX, the private and shared mappings of the
> same GFN are mapped at separate GPAs, with a configurable GPA bit selecting
> between private and shared aliases of the same GFN.  This means that
> operations working with the concept of a maximum GFN should only go up to
> this configurable GPA bit instead of the existing behavior based on
> shadow_phys_bits.  Other TDX changes will handle applying the operation
> across both GFN aliases.

This is going to be confusing.  A TDX will be all but guaranteed to generate an
EPT violation on a gfn > kvm->mmu_max_gfn.

> Using the existing kvm_mmu_max_gfn() based on shadow_phys_bits would cause
> functional problems for TDX.  Specifically, because the function is used to
> restrict the range where memslots can be created.  For TDX, if a memslot is
> created at a GFN that includes the bit that selects between private/shared,
> it would confuse the logic that zaps both aliases.  It would end up zapping
> only the higher alias and missing the lower one.  In this case, freed pages
> could remain mapped in the TDX guest.

So why don't we simply disallow creating aliased memslots?

> Since this GPA bit is configurable per-VM, make kvm_mmu_max_gfn() per-VM by
> having it take a struct kvm, and keep the max GFN as a member on that
> struct.  Future TDX changes will set this member based on the configurable
> position of the private/shared bit.
> 
> Besides functional issues, it is generally more correct and easier to

No, it most definitely is not more correct.  There is absolutely no issue zapping
SPTEs that should never exist.  In fact, restricting the zapping path is far more
likely to *cause* correctness issues, e.g. see 

  524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
  86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR")

Creating SPTEs is a different matter, but unless I'm missing something, the only
path that _needs_ to be updated is kvm_arch_prepare_memory_region(), to disallow
aliased memslots.

I assume TDX will strip the shared bit from fault->gfn, and shove it back in when
creating MMIO SPTEs in the shared EPT page tables.

Why can't we simply do:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 842a3a4cdfe9..5ea428dde891 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4395,6 +4395,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
        struct kvm_memory_slot *slot = fault->slot;
        int ret;
 
+       if (WARN_ON_ONCE(vcpu->kvm, kvm_is_gfn_alias(fault->gfn)))
+               return -EFAULT;
+
        /*
         * Note that the mmu_invalidate_seq also serves to detect a concurrent
         * change in attributes.  is_page_fault_stale() will detect an
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 994743266480..091da7607025 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12979,6 +12979,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
                if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
                        return -EINVAL;
 
+               if (kvm_is_gfn_alias(kvm, new->base_gfn + new->npages - 1))
+                       return -EINVAL;
+
                return kvm_alloc_memslot_metadata(kvm, new);
        }

Edgecombe, Rick P July 16, 2024, 9:14 p.m. UTC | #3

On Tue, 2024-07-16 at 12:31 -0700, Sean Christopherson wrote:
> 
> No, it most definitely is not more correct.  There is absolutely no issue
> zapping
> SPTEs that should never exist.  In fact, restricting the zapping path is far
> more
> likely to *cause* correctness issues, e.g. see 
> 
>   524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all
> SPTEs")
>   86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed
> host.MAXPHYADDR")

The type of correctness this was going for was around the new treatment of GFNs
not having the shared/alias bit. As you know it can get confusing which
variables have these bits and which have them stripped. Part of the recent MMU
work involved making sure at least GFN's didn't contain the shared bit.

Then in TDP MMU where it iterates from start to end, for example:
static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
			      gfn_t start, gfn_t end, bool can_yield, bool
flush)
{
	struct tdp_iter iter;

	end = min(end, tdp_mmu_max_gfn_exclusive());

	lockdep_assert_held_write(&kvm->mmu_lock);

	rcu_read_lock();

	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
		if (can_yield &&
		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
			flush = false;
			continue;
		}
...

The math gets a bit confused. For the private/mirror root, start will begin at
0, and iterate to a range that includes the shared bit. No functional problem
because we are zapping things that shouldn't be set. But it means the 'gfn' has
the bit position of the shared bit set. Although it is not acting as the shared
bit in this case, just an out of range bit.

For the shared/direct root, it will iterate from (shared_bit | 0) to (shared_bit
| max_gfn). So where the mirror root iterates through the whole range, the
shared case skips it in the current code anyway.

And then the fact that the code already takes care here to avoid zapping over
ranges that exceed the max gfn.

So it's a bit asymmetric, and just overall weird. We are weighing functional
correctness risk with known code weirdness. My inclination was to try to reduce
the places where TDX MMU needs paths happen to work for subtle reasons for the
cost of the VM field. I think I still lean that way, but not a strong opinion.

> 
> Creating SPTEs is a different matter, but unless I'm missing something, the
> only
> path that _needs_ to be updated is kvm_arch_prepare_memory_region(), to
> disallow
> aliased memslots.
> 
> I assume TDX will strip the shared bit from fault->gfn, and shove it back in
> when
> creating MMIO SPTEs in the shared EPT page tables.
> 
> Why can't we simply do:

Seems ok. It's not dissimilar to what this patch does for memslots. For the
fault path check, it is catching something that 86931ff7207b ("KVM: x86/mmu: Do
not create SPTEs for GFNs that exceed host.MAXPHYADDR") says not to worry about.
But maybe that calculus changes with the TDX module in the loop.

We also should be ok functionally without either.

Sean Christopherson July 16, 2024, 11:08 p.m. UTC | #4

On Tue, Jul 16, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-07-16 at 12:31 -0700, Sean Christopherson wrote:
> > 
> > No, it most definitely is not more correct.  There is absolutely no issue
> > zapping
> > SPTEs that should never exist.  In fact, restricting the zapping path is far
> > more
> > likely to *cause* correctness issues, e.g. see 
> > 
> >   524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all
> > SPTEs")
> >   86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed
> > host.MAXPHYADDR")
> 
> The type of correctness this was going for was around the new treatment of GFNs
> not having the shared/alias bit. As you know it can get confusing which
> variables have these bits and which have them stripped. Part of the recent MMU
> work involved making sure at least GFN's didn't contain the shared bit.

That's fine, and doesn't conflict with what I'm asserting, which is that it's
a-ok to process SPTEs a range of GFNs that, barring KVM bugs, should never have
"valid" entries.

> Then in TDP MMU where it iterates from start to end, for example:
> static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> 			      gfn_t start, gfn_t end, bool can_yield, bool
> flush)
> {
> 	struct tdp_iter iter;
> 
> 	end = min(end, tdp_mmu_max_gfn_exclusive());
> 
> 	lockdep_assert_held_write(&kvm->mmu_lock);
> 
> 	rcu_read_lock();
> 
> 	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
> 		if (can_yield &&
> 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
> 			flush = false;
> 			continue;
> 		}
> ...
> 
> The math gets a bit confused. For the private/mirror root, start will begin at
> 0, and iterate to a range that includes the shared bit. No functional problem
> because we are zapping things that shouldn't be set. But it means the 'gfn' has
> the bit position of the shared bit set. Although it is not acting as the shared
> bit in this case, just an out of range bit.
>
> For the shared/direct root, it will iterate from (shared_bit | 0) to (shared_bit
> | max_gfn). So where the mirror root iterates through the whole range, the
> shared case skips it in the current code anyway.
> 
> And then the fact that the code already takes care here to avoid zapping over
> ranges that exceed the max gfn.
> 
> So it's a bit asymmetric, and just overall weird. We are weighing functional
> correctness risk with known code weirdness.

IMO, you're looking at it with too much of a TDX lens and not thinking about all
the people that don't care about TDX, which is the majority of KVM developers.

The unaliased GFN is definitely not the max GFN of all the VM's MMUs, since the
shared EPT must be able to process GPAs with bits set above the "max" GFN.  And
to me, _that's_ far more weird than saying that "S-EPT MMUs never set the shared
bit, and shared EPT MMUs never clear the shared bit".  I'm guessing the S-EPT
support ORs in the shared bit, but it's still a GFN.

If you were adding a per-MMU max GFN, then I'd buy that it legitimately is the max
GFN, but why not have a full GFN range for the MMU?  E.g.

  static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
			         bool shared, int zap_level)
  {
	struct tdp_iter iter;

	gfn_t end = tdp_mmu_max_gfn_exclusive(root);
	gfn_t start = tdp_mmu_min_gfn_inclusive(root);

and then have the helpers incorporated the S-EPT vs. EPT information.  That gets
us optimized, precise zapping without needing to muddy the waters by tracking a
per-VM "max" GFN that is only kinda sorta the max if you close your eyes and don't
think too hard about the shared MMU usage.

> My inclination was to try to reduce the places where TDX MMU needs paths
> happen to work for subtle reasons for the cost of the VM field. 

But it doesn't happen to work for subtle reasons.  It works because it has to
work.  Processing !PRESENT SPTEs should always work, regardless of why KVM can
guarantee there are no SPTEs in a given GFN range.

Edgecombe, Rick P July 17, 2024, 12:38 a.m. UTC | #5

On Tue, 2024-07-16 at 16:08 -0700, Sean Christopherson wrote:
> 
> IMO, you're looking at it with too much of a TDX lens and not thinking about
> all
> the people that don't care about TDX, which is the majority of KVM developers.

Well, I don't mean to. Actually trying to do the opposite.

I don't see how per-vm max gfn makes it easier or harder to look at things from
the normal VM perspective. I guess you'd have to figure out what set kvm-
>mmu_max_gfn.

> 
> The unaliased GFN is definitely not the max GFN of all the VM's MMUs, since
> the
> shared EPT must be able to process GPAs with bits set above the "max" GFN. 
> And
> to me, _that's_ far more weird than saying that "S-EPT MMUs never set the
> shared
> bit, and shared EPT MMUs never clear the shared bit".  I'm guessing the S-EPT
> support ORs in the shared bit, but it's still a GFN.

In the current solution GFNs always have the shared bit stripped. It gets added
back within the TDP MMU iterator. So for the regular VM reader's perspective,
TDP MMU behaves pretty much like normal for TDX direct (shared) roots, that is
memslot GFNs get mapped at the same TDP GFNs. The iterator handles writing the
PTE for the memslot GFN at the special position in the EPT tables (gfn |
shared_bit).

> 
> If you were adding a per-MMU max GFN, then I'd buy that it legitimately is the
> max
> GFN, but why not have a full GFN range for the MMU?  E.g.
> 
>   static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>                                  bool shared, int zap_level)
>   {
>         struct tdp_iter iter;
> 
>         gfn_t end = tdp_mmu_max_gfn_exclusive(root);
>         gfn_t start = tdp_mmu_min_gfn_inclusive(root);
> 
> and then have the helpers incorporated the S-EPT vs. EPT information.  That
> gets
> us optimized, precise zapping without needing to muddy the waters by tracking
> a
> per-VM "max" GFN that is only kinda sorta the max if you close your eyes and
> don't
> think too hard about the shared MMU usage.

This is similar to what we had before with the kvm_gfn_for_root(). Have you
looked at a recent version? Here, this recent one has some discussion on it:
https://lore.kernel.org/kvm/20240530210714.364118-7-rick.p.edgecombe@intel.com/#t

Pushing the shared bit re-application into to the TDP MMU iterator pretty nicely
hides a lot of the TDX specific behavior. It wraps up the TDX bits so that other
developers *don't* working on the various operations don't have to think about
it.

> 
> > My inclination was to try to reduce the places where TDX MMU needs paths
> > happen to work for subtle reasons for the cost of the VM field. 
> 
> But it doesn't happen to work for subtle reasons.  It works because it has to
> work.  Processing !PRESENT SPTEs should always work, regardless of why KVM can
> guarantee there are no SPTEs in a given GFN range.

The current code (without this patch) doesn't zap the whole range that is
mappable by the EPT for the shared root case. It covers the whole range in the
private case, and only the range that is expected to be mapped in the shared
case. So this is good or bad? I think you are saying bad.

With this patch, it also doesn't zap the whole range mappable by the EPT, but
does it in a consistent way.

I think you are saying it's important to zap the whole range mappable by the
EPT. With TDX it is fuzzy, because the direct root range without the shared bit,
or beyond shouldn't be accessible via that root so it is not really mapped. We
would still be zapping the whole accessible paging structure, even if not the
whole part documented in the normal EPT.

But if we want to zap the whole structure I see the positives. I'd still rather
not go back to a kvm_gfn_for_root()-like solution for the TDP MMU support if
possible. Is that alright with you? What about something like this in
conjunction with your earlier diff?

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 8bc19e65371c..bf19d5a45f87 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -129,6 +129,11 @@ struct tdp_iter {
             iter.valid && iter.gfn < end;                                     
\
             tdp_iter_next(&iter))

+#define for_each_tdp_pte_min_level_all(iter, root, min_level)          \
+       for (tdp_iter_start(&iter, root, min_level, 0, 0);              \
+            iter.valid && iter.gfn < tdp_mmu_max_gfn_exclusive();      \
+            tdp_iter_next(&iter))
+
 #define for_each_tdp_pte(iter, kvm, root, start, end)                         
\
        for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c3ce43ce7b3f..6f425652e396 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -889,10 +889,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct
kvm_mmu_page *root,
 {
        struct tdp_iter iter;

-       gfn_t end = tdp_mmu_max_gfn_exclusive();
-       gfn_t start = 0;
-
-       for_each_tdp_pte_min_level(iter, kvm, root, zap_level, start, end) {
+       for_each_tdp_pte_min_level_all(iter, root, zap_level) {
 retry:
                if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
                        continue;

KVM: x86/mmu: Allow per VM kvm_mmu_max_gfn()

Commit Message

Comments

Patch