diff mbox series

[v15,21/23] KVM: MMU: Disable fast path for private memslots

Message ID 20240510015822.503071-1-michael.roth@amd.com (mailing list archive)
State New
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Commit Message

Michael Roth May 10, 2024, 1:58 a.m. UTC
For hardware-protected VMs like SEV-SNP guests, certain conditions like
attempting to perform a write to a page which is not in the state that
the guest expects it to be in can result in a nested/extended #PF which
can only be satisfied by the host performing an implicit page state
change to transition the page into the expected shared/private state.
This is generally handled by generating a KVM_EXIT_MEMORY_FAULT event
that gets forwarded to userspace to handle via
KVM_SET_MEMORY_ATTRIBUTES.

However, the fast_page_fault() code might misconstrue this situation as
being the result of a write-protected access, and treat it as a spurious
case when it sees that writes are already allowed for the sPTE. This
results in the KVM MMU trying to resume the guest rather than taking any
action to satisfy the real source of the #PF such as generating a
KVM_EXIT_MEMORY_FAULT, resulting in the guest spinning on nested #PFs.

For now, just skip the fast path for hardware-protected VMs since they
don't currently utilize any of this access-tracking machinery anyway. In
the future, these considerations will need to be taken into account if
there's any need/desire to re-enable the fast path for
hardware-protected VMs.

Since software-protected VMs don't have a notion of a shared vs. private
that's separate from what KVM is tracking, the above
KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the special
handling for that case for now.

Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

Comments

Sean Christopherson May 10, 2024, 1:47 p.m. UTC | #1
On Thu, May 09, 2024, Michael Roth wrote:
> ---
>  arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 62ad38b2a8c9..cecd8360378f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3296,7 +3296,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
>  	return RET_PF_CONTINUE;
>  }
>  
> -static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
> +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	/*
>  	 * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
> @@ -3307,6 +3307,32 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
>  	if (fault->rsvd)
>  		return false;
>  
> +	/*
> +	 * For hardware-protected VMs, certain conditions like attempting to
> +	 * perform a write to a page which is not in the state that the guest
> +	 * expects it to be in can result in a nested/extended #PF. In this
> +	 * case, the below code might misconstrue this situation as being the
> +	 * result of a write-protected access, and treat it as a spurious case
> +	 * rather than taking any action to satisfy the real source of the #PF
> +	 * such as generating a KVM_EXIT_MEMORY_FAULT. This can lead to the
> +	 * guest spinning on a #PF indefinitely.
> +	 *
> +	 * For now, just skip the fast path for hardware-protected VMs since
> +	 * they don't currently utilize any of this machinery anyway. In the
> +	 * future, these considerations will need to be taken into account if
> +	 * there's any need/desire to re-enable the fast path for
> +	 * hardware-protected VMs.
> +	 *
> +	 * Since software-protected VMs don't have a notion of a shared vs.
> +	 * private that's separate from what KVM is tracking, the above
> +	 * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> +	 * special handling for that case for now.

Very technically, it can occur if userspace _just_ modified the attributes.  And
as I've said multiple times, at least for now, I want to avoid special casing
SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
is to allow testing flows that are impossible to excercise without SNP/TDX hardware.

> +	 */
> +	if (kvm_slot_can_be_private(fault->slot) &&
> +	    !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> +	      vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))

Heh, !(x && y) kills me, I misread this like 4 times.

Anyways, I don't like the heuristic.  It doesn't tie the restriction back to the
cause in any reasonable way.  Can't this simply be?

	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
		return false;

Which is much, much more self-explanatory.
Paolo Bonzini May 10, 2024, 1:50 p.m. UTC | #2
On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> > +      * Since software-protected VMs don't have a notion of a shared vs.
> > +      * private that's separate from what KVM is tracking, the above
> > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > +      * special handling for that case for now.
>
> Very technically, it can occur if userspace _just_ modified the attributes.  And
> as I've said multiple times, at least for now, I want to avoid special casing
> SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> is to allow testing flows that are impossible to excercise without SNP/TDX hardware.

Yep, it is not like they have to be optimized.

> > +      */
> > +     if (kvm_slot_can_be_private(fault->slot) &&
> > +         !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> > +           vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
>
> Heh, !(x && y) kills me, I misread this like 4 times.
>
> Anyways, I don't like the heuristic.  It doesn't tie the restriction back to the
> cause in any reasonable way.  Can't this simply be?
>
>         if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
>                 return false;

You beat me to it by seconds. And it can also be guarded by a check on
kvm->arch.has_private_mem to avoid the attributes lookup.

> Which is much, much more self-explanatory.

Both more self-explanatory and more correct.

Paolo
Michael Roth May 10, 2024, 3:27 p.m. UTC | #3
On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > +      * Since software-protected VMs don't have a notion of a shared vs.
> > > +      * private that's separate from what KVM is tracking, the above
> > > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > +      * special handling for that case for now.
> >
> > Very technically, it can occur if userspace _just_ modified the attributes.  And
> > as I've said multiple times, at least for now, I want to avoid special casing
> > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> 
> Yep, it is not like they have to be optimized.

Ok, I thought there were maybe some future plans to use sw-protected VMs
to get some added protections from userspace. But even then there'd
probably still be extra considerations for how to handle access tracking
so white-listing them probably isn't right anyway.

I was also partly tempted to take this route because it would cover this
TDX patch as well:

  https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/

and avoid any weirdness about checking kvm_mem_is_private() without
checking mmu_invalidate_seq, but I think those cases all end up
resolving themselves eventually and added some comments around that.

> 
> > > +      */
> > > +     if (kvm_slot_can_be_private(fault->slot) &&
> > > +         !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> > > +           vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
> >
> > Heh, !(x && y) kills me, I misread this like 4 times.
> >
> > Anyways, I don't like the heuristic.  It doesn't tie the restriction back to the
> > cause in any reasonable way.  Can't this simply be?
> >
> >         if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
> >                 return false;
> 
> You beat me to it by seconds. And it can also be guarded by a check on
> kvm->arch.has_private_mem to avoid the attributes lookup.

I re-tested with things implemented this way and everything seems to
look good. It's not clear to me whether this would cover the cases the
above-mentioned TDX patch handles, but no biggie if that's still needed.

The new version of the patch is here:

  https://github.com/mdroth/linux/commit/39643f9f6da6265d39d633a703c53997985c1208

And I've updated my branches with to replace the old patch and also
incorporate Sean's suggestions for patch 22:

  https://github.com/mdroth/linux/commits/snp-host-v15c3-unsquashed

and have them here with things already squashed in/relocated:

  https://github.com/mdroth/linux/commits/snp-host-v15c3

Thanks for the feedback Sean, Paolo.

-Mike
  
> 
> > Which is much, much more self-explanatory.
> 
> Both more self-explanatory and more correct.
> 
> Paolo
> 
>
Sean Christopherson May 10, 2024, 3:59 p.m. UTC | #4
On Fri, May 10, 2024, Michael Roth wrote:
> On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > +      * Since software-protected VMs don't have a notion of a shared vs.
> > > > +      * private that's separate from what KVM is tracking, the above
> > > > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > > +      * special handling for that case for now.
> > >
> > > Very technically, it can occur if userspace _just_ modified the attributes.  And
> > > as I've said multiple times, at least for now, I want to avoid special casing
> > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> > 
> > Yep, it is not like they have to be optimized.
> 
> Ok, I thought there were maybe some future plans to use sw-protected VMs
> to get some added protections from userspace. But even then there'd
> probably still be extra considerations for how to handle access tracking
> so white-listing them probably isn't right anyway.
> 
> I was also partly tempted to take this route because it would cover this
> TDX patch as well:
> 
>   https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/

Hmm, I'm pretty sure that patch is trying to fix the exact same issue you are
fixing, just in a less precise way.  S-EPT entries only support RWX=0 and RWX=111b,
i.e. it should be impossible to have a write-fault to a present S-EPT entry.

And if TDX is running afoul of this code:

	if (!fault->present)
		return !kvm_ad_enabled();

then KVM should do the sane thing and require A/D support be enabled for TDX.

And if it's something else entirely, that changelog has some explaining to do.

> and avoid any weirdness about checking kvm_mem_is_private() without
> checking mmu_invalidate_seq, but I think those cases all end up
> resolving themselves eventually and added some comments around that.

Yep, checking state that is protected by mmu_invalidate_seq outside of mmu_lock
is definitely allowed, e.g. the entire fast page fault path operates outside of
mmu_lock and thus outside of mmu_invalidate_seq's purview.

It's a-ok because the SPTE are done with an atomic CMPXCHG, and so KVM only needs
to ensure its page tables aren't outright _freed_.  If the zap triggered by the
attributes change "wins", then the fast #PF path will fail the CMPXCHG and be an
expensive NOP.  If the fast #PF wins, the zap will pave over the fast #PF fix,
and the IPI+flush that is needed for all zaps, to ensure vCPUs don't have stale
references, does the rest.

And if there's an attributes race that causes the fast #PF to bail early, the vCPU
will see the correct state on the next page fault.
Isaku Yamahata May 10, 2024, 5:47 p.m. UTC | #5
On Fri, May 10, 2024 at 08:59:09AM -0700,
Sean Christopherson <seanjc@google.com> wrote:

> On Fri, May 10, 2024, Michael Roth wrote:
> > On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> > > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > > +      * Since software-protected VMs don't have a notion of a shared vs.
> > > > > +      * private that's separate from what KVM is tracking, the above
> > > > > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > > > +      * special handling for that case for now.
> > > >
> > > > Very technically, it can occur if userspace _just_ modified the attributes.  And
> > > > as I've said multiple times, at least for now, I want to avoid special casing
> > > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> > > 
> > > Yep, it is not like they have to be optimized.
> > 
> > Ok, I thought there were maybe some future plans to use sw-protected VMs
> > to get some added protections from userspace. But even then there'd
> > probably still be extra considerations for how to handle access tracking
> > so white-listing them probably isn't right anyway.
> > 
> > I was also partly tempted to take this route because it would cover this
> > TDX patch as well:
> > 
> >   https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/
> 
> Hmm, I'm pretty sure that patch is trying to fix the exact same issue you are
> fixing, just in a less precise way.  S-EPT entries only support RWX=0 and RWX=111b,
> i.e. it should be impossible to have a write-fault to a present S-EPT entry.
> 
> And if TDX is running afoul of this code:
> 
> 	if (!fault->present)
> 		return !kvm_ad_enabled();
> 
> then KVM should do the sane thing and require A/D support be enabled for TDX.
> 
> And if it's something else entirely, that changelog has some explaining to do.

Yes, it's for KVM_EXIT_MEMORY_FAULT case.  Because Secure-EPT has non-present or
all RWX allowed, fast page fault always returns RET_PF_INVALID by
is_shadow_present_pte() check.

I lightly tested the patch at [1] and it works for TDX KVM.

[1] https://github.com/mdroth/linux/commit/39643f9f6da6265d39d633a703c53997985c1208

Just in case for that patch,
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 62ad38b2a8c9..cecd8360378f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3296,7 +3296,7 @@  static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
 	return RET_PF_CONTINUE;
 }
 
-static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
+static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	/*
 	 * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
@@ -3307,6 +3307,32 @@  static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 	if (fault->rsvd)
 		return false;
 
+	/*
+	 * For hardware-protected VMs, certain conditions like attempting to
+	 * perform a write to a page which is not in the state that the guest
+	 * expects it to be in can result in a nested/extended #PF. In this
+	 * case, the below code might misconstrue this situation as being the
+	 * result of a write-protected access, and treat it as a spurious case
+	 * rather than taking any action to satisfy the real source of the #PF
+	 * such as generating a KVM_EXIT_MEMORY_FAULT. This can lead to the
+	 * guest spinning on a #PF indefinitely.
+	 *
+	 * For now, just skip the fast path for hardware-protected VMs since
+	 * they don't currently utilize any of this machinery anyway. In the
+	 * future, these considerations will need to be taken into account if
+	 * there's any need/desire to re-enable the fast path for
+	 * hardware-protected VMs.
+	 *
+	 * Since software-protected VMs don't have a notion of a shared vs.
+	 * private that's separate from what KVM is tracking, the above
+	 * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
+	 * special handling for that case for now.
+	 */
+	if (kvm_slot_can_be_private(fault->slot) &&
+	    !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
+	      vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
+		return false;
+
 	/*
 	 * #PF can be fast if:
 	 *
@@ -3407,7 +3433,7 @@  static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	u64 *sptep;
 	uint retry_count = 0;
 
-	if (!page_fault_can_be_fast(fault))
+	if (!page_fault_can_be_fast(vcpu, fault))
 		return ret;
 
 	walk_shadow_page_lockless_begin(vcpu);