diff mbox series

[02/16] KVM: x86/mmu: Introduce a slot flag to zap only slot leafs on slot deletion

Message ID 20240515005952.3410568-3-rick.p.edgecombe@intel.com (mailing list archive)
State New
Headers show
Series TDX MMU prep series part 1 | expand

Commit Message

Edgecombe, Rick P May 15, 2024, 12:59 a.m. UTC
From: Yan Zhao <yan.y.zhao@intel.com>

Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf
SPTEs when deleting a memslot.

Today "zapping only memslot leaf SPTEs" on memslot deletion is not done.
Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for
AMD) and generate fresh new TDPs based on the new memslot layout. This is
because zapping and re-generating TDPs is low overhead for most use cases,
and  more importantly, it's due to a bug [1] which caused VM instability
when a VM is with Nvidia Geforce GPU assigned.

There's a previous attempt [2] to introduce a per-VM flag to workaround bug
[1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs.
However, [2] was not merged due to lacking of a clear explanation of
exactly what is broken [3] and it's not wise to "have a bug that is known
to happen when you enable the capability".

However, for some specific scenarios, e.g. TDX, invalidating and
re-generating a new page table is not viable for reasons:
- TDX requires root page of private page table remains unaltered throughout
  the TD life cycle.
- TDX mandates that leaf entries in private page table must be zapped prior
  to non-leaf entries.

So, Sean re-considered about introducing a per-VM flag or per-memslot flag
again for VMs like TDX. [4]

This patch is an implementation of per-memslot flag.
Compared to per-VM flag approach,
Pros:
(1) By allowing userspace to control the zapping behavior in fine-grained
    granularity, optimizations for specific use cases can be developed
    without future kernel changes.
(2) Allows developing new zapping behaviors without risking regressions by
    changing KVM behavior, as seen previously.

Cons:
(1) Users need to ensure all necessary memslots are with flag
    KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
    memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
(2) Opens up the possibility that userspace could configure memslots for
    normal VM in such a way that the bug [1] is seen.

However, one thing deserves noting for TDX, is that TDX may potentially
meet bug [1] for either per-memslot flag or per-VM flag approach, since
there's a usage in radar to assign an untrusted & passthrough GPU device
in TDX. If that happens, it can be treated as a bug (not regression) and
fixed accordingly.

An alternative approach we can also consider is to always invalidate &
rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored
and private page tables on memslot deletion. This approach could exempt TDX
from bug [1] when "untrusted & passthrough" devices are involved. But
downside is that this approach requires creating new very specific KVM
zapping ABI that could limit future changes in the same way that the bug
did for normal VMs.

Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3]
Link: https://lore.kernel.org/kvm/ZhSYEVCHqSOpVKMh@google.com [4]
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c   | 30 +++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c       | 17 +++++++++++++++++
 include/uapi/linux/kvm.h |  1 +
 virt/kvm/kvm_main.c      |  5 ++++-
 4 files changed, 51 insertions(+), 2 deletions(-)

Comments

Huang, Kai May 15, 2024, 1:24 p.m. UTC | #1
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> From: Yan Zhao <yan.y.zhao@intel.com>
> 
> Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf
> SPTEs when deleting a memslot.
> 
> Today "zapping only memslot leaf SPTEs" on memslot deletion is not done.
> Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for
> AMD) and generate fresh new TDPs based on the new memslot layout. This is
> because zapping and re-generating TDPs is low overhead for most use cases,
> and  more importantly, it's due to a bug [1] which caused VM instability
> when a VM is with Nvidia Geforce GPU assigned.
> 
> There's a previous attempt [2] to introduce a per-VM flag to workaround bug
> [1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs.
> However, [2] was not merged due to lacking of a clear explanation of
> exactly what is broken [3] and it's not wise to "have a bug that is known
> to happen when you enable the capability".
> 
> However, for some specific scenarios, e.g. TDX, invalidating and
> re-generating a new page table is not viable for reasons:
> - TDX requires root page of private page table remains unaltered throughout
>   the TD life cycle.
> - TDX mandates that leaf entries in private page table must be zapped prior
>   to non-leaf entries.
> 
> So, Sean re-considered about introducing a per-VM flag or per-memslot flag
> again for VMs like TDX. [4]
> 
> This patch is an implementation of per-memslot flag.
> Compared to per-VM flag approach,
> Pros:
> (1) By allowing userspace to control the zapping behavior in fine-grained
>     granularity, optimizations for specific use cases can be developed
>     without future kernel changes.
> (2) Allows developing new zapping behaviors without risking regressions by
>     changing KVM behavior, as seen previously.
> 
> Cons:
> (1) Users need to ensure all necessary memslots are with flag
>     KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
>     memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
> (2) Opens up the possibility that userspace could configure memslots for
>     normal VM in such a way that the bug [1] is seen.

I don't quite follow the logic why userspace should be involved.

TDX cannot use "page table fast zap", and need to use a different way to
zap, a.k.a, zap-leaf-only while holding MMU write lock, but this doesn't
necessarily mean such thing should be exposed to userspace?

It's weird that userspace needs to control how does KVM zap page table for
memslot delete/move.

The [2] mentions there are performance improvement for certain VMs if KVM
does zap-leaf-only, but AFAICT it doesn't provide concrete argument why it
needs to be exposed to userspace so it can be done as per-vm (that whole
thread basically was talking about the bug in [1]).  E.g., see:

https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#m702b273057cc318465cb5a1677d94e923dce9832

"
Ya, a capability is a bad idea. I was coming at it from the angle that, if
there is a fundamental requirement with e.g. GPU passthrough that requires
zapping all SPTEs, then enabling the precise capability on a per-VM basis
would make sense.  But adding something to the ABI on pure speculation is
silly.
"

So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
We can just set this flag for a TDX guest when memslot is created in KVM.

> 
> However, one thing deserves noting for TDX, is that TDX may potentially
> meet bug [1] for either per-memslot flag or per-VM flag approach, since
> there's a usage in radar to assign an untrusted & passthrough GPU device
> in TDX. If that happens, it can be treated as a bug (not regression) and
> fixed accordingly.
> 
> An alternative approach we can also consider is to always invalidate &
> rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored
> and private page tables on memslot deletion. This approach could exempt TDX
> from bug [1] when "untrusted & passthrough" devices are involved. But
> downside is that this approach requires creating new very specific KVM
> zapping ABI that could limit future changes in the same way that the bug
> did for normal VMs.
> 
> Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
> Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2]
> Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3]
> Link: https://lore.kernel.org/kvm/ZhSYEVCHqSOpVKMh@google.com [4]
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU Part 1:
>  - New patch
> ---
>  arch/x86/kvm/mmu/mmu.c   | 30 +++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c       | 17 +++++++++++++++++
>  include/uapi/linux/kvm.h |  1 +
>  virt/kvm/kvm_main.c      |  5 ++++-
>  4 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 61982da8c8b2..4a8e819794db 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6962,10 +6962,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  	kvm_mmu_zap_all(kvm);
>  }
>  
> +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> +		return;
> +
> +	write_lock(&kvm->mmu_lock);
> +
> +	/*
> +	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> +	 * case scenario we'll have unused shadow pages lying around until they
> +	 * are recycled due to age or when the VM is destroyed.
> +	 */
> +	struct kvm_gfn_range range = {
> +		.slot = slot,
> +		.start = slot->base_gfn,
> +		.end = slot->base_gfn + slot->npages,
> +		.may_block = true,
> +	};
> +
> +	if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	write_unlock(&kvm->mmu_lock);
> +}
> +
>  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>  				   struct kvm_memory_slot *slot)
>  {
> -	kvm_mmu_zap_all_fast(kvm);
> +	if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY)
> +		kvm_mmu_zap_memslot_leafs(kvm, slot);
> +	else
> +		kvm_mmu_zap_all_fast(kvm);
>  }
>  
>  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7c593a081eba..4b3ec2ec79e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12952,6 +12952,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
>  			return -EINVAL;
>  
> +		/*
> +		 * Since TDX private pages requires re-accepting after zap,
> +		 * and TDX private root page should not be zapped, TDX requires
> +		 * memslots for private memory must have flag
> +		 * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of
> +		 * the deleting memslot will be zapped and SPTEs in other
> +		 * memslots would not be affected.
> +		 */
> +		if (kvm->arch.vm_type == KVM_X86_TDX_VM &&
> +		    (new->flags & KVM_MEM_GUEST_MEMFD) &&
> +		    !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY))
> +			return -EINVAL;
> +
> +		/* zap-leafs-only works only when TDP MMU is enabled for now */
> +		if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled)
> +			return -EINVAL;

If this zap-leaf-only is supposed to be generic, I don't see why we want
to make it only for TDP MMU?
Isaku Yamahata May 15, 2024, 6:03 p.m. UTC | #2
On Tue, May 14, 2024 at 05:59:38PM -0700,
Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> From: Yan Zhao <yan.y.zhao@intel.com>
> 
> Introduce a per-memslot flag KVM_MEM_ZAP_LEAFS_ONLY to permit zap only leaf
> SPTEs when deleting a memslot.
> 
> Today "zapping only memslot leaf SPTEs" on memslot deletion is not done.
> Instead KVM will invalidate all old TDPs (i.e. EPT for Intel or NPT for
> AMD) and generate fresh new TDPs based on the new memslot layout. This is
> because zapping and re-generating TDPs is low overhead for most use cases,
> and  more importantly, it's due to a bug [1] which caused VM instability
> when a VM is with Nvidia Geforce GPU assigned.
> 
> There's a previous attempt [2] to introduce a per-VM flag to workaround bug
> [1] by only allowing "zapping only memslot leaf SPTEs" for specific VMs.
> However, [2] was not merged due to lacking of a clear explanation of
> exactly what is broken [3] and it's not wise to "have a bug that is known
> to happen when you enable the capability".
> 
> However, for some specific scenarios, e.g. TDX, invalidating and
> re-generating a new page table is not viable for reasons:
> - TDX requires root page of private page table remains unaltered throughout
>   the TD life cycle.
> - TDX mandates that leaf entries in private page table must be zapped prior
>   to non-leaf entries.
> 
> So, Sean re-considered about introducing a per-VM flag or per-memslot flag
> again for VMs like TDX. [4]
> 
> This patch is an implementation of per-memslot flag.
> Compared to per-VM flag approach,
> Pros:
> (1) By allowing userspace to control the zapping behavior in fine-grained
>     granularity, optimizations for specific use cases can be developed
>     without future kernel changes.
> (2) Allows developing new zapping behaviors without risking regressions by
>     changing KVM behavior, as seen previously.
> 
> Cons:
> (1) Users need to ensure all necessary memslots are with flag
>     KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
>     memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
> (2) Opens up the possibility that userspace could configure memslots for
>     normal VM in such a way that the bug [1] is seen.
> 
> However, one thing deserves noting for TDX, is that TDX may potentially
> meet bug [1] for either per-memslot flag or per-VM flag approach, since
> there's a usage in radar to assign an untrusted & passthrough GPU device
> in TDX. If that happens, it can be treated as a bug (not regression) and
> fixed accordingly.
> 
> An alternative approach we can also consider is to always invalidate &
> rebuild all shared page tables and zap only memslot leaf SPTEs for mirrored
> and private page tables on memslot deletion. This approach could exempt TDX
> from bug [1] when "untrusted & passthrough" devices are involved. But
> downside is that this approach requires creating new very specific KVM
> zapping ABI that could limit future changes in the same way that the bug
> did for normal VMs.
> 
> Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
> Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [2]
> Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#m1839c85392a7a022df9e507876bb241c022c4f06 [3]
> Link: https://lore.kernel.org/kvm/ZhSYEVCHqSOpVKMh@google.com [4]
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU Part 1:
>  - New patch
> ---
>  arch/x86/kvm/mmu/mmu.c   | 30 +++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c       | 17 +++++++++++++++++
>  include/uapi/linux/kvm.h |  1 +
>  virt/kvm/kvm_main.c      |  5 ++++-
>  4 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 61982da8c8b2..4a8e819794db 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6962,10 +6962,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  	kvm_mmu_zap_all(kvm);
>  }
>  
> +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> +		return;
> +
> +	write_lock(&kvm->mmu_lock);
> +
> +	/*
> +	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> +	 * case scenario we'll have unused shadow pages lying around until they
> +	 * are recycled due to age or when the VM is destroyed.
> +	 */
> +	struct kvm_gfn_range range = {
> +		.slot = slot,
> +		.start = slot->base_gfn,
> +		.end = slot->base_gfn + slot->npages,
> +		.may_block = true,
> +	};

nit: move this up at the beginning of this function.
Compiler didn't complain?


> +
> +	if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	write_unlock(&kvm->mmu_lock);
> +}
> +
>  void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>  				   struct kvm_memory_slot *slot)
>  {
> -	kvm_mmu_zap_all_fast(kvm);
> +	if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY)
> +		kvm_mmu_zap_memslot_leafs(kvm, slot);
> +	else
> +		kvm_mmu_zap_all_fast(kvm);
>  }
>  
>  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7c593a081eba..4b3ec2ec79e9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12952,6 +12952,23 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
>  			return -EINVAL;
>  
> +		/*
> +		 * Since TDX private pages requires re-accepting after zap,
> +		 * and TDX private root page should not be zapped, TDX requires
> +		 * memslots for private memory must have flag
> +		 * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of
> +		 * the deleting memslot will be zapped and SPTEs in other
> +		 * memslots would not be affected.
> +		 */
> +		if (kvm->arch.vm_type == KVM_X86_TDX_VM &&
> +		    (new->flags & KVM_MEM_GUEST_MEMFD) &&
> +		    !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY))
> +			return -EINVAL;
> +
> +		/* zap-leafs-only works only when TDP MMU is enabled for now */
> +		if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled)
> +			return -EINVAL;
> +
>  		return kvm_alloc_memslot_metadata(kvm, new);
>  	}
>  
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index aee67912e71c..d53648c19b26 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
>  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>  #define KVM_MEM_READONLY	(1UL << 1)
>  #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
> +#define KVM_MEM_ZAP_LEAFS_ONLY	(1UL << 3)

If we make this uAPI, please update Documentation/virt/kvm/api.rst too.


>  
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 81b90bf03f2f..1b1ffb6fc786 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1568,6 +1568,8 @@ static int check_memory_region_flags(struct kvm *kvm,
>  	if (kvm_arch_has_private_mem(kvm))
>  		valid_flags |= KVM_MEM_GUEST_MEMFD;
>  
> +	valid_flags |= KVM_MEM_ZAP_LEAFS_ONLY;
> +

This is arch common code. We need a guard for other arch (non-x86).
Also feature enumeration. KVM_CAP_USER_MEMORY2 can be used?
Sean Christopherson May 15, 2024, 7:09 p.m. UTC | #3
On Wed, May 15, 2024, Kai Huang wrote:
> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > So, Sean re-considered about introducing a per-VM flag or per-memslot flag
> > again for VMs like TDX. [4]

Not really.  I tried to be as clear as I could that my preference was that _if_
we exposing _something_ to userspace, then that something should probably be a
memslot flag.

 : There's no concrete motiviation, it's more that _if_ we're going to expose a knob
 : to userspace, then I'd prefer to make it as precise as possible to minimize the
 : changes of KVM ending up back in ABI hell again.

As I stressed in that thread, I hadn't thought about this deeply enough to have
an opinion one way or the other.

 : You're _really_ reading too much into my suggestion.  As above, my suggestion
 : was very spur of the momemnt.  I haven't put much thought into the tradeoffs and
 : side effects.

> > This patch is an implementation of per-memslot flag.
> > Compared to per-VM flag approach,
> > Pros:
> > (1) By allowing userspace to control the zapping behavior in fine-grained
> >     granularity, optimizations for specific use cases can be developed
> >     without future kernel changes.
> > (2) Allows developing new zapping behaviors without risking regressions by
> >     changing KVM behavior, as seen previously.
> > 
> > Cons:
> > (1) Users need to ensure all necessary memslots are with flag
> >     KVM_MEM_ZAP_LEAFS_ONLY set.e.g. QEMU needs to ensure all GUEST_MEMFD
> >     memslot is with ZAP_LEAFS_ONLY flag for TDX VM.
> > (2) Opens up the possibility that userspace could configure memslots for
> >     normal VM in such a way that the bug [1] is seen.
> 
> I don't quite follow the logic why userspace should be involved.
> 
> TDX cannot use "page table fast zap", and need to use a different way to
> zap, a.k.a, zap-leaf-only while holding MMU write lock, but this doesn't
> necessarily mean such thing should be exposed to userspace?
> 
> It's weird that userspace needs to control how does KVM zap page table for
> memslot delete/move.

Yeah, this isn't quite what I had in mind.  Granted, what I had in mind may not
be much any better, but I definitely don't want to let userspace dictate exactly
how KVM manages SPTEs.

My thinking for a memslot flag was more of a "deleting this memslot doesn't have
side effects", i.e. a way for userspace to give KVM the green light to deviate
from KVM's historical behavior of rebuilding the entire page tables.  Under the
hood, KVM would be allowed to do whatever it wants, e.g. for the initial
implementation, KVM would zap only leafs.  But critically, KVM wouldn't be
_required_ to zap only leafs.

> So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
> We can just set this flag for a TDX guest when memslot is created in KVM.

100% agreed from a functionality perspective.  My thoughts/concerns are more about
KVM's ABI.

Hmm, actually, we already have new uAPI/ABI in the form of VM types.  What if
we squeeze a documentation update into 6.10 (which adds the SEV VM flavors) to
state that KVM's historical behavior of blasting all SPTEs is only _guaranteed_
for KVM_X86_DEFAULT_VM?

Anyone know if QEMU deletes shared-only, i.e. non-guest_memfd, memslots during
SEV-* boot?  If so, and assuming any such memslots are smallish, we could even
start enforcing the new ABI by doing a precise zap for small (arbitrary limit TBD)
shared-only memslots for !KVM_X86_DEFAULT_VM VMs.
Edgecombe, Rick P May 15, 2024, 7:23 p.m. UTC | #4
On Wed, 2024-05-15 at 12:09 -0700, Sean Christopherson wrote:
> > It's weird that userspace needs to control how does KVM zap page table for
> > memslot delete/move.
> 
> Yeah, this isn't quite what I had in mind.  Granted, what I had in mind may
> not
> be much any better, but I definitely don't want to let userspace dictate
> exactly
> how KVM manages SPTEs.

To me it doesn't seem completely unprecedented at least. Linux has a ton of
madvise() flags and other knobs to control this kind of PTE management for
userspace memory.

> 
> My thinking for a memslot flag was more of a "deleting this memslot doesn't
> have
> side effects", i.e. a way for userspace to give KVM the green light to deviate
> from KVM's historical behavior of rebuilding the entire page tables.  Under
> the
> hood, KVM would be allowed to do whatever it wants, e.g. for the initial
> implementation, KVM would zap only leafs.  But critically, KVM wouldn't be
> _required_ to zap only leafs.
> 
> > So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
> > We can just set this flag for a TDX guest when memslot is created in KVM.
> 
> 100% agreed from a functionality perspective.  My thoughts/concerns are more
> about
> KVM's ABI.
> 
> Hmm, actually, we already have new uAPI/ABI in the form of VM types.  What if
> we squeeze a documentation update into 6.10 (which adds the SEV VM flavors) to
> state that KVM's historical behavior of blasting all SPTEs is only
> _guaranteed_
> for KVM_X86_DEFAULT_VM?
> 
> Anyone know if QEMU deletes shared-only, i.e. non-guest_memfd, memslots during
> SEV-* boot?  If so, and assuming any such memslots are smallish, we could even
> start enforcing the new ABI by doing a precise zap for small (arbitrary limit
> TBD)
> shared-only memslots for !KVM_X86_DEFAULT_VM VMs.

Again thinking of the userspace memory analogy... Aren't there some VMs where
the fast zap is faster? Like if you have guest with a small memslot that gets
deleted all the time, you could want it to be zapped specifically. But for the
giant memslot next to it, you might want to do the fast zap all thing.

So rather then try to optimize zapping more someday and hit similar issues, let
userspace decide how it wants it to be done. I'm not sure of the actual
performance tradeoffs here, to be clear.

That said, a per-vm know is easier for TDX purposes.
Sean Christopherson May 15, 2024, 8:05 p.m. UTC | #5
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 12:09 -0700, Sean Christopherson wrote:
> > > It's weird that userspace needs to control how does KVM zap page table for
> > > memslot delete/move.
> > 
> > Yeah, this isn't quite what I had in mind.  Granted, what I had in mind may
> > not be much any better, but I definitely don't want to let userspace
> > dictate exactly how KVM manages SPTEs.
> 
> To me it doesn't seem completely unprecedented at least. Linux has a ton of
> madvise() flags and other knobs to control this kind of PTE management for
> userspace memory.

Yes, but they all express their requests in terms of what behavior userspace wants
or to communicate userspace's access paterns.  They don't dictate exact low level
behavior to the kernel.

> > My thinking for a memslot flag was more of a "deleting this memslot doesn't
> > have side effects", i.e. a way for userspace to give KVM the green light to
> > deviate from KVM's historical behavior of rebuilding the entire page
> > tables.  Under the hood, KVM would be allowed to do whatever it wants, e.g.
> > for the initial implementation, KVM would zap only leafs.  But critically,
> > KVM wouldn't be _required_ to zap only leafs.
> > 
> > > So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
> > > We can just set this flag for a TDX guest when memslot is created in KVM.
> > 
> > 100% agreed from a functionality perspective.  My thoughts/concerns are
> > more about KVM's ABI.
> > 
> > Hmm, actually, we already have new uAPI/ABI in the form of VM types.  What
> > if we squeeze a documentation update into 6.10 (which adds the SEV VM
> > flavors) to state that KVM's historical behavior of blasting all SPTEs is
> > only _guaranteed_ for KVM_X86_DEFAULT_VM?
> > 
> > Anyone know if QEMU deletes shared-only, i.e. non-guest_memfd, memslots
> > during SEV-* boot?  If so, and assuming any such memslots are smallish, we
> > could even start enforcing the new ABI by doing a precise zap for small
> > (arbitrary limit TBD) shared-only memslots for !KVM_X86_DEFAULT_VM VMs.
> 
> Again thinking of the userspace memory analogy... Aren't there some VMs where
> the fast zap is faster? Like if you have guest with a small memslot that gets
> deleted all the time, you could want it to be zapped specifically. But for the
> giant memslot next to it, you might want to do the fast zap all thing.

Yes.  But...

> So rather then try to optimize zapping more someday and hit similar issues, let
> userspace decide how it wants it to be done. I'm not sure of the actual
> performance tradeoffs here, to be clear.

...unless someone is able to root cause the VFIO regression, we don't have the
luxury of letting userspace give KVM a hint as to whether it might be better to
do a precise zap versus a nuke-and-pave.

And more importantly, it would be a _hint_, not the hard requirement that TDX
needs.

> That said, a per-vm know is easier for TDX purposes.
Edgecombe, Rick P May 15, 2024, 8:53 p.m. UTC | #6
On Wed, 2024-05-15 at 13:05 -0700, Sean Christopherson wrote:
> On Wed, May 15, 2024, Rick P Edgecombe wrote:
> > On Wed, 2024-05-15 at 12:09 -0700, Sean Christopherson wrote:
> > > > It's weird that userspace needs to control how does KVM zap page table
> > > > for
> > > > memslot delete/move.
> > > 
> > > Yeah, this isn't quite what I had in mind.  Granted, what I had in mind
> > > may
> > > not be much any better, but I definitely don't want to let userspace
> > > dictate exactly how KVM manages SPTEs.
> > 
> > To me it doesn't seem completely unprecedented at least. Linux has a ton of
> > madvise() flags and other knobs to control this kind of PTE management for
> > userspace memory.
> 
> Yes, but they all express their requests in terms of what behavior userspace
> wants
> or to communicate userspace's access paterns.  They don't dictate exact low
> level
> behavior to the kernel.
> 

There are a few for madvise that are like "don't do this". Of course also, some
of the implementations take direct action anyway and then become ABI. Otherwise
there is mlock(). There are so many mm features. It might actually be more of a
cautionary tale.

[snip]

> > So rather then try to optimize zapping more someday and hit similar issues,
> > let
> > userspace decide how it wants it to be done. I'm not sure of the actual
> > performance tradeoffs here, to be clear.
> 
> ...unless someone is able to root cause the VFIO regression, we don't have the
> luxury of letting userspace give KVM a hint as to whether it might be better
> to
> do a precise zap versus a nuke-and-pave.

Pedantry... I think it's not a regression if something requires a new flag. It
is still a bug though.

The thing I worry about on the bug is whether it might have been due to a guest
having access to page it shouldn't have. In which case we can't give the user
the opportunity to create it.

I didn't gather there was any proof of this. Did you have any hunch either way?

> 
> And more importantly, it would be a _hint_, not the hard requirement that TDX
> needs.
> 
> > That said, a per-vm know is easier for TDX purposes.

If we don't want it to be a mandate from userspace, then we need to do some per-
vm checking in TDX's case anyway. In which case we might as well go with the
per-vm option for TDX.

You had said up the thread, why not opt all non-normal VMs into the new
behavior. It will work great for TDX. But why do SEV and others want this
automatically?
Sean Christopherson May 15, 2024, 10:47 p.m. UTC | #7
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 13:05 -0700, Sean Christopherson wrote:
> > On Wed, May 15, 2024, Rick P Edgecombe wrote:
> > > So rather then try to optimize zapping more someday and hit similar
> > > issues, let userspace decide how it wants it to be done. I'm not sure of
> > > the actual performance tradeoffs here, to be clear.
> > 
> > ...unless someone is able to root cause the VFIO regression, we don't have
> > the luxury of letting userspace give KVM a hint as to whether it might be
> > better to do a precise zap versus a nuke-and-pave.
> 
> Pedantry... I think it's not a regression if something requires a new flag. It
> is still a bug though.

Heh, pedantry denied.  I was speaking in the past tense about the VFIO failure,
which was a regression as I changed KVM behavior without adding a flag.

> The thing I worry about on the bug is whether it might have been due to a guest
> having access to page it shouldn't have. In which case we can't give the user
> the opportunity to create it.
> 
> I didn't gather there was any proof of this. Did you have any hunch either way?

I doubt the guest was able to access memory it shouldn't have been able to access.
But that's a moot point, as the bigger problem is that, because we have no idea
what's at fault, KVM can't make any guarantees about the safety of such a flag.

TDX is a special case where we don't have a better option (we do have other options,
they're just horrible).  In other words, the choice is essentially to either:

 (a) cross our fingers and hope that the problem is limited to shared memory
     with QEMU+VFIO, i.e. and doesn't affect TDX private memory.

or 

 (b) don't merge TDX until the original regression is fully resolved.

FWIW, I would love to root cause and fix the failure, but I don't know how feasible
that is at this point.

> > And more importantly, it would be a _hint_, not the hard requirement that TDX
> > needs.
> > 
> > > That said, a per-vm know is easier for TDX purposes.
> 
> If we don't want it to be a mandate from userspace, then we need to do some per-
> vm checking in TDX's case anyway. In which case we might as well go with the
> per-vm option for TDX.
> 
> You had said up the thread, why not opt all non-normal VMs into the new
> behavior. It will work great for TDX. But why do SEV and others want this
> automatically?

Because I want flexibility in KVM, i.e. I want to take the opportunity to try and
break away from KVM's godawful ABI.  It might be a pipe dream, as keying off the
VM type obviously has similar risks to giving userspace a memslot flag.  The one
sliver of hope is that the VM types really are quite new (though less so for SEV
and SEV-ES), whereas a memslot flag would be easily applied to existing VMs.
Huang, Kai May 15, 2024, 11:06 p.m. UTC | #8
>>
>> You had said up the thread, why not opt all non-normal VMs into the new
>> behavior. It will work great for TDX. But why do SEV and others want this
>> automatically?
> 
> Because I want flexibility in KVM, i.e. I want to take the opportunity to try and
> break away from KVM's godawful ABI.  It might be a pipe dream, as keying off the
> VM type obviously has similar risks to giving userspace a memslot flag.  The one
> sliver of hope is that the VM types really are quite new (though less so for SEV
> and SEV-ES), whereas a memslot flag would be easily applied to existing VMs.

Btw, does the "zap-leaf-only" approach always have better performance, 
assuming we have to hold MMU write lock for that?

Consider a huge memslot being deleted/moved.

If we can always have a better performance for "zap-leaf-only", then 
instead of letting userspace to opt-in this feature, we perhaps can do 
the opposite:

We always do the "zap-leaf-only" in KVM, but add a quirk for the VMs 
that userspace know can have such bug and apply this quirk.

But again, I think it's just too overkill for TDX.  We can just set the 
ZAP_LEAF_ONLY flag for the slot when it is created in KVM.
Sean Christopherson May 15, 2024, 11:20 p.m. UTC | #9
On Thu, May 16, 2024, Kai Huang wrote:
> > > You had said up the thread, why not opt all non-normal VMs into the new
> > > behavior. It will work great for TDX. But why do SEV and others want this
> > > automatically?
> > 
> > Because I want flexibility in KVM, i.e. I want to take the opportunity to try and
> > break away from KVM's godawful ABI.  It might be a pipe dream, as keying off the
> > VM type obviously has similar risks to giving userspace a memslot flag.  The one
> > sliver of hope is that the VM types really are quite new (though less so for SEV
> > and SEV-ES), whereas a memslot flag would be easily applied to existing VMs.
> 
> Btw, does the "zap-leaf-only" approach always have better performance,
> assuming we have to hold MMU write lock for that?

I highly doubt it, especially given how much the TDP MMU can now do with mmu_lock
held for read.

> Consider a huge memslot being deleted/moved.
> 
> If we can always have a better performance for "zap-leaf-only", then instead
> of letting userspace to opt-in this feature, we perhaps can do the opposite:
> 
> We always do the "zap-leaf-only" in KVM, but add a quirk for the VMs that
> userspace know can have such bug and apply this quirk.

Hmm, a quirk isn't a bad idea.  It suffers the same problems as a memslot flag,
i.e. who knows when it's safe to disable the quirk, but I would hope userspace
would be much, much cautious about disabling a quirk that comes with a massive
disclaimer.

Though I suspect Paolo will shoot this down too ;-)

> But again, I think it's just too overkill for TDX.  We can just set the
> ZAP_LEAF_ONLY flag for the slot when it is created in KVM.

Ya, I'm convinced that adding uAPI is overkill at this point.
Huang, Kai May 15, 2024, 11:36 p.m. UTC | #10
On 16/05/2024 11:20 am, Sean Christopherson wrote:
> On Thu, May 16, 2024, Kai Huang wrote:
>>>> You had said up the thread, why not opt all non-normal VMs into the new
>>>> behavior. It will work great for TDX. But why do SEV and others want this
>>>> automatically?
>>>
>>> Because I want flexibility in KVM, i.e. I want to take the opportunity to try and
>>> break away from KVM's godawful ABI.  It might be a pipe dream, as keying off the
>>> VM type obviously has similar risks to giving userspace a memslot flag.  The one
>>> sliver of hope is that the VM types really are quite new (though less so for SEV
>>> and SEV-ES), whereas a memslot flag would be easily applied to existing VMs.
>>
>> Btw, does the "zap-leaf-only" approach always have better performance,
>> assuming we have to hold MMU write lock for that?
> 
> I highly doubt it, especially given how much the TDP MMU can now do with mmu_lock
> held for read.
> 
>> Consider a huge memslot being deleted/moved.
>>
>> If we can always have a better performance for "zap-leaf-only", then instead
>> of letting userspace to opt-in this feature, we perhaps can do the opposite:
>>
>> We always do the "zap-leaf-only" in KVM, but add a quirk for the VMs that
>> userspace know can have such bug and apply this quirk.
> 
> Hmm, a quirk isn't a bad idea.  It suffers the same problems as a memslot flag,
> i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> would be much, much cautious about disabling a quirk that comes with a massive
> disclaimer.
> 
> Though I suspect Paolo will shoot this down too ;-)

The quirk only works based on the assumption that userspace _exactly_ 
knows what kinda VMs will have this bug.

But as mentioned above, the first step is we need to convince ourselves 
that doing "zap-leaf-only" by default is the right thing to do.

:-)
Edgecombe, Rick P May 15, 2024, 11:56 p.m. UTC | #11
On Wed, 2024-05-15 at 15:47 -0700, Sean Christopherson wrote:
> > I didn't gather there was any proof of this. Did you have any hunch either
> > way?
> 
> I doubt the guest was able to access memory it shouldn't have been able to
> access.
> But that's a moot point, as the bigger problem is that, because we have no
> idea
> what's at fault, KVM can't make any guarantees about the safety of such a
> flag.
> 
> TDX is a special case where we don't have a better option (we do have other
> options,
> they're just horrible).  In other words, the choice is essentially to either:
> 
>  (a) cross our fingers and hope that the problem is limited to shared memory
>      with QEMU+VFIO, i.e. and doesn't affect TDX private memory.
> 
> or 
> 
>  (b) don't merge TDX until the original regression is fully resolved.
> 
> FWIW, I would love to root cause and fix the failure, but I don't know how
> feasible
> that is at this point.

If we think it is not a security issue, and we don't even know if it can be hit
for TDX, then I'd be included to go with (a). Especially since we are just
aiming for the most basic support, and don't have to worry about regressions in
the classical sense.

I'm not sure how easy it will be to root cause it at this point. Hopefully Yan
will be coming online soon. She mentioned some previous Intel effort to
investigate it. Presumably we would have to start with the old kernel that
exhibited the issue. If it can still be found...
Xiaoyao Li May 16, 2024, 1:12 a.m. UTC | #12
On 5/16/2024 7:20 AM, Sean Christopherson wrote:
>> But again, I think it's just too overkill for TDX.  We can just set the
>> ZAP_LEAF_ONLY flag for the slot when it is created in KVM.
> Ya, I'm convinced that adding uAPI is overkill at this point.

+1.

Making it configurable by userspace needs justification common enough.

If it's just for TDX specific and mandatory for TDX, just make it KVM 
internal thing for TDX.
Edgecombe, Rick P May 16, 2024, 2:21 a.m. UTC | #13
On Wed, 2024-05-15 at 16:56 -0700, Rick Edgecombe wrote:
> 
> If we think it is not a security issue, and we don't even know if it can be
> hit
> for TDX, then I'd be included to go with (a). Especially since we are just
> aiming for the most basic support, and don't have to worry about regressions
> in
> the classical sense.
> 
> I'm not sure how easy it will be to root cause it at this point. Hopefully Yan
> will be coming online soon. She mentioned some previous Intel effort to
> investigate it. Presumably we would have to start with the old kernel that
> exhibited the issue. If it can still be found...
> 

Weijiang filled me in. It sounds like a really tough one, as he described: "From
my experience, to reproduce the issue, it requires specific HW config + specific
SW + specific operations, I almost died on it"

If it shows up in TDX, we might get lucky with an easier reproducer...
Yan Zhao May 16, 2024, 3:56 a.m. UTC | #14
On Thu, May 16, 2024 at 07:56:18AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2024-05-15 at 15:47 -0700, Sean Christopherson wrote:
> > > I didn't gather there was any proof of this. Did you have any hunch either
> > > way?
> > 
> > I doubt the guest was able to access memory it shouldn't have been able to
> > access.
> > But that's a moot point, as the bigger problem is that, because we have no
> > idea
> > what's at fault, KVM can't make any guarantees about the safety of such a
> > flag.
> > 
> > TDX is a special case where we don't have a better option (we do have other
> > options,
> > they're just horrible).  In other words, the choice is essentially to either:
> > 
> >  (a) cross our fingers and hope that the problem is limited to shared memory
> >      with QEMU+VFIO, i.e. and doesn't affect TDX private memory.
> > 
> > or 
> > 
> >  (b) don't merge TDX until the original regression is fully resolved.
> > 
> > FWIW, I would love to root cause and fix the failure, but I don't know how
> > feasible
> > that is at this point.
Me too. So curious about what's exactly broken.

> 
> If we think it is not a security issue, and we don't even know if it can be hit
> for TDX, then I'd be included to go with (a). Especially since we are just
> aiming for the most basic support, and don't have to worry about regressions in
> the classical sense.
> 
> I'm not sure how easy it will be to root cause it at this point. Hopefully Yan
> will be coming online soon. She mentioned some previous Intel effort to
> investigate it. Presumably we would have to start with the old kernel that
> exhibited the issue. If it can still be found...
I tried to reproduce it under the direction from Weijiang, though my NVIDIA card
was of a little difference as the one used by Weijiang.
However, I failed. I'm not sure whether it was because I did it remotely or
whether it was because I didn't spend enough time (since it's not an official
tasks assigned to me and I just did it out of curiosity).

If you think it's worthwhile, I would like to try again locally to see if I will
be lucky enough to reproduce and root-cause it.

But is it possible not to have TDX be pending on this bug/regression?
Paolo Bonzini May 17, 2024, 3:25 p.m. UTC | #15
On 5/15/24 21:09, Sean Christopherson wrote:
> Hmm, actually, we already have new uAPI/ABI in the form of VM types.  What if
> we squeeze a documentation update into 6.10 (which adds the SEV VM flavors) to
> state that KVM's historical behavior of blasting all SPTEs is only_guaranteed_
> for KVM_X86_DEFAULT_VM?
> 
> Anyone know if QEMU deletes shared-only, i.e. non-guest_memfd, memslots during
> SEV-* boot?

Yes, the process is mostly the same for normal UEFI boot, SEV and SEV-ES.

However, it does so while the VM is paused (remember the atomic memslot
updates attempts?  that's now enforced by QEMU).  So it's quite possible
that the old bug is not visible anymore, independent of why VFIO caused
it.

Paolo

> If so, and assuming any such memslots are smallish, we could even
> start enforcing the new ABI by doing a precise zap for small (arbitrary limit TBD)
> shared-only memslots for !KVM_X86_DEFAULT_VM VMs.
Paolo Bonzini May 17, 2024, 3:27 p.m. UTC | #16
On 5/15/24 22:05, Sean Christopherson wrote:
>> Again thinking of the userspace memory analogy... Aren't there some VMs where
>> the fast zap is faster? Like if you have guest with a small memslot that gets
>> deleted all the time, you could want it to be zapped specifically. But for the
>> giant memslot next to it, you might want to do the fast zap all thing.
>
> Yes.  But...

On the other hand, tearing down a giant memslot isn't really common. 
The main occurrence of memslots going away is 1) BIOS fiddling with low 
memory permissions;  2) PCI BARs.  The former is only happening at boot 
and with small memslots, the latter can in principle involve large
memslots but... just don't do it.

Paolo
Paolo Bonzini May 17, 2024, 3:30 p.m. UTC | #17
On 5/16/24 01:20, Sean Christopherson wrote:
> Hmm, a quirk isn't a bad idea.  It suffers the same problems as a memslot flag,
> i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> would be much, much cautious about disabling a quirk that comes with a massive
> disclaimer.
> 
> Though I suspect Paolo will shoot this down too 
Yan Zhao May 22, 2024, 1:29 a.m. UTC | #18
On Fri, May 17, 2024 at 05:30:50PM +0200, Paolo Bonzini wrote:
> On 5/16/24 01:20, Sean Christopherson wrote:
> > Hmm, a quirk isn't a bad idea.  It suffers the same problems as a memslot flag,
> > i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> > would be much, much cautious about disabling a quirk that comes with a massive
> > disclaimer.
> > 
> > Though I suspect Paolo will shoot this down too 
Sean Christopherson May 22, 2024, 2:31 a.m. UTC | #19
On Wed, May 22, 2024, Yan Zhao wrote:
> On Fri, May 17, 2024 at 05:30:50PM +0200, Paolo Bonzini wrote:
> > On 5/16/24 01:20, Sean Christopherson wrote:
> > > Hmm, a quirk isn't a bad idea.  It suffers the same problems as a memslot flag,
> > > i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> > > would be much, much cautious about disabling a quirk that comes with a massive
> > > disclaimer.
> > > 
> > > Though I suspect Paolo will shoot this down too 
Yan Zhao May 22, 2024, 6:48 a.m. UTC | #20
On Tue, May 21, 2024 at 07:31:31PM -0700, Sean Christopherson wrote:
> On Wed, May 22, 2024, Yan Zhao wrote:
> > On Fri, May 17, 2024 at 05:30:50PM +0200, Paolo Bonzini wrote:
> > > On 5/16/24 01:20, Sean Christopherson wrote:
> > > > Hmm, a quirk isn't a bad idea.  It suffers the same problems as a memslot flag,
> > > > i.e. who knows when it's safe to disable the quirk, but I would hope userspace
> > > > would be much, much cautious about disabling a quirk that comes with a massive
> > > > disclaimer.
> > > > 
> > > > Though I suspect Paolo will shoot this down too 
Paolo Bonzini May 22, 2024, 3:45 p.m. UTC | #21
On Wed, May 22, 2024 at 8:49 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > Disabling the quirk would allow KVM to choose between a slow/precise/partial zap,
> > and full/fast zap.
> TDX needs to disable the quirk for slow/precise/partial zap, right?

Yes - and since TDX is a separate VM type it might even start with the
quirk disabled. For sure, the memslot flag is the worst option and I'd
really prefer to avoid it.

> > I have the same feeling that the bug is probably not reproducible with latest
> > KVM code

Or with the latest QEMU code, if it was related somehow to non-atomic
changes to the memory map.

Paolo
Yan Zhao May 24, 2024, 1:50 a.m. UTC | #22
On Wed, May 22, 2024 at 05:45:39PM +0200, Paolo Bonzini wrote:
> On Wed, May 22, 2024 at 8:49 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > Disabling the quirk would allow KVM to choose between a slow/precise/partial zap,
> > > and full/fast zap.
> > TDX needs to disable the quirk for slow/precise/partial zap, right?
> 
> Yes - and since TDX is a separate VM type it might even start with the
> quirk disabled. For sure, the memslot flag is the worst option and I'd
> really prefer to avoid it.
Thanks. Will implement a quirk and let TDX code in QEMU to disable the
quirk.

> 
> > > I have the same feeling that the bug is probably not reproducible with latest
> > > KVM code
> 
> Or with the latest QEMU code, if it was related somehow to non-atomic
> changes to the memory map.
>
Thanks for this input. Will check if it's related.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 61982da8c8b2..4a8e819794db 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6962,10 +6962,38 @@  void kvm_arch_flush_shadow_all(struct kvm *kvm)
 	kvm_mmu_zap_all(kvm);
 }
 
+static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
+		return;
+
+	write_lock(&kvm->mmu_lock);
+
+	/*
+	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+	 * case scenario we'll have unused shadow pages lying around until they
+	 * are recycled due to age or when the VM is destroyed.
+	 */
+	struct kvm_gfn_range range = {
+		.slot = slot,
+		.start = slot->base_gfn,
+		.end = slot->base_gfn + slot->npages,
+		.may_block = true,
+	};
+
+	if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
+		kvm_flush_remote_tlbs(kvm);
+
+	write_unlock(&kvm->mmu_lock);
+}
+
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot)
 {
-	kvm_mmu_zap_all_fast(kvm);
+	if (slot->flags & KVM_MEM_ZAP_LEAFS_ONLY)
+		kvm_mmu_zap_memslot_leafs(kvm, slot);
+	else
+		kvm_mmu_zap_all_fast(kvm);
 }
 
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7c593a081eba..4b3ec2ec79e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12952,6 +12952,23 @@  int kvm_arch_prepare_memory_region(struct kvm *kvm,
 		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
 			return -EINVAL;
 
+		/*
+		 * Since TDX private pages requires re-accepting after zap,
+		 * and TDX private root page should not be zapped, TDX requires
+		 * memslots for private memory must have flag
+		 * KVM_MEM_ZAP_LEAFS_ONLY set too, so that only leaf SPTEs of
+		 * the deleting memslot will be zapped and SPTEs in other
+		 * memslots would not be affected.
+		 */
+		if (kvm->arch.vm_type == KVM_X86_TDX_VM &&
+		    (new->flags & KVM_MEM_GUEST_MEMFD) &&
+		    !(new->flags & KVM_MEM_ZAP_LEAFS_ONLY))
+			return -EINVAL;
+
+		/* zap-leafs-only works only when TDP MMU is enabled for now */
+		if ((new->flags & KVM_MEM_ZAP_LEAFS_ONLY) && !tdp_mmu_enabled)
+			return -EINVAL;
+
 		return kvm_alloc_memslot_metadata(kvm, new);
 	}
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index aee67912e71c..d53648c19b26 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -51,6 +51,7 @@  struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_ZAP_LEAFS_ONLY	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 81b90bf03f2f..1b1ffb6fc786 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1568,6 +1568,8 @@  static int check_memory_region_flags(struct kvm *kvm,
 	if (kvm_arch_has_private_mem(kvm))
 		valid_flags |= KVM_MEM_GUEST_MEMFD;
 
+	valid_flags |= KVM_MEM_ZAP_LEAFS_ONLY;
+
 	/* Dirty logging private memory is not currently supported. */
 	if (mem->flags & KVM_MEM_GUEST_MEMFD)
 		valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
@@ -2052,7 +2054,8 @@  int __kvm_set_memory_region(struct kvm *kvm,
 			return -EINVAL;
 		if ((mem->userspace_addr != old->userspace_addr) ||
 		    (npages != old->npages) ||
-		    ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
+		    ((mem->flags ^ old->flags) &
+		     (KVM_MEM_READONLY | KVM_MEM_ZAP_LEAFS_ONLY)))
 			return -EINVAL;
 
 		if (base_gfn != old->base_gfn)