diff mbox series

[RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Message ID 20250313181629.17764-1-adrian.hunter@intel.com (mailing list archive)
State New
Headers show
Series [RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time | expand

Commit Message

Adrian Hunter March 13, 2025, 6:16 p.m. UTC
Improve TDX shutdown performance by adding a more efficient shutdown
operation at the cost of adding separate branches for the TDX MMU
operations for normal runtime and shutdown.  This more efficient method was
previously used in earlier versions of the TDX patches, but was removed to
simplify the initial upstreaming.  This is an RFC, and still needs a proper
upstream commit log. It is intended to be an eventual follow up to base
support.

== Background ==

TDX has 2 methods for the host to reclaim guest private memory, depending
on whether the TD (TDX VM) is in a runnable state or not.  These are
called, respectively:
  1. Dynamic Page Removal
  2. Reclaiming TD Pages in TD_TEARDOWN State

Dynamic Page Removal is much slower.  Reclaiming a 4K page in TD_TEARDOWN
state can be 5 times faster, although that is just one part of TD shutdown.

== Relevant TDX Architecture ==

Dynamic Page Removal is slow because it has to potentially deal with a
running TD, and so involves a number of steps:
	Block further address translation
	Exit each VCPU
	Clear Secure EPT entry
	Flush/write-back/invalidate relevant caches

Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
flushing.  For details, see TDX Module Base Spec October 2024 sections:

	7.5.   TD Teardown Sequence
	5.6.3. TD Keys Reclamation, TLB and Cache Flush

Essentially all that remains then is to take each page away from the
TDX Module and return it to the kernel.

== Problem ==

Currently, Dynamic Page Removal is being used when the TD is being
shutdown for the sake of having simpler initial code.

This happens when guest_memfds are closed, refer kvm_gmem_release().
guest_memfds hold a reference to struct kvm, so that VM destruction cannot
happen until after they are released, refer kvm_gmem_release().

Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
reclaim time.  For example:

	VCPUs	Size (GB)	Before (secs)	After (secs)
	 4	 18		 72		 24
	32	107		517		134

Note, the V19 patch set:

	https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/

did not have this issue because the HKID was released early, something that
Sean effectively NAK'ed:

	"No, the right answer is to not release the HKID until the VM is
	destroyed."

	https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/

That was taken on board in the "TDX MMU Part 2" patch set.  Refer
"Moving of tdx_mmu_release_hkid()" in:

	https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/

== Options ==

  1. Start TD teardown earlier so that when pages are removed,
  they can be reclaimed faster.
  2. Defer page removal until after TD teardown has started.
  3. A combination of 1 and 2.

Option 1 is problematic because it means putting the TD into a non-runnable
state while it is potentially still active. Also, as mentioned above, Sean
effectively NAK'ed it.

Option 2 is possible because the lifetime of guest memory pages is separate
from guest_memfd (struct kvm_gmem) lifetime.

A reference is taken to pages when they are faulted in, refer
kvm_gmem_get_pfn().  That reference is not released until the pages are
removed from the mirror SEPT, refer tdx_unpin().

Option 3 is not needed because TD teardown begins during VM destruction
before pages are reclaimed.  TD_TEARDOWN state is entered by
tdx_mmu_release_hkid(), whereas pages are reclaimed by tdp_mmu_zap_root(),
as follows:

    kvm_arch_destroy_vm()
        ...
        vt_vm_pre_destroy()
            tdx_mmu_release_hkid()
        ...
        kvm_mmu_uninit_vm()
            kvm_mmu_uninit_tdp_mmu()
                kvm_tdp_mmu_invalidate_roots()
                kvm_tdp_mmu_zap_invalidated_roots()
                    tdp_mmu_zap_root()

== Proof of Concept for option 2 ==

Assume user space never needs to close a guest_memfd except as part of VM
shutdown.

Add a callback from kvm_gmem_release() to decide whether to defer removal.
For TDX, record the inode (up to a max. of 64 inodes) and pin it.

Amend the release of guest_memfds to skip removing pages from the MMU
in that case.

Amend TDX private memory page removal to detect TD_TEARDOWN state, and
reclaim the page accordingly.

For TDX, finally unpin any pinned inodes.

This hopefully illustrates what needs to be done, but guidance is sought
for the best way to do it.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  3 ++
 arch/x86/kvm/Kconfig               |  1 +
 arch/x86/kvm/vmx/main.c            | 12 +++++++-
 arch/x86/kvm/vmx/tdx.c             | 47 +++++++++++++++++++++++++-----
 arch/x86/kvm/vmx/tdx.h             | 14 +++++++++
 arch/x86/kvm/vmx/x86_ops.h         |  2 ++
 arch/x86/kvm/x86.c                 |  7 +++++
 include/linux/kvm_host.h           |  5 ++++
 virt/kvm/Kconfig                   |  4 +++
 virt/kvm/guest_memfd.c             | 26 ++++++++++++-----
 11 files changed, 107 insertions(+), 15 deletions(-)

Comments

Paolo Bonzini March 13, 2025, 6:39 p.m. UTC | #1
On Thu, Mar 13, 2025 at 7:16 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
> Improve TDX shutdown performance by adding a more efficient shutdown
> operation at the cost of adding separate branches for the TDX MMU
> operations for normal runtime and shutdown.  This more efficient method was
> previously used in earlier versions of the TDX patches, but was removed to
> simplify the initial upstreaming.  This is an RFC, and still needs a proper
> upstream commit log. It is intended to be an eventual follow up to base
> support.

In the latest code the HKID is released in kvm_arch_pre_destroy_vm().
That is before kvm_free_memslot() calls kvm_gmem_unbind(), which
results in fput() and hence kvm_gmem_release().

So, as long as userspace doesn't remove the memslots and close the
guestmemfds, shouldn't the TD_TEARDOWN method be usable?

Paolo

> == Background ==
>
> TDX has 2 methods for the host to reclaim guest private memory, depending
> on whether the TD (TDX VM) is in a runnable state or not.  These are
> called, respectively:
>   1. Dynamic Page Removal
>   2. Reclaiming TD Pages in TD_TEARDOWN State
>
> Dynamic Page Removal is much slower.  Reclaiming a 4K page in TD_TEARDOWN
> state can be 5 times faster, although that is just one part of TD shutdown.
>
> == Relevant TDX Architecture ==
>
> Dynamic Page Removal is slow because it has to potentially deal with a
> running TD, and so involves a number of steps:
>         Block further address translation
>         Exit each VCPU
>         Clear Secure EPT entry
>         Flush/write-back/invalidate relevant caches
>
> Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
> procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
> flushing.  For details, see TDX Module Base Spec October 2024 sections:
>
>         7.5.   TD Teardown Sequence
>         5.6.3. TD Keys Reclamation, TLB and Cache Flush
>
> Essentially all that remains then is to take each page away from the
> TDX Module and return it to the kernel.
>
> == Problem ==
>
> Currently, Dynamic Page Removal is being used when the TD is being
> shutdown for the sake of having simpler initial code.
>
> This happens when guest_memfds are closed, refer kvm_gmem_release().
> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
> happen until after they are released, refer kvm_gmem_release().
>
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time.  For example:
>
>         VCPUs   Size (GB)       Before (secs)   After (secs)
>          4       18              72              24
>         32      107             517             134
>
> Note, the V19 patch set:
>
>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/
>
> did not have this issue because the HKID was released early, something that
> Sean effectively NAK'ed:
>
>         "No, the right answer is to not release the HKID until the VM is
>         destroyed."
>
>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/
>
> That was taken on board in the "TDX MMU Part 2" patch set.  Refer
> "Moving of tdx_mmu_release_hkid()" in:
>
>         https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
>
> == Options ==
>
>   1. Start TD teardown earlier so that when pages are removed,
>   they can be reclaimed faster.
>   2. Defer page removal until after TD teardown has started.
>   3. A combination of 1 and 2.
>
> Option 1 is problematic because it means putting the TD into a non-runnable
> state while it is potentially still active. Also, as mentioned above, Sean
> effectively NAK'ed it.
>
> Option 2 is possible because the lifetime of guest memory pages is separate
> from guest_memfd (struct kvm_gmem) lifetime.
>
> A reference is taken to pages when they are faulted in, refer
> kvm_gmem_get_pfn().  That reference is not released until the pages are
> removed from the mirror SEPT, refer tdx_unpin().
>
> Option 3 is not needed because TD teardown begins during VM destruction
> before pages are reclaimed.  TD_TEARDOWN state is entered by
> tdx_mmu_release_hkid(), whereas pages are reclaimed by tdp_mmu_zap_root(),
> as follows:
>
>     kvm_arch_destroy_vm()
>         ...
>         vt_vm_pre_destroy()
>             tdx_mmu_release_hkid()
>         ...
>         kvm_mmu_uninit_vm()
>             kvm_mmu_uninit_tdp_mmu()
>                 kvm_tdp_mmu_invalidate_roots()
>                 kvm_tdp_mmu_zap_invalidated_roots()
>                     tdp_mmu_zap_root()
>
> == Proof of Concept for option 2 ==
>
> Assume user space never needs to close a guest_memfd except as part of VM
> shutdown.
>
> Add a callback from kvm_gmem_release() to decide whether to defer removal.
> For TDX, record the inode (up to a max. of 64 inodes) and pin it.
>
> Amend the release of guest_memfds to skip removing pages from the MMU
> in that case.
>
> Amend TDX private memory page removal to detect TD_TEARDOWN state, and
> reclaim the page accordingly.
>
> For TDX, finally unpin any pinned inodes.
>
> This hopefully illustrates what needs to be done, but guidance is sought
> for the best way to do it.
>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  3 ++
>  arch/x86/kvm/Kconfig               |  1 +
>  arch/x86/kvm/vmx/main.c            | 12 +++++++-
>  arch/x86/kvm/vmx/tdx.c             | 47 +++++++++++++++++++++++++-----
>  arch/x86/kvm/vmx/tdx.h             | 14 +++++++++
>  arch/x86/kvm/vmx/x86_ops.h         |  2 ++
>  arch/x86/kvm/x86.c                 |  7 +++++
>  include/linux/kvm_host.h           |  5 ++++
>  virt/kvm/Kconfig                   |  4 +++
>  virt/kvm/guest_memfd.c             | 26 ++++++++++++-----
>  11 files changed, 107 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 79406bf07a1c..e4728f1fe646 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -148,6 +148,7 @@ KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
>  KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>  KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
>  KVM_X86_OP_OPTIONAL(gmem_invalidate)
> +KVM_X86_OP_OPTIONAL_RET0(gmem_defer_removal)
>
>  #undef KVM_X86_OP
>  #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9b9dde476f3c..d1afb4e1c2ee 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1661,6 +1661,8 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical)
>         return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
>  }
>
> +struct inode;
> +
>  struct kvm_x86_ops {
>         const char *name;
>
> @@ -1888,6 +1890,7 @@ struct kvm_x86_ops {
>         int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
>         void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
>         int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
> +       int (*gmem_defer_removal)(struct kvm *kvm, struct inode *inode);
>  };
>
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 0d445a317f61..32c4b9922e7b 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -96,6 +96,7 @@ config KVM_INTEL
>         depends on KVM && IA32_FEAT_CTL
>         select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
>         select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
> +       select HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST
>         help
>           Provides support for KVM on processors equipped with Intel's VT
>           extensions, a.k.a. Virtual Machine Extensions (VMX).
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 94d5d907d37b..b835006e1282 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -888,6 +888,14 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>         return 0;
>  }
>
> +static int vt_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +       if (is_td(kvm))
> +               return tdx_gmem_defer_removal(kvm, inode);
> +
> +       return 0;
> +}
> +
>  #define VMX_REQUIRED_APICV_INHIBITS                            \
>         (BIT(APICV_INHIBIT_REASON_DISABLED) |                   \
>          BIT(APICV_INHIBIT_REASON_ABSENT) |                     \
> @@ -1046,7 +1054,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>         .mem_enc_ioctl = vt_mem_enc_ioctl,
>         .vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
>
> -       .private_max_mapping_level = vt_gmem_private_max_mapping_level
> +       .private_max_mapping_level = vt_gmem_private_max_mapping_level,
> +
> +       .gmem_defer_removal = vt_gmem_defer_removal,
>  };
>
>  struct kvm_x86_init_ops vt_init_ops __initdata = {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index d9eb20516c71..51bbb44ac1bd 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -5,6 +5,7 @@
>  #include <asm/fpu/xcr.h>
>  #include <linux/misc_cgroup.h>
>  #include <linux/mmu_context.h>
> +#include <linux/fs.h>
>  #include <asm/tdx.h>
>  #include "capabilities.h"
>  #include "mmu.h"
> @@ -594,10 +595,20 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
>         kvm_tdx->td.tdr_page = NULL;
>  }
>
> +static void tdx_unpin_inodes(struct kvm *kvm)
> +{
> +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +       for (int i = 0; i < kvm_tdx->nr_gmem_inodes; i++)
> +               iput(kvm_tdx->gmem_inodes[i]);
> +}
> +
>  void tdx_vm_destroy(struct kvm *kvm)
>  {
>         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>
> +       tdx_unpin_inodes(kvm);
> +
>         tdx_reclaim_td_control_pages(kvm);
>
>         kvm_tdx->state = TD_STATE_UNINITIALIZED;
> @@ -1778,19 +1789,28 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
>         return tdx_reclaim_page(virt_to_page(private_spt));
>  }
>
> +static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page)
> +{
> +       int ret;
> +
> +       if (level != PG_LEVEL_4K)
> +               return -EINVAL;
> +
> +       ret = tdx_reclaim_page(page);
> +       if (!ret)
> +               tdx_unpin(kvm, page);
> +
> +       return ret;
> +}
> +
>  int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>                                  enum pg_level level, kvm_pfn_t pfn)
>  {
>         struct page *page = pfn_to_page(pfn);
>         int ret;
>
> -       /*
> -        * HKID is released after all private pages have been removed, and set
> -        * before any might be populated. Warn if zapping is attempted when
> -        * there can't be anything populated in the private EPT.
> -        */
> -       if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> -               return -EINVAL;
> +       if (!is_hkid_assigned(to_kvm_tdx(kvm)))
> +               return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn));
>
>         ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>         if (ret <= 0)
> @@ -3221,6 +3241,19 @@ int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>         return PG_LEVEL_4K;
>  }
>
> +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +       if (kvm_tdx->nr_gmem_inodes >= TDX_MAX_GMEM_INODES)
> +               return 0;
> +
> +       kvm_tdx->gmem_inodes[kvm_tdx->nr_gmem_inodes++] = inode;
> +       ihold(inode);
> +
> +       return 1;
> +}
> +
>  static int tdx_online_cpu(unsigned int cpu)
>  {
>         unsigned long flags;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 6b3bebebabfa..fb5c4face131 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -20,6 +20,10 @@ enum kvm_tdx_state {
>         TD_STATE_RUNNABLE,
>  };
>
> +struct inode;
> +
> +#define TDX_MAX_GMEM_INODES 64
> +
>  struct kvm_tdx {
>         struct kvm kvm;
>
> @@ -43,6 +47,16 @@ struct kvm_tdx {
>          * Set/unset is protected with kvm->mmu_lock.
>          */
>         bool wait_for_sept_zap;
> +
> +       /*
> +        * For pages that will not be removed until TD shutdown, the associated
> +        * guest_memfd inode is pinned.  Allow for a fixed number of pinned
> +        * inodes.  If there are more, then when the guest_memfd is closed,
> +        * their pages will be removed safely but inefficiently prior to
> +        * shutdown.
> +        */
> +       struct inode *gmem_inodes[TDX_MAX_GMEM_INODES];
> +       int nr_gmem_inodes;
>  };
>
>  /* TDX module vCPU states */
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 4704bed033b1..4ee123289d85 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -164,6 +164,7 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
>  void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
>  void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
>  int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
> +int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
>  #else
>  static inline void tdx_disable_virtualization_cpu(void) {}
>  static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> @@ -229,6 +230,7 @@ static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
>  static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; }
> +static inline int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode) { return 0; }
>  #endif
>
>  #endif /* __KVM_X86_VMX_X86_OPS_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 03db366e794a..96ebf0303223 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13659,6 +13659,13 @@ void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
>  }
>  #endif
>
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +       return kvm_x86_call(gmem_defer_removal)(kvm, inode);
> +}
> +#endif
> +
>  int kvm_spec_ctrl_test_value(u64 value)
>  {
>         /*
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d72ba0a4ca0e..f5c8b1923c24 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2534,6 +2534,11 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
>  int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
>  #endif
>
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +struct inode;
> +bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
> +#endif
> +
>  #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
>  /**
>   * kvm_gmem_populate() - Populate/prepare a GPA range with guest data
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 54e959e7d68f..589111505ad0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
>  config HAVE_KVM_ARCH_GMEM_INVALIDATE
>         bool
>         depends on KVM_PRIVATE_MEM
> +
> +config HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +       bool
> +       depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b2aa6bf24d3a..cd485f45fdaf 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -248,6 +248,15 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
>         return ret;
>  }
>
> +inline bool kvm_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
> +{
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
> +       return kvm_arch_gmem_defer_removal(kvm, inode);
> +#else
> +       return false;
> +#endif
> +}
> +
>  static int kvm_gmem_release(struct inode *inode, struct file *file)
>  {
>         struct kvm_gmem *gmem = file->private_data;
> @@ -275,13 +284,16 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
>         xa_for_each(&gmem->bindings, index, slot)
>                 WRITE_ONCE(slot->gmem.file, NULL);
>
> -       /*
> -        * All in-flight operations are gone and new bindings can be created.
> -        * Zap all SPTEs pointed at by this file.  Do not free the backing
> -        * memory, as its lifetime is associated with the inode, not the file.
> -        */
> -       kvm_gmem_invalidate_begin(gmem, 0, -1ul);
> -       kvm_gmem_invalidate_end(gmem, 0, -1ul);
> +       if (!kvm_gmem_defer_removal(kvm, inode)) {
> +               /*
> +                * All in-flight operations are gone and new bindings can be
> +                * created.  Zap all SPTEs pointed at by this file.  Do not free
> +                * the backing memory, as its lifetime is associated with the
> +                * inode, not the file.
> +                */
> +               kvm_gmem_invalidate_begin(gmem, 0, -1ul);
> +               kvm_gmem_invalidate_end(gmem, 0, -1ul);
> +       }
>
>         list_del(&gmem->entry);
>
> --
> 2.43.0
>
Adrian Hunter March 13, 2025, 7:07 p.m. UTC | #2
On 13/03/25 20:39, Paolo Bonzini wrote:
> On Thu, Mar 13, 2025 at 7:16 PM Adrian Hunter <adrian.hunter@intel.com> wrote:
>> Improve TDX shutdown performance by adding a more efficient shutdown
>> operation at the cost of adding separate branches for the TDX MMU
>> operations for normal runtime and shutdown.  This more efficient method was
>> previously used in earlier versions of the TDX patches, but was removed to
>> simplify the initial upstreaming.  This is an RFC, and still needs a proper
>> upstream commit log. It is intended to be an eventual follow up to base
>> support.
> 
> In the latest code the HKID is released in kvm_arch_pre_destroy_vm().

I am looking at kvm-coco-queue

> That is before kvm_free_memslot() calls kvm_gmem_unbind(), which
> results in fput() and hence kvm_gmem_release().
> 
> So, as long as userspace doesn't remove the memslots and close the
> guestmemfds, shouldn't the TD_TEARDOWN method be usable?

kvm_arch_pre_destroy_vm() is called from kvm_destroy_vm()
which won't happen before kvm_gmem_release() calls
kvm_put_kvm().
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 79406bf07a1c..e4728f1fe646 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -148,6 +148,7 @@  KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
 KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
 KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
 KVM_X86_OP_OPTIONAL(gmem_invalidate)
+KVM_X86_OP_OPTIONAL_RET0(gmem_defer_removal)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9b9dde476f3c..d1afb4e1c2ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1661,6 +1661,8 @@  static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical)
 	return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
 }
 
+struct inode;
+
 struct kvm_x86_ops {
 	const char *name;
 
@@ -1888,6 +1890,7 @@  struct kvm_x86_ops {
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
 	int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
+	int (*gmem_defer_removal)(struct kvm *kvm, struct inode *inode);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 0d445a317f61..32c4b9922e7b 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,7 @@  config KVM_INTEL
 	depends on KVM && IA32_FEAT_CTL
 	select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
 	select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
+	select HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL if INTEL_TDX_HOST
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
 	  extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 94d5d907d37b..b835006e1282 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -888,6 +888,14 @@  static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 	return 0;
 }
 
+static int vt_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
+{
+	if (is_td(kvm))
+		return tdx_gmem_defer_removal(kvm, inode);
+
+	return 0;
+}
+
 #define VMX_REQUIRED_APICV_INHIBITS				\
 	(BIT(APICV_INHIBIT_REASON_DISABLED) |			\
 	 BIT(APICV_INHIBIT_REASON_ABSENT) |			\
@@ -1046,7 +1054,9 @@  struct kvm_x86_ops vt_x86_ops __initdata = {
 	.mem_enc_ioctl = vt_mem_enc_ioctl,
 	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
 
-	.private_max_mapping_level = vt_gmem_private_max_mapping_level
+	.private_max_mapping_level = vt_gmem_private_max_mapping_level,
+
+	.gmem_defer_removal = vt_gmem_defer_removal,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d9eb20516c71..51bbb44ac1bd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,6 +5,7 @@ 
 #include <asm/fpu/xcr.h>
 #include <linux/misc_cgroup.h>
 #include <linux/mmu_context.h>
+#include <linux/fs.h>
 #include <asm/tdx.h>
 #include "capabilities.h"
 #include "mmu.h"
@@ -594,10 +595,20 @@  static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 	kvm_tdx->td.tdr_page = NULL;
 }
 
+static void tdx_unpin_inodes(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	for (int i = 0; i < kvm_tdx->nr_gmem_inodes; i++)
+		iput(kvm_tdx->gmem_inodes[i]);
+}
+
 void tdx_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 
+	tdx_unpin_inodes(kvm);
+
 	tdx_reclaim_td_control_pages(kvm);
 
 	kvm_tdx->state = TD_STATE_UNINITIALIZED;
@@ -1778,19 +1789,28 @@  int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return tdx_reclaim_page(virt_to_page(private_spt));
 }
 
+static int tdx_sept_teardown_private_spte(struct kvm *kvm, enum pg_level level, struct page *page)
+{
+	int ret;
+
+	if (level != PG_LEVEL_4K)
+		return -EINVAL;
+
+	ret = tdx_reclaim_page(page);
+	if (!ret)
+		tdx_unpin(kvm, page);
+
+	return ret;
+}
+
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 				 enum pg_level level, kvm_pfn_t pfn)
 {
 	struct page *page = pfn_to_page(pfn);
 	int ret;
 
-	/*
-	 * HKID is released after all private pages have been removed, and set
-	 * before any might be populated. Warn if zapping is attempted when
-	 * there can't be anything populated in the private EPT.
-	 */
-	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+	if (!is_hkid_assigned(to_kvm_tdx(kvm)))
+		return tdx_sept_teardown_private_spte(kvm, level, pfn_to_page(pfn));
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
@@ -3221,6 +3241,19 @@  int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 	return PG_LEVEL_4K;
 }
 
+int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	if (kvm_tdx->nr_gmem_inodes >= TDX_MAX_GMEM_INODES)
+		return 0;
+
+	kvm_tdx->gmem_inodes[kvm_tdx->nr_gmem_inodes++] = inode;
+	ihold(inode);
+
+	return 1;
+}
+
 static int tdx_online_cpu(unsigned int cpu)
 {
 	unsigned long flags;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 6b3bebebabfa..fb5c4face131 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -20,6 +20,10 @@  enum kvm_tdx_state {
 	TD_STATE_RUNNABLE,
 };
 
+struct inode;
+
+#define TDX_MAX_GMEM_INODES 64
+
 struct kvm_tdx {
 	struct kvm kvm;
 
@@ -43,6 +47,16 @@  struct kvm_tdx {
 	 * Set/unset is protected with kvm->mmu_lock.
 	 */
 	bool wait_for_sept_zap;
+
+	/*
+	 * For pages that will not be removed until TD shutdown, the associated
+	 * guest_memfd inode is pinned.  Allow for a fixed number of pinned
+	 * inodes.  If there are more, then when the guest_memfd is closed,
+	 * their pages will be removed safely but inefficiently prior to
+	 * shutdown.
+	 */
+	struct inode *gmem_inodes[TDX_MAX_GMEM_INODES];
+	int nr_gmem_inodes;
 };
 
 /* TDX module vCPU states */
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 4704bed033b1..4ee123289d85 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -164,6 +164,7 @@  void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
+int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
 #else
 static inline void tdx_disable_virtualization_cpu(void) {}
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
@@ -229,6 +230,7 @@  static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_flush_tlb_all(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; }
+static inline int tdx_gmem_defer_removal(struct kvm *kvm, struct inode *inode) { return 0; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03db366e794a..96ebf0303223 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13659,6 +13659,13 @@  void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
 }
 #endif
 
+#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
+bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
+{
+	return kvm_x86_call(gmem_defer_removal)(kvm, inode);
+}
+#endif
+
 int kvm_spec_ctrl_test_value(u64 value)
 {
 	/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d72ba0a4ca0e..f5c8b1923c24 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2534,6 +2534,11 @@  static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
 #endif
 
+#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
+struct inode;
+bool kvm_arch_gmem_defer_removal(struct kvm *kvm, struct inode *inode);
+#endif
+
 #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
 /**
  * kvm_gmem_populate() - Populate/prepare a GPA range with guest data
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..589111505ad0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,7 @@  config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
+       bool
+       depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b2aa6bf24d3a..cd485f45fdaf 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -248,6 +248,15 @@  static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
 	return ret;
 }
 
+inline bool kvm_gmem_defer_removal(struct kvm *kvm, struct inode *inode)
+{
+#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_DEFER_REMOVAL
+	return kvm_arch_gmem_defer_removal(kvm, inode);
+#else
+	return false;
+#endif
+}
+
 static int kvm_gmem_release(struct inode *inode, struct file *file)
 {
 	struct kvm_gmem *gmem = file->private_data;
@@ -275,13 +284,16 @@  static int kvm_gmem_release(struct inode *inode, struct file *file)
 	xa_for_each(&gmem->bindings, index, slot)
 		WRITE_ONCE(slot->gmem.file, NULL);
 
-	/*
-	 * All in-flight operations are gone and new bindings can be created.
-	 * Zap all SPTEs pointed at by this file.  Do not free the backing
-	 * memory, as its lifetime is associated with the inode, not the file.
-	 */
-	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
-	kvm_gmem_invalidate_end(gmem, 0, -1ul);
+	if (!kvm_gmem_defer_removal(kvm, inode)) {
+		/*
+		 * All in-flight operations are gone and new bindings can be
+		 * created.  Zap all SPTEs pointed at by this file.  Do not free
+		 * the backing memory, as its lifetime is associated with the
+		 * inode, not the file.
+		 */
+		kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+		kvm_gmem_invalidate_end(gmem, 0, -1ul);
+	}
 
 	list_del(&gmem->entry);