[RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Improve TDX shutdown performance by adding a more efficient shutdown
operation at the cost of adding separate branches for the TDX MMU
operations for normal runtime and shutdown.  This more efficient method was
previously used in earlier versions of the TDX patches, but was removed to
simplify the initial upstreaming.  This is an RFC, and still needs a proper
upstream commit log. It is intended to be an eventual follow up to base
support.

== Background ==

TDX has 2 methods for the host to reclaim guest private memory, depending
on whether the TD (TDX VM) is in a runnable state or not.  These are
called, respectively:
  1. Dynamic Page Removal
  2. Reclaiming TD Pages in TD_TEARDOWN State

Dynamic Page Removal is much slower.  Reclaiming a 4K page in TD_TEARDOWN
state can be 5 times faster, although that is just one part of TD shutdown.

== Relevant TDX Architecture ==

Dynamic Page Removal is slow because it has to potentially deal with a
running TD, and so involves a number of steps:
	Block further address translation
	Exit each VCPU
	Clear Secure EPT entry
	Flush/write-back/invalidate relevant caches

Reclaiming TD Pages in TD_TEARDOWN State is fast because the shutdown
procedure (refer tdx_mmu_release_hkid()) has already performed the relevant
flushing.  For details, see TDX Module Base Spec October 2024 sections:

	7.5.   TD Teardown Sequence
	5.6.3. TD Keys Reclamation, TLB and Cache Flush

Essentially all that remains then is to take each page away from the
TDX Module and return it to the kernel.

== Problem ==

Currently, Dynamic Page Removal is being used when the TD is being
shutdown for the sake of having simpler initial code.

This happens when guest_memfds are closed, refer kvm_gmem_release().
guest_memfds hold a reference to struct kvm, so that VM destruction cannot
happen until after they are released, refer kvm_gmem_release().

Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
reclaim time.  For example:

	VCPUs	Size (GB)	Before (secs)	After (secs)
	 4	 18		 72		 24
	32	107		517		134

Note, the V19 patch set:

	https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/

did not have this issue because the HKID was released early, something that
Sean effectively NAK'ed:

	"No, the right answer is to not release the HKID until the VM is
	destroyed."

	https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@google.com/

That was taken on board in the "TDX MMU Part 2" patch set.  Refer
"Moving of tdx_mmu_release_hkid()" in:

	https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/

== Options ==

  1. Start TD teardown earlier so that when pages are removed,
  they can be reclaimed faster.
  2. Defer page removal until after TD teardown has started.
  3. A combination of 1 and 2.

Option 1 is problematic because it means putting the TD into a non-runnable
state while it is potentially still active. Also, as mentioned above, Sean
effectively NAK'ed it.

Option 2 is possible because the lifetime of guest memory pages is separate
from guest_memfd (struct kvm_gmem) lifetime.

A reference is taken to pages when they are faulted in, refer
kvm_gmem_get_pfn().  That reference is not released until the pages are
removed from the mirror SEPT, refer tdx_unpin().

Option 3 is not needed because TD teardown begins during VM destruction
before pages are reclaimed.  TD_TEARDOWN state is entered by
tdx_mmu_release_hkid(), whereas pages are reclaimed by tdp_mmu_zap_root(),
as follows:

    kvm_arch_destroy_vm()
        ...
        vt_vm_pre_destroy()
            tdx_mmu_release_hkid()
        ...
        kvm_mmu_uninit_vm()
            kvm_mmu_uninit_tdp_mmu()
                kvm_tdp_mmu_invalidate_roots()
                kvm_tdp_mmu_zap_invalidated_roots()
                    tdp_mmu_zap_root()

== Proof of Concept for option 2 ==

Assume user space never needs to close a guest_memfd except as part of VM
shutdown.

Add a callback from kvm_gmem_release() to decide whether to defer removal.
For TDX, record the inode (up to a max. of 64 inodes) and pin it.

Amend the release of guest_memfds to skip removing pages from the MMU
in that case.

Amend TDX private memory page removal to detect TD_TEARDOWN state, and
reclaim the page accordingly.

For TDX, finally unpin any pinned inodes.

This hopefully illustrates what needs to be done, but guidance is sought
for the best way to do it.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  3 ++
 arch/x86/kvm/Kconfig               |  1 +
 arch/x86/kvm/vmx/main.c            | 12 +++++++-
 arch/x86/kvm/vmx/tdx.c             | 47 +++++++++++++++++++++++++-----
 arch/x86/kvm/vmx/tdx.h             | 14 +++++++++
 arch/x86/kvm/vmx/x86_ops.h         |  2 ++
 arch/x86/kvm/x86.c                 |  7 +++++
 include/linux/kvm_host.h           |  5 ++++
 virt/kvm/Kconfig                   |  4 +++
 virt/kvm/guest_memfd.c             | 26 ++++++++++++-----
 11 files changed, 107 insertions(+), 15 deletions(-)

Message ID	20250313181629.17764-1-adrian.hunter@intel.com (mailing list archive)
State	New
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E248F15CD78; Thu, 13 Mar 2025 18:16:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.10 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741889812; cv=none; b=E25F4BImXJVuQRNGAy1JbDA6XwEkgnB6G7MUxEd/2tE0nXJDnT6d9KhwY2pHI4xBkrNSnDeT7cAABOGmcC7l2lPr+L2Kk4tuuMVatpm71V6Pawk5awzC905By2/GNyj2D+iHZdEEmv75N5Fvb7dpy1P5QddOL04y0JPHFw2Nkpc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741889812; c=relaxed/simple; bh=JHaDvHnmIuMbfv/S6wMqMf0r0ma36TZbA+ecaHDSje4=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=gLePhk2rdqUunGFHG7I5Ok2B8kNHIhQnucldVdbeB/SWL9kOQevUIRVuU06CnSkGdLfs1XzMSVwEwsyEwhmeAkVGHUEzRoWcfFC1nRu/j1fxSjytUsTbnpvIU5swc2lVySgiqc/KYGdd/+df5evmn2rg0bDhVb2IyRXo5JpvJQI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=SHbtlbZw; arc=none smtp.client-ip=192.198.163.10 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="SHbtlbZw" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1741889810; x=1773425810; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=JHaDvHnmIuMbfv/S6wMqMf0r0ma36TZbA+ecaHDSje4=; b=SHbtlbZwBRuEt4wj+6F2KgRTkTkl0hwqTKfBca8m3MwizAvvgJyJ7BXV OWIr2Y8wNsohSgRuAp5/heuO81cXQ6H9f8sBClY2dX27dJy7AL4N2rxYl pe62mi/KqiKltUG3jJbDBspPsPVWCoUte57u8svz5tCLc9MR9hlchQ1Uv rVQ4cgEK4vzDhWrpZsUCMG7gvMFGBy/bT/yFN58PajlZ/3BaKwLUooaeH ZdPtTy2LqT4/oa4s4GDXpnW00UYqbnMuFAMMnrDUrixV32/puUMCHrUZz gCqLegkFuY46hFRQ3OR79dhy4wBcjbALOoYKqVupUwQnFSdx0BgHhlFKK A==; X-CSE-ConnectionGUID: yLDfHBDDTZqlDkSa7CCKzA== X-CSE-MsgGUID: Qqe9XHuGQdeg40ceXQPeuA== X-IronPort-AV: E=McAfee;i="6700,10204,11372"; a="54400593" X-IronPort-AV: E=Sophos;i="6.14,245,1736841600"; d="scan'208";a="54400593" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by fmvoesa104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Mar 2025 11:16:49 -0700 X-CSE-ConnectionGUID: VYq43T5HTauAr/gEKRwoLg== X-CSE-MsgGUID: Bt2FvP9vRS2UdfFBKsfn8w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.14,245,1736841600"; d="scan'208";a="121036443" Received: from ahunter6-mobl1.ger.corp.intel.com (HELO ahunter-VirtualBox.ger.corp.intel.com) ([10.245.89.141]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Mar 2025 11:16:41 -0700 From: Adrian Hunter <adrian.hunter@intel.com> To: pbonzini@redhat.com, seanjc@google.com Cc: kvm@vger.kernel.org, rick.p.edgecombe@intel.com, kirill.shutemov@linux.intel.com, kai.huang@intel.com, reinette.chatre@intel.com, xiaoyao.li@intel.com, tony.lindgren@linux.intel.com, binbin.wu@linux.intel.com, isaku.yamahata@intel.com, linux-kernel@vger.kernel.org, yan.y.zhao@intel.com, chao.gao@intel.com Subject: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time Date: Thu, 13 Mar 2025 20:16:29 +0200 Message-ID: <20250313181629.17764-1-adrian.hunter@intel.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: <kvm.vger.kernel.org> List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Organization: Intel Finland Oy, Registered Address: PL 281, 00181 Helsinki, Business Identity Code: 0357606 - 4, Domiciled in Helsinki Content-Transfer-Encoding: 8bit
Series	[RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time \| expand [RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

[RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Commit Message

Comments

Patch