[v18,060/121] KVM: TDX: TDP MMU TDX support

Message ID	a47c5a9442130f45fc09c1d4ae0e4352054be636.1705965635.git.isaku.yamahata@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E87AE64CDB; Mon, 22 Jan 2024 23:55:50 +0000 (UTC) From: isaku.yamahata@intel.com To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: isaku.yamahata@intel.com, isaku.yamahata@gmail.com, Paolo Bonzini <pbonzini@redhat.com>, erdemaktas@google.com, Sean Christopherson <seanjc@google.com>, Sagi Shahar <sagis@google.com>, Kai Huang <kai.huang@intel.com>, chen.bo@intel.com, hang.yuan@intel.com, tina.zhang@intel.com Subject: [PATCH v18 060/121] KVM: TDX: TDP MMU TDX support Date: Mon, 22 Jan 2024 15:53:36 -0800 Message-Id: <a47c5a9442130f45fc09c1d4ae0e4352054be636.1705965635.git.isaku.yamahata@intel.com> In-Reply-To: <cover.1705965634.git.isaku.yamahata@intel.com> References: <cover.1705965634.git.isaku.yamahata@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	KVM TDX basic feature support \| expand [v18,000/121] KVM TDX basic feature support [v18,001/121] x86/virt/tdx: Export TDX KeyID information [v18,002/121] x86/virt/tdx: Export SEAMCALL functions [v18,003/121] KVM: x86: Add is_vm_type_supported callback [v18,004/121] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX [v18,005/121] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_hardware_setup() [v18,006/121] KVM: x86/vmx: Refactor KVM VMX module init/exit functions [v18,007/121] KVM: VMX: Reorder vmx initialization with kvm vendor initialization [v18,008/121] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module [v18,009/121] KVM: TDX: Add placeholders for TDX VM/vcpu structure [v18,010/121] KVM: TDX: Make TDX VM type supported [v18,011/121,MARKER] The start of TDX KVM patch series: TDX architectural definitions [v18,012/121] KVM: TDX: Define TDX architectural definitions [v18,013/121] KVM: TDX: Add TDX "architectural" error codes [v18,014/121] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module [v18,015/121] KVM: TDX: Retry SEAMCALL on the lack of entropy error [v18,016/121] KVM: TDX: Add helper functions to print TDX SEAMCALL error [v18,017/121,MARKER] The start of TDX KVM patch series: TD VM creation/destruction [v18,018/121] KVM: TDX: Add helper functions to allocate/free TDX private host key id [v18,019/121] KVM: TDX: Add helper function to read TDX metadata in array [v18,020/121] x86/virt/tdx: Get system-wide info about TDX module on initialization [v18,021/121] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl [v18,022/121] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters [v18,023/121] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific [v18,024/121] KVM: TDX: create/destroy VM structure [v18,025/121] KVM: TDX: initialize VM with TDX specific parameters [v18,026/121] KVM: TDX: Make pmu_intel.c ignore guest TD case [v18,027/121] KVM: TDX: Refuse to unplug the last cpu on the package [v18,028/121,MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction [v18,029/121] KVM: TDX: create/free TDX vcpu structure [v18,030/121] KVM: TDX: Do TDX specific vcpu initialization [v18,031/121,MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits [v18,032/121] KVM: x86/mmu: introduce config for PRIVATE KVM MMU [v18,033/121] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA [v18,034/121,MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX [v18,035/121] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values [v18,036/121] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE [v18,037/121] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE [v18,038/121] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask/shadow_present_mask [v18,039/121] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis [v18,040/121] KVM: x86/mmu: Disallow fast page fault on private GPA [v18,041/121] KVM: x86/mmu: Allow per-VM override of the TDP max page level [v18,042/121] KVM: VMX: Introduce test mode related to EPT violation VE [v18,043/121,MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks [v18,044/121] KVM: x86/mmu: Assume guest MMIOs are shared [v18,045/121] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation [v18,046/121] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role [v18,047/121] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page [v18,048/121] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases [v18,049/121] KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA [v18,050/121] KVM: x86/tdp_mmu: Sprinkle __must_check [v18,051/121] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU [v18,052/121,MARKER] The start of TDX KVM patch series: TDX EPT violation [v18,053/121] KVM: x86/mmu: TDX: Do not enable page track for TD guest [v18,054/121] KVM: VMX: Split out guts of EPT violation to common/exposed function [v18,055/121] KVM: VMX: Move setting of EPT MMU masks to common VT-x code [v18,056/121] KVM: TDX: Add accessors VMX VMCS helpers [v18,057/121] KVM: TDX: Add load_mmu_pgd method for TDX [v18,058/121] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT [v18,059/121] KVM: TDX: Require TDP MMU and mmio caching for TDX [v18,060/121] KVM: TDX: TDP MMU TDX support [v18,061/121] KVM: TDX: MTRR: implement get_mt_mask() for TDX [v18,062/121,MARKER] The start of TDX KVM patch series: TD finalization [v18,063/121] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX [v18,064/121] KVM: TDX: Create initial guest memory [v18,065/121] KVM: TDX: Finalize VM initialization [v18,066/121,MARKER] The start of TDX KVM patch series: TD vcpu enter/exit [v18,067/121] KVM: TDX: Implement TDX vcpu enter/exit path [v18,068/121] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) [v18,069/121] KVM: TDX: restore host xsave state when exit from the guest TD [v18,070/121] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr [v18,071/121] KVM: TDX: restore user ret MSRs [v18,072/121] KVM: TDX: Add TSX_CTRL msr into uret_msrs list [v18,073/121,MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls [v18,074/121] KVM: TDX: complete interrupts after tdexit [v18,075/121] KVM: TDX: restore debug store when TD exit [v18,076/121] KVM: TDX: handle vcpu migration over logical processor [v18,077/121] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior [v18,078/121] KVM: TDX: Add support for find pending IRQ in a protected local APIC [v18,079/121] KVM: x86: Assume timer IRQ was injected if APIC state is proteced [v18,080/121] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c [v18,081/121] KVM: TDX: Implement interrupt injection [v18,082/121] KVM: TDX: Implements vcpu request_immediate_exit [v18,083/121] KVM: TDX: Implement methods to inject NMI [v18,084/121] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument [v18,085/121] KVM: VMX: Move NMI/exception handler to common helper [v18,086/121] KVM: x86: Split core of hypercall emulation to helper function [v18,087/121] KVM: TDX: Add a place holder to handle TDX VM exit [v18,088/121] KVM: TDX: Handle vmentry failure for INTEL TD guest [v18,089/121] KVM: TDX: handle EXIT_REASON_OTHER_SMI [v18,090/121] KVM: TDX: handle ept violation/misconfig exit [v18,091/121] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT [v18,092/121] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI [v18,093/121] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) [v18,094/121] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL [v18,095/121] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL [v18,096/121] KVM: TDX: Handle TDX PV CPUID hypercall [v18,097/121] KVM: TDX: Handle TDX PV HLT hypercall [v18,098/121] KVM: TDX: Handle TDX PV port io hypercall [v18,099/121] KVM: TDX: Handle TDX PV MMIO hypercall [v18,100/121] KVM: TDX: Implement callbacks for MSR operations for TDX [v18,101/121] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall [v18,102/121] KVM: TDX: Handle MSR MTRRCap and MTRRDefType access [v18,103/121] KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL [v18,104/121] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall [v18,105/121] KVM: TDX: Silently discard SMI request [v18,106/121] KVM: TDX: Silently ignore INIT/SIPI [v18,107/121] KVM: TDX: Add methods to ignore accesses to CPU state [v18,108/121] KVM: TDX: Add methods to ignore guest instruction emulation [v18,109/121] KVM: TDX: Add a method to ignore dirty logging [v18,110/121] KVM: TDX: Add methods to ignore VMX preemption timer [v18,111/121] KVM: TDX: Add methods to ignore accesses to TSC [v18,112/121] KVM: TDX: Ignore setting up mce [v18,113/121] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch [v18,114/121] KVM: TDX: Add methods to ignore virtual apic related operation [v18,115/121] KVM: TDX: Inhibit APICv for TDX guest [v18,116/121] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) [v18,117/121] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU [v18,118/121] KVM: TDX: Add hint TDX ioctl to release Secure-EPT [v18,119/121] RFC: KVM: x86: Add x86 callback to check cpuid [v18,120/121] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2 [v18,121/121,MARKER] the end of (the first phase of) TDX KVM patch series

Message ID

a47c5a9442130f45fc09c1d4ae0e4352054be636.1705965635.git.isaku.yamahata@intel.com (mailing list archive)

State

New, archived

Headers

From: isaku.yamahata@intel.com
To: kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: isaku.yamahata@intel.com,
	isaku.yamahata@gmail.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	erdemaktas@google.com,
	Sean Christopherson <seanjc@google.com>,
	Sagi Shahar <sagis@google.com>,
	Kai Huang <kai.huang@intel.com>,
	chen.bo@intel.com,
	hang.yuan@intel.com,
	tina.zhang@intel.com
Subject: [PATCH v18 060/121] KVM: TDX: TDP MMU TDX support
Date: Mon, 22 Jan 2024 15:53:36 -0800
Message-Id: 
 <a47c5a9442130f45fc09c1d4ae0e4352054be636.1705965635.git.isaku.yamahata@intel.com>
In-Reply-To: <cover.1705965634.git.isaku.yamahata@intel.com>
References: <cover.1705965634.git.isaku.yamahata@intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

KVM TDX basic feature support | expand

Commit Message

Isaku Yamahata Jan. 22, 2024, 11:53 p.m. UTC

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement hooks of TDP MMU for TDX backend.  TLB flush, TLB shootdown,
propagating the change private EPT entry to Secure EPT and freeing Secure
EPT page. TLB flush handles both shared EPT and private EPT.  It flushes
shared EPT same as VMX.  It also waits for the TDX TLB shootdown.  For the
hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
EPT so that the page can be freed to OS.

Propagate the entry change to Secure EPT.  The possible entry changes are
present -> non-present(zapping) and non-present -> present(population).  On
population just link the Secure EPT page or the private guest page to the
Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
zapping/population, zapping requires synchronous TLB shoot down with the
frozen EPT entry.  It zaps the secure entry, increments TLB counter, sends
IPI to remote vcpus to trigger TLB flush, and then unlinks the private
guest page from the Secure EPT. For simplicity, batched zapping with
exclude lock is handled as concurrent zapping.  Although it's inefficient,
it can be optimized in the future.

For MMIO SPTE, the spte value changes as follows.
initial value (suppress VE bit is set)
-> Guest issues MMIO and triggers EPT violation
-> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
-> Guest MMIO resumes.  It triggers VE exception in guest TD
-> Guest VE handler issues TDG.VP.VMCALL<MMIO>
-> KVM handles MMIO
-> Guest VE handler resumes its execution after MMIO instruction

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

---
v18:
- rename tdx_sept_page_aug() -> tdx_mem_page_aug()
- checkpatch: space => tab

v15 -> v16:
- Add the handling of TD_ATTR_SEPT_VE_DISABLE case.

v14 -> v15:
- Implemented tdx_flush_tlb_current()
- Removed unnecessary invept in tdx_flush_tlb().  It was carry over
  from the very old code base.
---
 arch/x86/kvm/mmu/spte.c    |   3 +-
 arch/x86/kvm/vmx/main.c    |  71 +++++++-
 arch/x86/kvm/vmx/tdx.c     | 342 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     |   2 +-
 arch/x86/kvm/vmx/tdx_ops.h |   6 +
 arch/x86/kvm/vmx/x86_ops.h |   6 +
 6 files changed, 424 insertions(+), 6 deletions(-)

Comments

Chenyi Qiang Jan. 23, 2024, 8:43 a.m. UTC | #1

On 1/23/2024 7:53 AM, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Implement hooks of TDP MMU for TDX backend.  TLB flush, TLB shootdown,
> propagating the change private EPT entry to Secure EPT and freeing Secure
> EPT page. TLB flush handles both shared EPT and private EPT.  It flushes
> shared EPT same as VMX.  It also waits for the TDX TLB shootdown.  For the
> hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
> EPT so that the page can be freed to OS.
> 
> Propagate the entry change to Secure EPT.  The possible entry changes are
> present -> non-present(zapping) and non-present -> present(population).  On
> population just link the Secure EPT page or the private guest page to the
> Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
> zapping/population, zapping requires synchronous TLB shoot down with the
> frozen EPT entry.  It zaps the secure entry, increments TLB counter, sends
> IPI to remote vcpus to trigger TLB flush, and then unlinks the private
> guest page from the Secure EPT. For simplicity, batched zapping with
> exclude lock is handled as concurrent zapping.  Although it's inefficient,
> it can be optimized in the future.
> 
> For MMIO SPTE, the spte value changes as follows.
> initial value (suppress VE bit is set)
> -> Guest issues MMIO and triggers EPT violation
> -> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
> -> Guest MMIO resumes.  It triggers VE exception in guest TD
> -> Guest VE handler issues TDG.VP.VMCALL<MMIO>
> -> KVM handles MMIO
> -> Guest VE handler resumes its execution after MMIO instruction
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> ---
> v18:
> - rename tdx_sept_page_aug() -> tdx_mem_page_aug()
> - checkpatch: space => tab
> 
> v15 -> v16:
> - Add the handling of TD_ATTR_SEPT_VE_DISABLE case.
> 
> v14 -> v15:
> - Implemented tdx_flush_tlb_current()
> - Removed unnecessary invept in tdx_flush_tlb().  It was carry over
>   from the very old code base.
> ---
>  arch/x86/kvm/mmu/spte.c    |   3 +-
>  arch/x86/kvm/vmx/main.c    |  71 +++++++-
>  arch/x86/kvm/vmx/tdx.c     | 342 +++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/tdx.h     |   2 +-
>  arch/x86/kvm/vmx/tdx_ops.h |   6 +
>  arch/x86/kvm/vmx/x86_ops.h |   6 +
>  6 files changed, 424 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 318135daf685..83926a35ea47 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -74,7 +74,8 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
>  	u64 spte = generation_mmio_spte_mask(gen);
>  	u64 gpa = gfn << PAGE_SHIFT;
>  
> -	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
> +	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
> +		     !kvm_gfn_shared_mask(vcpu->kvm));
>  
>  	access &= shadow_mmio_access_mask;
>  	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index e77c045dca84..569f2f67094c 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -28,6 +28,7 @@ static int vt_max_vcpus(struct kvm *kvm)
>  
>  	return kvm->max_vcpus;
>  }
> +static int vt_flush_remote_tlbs(struct kvm *kvm);
>  
>  static int vt_hardware_enable(void)
>  {
> @@ -74,8 +75,22 @@ static __init int vt_hardware_setup(void)
>  		pr_warn_ratelimited("TDX requires mmio caching.  Please enable mmio caching for TDX.\n");
>  	}
>  
> +	/*
> +	 * TDX KVM overrides flush_remote_tlbs method and assumes
> +	 * flush_remote_tlbs_range = NULL that falls back to
> +	 * flush_remote_tlbs.  Disable TDX if there are conflicts.
> +	 */
> +	if (vt_x86_ops.flush_remote_tlbs ||
> +	    vt_x86_ops.flush_remote_tlbs_range) {
> +		enable_tdx = false;
> +		pr_warn_ratelimited("TDX requires baremetal. Not Supported on VMM guest.\n");
> +	}
> +
>  	enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
>  
> +	if (enable_tdx)
> +		vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;
> +
>  	return 0;
>  }
>  
I hit some build issues when CONFIG_HYPERV=n:

error: ‘struct kvm_x86_ops’ has no member named ‘flush_remote_tlbs’
error: ‘struct kvm_x86_ops’ has no member named ‘flush_remote_tlbs_range’

I think it should be related to the commit
https://lore.kernel.org/all/20231018192325.1893896-1-seanjc@google.com/

Binbin Wu Jan. 30, 2024, 3:31 p.m. UTC | #2

On 1/23/2024 7:53 AM, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Implement hooks of TDP MMU for TDX backend.  TLB flush, TLB shootdown,
> propagating the change private EPT entry to Secure EPT and freeing Secure
> EPT page. TLB flush handles both shared EPT and private EPT.  It flushes
> shared EPT same as VMX.  It also waits for the TDX TLB shootdown.  For the
> hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
> EPT so that the page can be freed to OS.
>
> Propagate the entry change to Secure EPT.  The possible entry changes are
> present -> non-present(zapping) and non-present -> present(population).  On
> population just link the Secure EPT page or the private guest page to the
> Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
> zapping/population, zapping requires synchronous TLB shoot down with the
> frozen EPT entry.  It zaps the secure entry, increments TLB counter, sends
> IPI to remote vcpus to trigger TLB flush, and then unlinks the private
> guest page from the Secure EPT. For simplicity, batched zapping with
> exclude lock is handled as concurrent zapping.  Although it's inefficient,
> it can be optimized in the future.
>
> For MMIO SPTE, the spte value changes as follows.
> initial value (suppress VE bit is set)
> -> Guest issues MMIO and triggers EPT violation
> -> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
> -> Guest MMIO resumes.  It triggers VE exception in guest TD
> -> Guest VE handler issues TDG.VP.VMCALL<MMIO>
> -> KVM handles MMIO
> -> Guest VE handler resumes its execution after MMIO instruction
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>
> ---
> v18:
> - rename tdx_sept_page_aug() -> tdx_mem_page_aug()
> - checkpatch: space => tab
>
> v15 -> v16:
> - Add the handling of TD_ATTR_SEPT_VE_DISABLE case.
>
> v14 -> v15:
> - Implemented tdx_flush_tlb_current()
> - Removed unnecessary invept in tdx_flush_tlb().  It was carry over
>    from the very old code base.
> ---
>   arch/x86/kvm/mmu/spte.c    |   3 +-
>   arch/x86/kvm/vmx/main.c    |  71 +++++++-
>   arch/x86/kvm/vmx/tdx.c     | 342 +++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/tdx.h     |   2 +-
>   arch/x86/kvm/vmx/tdx_ops.h |   6 +
>   arch/x86/kvm/vmx/x86_ops.h |   6 +
>   6 files changed, 424 insertions(+), 6 deletions(-)
[...]
> +
> +/*
> + * TLB shoot down procedure:
> + * There is a global epoch counter and each vcpu has local epoch counter.
> + * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
> + *   This blocks the subsequenct creation of TLB translation on that range.
> + *   This corresponds to clear the present bit(all RXW) in EPT entry
> + * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
> + * - IPI to remote vcpus
> + * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
> + * - On re-entry, TDX module compares the local epoch counter with the global
> + *   epoch counter.  If the local epoch counter is older than the global epoch
> + *   counter, update the local epoch counter and flushes TLB.
> + */
> +static void tdx_track(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err;
> +
> +	KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
> +	/* If TD isn't finalized, it's before any vcpu running. */
> +	if (unlikely(!is_td_finalized(kvm_tdx)))
> +		return;
> +
> +	/*
> +	 * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
> +	 * the counter.  The counter is used instead of bool because multiple
> +	 * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
> +	 */
> +	atomic_inc(&kvm_tdx->tdh_mem_track);
> +	/*
> +	 * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
> +	 * KVM_REQUEST_WAIT.
> +	 */
> +	kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
> +
> +	do {
> +		/*
> +		 * kvm_flush_remote_tlbs() doesn't allow to return error and
> +		 * retry.
> +		 */
> +		err = tdh_mem_track(kvm_tdx->tdr_pa);
> +	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));

Why the sequence of the code is different from the description of the 
function.
In the description, do the TDH.MEM.TRACK before IPIs.
But in the code, do TDH.MEM.TRACK after IPIs?


> +
> +	/* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
> +	atomic_dec(&kvm_tdx->tdh_mem_track);
> +
> +	if (KVM_BUG_ON(err, kvm))
> +		pr_tdx_error(TDH_MEM_TRACK, err, NULL);
> +
> +}
> +
[...]

Isaku Yamahata Feb. 26, 2024, 7:21 p.m. UTC | #3

On Tue, Jan 30, 2024 at 11:31:22PM +0800,
Binbin Wu <binbin.wu@linux.intel.com> wrote:

> > +
> > +/*
> > + * TLB shoot down procedure:
> > + * There is a global epoch counter and each vcpu has local epoch counter.
> > + * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
> > + *   This blocks the subsequenct creation of TLB translation on that range.
> > + *   This corresponds to clear the present bit(all RXW) in EPT entry
> > + * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
> > + * - IPI to remote vcpus
> > + * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
> > + * - On re-entry, TDX module compares the local epoch counter with the global
> > + *   epoch counter.  If the local epoch counter is older than the global epoch
> > + *   counter, update the local epoch counter and flushes TLB.
> > + */
> > +static void tdx_track(struct kvm *kvm)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	u64 err;
> > +
> > +	KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
> > +	/* If TD isn't finalized, it's before any vcpu running. */
> > +	if (unlikely(!is_td_finalized(kvm_tdx)))
> > +		return;
> > +
> > +	/*
> > +	 * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
> > +	 * the counter.  The counter is used instead of bool because multiple
> > +	 * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
> > +	 */
> > +	atomic_inc(&kvm_tdx->tdh_mem_track);
> > +	/*
> > +	 * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
> > +	 * KVM_REQUEST_WAIT.
> > +	 */
> > +	kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
> > +
> > +	do {
> > +		/*
> > +		 * kvm_flush_remote_tlbs() doesn't allow to return error and
> > +		 * retry.
> > +		 */
> > +		err = tdh_mem_track(kvm_tdx->tdr_pa);
> > +	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
> 
> Why the sequence of the code is different from the description of the
> function.
> In the description, do the TDH.MEM.TRACK before IPIs.
> But in the code, do TDH.MEM.TRACK after IPIs?

It's intentional to handle IPI in parallel as we already introduced
tdh_mem_track.

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 318135daf685..83926a35ea47 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -74,7 +74,8 @@  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
 	u64 spte = generation_mmio_spte_mask(gen);
 	u64 gpa = gfn << PAGE_SHIFT;
 
-	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
+	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
+		     !kvm_gfn_shared_mask(vcpu->kvm));
 
 	access &= shadow_mmio_access_mask;
 	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index e77c045dca84..569f2f67094c 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -28,6 +28,7 @@  static int vt_max_vcpus(struct kvm *kvm)
 
 	return kvm->max_vcpus;
 }
+static int vt_flush_remote_tlbs(struct kvm *kvm);
 
 static int vt_hardware_enable(void)
 {
@@ -74,8 +75,22 @@  static __init int vt_hardware_setup(void)
 		pr_warn_ratelimited("TDX requires mmio caching.  Please enable mmio caching for TDX.\n");
 	}
 
+	/*
+	 * TDX KVM overrides flush_remote_tlbs method and assumes
+	 * flush_remote_tlbs_range = NULL that falls back to
+	 * flush_remote_tlbs.  Disable TDX if there are conflicts.
+	 */
+	if (vt_x86_ops.flush_remote_tlbs ||
+	    vt_x86_ops.flush_remote_tlbs_range) {
+		enable_tdx = false;
+		pr_warn_ratelimited("TDX requires baremetal. Not Supported on VMM guest.\n");
+	}
+
 	enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
 
+	if (enable_tdx)
+		vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;
+
 	return 0;
 }
 
@@ -158,6 +173,54 @@  static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_flush_tlb(vcpu);
+		return;
+	}
+
+	vmx_flush_tlb_all(vcpu);
+}
+
+static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_flush_tlb_current(vcpu);
+		return;
+	}
+
+	vmx_flush_tlb_current(vcpu);
+}
+
+static int vt_flush_remote_tlbs(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return tdx_sept_flush_remote_tlbs(kvm);
+
+	/*
+	 * fallback to KVM_REQ_TLB_FLUSH.
+	 * See kvm_arch_flush_remote_tlb() and kvm_flush_remote_tlbs().
+	 */
+	return -EOPNOTSUPP;
+}
+
+static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+{
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_flush_tlb_gva(vcpu, addr);
+}
+
+static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_flush_tlb_guest(vcpu);
+}
+
 static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			int pgd_level)
 {
@@ -249,10 +312,10 @@  struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_rflags = vmx_set_rflags,
 	.get_if_flag = vmx_get_if_flag,
 
-	.flush_tlb_all = vmx_flush_tlb_all,
-	.flush_tlb_current = vmx_flush_tlb_current,
-	.flush_tlb_gva = vmx_flush_tlb_gva,
-	.flush_tlb_guest = vmx_flush_tlb_guest,
+	.flush_tlb_all = vt_flush_tlb_all,
+	.flush_tlb_current = vt_flush_tlb_current,
+	.flush_tlb_gva = vt_flush_tlb_gva,
+	.flush_tlb_guest = vt_flush_tlb_guest,
 
 	.vcpu_pre_run = vmx_vcpu_pre_run,
 	.vcpu_run = vmx_vcpu_run,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 25510b6740a3..4002e7e7b191 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -8,6 +8,7 @@ 
 #include "mmu.h"
 #include "tdx_arch.h"
 #include "tdx.h"
+#include "vmx.h"
 #include "x86.h"
 
 #undef pr_fmt
@@ -407,6 +408,22 @@  static int tdx_do_tdh_mng_key_config(void *param)
 
 int tdx_vm_init(struct kvm *kvm)
 {
+	/*
+	 * Because guest TD is protected, VMM can't parse the instruction in TD.
+	 * Instead, guest uses MMIO hypercall.  For unmodified device driver,
+	 * #VE needs to be injected for MMIO and #VE handler in TD converts MMIO
+	 * instruction into MMIO hypercall.
+	 *
+	 * SPTE value for MMIO needs to be setup so that #VE is injected into
+	 * TD instead of triggering EPT MISCONFIG.
+	 * - RWX=0 so that EPT violation is triggered.
+	 * - suppress #VE bit is cleared to inject #VE.
+	 */
+	kvm_mmu_set_mmio_spte_value(kvm, 0);
+
+	/* TODO: Enable 2mb and 1gb large page support. */
+	kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
+
 	/*
 	 * This function initializes only KVM software construct.  It doesn't
 	 * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
@@ -506,6 +523,285 @@  void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
 }
 
+static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	put_page(page);
+}
+
+static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
+			    enum pg_level level, kvm_pfn_t pfn)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	union tdx_sept_level_state level_state;
+	hpa_t hpa = pfn_to_hpa(pfn);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	struct tdx_module_args out;
+	union tdx_sept_entry entry;
+	u64 err;
+
+	err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, hpa, &out);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
+		tdx_unpin(kvm, pfn);
+		return -EAGAIN;
+	}
+	if (unlikely(err == (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))) {
+		entry.raw = out.rcx;
+		level_state.raw = out.rdx;
+		if (level_state.level == tdx_level &&
+		    level_state.state == TDX_SEPT_PENDING &&
+		    entry.leaf && entry.pfn == pfn && entry.sve) {
+			tdx_unpin(kvm, pfn);
+			WARN_ON_ONCE(!(to_kvm_tdx(kvm)->attributes &
+				       TDX_TD_ATTR_SEPT_VE_DISABLE));
+			return -EAGAIN;
+		}
+	}
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
+		tdx_unpin(kvm, pfn);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
+	/*
+	 * Because restricted mem doesn't support page migration with
+	 * a_ops->migrate_page (yet), no callback isn't triggered for KVM on
+	 * page migration.  Until restricted mem supports page migration,
+	 * prevent page migration.
+	 * TODO: Once restricted mem introduces callback on page migration,
+	 * implement it and remove get_page/put_page().
+	 */
+	get_page(pfn_to_page(pfn));
+
+	if (likely(is_td_finalized(kvm_tdx)))
+		return tdx_mem_page_aug(kvm, gfn, level, pfn);
+
+	/* TODO: tdh_mem_page_add() comes here for the initial memory. */
+
+	return 0;
+}
+
+static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
+				       enum pg_level level, kvm_pfn_t pfn)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_module_args out;
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = pfn_to_hpa(pfn);
+	hpa_t hpa_with_hkid;
+	u64 err;
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
+	if (unlikely(!is_hkid_assigned(kvm_tdx))) {
+		/*
+		 * The HKID assigned to this TD was already freed and cache
+		 * was already flushed. We don't have to flush again.
+		 */
+		err = tdx_reclaim_page(hpa);
+		if (KVM_BUG_ON(err, kvm))
+			return -EIO;
+		tdx_unpin(kvm, pfn);
+		return 0;
+	}
+
+	do {
+		/*
+		 * When zapping private page, write lock is held. So no race
+		 * condition with other vcpu sept operation.  Race only with
+		 * TDH.VP.ENTER.
+		 */
+		err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
+		return -EIO;
+	}
+
+	hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+	do {
+		/*
+		 * TDX_OPERAND_BUSY can happen on locking PAMT entry.  Because
+		 * this page was removed above, other thread shouldn't be
+		 * repeatedly operating on this page.  Just retry loop.
+		 */
+		err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+	} while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+		return -EIO;
+	}
+	tdx_clear_page(hpa);
+	tdx_unpin(kvm, pfn);
+	return 0;
+}
+
+static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, void *private_spt)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = __pa(private_spt);
+	struct tdx_module_args out;
+	u64 err;
+
+	err = tdh_mem_sept_add(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_SEPT_ADD, err, &out);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+	struct tdx_module_args out;
+	u64 err;
+
+	/* This can be called when destructing guest TD after freeing HKID. */
+	if (unlikely(!is_hkid_assigned(kvm_tdx)))
+		return 0;
+
+	/* For now large page isn't supported yet. */
+	WARN_ON_ONCE(level != PG_LEVEL_4K);
+	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
+		return -EIO;
+	}
+	return 0;
+}
+
+/*
+ * TLB shoot down procedure:
+ * There is a global epoch counter and each vcpu has local epoch counter.
+ * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
+ *   This blocks the subsequenct creation of TLB translation on that range.
+ *   This corresponds to clear the present bit(all RXW) in EPT entry
+ * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
+ * - IPI to remote vcpus
+ * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
+ * - On re-entry, TDX module compares the local epoch counter with the global
+ *   epoch counter.  If the local epoch counter is older than the global epoch
+ *   counter, update the local epoch counter and flushes TLB.
+ */
+static void tdx_track(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err;
+
+	KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
+	/* If TD isn't finalized, it's before any vcpu running. */
+	if (unlikely(!is_td_finalized(kvm_tdx)))
+		return;
+
+	/*
+	 * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
+	 * the counter.  The counter is used instead of bool because multiple
+	 * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
+	 */
+	atomic_inc(&kvm_tdx->tdh_mem_track);
+	/*
+	 * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
+	 * KVM_REQUEST_WAIT.
+	 */
+	kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+
+	do {
+		/*
+		 * kvm_flush_remote_tlbs() doesn't allow to return error and
+		 * retry.
+		 */
+		err = tdh_mem_track(kvm_tdx->tdr_pa);
+	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
+
+	/* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
+	atomic_dec(&kvm_tdx->tdh_mem_track);
+
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_TRACK, err, NULL);
+
+}
+
+static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, void *private_spt)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/*
+	 * The HKID assigned to this TD was already freed and cache was
+	 * already flushed. We don't have to flush again.
+	 */
+	if (!is_hkid_assigned(kvm_tdx))
+		return tdx_reclaim_page(__pa(private_spt));
+
+	/*
+	 * free_private_spt() is (obviously) called when a shadow page is being
+	 * zapped.  KVM doesn't (yet) zap private SPs while the TD is active.
+	 * Note: This function is for private shadow page.  Not for private
+	 * guest page.   private guest page can be zapped during TD is active.
+	 * shared <-> private conversion and slot move/deletion.
+	 */
+	KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm);
+	return -EINVAL;
+}
+
+int tdx_sept_flush_remote_tlbs(struct kvm *kvm)
+{
+	if (unlikely(!is_td(kvm)))
+		return -EOPNOTSUPP;
+
+	if (is_hkid_assigned(to_kvm_tdx(kvm)))
+		tdx_track(kvm);
+
+	return 0;
+}
+
+static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+					 enum pg_level level, kvm_pfn_t pfn)
+{
+	/*
+	 * TDX requires TLB tracking before dropping private page.  Do
+	 * it here, although it is also done later.
+	 * If hkid isn't assigned, the guest is destroying and no vcpu
+	 * runs further.  TLB shootdown isn't needed.
+	 *
+	 * TODO: Call TDH.MEM.TRACK() only when we have called
+	 * TDH.MEM.RANGE.BLOCK(), but not call TDH.MEM.TRACK() yet.
+	 */
+	if (is_hkid_assigned(to_kvm_tdx(kvm)))
+		tdx_track(kvm);
+
+	return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
@@ -970,6 +1266,39 @@  static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return ret;
 }
 
+void tdx_flush_tlb(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Don't need to flush shared EPTP:
+	 * "TD VCPU TLB Address Spaced Identifier" in the TDX module spec:
+	 * The TLB entries for TD are tagged with:
+	 *  SEAM (1 bit)
+	 *  VPID
+	 *  Secure EPT root (51:12 bits) with HKID = 0
+	 *  PCID
+	 * for *both* Secure-EPT and Shared-EPT.
+	 * TLB flush with Secure-EPT root by tdx_track() results in flushing
+	 * the conversion of both Secure-EPT and Shared-EPT.
+	 */
+
+	/*
+	 * See tdx_track().  Wait for tlb shootdown initiater to finish
+	 * TDH_MEM_TRACK() so that shared-EPT/secure-EPT TLB is flushed
+	 * on the next TDENTER.
+	 */
+	while (atomic_read(&to_kvm_tdx(vcpu->kvm)->tdh_mem_track))
+		cpu_relax();
+}
+
+void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * flush_tlb_current() is used only the first time for the vcpu to run.
+	 * As it isn't performance critical, keep this function simple.
+	 */
+	tdx_track(vcpu->kvm);
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -1288,8 +1617,21 @@  int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 	on_each_cpu(vmx_off, &vmx_tdx.vmx_enabled, true);
 	cpus_read_unlock();
 	free_cpumask_var(vmx_tdx.vmx_enabled);
+	if (r)
+		goto out;
+
+	x86_ops->link_private_spt = tdx_sept_link_private_spt;
+	x86_ops->free_private_spt = tdx_sept_free_private_spt;
+	x86_ops->set_private_spte = tdx_sept_set_private_spte;
+	x86_ops->remove_private_spte = tdx_sept_remove_private_spte;
+	x86_ops->zap_private_spte = tdx_sept_zap_private_spte;
+
+	return 0;
 
 out:
+	/* kfree() accepts NULL. */
+	kfree(tdx_mng_key_config_lock);
+	tdx_mng_key_config_lock = NULL;
 	return r;
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index c8a52eedde02..783ce329d7da 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -18,6 +18,7 @@  struct kvm_tdx {
 	int hkid;
 
 	bool finalized;
+	atomic_t tdh_mem_track;
 
 	u64 tsc_offset;
 };
@@ -165,7 +166,6 @@  static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
 	}
 	return out.r8;
 }
-
 #else
 struct kvm_tdx {
 	struct kvm kvm;
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 53a6c3f692b0..3513d5df10ee 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -52,6 +52,12 @@  static inline u64 tdx_seamcall(u64 op, struct tdx_module_args *in,
 void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_args *out);
 #endif
 
+static inline int pg_level_to_tdx_sept_level(enum pg_level level)
+{
+	WARN_ON_ONCE(level == PG_LEVEL_NONE);
+	return level - 1;
+}
+
 /*
  * TDX module acquires its internal lock for resources.  It doesn't spin to get
  * locks because of its restrictions of allowed execution time.  Instead, it
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index a9e5caf880dd..441915e9293e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -153,6 +153,9 @@  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
+void tdx_flush_tlb(struct kvm_vcpu *vcpu);
+void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
+int tdx_sept_flush_remote_tlbs(struct kvm *kvm);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
@@ -176,6 +179,9 @@  static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
+static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
+static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
+static inline int tdx_sept_flush_remote_tlbs(struct kvm *kvm) { return 0; }
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif

[v18,060/121] KVM: TDX: TDP MMU TDX support

Commit Message

Comments

Patch