[v17,092/116] KVM: TDX: Handle TDX PV HLT hypercall

Message ID	7ca4b7af33646e3f5693472b4394ba0179b550e1.1699368322.git.isaku.yamahata@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1C7E236B0B for <kvm@vger.kernel.org>; Tue, 7 Nov 2023 15:04:58 +0000 (UTC) From: isaku.yamahata@intel.com To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: isaku.yamahata@intel.com, isaku.yamahata@gmail.com, Paolo Bonzini <pbonzini@redhat.com>, erdemaktas@google.com, Sean Christopherson <seanjc@google.com>, Sagi Shahar <sagis@google.com>, David Matlack <dmatlack@google.com>, Kai Huang <kai.huang@intel.com>, Zhi Wang <zhi.wang.linux@gmail.com>, chen.bo@intel.com, hang.yuan@intel.com, tina.zhang@intel.com Subject: [PATCH v17 092/116] KVM: TDX: Handle TDX PV HLT hypercall Date: Tue, 7 Nov 2023 06:56:58 -0800 Message-Id: <7ca4b7af33646e3f5693472b4394ba0179b550e1.1699368322.git.isaku.yamahata@intel.com> In-Reply-To: <cover.1699368322.git.isaku.yamahata@intel.com> References: <cover.1699368322.git.isaku.yamahata@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	KVM TDX basic feature support \| expand [v17,000/116] KVM TDX basic feature support [v17,001/116] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX [v17,002/116] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_hardware_setup() [v17,003/116] KVM: x86/vmx: Refactor KVM VMX module init/exit functions [v17,004/116] KVM: VMX: Reorder vmx initialization with kvm vendor initialization [v17,005/116] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module [v17,006/116] KVM: TDX: Add placeholders for TDX VM/vcpu structure [v17,007/116] KVM: TDX: Make TDX VM type supported [v17,008/116,MARKER] The start of TDX KVM patch series: TDX architectural definitions [v17,009/116] KVM: TDX: Define TDX architectural definitions [v17,010/116] KVM: TDX: Add TDX "architectural" error codes [v17,011/116] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module [v17,012/116] KVM: TDX: Retry SEAMCALL on the lack of entropy error [v17,013/116] KVM: TDX: Add helper functions to print TDX SEAMCALL error [v17,014/116,MARKER] The start of TDX KVM patch series: TD VM creation/destruction [v17,015/116] x86/cpu: Add helper functions to allocate/free TDX private host key id [v17,016/116] x86/virt/tdx: Add a helper function to return system wide info about TDX module [v17,017/116] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl [v17,018/116] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters [v17,019/116] KVM: x86, tdx: Make KVM_CAP_MAX_VCPUS backend specific [v17,020/116] KVM: TDX: create/destroy VM structure [v17,021/116] KVM: TDX: initialize VM with TDX specific parameters [v17,022/116] KVM: TDX: Make pmu_intel.c ignore guest TD case [v17,023/116] KVM: TDX: Refuse to unplug the last cpu on the package [v17,024/116,MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction [v17,025/116] KVM: TDX: allocate/free TDX vcpu structure [v17,026/116] KVM: TDX: Do TDX specific vcpu initialization [v17,027/116,MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits [v17,028/116] KVM: x86/mmu: introduce config for PRIVATE KVM MMU [v17,029/116] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA [v17,030/116,MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX [v17,031/116] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values [v17,032/116] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE [v17,033/116] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE [v17,034/116] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask/shadow_present_mask [v17,035/116] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis [v17,036/116] KVM: x86/mmu: Disallow fast page fault on private GPA [v17,037/116] KVM: x86/mmu: Allow per-VM override of the TDP max page level [v17,038/116] KVM: VMX: Introduce test mode related to EPT violation VE [v17,039/116,MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks [v17,040/116] KVM: x86/mmu: Assume guest MMIOs are shared [v17,041/116] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation [v17,042/116] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role [v17,043/116] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page [v17,044/116] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases [v17,045/116] KVM: x86/tdp_mmu: Sprinkle __must_check [v17,046/116] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU [v17,047/116,MARKER] The start of TDX KVM patch series: TDX EPT violation [v17,048/116] KVM: x86/mmu: TDX: Do not enable page track for TD guest [v17,049/116] KVM: VMX: Split out guts of EPT violation to common/exposed function [v17,050/116] KVM: VMX: Move setting of EPT MMU masks to common VT-x code [v17,051/116] KVM: TDX: Add accessors VMX VMCS helpers [v17,052/116] KVM: TDX: Add load_mmu_pgd method for TDX [v17,053/116] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT [v17,054/116] KVM: TDX: Require TDP MMU and mmio caching for TDX [v17,055/116] KVM: TDX: TDP MMU TDX support [v17,056/116] KVM: TDX: MTRR: implement get_mt_mask() for TDX [v17,057/116,MARKER] The start of TDX KVM patch series: TD finalization [v17,058/116] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX [v17,059/116] KVM: TDX: Create initial guest memory [v17,060/116] KVM: TDX: Finalize VM initialization [v17,061/116,MARKER] The start of TDX KVM patch series: TD vcpu enter/exit [v17,062/116] KVM: TDX: Implement TDX vcpu enter/exit path [v17,063/116] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) [v17,064/116] KVM: TDX: restore host xsave state when exit from the guest TD [v17,065/116] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr [v17,066/116] KVM: TDX: restore user ret MSRs [v17,067/116] KVM: TDX: Add TSX_CTRL msr into uret_msrs list [v17,068/116,MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls [v17,069/116] KVM: TDX: complete interrupts after tdexit [v17,070/116] KVM: TDX: restore debug store when TD exit [v17,071/116] KVM: TDX: handle vcpu migration over logical processor [v17,072/116] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior [v17,073/116] KVM: TDX: Add support for find pending IRQ in a protected local APIC [v17,074/116] KVM: x86: Assume timer IRQ was injected if APIC state is proteced [v17,075/116] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c [v17,076/116] KVM: TDX: Implement interrupt injection [v17,077/116] KVM: TDX: Implements vcpu request_immediate_exit [v17,078/116] KVM: TDX: Implement methods to inject NMI [v17,079/116] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument [v17,080/116] KVM: VMX: Move NMI/exception handler to common helper [v17,081/116] KVM: x86: Split core of hypercall emulation to helper function [v17,082/116] KVM: TDX: Add a place holder to handle TDX VM exit [v17,083/116] KVM: TDX: Handle vmentry failure for INTEL TD guest [v17,084/116] KVM: TDX: handle EXIT_REASON_OTHER_SMI [v17,085/116] KVM: TDX: handle ept violation/misconfig exit [v17,086/116] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT [v17,087/116] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI [v17,088/116] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) [v17,089/116] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL [v17,090/116] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL [v17,091/116] KVM: TDX: Handle TDX PV CPUID hypercall [v17,092/116] KVM: TDX: Handle TDX PV HLT hypercall [v17,093/116] KVM: TDX: Handle TDX PV port io hypercall [v17,094/116] KVM: TDX: Handle TDX PV MMIO hypercall [v17,095/116] KVM: TDX: Implement callbacks for MSR operations for TDX [v17,096/116] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall [v17,097/116] KVM: TDX: Handle MSR MTRRCap and MTRRDefType access [v17,098/116] KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL [v17,099/116] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall [v17,100/116] KVM: TDX: Silently discard SMI request [v17,101/116] KVM: TDX: Silently ignore INIT/SIPI [v17,102/116] KVM: TDX: Add methods to ignore accesses to CPU state [v17,103/116] KVM: TDX: Add methods to ignore guest instruction emulation [v17,104/116] KVM: TDX: Add a method to ignore dirty logging [v17,105/116] KVM: TDX: Add methods to ignore VMX preemption timer [v17,106/116] KVM: TDX: Add methods to ignore accesses to TSC [v17,107/116] KVM: TDX: Ignore setting up mce [v17,108/116] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch [v17,109/116] KVM: TDX: Add methods to ignore virtual apic related operation [v17,110/116] KVM: TDX: Inhibit APICv for TDX guest [v17,111/116] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) [v17,112/116] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU [v17,113/116] KVM: TDX: Add hint TDX ioctl to release Secure-EPT [v17,114/116] RFC: KVM: x86: Add x86 callback to check cpuid [v17,115/116] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2 [v17,116/116,MARKER] the end of (the first phase of) TDX KVM patch series

Message ID

7ca4b7af33646e3f5693472b4394ba0179b550e1.1699368322.git.isaku.yamahata@intel.com (mailing list archive)

State

New, archived

Headers

From: isaku.yamahata@intel.com
To: kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: isaku.yamahata@intel.com,
	isaku.yamahata@gmail.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	erdemaktas@google.com,
	Sean Christopherson <seanjc@google.com>,
	Sagi Shahar <sagis@google.com>,
	David Matlack <dmatlack@google.com>,
	Kai Huang <kai.huang@intel.com>,
	Zhi Wang <zhi.wang.linux@gmail.com>,
	chen.bo@intel.com,
	hang.yuan@intel.com,
	tina.zhang@intel.com
Subject: [PATCH v17 092/116] KVM: TDX: Handle TDX PV HLT hypercall
Date: Tue,  7 Nov 2023 06:56:58 -0800
Message-Id: 
 <7ca4b7af33646e3f5693472b4394ba0179b550e1.1699368322.git.isaku.yamahata@intel.com>
In-Reply-To: <cover.1699368322.git.isaku.yamahata@intel.com>
References: <cover.1699368322.git.isaku.yamahata@intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

KVM TDX basic feature support | expand

Commit Message

Isaku Yamahata Nov. 7, 2023, 2:56 p.m. UTC

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV HLT hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h |  3 +++
 2 files changed, 44 insertions(+), 1 deletion(-)

Comments

Sean Christopherson Jan. 5, 2024, 11:05 p.m. UTC | #1

On Tue, Nov 07, 2023, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Wire up TDX PV HLT hypercall to the KVM backend function.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/tdx.h |  3 +++
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3a1fe74b95c3..4e48989d364f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -662,7 +662,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  
>  bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
>  {
> -	return pi_has_pending_interrupt(vcpu);
> +	bool ret = pi_has_pending_interrupt(vcpu);
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> +		return true;
> +
> +	if (tdx->interrupt_disabled_hlt)
> +		return false;
> +
> +	/*
> +	 * This is for the case where the virtual interrupt is recognized,
> +	 * i.e. set in vmcs.RVI, between the STI and "HLT".  KVM doesn't have
> +	 * access to RVI and the interrupt is no longer in the PID (because it
> +	 * was "recognized".  It doesn't get delivered in the guest because the
> +	 * TDCALL completes before interrupts are enabled.
> +	 *
> +	 * TDX modules sets RVI while in an STI interrupt shadow.
> +	 * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
> +	 *   The interrupt shadow at this point is gone.
> +	 * - It knows that there is an interrupt that can be delivered
> +	 *   (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
> +	 *    matter)
> +	 * - It forwards the TDExit nevertheless, to a clueless hypervisor that
> +	 *   has no way to glean either RVI or PPR.

WTF.  Seriously, what in the absolute hell is going on.  I reported this internally
four ***YEARS*** ago.  This is not some obscure theoretical edge case, this is core
functionality and it's completely broken garbage.

NAK.  Hard NAK.  Fix the TDX module, full stop.

Even worse, TDX 1.5 apparently _already_ has the necessary logic for dealing with
interrupts that are pending in RVI when handling NESTED VM-Enter.  Really!?!?!
Y'all went and added nested virtualization support of some kind, but can't find
the time to get the basics right?

Chao Gao Jan. 8, 2024, 5:09 a.m. UTC | #2

On Fri, Jan 05, 2024 at 03:05:12PM -0800, Sean Christopherson wrote:
>On Tue, Nov 07, 2023, isaku.yamahata@intel.com wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>> 
>> Wire up TDX PV HLT hypercall to the KVM backend function.
>> 
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> ---
>>  arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
>>  arch/x86/kvm/vmx/tdx.h |  3 +++
>>  2 files changed, 44 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index 3a1fe74b95c3..4e48989d364f 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -662,7 +662,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>  
>>  bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
>>  {
>> -	return pi_has_pending_interrupt(vcpu);
>> +	bool ret = pi_has_pending_interrupt(vcpu);
>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +
>> +	if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
>> +		return true;
>> +
>> +	if (tdx->interrupt_disabled_hlt)
>> +		return false;
>> +
>> +	/*
>> +	 * This is for the case where the virtual interrupt is recognized,
>> +	 * i.e. set in vmcs.RVI, between the STI and "HLT".  KVM doesn't have
>> +	 * access to RVI and the interrupt is no longer in the PID (because it
>> +	 * was "recognized".  It doesn't get delivered in the guest because the
>> +	 * TDCALL completes before interrupts are enabled.
>> +	 *
>> +	 * TDX modules sets RVI while in an STI interrupt shadow.
>> +	 * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
>> +	 *   The interrupt shadow at this point is gone.
>> +	 * - It knows that there is an interrupt that can be delivered
>> +	 *   (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
>> +	 *    matter)
>> +	 * - It forwards the TDExit nevertheless, to a clueless hypervisor that
>> +	 *   has no way to glean either RVI or PPR.
>
>WTF.  Seriously, what in the absolute hell is going on.  I reported this internally
>four ***YEARS*** ago.  This is not some obscure theoretical edge case, this is core
>functionality and it's completely broken garbage.
>
>NAK.  Hard NAK.  Fix the TDX module, full stop.
>
>Even worse, TDX 1.5 apparently _already_ has the necessary logic for dealing with
>interrupts that are pending in RVI when handling NESTED VM-Enter.  Really!?!?!
>Y'all went and added nested virtualization support of some kind, but can't find
>the time to get the basics right?

We actually fixed the TDX module. See 11.9.5. Pending Virtual Interrupt
Delivery Indication in TDX module 1.5 spec [1]

  The host VMM can detect whether a virtual interrupt is pending delivery to a
  VCPU in the Virtual APIC page, using TDH.VP.RD to read the VCPU_STATE_DETAILS
  TDVPS field.

  The typical use case is when the guest TD VCPU indicates to the host VMM, using
  TDG.VP.VMCALL, that it has no work to do and can be halted. The guest TD is
  expected to pass an “interrupt blocked” flag. The guest TD is expected to set
  this flag to 0 if and only if RFLAGS.IF is 1 or the TDCALL instruction that
  invokes TDG.VP.VMCALL immediately follows an STI instruction. If the “interrupt
  blocked” flag is 0, the host VMM can determine whether to re-schedule the guest
  TD VCPU based on VCPU_STATE_DETAILS.

Isaku, this patch didn't read VCPU_STATE_DETAILS. Maybe you missed something
during rebase? Regarding buggy_hlt_workaround, do you aim to avoid reading
VCPU_STATE_DETAILS as much as possible (because reading it via SEAMCALL is
costly, ~3-4K cycles)? if so, please make it clear in the changelog/comment.

[1]: https://cdrdv2.intel.com/v1/dl/getContent/733575

>

Sean Christopherson Jan. 9, 2024, 4:21 p.m. UTC | #3

On Mon, Jan 08, 2024, Chao Gao wrote:
> On Fri, Jan 05, 2024 at 03:05:12PM -0800, Sean Christopherson wrote:
> >On Tue, Nov 07, 2023, isaku.yamahata@intel.com wrote:
> >> From: Isaku Yamahata <isaku.yamahata@intel.com>
> >> 
> >> Wire up TDX PV HLT hypercall to the KVM backend function.
> >> 
> >> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> >> ---
> >>  arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
> >>  arch/x86/kvm/vmx/tdx.h |  3 +++
> >>  2 files changed, 44 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >> index 3a1fe74b95c3..4e48989d364f 100644
> >> --- a/arch/x86/kvm/vmx/tdx.c
> >> +++ b/arch/x86/kvm/vmx/tdx.c
> >> @@ -662,7 +662,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >>  
> >>  bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
> >>  {
> >> -	return pi_has_pending_interrupt(vcpu);
> >> +	bool ret = pi_has_pending_interrupt(vcpu);
> >> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> >> +
> >> +	if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> >> +		return true;
> >> +
> >> +	if (tdx->interrupt_disabled_hlt)
> >> +		return false;
> >> +
> >> +	/*
> >> +	 * This is for the case where the virtual interrupt is recognized,
> >> +	 * i.e. set in vmcs.RVI, between the STI and "HLT".  KVM doesn't have
> >> +	 * access to RVI and the interrupt is no longer in the PID (because it
> >> +	 * was "recognized".  It doesn't get delivered in the guest because the
> >> +	 * TDCALL completes before interrupts are enabled.
> >> +	 *
> >> +	 * TDX modules sets RVI while in an STI interrupt shadow.
> >> +	 * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
> >> +	 *   The interrupt shadow at this point is gone.
> >> +	 * - It knows that there is an interrupt that can be delivered
> >> +	 *   (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
> >> +	 *    matter)
> >> +	 * - It forwards the TDExit nevertheless, to a clueless hypervisor that
> >> +	 *   has no way to glean either RVI or PPR.
> >
> >WTF.  Seriously, what in the absolute hell is going on.  I reported this internally
> >four ***YEARS*** ago.  This is not some obscure theoretical edge case, this is core
> >functionality and it's completely broken garbage.
> >
> >NAK.  Hard NAK.  Fix the TDX module, full stop.
> >
> >Even worse, TDX 1.5 apparently _already_ has the necessary logic for dealing with
> >interrupts that are pending in RVI when handling NESTED VM-Enter.  Really!?!?!
> >Y'all went and added nested virtualization support of some kind, but can't find
> >the time to get the basics right?
> 
> We actually fixed the TDX module. See 11.9.5. Pending Virtual Interrupt
> Delivery Indication in TDX module 1.5 spec [1]
> 
>   The host VMM can detect whether a virtual interrupt is pending delivery to a
>   VCPU in the Virtual APIC page, using TDH.VP.RD to read the VCPU_STATE_DETAILS
>   TDVPS field.
>   
>   The typical use case is when the guest TD VCPU indicates to the host VMM, using
>   TDG.VP.VMCALL, that it has no work to do and can be halted. The guest TD is
>   expected to pass an “interrupt blocked” flag. The guest TD is expected to set
>   this flag to 0 if and only if RFLAGS.IF is 1 or the TDCALL instruction that
>   invokes TDG.VP.VMCALL immediately follows an STI instruction. If the “interrupt
>   blocked” flag is 0, the host VMM can determine whether to re-schedule the guest
>   TD VCPU based on VCPU_STATE_DETAILS.
> 
> Isaku, this patch didn't read VCPU_STATE_DETAILS. Maybe you missed something
> during rebase? Regarding buggy_hlt_workaround, do you aim to avoid reading
> VCPU_STATE_DETAILS as much as possible (because reading it via SEAMCALL is
> costly, ~3-4K cycles)? 

*sigh*  Why only earth doesn't the TDX module simply compute VMXIP on TDVMCALL?
It's literally one bit and one extra VMREAD.  There are plenty of register bits
available, and I highly doubt ~20 cycles in the TDVMCALL path will be noticeable,
let alone problematic.  Such functionality could even be added on top in a TDX
module update, and Intel could even bill it as a performance optimization.

Eating 4k cycles in the HLT path isn't the end of the world, but it's far from
optimal and it's just so incredibly wasteful.  I wouldn't be surprised if the
latency is measurable for certain workloads, which will lead to guests using
idle=poll and/or other games being played in the guest.

And AFAICT, the TDX module doesn't support HLT passthrough, so fully dedicated
CPUs can't even mitigate the pain that way.

Anyways, regarding the "workaround", my NAK stands.  It has bad tradeoffs of its
own, e.g. will result in spurious wakeups, and can't possibly work for VMs with
passthrough devices.  Not to mention that the implementation has several races
and false positives.

Isaku Yamahata Jan. 9, 2024, 5:36 p.m. UTC | #4

On Tue, Jan 09, 2024 at 08:21:13AM -0800,
Sean Christopherson <seanjc@google.com> wrote:

> On Mon, Jan 08, 2024, Chao Gao wrote:
> > On Fri, Jan 05, 2024 at 03:05:12PM -0800, Sean Christopherson wrote:
> > >On Tue, Nov 07, 2023, isaku.yamahata@intel.com wrote:
> > >> From: Isaku Yamahata <isaku.yamahata@intel.com>
> > >> 
> > >> Wire up TDX PV HLT hypercall to the KVM backend function.
> > >> 
> > >> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > >> ---
> > >>  arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
> > >>  arch/x86/kvm/vmx/tdx.h |  3 +++
> > >>  2 files changed, 44 insertions(+), 1 deletion(-)
> > >> 
> > >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > >> index 3a1fe74b95c3..4e48989d364f 100644
> > >> --- a/arch/x86/kvm/vmx/tdx.c
> > >> +++ b/arch/x86/kvm/vmx/tdx.c
> > >> @@ -662,7 +662,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > >>  
> > >>  bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
> > >>  {
> > >> -	return pi_has_pending_interrupt(vcpu);
> > >> +	bool ret = pi_has_pending_interrupt(vcpu);
> > >> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > >> +
> > >> +	if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> > >> +		return true;
> > >> +
> > >> +	if (tdx->interrupt_disabled_hlt)
> > >> +		return false;
> > >> +
> > >> +	/*
> > >> +	 * This is for the case where the virtual interrupt is recognized,
> > >> +	 * i.e. set in vmcs.RVI, between the STI and "HLT".  KVM doesn't have
> > >> +	 * access to RVI and the interrupt is no longer in the PID (because it
> > >> +	 * was "recognized".  It doesn't get delivered in the guest because the
> > >> +	 * TDCALL completes before interrupts are enabled.
> > >> +	 *
> > >> +	 * TDX modules sets RVI while in an STI interrupt shadow.
> > >> +	 * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
> > >> +	 *   The interrupt shadow at this point is gone.
> > >> +	 * - It knows that there is an interrupt that can be delivered
> > >> +	 *   (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
> > >> +	 *    matter)
> > >> +	 * - It forwards the TDExit nevertheless, to a clueless hypervisor that
> > >> +	 *   has no way to glean either RVI or PPR.
> > >
> > >WTF.  Seriously, what in the absolute hell is going on.  I reported this internally
> > >four ***YEARS*** ago.  This is not some obscure theoretical edge case, this is core
> > >functionality and it's completely broken garbage.
> > >
> > >NAK.  Hard NAK.  Fix the TDX module, full stop.
> > >
> > >Even worse, TDX 1.5 apparently _already_ has the necessary logic for dealing with
> > >interrupts that are pending in RVI when handling NESTED VM-Enter.  Really!?!?!
> > >Y'all went and added nested virtualization support of some kind, but can't find
> > >the time to get the basics right?
> > 
> > We actually fixed the TDX module. See 11.9.5. Pending Virtual Interrupt
> > Delivery Indication in TDX module 1.5 spec [1]
> > 
> >   The host VMM can detect whether a virtual interrupt is pending delivery to a
> >   VCPU in the Virtual APIC page, using TDH.VP.RD to read the VCPU_STATE_DETAILS
> >   TDVPS field.
> >   
> >   The typical use case is when the guest TD VCPU indicates to the host VMM, using
> >   TDG.VP.VMCALL, that it has no work to do and can be halted. The guest TD is
> >   expected to pass an “interrupt blocked” flag. The guest TD is expected to set
> >   this flag to 0 if and only if RFLAGS.IF is 1 or the TDCALL instruction that
> >   invokes TDG.VP.VMCALL immediately follows an STI instruction. If the “interrupt
> >   blocked” flag is 0, the host VMM can determine whether to re-schedule the guest
> >   TD VCPU based on VCPU_STATE_DETAILS.
> > 
> > Isaku, this patch didn't read VCPU_STATE_DETAILS. Maybe you missed something
> > during rebase? Regarding buggy_hlt_workaround, do you aim to avoid reading
> > VCPU_STATE_DETAILS as much as possible (because reading it via SEAMCALL is
> > costly, ~3-4K cycles)? 
> 
> *sigh*  Why only earth doesn't the TDX module simply compute VMXIP on TDVMCALL?
> It's literally one bit and one extra VMREAD.  There are plenty of register bits
> available, and I highly doubt ~20 cycles in the TDVMCALL path will be noticeable,
> let alone problematic.  Such functionality could even be added on top in a TDX
> module update, and Intel could even bill it as a performance optimization.
> 
> Eating 4k cycles in the HLT path isn't the end of the world, but it's far from
> optimal and it's just so incredibly wasteful.  I wouldn't be surprised if the
> latency is measurable for certain workloads, which will lead to guests using
> idle=poll and/or other games being played in the guest.
> 
> And AFAICT, the TDX module doesn't support HLT passthrough, so fully dedicated
> CPUs can't even mitigate the pain that way.
> 
> Anyways, regarding the "workaround", my NAK stands.  It has bad tradeoffs of its
> own, e.g. will result in spurious wakeups, and can't possibly work for VMs with
> passthrough devices.  Not to mention that the implementation has several races
> and false positives.

I'll drop the last part and use TDH.VP.RD(VCPU_STATE_DETAILS) with TDX 1.5.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3a1fe74b95c3..4e48989d364f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -662,7 +662,32 @@  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
-	return pi_has_pending_interrupt(vcpu);
+	bool ret = pi_has_pending_interrupt(vcpu);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
+		return true;
+
+	if (tdx->interrupt_disabled_hlt)
+		return false;
+
+	/*
+	 * This is for the case where the virtual interrupt is recognized,
+	 * i.e. set in vmcs.RVI, between the STI and "HLT".  KVM doesn't have
+	 * access to RVI and the interrupt is no longer in the PID (because it
+	 * was "recognized".  It doesn't get delivered in the guest because the
+	 * TDCALL completes before interrupts are enabled.
+	 *
+	 * TDX modules sets RVI while in an STI interrupt shadow.
+	 * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
+	 *   The interrupt shadow at this point is gone.
+	 * - It knows that there is an interrupt that can be delivered
+	 *   (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
+	 *    matter)
+	 * - It forwards the TDExit nevertheless, to a clueless hypervisor that
+	 *   has no way to glean either RVI or PPR.
+	 */
+	return !!xchg(&tdx->buggy_hlt_workaround, 0);
 }
 
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -1104,6 +1129,17 @@  static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	/* See tdx_protected_apic_has_interrupt() to avoid heavy seamcall */
+	tdx->interrupt_disabled_hlt = tdvmcall_a0_read(vcpu);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	return kvm_emulate_halt_noskip(vcpu);
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
@@ -1112,6 +1148,8 @@  static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 	switch (tdvmcall_leaf(vcpu)) {
 	case EXIT_REASON_CPUID:
 		return tdx_emulate_cpuid(vcpu);
+	case EXIT_REASON_HLT:
+		return tdx_emulate_hlt(vcpu);
 	default:
 		break;
 	}
@@ -1486,6 +1524,8 @@  void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	struct kvm_vcpu *vcpu = apic->vcpu;
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
 
+	/* See comment in tdx_protected_apic_has_interrupt(). */
+	tdx->buggy_hlt_workaround = 1;
 	/* TDX supports only posted interrupt.  No lapic emulation. */
 	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
 }
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 870a0d15e073..4ddcf804c0a4 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -101,6 +101,9 @@  struct vcpu_tdx {
 	bool host_state_need_restore;
 	u64 msr_host_kernel_gs_base;
 
+	bool interrupt_disabled_hlt;
+	unsigned int buggy_hlt_workaround;
+
 	/*
 	 * Dummy to make pmu_intel not corrupt memory.
 	 * TODO: Support PMU for TDX.  Future work.

[v17,092/116] KVM: TDX: Handle TDX PV HLT hypercall

Commit Message

Comments

Patch