[v2,5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>

Message ID	20250211025442.3071607-6-binbin.wu@linux.intel.com (mailing list archive)
State	New
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74C7F1E231F; Tue, 11 Feb 2025 02:53:23 +0000 (UTC) From: Binbin Wu <binbin.wu@linux.intel.com> To: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org Cc: rick.p.edgecombe@intel.com, kai.huang@intel.com, adrian.hunter@intel.com, reinette.chatre@intel.com, xiaoyao.li@intel.com, tony.lindgren@intel.com, isaku.yamahata@intel.com, yan.y.zhao@intel.com, chao.gao@intel.com, linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com Subject: [PATCH v2 5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA> Date: Tue, 11 Feb 2025 10:54:39 +0800 Message-ID: <20250211025442.3071607-6-binbin.wu@linux.intel.com> In-Reply-To: <20250211025442.3071607-1-binbin.wu@linux.intel.com> References: <20250211025442.3071607-1-binbin.wu@linux.intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	KVM: TDX: TDX hypercalls may exit to userspace \| expand [v2,0/8] KVM: TDX: TDX hypercalls may exit to userspace [v2,1/8] KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs [v2,2/8] KVM: TDX: Add a place holder to handle TDX VM exit [v2,3/8] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) [v2,4/8] KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL [v2,5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA> [v2,6/8] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError> [v2,7/8] KVM: TDX: Handle TDX PV port I/O hypercall [v2,8/8] KVM: TDX: Handle TDX PV MMIO hypercall

Binbin Wu Feb. 11, 2025, 2:54 a.m. UTC

Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling.

MapGPA is used by TDX guest to request to map a GPA range as private
or shared memory.  It needs to exit to userspace for handling.  KVM has
already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will
exit to userspace with exit reason KVM_EXIT_HYPERCALL.  Do sanity checks,
convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request
to userspace.

To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA
range is split into 2MB chunks and check interrupt pending between chunks.
This allows for timely injection of interrupts and prevents issues with
guest lockup detection.  TDX guest should retry the operation for the
GPA starting at the address specified in R11 when the TDVMCALL return
TDVMCALL_RETRY as status code.

Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE bit set for TD VM.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
Hypercalls exit to userspace v2:
- Skip setting of return code as TDVMCALL_STATUS_SUCCESS.
- Use vp_enter_args instead of x86 registers.
- Remove unnecessary comments.
- Zero run->hypercall.ret in __tdx_map_gpa() following the pattern of Paolo's
  patch, the feedback of adding a helper is still pending. (Rick)
  https://lore.kernel.org/kvm/20241213194137.315304-1-pbonzini@redhat.com

Hypercalls exit to userspace v1:
- New added.
  Implement one of the hypercalls need to exit to userspace for handling after
  dropping "KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL", which tries to resolve
  Sean's comment.
  https://lore.kernel.org/kvm/Zg18ul8Q4PGQMWam@google.com/
- Check interrupt pending between chunks suggested by Sean.
  https://lore.kernel.org/kvm/ZleJvmCawKqmpFIa@google.com/
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
- Use vt_is_tdx_private_gpa()
---
 arch/x86/include/asm/shared/tdx.h |   1 +
 arch/x86/kvm/vmx/tdx.c            | 113 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h            |   3 +
 3 files changed, 117 insertions(+)

Yan Zhao Feb. 11, 2025, 6:54 a.m. UTC | #1

On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	if (vcpu->run->hypercall.ret) {
> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
> +		return 1;
> +	}
> +
> +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
> +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
> +		return 1;
> +
> +	/*
> +	 * Stop processing the remaining part if there is pending interrupt.
> +	 * Skip checking pending virtual interrupt (reflected by
> +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
> +	 * if guest disabled interrupt, it's OK not returning back to guest
> +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
> +	 * immediately after STI or MOV/POP SS.
> +	 */
> +	if (pi_has_pending_interrupt(vcpu) ||
> +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
Should here also use "kvm_vcpu_has_events()" to replace
"pi_has_pending_interrupt(vcpu) ||
 kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
suggested at [1]?

[1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com

> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
> +		return 1;
> +	}
> +
> +	__tdx_map_gpa(tdx);
> +	return 0;
> +}

Binbin Wu Feb. 11, 2025, 8:11 a.m. UTC | #2

On 2/11/2025 2:54 PM, Yan Zhao wrote:
> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
>> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +
>> +	if (vcpu->run->hypercall.ret) {
>> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
>> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
>> +		return 1;
>> +	}
>> +
>> +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
>> +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
>> +		return 1;
>> +
>> +	/*
>> +	 * Stop processing the remaining part if there is pending interrupt.
>> +	 * Skip checking pending virtual interrupt (reflected by
>> +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
>> +	 * if guest disabled interrupt, it's OK not returning back to guest
>> +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
>> +	 * immediately after STI or MOV/POP SS.
>> +	 */
>> +	if (pi_has_pending_interrupt(vcpu) ||
>> +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
> Should here also use "kvm_vcpu_has_events()" to replace
> "pi_has_pending_interrupt(vcpu) ||
>   kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
> suggested at [1]?
>
> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com

For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt
via a SEAM call.  As noted in the comments, the check for pending virtual
interrupt is intentionally skipped to save the SEAM call. Additionally,
unnecessarily returning back to guest will has performance impact.

But according to the discussion thread above, it seems that Sean prioritized
code readability (i.e. reuse the common helper to make TDX code less special)
over performance considerations?

>
>> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
>> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
>> +		return 1;
>> +	}
>> +
>> +	__tdx_map_gpa(tdx);
>> +	return 0;
>> +}
>

Chao Gao Feb. 11, 2025, 8:59 a.m. UTC | #3

On Tue, Feb 11, 2025 at 04:11:19PM +0800, Binbin Wu wrote:
>
>
>On 2/11/2025 2:54 PM, Yan Zhao wrote:
>> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
>> > +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>> > +{
>> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> > +
>> > +	if (vcpu->run->hypercall.ret) {
>> > +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
>> > +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
>> > +		return 1;
>> > +	}
>> > +
>> > +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
>> > +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
>> > +		return 1;
>> > +
>> > +	/*
>> > +	 * Stop processing the remaining part if there is pending interrupt.
>> > +	 * Skip checking pending virtual interrupt (reflected by
>> > +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
>> > +	 * if guest disabled interrupt, it's OK not returning back to guest
>> > +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
>> > +	 * immediately after STI or MOV/POP SS.
>> > +	 */
>> > +	if (pi_has_pending_interrupt(vcpu) ||
>> > +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
>> Should here also use "kvm_vcpu_has_events()" to replace
>> "pi_has_pending_interrupt(vcpu) ||
>>   kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
>> suggested at [1]?
>> 
>> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com
>
>For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt
>via a SEAM call.  As noted in the comments, the check for pending virtual
>interrupt is intentionally skipped to save the SEAM call. Additionally,
>unnecessarily returning back to guest will has performance impact.
>
>But according to the discussion thread above, it seems that Sean prioritized
>code readability (i.e. reuse the common helper to make TDX code less special)
>over performance considerations?

To mitigate the performance impact, we can cache the "pending interrupt" status
on the first read, similar to how guest RSP/RBP are cached to avoid VMREADs for
normal VMs. This optimization can be done in a separate patch or series.

And, future TDX modules will report the status via registers.

Sean Christopherson Feb. 12, 2025, 12:46 a.m. UTC | #4

On Tue, Feb 11, 2025, Chao Gao wrote:
> On Tue, Feb 11, 2025 at 04:11:19PM +0800, Binbin Wu wrote:
> >
> >
> >On 2/11/2025 2:54 PM, Yan Zhao wrote:
> >> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
> >> > +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
> >> > +{
> >> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> >> > +
> >> > +	if (vcpu->run->hypercall.ret) {
> >> > +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
> >> > +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
> >> > +		return 1;
> >> > +	}
> >> > +
> >> > +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
> >> > +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
> >> > +		return 1;
> >> > +
> >> > +	/*
> >> > +	 * Stop processing the remaining part if there is pending interrupt.
> >> > +	 * Skip checking pending virtual interrupt (reflected by
> >> > +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
> >> > +	 * if guest disabled interrupt, it's OK not returning back to guest
> >> > +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
> >> > +	 * immediately after STI or MOV/POP SS.
> >> > +	 */
> >> > +	if (pi_has_pending_interrupt(vcpu) ||
> >> > +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
> >> Should here also use "kvm_vcpu_has_events()" to replace
> >> "pi_has_pending_interrupt(vcpu) ||
> >>   kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
> >> suggested at [1]?
> >> 
> >> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com
> >
> >For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt
> >via a SEAM call.  As noted in the comments, the check for pending virtual
> >interrupt is intentionally skipped to save the SEAM call. Additionally,

Drat, I had a whole response typed up and then discovered the implementation of
tdx_protected_apic_has_interrupt() had changed.  But I think the basic gist
still holds.

The new version:

 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
-       return pi_has_pending_interrupt(vcpu);
+       u64 vcpu_state_details;
+
+       if (pi_has_pending_interrupt(vcpu))
+               return true;
+
+       vcpu_state_details =
+               td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
+
+       return tdx_vcpu_state_details_intr_pending(vcpu_state_details);
 }

is much better than the old:

 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
-       return pi_has_pending_interrupt(vcpu);
+       bool ret = pi_has_pending_interrupt(vcpu);
+       union tdx_vcpu_state_details details;
+       struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+       if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
+               return true;
+
+       if (tdx->interrupt_disabled_hlt)
+               return false;
+
+       details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH);
+       return !!details.vmxip;
 }

because assuming the vCPU has an interrupt if it's not HALTED is all kinds of
wrong.

However, checking VMXIP for the !HLT case is also wrong.  And undesirable, as
evidenced by both this path and the EPT violation retry path wanted to avoid
checking VMXIP.

Except for the guest being stupid (non-HLT TDCALL in an interrupt shadow), having
an interrupt in RVI that is fully unmasked will be extremely rare.  Actually,
outside of an interrupt shadow, I don't think it's even possible.  I can't think
of any CPU flows that modify RVI in the middle of instruction execution.  I.e. if
RVI is non-zero, then either the interrupt has been pending since before the
TDVMCALL, or the TDVMCALL is in an STI/SS shadow.  And if the interrupt was
pending before TDVMCALL, then it _must_ be blocked, otherwise the interrupt
would have been serviced at the instruction boundary.

I am completely comfortable saying that KVM doesn't care about STI/SS shadows
outside of the HALTED case, and so unless I'm missing something, I think it makes
sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
case, because it's impossible to know if the interrupt is actually unmasked, and
statistically it's far, far more likely that it _is_ masked.

> >unnecessarily returning back to guest will has performance impact.
> >
> >But according to the discussion thread above, it seems that Sean prioritized
> >code readability (i.e. reuse the common helper to make TDX code less special)
> >over performance considerations?
> 
> To mitigate the performance impact, we can cache the "pending interrupt" status
> on the first read, similar to how guest RSP/RBP are cached to avoid VMREADs for
> normal VMs. This optimization can be done in a separate patch or series.
> 
> And, future TDX modules will report the status via registers.

Binbin Wu Feb. 12, 2025, 5:16 a.m. UTC | #5

On 2/12/2025 8:46 AM, Sean Christopherson wrote:
> On Tue, Feb 11, 2025, Chao Gao wrote:
>> On Tue, Feb 11, 2025 at 04:11:19PM +0800, Binbin Wu wrote:
>>>
>>> On 2/11/2025 2:54 PM, Yan Zhao wrote:
>>>> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote:
>>>>> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>>>>> +
>>>>> +	if (vcpu->run->hypercall.ret) {
>>>>> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
>>>>> +		tdx->vp_enter_args.r11 = tdx->map_gpa_next;
>>>>> +		return 1;
>>>>> +	}
>>>>> +
>>>>> +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
>>>>> +	if (tdx->map_gpa_next >= tdx->map_gpa_end)
>>>>> +		return 1;
>>>>> +
>>>>> +	/*
>>>>> +	 * Stop processing the remaining part if there is pending interrupt.
>>>>> +	 * Skip checking pending virtual interrupt (reflected by
>>>>> +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
>>>>> +	 * if guest disabled interrupt, it's OK not returning back to guest
>>>>> +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
>>>>> +	 * immediately after STI or MOV/POP SS.
>>>>> +	 */
>>>>> +	if (pi_has_pending_interrupt(vcpu) ||
>>>>> +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
>>>> Should here also use "kvm_vcpu_has_events()" to replace
>>>> "pi_has_pending_interrupt(vcpu) ||
>>>>    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean
>>>> suggested at [1]?
>>>>
>>>> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com
>>> For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt
>>> via a SEAM call.  As noted in the comments, the check for pending virtual
>>> interrupt is intentionally skipped to save the SEAM call. Additionally,
> Drat, I had a whole response typed up and then discovered the implementation of
> tdx_protected_apic_has_interrupt() had changed.  But I think the basic gist
> still holds.
>
> The new version:
>
>   bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
>   {
> -       return pi_has_pending_interrupt(vcpu);
> +       u64 vcpu_state_details;
> +
> +       if (pi_has_pending_interrupt(vcpu))
> +               return true;
> +
> +       vcpu_state_details =
> +               td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
> +
> +       return tdx_vcpu_state_details_intr_pending(vcpu_state_details);
>   }
>
> is much better than the old:
>
>   bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
>   {
> -       return pi_has_pending_interrupt(vcpu);
> +       bool ret = pi_has_pending_interrupt(vcpu);
> +       union tdx_vcpu_state_details details;
> +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +       if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
> +               return true;
> +
> +       if (tdx->interrupt_disabled_hlt)
> +               return false;
> +
> +       details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH);
> +       return !!details.vmxip;
>   }
>
> because assuming the vCPU has an interrupt if it's not HALTED is all kinds of
> wrong.
>
> However, checking VMXIP for the !HLT case is also wrong.  And undesirable, as
> evidenced by both this path and the EPT violation retry path wanted to avoid
> checking VMXIP.
>
> Except for the guest being stupid (non-HLT TDCALL in an interrupt shadow), having
> an interrupt in RVI that is fully unmasked will be extremely rare.  Actually,
> outside of an interrupt shadow, I don't think it's even possible.  I can't think
> of any CPU flows that modify RVI in the middle of instruction execution.  I.e. if
> RVI is non-zero, then either the interrupt has been pending since before the
> TDVMCALL, or the TDVMCALL is in an STI/SS shadow.  And if the interrupt was
> pending before TDVMCALL, then it _must_ be blocked, otherwise the interrupt
> would have been serviced at the instruction boundary.

Agree.

>
> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
> outside of the HALTED case, and so unless I'm missing something, I think it makes
> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
> case, because it's impossible to know if the interrupt is actually unmasked, and
> statistically it's far, far more likely that it _is_ masked.
OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
And use kvm_vcpu_has_events() to replace the open code in this patch.

Thanks!

>
>>> unnecessarily returning back to guest will has performance impact.
>>>
>>> But according to the discussion thread above, it seems that Sean prioritized
>>> code readability (i.e. reuse the common helper to make TDX code less special)
>>> over performance considerations?
>> To mitigate the performance impact, we can cache the "pending interrupt" status
>> on the first read, similar to how guest RSP/RBP are cached to avoid VMREADs for
>> normal VMs. This optimization can be done in a separate patch or series.
>>
>> And, future TDX modules will report the status via registers.

Sean Christopherson Feb. 12, 2025, 6:56 p.m. UTC | #6

On Wed, Feb 12, 2025, Binbin Wu wrote:
> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
> > I am completely comfortable saying that KVM doesn't care about STI/SS shadows
> > outside of the HALTED case, and so unless I'm missing something, I think it makes
> > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
> > case, because it's impossible to know if the interrupt is actually unmasked, and
> > statistically it's far, far more likely that it _is_ masked.
> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
> And use kvm_vcpu_has_events() to replace the open code in this patch.

Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
If the guest initiates a spurious wakeup, pv_unhalted could be left set in
perpetuity.

I _think_ this would work and is generally desirable?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e77e61d4fbd..435ca2782c3c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11114,9 +11114,6 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
            kvm_apic_init_sipi_allowed(vcpu))
                return true;
 
-       if (vcpu->arch.pv.pv_unhalted)
-               return true;
-
        if (kvm_is_exception_pending(vcpu))
                return true;
 
@@ -11157,7 +11154,8 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
-       return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
+       return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted ||
+              kvm_vcpu_has_events(vcpu);
 }
 
 /* Called within kvm->srcu read side.  */
@@ -11293,7 +11291,7 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
         */
        ++vcpu->stat.halt_exits;
        if (lapic_in_kernel(vcpu)) {
-               if (kvm_vcpu_has_events(vcpu))
+               if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted)
                        vcpu->arch.pv.pv_unhalted = false;
                else
                        vcpu->arch.mp_state = state;

Binbin Wu Feb. 13, 2025, 3:23 a.m. UTC | #7

On 2/13/2025 2:56 AM, Sean Christopherson wrote:
> On Wed, Feb 12, 2025, Binbin Wu wrote:
>> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
>>> outside of the HALTED case, and so unless I'm missing something, I think it makes
>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
>>> case, because it's impossible to know if the interrupt is actually unmasked, and
>>> statistically it's far, far more likely that it _is_ masked.
>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
>> And use kvm_vcpu_has_events() to replace the open code in this patch.
> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
> If the guest initiates a spurious wakeup, pv_unhalted could be left set in
> perpetuity.

Oh, yes.
KVM_HC_KICK_CPU is allowed in TDX guests.

The change below looks good to me.

One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
left set, then later when the guest want to halt the vcpu, in
__kvm_emulate_halt(), since pv_unhalted is still set and the state will not
transit to KVM_MP_STATE_HALTED.
But I guess it's guests' responsibility to not initiate spurious wakeup,
guests need to bear the fact that HLT could fail due to a previous
spurious wakeup?

>
> I _think_ this would work and is generally desirable?
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8e77e61d4fbd..435ca2782c3c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11114,9 +11114,6 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>              kvm_apic_init_sipi_allowed(vcpu))
>                  return true;
>   
> -       if (vcpu->arch.pv.pv_unhalted)
> -               return true;
> -
>          if (kvm_is_exception_pending(vcpu))
>                  return true;
>   
> @@ -11157,7 +11154,8 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>   
>   int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>   {
> -       return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
> +       return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted ||
> +              kvm_vcpu_has_events(vcpu);
>   }
>   
>   /* Called within kvm->srcu read side.  */
> @@ -11293,7 +11291,7 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
>           */
>          ++vcpu->stat.halt_exits;
>          if (lapic_in_kernel(vcpu)) {
> -               if (kvm_vcpu_has_events(vcpu))
> +               if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted)
>                          vcpu->arch.pv.pv_unhalted = false;
>                  else
>                          vcpu->arch.mp_state = state;
>
>

Binbin Wu Feb. 13, 2025, 5:11 a.m. UTC | #8

On 2/13/2025 11:23 AM, Binbin Wu wrote:
>
>
> On 2/13/2025 2:56 AM, Sean Christopherson wrote:
>> On Wed, Feb 12, 2025, Binbin Wu wrote:
>>> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
>>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
>>>> outside of the HALTED case, and so unless I'm missing something, I think it makes
>>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
>>>> case, because it's impossible to know if the interrupt is actually unmasked, and
>>>> statistically it's far, far more likely that it _is_ masked.
>>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
>>> And use kvm_vcpu_has_events() to replace the open code in this patch.
>> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
>> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
>> If the guest initiates a spurious wakeup, pv_unhalted could be left set in
>> perpetuity.
>
> Oh, yes.
> KVM_HC_KICK_CPU is allowed in TDX guests.
>
> The change below looks good to me.
>
> One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
> left set, then later when the guest want to halt the vcpu, in
> __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
> transit to KVM_MP_STATE_HALTED.
> But I guess it's guests' responsibility to not initiate spurious wakeup,
> guests need to bear the fact that HLT could fail due to a previous
> spurious wakeup?

Just found a patch set for fixing the issue.
https://lore.kernel.org/kvm/20250113200150.487409-1-jmattson@google.com/

>
>>
>> I _think_ this would work and is generally desirable?
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 8e77e61d4fbd..435ca2782c3c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -11114,9 +11114,6 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>>              kvm_apic_init_sipi_allowed(vcpu))
>>                  return true;
>>   -       if (vcpu->arch.pv.pv_unhalted)
>> -               return true;
>> -
>>          if (kvm_is_exception_pending(vcpu))
>>                  return true;
>>   @@ -11157,7 +11154,8 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>>     int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>>   {
>> -       return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
>> +       return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted ||
>> +              kvm_vcpu_has_events(vcpu);
>>   }
>>     /* Called within kvm->srcu read side.  */
>> @@ -11293,7 +11291,7 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
>>           */
>>          ++vcpu->stat.halt_exits;
>>          if (lapic_in_kernel(vcpu)) {
>> -               if (kvm_vcpu_has_events(vcpu))
>> +               if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted)
>>                          vcpu->arch.pv.pv_unhalted = false;
>>                  else
>>                          vcpu->arch.mp_state = state;
>>
>>
>
>

Sean Christopherson Feb. 13, 2025, 3:17 p.m. UTC | #9

On Thu, Feb 13, 2025, Binbin Wu wrote:
> On 2/13/2025 11:23 AM, Binbin Wu wrote:
> > On 2/13/2025 2:56 AM, Sean Christopherson wrote:
> > > On Wed, Feb 12, 2025, Binbin Wu wrote:
> > > > On 2/12/2025 8:46 AM, Sean Christopherson wrote:
> > > > > I am completely comfortable saying that KVM doesn't care about STI/SS shadows
> > > > > outside of the HALTED case, and so unless I'm missing something, I think it makes
> > > > > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
> > > > > case, because it's impossible to know if the interrupt is actually unmasked, and
> > > > > statistically it's far, far more likely that it _is_ masked.
> > > > OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
> > > > And use kvm_vcpu_has_events() to replace the open code in this patch.
> > > Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
> > > is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
> > > If the guest initiates a spurious wakeup, pv_unhalted could be left set in
> > > perpetuity.
> > 
> > Oh, yes.
> > KVM_HC_KICK_CPU is allowed in TDX guests.

And a clever guest can send a REMRD IPI.

> > The change below looks good to me.
> > 
> > One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
> > left set, then later when the guest want to halt the vcpu, in
> > __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
> > transit to KVM_MP_STATE_HALTED.
> > But I guess it's guests' responsibility to not initiate spurious wakeup,
> > guests need to bear the fact that HLT could fail due to a previous
> > spurious wakeup?
> 
> Just found a patch set for fixing the issue.

FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures
pv_unhalted is cleared when transitioning to RUNNING.  If the vCPU is already
RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it
until the next transition to RUNNING (which implies at least an attempted
transition away from RUNNING).

Binbin Wu Feb. 17, 2025, 3:41 a.m. UTC | #10

On 2/13/2025 11:17 PM, Sean Christopherson wrote:
> On Thu, Feb 13, 2025, Binbin Wu wrote:
>> On 2/13/2025 11:23 AM, Binbin Wu wrote:
>>> On 2/13/2025 2:56 AM, Sean Christopherson wrote:
>>>> On Wed, Feb 12, 2025, Binbin Wu wrote:
>>>>> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
>>>>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
>>>>>> outside of the HALTED case, and so unless I'm missing something, I think it makes
>>>>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
>>>>>> case, because it's impossible to know if the interrupt is actually unmasked, and
>>>>>> statistically it's far, far more likely that it _is_ masked.
>>>>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
>>>>> And use kvm_vcpu_has_events() to replace the open code in this patch.
>>>> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
>>>> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
>>>> If the guest initiates a spurious wakeup, pv_unhalted could be left set in
>>>> perpetuity.
>>> Oh, yes.
>>> KVM_HC_KICK_CPU is allowed in TDX guests.
> And a clever guest can send a REMRD IPI.
>
>>> The change below looks good to me.
>>>
>>> One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
>>> left set, then later when the guest want to halt the vcpu, in
>>> __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
>>> transit to KVM_MP_STATE_HALTED.
>>> But I guess it's guests' responsibility to not initiate spurious wakeup,
>>> guests need to bear the fact that HLT could fail due to a previous
>>> spurious wakeup?
>> Just found a patch set for fixing the issue.
> FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures
> pv_unhalted is cleared when transitioning to RUNNING.  If the vCPU is already
> RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it
> until the next transition to RUNNING (which implies at least an attempted
> transition away from RUNNING).
>
Indeed.

I am wondering why KVM doesn't clear pv_unhalted before the vcpu entering guest?
Is the additional memory access a concern or is there some other reason?

Sean Christopherson Feb. 19, 2025, 12:29 a.m. UTC | #11

On Mon, Feb 17, 2025, Binbin Wu wrote:
> On 2/13/2025 11:17 PM, Sean Christopherson wrote:
> > On Thu, Feb 13, 2025, Binbin Wu wrote:
> > > On 2/13/2025 11:23 AM, Binbin Wu wrote:
> > > > On 2/13/2025 2:56 AM, Sean Christopherson wrote:
> > > > > On Wed, Feb 12, 2025, Binbin Wu wrote:
> > > > > > On 2/12/2025 8:46 AM, Sean Christopherson wrote:
> > > > > > > I am completely comfortable saying that KVM doesn't care about STI/SS shadows
> > > > > > > outside of the HALTED case, and so unless I'm missing something, I think it makes
> > > > > > > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
> > > > > > > case, because it's impossible to know if the interrupt is actually unmasked, and
> > > > > > > statistically it's far, far more likely that it _is_ masked.
> > > > > > OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
> > > > > > And use kvm_vcpu_has_events() to replace the open code in this patch.
> > > > > Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
> > > > > is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
> > > > > If the guest initiates a spurious wakeup, pv_unhalted could be left set in
> > > > > perpetuity.
> > > > Oh, yes.
> > > > KVM_HC_KICK_CPU is allowed in TDX guests.
> > And a clever guest can send a REMRD IPI.
> > 
> > > > The change below looks good to me.
> > > > 
> > > > One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
> > > > left set, then later when the guest want to halt the vcpu, in
> > > > __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
> > > > transit to KVM_MP_STATE_HALTED.
> > > > But I guess it's guests' responsibility to not initiate spurious wakeup,
> > > > guests need to bear the fact that HLT could fail due to a previous
> > > > spurious wakeup?
> > > Just found a patch set for fixing the issue.
> > FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures
> > pv_unhalted is cleared when transitioning to RUNNING.  If the vCPU is already
> > RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it
> > until the next transition to RUNNING (which implies at least an attempted
> > transition away from RUNNING).
> > 
> Indeed.
> 
> I am wondering why KVM doesn't clear pv_unhalted before the vcpu entering guest?
> Is the additional memory access a concern or is there some other reason?

Not clearing pv_unhalted when entering the guest is necessary to avoid races.
Stating the obvious, the guest must set up all of its lock tracking before executing
HLT, which means that the soon-to-be-blocking vCPU is eligible for wakeup *before*
it executes HLT.  If an asynchronous exit happens on the vCPU at just the right
time, KVM could clear pv_unhalted before the vCPU executes HLT.

Binbin Wu Feb. 19, 2025, 12:49 a.m. UTC | #12

On 2/19/2025 8:29 AM, Sean Christopherson wrote:
> On Mon, Feb 17, 2025, Binbin Wu wrote:
>> On 2/13/2025 11:17 PM, Sean Christopherson wrote:
>>> On Thu, Feb 13, 2025, Binbin Wu wrote:
>>>> On 2/13/2025 11:23 AM, Binbin Wu wrote:
>>>>> On 2/13/2025 2:56 AM, Sean Christopherson wrote:
>>>>>> On Wed, Feb 12, 2025, Binbin Wu wrote:
>>>>>>> On 2/12/2025 8:46 AM, Sean Christopherson wrote:
>>>>>>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows
>>>>>>>> outside of the HALTED case, and so unless I'm missing something, I think it makes
>>>>>>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED
>>>>>>>> case, because it's impossible to know if the interrupt is actually unmasked, and
>>>>>>>> statistically it's far, far more likely that it _is_ masked.
>>>>>>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part.
>>>>>>> And use kvm_vcpu_has_events() to replace the open code in this patch.
>>>>>> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted
>>>>>> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE.
>>>>>> If the guest initiates a spurious wakeup, pv_unhalted could be left set in
>>>>>> perpetuity.
>>>>> Oh, yes.
>>>>> KVM_HC_KICK_CPU is allowed in TDX guests.
>>> And a clever guest can send a REMRD IPI.
>>>
>>>>> The change below looks good to me.
>>>>>
>>>>> One minor issue is when guest initiates a spurious wakeup, pv_unhalted is
>>>>> left set, then later when the guest want to halt the vcpu, in
>>>>> __kvm_emulate_halt(), since pv_unhalted is still set and the state will not
>>>>> transit to KVM_MP_STATE_HALTED.
>>>>> But I guess it's guests' responsibility to not initiate spurious wakeup,
>>>>> guests need to bear the fact that HLT could fail due to a previous
>>>>> spurious wakeup?
>>>> Just found a patch set for fixing the issue.
>>> FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures
>>> pv_unhalted is cleared when transitioning to RUNNING.  If the vCPU is already
>>> RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it
>>> until the next transition to RUNNING (which implies at least an attempted
>>> transition away from RUNNING).
>>>
>> Indeed.
>>
>> I am wondering why KVM doesn't clear pv_unhalted before the vcpu entering guest?
>> Is the additional memory access a concern or is there some other reason?
> Not clearing pv_unhalted when entering the guest is necessary to avoid races.
> Stating the obvious, the guest must set up all of its lock tracking before executing
> HLT, which means that the soon-to-be-blocking vCPU is eligible for wakeup *before*
> it executes HLT.  If an asynchronous exit happens on the vCPU at just the right
> time, KVM could clear pv_unhalted before the vCPU executes HLT.
>
Got it.
Thanks for the explanation.

[v2,5/8] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>

Commit Message

Comments

Patch