Message ID | 20250211025442.3071607-6-binbin.wu@linux.intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: TDX: TDX hypercalls may exit to userspace | expand |
On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote: > +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu) > +{ > + struct vcpu_tdx *tdx = to_tdx(vcpu); > + > + if (vcpu->run->hypercall.ret) { > + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); > + tdx->vp_enter_args.r11 = tdx->map_gpa_next; > + return 1; > + } > + > + tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN; > + if (tdx->map_gpa_next >= tdx->map_gpa_end) > + return 1; > + > + /* > + * Stop processing the remaining part if there is pending interrupt. > + * Skip checking pending virtual interrupt (reflected by > + * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because > + * if guest disabled interrupt, it's OK not returning back to guest > + * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA > + * immediately after STI or MOV/POP SS. > + */ > + if (pi_has_pending_interrupt(vcpu) || > + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) { Should here also use "kvm_vcpu_has_events()" to replace "pi_has_pending_interrupt(vcpu) || kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean suggested at [1]? [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com > + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY); > + tdx->vp_enter_args.r11 = tdx->map_gpa_next; > + return 1; > + } > + > + __tdx_map_gpa(tdx); > + return 0; > +}
On 2/11/2025 2:54 PM, Yan Zhao wrote: > On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote: >> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu) >> +{ >> + struct vcpu_tdx *tdx = to_tdx(vcpu); >> + >> + if (vcpu->run->hypercall.ret) { >> + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); >> + tdx->vp_enter_args.r11 = tdx->map_gpa_next; >> + return 1; >> + } >> + >> + tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN; >> + if (tdx->map_gpa_next >= tdx->map_gpa_end) >> + return 1; >> + >> + /* >> + * Stop processing the remaining part if there is pending interrupt. >> + * Skip checking pending virtual interrupt (reflected by >> + * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because >> + * if guest disabled interrupt, it's OK not returning back to guest >> + * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA >> + * immediately after STI or MOV/POP SS. >> + */ >> + if (pi_has_pending_interrupt(vcpu) || >> + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) { > Should here also use "kvm_vcpu_has_events()" to replace > "pi_has_pending_interrupt(vcpu) || > kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean > suggested at [1]? > > [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt via a SEAM call. As noted in the comments, the check for pending virtual interrupt is intentionally skipped to save the SEAM call. Additionally, unnecessarily returning back to guest will has performance impact. But according to the discussion thread above, it seems that Sean prioritized code readability (i.e. reuse the common helper to make TDX code less special) over performance considerations? > >> + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY); >> + tdx->vp_enter_args.r11 = tdx->map_gpa_next; >> + return 1; >> + } >> + >> + __tdx_map_gpa(tdx); >> + return 0; >> +} >
On Tue, Feb 11, 2025 at 04:11:19PM +0800, Binbin Wu wrote: > > >On 2/11/2025 2:54 PM, Yan Zhao wrote: >> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote: >> > +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu) >> > +{ >> > + struct vcpu_tdx *tdx = to_tdx(vcpu); >> > + >> > + if (vcpu->run->hypercall.ret) { >> > + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); >> > + tdx->vp_enter_args.r11 = tdx->map_gpa_next; >> > + return 1; >> > + } >> > + >> > + tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN; >> > + if (tdx->map_gpa_next >= tdx->map_gpa_end) >> > + return 1; >> > + >> > + /* >> > + * Stop processing the remaining part if there is pending interrupt. >> > + * Skip checking pending virtual interrupt (reflected by >> > + * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because >> > + * if guest disabled interrupt, it's OK not returning back to guest >> > + * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA >> > + * immediately after STI or MOV/POP SS. >> > + */ >> > + if (pi_has_pending_interrupt(vcpu) || >> > + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) { >> Should here also use "kvm_vcpu_has_events()" to replace >> "pi_has_pending_interrupt(vcpu) || >> kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean >> suggested at [1]? >> >> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com > >For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt >via a SEAM call. As noted in the comments, the check for pending virtual >interrupt is intentionally skipped to save the SEAM call. Additionally, >unnecessarily returning back to guest will has performance impact. > >But according to the discussion thread above, it seems that Sean prioritized >code readability (i.e. reuse the common helper to make TDX code less special) >over performance considerations? To mitigate the performance impact, we can cache the "pending interrupt" status on the first read, similar to how guest RSP/RBP are cached to avoid VMREADs for normal VMs. This optimization can be done in a separate patch or series. And, future TDX modules will report the status via registers.
On Tue, Feb 11, 2025, Chao Gao wrote: > On Tue, Feb 11, 2025 at 04:11:19PM +0800, Binbin Wu wrote: > > > > > >On 2/11/2025 2:54 PM, Yan Zhao wrote: > >> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote: > >> > +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu) > >> > +{ > >> > + struct vcpu_tdx *tdx = to_tdx(vcpu); > >> > + > >> > + if (vcpu->run->hypercall.ret) { > >> > + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); > >> > + tdx->vp_enter_args.r11 = tdx->map_gpa_next; > >> > + return 1; > >> > + } > >> > + > >> > + tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN; > >> > + if (tdx->map_gpa_next >= tdx->map_gpa_end) > >> > + return 1; > >> > + > >> > + /* > >> > + * Stop processing the remaining part if there is pending interrupt. > >> > + * Skip checking pending virtual interrupt (reflected by > >> > + * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because > >> > + * if guest disabled interrupt, it's OK not returning back to guest > >> > + * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA > >> > + * immediately after STI or MOV/POP SS. > >> > + */ > >> > + if (pi_has_pending_interrupt(vcpu) || > >> > + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) { > >> Should here also use "kvm_vcpu_has_events()" to replace > >> "pi_has_pending_interrupt(vcpu) || > >> kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean > >> suggested at [1]? > >> > >> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com > > > >For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt > >via a SEAM call. As noted in the comments, the check for pending virtual > >interrupt is intentionally skipped to save the SEAM call. Additionally, Drat, I had a whole response typed up and then discovered the implementation of tdx_protected_apic_has_interrupt() had changed. But I think the basic gist still holds. The new version: bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { - return pi_has_pending_interrupt(vcpu); + u64 vcpu_state_details; + + if (pi_has_pending_interrupt(vcpu)) + return true; + + vcpu_state_details = + td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH); + + return tdx_vcpu_state_details_intr_pending(vcpu_state_details); } is much better than the old: bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { - return pi_has_pending_interrupt(vcpu); + bool ret = pi_has_pending_interrupt(vcpu); + union tdx_vcpu_state_details details; + struct vcpu_tdx *tdx = to_tdx(vcpu); + + if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED) + return true; + + if (tdx->interrupt_disabled_hlt) + return false; + + details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH); + return !!details.vmxip; } because assuming the vCPU has an interrupt if it's not HALTED is all kinds of wrong. However, checking VMXIP for the !HLT case is also wrong. And undesirable, as evidenced by both this path and the EPT violation retry path wanted to avoid checking VMXIP. Except for the guest being stupid (non-HLT TDCALL in an interrupt shadow), having an interrupt in RVI that is fully unmasked will be extremely rare. Actually, outside of an interrupt shadow, I don't think it's even possible. I can't think of any CPU flows that modify RVI in the middle of instruction execution. I.e. if RVI is non-zero, then either the interrupt has been pending since before the TDVMCALL, or the TDVMCALL is in an STI/SS shadow. And if the interrupt was pending before TDVMCALL, then it _must_ be blocked, otherwise the interrupt would have been serviced at the instruction boundary. I am completely comfortable saying that KVM doesn't care about STI/SS shadows outside of the HALTED case, and so unless I'm missing something, I think it makes sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED case, because it's impossible to know if the interrupt is actually unmasked, and statistically it's far, far more likely that it _is_ masked. > >unnecessarily returning back to guest will has performance impact. > > > >But according to the discussion thread above, it seems that Sean prioritized > >code readability (i.e. reuse the common helper to make TDX code less special) > >over performance considerations? > > To mitigate the performance impact, we can cache the "pending interrupt" status > on the first read, similar to how guest RSP/RBP are cached to avoid VMREADs for > normal VMs. This optimization can be done in a separate patch or series. > > And, future TDX modules will report the status via registers.
On 2/12/2025 8:46 AM, Sean Christopherson wrote: > On Tue, Feb 11, 2025, Chao Gao wrote: >> On Tue, Feb 11, 2025 at 04:11:19PM +0800, Binbin Wu wrote: >>> >>> On 2/11/2025 2:54 PM, Yan Zhao wrote: >>>> On Tue, Feb 11, 2025 at 10:54:39AM +0800, Binbin Wu wrote: >>>>> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu) >>>>> +{ >>>>> + struct vcpu_tdx *tdx = to_tdx(vcpu); >>>>> + >>>>> + if (vcpu->run->hypercall.ret) { >>>>> + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); >>>>> + tdx->vp_enter_args.r11 = tdx->map_gpa_next; >>>>> + return 1; >>>>> + } >>>>> + >>>>> + tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN; >>>>> + if (tdx->map_gpa_next >= tdx->map_gpa_end) >>>>> + return 1; >>>>> + >>>>> + /* >>>>> + * Stop processing the remaining part if there is pending interrupt. >>>>> + * Skip checking pending virtual interrupt (reflected by >>>>> + * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because >>>>> + * if guest disabled interrupt, it's OK not returning back to guest >>>>> + * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA >>>>> + * immediately after STI or MOV/POP SS. >>>>> + */ >>>>> + if (pi_has_pending_interrupt(vcpu) || >>>>> + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) { >>>> Should here also use "kvm_vcpu_has_events()" to replace >>>> "pi_has_pending_interrupt(vcpu) || >>>> kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" as Sean >>>> suggested at [1]? >>>> >>>> [1] https://lore.kernel.org/all/Z4rIGv4E7Jdmhl8P@google.com >>> For TDX guests, kvm_vcpu_has_events() will check pending virtual interrupt >>> via a SEAM call. As noted in the comments, the check for pending virtual >>> interrupt is intentionally skipped to save the SEAM call. Additionally, > Drat, I had a whole response typed up and then discovered the implementation of > tdx_protected_apic_has_interrupt() had changed. But I think the basic gist > still holds. > > The new version: > > bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) > { > - return pi_has_pending_interrupt(vcpu); > + u64 vcpu_state_details; > + > + if (pi_has_pending_interrupt(vcpu)) > + return true; > + > + vcpu_state_details = > + td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH); > + > + return tdx_vcpu_state_details_intr_pending(vcpu_state_details); > } > > is much better than the old: > > bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) > { > - return pi_has_pending_interrupt(vcpu); > + bool ret = pi_has_pending_interrupt(vcpu); > + union tdx_vcpu_state_details details; > + struct vcpu_tdx *tdx = to_tdx(vcpu); > + > + if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED) > + return true; > + > + if (tdx->interrupt_disabled_hlt) > + return false; > + > + details.full = td_state_non_arch_read64(tdx, TD_VCPU_STATE_DETAILS_NON_ARCH); > + return !!details.vmxip; > } > > because assuming the vCPU has an interrupt if it's not HALTED is all kinds of > wrong. > > However, checking VMXIP for the !HLT case is also wrong. And undesirable, as > evidenced by both this path and the EPT violation retry path wanted to avoid > checking VMXIP. > > Except for the guest being stupid (non-HLT TDCALL in an interrupt shadow), having > an interrupt in RVI that is fully unmasked will be extremely rare. Actually, > outside of an interrupt shadow, I don't think it's even possible. I can't think > of any CPU flows that modify RVI in the middle of instruction execution. I.e. if > RVI is non-zero, then either the interrupt has been pending since before the > TDVMCALL, or the TDVMCALL is in an STI/SS shadow. And if the interrupt was > pending before TDVMCALL, then it _must_ be blocked, otherwise the interrupt > would have been serviced at the instruction boundary. Agree. > > I am completely comfortable saying that KVM doesn't care about STI/SS shadows > outside of the HALTED case, and so unless I'm missing something, I think it makes > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED > case, because it's impossible to know if the interrupt is actually unmasked, and > statistically it's far, far more likely that it _is_ masked. OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part. And use kvm_vcpu_has_events() to replace the open code in this patch. Thanks! > >>> unnecessarily returning back to guest will has performance impact. >>> >>> But according to the discussion thread above, it seems that Sean prioritized >>> code readability (i.e. reuse the common helper to make TDX code less special) >>> over performance considerations? >> To mitigate the performance impact, we can cache the "pending interrupt" status >> on the first read, similar to how guest RSP/RBP are cached to avoid VMREADs for >> normal VMs. This optimization can be done in a separate patch or series. >> >> And, future TDX modules will report the status via registers.
On Wed, Feb 12, 2025, Binbin Wu wrote: > On 2/12/2025 8:46 AM, Sean Christopherson wrote: > > I am completely comfortable saying that KVM doesn't care about STI/SS shadows > > outside of the HALTED case, and so unless I'm missing something, I think it makes > > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED > > case, because it's impossible to know if the interrupt is actually unmasked, and > > statistically it's far, far more likely that it _is_ masked. > OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part. > And use kvm_vcpu_has_events() to replace the open code in this patch. Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE. If the guest initiates a spurious wakeup, pv_unhalted could be left set in perpetuity. I _think_ this would work and is generally desirable? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8e77e61d4fbd..435ca2782c3c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11114,9 +11114,6 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) kvm_apic_init_sipi_allowed(vcpu)) return true; - if (vcpu->arch.pv.pv_unhalted) - return true; - if (kvm_is_exception_pending(vcpu)) return true; @@ -11157,7 +11154,8 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) { - return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu); + return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted || + kvm_vcpu_has_events(vcpu); } /* Called within kvm->srcu read side. */ @@ -11293,7 +11291,7 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason) */ ++vcpu->stat.halt_exits; if (lapic_in_kernel(vcpu)) { - if (kvm_vcpu_has_events(vcpu)) + if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted) vcpu->arch.pv.pv_unhalted = false; else vcpu->arch.mp_state = state;
On 2/13/2025 2:56 AM, Sean Christopherson wrote: > On Wed, Feb 12, 2025, Binbin Wu wrote: >> On 2/12/2025 8:46 AM, Sean Christopherson wrote: >>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows >>> outside of the HALTED case, and so unless I'm missing something, I think it makes >>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED >>> case, because it's impossible to know if the interrupt is actually unmasked, and >>> statistically it's far, far more likely that it _is_ masked. >> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part. >> And use kvm_vcpu_has_events() to replace the open code in this patch. > Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted > is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE. > If the guest initiates a spurious wakeup, pv_unhalted could be left set in > perpetuity. Oh, yes. KVM_HC_KICK_CPU is allowed in TDX guests. The change below looks good to me. One minor issue is when guest initiates a spurious wakeup, pv_unhalted is left set, then later when the guest want to halt the vcpu, in __kvm_emulate_halt(), since pv_unhalted is still set and the state will not transit to KVM_MP_STATE_HALTED. But I guess it's guests' responsibility to not initiate spurious wakeup, guests need to bear the fact that HLT could fail due to a previous spurious wakeup? > > I _think_ this would work and is generally desirable? > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 8e77e61d4fbd..435ca2782c3c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -11114,9 +11114,6 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) > kvm_apic_init_sipi_allowed(vcpu)) > return true; > > - if (vcpu->arch.pv.pv_unhalted) > - return true; > - > if (kvm_is_exception_pending(vcpu)) > return true; > > @@ -11157,7 +11154,8 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) > > int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) > { > - return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu); > + return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted || > + kvm_vcpu_has_events(vcpu); > } > > /* Called within kvm->srcu read side. */ > @@ -11293,7 +11291,7 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason) > */ > ++vcpu->stat.halt_exits; > if (lapic_in_kernel(vcpu)) { > - if (kvm_vcpu_has_events(vcpu)) > + if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted) > vcpu->arch.pv.pv_unhalted = false; > else > vcpu->arch.mp_state = state; > >
On 2/13/2025 11:23 AM, Binbin Wu wrote: > > > On 2/13/2025 2:56 AM, Sean Christopherson wrote: >> On Wed, Feb 12, 2025, Binbin Wu wrote: >>> On 2/12/2025 8:46 AM, Sean Christopherson wrote: >>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows >>>> outside of the HALTED case, and so unless I'm missing something, I think it makes >>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED >>>> case, because it's impossible to know if the interrupt is actually unmasked, and >>>> statistically it's far, far more likely that it _is_ masked. >>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part. >>> And use kvm_vcpu_has_events() to replace the open code in this patch. >> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted >> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE. >> If the guest initiates a spurious wakeup, pv_unhalted could be left set in >> perpetuity. > > Oh, yes. > KVM_HC_KICK_CPU is allowed in TDX guests. > > The change below looks good to me. > > One minor issue is when guest initiates a spurious wakeup, pv_unhalted is > left set, then later when the guest want to halt the vcpu, in > __kvm_emulate_halt(), since pv_unhalted is still set and the state will not > transit to KVM_MP_STATE_HALTED. > But I guess it's guests' responsibility to not initiate spurious wakeup, > guests need to bear the fact that HLT could fail due to a previous > spurious wakeup? Just found a patch set for fixing the issue. https://lore.kernel.org/kvm/20250113200150.487409-1-jmattson@google.com/ > >> >> I _think_ this would work and is generally desirable? >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index 8e77e61d4fbd..435ca2782c3c 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -11114,9 +11114,6 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) >> kvm_apic_init_sipi_allowed(vcpu)) >> return true; >> - if (vcpu->arch.pv.pv_unhalted) >> - return true; >> - >> if (kvm_is_exception_pending(vcpu)) >> return true; >> @@ -11157,7 +11154,8 @@ static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) >> int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) >> { >> - return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu); >> + return kvm_vcpu_running(vcpu) || vcpu->arch.pv.pv_unhalted || >> + kvm_vcpu_has_events(vcpu); >> } >> /* Called within kvm->srcu read side. */ >> @@ -11293,7 +11291,7 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason) >> */ >> ++vcpu->stat.halt_exits; >> if (lapic_in_kernel(vcpu)) { >> - if (kvm_vcpu_has_events(vcpu)) >> + if (kvm_vcpu_has_events(vcpu) || vcpu->arch.pv.pv_unhalted) >> vcpu->arch.pv.pv_unhalted = false; >> else >> vcpu->arch.mp_state = state; >> >> > >
On Thu, Feb 13, 2025, Binbin Wu wrote: > On 2/13/2025 11:23 AM, Binbin Wu wrote: > > On 2/13/2025 2:56 AM, Sean Christopherson wrote: > > > On Wed, Feb 12, 2025, Binbin Wu wrote: > > > > On 2/12/2025 8:46 AM, Sean Christopherson wrote: > > > > > I am completely comfortable saying that KVM doesn't care about STI/SS shadows > > > > > outside of the HALTED case, and so unless I'm missing something, I think it makes > > > > > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED > > > > > case, because it's impossible to know if the interrupt is actually unmasked, and > > > > > statistically it's far, far more likely that it _is_ masked. > > > > OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part. > > > > And use kvm_vcpu_has_events() to replace the open code in this patch. > > > Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted > > > is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE. > > > If the guest initiates a spurious wakeup, pv_unhalted could be left set in > > > perpetuity. > > > > Oh, yes. > > KVM_HC_KICK_CPU is allowed in TDX guests. And a clever guest can send a REMRD IPI. > > The change below looks good to me. > > > > One minor issue is when guest initiates a spurious wakeup, pv_unhalted is > > left set, then later when the guest want to halt the vcpu, in > > __kvm_emulate_halt(), since pv_unhalted is still set and the state will not > > transit to KVM_MP_STATE_HALTED. > > But I guess it's guests' responsibility to not initiate spurious wakeup, > > guests need to bear the fact that HLT could fail due to a previous > > spurious wakeup? > > Just found a patch set for fixing the issue. FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures pv_unhalted is cleared when transitioning to RUNNING. If the vCPU is already RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it until the next transition to RUNNING (which implies at least an attempted transition away from RUNNING).
On 2/13/2025 11:17 PM, Sean Christopherson wrote: > On Thu, Feb 13, 2025, Binbin Wu wrote: >> On 2/13/2025 11:23 AM, Binbin Wu wrote: >>> On 2/13/2025 2:56 AM, Sean Christopherson wrote: >>>> On Wed, Feb 12, 2025, Binbin Wu wrote: >>>>> On 2/12/2025 8:46 AM, Sean Christopherson wrote: >>>>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows >>>>>> outside of the HALTED case, and so unless I'm missing something, I think it makes >>>>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED >>>>>> case, because it's impossible to know if the interrupt is actually unmasked, and >>>>>> statistically it's far, far more likely that it _is_ masked. >>>>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part. >>>>> And use kvm_vcpu_has_events() to replace the open code in this patch. >>>> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted >>>> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE. >>>> If the guest initiates a spurious wakeup, pv_unhalted could be left set in >>>> perpetuity. >>> Oh, yes. >>> KVM_HC_KICK_CPU is allowed in TDX guests. > And a clever guest can send a REMRD IPI. > >>> The change below looks good to me. >>> >>> One minor issue is when guest initiates a spurious wakeup, pv_unhalted is >>> left set, then later when the guest want to halt the vcpu, in >>> __kvm_emulate_halt(), since pv_unhalted is still set and the state will not >>> transit to KVM_MP_STATE_HALTED. >>> But I guess it's guests' responsibility to not initiate spurious wakeup, >>> guests need to bear the fact that HLT could fail due to a previous >>> spurious wakeup? >> Just found a patch set for fixing the issue. > FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures > pv_unhalted is cleared when transitioning to RUNNING. If the vCPU is already > RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it > until the next transition to RUNNING (which implies at least an attempted > transition away from RUNNING). > Indeed. I am wondering why KVM doesn't clear pv_unhalted before the vcpu entering guest? Is the additional memory access a concern or is there some other reason?
On Mon, Feb 17, 2025, Binbin Wu wrote: > On 2/13/2025 11:17 PM, Sean Christopherson wrote: > > On Thu, Feb 13, 2025, Binbin Wu wrote: > > > On 2/13/2025 11:23 AM, Binbin Wu wrote: > > > > On 2/13/2025 2:56 AM, Sean Christopherson wrote: > > > > > On Wed, Feb 12, 2025, Binbin Wu wrote: > > > > > > On 2/12/2025 8:46 AM, Sean Christopherson wrote: > > > > > > > I am completely comfortable saying that KVM doesn't care about STI/SS shadows > > > > > > > outside of the HALTED case, and so unless I'm missing something, I think it makes > > > > > > > sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED > > > > > > > case, because it's impossible to know if the interrupt is actually unmasked, and > > > > > > > statistically it's far, far more likely that it _is_ masked. > > > > > > OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part. > > > > > > And use kvm_vcpu_has_events() to replace the open code in this patch. > > > > > Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted > > > > > is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE. > > > > > If the guest initiates a spurious wakeup, pv_unhalted could be left set in > > > > > perpetuity. > > > > Oh, yes. > > > > KVM_HC_KICK_CPU is allowed in TDX guests. > > And a clever guest can send a REMRD IPI. > > > > > > The change below looks good to me. > > > > > > > > One minor issue is when guest initiates a spurious wakeup, pv_unhalted is > > > > left set, then later when the guest want to halt the vcpu, in > > > > __kvm_emulate_halt(), since pv_unhalted is still set and the state will not > > > > transit to KVM_MP_STATE_HALTED. > > > > But I guess it's guests' responsibility to not initiate spurious wakeup, > > > > guests need to bear the fact that HLT could fail due to a previous > > > > spurious wakeup? > > > Just found a patch set for fixing the issue. > > FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures > > pv_unhalted is cleared when transitioning to RUNNING. If the vCPU is already > > RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it > > until the next transition to RUNNING (which implies at least an attempted > > transition away from RUNNING). > > > Indeed. > > I am wondering why KVM doesn't clear pv_unhalted before the vcpu entering guest? > Is the additional memory access a concern or is there some other reason? Not clearing pv_unhalted when entering the guest is necessary to avoid races. Stating the obvious, the guest must set up all of its lock tracking before executing HLT, which means that the soon-to-be-blocking vCPU is eligible for wakeup *before* it executes HLT. If an asynchronous exit happens on the vCPU at just the right time, KVM could clear pv_unhalted before the vCPU executes HLT.
On 2/19/2025 8:29 AM, Sean Christopherson wrote: > On Mon, Feb 17, 2025, Binbin Wu wrote: >> On 2/13/2025 11:17 PM, Sean Christopherson wrote: >>> On Thu, Feb 13, 2025, Binbin Wu wrote: >>>> On 2/13/2025 11:23 AM, Binbin Wu wrote: >>>>> On 2/13/2025 2:56 AM, Sean Christopherson wrote: >>>>>> On Wed, Feb 12, 2025, Binbin Wu wrote: >>>>>>> On 2/12/2025 8:46 AM, Sean Christopherson wrote: >>>>>>>> I am completely comfortable saying that KVM doesn't care about STI/SS shadows >>>>>>>> outside of the HALTED case, and so unless I'm missing something, I think it makes >>>>>>>> sense for tdx_protected_apic_has_interrupt() to not check RVI outside of the HALTED >>>>>>>> case, because it's impossible to know if the interrupt is actually unmasked, and >>>>>>>> statistically it's far, far more likely that it _is_ masked. >>>>>>> OK. Will update tdx_protected_apic_has_interrupt() in "TDX interrupts" part. >>>>>>> And use kvm_vcpu_has_events() to replace the open code in this patch. >>>>>> Something to keep an eye on: kvm_vcpu_has_events() returns true if pv_unhalted >>>>>> is set, and pv_unhalted is only cleared on transitions KVM_MP_STATE_RUNNABLE. >>>>>> If the guest initiates a spurious wakeup, pv_unhalted could be left set in >>>>>> perpetuity. >>>>> Oh, yes. >>>>> KVM_HC_KICK_CPU is allowed in TDX guests. >>> And a clever guest can send a REMRD IPI. >>> >>>>> The change below looks good to me. >>>>> >>>>> One minor issue is when guest initiates a spurious wakeup, pv_unhalted is >>>>> left set, then later when the guest want to halt the vcpu, in >>>>> __kvm_emulate_halt(), since pv_unhalted is still set and the state will not >>>>> transit to KVM_MP_STATE_HALTED. >>>>> But I guess it's guests' responsibility to not initiate spurious wakeup, >>>>> guests need to bear the fact that HLT could fail due to a previous >>>>> spurious wakeup? >>>> Just found a patch set for fixing the issue. >>> FWIW, Jim's series doesn't address spurious wakeups per se, it just ensures >>> pv_unhalted is cleared when transitioning to RUNNING. If the vCPU is already >>> RUNNING, __apic_accept_irq() will set pv_unhalted and nothing will clear it >>> until the next transition to RUNNING (which implies at least an attempted >>> transition away from RUNNING). >>> >> Indeed. >> >> I am wondering why KVM doesn't clear pv_unhalted before the vcpu entering guest? >> Is the additional memory access a concern or is there some other reason? > Not clearing pv_unhalted when entering the guest is necessary to avoid races. > Stating the obvious, the guest must set up all of its lock tracking before executing > HLT, which means that the soon-to-be-blocking vCPU is eligible for wakeup *before* > it executes HLT. If an asynchronous exit happens on the vCPU at just the right > time, KVM could clear pv_unhalted before the vCPU executes HLT. > Got it. Thanks for the explanation.
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h index 4aedab1f2a1a..f23657350d28 100644 --- a/arch/x86/include/asm/shared/tdx.h +++ b/arch/x86/include/asm/shared/tdx.h @@ -77,6 +77,7 @@ #define TDVMCALL_STATUS_SUCCESS 0x0000000000000000ULL #define TDVMCALL_STATUS_RETRY 0x0000000000000001ULL #define TDVMCALL_STATUS_INVALID_OPERAND 0x8000000000000000ULL +#define TDVMCALL_STATUS_ALIGN_ERROR 0x8000000000000002ULL /* * Bitmasks of exposed registers (with VMM). diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index daa49f2ee2b3..8b51b4c937e9 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -981,9 +981,122 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu) return __kvm_emulate_hypercall(vcpu, 0, complete_hypercall_exit); } +/* + * Split into chunks and check interrupt pending between chunks. This allows + * for timely injection of interrupts to prevent issues with guest lockup + * detection. + */ +#define TDX_MAP_GPA_MAX_LEN (2 * 1024 * 1024) +static void __tdx_map_gpa(struct vcpu_tdx *tdx); + +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu) +{ + struct vcpu_tdx *tdx = to_tdx(vcpu); + + if (vcpu->run->hypercall.ret) { + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); + tdx->vp_enter_args.r11 = tdx->map_gpa_next; + return 1; + } + + tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN; + if (tdx->map_gpa_next >= tdx->map_gpa_end) + return 1; + + /* + * Stop processing the remaining part if there is pending interrupt. + * Skip checking pending virtual interrupt (reflected by + * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because + * if guest disabled interrupt, it's OK not returning back to guest + * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA + * immediately after STI or MOV/POP SS. + */ + if (pi_has_pending_interrupt(vcpu) || + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) { + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY); + tdx->vp_enter_args.r11 = tdx->map_gpa_next; + return 1; + } + + __tdx_map_gpa(tdx); + return 0; +} + +static void __tdx_map_gpa(struct vcpu_tdx *tdx) +{ + u64 gpa = tdx->map_gpa_next; + u64 size = tdx->map_gpa_end - tdx->map_gpa_next; + + if (size > TDX_MAP_GPA_MAX_LEN) + size = TDX_MAP_GPA_MAX_LEN; + + tdx->vcpu.run->exit_reason = KVM_EXIT_HYPERCALL; + tdx->vcpu.run->hypercall.nr = KVM_HC_MAP_GPA_RANGE; + /* + * In principle this should have been -KVM_ENOSYS, but userspace (QEMU <=9.2) + * assumed that vcpu->run->hypercall.ret is never changed by KVM and thus that + * it was always zero on KVM_EXIT_HYPERCALL. Since KVM is now overwriting + * vcpu->run->hypercall.ret, ensuring that it is zero to not break QEMU. + */ + tdx->vcpu.run->hypercall.ret = 0; + tdx->vcpu.run->hypercall.args[0] = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(tdx->vcpu.kvm)); + tdx->vcpu.run->hypercall.args[1] = size / PAGE_SIZE; + tdx->vcpu.run->hypercall.args[2] = vt_is_tdx_private_gpa(tdx->vcpu.kvm, gpa) ? + KVM_MAP_GPA_RANGE_ENCRYPTED : + KVM_MAP_GPA_RANGE_DECRYPTED; + tdx->vcpu.run->hypercall.flags = KVM_EXIT_HYPERCALL_LONG_MODE; + + tdx->vcpu.arch.complete_userspace_io = tdx_complete_vmcall_map_gpa; +} + +static int tdx_map_gpa(struct kvm_vcpu *vcpu) +{ + struct vcpu_tdx *tdx = to_tdx(vcpu); + u64 gpa = tdx->vp_enter_args.r12; + u64 size = tdx->vp_enter_args.r13; + u64 ret; + + /* + * Converting TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE requires + * userspace to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE + * bit set. If not, the error code is not defined in GHCI for TDX, use + * TDVMCALL_STATUS_INVALID_OPERAND for this case. + */ + if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE)) { + ret = TDVMCALL_STATUS_INVALID_OPERAND; + goto error; + } + + if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) || + !kvm_vcpu_is_legal_gpa(vcpu, gpa + size - 1) || + (vt_is_tdx_private_gpa(vcpu->kvm, gpa) != + vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))) { + ret = TDVMCALL_STATUS_INVALID_OPERAND; + goto error; + } + + if (!PAGE_ALIGNED(gpa) || !PAGE_ALIGNED(size)) { + ret = TDVMCALL_STATUS_ALIGN_ERROR; + goto error; + } + + tdx->map_gpa_end = gpa + size; + tdx->map_gpa_next = gpa; + + __tdx_map_gpa(tdx); + return 0; + +error: + tdvmcall_set_return_code(vcpu, ret); + tdx->vp_enter_args.r11 = gpa; + return 1; +} + static int handle_tdvmcall(struct kvm_vcpu *vcpu) { switch (tdvmcall_leaf(vcpu)) { + case TDVMCALL_MAP_GPA: + return tdx_map_gpa(vcpu); default: break; } diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h index 0e3522e423cc..45c1d064b6b7 100644 --- a/arch/x86/kvm/vmx/tdx.h +++ b/arch/x86/kvm/vmx/tdx.h @@ -57,6 +57,9 @@ struct vcpu_tdx { u64 vp_enter_ret; enum vcpu_tdx_state state; + + u64 map_gpa_next; + u64 map_gpa_end; }; void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling. MapGPA is used by TDX guest to request to map a GPA range as private or shared memory. It needs to exit to userspace for handling. KVM has already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will exit to userspace with exit reason KVM_EXIT_HYPERCALL. Do sanity checks, convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request to userspace. To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA range is split into 2MB chunks and check interrupt pending between chunks. This allows for timely injection of interrupts and prevents issues with guest lockup detection. TDX guest should retry the operation for the GPA starting at the address specified in R11 when the TDVMCALL return TDVMCALL_RETRY as status code. Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE bit set for TD VM. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> --- Hypercalls exit to userspace v2: - Skip setting of return code as TDVMCALL_STATUS_SUCCESS. - Use vp_enter_args instead of x86 registers. - Remove unnecessary comments. - Zero run->hypercall.ret in __tdx_map_gpa() following the pattern of Paolo's patch, the feedback of adding a helper is still pending. (Rick) https://lore.kernel.org/kvm/20241213194137.315304-1-pbonzini@redhat.com Hypercalls exit to userspace v1: - New added. Implement one of the hypercalls need to exit to userspace for handling after dropping "KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL", which tries to resolve Sean's comment. https://lore.kernel.org/kvm/Zg18ul8Q4PGQMWam@google.com/ - Check interrupt pending between chunks suggested by Sean. https://lore.kernel.org/kvm/ZleJvmCawKqmpFIa@google.com/ - Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin) - Use vt_is_tdx_private_gpa() --- arch/x86/include/asm/shared/tdx.h | 1 + arch/x86/kvm/vmx/tdx.c | 113 ++++++++++++++++++++++++++++++ arch/x86/kvm/vmx/tdx.h | 3 + 3 files changed, 117 insertions(+)