diff mbox series

[2/2] KVM: x86: Fix INIT signal handling in various CPU states

Message ID 20190826102449.142687-3-liran.alon@oracle.com (mailing list archive)
State New, archived
Headers show
Series [1/2] KVM: VMX: Introduce exit reason for receiving INIT signal on guest-mode | expand

Commit Message

Liran Alon Aug. 26, 2019, 10:24 a.m. UTC
Commit cd7764fe9f73 ("KVM: x86: latch INITs while in system management mode")
changed code to latch INIT while vCPU is in SMM and process latched INIT
when leaving SMM. It left a subtle remark in commit message that similar
treatment should also be done while vCPU is in VMX non-root-mode.

However, INIT signals should actually be latched in various vCPU states:
(*) For both Intel and AMD, INIT signals should be latched while vCPU
is in SMM.
(*) For Intel, INIT should also be latched while vCPU is in VMX
operation and later processed when vCPU leaves VMX operation by
executing VMXOFF.
(*) For AMD, INIT should also be latched while vCPU runs with GIF=0
or in guest-mode with intercept defined on INIT signal.

To fix this:
1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
can define the various CPU states in which INIT signals should be
blocked and modify kvm_apic_accept_events() to use it.
2) Modify vmx_check_nested_events() to check for pending INIT signal
while vCPU in guest-mode. If so, emualte vmexit on
EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
but is currently left as a TODO comment to implement in the future
because nSVM don't yet implement svm_check_nested_events().

Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
guest (See nested_vmx_setup_ctls_msrs()).
If and when support for this activity state will be implemented,
kvm_check_nested_events() would need to avoid emulating vmexit on
INIT signal in case activity-state is wait-for-SIPI. In addition,
kvm_apic_accept_events() would need to be modified to avoid discarding
SIPI in case VMX activity-state is wait-for-SIPI but instead delay
SIPI processing to vmx_check_nested_events() that would clear
pending APIC events and emulate vmexit on SIPI.

Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/lapic.c            | 11 +++++++----
 arch/x86/kvm/svm.c              | 20 ++++++++++++++++++++
 arch/x86/kvm/vmx/nested.c       | 14 ++++++++++++++
 arch/x86/kvm/vmx/vmx.c          |  6 ++++++
 5 files changed, 49 insertions(+), 4 deletions(-)

Comments

Sean Christopherson Aug. 26, 2019, 4:03 p.m. UTC | #1
On Mon, Aug 26, 2019 at 01:24:49PM +0300, Liran Alon wrote:
> Commit cd7764fe9f73 ("KVM: x86: latch INITs while in system management mode")
> changed code to latch INIT while vCPU is in SMM and process latched INIT
> when leaving SMM. It left a subtle remark in commit message that similar
> treatment should also be done while vCPU is in VMX non-root-mode.
> 
> However, INIT signals should actually be latched in various vCPU states:
> (*) For both Intel and AMD, INIT signals should be latched while vCPU
> is in SMM.
> (*) For Intel, INIT should also be latched while vCPU is in VMX
> operation and later processed when vCPU leaves VMX operation by
> executing VMXOFF.

I think there's a {get,set}_nested_state() component missing, e.g. the SMM
case exposes and consumes a pending INIT via events->smi.latched_init.
Similar functionality is needed to preserve a pending INIT for post-VMXON
across migration.

> (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
> or in guest-mode with intercept defined on INIT signal.
> 
> To fix this:
> 1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
> can define the various CPU states in which INIT signals should be
> blocked and modify kvm_apic_accept_events() to use it.

kvm_arch_vcpu_ioctl_set_mpstate() should also use the new helper, e.g.
userspace shouldn't be able to put the vCPU into KVM_MP_STATE_SIPI_RECEIVED
or KVM_MP_STATE_INIT_RECEIVED if it's post-VMXON.

> 2) Modify vmx_check_nested_events() to check for pending INIT signal
> while vCPU in guest-mode. If so, emualte vmexit on
> EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
> but is currently left as a TODO comment to implement in the future
> because nSVM don't yet implement svm_check_nested_events().
> 
> Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
> activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
> guest (See nested_vmx_setup_ctls_msrs()).
> If and when support for this activity state will be implemented,
> kvm_check_nested_events() would need to avoid emulating vmexit on
> INIT signal in case activity-state is wait-for-SIPI. In addition,
> kvm_apic_accept_events() would need to be modified to avoid discarding
> SIPI in case VMX activity-state is wait-for-SIPI but instead delay
> SIPI processing to vmx_check_nested_events() that would clear
> pending APIC events and emulate vmexit on SIPI.
> 
> Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
> Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/lapic.c            | 11 +++++++----
>  arch/x86/kvm/svm.c              | 20 ++++++++++++++++++++
>  arch/x86/kvm/vmx/nested.c       | 14 ++++++++++++++
>  arch/x86/kvm/vmx/vmx.c          |  6 ++++++
>  5 files changed, 49 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 74e88e5edd9c..158483538181 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1209,6 +1209,8 @@ struct kvm_x86_ops {
>  	uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
>  
>  	bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
> +
> +	bool (*apic_init_signal_blocked)(struct kvm_vcpu *vcpu);
>  };
>  
>  struct kvm_arch_async_pf {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 685d17c11461..9620fe5ce8d1 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2702,11 +2702,14 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>  		return;
>  
>  	/*
> -	 * INITs are latched while in SMM.  Because an SMM CPU cannot
> -	 * be in KVM_MP_STATE_INIT_RECEIVED state, just eat SIPIs
> -	 * and delay processing of INIT until the next RSM.
> +	 * INITs are latched while CPU is in specific states.
> +	 * Because a CPU cannot be in these states immediately
> +	 * after it have processed an INIT signal (and thus in
> +	 * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
> +	 * and delay processing of INIT until CPU leaves
> +	 * the state which latch INIT signal.
>  	 */
> -	if (is_smm(vcpu)) {
> +	if (kvm_x86_ops->apic_init_signal_blocked(vcpu)) {

We may want to keep the SMM+INIT outside of the helper so as to avoid
splitting the SMI+INIT logic across common x86 and vendor specific code.
E.g. kvm_arch_vcpu_ioctl_set_mpstate() needs to handle the SMI pending
scenario and kvm_vcpu_ioctl_x86_set_vcpu_events() must prevent putting
the vCPU into SMM or pending an SMI when the vCPU is already in
KVM_MP_STATE_INIT_RECEIVED.

>  		WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
>  		if (test_bit(KVM_APIC_SIPI, &apic->pending_events))
>  			clear_bit(KVM_APIC_SIPI, &apic->pending_events);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 6462c386015d..0e43acf7bea4 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -7205,6 +7205,24 @@ static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>  	return false;
>  }
>  
> +static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	/*
> +	 * TODO: Last condition latch INIT signals on vCPU when
> +	 * vCPU is in guest-mode and vmcb12 defines intercept on INIT.
> +	 * To properly emulate INIT intercept, SVM should also implement
> +	 * kvm_x86_ops->check_nested_events() and process pending INIT
> +	 * signal to cause nested_svm_vmexit(). As SVM currently don't
> +	 * implement check_nested_events(), this work is delayed
> +	 * for future improvement.
> +	 */
> +	return is_smm(vcpu) ||
> +		   !gif_set(svm) ||
> +		   (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT));
> +}
> +
>  static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>  	.cpu_has_kvm_support = has_svm,
>  	.disabled_by_bios = is_disabled,
> @@ -7341,6 +7359,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>  	.nested_get_evmcs_version = nested_get_evmcs_version,
>  
>  	.need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
> +
> +	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
>  };
>  
>  static int __init svm_init(void)
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ced9fba32598..d655fcd04c01 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3401,6 +3401,15 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
>  	unsigned long exit_qual;
>  	bool block_nested_events =
>  	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	if (lapic_in_kernel(vcpu) &&
> +		test_bit(KVM_APIC_INIT, &apic->pending_events)) {

Prefer aligned identation.

> +		if (block_nested_events)
> +			return -EBUSY;
> +		nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
> +		return 0;
> +	}
>  
>  	if (vcpu->arch.exception.pending &&
>  		nested_vmx_check_exception(vcpu, &exit_qual)) {
> @@ -4466,7 +4475,12 @@ static int handle_vmoff(struct kvm_vcpu *vcpu)
>  {
>  	if (!nested_vmx_check_permission(vcpu))
>  		return 1;
> +
>  	free_nested(vcpu);
> +
> +	/* Process a latched INIT during time CPU was in VMX operation */
> +	kvm_make_request(KVM_REQ_EVENT, vcpu);
> +
>  	return nested_vmx_succeed(vcpu);
>  }
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b5b5b2e5dac5..5a1aa0640f2a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7479,6 +7479,11 @@ static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>  	return false;
>  }
>  
> +static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +{
> +	return is_smm(vcpu) || to_vmx(vcpu)->nested.vmxon;
> +}
> +
>  static __init int hardware_setup(void)
>  {
>  	unsigned long host_bndcfgs;
> @@ -7803,6 +7808,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  	.get_vmcs12_pages = NULL,
>  	.nested_enable_evmcs = NULL,
>  	.need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
> +	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
>  };
>  
>  static void vmx_cleanup_l1d_flush(void)
> -- 
> 2.20.1
>
Liran Alon Aug. 26, 2019, 6:26 p.m. UTC | #2
> On 26 Aug 2019, at 19:03, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Mon, Aug 26, 2019 at 01:24:49PM +0300, Liran Alon wrote:
>> Commit cd7764fe9f73 ("KVM: x86: latch INITs while in system management mode")
>> changed code to latch INIT while vCPU is in SMM and process latched INIT
>> when leaving SMM. It left a subtle remark in commit message that similar
>> treatment should also be done while vCPU is in VMX non-root-mode.
>> 
>> However, INIT signals should actually be latched in various vCPU states:
>> (*) For both Intel and AMD, INIT signals should be latched while vCPU
>> is in SMM.
>> (*) For Intel, INIT should also be latched while vCPU is in VMX
>> operation and later processed when vCPU leaves VMX operation by
>> executing VMXOFF.
> 
> I think there's a {get,set}_nested_state() component missing, e.g. the SMM
> case exposes and consumes a pending INIT via events->smi.latched_init.
> Similar functionality is needed to preserve a pending INIT for post-VMXON
> across migration.

Good catch. I have missed this.

It seems that kvm_vcpu_ioctl_x86_get_vcpu_events() sets events->smi.latched_init
just based on KVM_APIC_INIT being set in LAPIC pending events.
Therefore, in case INIT is latched outside of SMM (e.g. By vCPU being in VMX operation),
it should also record it in this variable.
However, kvm_vcpu_ioctl_x86_set_vcpu_events() seems to only treat this
flag in case vCPU is in SMM (As indicated by events->smi.smm).

I would actually like to add a new field (that deprecates this events->smi.latched_init)
of events->pending_lapic_events that holds pending LAPIC events.
Regardless of specific CPU state we are at. I would need to use
one of the reserved fields in struct kvm_vcpu_events and introduce
a new KVM_VCPUEVENT_VALID_APIC_EVENTS flags to indicate
this field is specified by userspace.
I will define that when this field is specified, the events->smi.latched_init is ignored.

An alternative could be to just add a flag to events->flags that modifies
behaviour to treat events->smi.latched_init as just events->latched_init.
But I prefer the previous option.

I don’t want to introduce a new flag for "pending INIT during VMX operation" as
there are various vendor-specific CPU states that latch INIT (As handled by this commit…).

Do you agree?

> 
>> (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
>> or in guest-mode with intercept defined on INIT signal.
>> 
>> To fix this:
>> 1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
>> can define the various CPU states in which INIT signals should be
>> blocked and modify kvm_apic_accept_events() to use it.
> 
> kvm_arch_vcpu_ioctl_set_mpstate() should also use the new helper, e.g.
> userspace shouldn't be able to put the vCPU into KVM_MP_STATE_SIPI_RECEIVED
> or KVM_MP_STATE_INIT_RECEIVED if it's post-VMXON.

Good catch. I have missed this as-well.
Will fix in v2.

> 
>> 2) Modify vmx_check_nested_events() to check for pending INIT signal
>> while vCPU in guest-mode. If so, emualte vmexit on
>> EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
>> but is currently left as a TODO comment to implement in the future
>> because nSVM don't yet implement svm_check_nested_events().
>> 
>> Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
>> activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
>> guest (See nested_vmx_setup_ctls_msrs()).
>> If and when support for this activity state will be implemented,
>> kvm_check_nested_events() would need to avoid emulating vmexit on
>> INIT signal in case activity-state is wait-for-SIPI. In addition,
>> kvm_apic_accept_events() would need to be modified to avoid discarding
>> SIPI in case VMX activity-state is wait-for-SIPI but instead delay
>> SIPI processing to vmx_check_nested_events() that would clear
>> pending APIC events and emulate vmexit on SIPI.
>> 
>> Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
>> Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> ---
>> arch/x86/include/asm/kvm_host.h |  2 ++
>> arch/x86/kvm/lapic.c            | 11 +++++++----
>> arch/x86/kvm/svm.c              | 20 ++++++++++++++++++++
>> arch/x86/kvm/vmx/nested.c       | 14 ++++++++++++++
>> arch/x86/kvm/vmx/vmx.c          |  6 ++++++
>> 5 files changed, 49 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 74e88e5edd9c..158483538181 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1209,6 +1209,8 @@ struct kvm_x86_ops {
>> 	uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
>> 
>> 	bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
>> +
>> +	bool (*apic_init_signal_blocked)(struct kvm_vcpu *vcpu);
>> };
>> 
>> struct kvm_arch_async_pf {
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 685d17c11461..9620fe5ce8d1 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2702,11 +2702,14 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>> 		return;
>> 
>> 	/*
>> -	 * INITs are latched while in SMM.  Because an SMM CPU cannot
>> -	 * be in KVM_MP_STATE_INIT_RECEIVED state, just eat SIPIs
>> -	 * and delay processing of INIT until the next RSM.
>> +	 * INITs are latched while CPU is in specific states.
>> +	 * Because a CPU cannot be in these states immediately
>> +	 * after it have processed an INIT signal (and thus in
>> +	 * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
>> +	 * and delay processing of INIT until CPU leaves
>> +	 * the state which latch INIT signal.
>> 	 */
>> -	if (is_smm(vcpu)) {
>> +	if (kvm_x86_ops->apic_init_signal_blocked(vcpu)) {
> 
> We may want to keep the SMM+INIT outside of the helper so as to avoid
> splitting the SMI+INIT logic across common x86 and vendor specific code.
> E.g. kvm_arch_vcpu_ioctl_set_mpstate() needs to handle the SMI pending
> scenario and kvm_vcpu_ioctl_x86_set_vcpu_events() must prevent putting
> the vCPU into SMM or pending an SMI when the vCPU is already in
> KVM_MP_STATE_INIT_RECEIVED.

I thought it will be nicer to have all conditions for INIT latching defined in a single place.
i.e. In the per-vendor callback.

Why does kvm_arch_vcpu_ioctl_set_mpstate() needs to prevent setting mpstate
to INIT_RECEIVED/SIPI_RECIEVED in case a SMI is pending?
I would expect it to just prevent to do so in case CPU is in a state where INIT is latched.

The commit which introduced this behaviour
28bf28887976 ("KVM: x86: fix user triggerable warning in kvm_apic_accept_events()")
Seems to just want to avoid WARN_ON_ONCE() at kvm_apic_accept_events().
Which in order to do so, smi_pending is not relevant.
Same argument goes for kvm_vcpu_ioctl_x86_set_vcpu_events() modified by same commit.

Am I missing something?

> 
>> 		WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
>> 		if (test_bit(KVM_APIC_SIPI, &apic->pending_events))
>> 			clear_bit(KVM_APIC_SIPI, &apic->pending_events);
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 6462c386015d..0e43acf7bea4 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -7205,6 +7205,24 @@ static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>> 	return false;
>> }
>> 
>> +static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> +	/*
>> +	 * TODO: Last condition latch INIT signals on vCPU when
>> +	 * vCPU is in guest-mode and vmcb12 defines intercept on INIT.
>> +	 * To properly emulate INIT intercept, SVM should also implement
>> +	 * kvm_x86_ops->check_nested_events() and process pending INIT
>> +	 * signal to cause nested_svm_vmexit(). As SVM currently don't
>> +	 * implement check_nested_events(), this work is delayed
>> +	 * for future improvement.
>> +	 */
>> +	return is_smm(vcpu) ||
>> +		   !gif_set(svm) ||
>> +		   (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT));
>> +}
>> +
>> static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>> 	.cpu_has_kvm_support = has_svm,
>> 	.disabled_by_bios = is_disabled,
>> @@ -7341,6 +7359,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>> 	.nested_get_evmcs_version = nested_get_evmcs_version,
>> 
>> 	.need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
>> +
>> +	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
>> };
>> 
>> static int __init svm_init(void)
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index ced9fba32598..d655fcd04c01 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -3401,6 +3401,15 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
>> 	unsigned long exit_qual;
>> 	bool block_nested_events =
>> 	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
>> +	struct kvm_lapic *apic = vcpu->arch.apic;
>> +
>> +	if (lapic_in_kernel(vcpu) &&
>> +		test_bit(KVM_APIC_INIT, &apic->pending_events)) {
> 
> Prefer aligned identation.

Actually I now see that I can just use kvm_lapic_latched_init() instead.
Therefore, will replace to that.

-Liran

> 
>> +		if (block_nested_events)
>> +			return -EBUSY;
>> +		nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
>> +		return 0;
>> +	}
>> 
>> 	if (vcpu->arch.exception.pending &&
>> 		nested_vmx_check_exception(vcpu, &exit_qual)) {
>> @@ -4466,7 +4475,12 @@ static int handle_vmoff(struct kvm_vcpu *vcpu)
>> {
>> 	if (!nested_vmx_check_permission(vcpu))
>> 		return 1;
>> +
>> 	free_nested(vcpu);
>> +
>> +	/* Process a latched INIT during time CPU was in VMX operation */
>> +	kvm_make_request(KVM_REQ_EVENT, vcpu);
>> +
>> 	return nested_vmx_succeed(vcpu);
>> }
>> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index b5b5b2e5dac5..5a1aa0640f2a 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -7479,6 +7479,11 @@ static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>> 	return false;
>> }
>> 
>> +static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
>> +{
>> +	return is_smm(vcpu) || to_vmx(vcpu)->nested.vmxon;
>> +}
>> +
>> static __init int hardware_setup(void)
>> {
>> 	unsigned long host_bndcfgs;
>> @@ -7803,6 +7808,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>> 	.get_vmcs12_pages = NULL,
>> 	.nested_enable_evmcs = NULL,
>> 	.need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
>> +	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
>> };
>> 
>> static void vmx_cleanup_l1d_flush(void)
>> -- 
>> 2.20.1
>>
Jim Mattson Aug. 26, 2019, 9:17 p.m. UTC | #3
On Mon, Aug 26, 2019 at 3:25 AM Liran Alon <liran.alon@oracle.com> wrote:
>
> Commit cd7764fe9f73 ("KVM: x86: latch INITs while in system management mode")
> changed code to latch INIT while vCPU is in SMM and process latched INIT
> when leaving SMM. It left a subtle remark in commit message that similar
> treatment should also be done while vCPU is in VMX non-root-mode.
>
> However, INIT signals should actually be latched in various vCPU states:
> (*) For both Intel and AMD, INIT signals should be latched while vCPU
> is in SMM.
> (*) For Intel, INIT should also be latched while vCPU is in VMX
> operation and later processed when vCPU leaves VMX operation by
> executing VMXOFF.
> (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
> or in guest-mode with intercept defined on INIT signal.
>
> To fix this:
> 1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
> can define the various CPU states in which INIT signals should be
> blocked and modify kvm_apic_accept_events() to use it.
> 2) Modify vmx_check_nested_events() to check for pending INIT signal
> while vCPU in guest-mode. If so, emualte vmexit on
> EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
> but is currently left as a TODO comment to implement in the future
> because nSVM don't yet implement svm_check_nested_events().
>
> Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
> activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
> guest (See nested_vmx_setup_ctls_msrs()).
> If and when support for this activity state will be implemented,
> kvm_check_nested_events() would need to avoid emulating vmexit on
> INIT signal in case activity-state is wait-for-SIPI. In addition,
> kvm_apic_accept_events() would need to be modified to avoid discarding
> SIPI in case VMX activity-state is wait-for-SIPI but instead delay
> SIPI processing to vmx_check_nested_events() that would clear
> pending APIC events and emulate vmexit on SIPI.
>
> Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
> Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>

Thanks for doing this! I asked Marc to take a look at it earlier this
month, but he's been swamped.

> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/lapic.c            | 11 +++++++----
>  arch/x86/kvm/svm.c              | 20 ++++++++++++++++++++
>  arch/x86/kvm/vmx/nested.c       | 14 ++++++++++++++
>  arch/x86/kvm/vmx/vmx.c          |  6 ++++++
>  5 files changed, 49 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 74e88e5edd9c..158483538181 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1209,6 +1209,8 @@ struct kvm_x86_ops {
>         uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
>
>         bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
> +
> +       bool (*apic_init_signal_blocked)(struct kvm_vcpu *vcpu);
>  };
>
>  struct kvm_arch_async_pf {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 685d17c11461..9620fe5ce8d1 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2702,11 +2702,14 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>                 return;
>
>         /*
> -        * INITs are latched while in SMM.  Because an SMM CPU cannot
> -        * be in KVM_MP_STATE_INIT_RECEIVED state, just eat SIPIs
> -        * and delay processing of INIT until the next RSM.
> +        * INITs are latched while CPU is in specific states.
> +        * Because a CPU cannot be in these states immediately
> +        * after it have processed an INIT signal (and thus in
> +        * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
> +        * and delay processing of INIT until CPU leaves
> +        * the state which latch INIT signal.
>          */
> -       if (is_smm(vcpu)) {
> +       if (kvm_x86_ops->apic_init_signal_blocked(vcpu)) {
>                 WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
>                 if (test_bit(KVM_APIC_SIPI, &apic->pending_events))
>                         clear_bit(KVM_APIC_SIPI, &apic->pending_events);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 6462c386015d..0e43acf7bea4 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -7205,6 +7205,24 @@ static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>         return false;
>  }
>
> +static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +{
> +       struct vcpu_svm *svm = to_svm(vcpu);
> +
> +       /*
> +        * TODO: Last condition latch INIT signals on vCPU when
> +        * vCPU is in guest-mode and vmcb12 defines intercept on INIT.
> +        * To properly emulate INIT intercept, SVM should also implement
> +        * kvm_x86_ops->check_nested_events() and process pending INIT
> +        * signal to cause nested_svm_vmexit(). As SVM currently don't
> +        * implement check_nested_events(), this work is delayed
> +        * for future improvement.
> +        */
> +       return is_smm(vcpu) ||
> +                  !gif_set(svm) ||
> +                  (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT));
> +}
> +
>  static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>         .cpu_has_kvm_support = has_svm,
>         .disabled_by_bios = is_disabled,
> @@ -7341,6 +7359,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>         .nested_get_evmcs_version = nested_get_evmcs_version,
>
>         .need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
> +
> +       .apic_init_signal_blocked = svm_apic_init_signal_blocked,
>  };
>
>  static int __init svm_init(void)
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ced9fba32598..d655fcd04c01 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3401,6 +3401,15 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
>         unsigned long exit_qual;
>         bool block_nested_events =
>             vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
> +       struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +       if (lapic_in_kernel(vcpu) &&
> +               test_bit(KVM_APIC_INIT, &apic->pending_events)) {
> +               if (block_nested_events)
> +                       return -EBUSY;
> +               nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
> +               return 0;
> +       }

Suppose that L0 just finished emulating an L2 instruction with
EFLAGS.TF set. So, we've just queued up a #DB trap in
vcpu->arch.exception. On this emulated VM-exit from L2 to L1, the
guest pending debug exceptions field in the vmcs12 should get the
value of vcpu->arch.exception.payload, and the queued #DB should be
squashed.

>         if (vcpu->arch.exception.pending &&
>                 nested_vmx_check_exception(vcpu, &exit_qual)) {
> @@ -4466,7 +4475,12 @@ static int handle_vmoff(struct kvm_vcpu *vcpu)
>  {
>         if (!nested_vmx_check_permission(vcpu))
>                 return 1;
> +
>         free_nested(vcpu);
> +
> +       /* Process a latched INIT during time CPU was in VMX operation */
> +       kvm_make_request(KVM_REQ_EVENT, vcpu);
> +
>         return nested_vmx_succeed(vcpu);
>  }
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b5b5b2e5dac5..5a1aa0640f2a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7479,6 +7479,11 @@ static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>         return false;
>  }
>
> +static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +{
> +       return is_smm(vcpu) || to_vmx(vcpu)->nested.vmxon;
> +}
> +
>  static __init int hardware_setup(void)
>  {
>         unsigned long host_bndcfgs;
> @@ -7803,6 +7808,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>         .get_vmcs12_pages = NULL,
>         .nested_enable_evmcs = NULL,
>         .need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
> +       .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
>  };
>
>  static void vmx_cleanup_l1d_flush(void)
> --
> 2.20.1
>
Liran Alon Aug. 26, 2019, 10:04 p.m. UTC | #4
> On 27 Aug 2019, at 0:17, Jim Mattson <jmattson@google.com> wrote:
> 
> On Mon, Aug 26, 2019 at 3:25 AM Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> Commit cd7764fe9f73 ("KVM: x86: latch INITs while in system management mode")
>> changed code to latch INIT while vCPU is in SMM and process latched INIT
>> when leaving SMM. It left a subtle remark in commit message that similar
>> treatment should also be done while vCPU is in VMX non-root-mode.
>> 
>> However, INIT signals should actually be latched in various vCPU states:
>> (*) For both Intel and AMD, INIT signals should be latched while vCPU
>> is in SMM.
>> (*) For Intel, INIT should also be latched while vCPU is in VMX
>> operation and later processed when vCPU leaves VMX operation by
>> executing VMXOFF.
>> (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
>> or in guest-mode with intercept defined on INIT signal.
>> 
>> To fix this:
>> 1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
>> can define the various CPU states in which INIT signals should be
>> blocked and modify kvm_apic_accept_events() to use it.
>> 2) Modify vmx_check_nested_events() to check for pending INIT signal
>> while vCPU in guest-mode. If so, emualte vmexit on
>> EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
>> but is currently left as a TODO comment to implement in the future
>> because nSVM don't yet implement svm_check_nested_events().
>> 
>> Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
>> activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
>> guest (See nested_vmx_setup_ctls_msrs()).
>> If and when support for this activity state will be implemented,
>> kvm_check_nested_events() would need to avoid emulating vmexit on
>> INIT signal in case activity-state is wait-for-SIPI. In addition,
>> kvm_apic_accept_events() would need to be modified to avoid discarding
>> SIPI in case VMX activity-state is wait-for-SIPI but instead delay
>> SIPI processing to vmx_check_nested_events() that would clear
>> pending APIC events and emulate vmexit on SIPI.
>> 
>> Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
>> Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> 
> Thanks for doing this! I asked Marc to take a look at it earlier this
> month, but he's been swamped.

No problem. BTW Nikita is also preparing a patch on top of this to support wait-for-SIPI activity-state.
In case that was in your TODO list as-well.

> 
>> ---
>> arch/x86/include/asm/kvm_host.h |  2 ++
>> arch/x86/kvm/lapic.c            | 11 +++++++----
>> arch/x86/kvm/svm.c              | 20 ++++++++++++++++++++
>> arch/x86/kvm/vmx/nested.c       | 14 ++++++++++++++
>> arch/x86/kvm/vmx/vmx.c          |  6 ++++++
>> 5 files changed, 49 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 74e88e5edd9c..158483538181 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1209,6 +1209,8 @@ struct kvm_x86_ops {
>>        uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
>> 
>>        bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
>> +
>> +       bool (*apic_init_signal_blocked)(struct kvm_vcpu *vcpu);
>> };
>> 
>> struct kvm_arch_async_pf {
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 685d17c11461..9620fe5ce8d1 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2702,11 +2702,14 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>>                return;
>> 
>>        /*
>> -        * INITs are latched while in SMM.  Because an SMM CPU cannot
>> -        * be in KVM_MP_STATE_INIT_RECEIVED state, just eat SIPIs
>> -        * and delay processing of INIT until the next RSM.
>> +        * INITs are latched while CPU is in specific states.
>> +        * Because a CPU cannot be in these states immediately
>> +        * after it have processed an INIT signal (and thus in
>> +        * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
>> +        * and delay processing of INIT until CPU leaves
>> +        * the state which latch INIT signal.
>>         */
>> -       if (is_smm(vcpu)) {
>> +       if (kvm_x86_ops->apic_init_signal_blocked(vcpu)) {
>>                WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
>>                if (test_bit(KVM_APIC_SIPI, &apic->pending_events))
>>                        clear_bit(KVM_APIC_SIPI, &apic->pending_events);
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 6462c386015d..0e43acf7bea4 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -7205,6 +7205,24 @@ static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>>        return false;
>> }
>> 
>> +static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
>> +{
>> +       struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> +       /*
>> +        * TODO: Last condition latch INIT signals on vCPU when
>> +        * vCPU is in guest-mode and vmcb12 defines intercept on INIT.
>> +        * To properly emulate INIT intercept, SVM should also implement
>> +        * kvm_x86_ops->check_nested_events() and process pending INIT
>> +        * signal to cause nested_svm_vmexit(). As SVM currently don't
>> +        * implement check_nested_events(), this work is delayed
>> +        * for future improvement.
>> +        */
>> +       return is_smm(vcpu) ||
>> +                  !gif_set(svm) ||
>> +                  (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT));
>> +}
>> +
>> static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>>        .cpu_has_kvm_support = has_svm,
>>        .disabled_by_bios = is_disabled,
>> @@ -7341,6 +7359,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>>        .nested_get_evmcs_version = nested_get_evmcs_version,
>> 
>>        .need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
>> +
>> +       .apic_init_signal_blocked = svm_apic_init_signal_blocked,
>> };
>> 
>> static int __init svm_init(void)
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index ced9fba32598..d655fcd04c01 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -3401,6 +3401,15 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
>>        unsigned long exit_qual;
>>        bool block_nested_events =
>>            vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
>> +       struct kvm_lapic *apic = vcpu->arch.apic;
>> +
>> +       if (lapic_in_kernel(vcpu) &&
>> +               test_bit(KVM_APIC_INIT, &apic->pending_events)) {
>> +               if (block_nested_events)
>> +                       return -EBUSY;
>> +               nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
>> +               return 0;
>> +       }
> 
> Suppose that L0 just finished emulating an L2 instruction with
> EFLAGS.TF set. So, we've just queued up a #DB trap in
> vcpu->arch.exception. On this emulated VM-exit from L2 to L1, the
> guest pending debug exceptions field in the vmcs12 should get the
> value of vcpu->arch.exception.payload, and the queued #DB should be
> squashed.

If I understand correctly you are discussing a case where L2 exited to L0 for
emulating some instruction when L2’s RFLAGS.TF is set. Therefore, after x86
emulator finished emulating L2 instruction, it queued a #DB exception.

Then before resuming L2 guest, some other vCPU send an INIT signal
to this vCPU. When L0 will reach vmx_check_nested_events() it will
see pending INIT signal and exit on EXIT_REASON_INIT_SIGNAL
but nested_vmx_vmexit() will basically drop pending #DB exception
in prepare_vmcs12() when it calls kvm_clear_exception_queue()
because vmcs12_save_pending_event() only saves injected exceptions.
(As changed by myself a long time ago)

I think you are right this is a bug.
I also think it could trivially be fixed by just making sure vmx_check_nested_events()
first evaluates pending exceptions and only then pending apic events.
However, we also want to make sure to request an “immediate-exit” in case
eval of pending exception require emulation of an exit from L2 to L1.

I will add this scenario to my kvm-unit-tests that I’m about to submit.

Thanks,
-Liran

> 
>>        if (vcpu->arch.exception.pending &&
>>                nested_vmx_check_exception(vcpu, &exit_qual)) {
>> @@ -4466,7 +4475,12 @@ static int handle_vmoff(struct kvm_vcpu *vcpu)
>> {
>>        if (!nested_vmx_check_permission(vcpu))
>>                return 1;
>> +
>>        free_nested(vcpu);
>> +
>> +       /* Process a latched INIT during time CPU was in VMX operation */
>> +       kvm_make_request(KVM_REQ_EVENT, vcpu);
>> +
>>        return nested_vmx_succeed(vcpu);
>> }
>> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index b5b5b2e5dac5..5a1aa0640f2a 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -7479,6 +7479,11 @@ static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>>        return false;
>> }
>> 
>> +static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
>> +{
>> +       return is_smm(vcpu) || to_vmx(vcpu)->nested.vmxon;
>> +}
>> +
>> static __init int hardware_setup(void)
>> {
>>        unsigned long host_bndcfgs;
>> @@ -7803,6 +7808,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>>        .get_vmcs12_pages = NULL,
>>        .nested_enable_evmcs = NULL,
>>        .need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
>> +       .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
>> };
>> 
>> static void vmx_cleanup_l1d_flush(void)
>> --
>> 2.20.1
>>
Jim Mattson Aug. 26, 2019, 10:38 p.m. UTC | #5
On Mon, Aug 26, 2019 at 3:04 PM Liran Alon <liran.alon@oracle.com> wrote:
>
>
>
> > On 27 Aug 2019, at 0:17, Jim Mattson <jmattson@google.com> wrote:

> > Suppose that L0 just finished emulating an L2 instruction with
> > EFLAGS.TF set. So, we've just queued up a #DB trap in
> > vcpu->arch.exception. On this emulated VM-exit from L2 to L1, the
> > guest pending debug exceptions field in the vmcs12 should get the
> > value of vcpu->arch.exception.payload, and the queued #DB should be
> > squashed.
>
> If I understand correctly you are discussing a case where L2 exited to L0 for
> emulating some instruction when L2’s RFLAGS.TF is set. Therefore, after x86
> emulator finished emulating L2 instruction, it queued a #DB exception.

Right. For example, L0 really likes to emulate L2's MOV-to-CR instructions.

For added complication, what if the emulated instruction is in the
shadow of a POPSS that triggered a data breakpoint? Then there will be
existing pending debug exceptions in vmcs02 that need to be merged
with DR6.BS. (This isn't anything new with your change, though.)

> Then before resuming L2 guest, some other vCPU send an INIT signal
> to this vCPU. When L0 will reach vmx_check_nested_events() it will
> see pending INIT signal and exit on EXIT_REASON_INIT_SIGNAL
> but nested_vmx_vmexit() will basically drop pending #DB exception
> in prepare_vmcs12() when it calls kvm_clear_exception_queue()
> because vmcs12_save_pending_event() only saves injected exceptions.
> (As changed by myself a long time ago)
>
> I think you are right this is a bug.
> I also think it could trivially be fixed by just making sure vmx_check_nested_events()
> first evaluates pending exceptions and only then pending apic events.
> However, we also want to make sure to request an “immediate-exit” in case
> eval of pending exception require emulation of an exit from L2 to L1.

Hmmm. Any exception other than a #DB trap should take precedence over
the INIT. However, the INIT takes precedence over a #DB trap. But
maybe you can fudge the ordering, because there is no way for the
guest to tell which came first?

What about a single-step trap on VMXOFF, with the INIT latched? In
that case, the guest could tell that the INIT was latched before the
VMXOFF, so the INIT must take precedence. Do we get that ordering
right?
Paolo Bonzini Sept. 11, 2019, 4:21 p.m. UTC | #6
On 26/08/19 20:26, Liran Alon wrote:
> An alternative could be to just add a flag to events->flags that modifies
> behaviour to treat events->smi.latched_init as just events->latched_init.
> But I prefer the previous option.

Why would you even need the flag?  I think you only need to move the "if
(lapic_in_kernel(vcpu)) outside, under "if (events->flags &
KVM_VCPUEVENT_VALID_SMM)".

In fact, I think it would make sense anyway to clear KVM_APIC_SIPI in
kvm_vcpu_ioctl_x86_set_vcpu_events (i.e. clear apic->pending_events and
then possibly set KVM_APIC_INIT if events->smi.latched_init is true).

Paolo
Paolo Bonzini Sept. 11, 2019, 4:23 p.m. UTC | #7
On 26/08/19 12:24, Liran Alon wrote:
>  	/*
> -	 * INITs are latched while in SMM.  Because an SMM CPU cannot
> -	 * be in KVM_MP_STATE_INIT_RECEIVED state, just eat SIPIs
> -	 * and delay processing of INIT until the next RSM.
> +	 * INITs are latched while CPU is in specific states.
> +	 * Because a CPU cannot be in these states immediately
> +	 * after it have processed an INIT signal (and thus in
> +	 * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
> +	 * and delay processing of INIT until CPU leaves
> +	 * the state which latch INIT signal.
>  	 */
> -	if (is_smm(vcpu)) {
> +	if (kvm_x86_ops->apic_init_signal_blocked(vcpu)) {

I'd prefer keeping is_smm(vcpu) here, since that is not vendor-specific.

Together with some edits to the comments, this is what I got.
Let me know if you prefer to have the latched_init changes
on top, or you'd like to post v2 with everything.

Thanks,

Paolo

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c4f271a9b306..b523949a8df8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1209,6 +1209,8 @@ struct kvm_x86_ops {
 	uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
 
 	bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
+
+	bool (*apic_init_signal_blocked)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 559e1c4c0832..dbbe4781fbb2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2706,11 +2706,14 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 		return;
 
 	/*
-	 * INITs are latched while in SMM.  Because an SMM CPU cannot
-	 * be in KVM_MP_STATE_INIT_RECEIVED state, just eat SIPIs
-	 * and delay processing of INIT until the next RSM.
+	 * INITs are latched while CPU is in specific states
+	 * (SMM, VMX non-root mode, SVM with GIF=0).
+	 * Because a CPU cannot be in these states immediately
+	 * after it has processed an INIT signal (and thus in
+	 * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
+	 * and leave the INIT pending.
 	 */
-	if (is_smm(vcpu)) {
+	if (is_smm(vcpu) || kvm_x86_ops->apic_init_signal_blocked(vcpu)) {
 		WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
 		if (test_bit(KVM_APIC_SIPI, &apic->pending_events))
 			clear_bit(KVM_APIC_SIPI, &apic->pending_events);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2854aafc489e..d24050b647c7 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -7170,6 +7170,21 @@ static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	/*
+	 * TODO: Last condition latch INIT signals on vCPU when
+	 * vCPU is in guest-mode and vmcb12 defines intercept on INIT.
+	 * To properly emulate the INIT intercept, SVM should implement
+	 * kvm_x86_ops->check_nested_events() and call nested_svm_vmexit()
+	 * there if an INIT signal is pending.
+	 */
+	return !gif_set(svm) ||
+		   (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT));
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -7306,6 +7321,8 @@ static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
 	.nested_get_evmcs_version = nested_get_evmcs_version,
 
 	.need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
+
+	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ad2453317c4b..6ce83c602e7f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3409,6 +3409,15 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
 	unsigned long exit_qual;
 	bool block_nested_events =
 	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	if (lapic_in_kernel(vcpu) &&
+		test_bit(KVM_APIC_INIT, &apic->pending_events)) {
+		if (block_nested_events)
+			return -EBUSY;
+		nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
+		return 0;
+	}
 
 	if (vcpu->arch.exception.pending &&
 		nested_vmx_check_exception(vcpu, &exit_qual)) {
@@ -4470,7 +4479,12 @@ static int handle_vmoff(struct kvm_vcpu *vcpu)
 {
 	if (!nested_vmx_check_permission(vcpu))
 		return 1;
+
 	free_nested(vcpu);
+
+	/* Process a latched INIT during time CPU was in VMX operation */
+	kvm_make_request(KVM_REQ_EVENT, vcpu);
+
 	return nested_vmx_succeed(vcpu);
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 99f52f8c969a..73bf9a2e6fb6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7470,6 +7470,11 @@ static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+	return to_vmx(vcpu)->nested.vmxon;
+}
+
 static __init int hardware_setup(void)
 {
 	unsigned long host_bndcfgs;
@@ -7794,6 +7799,7 @@ static __exit void hardware_unsetup(void)
 	.get_vmcs12_pages = NULL,
 	.nested_enable_evmcs = NULL,
 	.need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
+	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
 };
 
 static void vmx_cleanup_l1d_flush(void)
Liran Alon Sept. 11, 2019, 4:42 p.m. UTC | #8
> On 11 Sep 2019, at 19:23, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 26/08/19 12:24, Liran Alon wrote:
>> 	/*
>> -	 * INITs are latched while in SMM.  Because an SMM CPU cannot
>> -	 * be in KVM_MP_STATE_INIT_RECEIVED state, just eat SIPIs
>> -	 * and delay processing of INIT until the next RSM.
>> +	 * INITs are latched while CPU is in specific states.
>> +	 * Because a CPU cannot be in these states immediately
>> +	 * after it have processed an INIT signal (and thus in
>> +	 * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
>> +	 * and delay processing of INIT until CPU leaves
>> +	 * the state which latch INIT signal.
>> 	 */
>> -	if (is_smm(vcpu)) {
>> +	if (kvm_x86_ops->apic_init_signal_blocked(vcpu)) {
> 
> I'd prefer keeping is_smm(vcpu) here, since that is not vendor-specific.
> 
> Together with some edits to the comments, this is what I got.
> Let me know if you prefer to have the latched_init changes
> on top, or you'd like to post v2 with everything.
> 
> Thanks,
> 
> Paolo

The code change below seems good to me.
(The only thing you changed is moving is_smm(vcpu) to the non-vendor specific part and rephrased comments right?)

I plan to submit a patch for the latched_init changes soon. (And respective kvm-unit-test which I already have ready and working).
We also have another small patch on top to add support for wait-for-SIPI activity-state which I will submit soon as-well.
So I’m fine with either option of you first applying this change and then I submit another patch on top,
or me just submitting a new v2 patch series with a new version for this patch and the rest of the changes.
I don’t have strong preference. Whatever you prefer is fine by me. :)

-Liran

> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c4f271a9b306..b523949a8df8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1209,6 +1209,8 @@ struct kvm_x86_ops {
> 	uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
> 
> 	bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
> +
> +	bool (*apic_init_signal_blocked)(struct kvm_vcpu *vcpu);
> };
> 
> struct kvm_arch_async_pf {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 559e1c4c0832..dbbe4781fbb2 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2706,11 +2706,14 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
> 		return;
> 
> 	/*
> -	 * INITs are latched while in SMM.  Because an SMM CPU cannot
> -	 * be in KVM_MP_STATE_INIT_RECEIVED state, just eat SIPIs
> -	 * and delay processing of INIT until the next RSM.
> +	 * INITs are latched while CPU is in specific states
> +	 * (SMM, VMX non-root mode, SVM with GIF=0).
> +	 * Because a CPU cannot be in these states immediately
> +	 * after it has processed an INIT signal (and thus in
> +	 * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
> +	 * and leave the INIT pending.
> 	 */
> -	if (is_smm(vcpu)) {
> +	if (is_smm(vcpu) || kvm_x86_ops->apic_init_signal_blocked(vcpu)) {
> 		WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
> 		if (test_bit(KVM_APIC_SIPI, &apic->pending_events))
> 			clear_bit(KVM_APIC_SIPI, &apic->pending_events);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 2854aafc489e..d24050b647c7 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -7170,6 +7170,21 @@ static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
> 	return false;
> }
> 
> +static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	/*
> +	 * TODO: Last condition latch INIT signals on vCPU when
> +	 * vCPU is in guest-mode and vmcb12 defines intercept on INIT.
> +	 * To properly emulate the INIT intercept, SVM should implement
> +	 * kvm_x86_ops->check_nested_events() and call nested_svm_vmexit()
> +	 * there if an INIT signal is pending.
> +	 */
> +	return !gif_set(svm) ||
> +		   (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT));
> +}
> +
> static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
> 	.cpu_has_kvm_support = has_svm,
> 	.disabled_by_bios = is_disabled,
> @@ -7306,6 +7321,8 @@ static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
> 	.nested_get_evmcs_version = nested_get_evmcs_version,
> 
> 	.need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
> +
> +	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
> };
> 
> static int __init svm_init(void)
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ad2453317c4b..6ce83c602e7f 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3409,6 +3409,15 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
> 	unsigned long exit_qual;
> 	bool block_nested_events =
> 	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	if (lapic_in_kernel(vcpu) &&
> +		test_bit(KVM_APIC_INIT, &apic->pending_events)) {
> +		if (block_nested_events)
> +			return -EBUSY;
> +		nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
> +		return 0;
> +	}
> 
> 	if (vcpu->arch.exception.pending &&
> 		nested_vmx_check_exception(vcpu, &exit_qual)) {
> @@ -4470,7 +4479,12 @@ static int handle_vmoff(struct kvm_vcpu *vcpu)
> {
> 	if (!nested_vmx_check_permission(vcpu))
> 		return 1;
> +
> 	free_nested(vcpu);
> +
> +	/* Process a latched INIT during time CPU was in VMX operation */
> +	kvm_make_request(KVM_REQ_EVENT, vcpu);
> +
> 	return nested_vmx_succeed(vcpu);
> }
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 99f52f8c969a..73bf9a2e6fb6 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7470,6 +7470,11 @@ static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
> 	return false;
> }
> 
> +static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +{
> +	return to_vmx(vcpu)->nested.vmxon;
> +}
> +
> static __init int hardware_setup(void)
> {
> 	unsigned long host_bndcfgs;
> @@ -7794,6 +7799,7 @@ static __exit void hardware_unsetup(void)
> 	.get_vmcs12_pages = NULL,
> 	.nested_enable_evmcs = NULL,
> 	.need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
> +	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
> };
> 
> static void vmx_cleanup_l1d_flush(void)
Paolo Bonzini Sept. 11, 2019, 4:48 p.m. UTC | #9
On 11/09/19 18:42, Liran Alon wrote:
> The code change below seems good to me. (The only thing you changed
> is moving is_smm(vcpu) to the non-vendor specific part and rephrased
> comments right?)

Yes.

> I plan to submit a patch for the latched_init changes soon. (And
> respective kvm-unit-test which I already have ready and working).

Very good, thanks.

Paolo
Liran Alon Nov. 10, 2019, 12:23 p.m. UTC | #10
Sorry in the delay of handling this.
Now preparing a bunch of KVM commits to submit. :)

> On 11 Sep 2019, at 19:21, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 26/08/19 20:26, Liran Alon wrote:
>> An alternative could be to just add a flag to events->flags that modifies
>> behaviour to treat events->smi.latched_init as just events->latched_init.
>> But I prefer the previous option.
> 
> Why would you even need the flag?  I think you only need to move the "if
> (lapic_in_kernel(vcpu)) outside, under "if (events->flags &
> KVM_VCPUEVENT_VALID_SMM)”.

Without an additional flag, the events->smi.latched_init field will be evaluated
by kvm_vcpu_ioctl_x86_set_vcpu_events() even when userspace haven’t
specified KVM_VCPUEVENT_VALID_SMM. Which in theory should break
compatibility to userspace that don’t specify this flag.

If you are ok with breaking this compatibility, I will avoid adding an opt-in flag
that specifies this field should be evaluated even when KVM_VCPUEVENT_VALID_SMM
is not specified.

Because we are lucky and “latched_init" was last field in “struct smi” inside “struct kvm_vcpu_events”,
I will just move “latched_init” field outside of “struct smi” just before the “reserved” field.
Which would keep binary format compatibility while allowing making KVM code more clear.

> 
> In fact, I think it would make sense anyway to clear KVM_APIC_SIPI in
> kvm_vcpu_ioctl_x86_set_vcpu_events (i.e. clear apic->pending_events and
> then possibly set KVM_APIC_INIT if events->smi.latched_init is true).

Agree. Will do this as-well.

-Liran

> 
> Paolo
Liran Alon Nov. 10, 2019, 12:57 p.m. UTC | #11
> On 10 Nov 2019, at 14:23, Liran Alon <liran.alon@oracle.com> wrote:
> 
> Because we are lucky and “latched_init" was last field in “struct smi” inside “struct kvm_vcpu_events”,
> I will just move “latched_init” field outside of “struct smi” just before the “reserved” field.
> Which would keep binary format compatibility while allowing making KVM code more clear.

Forget about this. Because of padding this will cause to issues.
And besides, it will break with any additional field added to “struct smi”.
So I will keep field in “struct smi” even though it isn’t related specifically to SMM anymore…

-Liran
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 74e88e5edd9c..158483538181 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1209,6 +1209,8 @@  struct kvm_x86_ops {
 	uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
 
 	bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
+
+	bool (*apic_init_signal_blocked)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 685d17c11461..9620fe5ce8d1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2702,11 +2702,14 @@  void kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 		return;
 
 	/*
-	 * INITs are latched while in SMM.  Because an SMM CPU cannot
-	 * be in KVM_MP_STATE_INIT_RECEIVED state, just eat SIPIs
-	 * and delay processing of INIT until the next RSM.
+	 * INITs are latched while CPU is in specific states.
+	 * Because a CPU cannot be in these states immediately
+	 * after it have processed an INIT signal (and thus in
+	 * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
+	 * and delay processing of INIT until CPU leaves
+	 * the state which latch INIT signal.
 	 */
-	if (is_smm(vcpu)) {
+	if (kvm_x86_ops->apic_init_signal_blocked(vcpu)) {
 		WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
 		if (test_bit(KVM_APIC_SIPI, &apic->pending_events))
 			clear_bit(KVM_APIC_SIPI, &apic->pending_events);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 6462c386015d..0e43acf7bea4 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -7205,6 +7205,24 @@  static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	/*
+	 * TODO: Last condition latch INIT signals on vCPU when
+	 * vCPU is in guest-mode and vmcb12 defines intercept on INIT.
+	 * To properly emulate INIT intercept, SVM should also implement
+	 * kvm_x86_ops->check_nested_events() and process pending INIT
+	 * signal to cause nested_svm_vmexit(). As SVM currently don't
+	 * implement check_nested_events(), this work is delayed
+	 * for future improvement.
+	 */
+	return is_smm(vcpu) ||
+		   !gif_set(svm) ||
+		   (svm->vmcb->control.intercept & (1ULL << INTERCEPT_INIT));
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -7341,6 +7359,8 @@  static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.nested_get_evmcs_version = nested_get_evmcs_version,
 
 	.need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
+
+	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ced9fba32598..d655fcd04c01 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3401,6 +3401,15 @@  static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
 	unsigned long exit_qual;
 	bool block_nested_events =
 	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	if (lapic_in_kernel(vcpu) &&
+		test_bit(KVM_APIC_INIT, &apic->pending_events)) {
+		if (block_nested_events)
+			return -EBUSY;
+		nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
+		return 0;
+	}
 
 	if (vcpu->arch.exception.pending &&
 		nested_vmx_check_exception(vcpu, &exit_qual)) {
@@ -4466,7 +4475,12 @@  static int handle_vmoff(struct kvm_vcpu *vcpu)
 {
 	if (!nested_vmx_check_permission(vcpu))
 		return 1;
+
 	free_nested(vcpu);
+
+	/* Process a latched INIT during time CPU was in VMX operation */
+	kvm_make_request(KVM_REQ_EVENT, vcpu);
+
 	return nested_vmx_succeed(vcpu);
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b5b5b2e5dac5..5a1aa0640f2a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7479,6 +7479,11 @@  static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+	return is_smm(vcpu) || to_vmx(vcpu)->nested.vmxon;
+}
+
 static __init int hardware_setup(void)
 {
 	unsigned long host_bndcfgs;
@@ -7803,6 +7808,7 @@  static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.get_vmcs12_pages = NULL,
 	.nested_enable_evmcs = NULL,
 	.need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
+	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
 };
 
 static void vmx_cleanup_l1d_flush(void)