[v7,26/26] KVM: nVMX: Enable CET support for nested guest

Message ID	20231124055330.138870-27-weijiang.yang@intel.com (mailing list archive)
State	New, archived
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Pv8ngh4+" From: Yang Weijiang <weijiang.yang@intel.com> To: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: peterz@infradead.org, chao.gao@intel.com, rick.p.edgecombe@intel.com, mlevitsk@redhat.com, john.allen@amd.com, weijiang.yang@intel.com Subject: [PATCH v7 26/26] KVM: nVMX: Enable CET support for nested guest Date: Fri, 24 Nov 2023 00:53:30 -0500 Message-Id: <20231124055330.138870-27-weijiang.yang@intel.com> In-Reply-To: <20231124055330.138870-1-weijiang.yang@intel.com> References: <20231124055330.138870-1-weijiang.yang@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Enable CET Virtualization \| expand [v7,00/26] Enable CET Virtualization [v7,01/26] x86/fpu/xstate: Always preserve non-user xfeatures/flags in __state_perm [v7,02/26] x86/fpu/xstate: Refine CET user xstate bit enabling [v7,03/26] x86/fpu/xstate: Add CET supervisor mode state support [v7,04/26] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set [v7,05/26] x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration [v7,06/26] x86/fpu/xstate: Create guest fpstate with guest specific config [v7,07/26] x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal fpstate [v7,08/26] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data [v7,09/26] KVM: x86: Rename kvm_{g,s}et_msr() to menifest emulation operations [v7,10/26] KVM: x86: Refine xsave-managed guest register/MSR reset handling [v7,11/26] KVM: x86: Add kvm_msr_{read,write}() helpers [v7,12/26] KVM: x86: Report XSS as to-be-saved if there are supported features [v7,13/26] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS [v7,14/26] KVM: x86: Initialize kvm_caps.supported_xss [v7,15/26] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs [v7,16/26] KVM: x86: Add fault checks for guest CR4.CET setting [v7,17/26] KVM: x86: Report KVM supported CET MSRs as to-be-saved [v7,18/26] KVM: VMX: Introduce CET VMCS fields and control bits [v7,19/26] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled" [v7,20/26] KVM: VMX: Emulate read and write to CET MSRs [v7,21/26] KVM: x86: Save and reload SSP to/from SMRAM [v7,22/26] KVM: VMX: Set up interception for CET MSRs [v7,23/26] KVM: VMX: Set host constant supervisor states to VMCS fields [v7,24/26] KVM: x86: Enable CET virtualization for VMX and advertise to userspace [v7,25/26] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1 [v7,26/26] KVM: nVMX: Enable CET support for nested guest

Yang, Weijiang Nov. 24, 2023, 5:53 a.m. UTC

Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.

Note, generally L1 VMM only touches CET VMCS fields when live migration or
vmcs_{read,write}() to the fields happens, so the fields only need to be
synced in these "rare" cases. And here only considers the case that L1 VMM
has set VM_ENTRY_LOAD_CET_STATE in its VMCS vm_entry_controls as it's the
common usage.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 48 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmcs12.c |  6 +++++
 arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++-
 arch/x86/kvm/vmx/vmx.c    |  2 ++
 4 files changed, 67 insertions(+), 3 deletions(-)

Maxim Levitsky Nov. 30, 2023, 5:53 p.m. UTC | #1

On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> to enable CET for nested VM.
> 
> Note, generally L1 VMM only touches CET VMCS fields when live migration or
> vmcs_{read,write}() to the fields happens, so the fields only need to be
> synced in these "rare" cases.
To be honest we can't assume anything about L1, but what we can assume

is that if vmcs12 field is not shadowed, then L1 vmwrite/vmread will
be always intercepted and during the interception the fields can be synced,
however I studied this area long ago and I might be mistaken.


>  And here only considers the case that L1 VMM
> has set VM_ENTRY_LOAD_CET_STATE in its VMCS vm_entry_controls as it's the
> common usage.
> 
> Suggested-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 48 +++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/vmx/vmcs12.c |  6 +++++
>  arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++-
>  arch/x86/kvm/vmx/vmx.c    |  2 ++
>  4 files changed, 67 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index d8c32682ca76..965173650542 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>  	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>  					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>  
> +	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_U_CET, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_S_CET, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> +
>  	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>  
>  	vmx->nested.force_msr_bitmap_recalc = false;
> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>  		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>  		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>  			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> +
> +		if (vmx->nested.nested_run_pending &&

I don't think that nested.nested_run_pending check is needed. 
prepare_vmcs02_rare is not going to be called unless the nested run is pending.

> +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> +				vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> +				vmcs_writel(GUEST_INTR_SSP_TABLE,
> +					    vmcs12->guest_ssp_tbl);
> +			}
> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> +			    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> +				vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> +		}
>  	}
>  
>  	if (nested_cpu_has_xsaves(vmcs12))
> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>  	vmcs12->guest_pending_dbg_exceptions =
>  		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>  
> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> +		vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> +		vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> +	}
> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> +	    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> +		vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> +	}

The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
was loaded, then it must be updated on exit - this is usually how VMX works.


Also I don't see any mention of usage of VM_EXIT_LOAD_CET_STATE, which if set,
should reset the L1 CET state to values in 'host_s_cet/host_ssp/host_ssp_tbl'
(This is also a common theme in VMX - host state is reset to values that the hypervisor
sets in VMCS, and the hypervisor must care to update these fields itself).


As a rule of thumb, if you add a field to vmcs12, you should use it somewhere,
and you should never use it unconditionally, as almost always its use
depends on entry or exit controls.

Same is true for entry/exit/execution controls - if you add one, you almost
always have to use it somewhere.


Best regards,
	Maxim Levitsky

> +
>  	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
>  }
>  
> @@ -6798,7 +6841,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>  		VM_EXIT_HOST_ADDR_SPACE_SIZE |
>  #endif
>  		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> -		VM_EXIT_CLEAR_BNDCFGS;
> +		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>  	msrs->exit_ctls_high |=
>  		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>  		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> @@ -6820,7 +6863,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>  #ifdef CONFIG_X86_64
>  		VM_ENTRY_IA32E_MODE |
>  #endif
> -		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> +		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> +		VM_ENTRY_LOAD_CET_STATE;
>  	msrs->entry_ctls_high |=
>  		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>  		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> index 106a72c923ca..4233b5ca9461 100644
> --- a/arch/x86/kvm/vmx/vmcs12.c
> +++ b/arch/x86/kvm/vmx/vmcs12.c
> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>  	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>  	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>  	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> +	FIELD(GUEST_S_CET, guest_s_cet),
> +	FIELD(GUEST_SSP, guest_ssp),
> +	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
>  	FIELD(HOST_CR0, host_cr0),
>  	FIELD(HOST_CR3, host_cr3),
>  	FIELD(HOST_CR4, host_cr4),
> @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
>  	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
>  	FIELD(HOST_RSP, host_rsp),
>  	FIELD(HOST_RIP, host_rip),
> +	FIELD(HOST_S_CET, host_s_cet),
> +	FIELD(HOST_SSP, host_ssp),
> +	FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
>  };
>  const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> index 01936013428b..3884489e7f7e 100644
> --- a/arch/x86/kvm/vmx/vmcs12.h
> +++ b/arch/x86/kvm/vmx/vmcs12.h
> @@ -117,7 +117,13 @@ struct __packed vmcs12 {
>  	natural_width host_ia32_sysenter_eip;
>  	natural_width host_rsp;
>  	natural_width host_rip;
> -	natural_width paddingl[8]; /* room for future expansion */
> +	natural_width host_s_cet;
> +	natural_width host_ssp;
> +	natural_width host_ssp_tbl;
> +	natural_width guest_s_cet;
> +	natural_width guest_ssp;
> +	natural_width guest_ssp_tbl;
> +	natural_width paddingl[2]; /* room for future expansion */
>  	u32 pin_based_vm_exec_control;
>  	u32 cpu_based_vm_exec_control;
>  	u32 exception_bitmap;
> @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
>  	CHECK_OFFSET(host_ia32_sysenter_eip, 656);
>  	CHECK_OFFSET(host_rsp, 664);
>  	CHECK_OFFSET(host_rip, 672);
> +	CHECK_OFFSET(host_s_cet, 680);
> +	CHECK_OFFSET(host_ssp, 688);
> +	CHECK_OFFSET(host_ssp_tbl, 696);
> +	CHECK_OFFSET(guest_s_cet, 704);
> +	CHECK_OFFSET(guest_ssp, 712);
> +	CHECK_OFFSET(guest_ssp_tbl, 720);
>  	CHECK_OFFSET(pin_based_vm_exec_control, 744);
>  	CHECK_OFFSET(cpu_based_vm_exec_control, 748);
>  	CHECK_OFFSET(exception_bitmap, 752);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index a1aae8709939..947028ff2e25 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7734,6 +7734,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
>  	cr4_fixed1_update(X86_CR4_PKE,        ecx, feature_bit(PKU));
>  	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
>  	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
> +	cr4_fixed1_update(X86_CR4_CET,	      ecx, feature_bit(SHSTK));
> +	cr4_fixed1_update(X86_CR4_CET,	      edx, feature_bit(IBT));
>  
>  #undef cr4_fixed1_update
>  }

Yang, Weijiang Dec. 4, 2023, 8:50 a.m. UTC | #2

On 12/1/2023 1:53 AM, Maxim Levitsky wrote:
> On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
>> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>> to enable CET for nested VM.
>>
>> Note, generally L1 VMM only touches CET VMCS fields when live migration or
>> vmcs_{read,write}() to the fields happens, so the fields only need to be
>> synced in these "rare" cases.
> To be honest we can't assume anything about L1, but what we can assume
>
> is that if vmcs12 field is not shadowed, then L1 vmwrite/vmread will
> be always intercepted and during the interception the fields can be synced,
> however I studied this area long ago and I might be mistaken.

The changelog wording failed to express what I meant to say:
vmcs12 and vmcs02 should be synced to reflect the correct CET states L1 or L2 are expected
to see. In LM case, the nested CET states should also be synced between L1 or L2 via the
control structures.

Will reword them, thanks for pointing it out!
>>   And here only considers the case that L1 VMM
>> has set VM_ENTRY_LOAD_CET_STATE in its VMCS vm_entry_controls as it's the
>> common usage.
>>
>> Suggested-by: Chao Gao <chao.gao@intel.com>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>> ---
>>   arch/x86/kvm/vmx/nested.c | 48 +++++++++++++++++++++++++++++++++++++--
>>   arch/x86/kvm/vmx/vmcs12.c |  6 +++++
>>   arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++-
>>   arch/x86/kvm/vmx/vmx.c    |  2 ++
>>   4 files changed, 67 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index d8c32682ca76..965173650542 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>>   	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>   					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>   
>> +	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_U_CET, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_S_CET, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>> +
>>   	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>   
>>   	vmx->nested.force_msr_bitmap_recalc = false;
>> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>>   		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>>   		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>>   			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>> +
>> +		if (vmx->nested.nested_run_pending &&
> I don't think that nested.nested_run_pending check is needed.
> prepare_vmcs02_rare is not going to be called unless the nested run is pending.

But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
we don't need to update vmcs02's fields at the point until L2 is being resumed.

>> +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
>> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>> +				vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
>> +				vmcs_writel(GUEST_INTR_SSP_TABLE,
>> +					    vmcs12->guest_ssp_tbl);
>> +			}
>> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>> +			    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
>> +				vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
>> +		}
>>   	}
>>   
>>   	if (nested_cpu_has_xsaves(vmcs12))
>> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>>   	vmcs12->guest_pending_dbg_exceptions =
>>   		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>   
>> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>> +		vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
>> +		vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>> +	}
>> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>> +	    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
>> +		vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
>> +	}
> The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> was loaded, then it must be updated on exit - this is usually how VMX works.

I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
so these fields for L2 guest should be synced to L1 unconditionally.

> Also I don't see any mention of usage of VM_EXIT_LOAD_CET_STATE, which if set,
> should reset the L1 CET state to values in 'host_s_cet/host_ssp/host_ssp_tbl'
> (This is also a common theme in VMX - host state is reset to values that the hypervisor
> sets in VMCS, and the hypervisor must care to update these fields itself).

Yes, the host CET states for L1 also should be synced, I'll add the missing part, thanks!

> As a rule of thumb, if you add a field to vmcs12, you should use it somewhere,
> and you should never use it unconditionally, as almost always its use
> depends on entry or exit controls.
>
> Same is true for entry/exit/execution controls - if you add one, you almost
> always have to use it somewhere.

I'll double check if anything is lost in various cases, thanks!

> Best regards,
> 	Maxim Levitsky
>
>> +
>>   	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
>>   }
>>   
>> @@ -6798,7 +6841,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>>   		VM_EXIT_HOST_ADDR_SPACE_SIZE |
>>   #endif
>>   		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> -		VM_EXIT_CLEAR_BNDCFGS;
>> +		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>>   	msrs->exit_ctls_high |=
>>   		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>   		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>> @@ -6820,7 +6863,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>>   #ifdef CONFIG_X86_64
>>   		VM_ENTRY_IA32E_MODE |
>>   #endif
>> -		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>> +		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>> +		VM_ENTRY_LOAD_CET_STATE;
>>   	msrs->entry_ctls_high |=
>>   		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>>   		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>> index 106a72c923ca..4233b5ca9461 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.c
>> +++ b/arch/x86/kvm/vmx/vmcs12.c
>> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>>   	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>>   	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>>   	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>> +	FIELD(GUEST_S_CET, guest_s_cet),
>> +	FIELD(GUEST_SSP, guest_ssp),
>> +	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
>>   	FIELD(HOST_CR0, host_cr0),
>>   	FIELD(HOST_CR3, host_cr3),
>>   	FIELD(HOST_CR4, host_cr4),
>> @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
>>   	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
>>   	FIELD(HOST_RSP, host_rsp),
>>   	FIELD(HOST_RIP, host_rip),
>> +	FIELD(HOST_S_CET, host_s_cet),
>> +	FIELD(HOST_SSP, host_ssp),
>> +	FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
>>   };
>>   const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
>> index 01936013428b..3884489e7f7e 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.h
>> +++ b/arch/x86/kvm/vmx/vmcs12.h
>> @@ -117,7 +117,13 @@ struct __packed vmcs12 {
>>   	natural_width host_ia32_sysenter_eip;
>>   	natural_width host_rsp;
>>   	natural_width host_rip;
>> -	natural_width paddingl[8]; /* room for future expansion */
>> +	natural_width host_s_cet;
>> +	natural_width host_ssp;
>> +	natural_width host_ssp_tbl;
>> +	natural_width guest_s_cet;
>> +	natural_width guest_ssp;
>> +	natural_width guest_ssp_tbl;
>> +	natural_width paddingl[2]; /* room for future expansion */
>>   	u32 pin_based_vm_exec_control;
>>   	u32 cpu_based_vm_exec_control;
>>   	u32 exception_bitmap;
>> @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
>>   	CHECK_OFFSET(host_ia32_sysenter_eip, 656);
>>   	CHECK_OFFSET(host_rsp, 664);
>>   	CHECK_OFFSET(host_rip, 672);
>> +	CHECK_OFFSET(host_s_cet, 680);
>> +	CHECK_OFFSET(host_ssp, 688);
>> +	CHECK_OFFSET(host_ssp_tbl, 696);
>> +	CHECK_OFFSET(guest_s_cet, 704);
>> +	CHECK_OFFSET(guest_ssp, 712);
>> +	CHECK_OFFSET(guest_ssp_tbl, 720);
>>   	CHECK_OFFSET(pin_based_vm_exec_control, 744);
>>   	CHECK_OFFSET(cpu_based_vm_exec_control, 748);
>>   	CHECK_OFFSET(exception_bitmap, 752);
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index a1aae8709939..947028ff2e25 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -7734,6 +7734,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
>>   	cr4_fixed1_update(X86_CR4_PKE,        ecx, feature_bit(PKU));
>>   	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
>>   	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
>> +	cr4_fixed1_update(X86_CR4_CET,	      ecx, feature_bit(SHSTK));
>> +	cr4_fixed1_update(X86_CR4_CET,	      edx, feature_bit(IBT));
>>   
>>   #undef cr4_fixed1_update
>>   }
>
>

Maxim Levitsky Dec. 5, 2023, 10:12 a.m. UTC | #3

On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
> On 12/1/2023 1:53 AM, Maxim Levitsky wrote:
> > On Fri, 2023-11-24 at 00:53 -0500, Yang Weijiang wrote:
> > > Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> > > to enable CET for nested VM.
> > > 
> > > Note, generally L1 VMM only touches CET VMCS fields when live migration or
> > > vmcs_{read,write}() to the fields happens, so the fields only need to be
> > > synced in these "rare" cases.
> > To be honest we can't assume anything about L1, but what we can assume
> > 
> > is that if vmcs12 field is not shadowed, then L1 vmwrite/vmread will
> > be always intercepted and during the interception the fields can be synced,
> > however I studied this area long ago and I might be mistaken.
> 
> The changelog wording failed to express what I meant to say:
> vmcs12 and vmcs02 should be synced to reflect the correct CET states L1 or L2 are expected
> to see. In LM case, the nested CET states should also be synced between L1 or L2 via the
> control structures.
> 
> Will reword them, thanks for pointing it out!
> > >   And here only considers the case that L1 VMM
> > > has set VM_ENTRY_LOAD_CET_STATE in its VMCS vm_entry_controls as it's the
> > > common usage.
> > > 
> > > Suggested-by: Chao Gao <chao.gao@intel.com>
> > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > > ---
> > >   arch/x86/kvm/vmx/nested.c | 48 +++++++++++++++++++++++++++++++++++++--
> > >   arch/x86/kvm/vmx/vmcs12.c |  6 +++++
> > >   arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++-
> > >   arch/x86/kvm/vmx/vmx.c    |  2 ++
> > >   4 files changed, 67 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index d8c32682ca76..965173650542 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > >   	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > >   					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
> > >   
> > > +	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> > > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > +					 MSR_IA32_U_CET, MSR_TYPE_RW);
> > > +
> > > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > +					 MSR_IA32_S_CET, MSR_TYPE_RW);
> > > +
> > > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > +					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> > > +
> > > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > +					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> > > +
> > > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > +					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> > > +
> > > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > +					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> > > +
> > > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > > +					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> > > +
> > >   	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
> > >   
> > >   	vmx->nested.force_msr_bitmap_recalc = false;
> > > @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> > >   		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> > >   		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> > >   			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> > > +
> > > +		if (vmx->nested.nested_run_pending &&
> > I don't think that nested.nested_run_pending check is needed.
> > prepare_vmcs02_rare is not going to be called unless the nested run is pending.
> 
> But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
> we don't need to update vmcs02's fields at the point until L2 is being resumed.

- If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.

- If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
from the migration stream.

- If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
but we still need to setup vmcs02, otherwise it will be left with default zero values.

Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.



> 
> > > +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> > > +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > +				vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> > > +				vmcs_writel(GUEST_INTR_SSP_TABLE,
> > > +					    vmcs12->guest_ssp_tbl);
> > > +			}
> > > +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > +			    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> > > +				vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> > > +		}
> > >   	}
> > >   
> > >   	if (nested_cpu_has_xsaves(vmcs12))
> > > @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> > >   	vmcs12->guest_pending_dbg_exceptions =
> > >   		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> > >   
> > > +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > +		vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> > > +		vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> > > +	}
> > > +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > +	    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> > > +		vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> > > +	}
> > The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> > was loaded, then it must be updated on exit - this is usually how VMX works.
> 
> I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
> so these fields for L2 guest should be synced to L1 unconditionally.

"the guest registers will be saved into VMCS fields unconditionally"
This is not true, unless there is a bug. the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
this state manually or use msr load/store lists instead.

Regardless of this,
if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
(that is copied back to vmcs12) this is what is written in the VMX spec.


Best regards,
	Maxim Levitsky

> 
> > Also I don't see any mention of usage of VM_EXIT_LOAD_CET_STATE, which if set,
> > should reset the L1 CET state to values in 'host_s_cet/host_ssp/host_ssp_tbl'
> > (This is also a common theme in VMX - host state is reset to values that the hypervisor
> > sets in VMCS, and the hypervisor must care to update these fields itself).
> 
> Yes, the host CET states for L1 also should be synced, I'll add the missing part, thanks!



> 
> > As a rule of thumb, if you add a field to vmcs12, you should use it somewhere,
> > and you should never use it unconditionally, as almost always its use
> > depends on entry or exit controls.
> > 
> > Same is true for entry/exit/execution controls - if you add one, you almost
> > always have to use it somewhere.
> 
> I'll double check if anything is lost in various cases, thanks!
> 
> > Best regards,
> > 	Maxim Levitsky
> > 
> > > +
> > >   	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
> > >   }
> > >   
> > > @@ -6798,7 +6841,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> > >   		VM_EXIT_HOST_ADDR_SPACE_SIZE |
> > >   #endif
> > >   		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> > > -		VM_EXIT_CLEAR_BNDCFGS;
> > > +		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> > >   	msrs->exit_ctls_high |=
> > >   		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> > >   		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> > > @@ -6820,7 +6863,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> > >   #ifdef CONFIG_X86_64
> > >   		VM_ENTRY_IA32E_MODE |
> > >   #endif
> > > -		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> > > +		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> > > +		VM_ENTRY_LOAD_CET_STATE;
> > >   	msrs->entry_ctls_high |=
> > >   		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> > >   		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> > > diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> > > index 106a72c923ca..4233b5ca9461 100644
> > > --- a/arch/x86/kvm/vmx/vmcs12.c
> > > +++ b/arch/x86/kvm/vmx/vmcs12.c
> > > @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> > >   	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> > >   	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> > >   	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> > > +	FIELD(GUEST_S_CET, guest_s_cet),
> > > +	FIELD(GUEST_SSP, guest_ssp),
> > > +	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> > >   	FIELD(HOST_CR0, host_cr0),
> > >   	FIELD(HOST_CR3, host_cr3),
> > >   	FIELD(HOST_CR4, host_cr4),
> > > @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
> > >   	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
> > >   	FIELD(HOST_RSP, host_rsp),
> > >   	FIELD(HOST_RIP, host_rip),
> > > +	FIELD(HOST_S_CET, host_s_cet),
> > > +	FIELD(HOST_SSP, host_ssp),
> > > +	FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
> > >   };
> > >   const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
> > > diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> > > index 01936013428b..3884489e7f7e 100644
> > > --- a/arch/x86/kvm/vmx/vmcs12.h
> > > +++ b/arch/x86/kvm/vmx/vmcs12.h
> > > @@ -117,7 +117,13 @@ struct __packed vmcs12 {
> > >   	natural_width host_ia32_sysenter_eip;
> > >   	natural_width host_rsp;
> > >   	natural_width host_rip;
> > > -	natural_width paddingl[8]; /* room for future expansion */
> > > +	natural_width host_s_cet;
> > > +	natural_width host_ssp;
> > > +	natural_width host_ssp_tbl;
> > > +	natural_width guest_s_cet;
> > > +	natural_width guest_ssp;
> > > +	natural_width guest_ssp_tbl;
> > > +	natural_width paddingl[2]; /* room for future expansion */
> > >   	u32 pin_based_vm_exec_control;
> > >   	u32 cpu_based_vm_exec_control;
> > >   	u32 exception_bitmap;
> > > @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
> > >   	CHECK_OFFSET(host_ia32_sysenter_eip, 656);
> > >   	CHECK_OFFSET(host_rsp, 664);
> > >   	CHECK_OFFSET(host_rip, 672);
> > > +	CHECK_OFFSET(host_s_cet, 680);
> > > +	CHECK_OFFSET(host_ssp, 688);
> > > +	CHECK_OFFSET(host_ssp_tbl, 696);
> > > +	CHECK_OFFSET(guest_s_cet, 704);
> > > +	CHECK_OFFSET(guest_ssp, 712);
> > > +	CHECK_OFFSET(guest_ssp_tbl, 720);
> > >   	CHECK_OFFSET(pin_based_vm_exec_control, 744);
> > >   	CHECK_OFFSET(cpu_based_vm_exec_control, 748);
> > >   	CHECK_OFFSET(exception_bitmap, 752);
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index a1aae8709939..947028ff2e25 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7734,6 +7734,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
> > >   	cr4_fixed1_update(X86_CR4_PKE,        ecx, feature_bit(PKU));
> > >   	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
> > >   	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
> > > +	cr4_fixed1_update(X86_CR4_CET,	      ecx, feature_bit(SHSTK));
> > > +	cr4_fixed1_update(X86_CR4_CET,	      edx, feature_bit(IBT));
> > >   
> > >   #undef cr4_fixed1_update
> > >   }

Yang, Weijiang Dec. 6, 2023, 9:22 a.m. UTC | #4

On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
> On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:

[...]

>>>>    	vmx->nested.force_msr_bitmap_recalc = false;
>>>> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>>>>    		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>>>>    		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>>>>    			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>>>> +
>>>> +		if (vmx->nested.nested_run_pending &&
>>> I don't think that nested.nested_run_pending check is needed.
>>> prepare_vmcs02_rare is not going to be called unless the nested run is pending.
>> But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
>> we don't need to update vmcs02's fields at the point until L2 is being resumed.
> - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
> because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
>
> - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
> but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
> from the migration stream.
>
> - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
> but we still need to setup vmcs02, otherwise it will be left with default zero values.

Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
> Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
>
>>>> +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
>>>> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>> +				vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
>>>> +				vmcs_writel(GUEST_INTR_SSP_TABLE,
>>>> +					    vmcs12->guest_ssp_tbl);
>>>> +			}
>>>> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>> +			    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
>>>> +				vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
>>>> +		}
>>>>    	}
>>>>    
>>>>    	if (nested_cpu_has_xsaves(vmcs12))
>>>> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>>>>    	vmcs12->guest_pending_dbg_exceptions =
>>>>    		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>>>    
>>>> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>> +		vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
>>>> +		vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>>>> +	}
>>>> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>> +	    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
>>>> +		vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
>>>> +	}
>>> The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
>>> was loaded, then it must be updated on exit - this is usually how VMX works.
>> I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
>> so these fields for L2 guest should be synced to L1 unconditionally.
> "the guest registers will be saved into VMCS fields unconditionally"
> This is not true, unless there is a bug.

I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
the individual CET spec.:
"10.6 VM Exit
On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
But since it doesn't appear in SDM, I shouldn't take it for granted.

> the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
> the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
> this state manually or use msr load/store lists instead.

Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
> Regardless of this,
> if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
> (that is copied back to vmcs12) this is what is written in the VMX spec.

What's the VMX spec. your're referring to here?

Maxim Levitsky Dec. 6, 2023, 5:24 p.m. UTC | #5

On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
> On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
> > On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
> 
> [...]
> 
> > > > >    	vmx->nested.force_msr_bitmap_recalc = false;
> > > > > @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> > > > >    		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> > > > >    		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> > > > >    			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> > > > > +
> > > > > +		if (vmx->nested.nested_run_pending &&
> > > > I don't think that nested.nested_run_pending check is needed.
> > > > prepare_vmcs02_rare is not going to be called unless the nested run is pending.
> > > But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
> > > we don't need to update vmcs02's fields at the point until L2 is being resumed.
> > - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
> > because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
> > 
> > - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
> > but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
> > from the migration stream.
> > 
> > - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
> > but we still need to setup vmcs02, otherwise it will be left with default zero values.
> 
> Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
> > Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
> > 
> > > > > +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> > > > > +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > +				vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> > > > > +				vmcs_writel(GUEST_INTR_SSP_TABLE,
> > > > > +					    vmcs12->guest_ssp_tbl);
> > > > > +			}
> > > > > +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > +			    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> > > > > +				vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> > > > > +		}
> > > > >    	}
> > > > >    
> > > > >    	if (nested_cpu_has_xsaves(vmcs12))
> > > > > @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> > > > >    	vmcs12->guest_pending_dbg_exceptions =
> > > > >    		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> > > > >    
> > > > > +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > +		vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> > > > > +		vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> > > > > +	}
> > > > > +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > +	    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> > > > > +		vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> > > > > +	}
> > > > The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> > > > was loaded, then it must be updated on exit - this is usually how VMX works.
> > > I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
> > > so these fields for L2 guest should be synced to L1 unconditionally.
> > "the guest registers will be saved into VMCS fields unconditionally"
> > This is not true, unless there is a bug.
> 
> I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
> the individual CET spec.:
> "10.6 VM Exit
> On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
> But since it doesn't appear in SDM, I shouldn't take it for granted.

SDM spec from September 2023:

28.3.1 Saving Control Registers, Debug Registers, and MSRs

"If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
support Intel 64 architecture, bits 63:32 of these MSRs are not saved."

Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
supported on the CPU.
It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via 
“load CET” VM entry control.

IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
and see if the CPU still updates the guest CET fields on the VM exit.


> 
> > the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
> > the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
> > this state manually or use msr load/store lists instead.
> 
> Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!


> > Regardless of this,
> > if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
> > (that is copied back to vmcs12) this is what is written in the VMX spec.
> 
> What's the VMX spec. your're referring to here?

SDM.

In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the 
VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.

And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.


Sorry for a bit of a confusion.

Best regards,
	Maxim Levitsky


> 
>

Yang, Weijiang Dec. 8, 2023, 3:15 p.m. UTC | #6

On 12/7/2023 1:24 AM, Maxim Levitsky wrote:
> On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
>> On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
>>> On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
>> [...]
>>
>>>>>>     	vmx->nested.force_msr_bitmap_recalc = false;
>>>>>> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>>>>>>     		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>>>>>>     		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>>>>>>     			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>>>>>> +
>>>>>> +		if (vmx->nested.nested_run_pending &&
>>>>> I don't think that nested.nested_run_pending check is needed.
>>>>> prepare_vmcs02_rare is not going to be called unless the nested run is pending.
>>>> But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
>>>> we don't need to update vmcs02's fields at the point until L2 is being resumed.
>>> - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
>>> because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
>>>
>>> - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
>>> but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
>>> from the migration stream.
>>>
>>> - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
>>> but we still need to setup vmcs02, otherwise it will be left with default zero values.
>> Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
>>> Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
>>>
>>>>>> +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
>>>>>> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>>>> +				vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
>>>>>> +				vmcs_writel(GUEST_INTR_SSP_TABLE,
>>>>>> +					    vmcs12->guest_ssp_tbl);
>>>>>> +			}
>>>>>> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>>>> +			    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
>>>>>> +				vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
>>>>>> +		}
>>>>>>     	}
>>>>>>     
>>>>>>     	if (nested_cpu_has_xsaves(vmcs12))
>>>>>> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>>>>>>     	vmcs12->guest_pending_dbg_exceptions =
>>>>>>     		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>>>>>     
>>>>>> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>>>> +		vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
>>>>>> +		vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>>>>>> +	}
>>>>>> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>>>> +	    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
>>>>>> +		vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
>>>>>> +	}
>>>>> The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
>>>>> was loaded, then it must be updated on exit - this is usually how VMX works.
>>>> I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
>>>> so these fields for L2 guest should be synced to L1 unconditionally.
>>> "the guest registers will be saved into VMCS fields unconditionally"
>>> This is not true, unless there is a bug.
>> I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
>> the individual CET spec.:
>> "10.6 VM Exit
>> On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
>> But since it doesn't appear in SDM, I shouldn't take it for granted.
> SDM spec from September 2023:
>
> 28.3.1 Saving Control Registers, Debug Registers, and MSRs
>
> "If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
> IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
> support Intel 64 architecture, bits 63:32 of these MSRs are not saved."
>
> Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
> supported on the CPU.
> It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via
> “load CET” VM entry control.
>
> IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
> and see if the CPU still updates the guest CET fields on the VM exit.

OK, I'll do some tests to see what's happening, thanks!
>>> the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
>>> the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
>>> this state manually or use msr load/store lists instead.
>> Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
>
>>> Regardless of this,
>>> if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
>>> (that is copied back to vmcs12) this is what is written in the VMX spec.
>> What's the VMX spec. your're referring to here?
> SDM.
>
> In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the
> VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.
>
> And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.
>
>
> Sorry for a bit of a confusion.

NP, I also double check it with HW Arch and get it back.
Thanks for raising these questions!

> Best regards,
> 	Maxim Levitsky
>
>
>>
>

Maxim Levitsky Dec. 8, 2023, 3:22 p.m. UTC | #7

On Fri, 2023-12-08 at 23:15 +0800, Yang, Weijiang wrote:
> On 12/7/2023 1:24 AM, Maxim Levitsky wrote:
> > On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
> > > On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
> > > > On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
> > > [...]
> > > 
> > > > > > >     	vmx->nested.force_msr_bitmap_recalc = false;
> > > > > > > @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> > > > > > >     		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> > > > > > >     		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> > > > > > >     			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> > > > > > > +
> > > > > > > +		if (vmx->nested.nested_run_pending &&
> > > > > > I don't think that nested.nested_run_pending check is needed.
> > > > > > prepare_vmcs02_rare is not going to be called unless the nested run is pending.
> > > > > But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
> > > > > we don't need to update vmcs02's fields at the point until L2 is being resumed.
> > > > - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
> > > > because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
> > > > 
> > > > - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
> > > > but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
> > > > from the migration stream.
> > > > 
> > > > - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
> > > > but we still need to setup vmcs02, otherwise it will be left with default zero values.
> > > Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
> > > > Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
> > > > 
> > > > > > > +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> > > > > > > +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > > > +				vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> > > > > > > +				vmcs_writel(GUEST_INTR_SSP_TABLE,
> > > > > > > +					    vmcs12->guest_ssp_tbl);
> > > > > > > +			}
> > > > > > > +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > > > +			    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> > > > > > > +				vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> > > > > > > +		}
> > > > > > >     	}
> > > > > > >     
> > > > > > >     	if (nested_cpu_has_xsaves(vmcs12))
> > > > > > > @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> > > > > > >     	vmcs12->guest_pending_dbg_exceptions =
> > > > > > >     		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> > > > > > >     
> > > > > > > +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > > > +		vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> > > > > > > +		vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> > > > > > > +	}
> > > > > > > +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > > > +	    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> > > > > > > +		vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> > > > > > > +	}
> > > > > > The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> > > > > > was loaded, then it must be updated on exit - this is usually how VMX works.
> > > > > I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
> > > > > so these fields for L2 guest should be synced to L1 unconditionally.
> > > > "the guest registers will be saved into VMCS fields unconditionally"
> > > > This is not true, unless there is a bug.
> > > I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
> > > the individual CET spec.:
> > > "10.6 VM Exit
> > > On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
> > > But since it doesn't appear in SDM, I shouldn't take it for granted.
> > SDM spec from September 2023:
> > 
> > 28.3.1 Saving Control Registers, Debug Registers, and MSRs
> > 
> > "If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
> > IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
> > support Intel 64 architecture, bits 63:32 of these MSRs are not saved."
> > 
> > Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
> > supported on the CPU.
> > It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via
> > “load CET” VM entry control.
> > 
> > IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
> > and see if the CPU still updates the guest CET fields on the VM exit.
> 
> OK, I'll do some tests to see what's happening, thanks!
> > > > the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
> > > > the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
> > > > this state manually or use msr load/store lists instead.
> > > Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
> > > > Regardless of this,
> > > > if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
> > > > (that is copied back to vmcs12) this is what is written in the VMX spec.
> > > What's the VMX spec. your're referring to here?
> > SDM.
> > 
> > In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the
> > VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.
I noticed a typo. I meant that the CPU should't  care about their values in the vmcs02.

> > 
> > And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.
> > 
> > 
> > Sorry for a bit of a confusion.
> 
> NP, I also double check it with HW Arch and get it back.
> Thanks for raising these questions!

Thanks to you too!


Best regards,
	Maxim Levitsky

> 
> > Best regards,
> > 	Maxim Levitsky
> > 
> >

Yang, Weijiang Dec. 12, 2023, 8:56 a.m. UTC | #8

On 12/8/2023 11:22 PM, Maxim Levitsky wrote:
> On Fri, 2023-12-08 at 23:15 +0800, Yang, Weijiang wrote:
>> On 12/7/2023 1:24 AM, Maxim Levitsky wrote:
>>> On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
>>>> On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
>>>>> On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
>>>> [...]
>>>>
>>>>>>>>      	vmx->nested.force_msr_bitmap_recalc = false;
>>>>>>>> @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>>>>>>>>      		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
>>>>>>>>      		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>>>>>>>>      			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
>>>>>>>> +
>>>>>>>> +		if (vmx->nested.nested_run_pending &&
>>>>>>> I don't think that nested.nested_run_pending check is needed.
>>>>>>> prepare_vmcs02_rare is not going to be called unless the nested run is pending.
>>>>>> But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
>>>>>> we don't need to update vmcs02's fields at the point until L2 is being resumed.
>>>>> - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
>>>>> because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
>>>>>
>>>>> - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
>>>>> but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
>>>>> from the migration stream.
>>>>>
>>>>> - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
>>>>> but we still need to setup vmcs02, otherwise it will be left with default zero values.
>>>> Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
>>>>> Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
>>>>>
>>>>>>>> +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
>>>>>>>> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>>>>>> +				vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
>>>>>>>> +				vmcs_writel(GUEST_INTR_SSP_TABLE,
>>>>>>>> +					    vmcs12->guest_ssp_tbl);
>>>>>>>> +			}
>>>>>>>> +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>>>>>> +			    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
>>>>>>>> +				vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
>>>>>>>> +		}
>>>>>>>>      	}
>>>>>>>>      
>>>>>>>>      	if (nested_cpu_has_xsaves(vmcs12))
>>>>>>>> @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
>>>>>>>>      	vmcs12->guest_pending_dbg_exceptions =
>>>>>>>>      		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>>>>>>>      
>>>>>>>> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
>>>>>>>> +		vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
>>>>>>>> +		vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
>>>>>>>> +	}
>>>>>>>> +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
>>>>>>>> +	    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
>>>>>>>> +		vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
>>>>>>>> +	}
>>>>>>> The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
>>>>>>> was loaded, then it must be updated on exit - this is usually how VMX works.
>>>>>> I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
>>>>>> so these fields for L2 guest should be synced to L1 unconditionally.
>>>>> "the guest registers will be saved into VMCS fields unconditionally"
>>>>> This is not true, unless there is a bug.
>>>> I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
>>>> the individual CET spec.:
>>>> "10.6 VM Exit
>>>> On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
>>>> But since it doesn't appear in SDM, I shouldn't take it for granted.
>>> SDM spec from September 2023:
>>>
>>> 28.3.1 Saving Control Registers, Debug Registers, and MSRs
>>>
>>> "If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
>>> IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
>>> support Intel 64 architecture, bits 63:32 of these MSRs are not saved."
>>>
>>> Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
>>> supported on the CPU.
>>> It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via
>>> “load CET” VM entry control.
>>>
>>> IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
>>> and see if the CPU still updates the guest CET fields on the VM exit.
>> OK, I'll do some tests to see what's happening, thanks!
>>>>> the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
>>>>> the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
>>>>> this state manually or use msr load/store lists instead.
>>>> Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
>>>>> Regardless of this,
>>>>> if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
>>>>> (that is copied back to vmcs12) this is what is written in the VMX spec.
>>>> What's the VMX spec. your're referring to here?
>>> SDM.
>>>
>>> In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the
>>> VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.
> I noticed a typo. I meant that the CPU should't  care about their values in the vmcs02.
>
>>> And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.
>>>
>>>
>>> Sorry for a bit of a confusion.
>> NP, I also double check it with HW Arch and get it back.
>> Thanks for raising these questions!

I got reply from HW Arch, the guest CET state is saved unconditionally:

"On the state save side, uCode doesn’t check for an exit control (or the load CET VM-entry control), but rather since it supports (as of TGL/SPR) CET,
  it unconditionally saves the state to the VMCS guest-state area. "

> Thanks to you too!
>
>
> Best regards,
> 	Maxim Levitsky
>
>>> Best regards,
>>> 	Maxim Levitsky
>>>
>>>
>

Maxim Levitsky Dec. 12, 2023, 11:09 a.m. UTC | #9

On Tue, 2023-12-12 at 16:56 +0800, Yang, Weijiang wrote:
> 
> On 12/8/2023 11:22 PM, Maxim Levitsky wrote:
> > On Fri, 2023-12-08 at 23:15 +0800, Yang, Weijiang wrote:
> > > On 12/7/2023 1:24 AM, Maxim Levitsky wrote:
> > > > On Wed, 2023-12-06 at 17:22 +0800, Yang, Weijiang wrote:
> > > > > On 12/5/2023 6:12 PM, Maxim Levitsky wrote:
> > > > > > On Mon, 2023-12-04 at 16:50 +0800, Yang, Weijiang wrote:
> > > > > [...]
> > > > > 
> > > > > > > > >      	vmx->nested.force_msr_bitmap_recalc = false;
> > > > > > > > > @@ -2469,6 +2491,18 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> > > > > > > > >      		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
> > > > > > > > >      		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> > > > > > > > >      			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
> > > > > > > > > +
> > > > > > > > > +		if (vmx->nested.nested_run_pending &&
> > > > > > > > I don't think that nested.nested_run_pending check is needed.
> > > > > > > > prepare_vmcs02_rare is not going to be called unless the nested run is pending.
> > > > > > > But there're other paths along to call prepare_vmcs02_rare(), e.g., vmx_set_nested_state()-> nested_vmx_enter_non_root_mode()-> prepare_vmcs02_rare(), especially when L1 instead of L2 was running. In this case, nested.nested_run_pending == false,
> > > > > > > we don't need to update vmcs02's fields at the point until L2 is being resumed.
> > > > > > - If we restore VM from migration stream when L2 is *not running*, then prepare_vmcs02_rare won't be called,
> > > > > > because nested_vmx_enter_non_root_mode will not be called, because in turn there is no nested vmcs to load.
> > > > > > 
> > > > > > - If we restore VM from migration stream when L2 is *about to run* (KVM emulated the VMRESUME/VMLAUNCH,
> > > > > > but we didn't do the actual hardware VMLAUNCH/VMRESUME on vmcs02, then the 'nested_run_pending' will be true, it will be restored
> > > > > > from the migration stream.
> > > > > > 
> > > > > > - If we migrate while nested guest was run once but didn't VMEXIT to L1 yet, then yes, nested.nested_run_pending will be false indeed,
> > > > > > but we still need to setup vmcs02, otherwise it will be left with default zero values.
> > > > > Thanks a lot for recapping these cases! I overlooked some nested flags before. It makes sense to remove nested.nested_run_pending.
> > > > > > Remember that prior to setting nested state the VM wasn't running even once usually, unlike when the guest enters nested state normally.
> > > > > > 
> > > > > > > > > +		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> > > > > > > > > +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > > > > > +				vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> > > > > > > > > +				vmcs_writel(GUEST_INTR_SSP_TABLE,
> > > > > > > > > +					    vmcs12->guest_ssp_tbl);
> > > > > > > > > +			}
> > > > > > > > > +			if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > > > > > +			    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT))
> > > > > > > > > +				vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> > > > > > > > > +		}
> > > > > > > > >      	}
> > > > > > > > >      
> > > > > > > > >      	if (nested_cpu_has_xsaves(vmcs12))
> > > > > > > > > @@ -4300,6 +4334,15 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
> > > > > > > > >      	vmcs12->guest_pending_dbg_exceptions =
> > > > > > > > >      		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> > > > > > > > >      
> > > > > > > > > +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK)) {
> > > > > > > > > +		vmcs12->guest_ssp = vmcs_readl(GUEST_SSP);
> > > > > > > > > +		vmcs12->guest_ssp_tbl = vmcs_readl(GUEST_INTR_SSP_TABLE);
> > > > > > > > > +	}
> > > > > > > > > +	if (guest_can_use(&vmx->vcpu, X86_FEATURE_SHSTK) ||
> > > > > > > > > +	    guest_can_use(&vmx->vcpu, X86_FEATURE_IBT)) {
> > > > > > > > > +		vmcs12->guest_s_cet = vmcs_readl(GUEST_S_CET);
> > > > > > > > > +	}
> > > > > > > > The above code should be conditional on VM_ENTRY_LOAD_CET_STATE - if the guest (L2) state
> > > > > > > > was loaded, then it must be updated on exit - this is usually how VMX works.
> > > > > > > I think this is not for L2 VM_ENTRY_LOAD_CET_STATE, it happens in prepare_vmcs02_rare(). IIUC, the guest registers will be saved into VMCS fields unconditionally when vm-exit happens,
> > > > > > > so these fields for L2 guest should be synced to L1 unconditionally.
> > > > > > "the guest registers will be saved into VMCS fields unconditionally"
> > > > > > This is not true, unless there is a bug.
> > > > > I checked the latest SDM, there's no such kind of wording regarding CET entry/exit control bits. The wording comes from
> > > > > the individual CET spec.:
> > > > > "10.6 VM Exit
> > > > > On processors that support CET, the VM exit saves the state of IA32_S_CET, SSP and IA32_INTERRUPT_SSP_TABLE_ADDR MSR to the VMCS guest-state area unconditionally."
> > > > > But since it doesn't appear in SDM, I shouldn't take it for granted.
> > > > SDM spec from September 2023:
> > > > 
> > > > 28.3.1 Saving Control Registers, Debug Registers, and MSRs
> > > > 
> > > > "If the processor supports the 1-setting of the “load CET” VM-entry control, the contents of the IA32_S_CET and
> > > > IA32_INTERRUPT_SSP_TABLE_ADDR MSRs are saved into the corresponding fields. On processors that do not
> > > > support Intel 64 architecture, bits 63:32 of these MSRs are not saved."
> > > > 
> > > > Honestly it's not 100% clear if the “load CET” should be set to 1 to trigger the restore, or that this control just needs to be
> > > > supported on the CPU.
> > > > It does feel like you are right here, that CPU always saves the guest state, but allows to not load it on VM entry via
> > > > “load CET” VM entry control.
> > > > 
> > > > IMHO its best to check what the bare metal does by rigging a test by patching the host kernel to not set the 'load CET' control,
> > > > and see if the CPU still updates the guest CET fields on the VM exit.
> > > OK, I'll do some tests to see what's happening, thanks!
> > > > > > the vmcs12 VM_ENTRY_LOAD_CET_STATE should be passed through as is to vmcs02, so if the nested guest doesn't set this bit
> > > > > > the entry/exit using vmcs02 will not touch the CET state, which is unusual but allowed by the spec I think - a nested hypervisor can opt for example to save/load
> > > > > > this state manually or use msr load/store lists instead.
> > > > > Right although the use case should be rare, will modify the code to check VM_ENTRY_LOAD_CET_STATE. Thanks!
> > > > > > Regardless of this,
> > > > > > if the guest didn't set VM_ENTRY_LOAD_CET_STATE, then vmcs12 guest fields should neither be loaded on VM entry (copied to vmcs02) nor updated on VM exit,
> > > > > > (that is copied back to vmcs12) this is what is written in the VMX spec.
> > > > > What's the VMX spec. your're referring to here?
> > > > SDM.
> > > > 
> > > > In fact, now that I am thinking about this again, it should be OK to unconditionally copy the CET fields from vmcs12 to vmcs02, because as long as the
> > > > VM_ENTRY_LOAD_CET_STATE is not set, the CPU should care about their values in the vmcs02.
> > I noticed a typo. I meant that the CPU should't  care about their values in the vmcs02.
> > 
> > > > And about the other way around, assuming that I made a mistake as I said above, then the other way around is indeed unconditional.
> > > > 
> > > > 
> > > > Sorry for a bit of a confusion.
> > > NP, I also double check it with HW Arch and get it back.
> > > Thanks for raising these questions!
> 
> I got reply from HW Arch, the guest CET state is saved unconditionally:
> 
> "On the state save side, uCode doesn’t check for an exit control (or the load CET VM-entry control), but rather since it supports (as of TGL/SPR) CET,
>   it unconditionally saves the state to the VMCS guest-state area. "

Great!

Best regards,
	Maxim Levitsky

> 
> > Thanks to you too!
> > 
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> > > > Best regards,
> > > > 	Maxim Levitsky
> > > > 
> > > >

[v7,26/26] KVM: nVMX: Enable CET support for nested guest

Commit Message

Comments

Patch