Message ID | 20220613161611.3567556-1-anrayabh@linux.microsoft.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: nVMX: Don't expose TSC scaling to L1 when on Hyper-V | expand |
On Mon, Jun 13, 2022, Anirudh Rayabharam wrote: > As per the comments in arch/x86/kvm/vmx/evmcs.h, TSC multiplier field is It's not just the comments, it's also the code. It would be helpful to call out in the changelog that KVM clears unsupported controls via evmcs_sanitize_exec_ctrls() when using eVMCS. > currently not supported in EVMCS. As a result, there is no TSC scaling > support when KVM is running on Hyper-V i.e. kvm_has_tsc_control is > false. > > However, in nested_vmx_setup_ctls_msrs(), TSC scaling is exposed to L1. > When L1 tries to launch an L2 guest, vmcs12 has TSC scaling enabled. > This propagates to vmcs02. But KVM doesn't set the TSC multiplier value > because kvm_has_tsc_control is false. Due to this, VM entry for L2 guest > fails. (VM entry fails if "use TSC scaling" is 1 and TSC multiplier is 0.) > > To fix, expose TSC scaling to L1 only if kvm_has_tsc_control. > > Fixes: d041b5ea93352 ("KVM: nVMX: Enable nested TSC scaling") > Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> > --- > arch/x86/kvm/vmx/nested.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > index f5cb18e00e78..d773ddc6422b 100644 > --- a/arch/x86/kvm/vmx/nested.c > +++ b/arch/x86/kvm/vmx/nested.c > @@ -6656,6 +6656,9 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) > msrs->secondary_ctls_low, > msrs->secondary_ctls_high); > > + if (!kvm_has_tsc_control) > + msrs->secondary_ctls_high &= ~SECONDARY_EXEC_TSC_SCALING; I would much rather we fix the root of the problem and not play whack-a-mole, e.g. all of the other controls that aren't supported by eVMCS have the same bug,. nested_vmx_setup_ctls_msrs() should use vmcs_config to get the base MSR values, not read the MSRs from hardware. And it's not just eVMCS, e.g. the manipulation of VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL for a CPU errata isn't handled either. > msrs->secondary_ctls_low = 0; > msrs->secondary_ctls_high &= > SECONDARY_EXEC_DESC | > @@ -6667,8 +6670,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) > SECONDARY_EXEC_RDRAND_EXITING | > SECONDARY_EXEC_ENABLE_INVPCID | > SECONDARY_EXEC_RDSEED_EXITING | > - SECONDARY_EXEC_XSAVES | > - SECONDARY_EXEC_TSC_SCALING; > + SECONDARY_EXEC_XSAVES; > > /* > * We can emulate "VMCS shadowing," even if the hardware > -- > 2.34.1 >
On 6/13/22 18:16, Anirudh Rayabharam wrote: > + if (!kvm_has_tsc_control) > + msrs->secondary_ctls_high &= ~SECONDARY_EXEC_TSC_SCALING; > + > msrs->secondary_ctls_low = 0; > msrs->secondary_ctls_high &= > SECONDARY_EXEC_DESC | > @@ -6667,8 +6670,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) > SECONDARY_EXEC_RDRAND_EXITING | > SECONDARY_EXEC_ENABLE_INVPCID | > SECONDARY_EXEC_RDSEED_EXITING | > - SECONDARY_EXEC_XSAVES | > - SECONDARY_EXEC_TSC_SCALING; > + SECONDARY_EXEC_XSAVES; > > /* This is wrong because it _always_ disables SECONDARY_EXEC_TSC_SCALING, even if kvm_has_tsc_control == true. That said, I think a better implementation of this patch is to just add a version of evmcs_sanitize_exec_ctrls that takes a struct nested_vmx_msrs *, and call it at the end of nested_vmx_setup_ctl_msrs like evmcs_sanitize_nested_vmx_vsrs(msrs); Even better (but I cannot "mentally test it" offhand) would be just diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e802f71a9e8d..b3425ce835c5 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1862,7 +1862,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) * sanity checking and refuse to boot. Filter all unsupported * features out. */ - if (!msr_info->host_initiated && + if (static_branch_unlikely(&enable_evmcs) || vmx->nested.enlightened_vmcs_enabled) nested_evmcs_filter_control_msr(msr_info->index, &msr_info->data); I cannot quite understand the host_initiated check, so I'll defer to Vitaly on why it is needed. Most likely, removing it would cause some warnings in QEMU with e.g. "-cpu Haswell,+vmx"; but I think it's a userspace bug and we should remove that part of the condition. You don't need to worry about that part, we'll cross that bridge if the above patch works for your case. Thanks, Paolo
On Mon, Jun 13, 2022, Paolo Bonzini wrote: > On 6/13/22 18:16, Anirudh Rayabharam wrote: > > + if (!kvm_has_tsc_control) > > + msrs->secondary_ctls_high &= ~SECONDARY_EXEC_TSC_SCALING; > > + > > msrs->secondary_ctls_low = 0; > > msrs->secondary_ctls_high &= > > SECONDARY_EXEC_DESC | > > @@ -6667,8 +6670,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) > > SECONDARY_EXEC_RDRAND_EXITING | > > SECONDARY_EXEC_ENABLE_INVPCID | > > SECONDARY_EXEC_RDSEED_EXITING | > > - SECONDARY_EXEC_XSAVES | > > - SECONDARY_EXEC_TSC_SCALING; > > + SECONDARY_EXEC_XSAVES; > > /* > > This is wrong because it _always_ disables SECONDARY_EXEC_TSC_SCALING, > even if kvm_has_tsc_control == true. > > That said, I think a better implementation of this patch is to just add > a version of evmcs_sanitize_exec_ctrls that takes a struct > nested_vmx_msrs *, and call it at the end of nested_vmx_setup_ctl_msrs like > > evmcs_sanitize_nested_vmx_vsrs(msrs); Any reason not to use the already sanitized vmcs_config? I can't think of any reason why the nested path should blindly use the raw MSR values from hardware.
On Mon, Jun 13, 2022 at 06:49:17PM +0200, Paolo Bonzini wrote: > On 6/13/22 18:16, Anirudh Rayabharam wrote: > > + if (!kvm_has_tsc_control) > > + msrs->secondary_ctls_high &= ~SECONDARY_EXEC_TSC_SCALING; > > + > > msrs->secondary_ctls_low = 0; > > msrs->secondary_ctls_high &= > > SECONDARY_EXEC_DESC | > > @@ -6667,8 +6670,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) > > SECONDARY_EXEC_RDRAND_EXITING | > > SECONDARY_EXEC_ENABLE_INVPCID | > > SECONDARY_EXEC_RDSEED_EXITING | > > - SECONDARY_EXEC_XSAVES | > > - SECONDARY_EXEC_TSC_SCALING; > > + SECONDARY_EXEC_XSAVES; > > /* > > This is wrong because it _always_ disables SECONDARY_EXEC_TSC_SCALING, > even if kvm_has_tsc_control == true. The MSR actually allows 1-setting of the "use TSC scaling" control. So this line is redundant anyway. > > That said, I think a better implementation of this patch is to just add > a version of evmcs_sanitize_exec_ctrls that takes a struct > nested_vmx_msrs *, and call it at the end of nested_vmx_setup_ctl_msrs like > > evmcs_sanitize_nested_vmx_vsrs(msrs); Sanitize at the end might not work because I see some cases in nested_vmx_setup_ctls_msrs() where we want to expose some things to L1 even though the hardware doesn't support it. > > Even better (but I cannot "mentally test it" offhand) would be just > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index e802f71a9e8d..b3425ce835c5 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -1862,7 +1862,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > * sanity checking and refuse to boot. Filter all unsupported > * features out. > */ > - if (!msr_info->host_initiated && > + if (static_branch_unlikely(&enable_evmcs) || > vmx->nested.enlightened_vmcs_enabled) > nested_evmcs_filter_control_msr(msr_info->index, > &msr_info->data); I will try this. Thanks, Anirudh. > > I cannot quite understand the host_initiated check, so I'll defer to > Vitaly on why it is needed. Most likely, removing it would cause some > warnings in QEMU with e.g. "-cpu Haswell,+vmx"; but I think it's a > userspace bug and we should remove that part of the condition. You > don't need to worry about that part, we'll cross that bridge if the > above patch works for your case. > > Thanks, > > Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > On 6/13/22 18:16, Anirudh Rayabharam wrote: >> + if (!kvm_has_tsc_control) >> + msrs->secondary_ctls_high &= ~SECONDARY_EXEC_TSC_SCALING; >> + >> msrs->secondary_ctls_low = 0; >> msrs->secondary_ctls_high &= >> SECONDARY_EXEC_DESC | >> @@ -6667,8 +6670,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) >> SECONDARY_EXEC_RDRAND_EXITING | >> SECONDARY_EXEC_ENABLE_INVPCID | >> SECONDARY_EXEC_RDSEED_EXITING | >> - SECONDARY_EXEC_XSAVES | >> - SECONDARY_EXEC_TSC_SCALING; >> + SECONDARY_EXEC_XSAVES; >> >> /* > > This is wrong because it _always_ disables SECONDARY_EXEC_TSC_SCALING, > even if kvm_has_tsc_control == true. > > That said, I think a better implementation of this patch is to just add > a version of evmcs_sanitize_exec_ctrls that takes a struct > nested_vmx_msrs *, and call it at the end of nested_vmx_setup_ctl_msrs like > > evmcs_sanitize_nested_vmx_vsrs(msrs); > > Even better (but I cannot "mentally test it" offhand) would be just > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index e802f71a9e8d..b3425ce835c5 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -1862,7 +1862,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > * sanity checking and refuse to boot. Filter all unsupported > * features out. > */ > - if (!msr_info->host_initiated && > + if (static_branch_unlikely(&enable_evmcs) || > vmx->nested.enlightened_vmcs_enabled) > nested_evmcs_filter_control_msr(msr_info->index, > &msr_info->data); > > I cannot quite understand the host_initiated check, so I'll defer to > Vitaly on why it is needed. Most likely, removing it would cause some > warnings in QEMU with e.g. "-cpu Haswell,+vmx"; but I think it's a > userspace bug and we should remove that part of the condition. I forgot the details, of course, but 31de3d2500e4 says: ``` With fine grained VMX feature enablement QEMU>=4.2 tries to do KVM_SET_MSRS with default (matching CPU model) values and in case eVMCS is also enabled, fails. ``` so it certainly was a workaround for QEMU.
On 6/14/22 06:55, Anirudh Rayabharam wrote: >> That said, I think a better implementation of this patch is to just add >> a version of evmcs_sanitize_exec_ctrls that takes a struct >> nested_vmx_msrs *, and call it at the end of nested_vmx_setup_ctl_msrs like >> >> evmcs_sanitize_nested_vmx_vsrs(msrs); > Sanitize at the end might not work because I see some cases in > nested_vmx_setup_ctls_msrs() where we want to expose some things to L1 > even though the hardware doesn't support it. > Yes, but these will never include eVMCS-unsupported features. Paolo
Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: ... > > As per the comments in arch/x86/kvm/vmx/evmcs.h, TSC multiplier field is > currently not supported in EVMCS. The latest version: https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_vmx_enlightened_vmcs has it, actually. It was missing before (compare with e.g. 6.0b version here: https://github.com/MicrosoftDocs/Virtualization-Documentation/raw/live/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf) but AFAIR TSC scaling wasn't advertised by genuine Hyper-V either. Interestingly enough, eVMCS version didn't change when these fields were added, it is still '1'. I even have a patch in my stash (attached). I didn't send it out because it wasn't properly tested with different Hyper-V versions.
Vitaly Kuznetsov <vkuznets@redhat.com> writes: > Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: > > ... > >> >> As per the comments in arch/x86/kvm/vmx/evmcs.h, TSC multiplier field is >> currently not supported in EVMCS. > > The latest version: > https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_vmx_enlightened_vmcs > > has it, actually. It was missing before (compare with e.g. 6.0b version > here: > https://github.com/MicrosoftDocs/Virtualization-Documentation/raw/live/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf) > > but AFAIR TSC scaling wasn't advertised by genuine Hyper-V either. > Interestingly enough, eVMCS version didn't change when these fields were > added, it is still '1'. > > I even have a patch in my stash (attached). I didn't send it out because > it wasn't properly tested with different Hyper-V versions. And of course I forgot a pre-requisite patch which updates 'struct hv_enlightened_vmcs' to the latest: diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 0a9407dc0859..038e5ef9b4a6 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -559,9 +559,20 @@ struct hv_enlightened_vmcs { u64 partition_assist_page; u64 padding64_4[4]; u64 guest_bndcfgs; - u64 padding64_5[7]; + u64 guest_ia32_perf_global_ctrl; + u64 guest_ia32_s_cet; + u64 guest_ssp; + u64 guest_ia32_int_ssp_table_addr; + u64 guest_ia32_lbr_ctl; + u64 padding64_5[2]; u64 xss_exit_bitmap; - u64 padding64_6[7]; + u64 host_ia32_perf_global_ctrl; + u64 encls_exiting_bitmap; + u64 tsc_multiplier; + u64 host_ia32_s_cet; + u64 host_ssp; + u64 host_ia32_int_ssp_table_addr; + u64 padding64_6; } __packed; #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE 0
On Tue, Jun 14, 2022 at 02:16:00PM +0200, Paolo Bonzini wrote: > On 6/14/22 06:55, Anirudh Rayabharam wrote: > > > That said, I think a better implementation of this patch is to just add > > > a version of evmcs_sanitize_exec_ctrls that takes a struct > > > nested_vmx_msrs *, and call it at the end of nested_vmx_setup_ctl_msrs like > > > > > > evmcs_sanitize_nested_vmx_vsrs(msrs); > > Sanitize at the end might not work because I see some cases in > > nested_vmx_setup_ctls_msrs() where we want to expose some things to L1 > > even though the hardware doesn't support it. > > > > Yes, but these will never include eVMCS-unsupported features. How are you so sure? For example, SECONDARY_EXEC_SHADOW_VMCS is unsupported in eVMCS but in nested_vmx_setup_ctls_msrs() we do: 6675 /* 6676 * We can emulate "VMCS shadowing," even if the hardware 6677 * doesn't support it. 6678 */ 6679 msrs->secondary_ctls_high |= 6680 SECONDARY_EXEC_SHADOW_VMCS; If we sanitize this out it might cause some regression right? Thanks! Anirudh. > > Paolo
On Tue, Jun 14, 2022 at 10:25:52AM +0530, Anirudh Rayabharam wrote: > On Mon, Jun 13, 2022 at 06:49:17PM +0200, Paolo Bonzini wrote: > > On 6/13/22 18:16, Anirudh Rayabharam wrote: > > > + if (!kvm_has_tsc_control) > > > + msrs->secondary_ctls_high &= ~SECONDARY_EXEC_TSC_SCALING; > > > + > > > msrs->secondary_ctls_low = 0; > > > msrs->secondary_ctls_high &= > > > SECONDARY_EXEC_DESC | > > > @@ -6667,8 +6670,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) > > > SECONDARY_EXEC_RDRAND_EXITING | > > > SECONDARY_EXEC_ENABLE_INVPCID | > > > SECONDARY_EXEC_RDSEED_EXITING | > > > - SECONDARY_EXEC_XSAVES | > > > - SECONDARY_EXEC_TSC_SCALING; > > > + SECONDARY_EXEC_XSAVES; > > > /* > > > > This is wrong because it _always_ disables SECONDARY_EXEC_TSC_SCALING, > > even if kvm_has_tsc_control == true. > > The MSR actually allows 1-setting of the "use TSC scaling" control. So this > line is redundant anyway. > > > > > That said, I think a better implementation of this patch is to just add > > a version of evmcs_sanitize_exec_ctrls that takes a struct > > nested_vmx_msrs *, and call it at the end of nested_vmx_setup_ctl_msrs like > > > > evmcs_sanitize_nested_vmx_vsrs(msrs); > > Sanitize at the end might not work because I see some cases in > nested_vmx_setup_ctls_msrs() where we want to expose some things to L1 > even though the hardware doesn't support it. > > > > > Even better (but I cannot "mentally test it" offhand) would be just > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index e802f71a9e8d..b3425ce835c5 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -1862,7 +1862,7 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > > * sanity checking and refuse to boot. Filter all unsupported > > * features out. > > */ > > - if (!msr_info->host_initiated && > > + if (static_branch_unlikely(&enable_evmcs) || > > vmx->nested.enlightened_vmcs_enabled) > > nested_evmcs_filter_control_msr(msr_info->index, > > &msr_info->data); > > I will try this. This patch fixed the issue for me. But again, this might filter out things that we wan't to expose to L1 even if not available in hardware. Thanks, Anirudh. > > Thanks, > > Anirudh. > > > > > I cannot quite understand the host_initiated check, so I'll defer to > > Vitaly on why it is needed. Most likely, removing it would cause some > > warnings in QEMU with e.g. "-cpu Haswell,+vmx"; but I think it's a > > userspace bug and we should remove that part of the condition. You > > don't need to worry about that part, we'll cross that bridge if the > > above patch works for your case. > > > > Thanks, > > > > Paolo
On Mon, Jun 13, 2022 at 04:57:49PM +0000, Sean Christopherson wrote: > On Mon, Jun 13, 2022, Paolo Bonzini wrote: > > On 6/13/22 18:16, Anirudh Rayabharam wrote: > > > + if (!kvm_has_tsc_control) > > > + msrs->secondary_ctls_high &= ~SECONDARY_EXEC_TSC_SCALING; > > > + > > > msrs->secondary_ctls_low = 0; > > > msrs->secondary_ctls_high &= > > > SECONDARY_EXEC_DESC | > > > @@ -6667,8 +6670,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) > > > SECONDARY_EXEC_RDRAND_EXITING | > > > SECONDARY_EXEC_ENABLE_INVPCID | > > > SECONDARY_EXEC_RDSEED_EXITING | > > > - SECONDARY_EXEC_XSAVES | > > > - SECONDARY_EXEC_TSC_SCALING; > > > + SECONDARY_EXEC_XSAVES; > > > /* > > > > This is wrong because it _always_ disables SECONDARY_EXEC_TSC_SCALING, > > even if kvm_has_tsc_control == true. > > > > That said, I think a better implementation of this patch is to just add > > a version of evmcs_sanitize_exec_ctrls that takes a struct > > nested_vmx_msrs *, and call it at the end of nested_vmx_setup_ctl_msrs like > > > > evmcs_sanitize_nested_vmx_vsrs(msrs); > > Any reason not to use the already sanitized vmcs_config? I can't think of any > reason why the nested path should blindly use the raw MSR values from hardware. vmcs_config has the sanitized exec controls. But how do we construct MSR values using them? Thanks, Anirudh.
On Tue, Jun 14, 2022, Anirudh Rayabharam wrote: > On Mon, Jun 13, 2022 at 04:57:49PM +0000, Sean Christopherson wrote: > > On Mon, Jun 13, 2022, Paolo Bonzini wrote: > > > On 6/13/22 18:16, Anirudh Rayabharam wrote: > > > > + if (!kvm_has_tsc_control) > > > > + msrs->secondary_ctls_high &= ~SECONDARY_EXEC_TSC_SCALING; > > > > + > > > > msrs->secondary_ctls_low = 0; > > > > msrs->secondary_ctls_high &= > > > > SECONDARY_EXEC_DESC | > > > > @@ -6667,8 +6670,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) > > > > SECONDARY_EXEC_RDRAND_EXITING | > > > > SECONDARY_EXEC_ENABLE_INVPCID | > > > > SECONDARY_EXEC_RDSEED_EXITING | > > > > - SECONDARY_EXEC_XSAVES | > > > > - SECONDARY_EXEC_TSC_SCALING; > > > > + SECONDARY_EXEC_XSAVES; > > > > /* > > > > > > This is wrong because it _always_ disables SECONDARY_EXEC_TSC_SCALING, > > > even if kvm_has_tsc_control == true. > > > > > > That said, I think a better implementation of this patch is to just add > > > a version of evmcs_sanitize_exec_ctrls that takes a struct > > > nested_vmx_msrs *, and call it at the end of nested_vmx_setup_ctl_msrs like > > > > > > evmcs_sanitize_nested_vmx_vsrs(msrs); > > > > Any reason not to use the already sanitized vmcs_config? I can't think of any > > reason why the nested path should blindly use the raw MSR values from hardware. > > vmcs_config has the sanitized exec controls. But how do we construct MSR > values using them? I was thinking we could use the sanitized controls for the allowed-1 bits, and then take the required-1 bits from the CPU. And then if we wanted to avoid the redundant RDMSRs in a follow-up patch we could add required-1 fields to vmcs_config. Hastily constructed and compile-tested only, proceed with caution :-) --- arch/x86/kvm/vmx/nested.c | 35 ++++++++++++++++++++--------------- arch/x86/kvm/vmx/nested.h | 2 +- arch/x86/kvm/vmx/vmx.c | 5 ++--- 3 files changed, 23 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index f5cb18e00e78..67cbb6643efa 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6541,8 +6541,13 @@ static u64 nested_vmx_calc_vmcs_enum_msr(void) * bit in the high half is on if the corresponding bit in the control field * may be on. See also vmx_control_verify(). */ -void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) +void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps) { + struct nested_vmx_msrs *msrs = &vmcs_config.nested; + + /* Take the allowed-1 bits from KVM's sanitized VMCS configuration. */ + u32 ignore_high; + /* * Note that as a general rule, the high half of the MSRs (bits in * the control fields which may be 1) should be initialized by the @@ -6559,11 +6564,11 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) */ /* pin-based controls */ - rdmsr(MSR_IA32_VMX_PINBASED_CTLS, - msrs->pinbased_ctls_low, - msrs->pinbased_ctls_high); + rdmsr(MSR_IA32_VMX_PINBASED_CTLS, msrs->pinbased_ctls_low, ignore_high); msrs->pinbased_ctls_low |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; + + msrs->pinbased_ctls_high = vmcs_conf->pin_based_exec_ctrl; msrs->pinbased_ctls_high &= PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING | @@ -6574,12 +6579,11 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) PIN_BASED_VMX_PREEMPTION_TIMER; /* exit controls */ - rdmsr(MSR_IA32_VMX_EXIT_CTLS, - msrs->exit_ctls_low, - msrs->exit_ctls_high); + rdmsr(MSR_IA32_VMX_EXIT_CTLS, msrs->exit_ctls_low, ignore_high); msrs->exit_ctls_low = VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; + msrs->exit_ctls_high = vmcs_conf->vmexit_ctrl; msrs->exit_ctls_high &= #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | @@ -6595,11 +6599,11 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS; /* entry controls */ - rdmsr(MSR_IA32_VMX_ENTRY_CTLS, - msrs->entry_ctls_low, - msrs->entry_ctls_high); + rdmsr(MSR_IA32_VMX_ENTRY_CTLS, msrs->entry_ctls_low, ignore_high); msrs->entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; + + msrs->entry_ctls_high = vmcs_conf->vmentry_ctrl; msrs->entry_ctls_high &= #ifdef CONFIG_X86_64 VM_ENTRY_IA32E_MODE | @@ -6613,11 +6617,11 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->entry_ctls_low &= ~VM_ENTRY_LOAD_DEBUG_CONTROLS; /* cpu-based controls */ - rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, - msrs->procbased_ctls_low, - msrs->procbased_ctls_high); + rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, msrs->procbased_ctls_low, ignore_high); msrs->procbased_ctls_low = CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR; + + msrs->procbased_ctls_high = vmcs_conf->cpu_based_exec_ctrl; msrs->procbased_ctls_high &= CPU_BASED_INTR_WINDOW_EXITING | CPU_BASED_NMI_WINDOW_EXITING | CPU_BASED_USE_TSC_OFFSETTING | @@ -6653,10 +6657,11 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) */ if (msrs->procbased_ctls_high & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2, - msrs->secondary_ctls_low, - msrs->secondary_ctls_high); + msrs->secondary_ctls_low, ignore_high); msrs->secondary_ctls_low = 0; + + msrs->secondary_ctls_high = vmcs_conf->cpu_based_2nd_exec_ctrl; msrs->secondary_ctls_high &= SECONDARY_EXEC_DESC | SECONDARY_EXEC_ENABLE_RDTSCP | diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h index c92cea0b8ccc..fae047c6204b 100644 --- a/arch/x86/kvm/vmx/nested.h +++ b/arch/x86/kvm/vmx/nested.h @@ -17,7 +17,7 @@ enum nvmx_vmentry_status { }; void vmx_leave_nested(struct kvm_vcpu *vcpu); -void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps); +void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps); void nested_vmx_hardware_unsetup(void); __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *)); void nested_vmx_set_vmcs_shadowing_bitmap(void); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9bd86ecccdab..cd0d0ffae0bf 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7139,7 +7139,7 @@ static int __init vmx_check_processor_compat(void) if (setup_vmcs_config(&vmcs_conf, &vmx_cap) < 0) return -EIO; if (nested) - nested_vmx_setup_ctls_msrs(&vmcs_conf.nested, vmx_cap.ept); + nested_vmx_setup_ctls_msrs(&vmcs_conf, vmx_cap.ept); if (memcmp(&vmcs_config, &vmcs_conf, sizeof(struct vmcs_config)) != 0) { printk(KERN_ERR "kvm: CPU %d feature inconsistency!\n", smp_processor_id()); @@ -8079,8 +8079,7 @@ static __init int hardware_setup(void) setup_default_sgx_lepubkeyhash(); if (nested) { - nested_vmx_setup_ctls_msrs(&vmcs_config.nested, - vmx_capability.ept); + nested_vmx_setup_ctls_msrs(&vmcs_config, vmx_capability.ept); r = nested_vmx_hardware_setup(kvm_vmx_exit_handlers); if (r) base-commit: b821e4ff9e35a8fc999685e8d44c0644cfeaa228 --
On 6/14/22 14:19, Vitaly Kuznetsov wrote: > The latest version: > https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_vmx_enlightened_vmcs > > has it, actually. It was missing before (compare with e.g. 6.0b version > here: > https://github.com/MicrosoftDocs/Virtualization-Documentation/raw/live/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf) > > but AFAIR TSC scaling wasn't advertised by genuine Hyper-V either. > Interestingly enough, eVMCS version didn't change when these fields were > added, it is still '1'. > > I even have a patch in my stash (attached). I didn't send it out because > it wasn't properly tested with different Hyper-V versions. > > -- Vitaly Anirudh, can you check if Vitaly's patches work for you? Paolo
On 6/14/22 17:13, Anirudh Rayabharam wrote: >>> Sanitize at the end might not work because I see some cases in >>> nested_vmx_setup_ctls_msrs() where we want to expose some things to L1 >>> even though the hardware doesn't support it. >> >> Yes, but these will never include eVMCS-unsupported features. > > How are you so sure? > > For example, SECONDARY_EXEC_SHADOW_VMCS is unsupported in eVMCS but in > nested_vmx_setup_ctls_msrs() we do: > > 6675 /* > 6676 * We can emulate "VMCS shadowing," even if the hardware > 6677 * doesn't support it. > 6678 */ > 6679 msrs->secondary_ctls_high |= > 6680 SECONDARY_EXEC_SHADOW_VMCS; > > If we sanitize this out it might cause some regression right? Yes, you're right, shadow VMCS is special: it is not supported by enlightened VMCS, but it is emulated rather than virtualized. Therefore, if L1 does not use the enlightened VMCS, it can indeed use shadow VMCS. Paolo
On Tue, Jun 14, 2022 at 07:20:34PM +0200, Paolo Bonzini wrote: > On 6/14/22 14:19, Vitaly Kuznetsov wrote: > > The latest version: > > https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_vmx_enlightened_vmcs > > > > has it, actually. It was missing before (compare with e.g. 6.0b version > > here: > > https://github.com/MicrosoftDocs/Virtualization-Documentation/raw/live/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf) > > > > but AFAIR TSC scaling wasn't advertised by genuine Hyper-V either. > > Interestingly enough, eVMCS version didn't change when these fields were > > added, it is still '1'. > > > > I even have a patch in my stash (attached). I didn't send it out because > > it wasn't properly tested with different Hyper-V versions. > > > > -- Vitaly > > Anirudh, can you check if Vitaly's patches work for you? I will check it. But I wonder if they fit the criteria for inclusion in stable trees... It is important for the fix to land in the stable trees since this issue is a regression that was introduced _after_ 5.13. (I probably should've mentioned this in the changelog.) Thanks, Anirudh. > > Paolo
Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: > On Tue, Jun 14, 2022 at 07:20:34PM +0200, Paolo Bonzini wrote: >> On 6/14/22 14:19, Vitaly Kuznetsov wrote: >> > The latest version: >> > https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_vmx_enlightened_vmcs >> > >> > has it, actually. It was missing before (compare with e.g. 6.0b version >> > here: >> > https://github.com/MicrosoftDocs/Virtualization-Documentation/raw/live/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf) >> > >> > but AFAIR TSC scaling wasn't advertised by genuine Hyper-V either. >> > Interestingly enough, eVMCS version didn't change when these fields were >> > added, it is still '1'. >> > >> > I even have a patch in my stash (attached). I didn't send it out because >> > it wasn't properly tested with different Hyper-V versions. >> > >> > -- Vitaly >> >> Anirudh, can you check if Vitaly's patches work for you? > > I will check it. But I wonder if they fit the criteria for inclusion in > stable trees... > > It is important for the fix to land in the stable trees since this issue > is a regression that was introduced _after_ 5.13. (I probably should've > mentioned this in the changelog.) > Personally, I see no problem with splitting off TscMultiplier part from my patch and marking it for stable@ fixing d041b5ea93352. I'm going to run some tests too.
Vitaly Kuznetsov <vkuznets@redhat.com> writes: > Vitaly Kuznetsov <vkuznets@redhat.com> writes: > >> Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: >> >> ... >> >>> >>> As per the comments in arch/x86/kvm/vmx/evmcs.h, TSC multiplier field is >>> currently not supported in EVMCS. >> >> The latest version: >> https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_vmx_enlightened_vmcs >> >> has it, actually. It was missing before (compare with e.g. 6.0b version >> here: >> https://github.com/MicrosoftDocs/Virtualization-Documentation/raw/live/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf) >> >> but AFAIR TSC scaling wasn't advertised by genuine Hyper-V either. >> Interestingly enough, eVMCS version didn't change when these fields were >> added, it is still '1'. >> >> I even have a patch in my stash (attached). I didn't send it out because >> it wasn't properly tested with different Hyper-V versions. > > And of course I forgot a pre-requisite patch which updates 'struct > hv_enlightened_vmcs' to the latest: > The good news is that TscMultiplies seems to work fine for me, at least with an Azure Dv5 instance where I can see Tsc scaling exposed. The bad news is that a few more patches are needed: 1) Fix 'struct hv_enlightened_vmcs' definition: https://lore.kernel.org/kvm/20220613133922.2875594-20-vkuznets@redhat.com/ 2) Define VMCS-to-EVMCS conversion for the new fields : diff --git a/arch/x86/kvm/vmx/evmcs.c b/arch/x86/kvm/vmx/evmcs.c index 6a61b1ae7942..707a8de11802 100644 --- a/arch/x86/kvm/vmx/evmcs.c +++ b/arch/x86/kvm/vmx/evmcs.c @@ -28,6 +28,8 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = { HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), EVMCS1_FIELD(HOST_IA32_EFER, host_ia32_efer, HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), + EVMCS1_FIELD(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl, + HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), EVMCS1_FIELD(HOST_CR0, host_cr0, HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1), EVMCS1_FIELD(HOST_CR3, host_cr3, @@ -78,6 +80,8 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = { HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), EVMCS1_FIELD(GUEST_IA32_EFER, guest_ia32_efer, HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), + EVMCS1_FIELD(GUEST_IA32_PERF_GLOBAL_CTRL, guest_ia32_perf_global_ctrl, + HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), EVMCS1_FIELD(GUEST_PDPTR0, guest_pdptr0, HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), EVMCS1_FIELD(GUEST_PDPTR1, guest_pdptr1, @@ -126,24 +130,47 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = { HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1), EVMCS1_FIELD(XSS_EXIT_BITMAP, xss_exit_bitmap, HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), + EVMCS1_FIELD(ENCLS_EXITING_BITMAP, encls_exiting_bitmap, + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), + EVMCS1_FIELD(TSC_MULTIPLIER, tsc_multiplier, + HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2), ... so it is becoming more and more complicated to assemble for testing. Let me finish my testing and I'll put a series together and send it out to simplify the process. Stay tuned!
Sean Christopherson <seanjc@google.com> writes: > On Tue, Jun 14, 2022, Anirudh Rayabharam wrote: >> On Mon, Jun 13, 2022 at 04:57:49PM +0000, Sean Christopherson wrote: ... >> > >> > Any reason not to use the already sanitized vmcs_config? I can't think of any >> > reason why the nested path should blindly use the raw MSR values from hardware. >> >> vmcs_config has the sanitized exec controls. But how do we construct MSR >> values using them? > > I was thinking we could use the sanitized controls for the allowed-1 bits, and then > take the required-1 bits from the CPU. And then if we wanted to avoid the redundant > RDMSRs in a follow-up patch we could add required-1 fields to vmcs_config. > > Hastily constructed and compile-tested only, proceed with caution :-) Independently from "[PATCH 00/11] KVM: VMX: Support TscScaling and EnclsExitingBitmap whith eVMCS" which is supposed to fix the particular TSC scaling issue, I like the idea to make nested_vmx_setup_ctls_msrs() use both allowed-1 and required-1 bits from vmcs_config. I'll pick up the suggested patch and try to construct something for required-1 bits. Thanks!
On Wed, Jun 22, 2022 at 10:00:29AM +0200, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@google.com> writes: > > > On Tue, Jun 14, 2022, Anirudh Rayabharam wrote: > >> On Mon, Jun 13, 2022 at 04:57:49PM +0000, Sean Christopherson wrote: > > ... > > >> > > >> > Any reason not to use the already sanitized vmcs_config? I can't think of any > >> > reason why the nested path should blindly use the raw MSR values from hardware. > >> > >> vmcs_config has the sanitized exec controls. But how do we construct MSR > >> values using them? > > > > I was thinking we could use the sanitized controls for the allowed-1 bits, and then > > take the required-1 bits from the CPU. And then if we wanted to avoid the redundant > > RDMSRs in a follow-up patch we could add required-1 fields to vmcs_config. > > > > Hastily constructed and compile-tested only, proceed with caution :-) > > Independently from "[PATCH 00/11] KVM: VMX: Support TscScaling and > EnclsExitingBitmap whith eVMCS" which is supposed to fix the particular > TSC scaling issue, I like the idea to make nested_vmx_setup_ctls_msrs() > use both allowed-1 and required-1 bits from vmcs_config. I'll pick up > the suggested patch and try to construct something for required-1 bits. I tried this patch today but it causes some regression which causes /dev/kvm to be unavailable in L1. I didn't get a chance to look into it closely but I am guessing it has something to do with the fact that vmcs_config reflects the config that L0 chose to use rather than what is available to use. So constructing allowed-1 MSR bits based on what bits are set in exec controls maybe isn't correct. Thanks! - Anirudh.
Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: > On Wed, Jun 22, 2022 at 10:00:29AM +0200, Vitaly Kuznetsov wrote: >> Sean Christopherson <seanjc@google.com> writes: >> >> > On Tue, Jun 14, 2022, Anirudh Rayabharam wrote: >> >> On Mon, Jun 13, 2022 at 04:57:49PM +0000, Sean Christopherson wrote: >> >> ... >> >> >> > >> >> > Any reason not to use the already sanitized vmcs_config? I can't think of any >> >> > reason why the nested path should blindly use the raw MSR values from hardware. >> >> >> >> vmcs_config has the sanitized exec controls. But how do we construct MSR >> >> values using them? >> > >> > I was thinking we could use the sanitized controls for the allowed-1 bits, and then >> > take the required-1 bits from the CPU. And then if we wanted to avoid the redundant >> > RDMSRs in a follow-up patch we could add required-1 fields to vmcs_config. >> > >> > Hastily constructed and compile-tested only, proceed with caution :-) >> >> Independently from "[PATCH 00/11] KVM: VMX: Support TscScaling and >> EnclsExitingBitmap whith eVMCS" which is supposed to fix the particular >> TSC scaling issue, I like the idea to make nested_vmx_setup_ctls_msrs() >> use both allowed-1 and required-1 bits from vmcs_config. I'll pick up >> the suggested patch and try to construct something for required-1 bits. > > I tried this patch today but it causes some regression which causes > /dev/kvm to be unavailable in L1. I didn't get a chance to look into it > closely but I am guessing it has something to do with the fact that > vmcs_config reflects the config that L0 chose to use rather than what is > available to use. So constructing allowed-1 MSR bits based on what bits > are set in exec controls maybe isn't correct. I've tried to pick it up but it's actually much harder than I think. The patch has some minor issues ('&vmcs_config.nested' needs to be switched to '&vmcs_conf->nested' in nested_vmx_setup_ctls_msrs()), but the main problem is that the set of controls nested_vmx_setup_ctls_msrs() needs is NOT a subset of vmcs_config (setup_vmcs_config()). I was able to identify at least: diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 5e14e4c40007..8076352174ad 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2483,8 +2483,14 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf, CPU_BASED_INVLPG_EXITING | CPU_BASED_RDPMC_EXITING; - opt = CPU_BASED_TPR_SHADOW | + opt = CPU_BASED_INTR_WINDOW_EXITING | + CPU_BASED_NMI_WINDOW_EXITING | + CPU_BASED_TPR_SHADOW | + CPU_BASED_USE_IO_BITMAPS | CPU_BASED_USE_MSR_BITMAPS | + CPU_BASED_MONITOR_TRAP_FLAG | + CPU_BASED_RDTSC_EXITING | + CPU_BASED_PAUSE_EXITING | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS | CPU_BASED_ACTIVATE_TERTIARY_CONTROLS; if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS, @@ -2582,6 +2588,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf, #endif opt = VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_PAT | + VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_EFER | VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_PT_CONCEAL_PIP | @@ -2604,7 +2611,11 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf, _pin_based_exec_control &= ~PIN_BASED_POSTED_INTR; min = VM_ENTRY_LOAD_DEBUG_CONTROLS; - opt = VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | + opt = +#ifdef CONFIG_X86_64 + VM_ENTRY_IA32E_MODE | +#endif + VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_IA32_EFER | VM_ENTRY_LOAD_BNDCFGS | but it is 1) not sufficient because some controls are smartly filtered out just because we don't want them for L1 -- and this doesn't mean that L2 doesn't need them and 2) because if we add some 'opt' controls to setup_vmcs_config() we need to filter them out somewhere else. I'm starting to think we may just want to store raw VMX MSR values in vmcs_config first, then sanitize them (eVMCS, vmx preemtoion timer bug, perf_ctrl bug,..) and then do the adjust_vmx_controls() magic. I'm not giving up yet but don't expect something small and backportable to stable :-)
On Wed, Jun 22, 2022 at 04:35:27PM +0200, Vitaly Kuznetsov wrote: > Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: > > > On Wed, Jun 22, 2022 at 10:00:29AM +0200, Vitaly Kuznetsov wrote: > >> Sean Christopherson <seanjc@google.com> writes: > >> > >> > On Tue, Jun 14, 2022, Anirudh Rayabharam wrote: > >> >> On Mon, Jun 13, 2022 at 04:57:49PM +0000, Sean Christopherson wrote: > >> > >> ... > >> > >> >> > > >> >> > Any reason not to use the already sanitized vmcs_config? I can't think of any > >> >> > reason why the nested path should blindly use the raw MSR values from hardware. > >> >> > >> >> vmcs_config has the sanitized exec controls. But how do we construct MSR > >> >> values using them? > >> > > >> > I was thinking we could use the sanitized controls for the allowed-1 bits, and then > >> > take the required-1 bits from the CPU. And then if we wanted to avoid the redundant > >> > RDMSRs in a follow-up patch we could add required-1 fields to vmcs_config. > >> > > >> > Hastily constructed and compile-tested only, proceed with caution :-) > >> > >> Independently from "[PATCH 00/11] KVM: VMX: Support TscScaling and > >> EnclsExitingBitmap whith eVMCS" which is supposed to fix the particular > >> TSC scaling issue, I like the idea to make nested_vmx_setup_ctls_msrs() > >> use both allowed-1 and required-1 bits from vmcs_config. I'll pick up > >> the suggested patch and try to construct something for required-1 bits. > > > > I tried this patch today but it causes some regression which causes > > /dev/kvm to be unavailable in L1. I didn't get a chance to look into it > > closely but I am guessing it has something to do with the fact that > > vmcs_config reflects the config that L0 chose to use rather than what is > > available to use. So constructing allowed-1 MSR bits based on what bits > > are set in exec controls maybe isn't correct. > > I've tried to pick it up but it's actually much harder than I think. The > patch has some minor issues ('&vmcs_config.nested' needs to be switched > to '&vmcs_conf->nested' in nested_vmx_setup_ctls_msrs()), but the main > problem is that the set of controls nested_vmx_setup_ctls_msrs() needs > is NOT a subset of vmcs_config (setup_vmcs_config()). I was able to > identify at least: > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 5e14e4c40007..8076352174ad 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -2483,8 +2483,14 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf, > CPU_BASED_INVLPG_EXITING | > CPU_BASED_RDPMC_EXITING; > > - opt = CPU_BASED_TPR_SHADOW | > + opt = CPU_BASED_INTR_WINDOW_EXITING | > + CPU_BASED_NMI_WINDOW_EXITING | > + CPU_BASED_TPR_SHADOW | > + CPU_BASED_USE_IO_BITMAPS | > CPU_BASED_USE_MSR_BITMAPS | > + CPU_BASED_MONITOR_TRAP_FLAG | > + CPU_BASED_RDTSC_EXITING | > + CPU_BASED_PAUSE_EXITING | > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS | > CPU_BASED_ACTIVATE_TERTIARY_CONTROLS; > if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS, > @@ -2582,6 +2588,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf, > #endif > opt = VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | > VM_EXIT_LOAD_IA32_PAT | > + VM_EXIT_SAVE_IA32_PAT | > VM_EXIT_LOAD_IA32_EFER | > VM_EXIT_CLEAR_BNDCFGS | > VM_EXIT_PT_CONCEAL_PIP | > @@ -2604,7 +2611,11 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf, > _pin_based_exec_control &= ~PIN_BASED_POSTED_INTR; > > min = VM_ENTRY_LOAD_DEBUG_CONTROLS; > - opt = VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | > + opt = > +#ifdef CONFIG_X86_64 > + VM_ENTRY_IA32E_MODE | > +#endif > + VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | > VM_ENTRY_LOAD_IA32_PAT | > VM_ENTRY_LOAD_IA32_EFER | > VM_ENTRY_LOAD_BNDCFGS | > > but it is 1) not sufficient because some controls are smartly filtered > out just because we don't want them for L1 -- and this doesn't mean that > L2 doesn't need them and 2) because if we add some 'opt' controls to > setup_vmcs_config() we need to filter them out somewhere else. > > I'm starting to think we may just want to store raw VMX MSR values in > vmcs_config first, then sanitize them (eVMCS, vmx preemtoion timer bug, > perf_ctrl bug,..) and then do the adjust_vmx_controls() magic. > > I'm not giving up yet but don't expect something small and backportable > to stable :-) How about we do something simple like the patch below to start with? This will easily apply to stable and we can continue improving upon it with follow up patches on mainline. diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index f5cb18e00e78..f88d748c7cc6 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6564,6 +6564,10 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->pinbased_ctls_high); msrs->pinbased_ctls_low |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; +#if IS_ENABLED(CONFIG_HYPERV) + if (static_branch_unlikely(&enable_evmcs)) + msrs->pinbased_ctls_high &= ~EVMCS1_UNSUPPORTED_PINCTRL; +#endif msrs->pinbased_ctls_high &= PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING | @@ -6580,6 +6584,10 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->exit_ctls_low = VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; +#if IS_ENABLED(CONFIG_HYPERV) + if (static_branch_unlikely(&enable_evmcs)) + msrs->exit_ctls_high &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL; +#endif msrs->exit_ctls_high &= #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | @@ -6600,6 +6608,10 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->entry_ctls_high); msrs->entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; +#if IS_ENABLED(CONFIG_HYPERV) + if (static_branch_unlikely(&enable_evmcs)) + msrs->entry_ctls_high &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL; +#endif msrs->entry_ctls_high &= #ifdef CONFIG_X86_64 VM_ENTRY_IA32E_MODE | @@ -6657,6 +6669,10 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->secondary_ctls_high); msrs->secondary_ctls_low = 0; +#if IS_ENABLED(CONFIG_HYPERV) + if (static_branch_unlikely(&enable_evmcs)) + msrs->secondary_ctls_high &= ~EVMCS1_UNSUPPORTED_2NDEXEC; +#endif msrs->secondary_ctls_high &= SECONDARY_EXEC_DESC | SECONDARY_EXEC_ENABLE_RDTSCP |
Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: > On Wed, Jun 22, 2022 at 04:35:27PM +0200, Vitaly Kuznetsov wrote: ... >> >> I've tried to pick it up but it's actually much harder than I think. The >> patch has some minor issues ('&vmcs_config.nested' needs to be switched >> to '&vmcs_conf->nested' in nested_vmx_setup_ctls_msrs()), but the main >> problem is that the set of controls nested_vmx_setup_ctls_msrs() needs >> is NOT a subset of vmcs_config (setup_vmcs_config()). I was able to >> identify at least: ... I've jsut sent "[PATCH RFC v1 00/10] KVM: nVMX: Use vmcs_config for setting up nested VMX MSRs" which implements Sean's suggestion. Hope this is the way to go for mainline. > > How about we do something simple like the patch below to start with? > This will easily apply to stable and we can continue improving upon > it with follow up patches on mainline. > Personally, I'm not against this for @stable. Alternatively, in case the only observed issue is with TSC scaling, we can add support for it for KVM-on-Hyper-V but not for Hyper-V-on-KVM (a small subset of "[PATCH 00/11] KVM: VMX: Support TscScaling and EnclsExitingBitmap whith eVMCS"). I can prepare patches if needed.
On Wed, Jun 22, 2022 at 06:48:50PM +0200, Vitaly Kuznetsov wrote: > Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: > > > On Wed, Jun 22, 2022 at 04:35:27PM +0200, Vitaly Kuznetsov wrote: > > ... > > >> > >> I've tried to pick it up but it's actually much harder than I think. The > >> patch has some minor issues ('&vmcs_config.nested' needs to be switched > >> to '&vmcs_conf->nested' in nested_vmx_setup_ctls_msrs()), but the main > >> problem is that the set of controls nested_vmx_setup_ctls_msrs() needs > >> is NOT a subset of vmcs_config (setup_vmcs_config()). I was able to > >> identify at least: > > ... > > I've jsut sent "[PATCH RFC v1 00/10] KVM: nVMX: Use vmcs_config for > setting up nested VMX MSRs" which implements Sean's suggestion. Hope > this is the way to go for mainline. > > > > > How about we do something simple like the patch below to start with? > > This will easily apply to stable and we can continue improving upon > > it with follow up patches on mainline. > > > > Personally, I'm not against this for @stable. Alternatively, in case the I think it's a good intermediate fix for mainline too. It is easier to land it in stable if it already exists in mainline. It can stay in mainline until your series lands and replaces it with the vmcs_config approach. What do you think? > only observed issue is with TSC scaling, we can add support for it for > KVM-on-Hyper-V but not for Hyper-V-on-KVM (a small subset of "[PATCH > 00/11] KVM: VMX: Support TscScaling and EnclsExitingBitmap whith > eVMCS"). I can prepare patches if needed. Will it fit in stable's 100 line rule? Thanks! - Anirudh.
Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: > On Wed, Jun 22, 2022 at 06:48:50PM +0200, Vitaly Kuznetsov wrote: >> Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: >> >> > On Wed, Jun 22, 2022 at 04:35:27PM +0200, Vitaly Kuznetsov wrote: >> >> ... >> >> >> >> >> I've tried to pick it up but it's actually much harder than I think. The >> >> patch has some minor issues ('&vmcs_config.nested' needs to be switched >> >> to '&vmcs_conf->nested' in nested_vmx_setup_ctls_msrs()), but the main >> >> problem is that the set of controls nested_vmx_setup_ctls_msrs() needs >> >> is NOT a subset of vmcs_config (setup_vmcs_config()). I was able to >> >> identify at least: >> >> ... >> >> I've jsut sent "[PATCH RFC v1 00/10] KVM: nVMX: Use vmcs_config for >> setting up nested VMX MSRs" which implements Sean's suggestion. Hope >> this is the way to go for mainline. >> >> > >> > How about we do something simple like the patch below to start with? >> > This will easily apply to stable and we can continue improving upon >> > it with follow up patches on mainline. >> > >> >> Personally, I'm not against this for @stable. Alternatively, in case the > > I think it's a good intermediate fix for mainline too. It is easier to land > it in stable if it already exists in mainline. It can stay in mainline > until your series lands and replaces it with the vmcs_config approach. > > What do you think? > Paolo's call but personally I think both series can make 5.20 so there's no need for an intermediate solution. >> only observed issue is with TSC scaling, we can add support for it for >> KVM-on-Hyper-V but not for Hyper-V-on-KVM (a small subset of "[PATCH >> 00/11] KVM: VMX: Support TscScaling and EnclsExitingBitmap whith >> eVMCS"). I can prepare patches if needed. > > Will it fit in stable's 100 line rule? > Yes, please take a look at the attached patches (5.18.y based). First 3 are identical to what I've sent for mainline, the last one is reduced to only support TSC scaling for KVM on Hyper-V (but not Hyper-V on KVM). Compile tested only, proceed with caution)
On Thu, Jun 23, 2022 at 01:49:30PM +0200, Vitaly Kuznetsov wrote: > Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: > > > On Wed, Jun 22, 2022 at 06:48:50PM +0200, Vitaly Kuznetsov wrote: > >> Anirudh Rayabharam <anrayabh@linux.microsoft.com> writes: > >> > >> > On Wed, Jun 22, 2022 at 04:35:27PM +0200, Vitaly Kuznetsov wrote: > >> > >> ... > >> > >> >> > >> >> I've tried to pick it up but it's actually much harder than I think. The > >> >> patch has some minor issues ('&vmcs_config.nested' needs to be switched > >> >> to '&vmcs_conf->nested' in nested_vmx_setup_ctls_msrs()), but the main > >> >> problem is that the set of controls nested_vmx_setup_ctls_msrs() needs > >> >> is NOT a subset of vmcs_config (setup_vmcs_config()). I was able to > >> >> identify at least: > >> > >> ... > >> > >> I've jsut sent "[PATCH RFC v1 00/10] KVM: nVMX: Use vmcs_config for > >> setting up nested VMX MSRs" which implements Sean's suggestion. Hope > >> this is the way to go for mainline. > >> > >> > > >> > How about we do something simple like the patch below to start with? > >> > This will easily apply to stable and we can continue improving upon > >> > it with follow up patches on mainline. > >> > > >> > >> Personally, I'm not against this for @stable. Alternatively, in case the > > > > I think it's a good intermediate fix for mainline too. It is easier to land > > it in stable if it already exists in mainline. It can stay in mainline > > until your series lands and replaces it with the vmcs_config approach. > > > > What do you think? > > > > Paolo's call but personally I think both series can make 5.20 so there's > no need for an intermediate solution. Only reason I see for this intermediate solution is to automatically land the fix in stable without bothering to write a special backport. I will send it as a proper patch and see if there is any interest in taking it. - Anirudh.
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index f5cb18e00e78..d773ddc6422b 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -6656,6 +6656,9 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) msrs->secondary_ctls_low, msrs->secondary_ctls_high); + if (!kvm_has_tsc_control) + msrs->secondary_ctls_high &= ~SECONDARY_EXEC_TSC_SCALING; + msrs->secondary_ctls_low = 0; msrs->secondary_ctls_high &= SECONDARY_EXEC_DESC | @@ -6667,8 +6670,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps) SECONDARY_EXEC_RDRAND_EXITING | SECONDARY_EXEC_ENABLE_INVPCID | SECONDARY_EXEC_RDSEED_EXITING | - SECONDARY_EXEC_XSAVES | - SECONDARY_EXEC_TSC_SCALING; + SECONDARY_EXEC_XSAVES; /* * We can emulate "VMCS shadowing," even if the hardware
VM entry into an L2 guest on KVM on Hyper-V fails with the following splat (stripped for brevity) when running cloud-hypervisor tests. [ 1481.600386] WARNING: CPU: 4 PID: 7641 at arch/x86/kvm/vmx/nested.c:4563 nested_vmx_vmexit+0x70d/0x790 [kvm_intel] [ 1481.600427] CPU: 4 PID: 7641 Comm: vcpu2 Not tainted 5.15.0-1008-azure #9-Ubuntu [ 1481.600429] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 07/22/2021 [ 1481.600430] RIP: 0010:nested_vmx_vmexit+0x70d/0x790 [kvm_intel] [ 1481.600447] Call Trace: [ 1481.600449] <TASK> [ 1481.600451] nested_vmx_reflect_vmexit+0x10b/0x440 [kvm_intel] [ 1481.600457] __vmx_handle_exit+0xef/0x670 [kvm_intel] [ 1481.600467] vmx_handle_exit+0x12/0x50 [kvm_intel] [ 1481.600472] vcpu_enter_guest+0x83a/0xfd0 [kvm] [ 1481.600524] vcpu_run+0x5e/0x240 [kvm] [ 1481.600560] kvm_arch_vcpu_ioctl_run+0xd7/0x550 [kvm] [ 1481.600597] kvm_vcpu_ioctl+0x29a/0x6d0 [kvm] [ 1481.600634] __x64_sys_ioctl+0x91/0xc0 [ 1481.600637] do_syscall_64+0x5c/0xc0 [ 1481.600667] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1481.600670] RIP: 0033:0x7f688becdaff [ 1481.600686] </TASK> As per the comments in arch/x86/kvm/vmx/evmcs.h, TSC multiplier field is currently not supported in EVMCS. As a result, there is no TSC scaling support when KVM is running on Hyper-V i.e. kvm_has_tsc_control is false. However, in nested_vmx_setup_ctls_msrs(), TSC scaling is exposed to L1. When L1 tries to launch an L2 guest, vmcs12 has TSC scaling enabled. This propagates to vmcs02. But KVM doesn't set the TSC multiplier value because kvm_has_tsc_control is false. Due to this, VM entry for L2 guest fails. (VM entry fails if "use TSC scaling" is 1 and TSC multiplier is 0.) To fix, expose TSC scaling to L1 only if kvm_has_tsc_control. Fixes: d041b5ea93352 ("KVM: nVMX: Enable nested TSC scaling") Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> --- arch/x86/kvm/vmx/nested.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)