mbox series

[v2,0/4] KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN for CPU hotplug

Message ID 20220117150542.2176196-1-vkuznets@redhat.com (mailing list archive)
Headers show
Series KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN for CPU hotplug | expand

Message

Vitaly Kuznetsov Jan. 17, 2022, 3:05 p.m. UTC
Changes since v1:
- Drop the allowlist of items which were allowed to change and just allow
the exact same CPUID data [Sean, Paolo]. Adjust selftest accordingly.
- Drop PATCH1 as the exact same change got merged upstream.

Recently, KVM made it illegal to change CPUID after KVM_RUN but
unfortunately this change is not fully compatible with existing VMMs.
In particular, QEMU reuses vCPU fds for CPU hotplug after unplug and it
calls KVM_SET_CPUID2. Relax the requirement by implementing an allowing
KVM_SET_CPUID{,2} with the exact same data.

Vitaly Kuznetsov (4):
  KVM: x86: Do runtime CPUID update before updating
    vcpu->arch.cpuid_entries
  KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN
  KVM: selftests: Rename 'get_cpuid_test' to 'cpuid_test'
  KVM: selftests: Test KVM_SET_CPUID2 after KVM_RUN

 arch/x86/kvm/cpuid.c                          | 67 ++++++++++++++++---
 arch/x86/kvm/x86.c                            | 19 ------
 tools/testing/selftests/kvm/.gitignore        |  2 +-
 tools/testing/selftests/kvm/Makefile          |  4 +-
 .../selftests/kvm/include/x86_64/processor.h  |  7 ++
 .../selftests/kvm/lib/x86_64/processor.c      | 33 +++++++--
 .../x86_64/{get_cpuid_test.c => cpuid_test.c} | 30 +++++++++
 7 files changed, 126 insertions(+), 36 deletions(-)
 rename tools/testing/selftests/kvm/x86_64/{get_cpuid_test.c => cpuid_test.c} (83%)

Comments

Igor Mammedov Jan. 18, 2022, 2:35 p.m. UTC | #1
On Mon, 17 Jan 2022 16:05:38 +0100
Vitaly Kuznetsov <vkuznets@redhat.com> wrote:

> Changes since v1:
> - Drop the allowlist of items which were allowed to change and just allow
> the exact same CPUID data [Sean, Paolo]. Adjust selftest accordingly.
> - Drop PATCH1 as the exact same change got merged upstream.
> 
> Recently, KVM made it illegal to change CPUID after KVM_RUN but
> unfortunately this change is not fully compatible with existing VMMs.
> In particular, QEMU reuses vCPU fds for CPU hotplug after unplug and it
> calls KVM_SET_CPUID2. Relax the requirement by implementing an allowing
> KVM_SET_CPUID{,2} with the exact same data.


Can you check following scenario:
 * on host that has IA32_TSX_CTRL and TSX enabled (RTM/HLE cpuid bits present)
 * boot 2 vcpus VM with TSX enabled on VMM side but with tsx=off on kernel CLI

     that should cause kernel to set MSR_IA32_TSX_CTRL to 3H from initial 0H
     and clear RTM+HLE bits in CPUID, check that RTM/HLE cpuid it cleared

 * hotunplug a VCPU and then replug it again
    if IA32_TSX_CTRL is reset to initial state, that should re-enable
    RTM/HLE cpuid bits and KVM_SET_CPUID2 might fail due to difference

and as Sean pointed out there might be other non constant leafs,
where exact match check could leave userspace broken.
     

> Vitaly Kuznetsov (4):
>   KVM: x86: Do runtime CPUID update before updating
>     vcpu->arch.cpuid_entries
>   KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN
>   KVM: selftests: Rename 'get_cpuid_test' to 'cpuid_test'
>   KVM: selftests: Test KVM_SET_CPUID2 after KVM_RUN
> 
>  arch/x86/kvm/cpuid.c                          | 67 ++++++++++++++++---
>  arch/x86/kvm/x86.c                            | 19 ------
>  tools/testing/selftests/kvm/.gitignore        |  2 +-
>  tools/testing/selftests/kvm/Makefile          |  4 +-
>  .../selftests/kvm/include/x86_64/processor.h  |  7 ++
>  .../selftests/kvm/lib/x86_64/processor.c      | 33 +++++++--
>  .../x86_64/{get_cpuid_test.c => cpuid_test.c} | 30 +++++++++
>  7 files changed, 126 insertions(+), 36 deletions(-)
>  rename tools/testing/selftests/kvm/x86_64/{get_cpuid_test.c => cpuid_test.c} (83%)
>
Paolo Bonzini Jan. 18, 2022, 4:17 p.m. UTC | #2
On 1/18/22 15:35, Igor Mammedov wrote:
> Can you check following scenario:
>   * on host that has IA32_TSX_CTRL and TSX enabled (RTM/HLE cpuid bits present)
>   * boot 2 vcpus VM with TSX enabled on VMM side but with tsx=off on kernel CLI
> 
>       that should cause kernel to set MSR_IA32_TSX_CTRL to 3H from initial 0H
>       and clear RTM+HLE bits in CPUID, check that RTM/HLE cpuid it cleared
> 
>   * hotunplug a VCPU and then replug it again
>      if IA32_TSX_CTRL is reset to initial state, that should re-enable
>      RTM/HLE cpuid bits and KVM_SET_CPUID2 might fail due to difference
> 
> and as Sean pointed out there might be other non constant leafs,
> where exact match check could leave userspace broken.


MSR_IA32_TSX_CTRL is handled differently straight during the CPUID call:

                 if (function == 7 && index == 0) {
                         u64 data;
                         if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
                             (data & TSX_CTRL_CPUID_CLEAR))
                                 *ebx &= ~(F(RTM) | F(HLE));
                 }


and I think we should redo all or most of kvm_update_cpuid_runtime
the same way.

Paolo
Vitaly Kuznetsov Jan. 18, 2022, 4:34 p.m. UTC | #3
Igor Mammedov <imammedo@redhat.com> writes:

> On Mon, 17 Jan 2022 16:05:38 +0100
> Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
>> Changes since v1:
>> - Drop the allowlist of items which were allowed to change and just allow
>> the exact same CPUID data [Sean, Paolo]. Adjust selftest accordingly.
>> - Drop PATCH1 as the exact same change got merged upstream.
>> 
>> Recently, KVM made it illegal to change CPUID after KVM_RUN but
>> unfortunately this change is not fully compatible with existing VMMs.
>> In particular, QEMU reuses vCPU fds for CPU hotplug after unplug and it
>> calls KVM_SET_CPUID2. Relax the requirement by implementing an allowing
>> KVM_SET_CPUID{,2} with the exact same data.
>
>
> Can you check following scenario:
>  * on host that has IA32_TSX_CTRL and TSX enabled (RTM/HLE cpuid bits present)
>  * boot 2 vcpus VM with TSX enabled on VMM side but with tsx=off on kernel CLI
>
>      that should cause kernel to set MSR_IA32_TSX_CTRL to 3H from initial 0H
>      and clear RTM+HLE bits in CPUID, check that RTM/HLE cpuid it
>      cleared

Forgive me my ignorance around (not only) TSX :-) I took a "Intel(R)
Xeon(R) CPU E3-1270 v5 @ 3.60GHz" host which seems to have rtm/hle and
booted a guest with 'cpu=host' and with (and without) 'tsx=off' on the
kernel command line. I decided to check what's is MSR_IA32_TSX_CTRL but
I see the following:

# rdmsr 0x122
rdmsr: CPU 0 cannot read MSR 0x00000122

I tried adding 'tsx_ctrl' to my QEMU command line but it complains with
qemu-system-x86_64: warning: host doesn't support requested feature: MSR(10AH).tsx-ctrl [bit 7]

so I think my host is not good enough :-(

Also, I've looked at tsx_clear_cpuid() but it actually writes to
MSR_TSX_FORCE_ABORT MSR (0x10F), not MSR_IA32_TSX_CTRL so I'm confused.

>
>  * hotunplug a VCPU and then replug it again
>     if IA32_TSX_CTRL is reset to initial state, that should re-enable
>     RTM/HLE cpuid bits and KVM_SET_CPUID2 might fail due to difference

Could you please teach me this kung-fu, I mean hot to unplug a
cold-plugged CPU with QMP? Previoulsy, I only did un-plugging for what
I've hotplugged, something like:

(QEMU) device_add driver=host-x86_64-cpu socket-id=0 core-id=2 thread-id=0 id=cpu2
{"return": {}}
(QEMU) device_del id=cpu2
{"return": {}}

What's the ids of the cold-plugged CPUs?

>
> and as Sean pointed out there might be other non constant leafs,
> where exact match check could leave userspace broken.

Indeed, while testing your suggestion I've stumbled upon
CPUID.(EAX=0x12, ECX=1) (SGX) where we mangle ECX from
kvm_vcpu_after_set_cpuid():

        best = kvm_find_cpuid_entry(vcpu, 0x12, 0x1);
	if (best) {
                best->ecx &= vcpu->arch.guest_supported_xcr0 & 0xffffffff;
		best->edx &= vcpu->arch.guest_supported_xcr0 >> 32;
                best->ecx |= XFEATURE_MASK_FPSSE;
        }

In theory, we should just move this to __kvm_update_cpuid_runtime()...
I'll take a look tomorrow.
Sean Christopherson Jan. 18, 2022, 4:53 p.m. UTC | #4
On Tue, Jan 18, 2022, Paolo Bonzini wrote:
> On 1/18/22 15:35, Igor Mammedov wrote:
> > Can you check following scenario:
> >   * on host that has IA32_TSX_CTRL and TSX enabled (RTM/HLE cpuid bits present)
> >   * boot 2 vcpus VM with TSX enabled on VMM side but with tsx=off on kernel CLI
> > 
> >       that should cause kernel to set MSR_IA32_TSX_CTRL to 3H from initial 0H
> >       and clear RTM+HLE bits in CPUID, check that RTM/HLE cpuid it cleared
> > 
> >   * hotunplug a VCPU and then replug it again
> >      if IA32_TSX_CTRL is reset to initial state, that should re-enable
> >      RTM/HLE cpuid bits and KVM_SET_CPUID2 might fail due to difference
> > 
> > and as Sean pointed out there might be other non constant leafs,
> > where exact match check could leave userspace broken.
> 
> 
> MSR_IA32_TSX_CTRL is handled differently straight during the CPUID call:
> 
>                 if (function == 7 && index == 0) {
>                         u64 data;
>                         if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
>                             (data & TSX_CTRL_CPUID_CLEAR))
>                                 *ebx &= ~(F(RTM) | F(HLE));
>                 }
> 
> 
> and I think we should redo all or most of kvm_update_cpuid_runtime
> the same way.

Please no.  xstate_required_size() requires multiple host CPUID calls, and glibc
does CPUID.0xD.0x0 and CPUID.0xD.0x1 as part of its initialization, i.e. launching
a new userspace process in the guest will see additional performance overhread due
to KVM dynamically computing the XSAVE size instead of caching it based on vCPU
state.  Nested virtualization would be especially painful as every one of those
"host" CPUID invocations will trigger and exit from L1=>L0.
Paolo Bonzini Jan. 18, 2022, 5:12 p.m. UTC | #5
On 1/18/22 17:53, Sean Christopherson wrote:
>>
>> and I think we should redo all or most of kvm_update_cpuid_runtime
>> the same way.
> Please no.  xstate_required_size() requires multiple host CPUID calls, and glibc
> does CPUID.0xD.0x0 and CPUID.0xD.0x1 as part of its initialization, i.e. launching
> a new userspace process in the guest will see additional performance overhread due
> to KVM dynamically computing the XSAVE size instead of caching it based on vCPU
> state.  Nested virtualization would be especially painful as every one of those
> "host" CPUID invocations will trigger and exit from L1=>L0.
> 

Agreed, but all of the required data is by Linux cached in 
xstate_offsets, xstate_sizes and xstate_comp_offsets; moving 
xstate_required_size to xstate.c and skipping the host CPUID would 
probably be a good idea nevertheless.

Paolo
Igor Mammedov Jan. 19, 2022, 7:59 a.m. UTC | #6
On Tue, 18 Jan 2022 17:34:09 +0100
Vitaly Kuznetsov <vkuznets@redhat.com> wrote:

> Igor Mammedov <imammedo@redhat.com> writes:
> 
> > On Mon, 17 Jan 2022 16:05:38 +0100
> > Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> >  
> >> Changes since v1:
> >> - Drop the allowlist of items which were allowed to change and just allow
> >> the exact same CPUID data [Sean, Paolo]. Adjust selftest accordingly.
> >> - Drop PATCH1 as the exact same change got merged upstream.
> >> 
> >> Recently, KVM made it illegal to change CPUID after KVM_RUN but
> >> unfortunately this change is not fully compatible with existing VMMs.
> >> In particular, QEMU reuses vCPU fds for CPU hotplug after unplug and it
> >> calls KVM_SET_CPUID2. Relax the requirement by implementing an allowing
> >> KVM_SET_CPUID{,2} with the exact same data.  
> >
> >
> > Can you check following scenario:
> >  * on host that has IA32_TSX_CTRL and TSX enabled (RTM/HLE cpuid bits present)
> >  * boot 2 vcpus VM with TSX enabled on VMM side but with tsx=off on kernel CLI
> >
> >      that should cause kernel to set MSR_IA32_TSX_CTRL to 3H from initial 0H
> >      and clear RTM+HLE bits in CPUID, check that RTM/HLE cpuid it
> >      cleared  
> 
> Forgive me my ignorance around (not only) TSX :-) I took a "Intel(R)
> Xeon(R) CPU E3-1270 v5 @ 3.60GHz" host which seems to have rtm/hle and
> booted a guest with 'cpu=host' and with (and without) 'tsx=off' on the
> kernel command line. I decided to check what's is MSR_IA32_TSX_CTRL but
> I see the following:
> 
> # rdmsr 0x122
> rdmsr: CPU 0 cannot read MSR 0x00000122
> 
> I tried adding 'tsx_ctrl' to my QEMU command line but it complains with
> qemu-system-x86_64: warning: host doesn't support requested feature: MSR(10AH).tsx-ctrl [bit 7]
> 
> so I think my host is not good enough :-(

I've seen it being available on "COOPER LAKE" Xeon

> 
> Also, I've looked at tsx_clear_cpuid() but it actually writes to
> MSR_TSX_FORCE_ABORT MSR (0x10F), not MSR_IA32_TSX_CTRL so I'm confused.

look at tsx_disable()

> >  * hotunplug a VCPU and then replug it again
> >     if IA32_TSX_CTRL is reset to initial state, that should re-enable
> >     RTM/HLE cpuid bits and KVM_SET_CPUID2 might fail due to difference  
> 
> Could you please teach me this kung-fu, I mean hot to unplug a
> cold-plugged CPU with QMP? Previoulsy, I only did un-plugging for what
> I've hotplugged, something like:
> 
> (QEMU) device_add driver=host-x86_64-cpu socket-id=0 core-id=2 thread-id=0 id=cpu2
> {"return": {}}
> (QEMU) device_del id=cpu2
> {"return": {}}
> 
> What's the ids of the cold-plugged CPUs?

it doesn't have to be coldplugged, hot(plug/unplug/plug) sequence is fine as well.
fyi you can use qom_path with device _del from 'info hotpluggable-cpus' output


> > and as Sean pointed out there might be other non constant leafs,
> > where exact match check could leave userspace broken.  
> 
> Indeed, while testing your suggestion I've stumbled upon
> CPUID.(EAX=0x12, ECX=1) (SGX) where we mangle ECX from
> kvm_vcpu_after_set_cpuid():
> 
>         best = kvm_find_cpuid_entry(vcpu, 0x12, 0x1);
> 	if (best) {
>                 best->ecx &= vcpu->arch.guest_supported_xcr0 & 0xffffffff;
> 		best->edx &= vcpu->arch.guest_supported_xcr0 >> 32;
>                 best->ecx |= XFEATURE_MASK_FPSSE;
>         }
> 
> In theory, we should just move this to __kvm_update_cpuid_runtime()...
> I'll take a look tomorrow.
>