diff mbox

regression: nested: L1 3.15+ fails to load kvm-intel on L0 <3.15

Message ID 55093B52.5090904@canonical.com (mailing list archive)
State New, archived
Headers show

Commit Message

Stefan Bader March 18, 2015, 8:46 a.m. UTC
Someone reported[1] that some of their L1 guests fail to load the kvm-intel
module (without much details). Turns out that this was (at least) caused by

KVM: vmx: Allow the guest to run with dirty debug registers

as this adds VM_EXIT_SAVE_DEBUG_CONTROLS to the required MSR_IA32_VMX_EXIT_CTLS
bits. Not sure this should be fixed up in pre 3.15 kernels or the other way
round. Maybe naively asked but would it be sufficient to add this as required to
older kernels vmcs setup (without the code to make any use of it)?

Regardless of that, I wonder whether the below (this version untested) sound
acceptable for upstream? At least it would make debugging much simpler. :)


Thanks,
-Stefan

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1431473

Comments

Paolo Bonzini March 18, 2015, 9:18 a.m. UTC | #1
On 18/03/2015 09:46, Stefan Bader wrote:
> 
> Regardless of that, I wonder whether the below (this version untested) sound
> acceptable for upstream? At least it would make debugging much simpler. :)
> 
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ct
>         ctl |= vmx_msr_low;  /* bit == 1 in low word  ==> must be one  */
> 
>         /* Ensure minimum (required) set of control bits are supported. */
> -       if (ctl_min & ~ctl)
> +       if (ctl_min & ~ctl) {
> +               printk(KERN_ERR "vmx: msr(%08x) does not match requirements. "
> +                               "req=%08x cur=%08x\n", msr, ctl_min, ctl);
>                 return -EIO;
> +       }
> 
>         *result = ctl;
>         return 0;

Yes, this is nice.  Maybe -ENODEV.

Also, a minimal patch for Ubuntu would probably be:

@@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 		      vmx_capability.ept, vmx_capability.vpid);
 	}
 
-	min = 0;
+	min = VM_EXIT_SAVE_DEBUG_CONTROLS;
 #ifdef CONFIG_X86_64
 	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
 #endif

but I don't think it's a good idea to add it to stable kernels.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Bader March 18, 2015, 9:59 a.m. UTC | #2
On 18.03.2015 10:18, Paolo Bonzini wrote:
> 
> 
> On 18/03/2015 09:46, Stefan Bader wrote:
>>
>> Regardless of that, I wonder whether the below (this version untested) sound
>> acceptable for upstream? At least it would make debugging much simpler. :)
>>
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ct
>>         ctl |= vmx_msr_low;  /* bit == 1 in low word  ==> must be one  */
>>
>>         /* Ensure minimum (required) set of control bits are supported. */
>> -       if (ctl_min & ~ctl)
>> +       if (ctl_min & ~ctl) {
>> +               printk(KERN_ERR "vmx: msr(%08x) does not match requirements. "
>> +                               "req=%08x cur=%08x\n", msr, ctl_min, ctl);
>>                 return -EIO;
>> +       }
>>
>>         *result = ctl;
>>         return 0;
> 
> Yes, this is nice.  Maybe -ENODEV.

Maybe, though I did not change that. Just added to give some kind of hint when
the module would otherwise fail with just an IO error.

> 
> Also, a minimal patch for Ubuntu would probably be:
> 
> @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  		      vmx_capability.ept, vmx_capability.vpid);
>  	}
>  
> -	min = 0;
> +	min = VM_EXIT_SAVE_DEBUG_CONTROLS;
>  #ifdef CONFIG_X86_64
>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #endif
> 
> but I don't think it's a good idea to add it to stable kernels.

Why is that? Because it has a risk of causing the module failing to load on L0
where it did work before? Which would be something I would rather avoid.
Generally I think it would be good to have something that can be generally
applied. Given the speed that cloud service providers tend to move forward (ok
they may not actively push the ability to go nested).

-Stefan
> 
> Paolo
>
Paolo Bonzini March 18, 2015, 10:27 a.m. UTC | #3
On 18/03/2015 10:59, Stefan Bader wrote:
>> @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct
>> vmcs_config *vmcs_conf) vmx_capability.ept,
>> vmx_capability.vpid); }
>> 
>> -	min = 0; +	min = VM_EXIT_SAVE_DEBUG_CONTROLS; #ifdef
>> CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif
>> 
>> but I don't think it's a good idea to add it to stable kernels.
> 
> Why is that? Because it has a risk of causing the module failing to
> load on L0 where it did work before?

Because if we wanted to make 3.14 nested VMX stable-ish we would need
several more, at least these:

      KVM: nVMX: fix lifetime issues for vmcs02
      KVM: nVMX: clean up nested_release_vmcs12 and code around it
      KVM: nVMX: Rework interception of IRQs and NMIs
      KVM: nVMX: Do not inject NMI vmexits when L2 has a pending
                 interrupt
      KVM: nVMX: Disable preemption while reading from shadow VMCS

and for 3.13:

      KVM: nVMX: Leave VMX mode on clearing of feature control MSR

There are also several L2-crash-L1 bugs too in Nadav Amit's patches.

Basically, nested VMX was never considered stable-worthy.  Perhaps
that can change soon---but not retroactively.

So I'd rather avoid giving false impressions of the stability of nVMX
in 3.14.

Even if we considered nVMX stable, I'd _really_ not want to consider
the L1<->L2 boundary a secure one for a longer time.

> Which would be something I would rather avoid. Generally I think it
> would be good to have something that can be generally applied.
> Given the speed that cloud service providers tend to move forward
> (ok they may not actively push the ability to go nested).

And if they did, I'd really not want them to do it with a 3.14 kernel.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Bader March 18, 2015, 10:30 a.m. UTC | #4
On 18.03.2015 11:27, Paolo Bonzini wrote:
> 
> 
> On 18/03/2015 10:59, Stefan Bader wrote:
>>> @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct
>>> vmcs_config *vmcs_conf) vmx_capability.ept,
>>> vmx_capability.vpid); }
>>>
>>> -	min = 0; +	min = VM_EXIT_SAVE_DEBUG_CONTROLS; #ifdef
>>> CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif
>>>
>>> but I don't think it's a good idea to add it to stable kernels.
>>
>> Why is that? Because it has a risk of causing the module failing to
>> load on L0 where it did work before?
> 
> Because if we wanted to make 3.14 nested VMX stable-ish we would need
> several more, at least these:
> 
>       KVM: nVMX: fix lifetime issues for vmcs02
>       KVM: nVMX: clean up nested_release_vmcs12 and code around it
>       KVM: nVMX: Rework interception of IRQs and NMIs
>       KVM: nVMX: Do not inject NMI vmexits when L2 has a pending
>                  interrupt
>       KVM: nVMX: Disable preemption while reading from shadow VMCS
> 
> and for 3.13:
> 
>       KVM: nVMX: Leave VMX mode on clearing of feature control MSR
> 
> There are also several L2-crash-L1 bugs too in Nadav Amit's patches.
> 
> Basically, nested VMX was never considered stable-worthy.  Perhaps
> that can change soon---but not retroactively.
> 
> So I'd rather avoid giving false impressions of the stability of nVMX
> in 3.14.
> 
> Even if we considered nVMX stable, I'd _really_ not want to consider
> the L1<->L2 boundary a secure one for a longer time.
> 
>> Which would be something I would rather avoid. Generally I think it
>> would be good to have something that can be generally applied.
>> Given the speed that cloud service providers tend to move forward
>> (ok they may not actively push the ability to go nested).
> 
> And if they did, I'd really not want them to do it with a 3.14 kernel.

3.14... you are optimistic. :) But thanks a lot for the detailed info.

-Stefan

> 
> Paolo
>
Stefan Bader March 19, 2015, 7:58 p.m. UTC | #5
On 18.03.2015 10:18, Paolo Bonzini wrote:
> 
> 
> On 18/03/2015 09:46, Stefan Bader wrote:
>>
>> Regardless of that, I wonder whether the below (this version untested) sound
>> acceptable for upstream? At least it would make debugging much simpler. :)
>>
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ct
>>         ctl |= vmx_msr_low;  /* bit == 1 in low word  ==> must be one  */
>>
>>         /* Ensure minimum (required) set of control bits are supported. */
>> -       if (ctl_min & ~ctl)
>> +       if (ctl_min & ~ctl) {
>> +               printk(KERN_ERR "vmx: msr(%08x) does not match requirements. "
>> +                               "req=%08x cur=%08x\n", msr, ctl_min, ctl);
>>                 return -EIO;
>> +       }
>>
>>         *result = ctl;
>>         return 0;
> 
> Yes, this is nice.  Maybe -ENODEV.
> 
> Also, a minimal patch for Ubuntu would probably be:
> 
> @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  		      vmx_capability.ept, vmx_capability.vpid);
>  	}
>  
> -	min = 0;
> +	min = VM_EXIT_SAVE_DEBUG_CONTROLS;
>  #ifdef CONFIG_X86_64
>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #endif
> 
> but I don't think it's a good idea to add it to stable kernels.

Sorry, I got a bit confused on my assumptions. While the change above causes
guests to fail but the statement to say this is caused by host kernels before
this change was against better knowledge and wrong.

The actual range was hosts running 3.2 which (maybe not perfect but at least
well enough) allowed to use nested vmx for guest kernel <3.15 will break. But
running 3.13 on the host has no issues.

Comparing the rdmsr values of guests between those two host kernels, I found
that on 3.2 the exit control msr was very sparsely initialized. And looking at
the changes between 3.2 and 3.13 I found

commit 33fb20c39e98b90813b5ab2d9a0d6faa6300caca
Author: Jan Kiszka <jan.kiszka@siemens.com>
Date:   Wed Mar 6 15:44:03 2013 +0100

    KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS

This was added in 3.10. So the range of kernels affected <3.10 back to when
nested vmx became somewhat usable. For 3.2 Ben (and obviously us) would be
affected. Apart from that, I believe, it is only 3.4 which has an active
longterm. At least that change looks safer for stable as it sounds like
correcting things and not adding a feature. I was able to cherry-pick that into
a 3.2 kernel and then a 3.16 guest successfully can load the kvm-intel module
again, of course with the same shortcomings as before.

-Stefan
> 
> Paolo
>
Paolo Bonzini March 19, 2015, 8:08 p.m. UTC | #6
On 19/03/2015 20:58, Stefan Bader wrote:
> This was added in 3.10. So the range of kernels affected <3.10 back
> to when nested vmx became somewhat usable. For 3.2 Ben (and
> obviously us) would be affected. Apart from that, I believe, it is
> only 3.4 which has an active longterm. At least that change looks
> safer for stable as it sounds like correcting things and not adding
> a feature. I was able to cherry-pick that into a 3.2 kernel and
> then a 3.16 guest successfully can load the kvm-intel module again,
> of course with the same shortcomings as before.

Feel free to backport whatever you want to distro kernels.  But I'm
going to NACK for stable@ anything that is related to nested virt.

The code has changed so much that I simply cannot do a meaningful
review of most patches when applied to old codebases.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2953,8 +2953,11 @@  static __init int adjust_vmx_controls(u32 ctl_min, u32 ct
        ctl |= vmx_msr_low;  /* bit == 1 in low word  ==> must be one  */

        /* Ensure minimum (required) set of control bits are supported. */
-       if (ctl_min & ~ctl)
+       if (ctl_min & ~ctl) {
+               printk(KERN_ERR "vmx: msr(%08x) does not match requirements. "
+                               "req=%08x cur=%08x\n", msr, ctl_min, ctl);
                return -EIO;
+       }

        *result = ctl;
        return 0;