diff mbox series

[kvm-unit-tests] x86:VMX: Fixup for VMX test failures

Message ID 20230720115810.104890-1-weijiang.yang@intel.com (mailing list archive)
State New, archived
Headers show
Series [kvm-unit-tests] x86:VMX: Fixup for VMX test failures | expand

Commit Message

Yang, Weijiang July 20, 2023, 11:58 a.m. UTC
CET KVM enabling patch series introduces extra constraints
on CR0.WP and CR4.CET bits, i.e., setting CR4.CET=1 faults if
CR0.WP==0. Simply skip CR4.CET bit test to avoid setting it in
flexible_cr4 and finally triggering a #GP when write the CR4
with CET bit set while CR0.WP is cleared.

The enable series also introduces IA32_VMX_BASIC[56 bit] check before
inject exception to VM, per SDM(Vol 3D, A-1):
"If bit 56 is read as 1, software can use VM entry to deliver a hardware
exception with or without an error code, regardless of vector."

With the change, some test cases expected VM entry failure  will
end up with successful results which causes reporting failures. Now
checks the VM launch status conditionally against the bit support
to get consistent results with the change enforced by KVM.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 x86/vmx.c       |  2 +-
 x86/vmx.h       |  3 ++-
 x86/vmx_tests.c | 21 +++++++++++++++++----
 3 files changed, 20 insertions(+), 6 deletions(-)

Comments

Sean Christopherson Aug. 2, 2023, 7:43 p.m. UTC | #1
This is not "fixup", this is support for CET and for new CPU functionality.

On Thu, Jul 20, 2023, Yang Weijiang wrote:
> CET KVM enabling patch series introduces extra constraints
> on CR0.WP and CR4.CET bits, i.e., setting CR4.CET=1 faults if
> CR0.WP==0. Simply skip CR4.CET bit test to avoid setting it in
> flexible_cr4 and finally triggering a #GP when write the CR4
> with CET bit set while CR0.WP is cleared.
> 
> The enable series also introduces IA32_VMX_BASIC[56 bit] check before
> inject exception to VM, per SDM(Vol 3D, A-1):
> "If bit 56 is read as 1, software can use VM entry to deliver a hardware
> exception with or without an error code, regardless of vector."

This clearly should be at least two separate patches, maybe event three.

  1. Exclude CR4.CET from the test_vmxon_bad_cr()
  2. Add the bit in the "basic" MSR that says the error code consistency check
     is skipped for protected mode and tweak test_invalid_event_injection()

2 could arguably be split, but IMO that's overkill.

> With the change, some test cases expected VM entry failure  will
> end up with successful results which causes reporting failures. Now
> checks the VM launch status conditionally against the bit support
> to get consistent results with the change enforced by KVM.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  x86/vmx.c       |  2 +-
>  x86/vmx.h       |  3 ++-
>  x86/vmx_tests.c | 21 +++++++++++++++++----
>  3 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/x86/vmx.c b/x86/vmx.c
> index 12e42b0..1c27850 100644
> --- a/x86/vmx.c
> +++ b/x86/vmx.c
> @@ -1430,7 +1430,7 @@ static int test_vmxon_bad_cr(int cr_number, unsigned long orig_cr,
>  		 */
>  		if ((cr_number == 0 && (bit == X86_CR0_PE || bit == X86_CR0_PG)) ||
>  		    (cr_number == 4 && (bit == X86_CR4_PAE || bit == X86_CR4_SMAP ||
> -					bit == X86_CR4_SMEP)))
> +					bit == X86_CR4_SMEP || bit == X86_CR4_CET)))
>  			continue;
>  
>  		if (!(bit & required1) && !(bit & disallowed1)) {
> diff --git a/x86/vmx.h b/x86/vmx.h
> index 604c78f..e53f600 100644
> --- a/x86/vmx.h
> +++ b/x86/vmx.h
> @@ -167,7 +167,8 @@ union vmx_basic {
>  			type:4,
>  			insouts:1,
>  			ctrl:1,
> -			reserved2:8;
> +			errcode:1,

Way too terse.  Please something similar to whatever #define we use on the KVM
side.  Ignore the existing names, this is one of those "the existing code is
awful" scenarios.

Also, I wouldn't be opposed to a patch to rename the union to "vmx_basic_msr",
and the global variable to basic_msr.  At first glance, I thought "basic.errcode"
was somehow looking at whether or not the basic VM-Exit reason had an error code.

> +			reserved2:7;
>  	};
>  };
>  
> diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
> index 7952ccb..b6d4982 100644
> --- a/x86/vmx_tests.c
> +++ b/x86/vmx_tests.c
> @@ -4173,7 +4173,10 @@ static void test_invalid_event_injection(void)
>  			    ent_intr_info);
>  	vmcs_write(GUEST_CR0, guest_cr0_save & ~X86_CR0_PE & ~X86_CR0_PG);
>  	vmcs_write(ENT_INTR_INFO, ent_intr_info);
> -	test_vmx_invalid_controls();
> +	if (basic.errcode)
> +		test_vmx_valid_controls();
> +	else
> +		test_vmx_invalid_controls();

This is wrong, no?  The consistency check is only skipped for PM, the above CR0.PE
modification means the target is RM.

>  	report_prefix_pop();
>  
>  	ent_intr_info = ent_intr_info_base | INTR_INFO_DELIVER_CODE_MASK |
Yang, Weijiang Aug. 3, 2023, 5:56 a.m. UTC | #2
On 8/3/2023 3:43 AM, Sean Christopherson wrote:
> This is not "fixup", this is support for CET and for new CPU functionality.
>
> On Thu, Jul 20, 2023, Yang Weijiang wrote:
>> CET KVM enabling patch series introduces extra constraints
>> on CR0.WP and CR4.CET bits, i.e., setting CR4.CET=1 faults if
>> CR0.WP==0. Simply skip CR4.CET bit test to avoid setting it in
>> flexible_cr4 and finally triggering a #GP when write the CR4
>> with CET bit set while CR0.WP is cleared.
>>
>> The enable series also introduces IA32_VMX_BASIC[56 bit] check before
>> inject exception to VM, per SDM(Vol 3D, A-1):
>> "If bit 56 is read as 1, software can use VM entry to deliver a hardware
>> exception with or without an error code, regardless of vector."
> This clearly should be at least two separate patches, maybe event three.
>
>    1. Exclude CR4.CET from the test_vmxon_bad_cr()
>    2. Add the bit in the "basic" MSR that says the error code consistency check
>       is skipped for protected mode and tweak test_invalid_event_injection()
>
> 2 could arguably be split, but IMO that's overkill.
I'll do so in next version, thanks!
>> With the change, some test cases expected VM entry failure  will
>> end up with successful results which causes reporting failures. Now
>> checks the VM launch status conditionally against the bit support
>> to get consistent results with the change enforced by KVM.
>>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>> ---
>>   x86/vmx.c       |  2 +-
>>   x86/vmx.h       |  3 ++-
>>   x86/vmx_tests.c | 21 +++++++++++++++++----
>>   3 files changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/x86/vmx.c b/x86/vmx.c
>> index 12e42b0..1c27850 100644
>> --- a/x86/vmx.c
>> +++ b/x86/vmx.c
>> @@ -1430,7 +1430,7 @@ static int test_vmxon_bad_cr(int cr_number, unsigned long orig_cr,
>>   		 */
>>   		if ((cr_number == 0 && (bit == X86_CR0_PE || bit == X86_CR0_PG)) ||
>>   		    (cr_number == 4 && (bit == X86_CR4_PAE || bit == X86_CR4_SMAP ||
>> -					bit == X86_CR4_SMEP)))
>> +					bit == X86_CR4_SMEP || bit == X86_CR4_CET)))
>>   			continue;
>>   
>>   		if (!(bit & required1) && !(bit & disallowed1)) {
>> diff --git a/x86/vmx.h b/x86/vmx.h
>> index 604c78f..e53f600 100644
>> --- a/x86/vmx.h
>> +++ b/x86/vmx.h
>> @@ -167,7 +167,8 @@ union vmx_basic {
>>   			type:4,
>>   			insouts:1,
>>   			ctrl:1,
>> -			reserved2:8;
>> +			errcode:1,
> Way too terse.  Please something similar to whatever #define we use on the KVM
> side.  Ignore the existing names, this is one of those "the existing code is
> awful" scenarios.
>
> Also, I wouldn't be opposed to a patch to rename the union to "vmx_basic_msr",
> and the global variable to basic_msr.  At first glance, I thought "basic.errcode"
> was somehow looking at whether or not the basic VM-Exit reason had an error code.
OK, will add these changes in next version.
>> +			reserved2:7;
>>   	};
>>   };
>>   
>> diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
>> index 7952ccb..b6d4982 100644
>> --- a/x86/vmx_tests.c
>> +++ b/x86/vmx_tests.c
>> @@ -4173,7 +4173,10 @@ static void test_invalid_event_injection(void)
>>   			    ent_intr_info);
>>   	vmcs_write(GUEST_CR0, guest_cr0_save & ~X86_CR0_PE & ~X86_CR0_PG);
>>   	vmcs_write(ENT_INTR_INFO, ent_intr_info);
>> -	test_vmx_invalid_controls();
>> +	if (basic.errcode)
>> +		test_vmx_valid_controls();
>> +	else
>> +		test_vmx_invalid_controls();
> This is wrong, no?  The consistency check is only skipped for PM, the above CR0.PE
> modification means the target is RM.
I think this case is executed with !CPU_URG, so RM is "converted" to PM because we
have below in KVM:
                 bool urg = nested_cpu_has2(vmcs12,
SECONDARY_EXEC_UNRESTRICTED_GUEST);
                 bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
...
                 if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
                     !nested_cpu_has_no_hw_errcode(vcpu)) {
                         /* VM-entry interruption-info field: deliver error code */
                         should_have_error_code =
                                 intr_type == INTR_TYPE_HARD_EXCEPTION &&
                                 prot_mode &&
x86_exception_has_error_code(vector);
                         if (CC(has_error_code != should_have_error_code))
                                 return -EINVAL;
                 }

so on platform with basic.errcode == 1, this case passes.
>>   	report_prefix_pop();
>>   
>>   	ent_intr_info = ent_intr_info_base | INTR_INFO_DELIVER_CODE_MASK |
Sean Christopherson Aug. 3, 2023, 5:11 p.m. UTC | #3
On Thu, Aug 03, 2023, Weijiang Yang wrote:
> On 8/3/2023 3:43 AM, Sean Christopherson wrote:
> > > diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
> > > index 7952ccb..b6d4982 100644
> > > --- a/x86/vmx_tests.c
> > > +++ b/x86/vmx_tests.c
> > > @@ -4173,7 +4173,10 @@ static void test_invalid_event_injection(void)
> > >   			    ent_intr_info);
> > >   	vmcs_write(GUEST_CR0, guest_cr0_save & ~X86_CR0_PE & ~X86_CR0_PG);
> > >   	vmcs_write(ENT_INTR_INFO, ent_intr_info);
> > > -	test_vmx_invalid_controls();
> > > +	if (basic.errcode)
> > > +		test_vmx_valid_controls();
> > > +	else
> > > +		test_vmx_invalid_controls();
> > This is wrong, no?  The consistency check is only skipped for PM, the above CR0.PE
> > modification means the target is RM.
> I think this case is executed with !CPU_URG, so RM is "converted" to PM because we
> have below in KVM:
>                 bool urg = nested_cpu_has2(vmcs12,
> SECONDARY_EXEC_UNRESTRICTED_GUEST);
>                 bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
> ...
>                 if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
>                     !nested_cpu_has_no_hw_errcode(vcpu)) {
>                         /* VM-entry interruption-info field: deliver error code */
>                         should_have_error_code =
>                                 intr_type == INTR_TYPE_HARD_EXCEPTION &&
>                                 prot_mode &&
> x86_exception_has_error_code(vector);
>                         if (CC(has_error_code != should_have_error_code))
>                                 return -EINVAL;
>                 }
> 
> so on platform with basic.errcode == 1, this case passes.

Huh.  I get the logic, but IMO based on the SDM, that's a ucode bug that got
propagated into KVM (or an SDM bug, which is my bet for how this gets treated).

I verified HSW at least does indeed generate VM-Fail and not VM-Exit(INVALID_STATE),
so it doesn't appear that KVM is making stuff (for once).  Either that or I'm
misreading the SDM (definite possibility), but the only relevant condition I see is:

  bit 0 (corresponding to CR0.PE) is set in the CR0 field in the guest-state area

I don't see anything in the SDM that states the CR0.PE is assumed to be '1' for
consistency checks when unrestricted guest is disabled.

Can you bug a VMX architect again to get clarification, e.g. to get an SDM update?
Or just point out where I missed something in the SDM, again...
Yang, Weijiang Aug. 4, 2023, 2:07 a.m. UTC | #4
On 8/4/2023 1:11 AM, Sean Christopherson wrote:
> On Thu, Aug 03, 2023, Weijiang Yang wrote:
>> On 8/3/2023 3:43 AM, Sean Christopherson wrote:
>>>> diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
>>>> index 7952ccb..b6d4982 100644
>>>> --- a/x86/vmx_tests.c
>>>> +++ b/x86/vmx_tests.c
>>>> @@ -4173,7 +4173,10 @@ static void test_invalid_event_injection(void)
>>>>    			    ent_intr_info);
>>>>    	vmcs_write(GUEST_CR0, guest_cr0_save & ~X86_CR0_PE & ~X86_CR0_PG);
>>>>    	vmcs_write(ENT_INTR_INFO, ent_intr_info);
>>>> -	test_vmx_invalid_controls();
>>>> +	if (basic.errcode)
>>>> +		test_vmx_valid_controls();
>>>> +	else
>>>> +		test_vmx_invalid_controls();
>>> This is wrong, no?  The consistency check is only skipped for PM, the above CR0.PE
>>> modification means the target is RM.
>> I think this case is executed with !CPU_URG, so RM is "converted" to PM because we
>> have below in KVM:
>>                  bool urg = nested_cpu_has2(vmcs12,
>> SECONDARY_EXEC_UNRESTRICTED_GUEST);
>>                  bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
>> ...
>>                  if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
>>                      !nested_cpu_has_no_hw_errcode(vcpu)) {
>>                          /* VM-entry interruption-info field: deliver error code */
>>                          should_have_error_code =
>>                                  intr_type == INTR_TYPE_HARD_EXCEPTION &&
>>                                  prot_mode &&
>> x86_exception_has_error_code(vector);
>>                          if (CC(has_error_code != should_have_error_code))
>>                                  return -EINVAL;
>>                  }
>>
>> so on platform with basic.errcode == 1, this case passes.
> Huh.  I get the logic, but IMO based on the SDM, that's a ucode bug that got
> propagated into KVM (or an SDM bug, which is my bet for how this gets treated).
>
> I verified HSW at least does indeed generate VM-Fail and not VM-Exit(INVALID_STATE),
> so it doesn't appear that KVM is making stuff (for once).  Either that or I'm
> misreading the SDM (definite possibility), but the only relevant condition I see is:
>
>    bit 0 (corresponding to CR0.PE) is set in the CR0 field in the guest-state area
>
> I don't see anything in the SDM that states the CR0.PE is assumed to be '1' for
> consistency checks when unrestricted guest is disabled.
>
> Can you bug a VMX architect again to get clarification, e.g. to get an SDM update?
> Or just point out where I missed something in the SDM, again...
Sure, let me throw the ball to the architect, will update here once got reply.
Thanks!
Yang, Weijiang Aug. 24, 2023, 7:25 a.m. UTC | #5
On 8/4/2023 1:11 AM, Sean Christopherson wrote:
> On Thu, Aug 03, 2023, Weijiang Yang wrote:
>> On 8/3/2023 3:43 AM, Sean Christopherson wrote:
>>>> diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
>>>> index 7952ccb..b6d4982 100644
>>>> --- a/x86/vmx_tests.c
>>>> +++ b/x86/vmx_tests.c
>>>> @@ -4173,7 +4173,10 @@ static void test_invalid_event_injection(void)
>>>>    			    ent_intr_info);
>>>>    	vmcs_write(GUEST_CR0, guest_cr0_save & ~X86_CR0_PE & ~X86_CR0_PG);
>>>>    	vmcs_write(ENT_INTR_INFO, ent_intr_info);
>>>> -	test_vmx_invalid_controls();
>>>> +	if (basic.errcode)
>>>> +		test_vmx_valid_controls();
>>>> +	else
>>>> +		test_vmx_invalid_controls();
>>> This is wrong, no?  The consistency check is only skipped for PM, the above CR0.PE
>>> modification means the target is RM.
>> I think this case is executed with !CPU_URG, so RM is "converted" to PM because we
>> have below in KVM:
>>                  bool urg = nested_cpu_has2(vmcs12,
>> SECONDARY_EXEC_UNRESTRICTED_GUEST);
>>                  bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
>> ...
>>                  if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
>>                      !nested_cpu_has_no_hw_errcode(vcpu)) {
>>                          /* VM-entry interruption-info field: deliver error code */
>>                          should_have_error_code =
>>                                  intr_type == INTR_TYPE_HARD_EXCEPTION &&
>>                                  prot_mode &&
>> x86_exception_has_error_code(vector);
>>                          if (CC(has_error_code != should_have_error_code))
>>                                  return -EINVAL;
>>                  }
>>
>> so on platform with basic.errcode == 1, this case passes.
> Huh.  I get the logic, but IMO based on the SDM, that's a ucode bug that got
> propagated into KVM (or an SDM bug, which is my bet for how this gets treated).
>
> I verified HSW at least does indeed generate VM-Fail and not VM-Exit(INVALID_STATE),
> so it doesn't appear that KVM is making stuff (for once).  Either that or I'm
> misreading the SDM (definite possibility), but the only relevant condition I see is:
>
>    bit 0 (corresponding to CR0.PE) is set in the CR0 field in the guest-state area
>
> I don't see anything in the SDM that states the CR0.PE is assumed to be '1' for
> consistency checks when unrestricted guest is disabled.
>
> Can you bug a VMX architect again to get clarification, e.g. to get an SDM update?
> Or just point out where I missed something in the SDM, again...

Sorry for the delayed response! Also added Gil in cc.

I got reply from Gil as below:

"I am not sure whether you (or Sean) are referring to guest state or host state.
The IA32_VMX_CR0_FIXED0 and IA32_VMX_CR0_FIXED1 MSRs enumerate which bits must have fixed values in CR0 in VMX operation.
Every CPU that supports VMX operation (not just the first ones) should support these MSRs to set bits 0 and 31, corresponding to PE and PG.

Section 24.8 explains what this means:
1.    VMXON fails if attempting to activated VMX root operation when either of those bits is 0.
2.    Attempting to clear either bit (with MOV to CR) in VMX operation (root or non-root) will #GP.
3.    Attempting to clear either bit through VM entry (loading guest state) will cause the VM entry to fail (with what Sean calls “VM-Exit(INVALID_STATE)”).
4.    If a VM exit would clear either bit (loading host state), the VM entry that would indicate the host state will fail (with what Sean calls “VM-Fail”).

Exceptions to #2 and #3 are made only if “unrestricted guest” is 1. (If “unrestricted guest” is 0, the items above all apply.)

If “unrestricted guest” is 1, #2 is relaxed in VMX non-root operation (guest software can clear PE or PG with MOV to CR).
If “unrestricted guest” is 1, #3 is relaxed in that VM entry can load CR0 to clear either PE or PG (but cannot set PG without also setting PE).

I suppose that we could clarify that SDM to indicate that all VMX CPUs enumerate PE and PG as being VMX-fixed to 1 (despite the fact that many CPUs do support the 1-setting of “unrestricted guest”)."
Sean Christopherson Aug. 24, 2023, 4:18 p.m. UTC | #6
On Thu, Aug 24, 2023, Weijiang Yang wrote:
> On 8/4/2023 1:11 AM, Sean Christopherson wrote:
> > On Thu, Aug 03, 2023, Weijiang Yang wrote:
> > > > This is wrong, no?  The consistency check is only skipped for PM, the above CR0.PE
> > > > modification means the target is RM.
> > > I think this case is executed with !CPU_URG, so RM is "converted" to PM because we
> > > have below in KVM:
> > >                  bool urg = nested_cpu_has2(vmcs12,
> > > SECONDARY_EXEC_UNRESTRICTED_GUEST);
> > >                  bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
> > > ...
> > >                  if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
> > >                      !nested_cpu_has_no_hw_errcode(vcpu)) {
> > >                          /* VM-entry interruption-info field: deliver error code */
> > >                          should_have_error_code =
> > >                                  intr_type == INTR_TYPE_HARD_EXCEPTION &&
> > >                                  prot_mode &&
> > > x86_exception_has_error_code(vector);
> > >                          if (CC(has_error_code != should_have_error_code))
> > >                                  return -EINVAL;
> > >                  }
> > > 
> > > so on platform with basic.errcode == 1, this case passes.
> > Huh.  I get the logic, but IMO based on the SDM, that's a ucode bug that got
> > propagated into KVM (or an SDM bug, which is my bet for how this gets treated).
> > 
> > I verified HSW at least does indeed generate VM-Fail and not VM-Exit(INVALID_STATE),
> > so it doesn't appear that KVM is making stuff (for once).  Either that or I'm
> > misreading the SDM (definite possibility), but the only relevant condition I see is:
> > 
> >    bit 0 (corresponding to CR0.PE) is set in the CR0 field in the guest-state area
> > 
> > I don't see anything in the SDM that states the CR0.PE is assumed to be '1' for
> > consistency checks when unrestricted guest is disabled.
> > 
> > Can you bug a VMX architect again to get clarification, e.g. to get an SDM update?
> > Or just point out where I missed something in the SDM, again...
> 
> Sorry for the delayed response! Also added Gil in cc.

Hey Gil!  Thanks for humoring me again.

> 
> I got reply from Gil as below:
> 
> "I am not sure whether you (or Sean) are referring to guest state or host state.

The question is whether trying to do VMLAUNCH/VMRESUME with this scenario

  1. unrestricted guest disabled
  2. GUEST_CR0.PE = 0
  3. #GP injection _without_ an error code

should VM-Fail due injecting a #GP without an error code, or VM-Exit(INVALID_STATE)
due to CR0.PE=0 without unrestricted guest support.

Hardware (I personally tested on Haswell) signals VM-Fail, which doesn't match
what's in the SDM:

  The field's deliver-error-code bit (bit 11) is 1 if each of the following holds:

   (1) the interruption type is hardware exception;
   (2) bit 0 (corresponding to CR0.PE) is set in the CR0 field in the guest-state area;
   (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the vector indicates
       one of the following exceptions: #DF (vector 8), #TS (10), #NP (11), #SS (12),
       #GP (13), #PF (14), or #AC (17).

Specifically #2 doesn't say anything about the check treating GUEST_CR0.PE as '1'
if unrestricted guest is disabled.
Neiger, Gil Aug. 24, 2023, 8:47 p.m. UTC | #7
> On Thu, Aug 24, 2023, Weijiang Yang wrote:
> > On 8/4/2023 1:11 AM, Sean Christopherson wrote:
> > > On Thu, Aug 03, 2023, Weijiang Yang wrote:
> > > > > This is wrong, no?  The consistency check is only skipped for
> > > > > PM, the above CR0.PE modification means the target is RM.
> > > > I think this case is executed with !CPU_URG, so RM is "converted"
> > > > to PM because we have below in KVM:
> > > >                  bool urg = nested_cpu_has2(vmcs12,
> > > > SECONDARY_EXEC_UNRESTRICTED_GUEST);
> > > >                  bool prot_mode = !urg || vmcs12->guest_cr0 &
> > > > X86_CR0_PE; ...
> > > >                  if (!prot_mode || intr_type !=
> > > > INTR_TYPE_HARD_EXCEPTION ||
> > > >                      !nested_cpu_has_no_hw_errcode(vcpu)) {
> > > >                          /* VM-entry interruption-info field:
> > > > deliver error code */
> > > >                          should_have_error_code =
> > > >                                  intr_type ==
> > > > INTR_TYPE_HARD_EXCEPTION &&
> > > >                                  prot_mode &&
> > > > x86_exception_has_error_code(vector);
> > > >                          if (CC(has_error_code !=
> > > > should_have_error_code))
> > > >                                  return -EINVAL;
> > > >                  }
> > > >
> > > > so on platform with basic.errcode == 1, this case passes.
> > > Huh.  I get the logic, but IMO based on the SDM, that's a ucode bug
> > > that got propagated into KVM (or an SDM bug, which is my bet for how
> this gets treated).
> > >
> > > I verified HSW at least does indeed generate VM-Fail and not
> > > VM-Exit(INVALID_STATE), so it doesn't appear that KVM is making
> > > stuff (for once).  Either that or I'm misreading the SDM (definite
> possibility), but the only relevant condition I see is:
> > >
> > >    bit 0 (corresponding to CR0.PE) is set in the CR0 field in the
> > > guest-state area
> > >
> > > I don't see anything in the SDM that states the CR0.PE is assumed to
> > > be '1' for consistency checks when unrestricted guest is disabled.
> > >
> > > Can you bug a VMX architect again to get clarification, e.g. to get an
> SDM update?
> > > Or just point out where I missed something in the SDM, again...
> >
> > Sorry for the delayed response! Also added Gil in cc.
> 
> Hey Gil!  Thanks for humoring me again.
> 
> >
> > I got reply from Gil as below:
> >
> > "I am not sure whether you (or Sean) are referring to guest state or
> host state.
> 
> The question is whether trying to do VMLAUNCH/VMRESUME with this scenario
> 
>   1. unrestricted guest disabled
>   2. GUEST_CR0.PE = 0
>   3. #GP injection _without_ an error code
> 
> should VM-Fail due injecting a #GP without an error code, or VM-
> Exit(INVALID_STATE) due to CR0.PE=0 without unrestricted guest support.
> 
> Hardware (I personally tested on Haswell) signals VM-Fail, which doesn't
> match what's in the SDM:
> 
>   The field's deliver-error-code bit (bit 11) is 1 if each of the
> following holds:
> 
>    (1) the interruption type is hardware exception;
>    (2) bit 0 (corresponding to CR0.PE) is set in the CR0 field in the
> guest-state area;
>    (3) IA32_VMX_BASIC[56] is read as 0 (see Appendix A.1); and (4) the
> vector indicates
>        one of the following exceptions: #DF (vector 8), #TS (10), #NP
> (11), #SS (12),
>        #GP (13), #PF (14), or #AC (17).
> 
> Specifically #2 doesn't say anything about the check treating GUEST_CR0.PE
> as '1'
> if unrestricted guest is disabled.

Thanks for clarifying that the case in question include injection of an event with error code.

This is a quirky situation, and it happens (coincidentally) that I am working similar questions right now internally.

In general, VM entry fails "early" (VM-Fail) if there is a problem with host state or controls and fails "late" (VM-Exit(INVALID_STATE)) if it doesn't fail early but there is a problem with guest state.

This distinction exists to allow the CPU to load and check guest state in one pass:  once VM entry starts to load state, any failure will reload registers with "VM-Exit(INVALID_STATE)".

The original checks on the injection controls (including the error code bit) were all done early as they were just checks on controls.  At that time, there was no conditioning on CR0.PE because "unrestricted guest" did not exist yet.

When "unrestricted guest" was added, those checks became conditional on the guest value of CR0.PE.  We consider the possibility of moving those checks to be "late" (because they depend on guest state), but that would have required more changes to the VM-entry implementation than seemed justified.  Instead, the checks were left "early".  (The guest CR0.PE was consulted, but it was not actually loaded into the CR0 registers, so VM-Fail was OK.)

That has left this architectural anomaly that we have one early check that depends on guest state - and it is guest state that may later cause a late check to fail.

As I said, I am working this issue internally to Intel so that everyone has a consistent view of how this all should work.  I will follow up (or work through Weijiang) as things develop.

					- Gil
diff mbox series

Patch

diff --git a/x86/vmx.c b/x86/vmx.c
index 12e42b0..1c27850 100644
--- a/x86/vmx.c
+++ b/x86/vmx.c
@@ -1430,7 +1430,7 @@  static int test_vmxon_bad_cr(int cr_number, unsigned long orig_cr,
 		 */
 		if ((cr_number == 0 && (bit == X86_CR0_PE || bit == X86_CR0_PG)) ||
 		    (cr_number == 4 && (bit == X86_CR4_PAE || bit == X86_CR4_SMAP ||
-					bit == X86_CR4_SMEP)))
+					bit == X86_CR4_SMEP || bit == X86_CR4_CET)))
 			continue;
 
 		if (!(bit & required1) && !(bit & disallowed1)) {
diff --git a/x86/vmx.h b/x86/vmx.h
index 604c78f..e53f600 100644
--- a/x86/vmx.h
+++ b/x86/vmx.h
@@ -167,7 +167,8 @@  union vmx_basic {
 			type:4,
 			insouts:1,
 			ctrl:1,
-			reserved2:8;
+			errcode:1,
+			reserved2:7;
 	};
 };
 
diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index 7952ccb..b6d4982 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -4173,7 +4173,10 @@  static void test_invalid_event_injection(void)
 			    ent_intr_info);
 	vmcs_write(GUEST_CR0, guest_cr0_save & ~X86_CR0_PE & ~X86_CR0_PG);
 	vmcs_write(ENT_INTR_INFO, ent_intr_info);
-	test_vmx_invalid_controls();
+	if (basic.errcode)
+		test_vmx_valid_controls();
+	else
+		test_vmx_invalid_controls();
 	report_prefix_pop();
 
 	ent_intr_info = ent_intr_info_base | INTR_INFO_DELIVER_CODE_MASK |
@@ -4206,7 +4209,10 @@  static void test_invalid_event_injection(void)
 			    ent_intr_info);
 	vmcs_write(GUEST_CR0, guest_cr0_save | X86_CR0_PE);
 	vmcs_write(ENT_INTR_INFO, ent_intr_info);
-	test_vmx_invalid_controls();
+	if (basic.errcode)
+		test_vmx_valid_controls();
+	else
+		test_vmx_invalid_controls();
 	report_prefix_pop();
 
 	vmcs_write(CPU_EXEC_CTRL1, secondary_save);
@@ -4228,7 +4234,11 @@  skip_unrestricted_guest:
 		report_prefix_pushf("VM-entry intr info=0x%x [-]",
 				    ent_intr_info);
 		vmcs_write(ENT_INTR_INFO, ent_intr_info);
-		test_vmx_invalid_controls();
+		if (exception_type_mask == INTR_TYPE_HARD_EXCEPTION &&
+		    basic.errcode)
+			test_vmx_valid_controls();
+		else
+			test_vmx_invalid_controls();
 		report_prefix_pop();
 	}
 	report_prefix_pop();
@@ -4265,7 +4275,10 @@  skip_unrestricted_guest:
 		report_prefix_pushf("VM-entry intr info=0x%x [-]",
 				    ent_intr_info);
 		vmcs_write(ENT_INTR_INFO, ent_intr_info);
-		test_vmx_invalid_controls();
+		if (basic.errcode)
+			test_vmx_valid_controls();
+		else
+			test_vmx_invalid_controls();
 		report_prefix_pop();
 
 		/* Positive case */