diff mbox series

[v2,6/7] KVM: VMX: Check Intel PT related CPUID leaves

Message ID 20210827070249.924633-7-xiaoyao.li@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: VMX: PT (processor trace) optimization cleanup and fixes | expand

Commit Message

Xiaoyao Li Aug. 27, 2021, 7:02 a.m. UTC
CPUID 0xD leaves reports the capabilities of Intel PT, e.g. it decides
which bits are valid to be set in MSR_IA32_RTIT_CTL, and reports the
number of PT ADDR ranges.

KVM needs to check that guest CPUID values set by userspace doesn't
enable any bit which is not supported by bare metal. Otherwise,
1. it will trigger vm-entry failure if hardware unsupported bit is
   exposed to guest and set by guest.
2. it triggers #GP when context switch PT MSRs if exposing more
   RTIT_ADDR* MSRs than hardware capacity.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
There is bit 31 of CPUID(0xD, 0).ECX that doesn't restrict any bit in
MSR_IA32_RTIT_CTL. If guest has different value than host, it won't
cause any vm-entry failure, but guest will parse the PT packet with
wrong format.

I also check it to be same as host to ensure the virtualization correctness.

Changes in v2:
- Call out that if configuring more PT ADDR MSRs than hardware, it can
  cause #GP when context switch.
---
 arch/x86/kvm/cpuid.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Comments

Sean Christopherson Sept. 9, 2021, 9:41 p.m. UTC | #1
On Fri, Aug 27, 2021, Xiaoyao Li wrote:
> CPUID 0xD leaves reports the capabilities of Intel PT, e.g. it decides
> which bits are valid to be set in MSR_IA32_RTIT_CTL, and reports the
> number of PT ADDR ranges.
> 
> KVM needs to check that guest CPUID values set by userspace doesn't
> enable any bit which is not supported by bare metal. Otherwise,
> 1. it will trigger vm-entry failure if hardware unsupported bit is
>    exposed to guest and set by guest.
> 2. it triggers #GP when context switch PT MSRs if exposing more
>    RTIT_ADDR* MSRs than hardware capacity.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> There is bit 31 of CPUID(0xD, 0).ECX that doesn't restrict any bit in
> MSR_IA32_RTIT_CTL. If guest has different value than host, it won't
> cause any vm-entry failure, but guest will parse the PT packet with
> wrong format.
> 
> I also check it to be same as host to ensure the virtualization correctness.
> 
> Changes in v2:
> - Call out that if configuring more PT ADDR MSRs than hardware, it can
>   cause #GP when context switch.
> ---
>  arch/x86/kvm/cpuid.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 739be5da3bca..0c8e06a24156 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -76,6 +76,7 @@ static inline struct kvm_cpuid_entry2 *cpuid_entry2_find(
>  static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
>  {
>  	struct kvm_cpuid_entry2 *best;
> +	u32 eax, ebx, ecx, edx;
>  
>  	/*
>  	 * The existing code assumes virtual address is 48-bit or 57-bit in the
> @@ -89,6 +90,30 @@ static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
>  			return -EINVAL;
>  	}
>  
> +	/*
> +	 * CPUID 0xD leaves tell Intel PT capabilities, which decides

CPUID.0xD is XSAVE state, CPUID.0x14 is Intel PT.  This series needs tests...

> +	 * pt_desc.ctl_bitmask in later update_intel_pt_cfg().
> +	 *
> +	 * pt_desc.ctl_bitmask decides the legal value for guest
> +	 * MSR_IA32_RTIT_CTL. KVM cannot support PT capabilities beyond native,
> +	 * otherwise it will trigger vm-entry failure if guest sets native
> +	 * unsupported bits in MSR_IA32_RTIT_CTL.
> +	 */
> +	best = cpuid_entry2_find(entries, nent, 0xD, 0);
> +	if (best) {
> +		cpuid_count(0xD, 0, &eax, &ebx, &ecx, &edx);
> +		if (best->ebx & ~ebx || best->ecx & ~ecx)
> +			return -EINVAL;
> +	}
> +	best = cpuid_entry2_find(entries, nent, 0xD, 1);
> +	if (best) {
> +		cpuid_count(0xD, 0, &eax, &ebx, &ecx, &edx);
> +		if (((best->eax & 0x7) > (eax & 0x7)) ||

Ugh, looking at the rest of the code, even this isn't sufficient because
pt_desc.guest.addr_{a,b} are hardcoded at 4 entries, i.e. running KVM on hardware
with >4 entries will lead to buffer overflows.

One option would be to bump that to the theoretical max of 15, which doesn't seem
too horrible, especially if pt_desc as a whole is allocated on-demand, which it
probably should be since it isn't exactly tiny (nor ubiquitous)

A different option would be to let userspace define whatever it wants for guest
CPUID, and instead cap nr_addr_ranges at min(host.cpuid, guest.cpuid, RTIT_ADDR_RANGE).

Letting userspace generate a bad MSR_IA32_RTIT_CTL is not problematic, there are
plenty of ways userspace can deliberately trigger VM-Entry failure due to invalid
guest state (even if this is a VM-Fail condition, it's not a danger to KVM).

> +		    ((best->eax & ~eax) >> 16) ||
> +		    (best->ebx & ~ebx))
> +			return -EINVAL;
> +	}
> +
>  	return 0;
>  }
>  
> -- 
> 2.27.0
>
Xiaoyao Li Sept. 10, 2021, 1:59 a.m. UTC | #2
On 9/10/2021 5:41 AM, Sean Christopherson wrote:
> On Fri, Aug 27, 2021, Xiaoyao Li wrote:
>> CPUID 0xD leaves reports the capabilities of Intel PT, e.g. it decides
>> which bits are valid to be set in MSR_IA32_RTIT_CTL, and reports the
>> number of PT ADDR ranges.
>>
>> KVM needs to check that guest CPUID values set by userspace doesn't
>> enable any bit which is not supported by bare metal. Otherwise,
>> 1. it will trigger vm-entry failure if hardware unsupported bit is
>>     exposed to guest and set by guest.
>> 2. it triggers #GP when context switch PT MSRs if exposing more
>>     RTIT_ADDR* MSRs than hardware capacity.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> There is bit 31 of CPUID(0xD, 0).ECX that doesn't restrict any bit in
>> MSR_IA32_RTIT_CTL. If guest has different value than host, it won't
>> cause any vm-entry failure, but guest will parse the PT packet with
>> wrong format.
>>
>> I also check it to be same as host to ensure the virtualization correctness.
>>
>> Changes in v2:
>> - Call out that if configuring more PT ADDR MSRs than hardware, it can
>>    cause #GP when context switch.
>> ---
>>   arch/x86/kvm/cpuid.c | 25 +++++++++++++++++++++++++
>>   1 file changed, 25 insertions(+)
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 739be5da3bca..0c8e06a24156 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -76,6 +76,7 @@ static inline struct kvm_cpuid_entry2 *cpuid_entry2_find(
>>   static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
>>   {
>>   	struct kvm_cpuid_entry2 *best;
>> +	u32 eax, ebx, ecx, edx;
>>   
>>   	/*
>>   	 * The existing code assumes virtual address is 48-bit or 57-bit in the
>> @@ -89,6 +90,30 @@ static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
>>   			return -EINVAL;
>>   	}
>>   
>> +	/*
>> +	 * CPUID 0xD leaves tell Intel PT capabilities, which decides
> 
> CPUID.0xD is XSAVE state, CPUID.0x14 is Intel PT.  This series needs tests...

My apologize.

>> +	 * pt_desc.ctl_bitmask in later update_intel_pt_cfg().
>> +	 *
>> +	 * pt_desc.ctl_bitmask decides the legal value for guest
>> +	 * MSR_IA32_RTIT_CTL. KVM cannot support PT capabilities beyond native,
>> +	 * otherwise it will trigger vm-entry failure if guest sets native
>> +	 * unsupported bits in MSR_IA32_RTIT_CTL.
>> +	 */
>> +	best = cpuid_entry2_find(entries, nent, 0xD, 0);
>> +	if (best) {
>> +		cpuid_count(0xD, 0, &eax, &ebx, &ecx, &edx);
>> +		if (best->ebx & ~ebx || best->ecx & ~ecx)
>> +			return -EINVAL;
>> +	}
>> +	best = cpuid_entry2_find(entries, nent, 0xD, 1);
>> +	if (best) {
>> +		cpuid_count(0xD, 0, &eax, &ebx, &ecx, &edx);
>> +		if (((best->eax & 0x7) > (eax & 0x7)) ||
> 
> Ugh, looking at the rest of the code, even this isn't sufficient because
> pt_desc.guest.addr_{a,b} are hardcoded at 4 entries, i.e. running KVM on hardware
> with >4 entries will lead to buffer overflows.

it's hardcoded to 4 because there is a note of "no processors support 
more than 4 address ranges" in SDM vol.3 Chapter 31.3.1, table 31-11

> One option would be to bump that to the theoretical max of 15, which doesn't seem
> too horrible, especially if pt_desc as a whole is allocated on-demand, which it
> probably should be since it isn't exactly tiny (nor ubiquitous)
> 
> A different option would be to let userspace define whatever it wants for guest
> CPUID, and instead cap nr_addr_ranges at min(host.cpuid, guest.cpuid, RTIT_ADDR_RANGE).
> 
> Letting userspace generate a bad MSR_IA32_RTIT_CTL is not problematic, there are
> plenty of ways userspace can deliberately trigger VM-Entry failure due to invalid
> guest state (even if this is a VM-Fail condition, it's not a danger to KVM).

I'm fine to only safe guard the nr_addr_range if VM-Entry failure 
doesn't matter.

> 
>> +		    ((best->eax & ~eax) >> 16) ||
>> +		    (best->ebx & ~ebx))
>> +			return -EINVAL;
>> +	}
>> +
>>   	return 0;
>>   }
>>   
>> -- 
>> 2.27.0
>>
Xiaoyao Li Oct. 18, 2021, 7:01 a.m. UTC | #3
On 9/10/2021 9:59 AM, Xiaoyao Li wrote:
> On 9/10/2021 5:41 AM, Sean Christopherson wrote:
>> On Fri, Aug 27, 2021, Xiaoyao Li wrote:
>>> CPUID 0xD leaves reports the capabilities of Intel PT, e.g. it decides
>>> which bits are valid to be set in MSR_IA32_RTIT_CTL, and reports the
>>> number of PT ADDR ranges.
>>>
>>> KVM needs to check that guest CPUID values set by userspace doesn't
>>> enable any bit which is not supported by bare metal. Otherwise,
>>> 1. it will trigger vm-entry failure if hardware unsupported bit is
>>>     exposed to guest and set by guest.
>>> 2. it triggers #GP when context switch PT MSRs if exposing more
>>>     RTIT_ADDR* MSRs than hardware capacity.
>>>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
..
> 
>>> +     * pt_desc.ctl_bitmask in later update_intel_pt_cfg().
>>> +     *
>>> +     * pt_desc.ctl_bitmask decides the legal value for guest
>>> +     * MSR_IA32_RTIT_CTL. KVM cannot support PT capabilities beyond 
>>> native,
>>> +     * otherwise it will trigger vm-entry failure if guest sets native
>>> +     * unsupported bits in MSR_IA32_RTIT_CTL.
>>> +     */
>>> +    best = cpuid_entry2_find(entries, nent, 0xD, 0);
>>> +    if (best) {
>>> +        cpuid_count(0xD, 0, &eax, &ebx, &ecx, &edx);
>>> +        if (best->ebx & ~ebx || best->ecx & ~ecx)
>>> +            return -EINVAL;
>>> +    }
>>> +    best = cpuid_entry2_find(entries, nent, 0xD, 1);
>>> +    if (best) {
>>> +        cpuid_count(0xD, 0, &eax, &ebx, &ecx, &edx);
>>> +        if (((best->eax & 0x7) > (eax & 0x7)) ||
>>
>> Ugh, looking at the rest of the code, even this isn't sufficient because
>> pt_desc.guest.addr_{a,b} are hardcoded at 4 entries, i.e. running KVM 
>> on hardware
>> with >4 entries will lead to buffer overflows.
> 
> it's hardcoded to 4 because there is a note of "no processors support 
> more than 4 address ranges" in SDM vol.3 Chapter 31.3.1, table 31-11
> 
>> One option would be to bump that to the theoretical max of 15, which 
>> doesn't seem
>> too horrible, especially if pt_desc as a whole is allocated on-demand, 
>> which it
>> probably should be since it isn't exactly tiny (nor ubiquitous)
>>
>> A different option would be to let userspace define whatever it wants 
>> for guest
>> CPUID, and instead cap nr_addr_ranges at min(host.cpuid, guest.cpuid, 
>> RTIT_ADDR_RANGE).
>>
>> Letting userspace generate a bad MSR_IA32_RTIT_CTL is not problematic, 
>> there are
>> plenty of ways userspace can deliberately trigger VM-Entry failure due 
>> to invalid
>> guest state (even if this is a VM-Fail condition, it's not a danger to 
>> KVM).
> 
> I'm fine to only safe guard the nr_addr_range if VM-Entry failure 
> doesn't matter.

Hi Sean.

It seems I misread your comment. All above you were talking about the 
check on nr_addr_range. Did you want to say the check is not necessary 
if it's to avoid VM-entry failure?

The problem is 1) the check on nr_addr_range is to avoid MSR read #GP, 
thought kernel will fix the #GP. It still prints the warning message.

2) Other check of this Patch on guest CPUID 0x14 is to avoid VM-entry 
failure.

So I want to ask that do you think both 1) and 2) are unnecessary, or 
only 2) ?
Paolo Bonzini Oct. 18, 2021, 12:47 p.m. UTC | #4
On 10/09/21 03:59, Xiaoyao Li wrote:
>> 
>> Ugh, looking at the rest of the code, even this isn't sufficient
>> because pt_desc.guest.addr_{a,b} are hardcoded at 4 entries, i.e.
>> running KVM on hardware with >4 entries will lead to buffer
>> overflows.
> 
> it's hardcoded to 4 because there is a note of "no processors support
>  more than 4 address ranges" in SDM vol.3 Chapter 31.3.1, table
> 31-11

True, but I agree with Sean that it's not pretty.

>> One option would be to bump that to the theoretical max of 15,
>> which doesn't seem too horrible, especially if pt_desc as a whole
>> is allocated on-demand, which it probably should be since it isn't
>> exactly tiny (nor ubiquitous)
>> 
>> A different option would be to let userspace define whatever it
>> wants for guest CPUID, and instead cap nr_addr_ranges at
>> min(host.cpuid, guest.cpuid, RTIT_ADDR_RANGE).

This is the safest option.

Paolo
Xiaoyao Li Oct. 18, 2021, 1:56 p.m. UTC | #5
On 10/18/2021 8:47 PM, Paolo Bonzini wrote:
> On 10/09/21 03:59, Xiaoyao Li wrote:
>>>
>>> Ugh, looking at the rest of the code, even this isn't sufficient
>>> because pt_desc.guest.addr_{a,b} are hardcoded at 4 entries, i.e.
>>> running KVM on hardware with >4 entries will lead to buffer
>>> overflows.
>>
>> it's hardcoded to 4 because there is a note of "no processors support
>>  more than 4 address ranges" in SDM vol.3 Chapter 31.3.1, table
>> 31-11
> 
> True, but I agree with Sean that it's not pretty.

Yes. We can add the check to validate the hardware is not bigger than 4 
for future processors. Intel folks are supposed to send new patches 
before silicon with more than 4 PT_ADDR_RANGE delivered.

>>> One option would be to bump that to the theoretical max of 15,
>>> which doesn't seem too horrible, especially if pt_desc as a whole
>>> is allocated on-demand, which it probably should be since it isn't
>>> exactly tiny (nor ubiquitous)
>>>
>>> A different option would be to let userspace define whatever it
>>> wants for guest CPUID, and instead cap nr_addr_ranges at
>>> min(host.cpuid, guest.cpuid, RTIT_ADDR_RANGE).
> 
> This is the safest option.

My concern was that change userspace's input silently is not good. If 
you prefer this, we certainly need to extend the userspace to query what 
value is finally accepted and set by KVM.

> Paolo
>
Sean Christopherson Oct. 18, 2021, 5:26 p.m. UTC | #6
On Mon, Oct 18, 2021, Xiaoyao Li wrote:
> On 10/18/2021 8:47 PM, Paolo Bonzini wrote:
> > > > One option would be to bump that to the theoretical max of 15,
> > > > which doesn't seem too horrible, especially if pt_desc as a whole
> > > > is allocated on-demand, which it probably should be since it isn't
> > > > exactly tiny (nor ubiquitous)
> > > > 
> > > > A different option would be to let userspace define whatever it
> > > > wants for guest CPUID, and instead cap nr_addr_ranges at
> > > > min(host.cpuid, guest.cpuid, RTIT_ADDR_RANGE).
> > 
> > This is the safest option.
> 
> My concern was that change userspace's input silently is not good.

Technically KVM isn't changing userspace's input, as KVM will still enumerate
CPUID as defined by userspace.  What KVM is doing is refusing to emulate/virtualize
a bogus vCPU model, e.g. by injecting #GP on an MSR access that userspace
incorrectly told the guest was legal.  That is standard operation procedure for
KVM, e.g. there are any number of instructions that will fault if userspace lies
about the vCPU model.

> prefer this, we certainly need to extend the userspace to query what value
> is finally accepted and set by KVM.

That would be __do_cpuid_func()'s responsibility to cap leaf 0x14 output with
RTIT_ADDR_RANGE.  I.e. enumerate the supported ranges in KVM_GET_SUPPORTED_CPUID,
after that it's userspace's responsibility to not mess up.
Xiaoyao Li Oct. 19, 2021, 1:46 a.m. UTC | #7
On 10/19/2021 1:26 AM, Sean Christopherson wrote:
> On Mon, Oct 18, 2021, Xiaoyao Li wrote:
>> On 10/18/2021 8:47 PM, Paolo Bonzini wrote:
>>>>> One option would be to bump that to the theoretical max of 15,
>>>>> which doesn't seem too horrible, especially if pt_desc as a whole
>>>>> is allocated on-demand, which it probably should be since it isn't
>>>>> exactly tiny (nor ubiquitous)
>>>>>
>>>>> A different option would be to let userspace define whatever it
>>>>> wants for guest CPUID, and instead cap nr_addr_ranges at
>>>>> min(host.cpuid, guest.cpuid, RTIT_ADDR_RANGE).
>>>
>>> This is the safest option.

I think I misunderstood it. sigh...

It's not architecture consistent that guest sees a certain number of 
RTIT_ADDR_RANGE from its CPUID but may get #GP when it accesses high index.

OK, you mean it's userspace's fault and KVM shouldn't get blamed for it. 
It seems reasonable for me now.

>> My concern was that change userspace's input silently is not good.
> 
> Technically KVM isn't changing userspace's input, as KVM will still enumerate
> CPUID as defined by userspace.  What KVM is doing is refusing to emulate/virtualize
> a bogus vCPU model, e.g. by injecting #GP on an MSR access that userspace
> incorrectly told the guest was legal.  That is standard operation procedure for
> KVM, e.g. there are any number of instructions that will fault if userspace lies
> about the vCPU model.
> 
>> prefer this, we certainly need to extend the userspace to query what value
>> is finally accepted and set by KVM.
> 
> That would be __do_cpuid_func()'s responsibility to cap leaf 0x14 output with
> RTIT_ADDR_RANGE.  I.e. enumerate the supported ranges in KVM_GET_SUPPORTED_CPUID,
> after that it's userspace's responsibility to not mess up.
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 739be5da3bca..0c8e06a24156 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -76,6 +76,7 @@  static inline struct kvm_cpuid_entry2 *cpuid_entry2_find(
 static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
 {
 	struct kvm_cpuid_entry2 *best;
+	u32 eax, ebx, ecx, edx;
 
 	/*
 	 * The existing code assumes virtual address is 48-bit or 57-bit in the
@@ -89,6 +90,30 @@  static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
 			return -EINVAL;
 	}
 
+	/*
+	 * CPUID 0xD leaves tell Intel PT capabilities, which decides
+	 * pt_desc.ctl_bitmask in later update_intel_pt_cfg().
+	 *
+	 * pt_desc.ctl_bitmask decides the legal value for guest
+	 * MSR_IA32_RTIT_CTL. KVM cannot support PT capabilities beyond native,
+	 * otherwise it will trigger vm-entry failure if guest sets native
+	 * unsupported bits in MSR_IA32_RTIT_CTL.
+	 */
+	best = cpuid_entry2_find(entries, nent, 0xD, 0);
+	if (best) {
+		cpuid_count(0xD, 0, &eax, &ebx, &ecx, &edx);
+		if (best->ebx & ~ebx || best->ecx & ~ecx)
+			return -EINVAL;
+	}
+	best = cpuid_entry2_find(entries, nent, 0xD, 1);
+	if (best) {
+		cpuid_count(0xD, 0, &eax, &ebx, &ecx, &edx);
+		if (((best->eax & 0x7) > (eax & 0x7)) ||
+		    ((best->eax & ~eax) >> 16) ||
+		    (best->ebx & ~ebx))
+			return -EINVAL;
+	}
+
 	return 0;
 }