diff mbox series

[v4,04/10] KVM/x86: intel_pmu_lbr_enable

Message ID 1545816338-1171-5-git-send-email-wei.w.wang@intel.com (mailing list archive)
State New, archived
Headers show
Series Guest LBR Enabling | expand

Commit Message

Wang, Wei W Dec. 26, 2018, 9:25 a.m. UTC
The lbr stack is architecturally specific, for example, SKX has 32 lbr
stack entries while HSW has 16 entries, so a HSW guest running on a SKX
machine may not get accurate perf results. Currently, we forbid the
guest lbr enabling when the guest and host see different lbr stack
entries.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kvm/cpuid.h     |   8 ++++
 arch/x86/kvm/pmu.c       |   8 ++++
 arch/x86/kvm/pmu.h       |   2 +
 arch/x86/kvm/pmu_intel.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c       |   3 +-
 5 files changed, 136 insertions(+), 2 deletions(-)

Comments

Liang, Kan Jan. 2, 2019, 4:33 p.m. UTC | #1
On 12/26/2018 4:25 AM, Wei Wang wrote:
> +
> +	/*
> +	 * It could be possible that people have vcpus of old model run on
> +	 * physcal cpus of newer model, for example a BDW guest on a SKX
> +	 * machine (but not possible to be the other way around).
> +	 * The BDW guest may not get accurate results on a SKX machine as it
> +	 * only reads 16 entries of the lbr stack while there are 32 entries
> +	 * of recordings. So we currently forbid the lbr enabling when the
> +	 * vcpu and physical cpu see different lbr stack entries.

I think it's not enough to only check number of entries. The LBR from/to 
MSRs may be different even the number of entries is the same, e.g SLM 
and KNL.

> +	 */
> +	switch (vcpu_model) {

That's a duplicate of intel_pmu_init(). I think it's better to factor 
out the common part if you want to check LBR MSRs and entries. Then we 
don't need to add the same codes in two different places when enabling 
new platforms.

Actually, I think we may just support LBR for guest if it has the 
identical CPU model as host. It should be good enough for now.

Thanks,
Kan
> +	case INTEL_FAM6_CORE2_MEROM:
> +	case INTEL_FAM6_CORE2_MEROM_L:
Jim Mattson Jan. 2, 2019, 11:26 p.m. UTC | #2
On Wed, Dec 26, 2018 at 2:01 AM Wei Wang <wei.w.wang@intel.com> wrote:
>
> The lbr stack is architecturally specific, for example, SKX has 32 lbr
> stack entries while HSW has 16 entries, so a HSW guest running on a SKX
> machine may not get accurate perf results. Currently, we forbid the
> guest lbr enabling when the guest and host see different lbr stack
> entries.

How do you handle live migration?
Wang, Wei W Jan. 3, 2019, 7:22 a.m. UTC | #3
On 01/03/2019 07:26 AM, Jim Mattson wrote:
> On Wed, Dec 26, 2018 at 2:01 AM Wei Wang <wei.w.wang@intel.com> wrote:
>> The lbr stack is architecturally specific, for example, SKX has 32 lbr
>> stack entries while HSW has 16 entries, so a HSW guest running on a SKX
>> machine may not get accurate perf results. Currently, we forbid the
>> guest lbr enabling when the guest and host see different lbr stack
>> entries.
> How do you handle live migration?

This feature is gated by the QEMU "lbr=true" option.
So if the lbr fails to work on the destination machine,
the destination side QEMU wouldn't be able to boot,
and migration will not happen.

Best,
Wei
Jim Mattson Jan. 3, 2019, 3:34 p.m. UTC | #4
On Wed, Jan 2, 2019 at 11:16 PM Wei Wang <wei.w.wang@intel.com> wrote:
>
> On 01/03/2019 07:26 AM, Jim Mattson wrote:
> > On Wed, Dec 26, 2018 at 2:01 AM Wei Wang <wei.w.wang@intel.com> wrote:
> >> The lbr stack is architecturally specific, for example, SKX has 32 lbr
> >> stack entries while HSW has 16 entries, so a HSW guest running on a SKX
> >> machine may not get accurate perf results. Currently, we forbid the
> >> guest lbr enabling when the guest and host see different lbr stack
> >> entries.
> > How do you handle live migration?
>
> This feature is gated by the QEMU "lbr=true" option.
> So if the lbr fails to work on the destination machine,
> the destination side QEMU wouldn't be able to boot,
> and migration will not happen.

Yes, but then what happens?

Fast forward to, say, 2021. You're decommissioning all Broadwell
servers in your data center. You have to migrate the running VMs off
of those Broadwell systems onto newer hardware. But, with the current
implementation, the migration cannot happen. So, what do you do? I
suppose you just never enable the feature in the first place. Right?
Andi Kleen Jan. 3, 2019, 5:18 p.m. UTC | #5
> Yes, but then what happens?
> 
> Fast forward to, say, 2021. You're decommissioning all Broadwell
> servers in your data center. You have to migrate the running VMs off
> of those Broadwell systems onto newer hardware. But, with the current
> implementation, the migration cannot happen. So, what do you do? I
> suppose you just never enable the feature in the first place. Right?

There would in theory ways to handle this.

LBRs are normally not required for running correctly, they're not like
instructions, so it would be actually quite reasonable to just return 0 
and ignore writes for incompatible machines for them. Your profilers might
return some bogus data, but that is normally acceptable.

In some cases it might be also possible to emulate LBRs of
different types, but I don't think it would be worth the effort.
 
Yes, but the patches don't do this, so right now you should only enable it when
all systems in your migration pool have compatible LBRs.

-Andi
Wang, Wei W Jan. 4, 2019, 9:58 a.m. UTC | #6
On 01/03/2019 12:33 AM, Liang, Kan wrote:
>
>
> On 12/26/2018 4:25 AM, Wei Wang wrote:
>> +
>> +    /*
>> +     * It could be possible that people have vcpus of old model run on
>> +     * physcal cpus of newer model, for example a BDW guest on a SKX
>> +     * machine (but not possible to be the other way around).
>> +     * The BDW guest may not get accurate results on a SKX machine 
>> as it
>> +     * only reads 16 entries of the lbr stack while there are 32 
>> entries
>> +     * of recordings. So we currently forbid the lbr enabling when the
>> +     * vcpu and physical cpu see different lbr stack entries.
>
> I think it's not enough to only check number of entries. The LBR 
> from/to MSRs may be different even the number of entries is the same, 
> e.g SLM and KNL.

Yes, we could add the comparison of the FROM msrs.

>
>> +     */
>> +    switch (vcpu_model) {
>
> That's a duplicate of intel_pmu_init(). I think it's better to factor 
> out the common part if you want to check LBR MSRs and entries. Then we 
> don't need to add the same codes in two different places when enabling 
> new platforms.
>


Yes, I thought about this, but intel_pmu_init() does a lot more things 
in each "Case xx". Any thought about how to factor them out?


> Actually, I think we may just support LBR for guest if it has the 
> identical CPU model as host. It should be good enough for now.
>

I actually tried this in the first place but it failed to work with the 
existing QEMU.
For example, when we specify "Broadwell" cpu from qemu, then qemu uses 
Broadwell core model,
but the physical machine I have is Broadwell X. This patch will support 
this case.

Best,
Wei
Wang, Wei W Jan. 4, 2019, 10:09 a.m. UTC | #7
On 01/03/2019 11:34 PM, Jim Mattson wrote:
> On Wed, Jan 2, 2019 at 11:16 PM Wei Wang <wei.w.wang@intel.com> wrote:
>> On 01/03/2019 07:26 AM, Jim Mattson wrote:
>>> On Wed, Dec 26, 2018 at 2:01 AM Wei Wang <wei.w.wang@intel.com> wrote:
>>>> The lbr stack is architecturally specific, for example, SKX has 32 lbr
>>>> stack entries while HSW has 16 entries, so a HSW guest running on a SKX
>>>> machine may not get accurate perf results. Currently, we forbid the
>>>> guest lbr enabling when the guest and host see different lbr stack
>>>> entries.
>>> How do you handle live migration?
>> This feature is gated by the QEMU "lbr=true" option.
>> So if the lbr fails to work on the destination machine,
>> the destination side QEMU wouldn't be able to boot,
>> and migration will not happen.
> Yes, but then what happens?
>
> Fast forward to, say, 2021. You're decommissioning all Broadwell
> servers in your data center. You have to migrate the running VMs off
> of those Broadwell systems onto newer hardware. But, with the current
> implementation, the migration cannot happen. So, what do you do? I
> suppose you just never enable the feature in the first place. Right?

I'm not sure if that's the way people would do with their data centers.
What would be the point of decommissioning all the BDW machines when
there are important BDW VMs running?

The "lbr=true" option can also be disabled via QMP, which will disable the
kvm side lbr support. So if you really want to deal with the above case,
you could first disable the lbr feature on the source side, and then 
boot the
destination side QEMU without "lbr=true". The lbr feature will not be 
available
to use by the guest at the time you decide to migrate the guest to a
non-compatible physical machine.

The point of this patch is: If we couldn't offer our users accurate
lbr results, we'd better to have the feature disabled rather than
offering wrong results to confuse them.


Best,
Wei
Jim Mattson Jan. 4, 2019, 3:53 p.m. UTC | #8
On Fri, Jan 4, 2019 at 2:03 AM Wei Wang <wei.w.wang@intel.com> wrote:
>
> On 01/03/2019 11:34 PM, Jim Mattson wrote:
> > Fast forward to, say, 2021. You're decommissioning all Broadwell
> > servers in your data center. You have to migrate the running VMs off
> > of those Broadwell systems onto newer hardware. But, with the current
> > implementation, the migration cannot happen. So, what do you do? I
> > suppose you just never enable the feature in the first place. Right?
>
> I'm not sure if that's the way people would do with their data centers.
> What would be the point of decommissioning all the BDW machines when
> there are important BDW VMs running?

TCO increases as hardware ages, while the TCO for each successive
generation of hardware tends to be lower than its predecessor. Thus,
the point of decommissioning all BDW hosts that are beyond their
useful service life is to save money. Our assumption for Google Cloud
is that we will always be able to emulate older Intel processors on
newer Intel processors, so the running BDW VMs should not be affected
by the decommissioning of BDW hardware. Obviously, that means that we
won't offer features that don't have a forward migration story, such
as this one.

Yes, someday Intel will drop support for some feature that we
currently offer (like MPX, perhaps), and that will cause us some
grief.
Liang, Kan Jan. 4, 2019, 3:57 p.m. UTC | #9
On 1/4/2019 4:58 AM, Wei Wang wrote:
> On 01/03/2019 12:33 AM, Liang, Kan wrote:
>>
>>
>> On 12/26/2018 4:25 AM, Wei Wang wrote:
>>> +
>>> +    /*
>>> +     * It could be possible that people have vcpus of old model run on
>>> +     * physcal cpus of newer model, for example a BDW guest on a SKX
>>> +     * machine (but not possible to be the other way around).
>>> +     * The BDW guest may not get accurate results on a SKX machine 
>>> as it
>>> +     * only reads 16 entries of the lbr stack while there are 32 
>>> entries
>>> +     * of recordings. So we currently forbid the lbr enabling when the
>>> +     * vcpu and physical cpu see different lbr stack entries.
>>
>> I think it's not enough to only check number of entries. The LBR 
>> from/to MSRs may be different even the number of entries is the same, 
>> e.g SLM and KNL.
> 
> Yes, we could add the comparison of the FROM msrs.
> 
>>
>>> +     */
>>> +    switch (vcpu_model) {
>>
>> That's a duplicate of intel_pmu_init(). I think it's better to factor 
>> out the common part if you want to check LBR MSRs and entries. Then we 
>> don't need to add the same codes in two different places when enabling 
>> new platforms.
>>
> 
> 
> Yes, I thought about this, but intel_pmu_init() does a lot more things 
> in each "Case xx". Any thought about how to factor them out?
>

I think we may only move the "switch (boot_cpu_data.x86_model) { ... }" 
to a new function, e.g. __intel_pmu_init(int model, struct x86_pmu *x86_pmu)

In __intel_pmu_init, if the model != boot_cpu_data.x86_model, you only 
need to update x86_pmu.*. Just ignore global settings, e.g 
hw_cache_event_ids, mem_attr, extra_attr etc.

You may also need to introduce another new function to check if the LBR 
is compatible with guest in lbr.c, e.g. bool 
is_lbr_compatible_with_guest(int model).

bool is_lbr_compatible_with_guest(int model) {
	struct x86_pmu fake_x86_pmu;

	if (boot_cpu_data.x86_model == model)
		return true;

	__intel_pmu_init(model, &fake_x86_pmu);

	if ((x86_pmu.lbr_nr == fake_x86_pmu.lbr_nr) &&
	    (x86_pmu.lbr_tos == fake_x86_pmu.lbr_tos) &&
	    (x86_pmu.lbr_from == fake_x86_pmu.lbr_from))
		return true;

	return false;
}

> 
>> Actually, I think we may just support LBR for guest if it has the 
>> identical CPU model as host. It should be good enough for now.
>>
> 
> I actually tried this in the first place but it failed to work with the 
> existing QEMU.
> For example, when we specify "Broadwell" cpu from qemu, then qemu uses 
> Broadwell core model,
> but the physical machine I have is Broadwell X. This patch will support 
> this case.

I mean is it good enough if we only support "-cpu host"?

Thanks,
Kan
Wang, Wei W Jan. 5, 2019, 10:09 a.m. UTC | #10
On 01/04/2019 11:57 PM, Liang, Kan wrote:
>
>
> On 1/4/2019 4:58 AM, Wei Wang wrote:
>> On 01/03/2019 12:33 AM, Liang, Kan wrote:
>>>
>>>
>>> On 12/26/2018 4:25 AM, Wei Wang wrote:
>>>> +
>>>> +    /*
>>>> +     * It could be possible that people have vcpus of old model 
>>>> run on
>>>> +     * physcal cpus of newer model, for example a BDW guest on a SKX
>>>> +     * machine (but not possible to be the other way around).
>>>> +     * The BDW guest may not get accurate results on a SKX machine 
>>>> as it
>>>> +     * only reads 16 entries of the lbr stack while there are 32 
>>>> entries
>>>> +     * of recordings. So we currently forbid the lbr enabling when 
>>>> the
>>>> +     * vcpu and physical cpu see different lbr stack entries.
>>>
>>> I think it's not enough to only check number of entries. The LBR 
>>> from/to MSRs may be different even the number of entries is the 
>>> same, e.g SLM and KNL.
>>
>> Yes, we could add the comparison of the FROM msrs.
>>
>>>
>>>> +     */
>>>> +    switch (vcpu_model) {
>>>
>>> That's a duplicate of intel_pmu_init(). I think it's better to 
>>> factor out the common part if you want to check LBR MSRs and 
>>> entries. Then we don't need to add the same codes in two different 
>>> places when enabling new platforms.
>>>
>>
>>
>> Yes, I thought about this, but intel_pmu_init() does a lot more 
>> things in each "Case xx". Any thought about how to factor them out?
>>
>
> I think we may only move the "switch (boot_cpu_data.x86_model) { ... 
> }" to a new function, e.g. __intel_pmu_init(int model, struct x86_pmu 
> *x86_pmu)
>
> In __intel_pmu_init, if the model != boot_cpu_data.x86_model, you only 
> need to update x86_pmu.*. Just ignore global settings, e.g 
> hw_cache_event_ids, mem_attr, extra_attr etc.

Thanks for sharing. I understand the point of maintaining those models 
at one place,
but this factor-out doesn't seem very elegant to me, like below

__intel_pmu_init (int model, struct x86_pmu *x86_pmu)
{
...
switch (model)
case INTEL_FAM6_NEHALEM:
case INTEL_FAM6_NEHALEM_EP:
case INTEL_FAM6_NEHALEM_EX:
     intel_pmu_lbr_init(x86_pmu);
     if (model != boot_cpu_data.x86_model)
         return;

     /* Other a lot of things init like below..*/
     memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
                    sizeof(hw_cache_event_ids));
     memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
                    sizeof(hw_cache_extra_regs));
     x86_pmu.event_constraints = intel_nehalem_event_constraints;
                 x86_pmu.pebs_constraints = 
intel_nehalem_pebs_event_constraints;
                 x86_pmu.enable_all = intel_pmu_nhm_enable_all;
                 x86_pmu.extra_regs = intel_nehalem_extra_regs;
  ...

Case...
}
We need insert "if (model != boot_cpu_data.x86_model)" in every "Case xx".

What would be the rationale that we only do lbr_init for "x86_pmu"
when model != boot_cpu_data.x86_model?
(It looks more like a workaround to factor-out the function and get what 
we want)

I would prefer having them separated as this patch for now - it is 
logically more clear to me.


>
>>
>>> Actually, I think we may just support LBR for guest if it has the 
>>> identical CPU model as host. It should be good enough for now.
>>>
>>
>> I actually tried this in the first place but it failed to work with 
>> the existing QEMU.
>> For example, when we specify "Broadwell" cpu from qemu, then qemu 
>> uses Broadwell core model,
>> but the physical machine I have is Broadwell X. This patch will 
>> support this case.
>
> I mean is it good enough if we only support "-cpu host"?
>

Not really. AFAIK, people don't use this usually. It is more common to 
specify the CPU type.

Best,
Wei
Wang, Wei W Jan. 5, 2019, 10:15 a.m. UTC | #11
On Friday, January 4, 2019 11:54 PM, Jim Mattson wrote:
> On Fri, Jan 4, 2019 at 2:03 AM Wei Wang <wei.w.wang@intel.com> wrote:
> >
> > On 01/03/2019 11:34 PM, Jim Mattson wrote:
> > > Fast forward to, say, 2021. You're decommissioning all Broadwell
> > > servers in your data center. You have to migrate the running VMs off
> > > of those Broadwell systems onto newer hardware. But, with the
> > > current implementation, the migration cannot happen. So, what do you
> > > do? I suppose you just never enable the feature in the first place. Right?
> >
> > I'm not sure if that's the way people would do with their data centers.
> > What would be the point of decommissioning all the BDW machines when
> > there are important BDW VMs running?
> 
> TCO increases as hardware ages, while the TCO for each successive
> generation of hardware tends to be lower than its predecessor. Thus, the
> point of decommissioning all BDW hosts that are beyond their useful service
> life is to save money. Our assumption for Google Cloud is that we will always
> be able to emulate older Intel processors on newer Intel processors, so the
> running BDW VMs should not be affected by the decommissioning of BDW
> hardware. Obviously, that means that we won't offer features that don't
> have a forward migration story, such as this one.
> 
> Yes, someday Intel will drop support for some feature that we currently offer
> (like MPX, perhaps), and that will cause us some grief.

This sounds like a pretty good opportunity to convince your customers
to upgrade their VMs :)

So the solution to your above problem I have in mind currently is:
- expect QEMU's support to upgrade VMs (e.g. BDW to SKL)
- just disable the non-compatible features for non-upgraded VMs
(this could be in your SLA)

Anyway, I think it would be good to get this feature online first,
and then re-visit that future concern.

Best,
Wei
Liang, Kan Jan. 7, 2019, 2:22 p.m. UTC | #12
On 1/5/2019 5:09 AM, Wei Wang wrote:
> On 01/04/2019 11:57 PM, Liang, Kan wrote:
>>
>>
>> On 1/4/2019 4:58 AM, Wei Wang wrote:
>>> On 01/03/2019 12:33 AM, Liang, Kan wrote:
>>>>
>>>>
>>>> On 12/26/2018 4:25 AM, Wei Wang wrote:
>>>>> +
>>>>> +    /*
>>>>> +     * It could be possible that people have vcpus of old model 
>>>>> run on
>>>>> +     * physcal cpus of newer model, for example a BDW guest on a SKX
>>>>> +     * machine (but not possible to be the other way around).
>>>>> +     * The BDW guest may not get accurate results on a SKX machine 
>>>>> as it
>>>>> +     * only reads 16 entries of the lbr stack while there are 32 
>>>>> entries
>>>>> +     * of recordings. So we currently forbid the lbr enabling when 
>>>>> the
>>>>> +     * vcpu and physical cpu see different lbr stack entries.
>>>>
>>>> I think it's not enough to only check number of entries. The LBR 
>>>> from/to MSRs may be different even the number of entries is the 
>>>> same, e.g SLM and KNL.
>>>
>>> Yes, we could add the comparison of the FROM msrs.
>>>
>>>>
>>>>> +     */
>>>>> +    switch (vcpu_model) {
>>>>
>>>> That's a duplicate of intel_pmu_init(). I think it's better to 
>>>> factor out the common part if you want to check LBR MSRs and 
>>>> entries. Then we don't need to add the same codes in two different 
>>>> places when enabling new platforms.
>>>>
>>>
>>>
>>> Yes, I thought about this, but intel_pmu_init() does a lot more 
>>> things in each "Case xx". Any thought about how to factor them out?
>>>
>>
>> I think we may only move the "switch (boot_cpu_data.x86_model) { ... 
>> }" to a new function, e.g. __intel_pmu_init(int model, struct x86_pmu 
>> *x86_pmu)
>>
>> In __intel_pmu_init, if the model != boot_cpu_data.x86_model, you only 
>> need to update x86_pmu.*. Just ignore global settings, e.g 
>> hw_cache_event_ids, mem_attr, extra_attr etc.
> 
> Thanks for sharing. I understand the point of maintaining those models 
> at one place,
> but this factor-out doesn't seem very elegant to me, like below
> 
> __intel_pmu_init (int model, struct x86_pmu *x86_pmu)
> {
> ...
> switch (model)
> case INTEL_FAM6_NEHALEM:
> case INTEL_FAM6_NEHALEM_EP:
> case INTEL_FAM6_NEHALEM_EX:
>      intel_pmu_lbr_init(x86_pmu);
>      if (model != boot_cpu_data.x86_model)
>          return;
> 
>      /* Other a lot of things init like below..*/
>      memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
>                     sizeof(hw_cache_event_ids));
>      memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
>                     sizeof(hw_cache_extra_regs));
>      x86_pmu.event_constraints = intel_nehalem_event_constraints;
>                  x86_pmu.pebs_constraints = 
> intel_nehalem_pebs_event_constraints;
>                  x86_pmu.enable_all = intel_pmu_nhm_enable_all;
>                  x86_pmu.extra_regs = intel_nehalem_extra_regs;
>   ...
> 
> Case...
> }
> We need insert "if (model != boot_cpu_data.x86_model)" in every "Case xx".
> 
> What would be the rationale that we only do lbr_init for "x86_pmu"
> when model != boot_cpu_data.x86_model?
> (It looks more like a workaround to factor-out the function and get what 
> we want)

I thought the new function may be extended to support fake pmu as below.
It's not only for lbr. PMU has many CPU specific features. It can be 
used for other features, if you want to check the compatibility in 
future. But I don't have an example now.

__intel_pmu_init (int model, struct x86_pmu *x86_pmu)
{
bool fake_pmu = (model != boot_cpu_data.x86_model) ? true : false;
...
switch (model)
case INTEL_FAM6_NEHALEM:
case INTEL_FAM6_NEHALEM_EP:
case INTEL_FAM6_NEHALEM_EX:
      intel_pmu_lbr_init(x86_pmu);
      x86_pmu->event_constraints = intel_nehalem_event_constraints;
      x86_pmu->pebs_constraints = intel_nehalem_pebs_event_constraints;
      x86_pmu->enable_all = intel_pmu_nhm_enable_all;
      x86_pmu->extra_regs = intel_nehalem_extra_regs;

      if (fake_pmu)
          return;

      /* Global variables should not be updated for fake PMU */
      memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
                     sizeof(hw_cache_event_ids));
      memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
                     sizeof(hw_cache_extra_regs));


> 
> I would prefer having them separated as this patch for now - it is 
> logically more clear to me.
> 

But it will be a problem for maintenance. Perf developer probably forget 
to update the list in KVM. I think you have to regularly check the perf 
code.

Thanks,
Kan
Wang, Wei W Jan. 8, 2019, 6:13 a.m. UTC | #13
On 01/07/2019 10:22 PM, Liang, Kan wrote:
>
>> Thanks for sharing. I understand the point of maintaining those 
>> models at one place,
>> but this factor-out doesn't seem very elegant to me, like below
>>
>> __intel_pmu_init (int model, struct x86_pmu *x86_pmu)
>> {
>> ...
>> switch (model)
>> case INTEL_FAM6_NEHALEM:
>> case INTEL_FAM6_NEHALEM_EP:
>> case INTEL_FAM6_NEHALEM_EX:
>>      intel_pmu_lbr_init(x86_pmu);
>>      if (model != boot_cpu_data.x86_model)
>>          return;
>>
>>      /* Other a lot of things init like below..*/
>>      memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
>>                     sizeof(hw_cache_event_ids));
>>      memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
>>                     sizeof(hw_cache_extra_regs));
>>      x86_pmu.event_constraints = intel_nehalem_event_constraints;
>>                  x86_pmu.pebs_constraints = 
>> intel_nehalem_pebs_event_constraints;
>>                  x86_pmu.enable_all = intel_pmu_nhm_enable_all;
>>                  x86_pmu.extra_regs = intel_nehalem_extra_regs;
>>   ...
>>
>> Case...
>> }
>> We need insert "if (model != boot_cpu_data.x86_model)" in every "Case 
>> xx".
>>
>> What would be the rationale that we only do lbr_init for "x86_pmu"
>> when model != boot_cpu_data.x86_model?
>> (It looks more like a workaround to factor-out the function and get 
>> what we want)
>
> I thought the new function may be extended to support fake pmu as below.
> It's not only for lbr. PMU has many CPU specific features. It can be 
> used for other features, if you want to check the compatibility in 
> future. But I don't have an example now.
>
> __intel_pmu_init (int model, struct x86_pmu *x86_pmu)
> {
> bool fake_pmu = (model != boot_cpu_data.x86_model) ? true : false;
> ...
> switch (model)
> case INTEL_FAM6_NEHALEM:
> case INTEL_FAM6_NEHALEM_EP:
> case INTEL_FAM6_NEHALEM_EX:
>      intel_pmu_lbr_init(x86_pmu);
>      x86_pmu->event_constraints = intel_nehalem_event_constraints;
>      x86_pmu->pebs_constraints = intel_nehalem_pebs_event_constraints;
>      x86_pmu->enable_all = intel_pmu_nhm_enable_all;
>      x86_pmu->extra_regs = intel_nehalem_extra_regs;
>
>      if (fake_pmu)
>          return;

It looks similar as the one I shared above, the difference is that more 
things
(e.g. constraints) are assigned to x86_fake_pmu.
I'm not sure about the logic behind it (still look like a workaround).



>
>      /* Global variables should not be updated for fake PMU */
>      memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
>                     sizeof(hw_cache_event_ids));
>      memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
>                     sizeof(hw_cache_extra_regs));
>
>
>>
>> I would prefer having them separated as this patch for now - it is 
>> logically more clear to me.
>>
>
> But it will be a problem for maintenance. Perf developer probably 
> forget to update the list in KVM. I think you have to regularly check 
> the perf code.
>

It's been very common in hypervisor development. That's why we have 
hypervisor developers here.
When a new platform is added, we will definitely get some work like this 
to do.

Best,
Wei
Liang, Kan Jan. 8, 2019, 2:08 p.m. UTC | #14
On 1/8/2019 1:13 AM, Wei Wang wrote:
> On 01/07/2019 10:22 PM, Liang, Kan wrote:
>>
>>> Thanks for sharing. I understand the point of maintaining those 
>>> models at one place,
>>> but this factor-out doesn't seem very elegant to me, like below
>>>
>>> __intel_pmu_init (int model, struct x86_pmu *x86_pmu)
>>> {
>>> ...
>>> switch (model)
>>> case INTEL_FAM6_NEHALEM:
>>> case INTEL_FAM6_NEHALEM_EP:
>>> case INTEL_FAM6_NEHALEM_EX:
>>>      intel_pmu_lbr_init(x86_pmu);
>>>      if (model != boot_cpu_data.x86_model)
>>>          return;
>>>
>>>      /* Other a lot of things init like below..*/
>>>      memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
>>>                     sizeof(hw_cache_event_ids));
>>>      memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
>>>                     sizeof(hw_cache_extra_regs));
>>>      x86_pmu.event_constraints = intel_nehalem_event_constraints;
>>>                  x86_pmu.pebs_constraints = 
>>> intel_nehalem_pebs_event_constraints;
>>>                  x86_pmu.enable_all = intel_pmu_nhm_enable_all;
>>>                  x86_pmu.extra_regs = intel_nehalem_extra_regs;
>>>   ...
>>>
>>> Case...
>>> }
>>> We need insert "if (model != boot_cpu_data.x86_model)" in every "Case 
>>> xx".
>>>
>>> What would be the rationale that we only do lbr_init for "x86_pmu"
>>> when model != boot_cpu_data.x86_model?
>>> (It looks more like a workaround to factor-out the function and get 
>>> what we want)
>>
>> I thought the new function may be extended to support fake pmu as below.
>> It's not only for lbr. PMU has many CPU specific features. It can be 
>> used for other features, if you want to check the compatibility in 
>> future. But I don't have an example now.
>>
>> __intel_pmu_init (int model, struct x86_pmu *x86_pmu)
>> {
>> bool fake_pmu = (model != boot_cpu_data.x86_model) ? true : false;
>> ...
>> switch (model)
>> case INTEL_FAM6_NEHALEM:
>> case INTEL_FAM6_NEHALEM_EP:
>> case INTEL_FAM6_NEHALEM_EX:
>>      intel_pmu_lbr_init(x86_pmu);
>>      x86_pmu->event_constraints = intel_nehalem_event_constraints;
>>      x86_pmu->pebs_constraints = intel_nehalem_pebs_event_constraints;
>>      x86_pmu->enable_all = intel_pmu_nhm_enable_all;
>>      x86_pmu->extra_regs = intel_nehalem_extra_regs;
>>
>>      if (fake_pmu)
>>          return;
> 
> It looks similar as the one I shared above, the difference is that more 
> things
> (e.g. constraints) are assigned to x86_fake_pmu.
> I'm not sure about the logic behind it (still look like a workaround).

The fake x86_pmu will include all the supported features in host. If you 
want to check other features in future, it would be useful.

> 
> 
> 
>>
>>      /* Global variables should not be updated for fake PMU */
>>      memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
>>                     sizeof(hw_cache_event_ids));
>>      memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
>>                     sizeof(hw_cache_extra_regs));
>>
>>
>>>
>>> I would prefer having them separated as this patch for now - it is 
>>> logically more clear to me.
>>>
>>
>> But it will be a problem for maintenance. Perf developer probably 
>> forget to update the list in KVM. I think you have to regularly check 
>> the perf code.
>>
> 
> It's been very common in hypervisor development. That's why we have 
> hypervisor developers here.
> When a new platform is added, we will definitely get some work like this 
> to do.
>

If that's part of your job, I'm OK with it.

Thanks,
Kan
Wang, Wei W Jan. 9, 2019, 1:54 a.m. UTC | #15
On 01/08/2019 10:08 PM, Liang, Kan wrote:
>
>
> On 1/8/2019 1:13 AM, Wei Wang wrote:
>> On 01/07/2019 10:22 PM, Liang, Kan wrote:
>>>
>>>> Thanks for sharing. I understand the point of maintaining those 
>>>> models at one place,
>>>> but this factor-out doesn't seem very elegant to me, like below
>>>>
>>>> __intel_pmu_init (int model, struct x86_pmu *x86_pmu)
>>>> {
>>>> ...
>>>> switch (model)
>>>> case INTEL_FAM6_NEHALEM:
>>>> case INTEL_FAM6_NEHALEM_EP:
>>>> case INTEL_FAM6_NEHALEM_EX:
>>>>      intel_pmu_lbr_init(x86_pmu);
>>>>      if (model != boot_cpu_data.x86_model)
>>>>          return;
>>>>
>>>>      /* Other a lot of things init like below..*/
>>>>      memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
>>>>                     sizeof(hw_cache_event_ids));
>>>>      memcpy(hw_cache_extra_regs, nehalem_hw_cache_extra_regs,
>>>>                     sizeof(hw_cache_extra_regs));
>>>>      x86_pmu.event_constraints = intel_nehalem_event_constraints;
>>>>                  x86_pmu.pebs_constraints = 
>>>> intel_nehalem_pebs_event_constraints;
>>>>                  x86_pmu.enable_all = intel_pmu_nhm_enable_all;
>>>>                  x86_pmu.extra_regs = intel_nehalem_extra_regs;
>>>>   ...
>>>>
>>>> Case...
>>>> }
>>>> We need insert "if (model != boot_cpu_data.x86_model)" in every 
>>>> "Case xx".
>>>>
>>>> What would be the rationale that we only do lbr_init for "x86_pmu"
>>>> when model != boot_cpu_data.x86_model?
>>>> (It looks more like a workaround to factor-out the function and get 
>>>> what we want)
>>>
>>> I thought the new function may be extended to support fake pmu as 
>>> below.
>>> It's not only for lbr. PMU has many CPU specific features. It can be 
>>> used for other features, if you want to check the compatibility in 
>>> future. But I don't have an example now.
>>>
>>> __intel_pmu_init (int model, struct x86_pmu *x86_pmu)
>>> {
>>> bool fake_pmu = (model != boot_cpu_data.x86_model) ? true : false;
>>> ...
>>> switch (model)
>>> case INTEL_FAM6_NEHALEM:
>>> case INTEL_FAM6_NEHALEM_EP:
>>> case INTEL_FAM6_NEHALEM_EX:
>>>      intel_pmu_lbr_init(x86_pmu);
>>>      x86_pmu->event_constraints = intel_nehalem_event_constraints;
>>>      x86_pmu->pebs_constraints = intel_nehalem_pebs_event_constraints;
>>>      x86_pmu->enable_all = intel_pmu_nhm_enable_all;
>>>      x86_pmu->extra_regs = intel_nehalem_extra_regs;
>>>
>>>      if (fake_pmu)
>>>          return;
>>
>> It looks similar as the one I shared above, the difference is that 
>> more things
>> (e.g. constraints) are assigned to x86_fake_pmu.
>> I'm not sure about the logic behind it (still look like a workaround).
>
> The fake x86_pmu will include all the supported features in host. If 
> you want to check other features in future, it would be useful.
>

OK, I'll think more about if we could have a cleaner way to factor out this.

Best,
Wei
diff mbox series

Patch

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 9a327d5..92bdc7d 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -123,6 +123,14 @@  static inline bool guest_cpuid_is_amd(struct kvm_vcpu *vcpu)
 	return best && best->ebx == X86EMUL_CPUID_VENDOR_AuthenticAMD_ebx;
 }
 
+static inline bool guest_cpuid_is_intel(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 0, 0);
+	return best && best->ebx == X86EMUL_CPUID_VENDOR_GenuineIntel_ebx;
+}
+
 static inline int guest_cpuid_family(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 58ead7d..b438ffa 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -299,6 +299,14 @@  int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	return 0;
 }
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+	if (guest_cpuid_is_intel(vcpu))
+		return kvm_x86_ops->pmu_ops->lbr_enable(vcpu);
+
+	return false;
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
 	if (lapic_in_kernel(vcpu))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ba8898e..5f3c7a4 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -28,6 +28,7 @@  struct kvm_pmu_ops {
 	struct kvm_pmc *(*msr_idx_to_pmc)(struct kvm_vcpu *vcpu, unsigned idx);
 	int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
 	bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
+	bool (*lbr_enable)(struct kvm_vcpu *vcpu);
 	int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	void (*refresh)(struct kvm_vcpu *vcpu);
@@ -106,6 +107,7 @@  void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
 void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
 void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu);
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
index 5ab4a36..c04cb6d 100644
--- a/arch/x86/kvm/pmu_intel.c
+++ b/arch/x86/kvm/pmu_intel.c
@@ -15,6 +15,7 @@ 
 #include <linux/kvm_host.h>
 #include <linux/perf_event.h>
 #include <asm/perf_event.h>
+#include <asm/intel-family.h>
 #include "x86.h"
 #include "cpuid.h"
 #include "lapic.h"
@@ -164,6 +165,121 @@  static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return ret;
 }
 
+static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u8 vcpu_model = guest_cpuid_model(vcpu);
+	unsigned int vcpu_lbr_nr;
+
+	if (x86_perf_get_lbr_stack(&kvm->arch.lbr_stack))
+		return false;
+
+	if (guest_cpuid_family(vcpu) != boot_cpu_data.x86)
+		return false;
+
+	/*
+	 * It could be possible that people have vcpus of old model run on
+	 * physcal cpus of newer model, for example a BDW guest on a SKX
+	 * machine (but not possible to be the other way around).
+	 * The BDW guest may not get accurate results on a SKX machine as it
+	 * only reads 16 entries of the lbr stack while there are 32 entries
+	 * of recordings. So we currently forbid the lbr enabling when the
+	 * vcpu and physical cpu see different lbr stack entries.
+	 */
+	switch (vcpu_model) {
+	case INTEL_FAM6_CORE2_MEROM:
+	case INTEL_FAM6_CORE2_MEROM_L:
+	case INTEL_FAM6_CORE2_PENRYN:
+	case INTEL_FAM6_CORE2_DUNNINGTON:
+		/* intel_pmu_lbr_init_core() */
+		vcpu_lbr_nr = 4;
+		break;
+	case INTEL_FAM6_NEHALEM:
+	case INTEL_FAM6_NEHALEM_EP:
+	case INTEL_FAM6_NEHALEM_EX:
+		/* intel_pmu_lbr_init_nhm() */
+		vcpu_lbr_nr = 16;
+		break;
+	case INTEL_FAM6_ATOM_BONNELL:
+	case INTEL_FAM6_ATOM_BONNELL_MID:
+	case INTEL_FAM6_ATOM_SALTWELL:
+	case INTEL_FAM6_ATOM_SALTWELL_MID:
+	case INTEL_FAM6_ATOM_SALTWELL_TABLET:
+		/* intel_pmu_lbr_init_atom() */
+		vcpu_lbr_nr = 8;
+		break;
+	case INTEL_FAM6_ATOM_SILVERMONT:
+	case INTEL_FAM6_ATOM_SILVERMONT_X:
+	case INTEL_FAM6_ATOM_SILVERMONT_MID:
+	case INTEL_FAM6_ATOM_AIRMONT:
+	case INTEL_FAM6_ATOM_AIRMONT_MID:
+		/* intel_pmu_lbr_init_slm() */
+		vcpu_lbr_nr = 8;
+		break;
+	case INTEL_FAM6_ATOM_GOLDMONT:
+	case INTEL_FAM6_ATOM_GOLDMONT_X:
+		/* intel_pmu_lbr_init_skl(); */
+		vcpu_lbr_nr = 32;
+		break;
+	case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
+		/* intel_pmu_lbr_init_skl()*/
+		vcpu_lbr_nr = 32;
+		break;
+	case INTEL_FAM6_WESTMERE:
+	case INTEL_FAM6_WESTMERE_EP:
+	case INTEL_FAM6_WESTMERE_EX:
+		/* intel_pmu_lbr_init_nhm() */
+		vcpu_lbr_nr = 16;
+		break;
+	case INTEL_FAM6_SANDYBRIDGE:
+	case INTEL_FAM6_SANDYBRIDGE_X:
+		/* intel_pmu_lbr_init_snb() */
+		vcpu_lbr_nr = 16;
+		break;
+	case INTEL_FAM6_IVYBRIDGE:
+	case INTEL_FAM6_IVYBRIDGE_X:
+		/* intel_pmu_lbr_init_snb() */
+		vcpu_lbr_nr = 16;
+		break;
+	case INTEL_FAM6_HASWELL_CORE:
+	case INTEL_FAM6_HASWELL_X:
+	case INTEL_FAM6_HASWELL_ULT:
+	case INTEL_FAM6_HASWELL_GT3E:
+		/* intel_pmu_lbr_init_hsw() */
+		vcpu_lbr_nr = 16;
+		break;
+	case INTEL_FAM6_BROADWELL_CORE:
+	case INTEL_FAM6_BROADWELL_XEON_D:
+	case INTEL_FAM6_BROADWELL_GT3E:
+	case INTEL_FAM6_BROADWELL_X:
+		/* intel_pmu_lbr_init_hsw() */
+		vcpu_lbr_nr = 16;
+		break;
+	case INTEL_FAM6_XEON_PHI_KNL:
+	case INTEL_FAM6_XEON_PHI_KNM:
+		/* intel_pmu_lbr_init_knl() */
+		vcpu_lbr_nr = 8;
+		break;
+	case INTEL_FAM6_SKYLAKE_MOBILE:
+	case INTEL_FAM6_SKYLAKE_DESKTOP:
+	case INTEL_FAM6_SKYLAKE_X:
+	case INTEL_FAM6_KABYLAKE_MOBILE:
+	case INTEL_FAM6_KABYLAKE_DESKTOP:
+		/* intel_pmu_lbr_init_skl() */
+		vcpu_lbr_nr = 32;
+		break;
+	default:
+		vcpu_lbr_nr = 0;
+		pr_warn("%s: vcpu model not supported %d\n", __func__,
+			vcpu_model);
+	}
+
+	if (vcpu_lbr_nr != kvm->arch.lbr_stack.nr)
+		return false;
+
+	return true;
+}
+
 static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -350,6 +466,7 @@  struct kvm_pmu_ops intel_pmu_ops = {
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
 	.is_valid_msr_idx = intel_is_valid_msr_idx,
 	.is_valid_msr = intel_is_valid_msr,
+	.lbr_enable = intel_pmu_lbr_enable,
 	.get_msr = intel_pmu_get_msr,
 	.set_msr = intel_pmu_set_msr,
 	.refresh = intel_pmu_refresh,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 50efee4..02e29fd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4505,8 +4505,7 @@  static int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		break;
 	case KVM_CAP_X86_GUEST_LBR:
 		r = -EINVAL;
-		if (cap->args[0] &&
-		    x86_perf_get_lbr_stack(&kvm->arch.lbr_stack)) {
+		if (cap->args[0] && !kvm_pmu_lbr_enable(kvm->vcpus[0])) {
 			pr_err("Failed to enable the guest lbr feature\n");
 			break;
 		}