diff mbox series

[v6,07/16] KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter

Message ID 20210511024214.280733-8-like.xu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/pmu: Add *basic* support to enable guest PEBS via DS | expand

Commit Message

Like Xu May 11, 2021, 2:42 a.m. UTC
When a guest counter is configured as a PEBS counter through
IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by
configuring a non-zero precision level in the perf_event_attr.

The guest PEBS overflow PMI bit would be set in the guest
GLOBAL_STATUS MSR when PEBS facility generates a PEBS
overflow PMI based on guest IA32_DS_AREA MSR.

Even with the same counter index and the same event code and
mask, guest PEBS events will not be reused for non-PEBS events.

Originally-by: Andi Kleen <ak@linux.intel.com>
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
---
 arch/x86/kvm/pmu.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra May 17, 2021, 8:39 a.m. UTC | #1
On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote:
> +	if (pebs) {
> +		/*
> +		 * The non-zero precision level of guest event makes the ordinary
> +		 * guest event becomes a guest PEBS event and triggers the host
> +		 * PEBS PMI handler to determine whether the PEBS overflow PMI
> +		 * comes from the host counters or the guest.
> +		 *
> +		 * For most PEBS hardware events, the difference in the software
> +		 * precision levels of guest and host PEBS events will not affect
> +		 * the accuracy of the PEBS profiling result, because the "event IP"
> +		 * in the PEBS record is calibrated on the guest side.
> +		 */
> +		attr.precise_ip = 1;
> +	}

You've just destroyed precdist, no?
Peter Zijlstra May 17, 2021, 9:14 a.m. UTC | #2
On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote:
> @@ -99,6 +109,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
>  				  bool exclude_kernel, bool intr,
>  				  bool in_tx, bool in_tx_cp)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu);
>  	struct perf_event *event;
>  	struct perf_event_attr attr = {
>  		.type = type,
> @@ -110,6 +121,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
>  		.exclude_kernel = exclude_kernel,
>  		.config = config,
>  	};
> +	bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
>  
>  	attr.sample_period = get_sample_period(pmc, pmc->counter);
>  
> @@ -124,9 +136,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
>  		attr.sample_period = 0;
>  		attr.config |= HSW_IN_TX_CHECKPOINTED;
>  	}
> +	if (pebs) {
> +		/*
> +		 * The non-zero precision level of guest event makes the ordinary
> +		 * guest event becomes a guest PEBS event and triggers the host
> +		 * PEBS PMI handler to determine whether the PEBS overflow PMI
> +		 * comes from the host counters or the guest.
> +		 *
> +		 * For most PEBS hardware events, the difference in the software
> +		 * precision levels of guest and host PEBS events will not affect
> +		 * the accuracy of the PEBS profiling result, because the "event IP"
> +		 * in the PEBS record is calibrated on the guest side.
> +		 */
> +		attr.precise_ip = 1;
> +	}
>  
>  	event = perf_event_create_kernel_counter(&attr, -1, current,
> -						 intr ? kvm_perf_overflow_intr :
> +						 (intr || pebs) ? kvm_perf_overflow_intr :
>  						 kvm_perf_overflow, pmc);

How would pebs && !intr be possible? Also; wouldn't this be more legible
when written like:

	perf_overflow_handler_t ovf = kvm_perf_overflow;

	...

	if (intr)
		ovf = kvm_perf_overflow_intr;

	...

	event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc);
Andi Kleen May 17, 2021, 2:44 p.m. UTC | #3
On 5/17/2021 1:39 AM, Peter Zijlstra wrote:
> On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote:
>> +	if (pebs) {
>> +		/*
>> +		 * The non-zero precision level of guest event makes the ordinary
>> +		 * guest event becomes a guest PEBS event and triggers the host
>> +		 * PEBS PMI handler to determine whether the PEBS overflow PMI
>> +		 * comes from the host counters or the guest.
>> +		 *
>> +		 * For most PEBS hardware events, the difference in the software
>> +		 * precision levels of guest and host PEBS events will not affect
>> +		 * the accuracy of the PEBS profiling result, because the "event IP"
>> +		 * in the PEBS record is calibrated on the guest side.
>> +		 */
>> +		attr.precise_ip = 1;
>> +	}
> You've just destroyed precdist, no?

precdist can mean multiple things:

- Convert cycles to the precise INST_RETIRED event. That is not 
meaningful for virtualization because "cycles" doesn't exist, just the 
raw events.

- For GLC+ and TNT+ it will force the event to a specific counter that 
is more precise. This would be indeed "destroyed", but right now the 
patch kit only supports Icelake which doesn't support that anyways.

So I think the code is correct for now, but will need to be changed for 
later CPUs. Should perhaps fix the comment though to discuss this.


-Andi
Peter Zijlstra May 18, 2021, 8:47 a.m. UTC | #4
On Mon, May 17, 2021 at 07:44:15AM -0700, Andi Kleen wrote:
> 
> On 5/17/2021 1:39 AM, Peter Zijlstra wrote:
> > On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote:
> > > +	if (pebs) {
> > > +		/*
> > > +		 * The non-zero precision level of guest event makes the ordinary
> > > +		 * guest event becomes a guest PEBS event and triggers the host
> > > +		 * PEBS PMI handler to determine whether the PEBS overflow PMI
> > > +		 * comes from the host counters or the guest.
> > > +		 *
> > > +		 * For most PEBS hardware events, the difference in the software
> > > +		 * precision levels of guest and host PEBS events will not affect
> > > +		 * the accuracy of the PEBS profiling result, because the "event IP"
> > > +		 * in the PEBS record is calibrated on the guest side.
> > > +		 */
> > > +		attr.precise_ip = 1;
> > > +	}
> > You've just destroyed precdist, no?
> 
> precdist can mean multiple things:
> 
> - Convert cycles to the precise INST_RETIRED event. That is not meaningful
> for virtualization because "cycles" doesn't exist, just the raw events.
> 
> - For GLC+ and TNT+ it will force the event to a specific counter that is
> more precise. This would be indeed "destroyed", but right now the patch kit
> only supports Icelake which doesn't support that anyways.
> 
> So I think the code is correct for now, but will need to be changed for
> later CPUs. Should perhaps fix the comment though to discuss this.

OK, can we then do a better comment that explains *why* this is correct
now and what needs help later?

Because IIUC the only reason it is correct now is because:

 - we only support ICL

   * and ICL has pebs_format>=2, so {1,2} are the same
   * and ICL doesn't have precise_ip==3 support

 - Other hardware (GLC+, TNT+) that could possibly care here
   is unsupported atm. but needs changes.

None of which is actually mentioned in that comment it does have.
Xu, Like May 18, 2021, 1:15 p.m. UTC | #5
On 2021/5/18 16:47, Peter Zijlstra wrote:
> On Mon, May 17, 2021 at 07:44:15AM -0700, Andi Kleen wrote:
>> On 5/17/2021 1:39 AM, Peter Zijlstra wrote:
>>> On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote:
>>>> +	if (pebs) {
>>>> +		/*
>>>> +		 * The non-zero precision level of guest event makes the ordinary
>>>> +		 * guest event becomes a guest PEBS event and triggers the host
>>>> +		 * PEBS PMI handler to determine whether the PEBS overflow PMI
>>>> +		 * comes from the host counters or the guest.
>>>> +		 *
>>>> +		 * For most PEBS hardware events, the difference in the software
>>>> +		 * precision levels of guest and host PEBS events will not affect
>>>> +		 * the accuracy of the PEBS profiling result, because the "event IP"
>>>> +		 * in the PEBS record is calibrated on the guest side.
>>>> +		 */
>>>> +		attr.precise_ip = 1;
>>>> +	}
>>> You've just destroyed precdist, no?
>> precdist can mean multiple things:
>>
>> - Convert cycles to the precise INST_RETIRED event. That is not meaningful
>> for virtualization because "cycles" doesn't exist, just the raw events.
>>
>> - For GLC+ and TNT+ it will force the event to a specific counter that is
>> more precise. This would be indeed "destroyed", but right now the patch kit
>> only supports Icelake which doesn't support that anyways.
>>
>> So I think the code is correct for now, but will need to be changed for
>> later CPUs. Should perhaps fix the comment though to discuss this.
> OK, can we then do a better comment that explains *why* this is correct
> now and what needs help later?
>
> Because IIUC the only reason it is correct now is because:
>
>   - we only support ICL
>
>     * and ICL has pebs_format>=2, so {1,2} are the same
>     * and ICL doesn't have precise_ip==3 support
>
>   - Other hardware (GLC+, TNT+) that could possibly care here
>     is unsupported atm. but needs changes.
>
> None of which is actually mentioned in that comment it does have.

Hi Andi & Peter,

By "precdist", do you mean the"Precise Distribution of Instructions Retired 
(PDIR) Facility"?

The SDM says Ice Lake Microarchitecture does support PEBS-PDIR on 
IA32_FIXED0 only.
And this patch kit enables it in the patch 0011, please take a look.

Or do I miss something about precdist on ICL ?

Thanks,
Like Xu
Xu, Like May 18, 2021, 1:28 p.m. UTC | #6
On 2021/5/17 17:14, Peter Zijlstra wrote:
> On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote:
>> @@ -99,6 +109,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
>>   				  bool exclude_kernel, bool intr,
>>   				  bool in_tx, bool in_tx_cp)
>>   {
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu);
>>   	struct perf_event *event;
>>   	struct perf_event_attr attr = {
>>   		.type = type,
>> @@ -110,6 +121,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
>>   		.exclude_kernel = exclude_kernel,
>>   		.config = config,
>>   	};
>> +	bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
>>   
>>   	attr.sample_period = get_sample_period(pmc, pmc->counter);
>>   
>> @@ -124,9 +136,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
>>   		attr.sample_period = 0;
>>   		attr.config |= HSW_IN_TX_CHECKPOINTED;
>>   	}
>> +	if (pebs) {
>> +		/*
>> +		 * The non-zero precision level of guest event makes the ordinary
>> +		 * guest event becomes a guest PEBS event and triggers the host
>> +		 * PEBS PMI handler to determine whether the PEBS overflow PMI
>> +		 * comes from the host counters or the guest.
>> +		 *
>> +		 * For most PEBS hardware events, the difference in the software
>> +		 * precision levels of guest and host PEBS events will not affect
>> +		 * the accuracy of the PEBS profiling result, because the "event IP"
>> +		 * in the PEBS record is calibrated on the guest side.
>> +		 */
>> +		attr.precise_ip = 1;
>> +	}
>>   
>>   	event = perf_event_create_kernel_counter(&attr, -1, current,
>> -						 intr ? kvm_perf_overflow_intr :
>> +						 (intr || pebs) ? kvm_perf_overflow_intr :
>>   						 kvm_perf_overflow, pmc);
> How would pebs && !intr be possible?

I don't think it's possible.

> Also; wouldn't this be more legible
> when written like:
>
> 	perf_overflow_handler_t ovf = kvm_perf_overflow;
>
> 	...
>
> 	if (intr)
> 		ovf = kvm_perf_overflow_intr;
>
> 	...
>
> 	event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc);
>

Please yell if you don't like this:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 711294babb97..a607f5a1b9cd 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -122,6 +122,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, 
u32 type,
                 .config = config,
         };
         bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
+       perf_overflow_handler_t ovf = (intr || pebs) ?
+               kvm_perf_overflow_intr : kvm_perf_overflow;

         attr.sample_period = get_sample_period(pmc, pmc->counter);

@@ -151,9 +153,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, 
u32 type,
                 attr.precise_ip = 1;
         }

-       event = perf_event_create_kernel_counter(&attr, -1, current,
-                                                (intr || pebs) ? 
kvm_perf_overflow_intr :
-                                                kvm_perf_overflow, pmc);
+       event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc);
         if (IS_ERR(event)) {
                 pr_debug_ratelimited("kvm_pmu: event creation failed %ld 
for pmc->idx = %d\n",
                             PTR_ERR(event), pmc->idx);
Peter Zijlstra May 18, 2021, 1:36 p.m. UTC | #7
On Tue, May 18, 2021 at 09:28:52PM +0800, Xu, Like wrote:

> > How would pebs && !intr be possible?
> 
> I don't think it's possible.

And yet you keep that 'intr||pebs' weirdness :/

> > Also; wouldn't this be more legible
> > when written like:
> > 
> > 	perf_overflow_handler_t ovf = kvm_perf_overflow;
> > 
> > 	...
> > 
> > 	if (intr)
> > 		ovf = kvm_perf_overflow_intr;
> > 
> > 	...
> > 
> > 	event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc);
> > 
> 
> Please yell if you don't like this:
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 711294babb97..a607f5a1b9cd 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -122,6 +122,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc,
> u32 type,
>                 .config = config,
>         };
>         bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
> +       perf_overflow_handler_t ovf = (intr || pebs) ?
> +               kvm_perf_overflow_intr : kvm_perf_overflow;

This, that's exactly the kind of code I wanted to get rid of. ?: has
it's place I suppose, but you're creating dense ugly code for no reason.

	perf_overflow_handle_t ovf = kvm_perf_overflow;

	if (intr)
		ovf = kvm_perf_overflow_intr;

Is so much easier to read. And if you really worry about that pebs
thing; you can add:

	WARN_ON_ONCE(pebs && !intr);
Xu, Like May 18, 2021, 2:05 p.m. UTC | #8
On 2021/5/18 21:36, Peter Zijlstra wrote:
> On Tue, May 18, 2021 at 09:28:52PM +0800, Xu, Like wrote:
>
>>> How would pebs && !intr be possible?
>> I don't think it's possible.
> And yet you keep that 'intr||pebs' weirdness :/
>
>>> Also; wouldn't this be more legible
>>> when written like:
>>>
>>> 	perf_overflow_handler_t ovf = kvm_perf_overflow;
>>>
>>> 	...
>>>
>>> 	if (intr)
>>> 		ovf = kvm_perf_overflow_intr;
>>>
>>> 	...
>>>
>>> 	event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc);
>>>
>> Please yell if you don't like this:
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 711294babb97..a607f5a1b9cd 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -122,6 +122,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc,
>> u32 type,
>>                  .config = config,
>>          };
>>          bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
>> +       perf_overflow_handler_t ovf = (intr || pebs) ?
>> +               kvm_perf_overflow_intr : kvm_perf_overflow;
> This, that's exactly the kind of code I wanted to get rid of. ?: has
> it's place I suppose, but you're creating dense ugly code for no reason.
>
> 	perf_overflow_handle_t ovf = kvm_perf_overflow;
>
> 	if (intr)
> 		ovf = kvm_perf_overflow_intr;
>
> Is so much easier to read. And if you really worry about that pebs
> thing; you can add:
>
> 	WARN_ON_ONCE(pebs && !intr);
>

Thanks!  Glad you could review my code.
As a new generation, we do appreciate your patient guidance on your taste 
in code.
Andi Kleen May 18, 2021, 3:58 p.m. UTC | #9
>
> By "precdist", do you mean the"Precise Distribution of Instructions 
> Retired (PDIR) Facility"?


This was referring to perf's precise_ip field


>
> The SDM says Ice Lake Microarchitecture does support PEBS-PDIR on 
> IA32_FIXED0 only.
> And this patch kit enables it in the patch 0011, please take a look.
>
> Or do I miss something about precdist on ICL ?


On Icelake everything is fine

It just would need changes on other CPUs, but that can be done later.


-Andi
diff mbox series

Patch

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 827886c12c16..0f86c1142f17 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -74,11 +74,21 @@  static void kvm_perf_overflow_intr(struct perf_event *perf_event,
 {
 	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+	bool skip_pmi = false;
 
 	if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
-		__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+		if (perf_event->attr.precise_ip) {
+			/* Indicate PEBS overflow PMI to guest. */
+			skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
+						      (unsigned long *)&pmu->global_status);
+		} else {
+			__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+		}
 		kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
 
+		if (skip_pmi)
+			return;
+
 		/*
 		 * Inject PMI. If vcpu was in a guest mode during NMI PMI
 		 * can be ejected on a guest mode re-entry. Otherwise we can't
@@ -99,6 +109,7 @@  static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
 				  bool exclude_kernel, bool intr,
 				  bool in_tx, bool in_tx_cp)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu);
 	struct perf_event *event;
 	struct perf_event_attr attr = {
 		.type = type,
@@ -110,6 +121,7 @@  static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
 		.exclude_kernel = exclude_kernel,
 		.config = config,
 	};
+	bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
 
 	attr.sample_period = get_sample_period(pmc, pmc->counter);
 
@@ -124,9 +136,23 @@  static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
 		attr.sample_period = 0;
 		attr.config |= HSW_IN_TX_CHECKPOINTED;
 	}
+	if (pebs) {
+		/*
+		 * The non-zero precision level of guest event makes the ordinary
+		 * guest event becomes a guest PEBS event and triggers the host
+		 * PEBS PMI handler to determine whether the PEBS overflow PMI
+		 * comes from the host counters or the guest.
+		 *
+		 * For most PEBS hardware events, the difference in the software
+		 * precision levels of guest and host PEBS events will not affect
+		 * the accuracy of the PEBS profiling result, because the "event IP"
+		 * in the PEBS record is calibrated on the guest side.
+		 */
+		attr.precise_ip = 1;
+	}
 
 	event = perf_event_create_kernel_counter(&attr, -1, current,
-						 intr ? kvm_perf_overflow_intr :
+						 (intr || pebs) ? kvm_perf_overflow_intr :
 						 kvm_perf_overflow, pmc);
 	if (IS_ERR(event)) {
 		pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
@@ -161,6 +187,10 @@  static bool pmc_resume_counter(struct kvm_pmc *pmc)
 			      get_sample_period(pmc, pmc->counter)))
 		return false;
 
+	if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) &&
+	    pmc->perf_event->attr.precise_ip)
+		return false;
+
 	/* reuse perf_event to serve as pmc_reprogram_counter() does*/
 	perf_event_enable(pmc->perf_event);