Message ID | 20210511024214.280733-8-like.xu@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/pmu: Add *basic* support to enable guest PEBS via DS | expand |
On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote: > + if (pebs) { > + /* > + * The non-zero precision level of guest event makes the ordinary > + * guest event becomes a guest PEBS event and triggers the host > + * PEBS PMI handler to determine whether the PEBS overflow PMI > + * comes from the host counters or the guest. > + * > + * For most PEBS hardware events, the difference in the software > + * precision levels of guest and host PEBS events will not affect > + * the accuracy of the PEBS profiling result, because the "event IP" > + * in the PEBS record is calibrated on the guest side. > + */ > + attr.precise_ip = 1; > + } You've just destroyed precdist, no?
On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote: > @@ -99,6 +109,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, > bool exclude_kernel, bool intr, > bool in_tx, bool in_tx_cp) > { > + struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu); > struct perf_event *event; > struct perf_event_attr attr = { > .type = type, > @@ -110,6 +121,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, > .exclude_kernel = exclude_kernel, > .config = config, > }; > + bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable); > > attr.sample_period = get_sample_period(pmc, pmc->counter); > > @@ -124,9 +136,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, > attr.sample_period = 0; > attr.config |= HSW_IN_TX_CHECKPOINTED; > } > + if (pebs) { > + /* > + * The non-zero precision level of guest event makes the ordinary > + * guest event becomes a guest PEBS event and triggers the host > + * PEBS PMI handler to determine whether the PEBS overflow PMI > + * comes from the host counters or the guest. > + * > + * For most PEBS hardware events, the difference in the software > + * precision levels of guest and host PEBS events will not affect > + * the accuracy of the PEBS profiling result, because the "event IP" > + * in the PEBS record is calibrated on the guest side. > + */ > + attr.precise_ip = 1; > + } > > event = perf_event_create_kernel_counter(&attr, -1, current, > - intr ? kvm_perf_overflow_intr : > + (intr || pebs) ? kvm_perf_overflow_intr : > kvm_perf_overflow, pmc); How would pebs && !intr be possible? Also; wouldn't this be more legible when written like: perf_overflow_handler_t ovf = kvm_perf_overflow; ... if (intr) ovf = kvm_perf_overflow_intr; ... event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc);
On 5/17/2021 1:39 AM, Peter Zijlstra wrote: > On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote: >> + if (pebs) { >> + /* >> + * The non-zero precision level of guest event makes the ordinary >> + * guest event becomes a guest PEBS event and triggers the host >> + * PEBS PMI handler to determine whether the PEBS overflow PMI >> + * comes from the host counters or the guest. >> + * >> + * For most PEBS hardware events, the difference in the software >> + * precision levels of guest and host PEBS events will not affect >> + * the accuracy of the PEBS profiling result, because the "event IP" >> + * in the PEBS record is calibrated on the guest side. >> + */ >> + attr.precise_ip = 1; >> + } > You've just destroyed precdist, no? precdist can mean multiple things: - Convert cycles to the precise INST_RETIRED event. That is not meaningful for virtualization because "cycles" doesn't exist, just the raw events. - For GLC+ and TNT+ it will force the event to a specific counter that is more precise. This would be indeed "destroyed", but right now the patch kit only supports Icelake which doesn't support that anyways. So I think the code is correct for now, but will need to be changed for later CPUs. Should perhaps fix the comment though to discuss this. -Andi
On Mon, May 17, 2021 at 07:44:15AM -0700, Andi Kleen wrote: > > On 5/17/2021 1:39 AM, Peter Zijlstra wrote: > > On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote: > > > + if (pebs) { > > > + /* > > > + * The non-zero precision level of guest event makes the ordinary > > > + * guest event becomes a guest PEBS event and triggers the host > > > + * PEBS PMI handler to determine whether the PEBS overflow PMI > > > + * comes from the host counters or the guest. > > > + * > > > + * For most PEBS hardware events, the difference in the software > > > + * precision levels of guest and host PEBS events will not affect > > > + * the accuracy of the PEBS profiling result, because the "event IP" > > > + * in the PEBS record is calibrated on the guest side. > > > + */ > > > + attr.precise_ip = 1; > > > + } > > You've just destroyed precdist, no? > > precdist can mean multiple things: > > - Convert cycles to the precise INST_RETIRED event. That is not meaningful > for virtualization because "cycles" doesn't exist, just the raw events. > > - For GLC+ and TNT+ it will force the event to a specific counter that is > more precise. This would be indeed "destroyed", but right now the patch kit > only supports Icelake which doesn't support that anyways. > > So I think the code is correct for now, but will need to be changed for > later CPUs. Should perhaps fix the comment though to discuss this. OK, can we then do a better comment that explains *why* this is correct now and what needs help later? Because IIUC the only reason it is correct now is because: - we only support ICL * and ICL has pebs_format>=2, so {1,2} are the same * and ICL doesn't have precise_ip==3 support - Other hardware (GLC+, TNT+) that could possibly care here is unsupported atm. but needs changes. None of which is actually mentioned in that comment it does have.
On 2021/5/18 16:47, Peter Zijlstra wrote: > On Mon, May 17, 2021 at 07:44:15AM -0700, Andi Kleen wrote: >> On 5/17/2021 1:39 AM, Peter Zijlstra wrote: >>> On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote: >>>> + if (pebs) { >>>> + /* >>>> + * The non-zero precision level of guest event makes the ordinary >>>> + * guest event becomes a guest PEBS event and triggers the host >>>> + * PEBS PMI handler to determine whether the PEBS overflow PMI >>>> + * comes from the host counters or the guest. >>>> + * >>>> + * For most PEBS hardware events, the difference in the software >>>> + * precision levels of guest and host PEBS events will not affect >>>> + * the accuracy of the PEBS profiling result, because the "event IP" >>>> + * in the PEBS record is calibrated on the guest side. >>>> + */ >>>> + attr.precise_ip = 1; >>>> + } >>> You've just destroyed precdist, no? >> precdist can mean multiple things: >> >> - Convert cycles to the precise INST_RETIRED event. That is not meaningful >> for virtualization because "cycles" doesn't exist, just the raw events. >> >> - For GLC+ and TNT+ it will force the event to a specific counter that is >> more precise. This would be indeed "destroyed", but right now the patch kit >> only supports Icelake which doesn't support that anyways. >> >> So I think the code is correct for now, but will need to be changed for >> later CPUs. Should perhaps fix the comment though to discuss this. > OK, can we then do a better comment that explains *why* this is correct > now and what needs help later? > > Because IIUC the only reason it is correct now is because: > > - we only support ICL > > * and ICL has pebs_format>=2, so {1,2} are the same > * and ICL doesn't have precise_ip==3 support > > - Other hardware (GLC+, TNT+) that could possibly care here > is unsupported atm. but needs changes. > > None of which is actually mentioned in that comment it does have. Hi Andi & Peter, By "precdist", do you mean the"Precise Distribution of Instructions Retired (PDIR) Facility"? The SDM says Ice Lake Microarchitecture does support PEBS-PDIR on IA32_FIXED0 only. And this patch kit enables it in the patch 0011, please take a look. Or do I miss something about precdist on ICL ? Thanks, Like Xu
On 2021/5/17 17:14, Peter Zijlstra wrote: > On Tue, May 11, 2021 at 10:42:05AM +0800, Like Xu wrote: >> @@ -99,6 +109,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, >> bool exclude_kernel, bool intr, >> bool in_tx, bool in_tx_cp) >> { >> + struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu); >> struct perf_event *event; >> struct perf_event_attr attr = { >> .type = type, >> @@ -110,6 +121,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, >> .exclude_kernel = exclude_kernel, >> .config = config, >> }; >> + bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable); >> >> attr.sample_period = get_sample_period(pmc, pmc->counter); >> >> @@ -124,9 +136,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, >> attr.sample_period = 0; >> attr.config |= HSW_IN_TX_CHECKPOINTED; >> } >> + if (pebs) { >> + /* >> + * The non-zero precision level of guest event makes the ordinary >> + * guest event becomes a guest PEBS event and triggers the host >> + * PEBS PMI handler to determine whether the PEBS overflow PMI >> + * comes from the host counters or the guest. >> + * >> + * For most PEBS hardware events, the difference in the software >> + * precision levels of guest and host PEBS events will not affect >> + * the accuracy of the PEBS profiling result, because the "event IP" >> + * in the PEBS record is calibrated on the guest side. >> + */ >> + attr.precise_ip = 1; >> + } >> >> event = perf_event_create_kernel_counter(&attr, -1, current, >> - intr ? kvm_perf_overflow_intr : >> + (intr || pebs) ? kvm_perf_overflow_intr : >> kvm_perf_overflow, pmc); > How would pebs && !intr be possible? I don't think it's possible. > Also; wouldn't this be more legible > when written like: > > perf_overflow_handler_t ovf = kvm_perf_overflow; > > ... > > if (intr) > ovf = kvm_perf_overflow_intr; > > ... > > event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc); > Please yell if you don't like this: diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 711294babb97..a607f5a1b9cd 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -122,6 +122,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, .config = config, }; bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable); + perf_overflow_handler_t ovf = (intr || pebs) ? + kvm_perf_overflow_intr : kvm_perf_overflow; attr.sample_period = get_sample_period(pmc, pmc->counter); @@ -151,9 +153,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, attr.precise_ip = 1; } - event = perf_event_create_kernel_counter(&attr, -1, current, - (intr || pebs) ? kvm_perf_overflow_intr : - kvm_perf_overflow, pmc); + event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc); if (IS_ERR(event)) { pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n", PTR_ERR(event), pmc->idx);
On Tue, May 18, 2021 at 09:28:52PM +0800, Xu, Like wrote: > > How would pebs && !intr be possible? > > I don't think it's possible. And yet you keep that 'intr||pebs' weirdness :/ > > Also; wouldn't this be more legible > > when written like: > > > > perf_overflow_handler_t ovf = kvm_perf_overflow; > > > > ... > > > > if (intr) > > ovf = kvm_perf_overflow_intr; > > > > ... > > > > event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc); > > > > Please yell if you don't like this: > > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c > index 711294babb97..a607f5a1b9cd 100644 > --- a/arch/x86/kvm/pmu.c > +++ b/arch/x86/kvm/pmu.c > @@ -122,6 +122,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, > u32 type, > .config = config, > }; > bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable); > + perf_overflow_handler_t ovf = (intr || pebs) ? > + kvm_perf_overflow_intr : kvm_perf_overflow; This, that's exactly the kind of code I wanted to get rid of. ?: has it's place I suppose, but you're creating dense ugly code for no reason. perf_overflow_handle_t ovf = kvm_perf_overflow; if (intr) ovf = kvm_perf_overflow_intr; Is so much easier to read. And if you really worry about that pebs thing; you can add: WARN_ON_ONCE(pebs && !intr);
On 2021/5/18 21:36, Peter Zijlstra wrote: > On Tue, May 18, 2021 at 09:28:52PM +0800, Xu, Like wrote: > >>> How would pebs && !intr be possible? >> I don't think it's possible. > And yet you keep that 'intr||pebs' weirdness :/ > >>> Also; wouldn't this be more legible >>> when written like: >>> >>> perf_overflow_handler_t ovf = kvm_perf_overflow; >>> >>> ... >>> >>> if (intr) >>> ovf = kvm_perf_overflow_intr; >>> >>> ... >>> >>> event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc); >>> >> Please yell if you don't like this: >> >> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c >> index 711294babb97..a607f5a1b9cd 100644 >> --- a/arch/x86/kvm/pmu.c >> +++ b/arch/x86/kvm/pmu.c >> @@ -122,6 +122,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, >> u32 type, >> .config = config, >> }; >> bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable); >> + perf_overflow_handler_t ovf = (intr || pebs) ? >> + kvm_perf_overflow_intr : kvm_perf_overflow; > This, that's exactly the kind of code I wanted to get rid of. ?: has > it's place I suppose, but you're creating dense ugly code for no reason. > > perf_overflow_handle_t ovf = kvm_perf_overflow; > > if (intr) > ovf = kvm_perf_overflow_intr; > > Is so much easier to read. And if you really worry about that pebs > thing; you can add: > > WARN_ON_ONCE(pebs && !intr); > Thanks! Glad you could review my code. As a new generation, we do appreciate your patient guidance on your taste in code.
> > By "precdist", do you mean the"Precise Distribution of Instructions > Retired (PDIR) Facility"? This was referring to perf's precise_ip field > > The SDM says Ice Lake Microarchitecture does support PEBS-PDIR on > IA32_FIXED0 only. > And this patch kit enables it in the patch 0011, please take a look. > > Or do I miss something about precdist on ICL ? On Icelake everything is fine It just would need changes on other CPUs, but that can be done later. -Andi
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 827886c12c16..0f86c1142f17 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -74,11 +74,21 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event, { struct kvm_pmc *pmc = perf_event->overflow_handler_context; struct kvm_pmu *pmu = pmc_to_pmu(pmc); + bool skip_pmi = false; if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) { - __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); + if (perf_event->attr.precise_ip) { + /* Indicate PEBS overflow PMI to guest. */ + skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT, + (unsigned long *)&pmu->global_status); + } else { + __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); + } kvm_make_request(KVM_REQ_PMU, pmc->vcpu); + if (skip_pmi) + return; + /* * Inject PMI. If vcpu was in a guest mode during NMI PMI * can be ejected on a guest mode re-entry. Otherwise we can't @@ -99,6 +109,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, bool exclude_kernel, bool intr, bool in_tx, bool in_tx_cp) { + struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu); struct perf_event *event; struct perf_event_attr attr = { .type = type, @@ -110,6 +121,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, .exclude_kernel = exclude_kernel, .config = config, }; + bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable); attr.sample_period = get_sample_period(pmc, pmc->counter); @@ -124,9 +136,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, attr.sample_period = 0; attr.config |= HSW_IN_TX_CHECKPOINTED; } + if (pebs) { + /* + * The non-zero precision level of guest event makes the ordinary + * guest event becomes a guest PEBS event and triggers the host + * PEBS PMI handler to determine whether the PEBS overflow PMI + * comes from the host counters or the guest. + * + * For most PEBS hardware events, the difference in the software + * precision levels of guest and host PEBS events will not affect + * the accuracy of the PEBS profiling result, because the "event IP" + * in the PEBS record is calibrated on the guest side. + */ + attr.precise_ip = 1; + } event = perf_event_create_kernel_counter(&attr, -1, current, - intr ? kvm_perf_overflow_intr : + (intr || pebs) ? kvm_perf_overflow_intr : kvm_perf_overflow, pmc); if (IS_ERR(event)) { pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n", @@ -161,6 +187,10 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc) get_sample_period(pmc, pmc->counter))) return false; + if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) && + pmc->perf_event->attr.precise_ip) + return false; + /* reuse perf_event to serve as pmc_reprogram_counter() does*/ perf_event_enable(pmc->perf_event);