diff mbox series

perf/intel: Remove Perfmon-v4 counter_freezing support

Message ID 20201110153721.GQ2651@hirez.programming.kicks-ass.net (mailing list archive)
State New, archived
Headers show
Series perf/intel: Remove Perfmon-v4 counter_freezing support | expand

Commit Message

Peter Zijlstra Nov. 10, 2020, 3:37 p.m. UTC
On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
> > The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
> > platforms can provide an architectural state of the instruction executed
> > after the instruction that caused the event. This patch set enables the
> > the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
> > The Linux guest can use PEBS feature like native:
> > 
> >   # perf record -e instructions:ppp ./br_instr a
> >   # perf record -c 100000 -e instructions:pp ./br_instr a
> > 
> > If the counter_freezing is not enabled on the host, the guest PEBS will
> > be disabled on purpose when host is using PEBS facility. By default,
> > KVM disables the co-existence of guest PEBS and host PEBS.
> 
> Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
> me go delete all that code.

---
Subject: perf/intel: Remove Perfmon-v4 counter_freezing support

Perfmon-v4 counter freezing is fundamentally broken; remove this default
disabled code to make sure nobody uses it.

The feature is called Freeze-on-PMI in the SDM, and if it would do that,
there wouldn't actually be a problem, *however* it does something subtly
different. It globally disables the whole PMU when it raises the PMI,
not when the PMI hits.

This means there's a window between the PMI getting raised and the PMI
actually getting served where we loose events and this violates the
perf counter independence. That is, a counting event should not result
in a different event count when there is a sampling event co-scheduled.

This is known to break existing software.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/intel/core.c | 152 -------------------------------------------
 arch/x86/events/perf_event.h |   3 +-
 2 files changed, 1 insertion(+), 154 deletions(-)

Comments

Stephane Eranian Nov. 10, 2020, 8:52 p.m. UTC | #1
On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
> > > The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
> > > platforms can provide an architectural state of the instruction executed
> > > after the instruction that caused the event. This patch set enables the
> > > the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
> > > The Linux guest can use PEBS feature like native:
> > >
> > >   # perf record -e instructions:ppp ./br_instr a
> > >   # perf record -c 100000 -e instructions:pp ./br_instr a
> > >
> > > If the counter_freezing is not enabled on the host, the guest PEBS will
> > > be disabled on purpose when host is using PEBS facility. By default,
> > > KVM disables the co-existence of guest PEBS and host PEBS.
> >
> > Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
> > me go delete all that code.
>
> ---
> Subject: perf/intel: Remove Perfmon-v4 counter_freezing support
>
> Perfmon-v4 counter freezing is fundamentally broken; remove this default
> disabled code to make sure nobody uses it.
>
> The feature is called Freeze-on-PMI in the SDM, and if it would do that,
> there wouldn't actually be a problem, *however* it does something subtly
> different. It globally disables the whole PMU when it raises the PMI,
> not when the PMI hits.
>
> This means there's a window between the PMI getting raised and the PMI
> actually getting served where we loose events and this violates the
> perf counter independence. That is, a counting event should not result
> in a different event count when there is a sampling event co-scheduled.
>

What is implemented is Freeze-on-Overflow, yet it is described as Freeze-on-PMI.
That, in itself, is a problem. I agree with you on that point.

However, there are use cases for both modes.

I can sample on event A and count on B, C and when A overflows, I want
to snapshot B, C.
For that I want B, C at the moment of the overflow, not at the moment
the PMI is delivered. Thus, youd
would want the Freeze-on-overflow behavior. You can collect in this
mode with the perf tool,
IIRC: perf record -e '{cycles,instructions,branches:S}' ....

The other usage model is that of the replay-debugger (rr) which you are alluding
to, which needs precise count of an event including during the skid
window. For that, you need
Freeze-on-PMI (delivered). Note that this tool likely only cares about
user level occurrences of events.

As for counter independence, I am not sure it holds in all cases. If
the events are setup for user+kernel
then, as soon as you co-schedule a sampling event, you will likely get
more counts on the counting
event due to the additional kernel entries/exits caused by
interrupt-based profiling. Even if you were to
restrict to user level only, I would expect to see a few more counts.


> This is known to break existing software.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/events/intel/core.c | 152 -------------------------------------------
>  arch/x86/events/perf_event.h |   3 +-
>  2 files changed, 1 insertion(+), 154 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index c79748f6921d..9909dfa6fb12 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2121,18 +2121,6 @@ static void intel_tfa_pmu_enable_all(int added)
>         intel_pmu_enable_all(added);
>  }
>
> -static void enable_counter_freeze(void)
> -{
> -       update_debugctlmsr(get_debugctlmsr() |
> -                       DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
> -static void disable_counter_freeze(void)
> -{
> -       update_debugctlmsr(get_debugctlmsr() &
> -                       ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
>  static inline u64 intel_pmu_get_status(void)
>  {
>         u64 status;
> @@ -2696,95 +2684,6 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>         return handled;
>  }
>
> -static bool disable_counter_freezing = true;
> -static int __init intel_perf_counter_freezing_setup(char *s)
> -{
> -       bool res;
> -
> -       if (kstrtobool(s, &res))
> -               return -EINVAL;
> -
> -       disable_counter_freezing = !res;
> -       return 1;
> -}
> -__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup);
> -
> -/*
> - * Simplified handler for Arch Perfmon v4:
> - * - We rely on counter freezing/unfreezing to enable/disable the PMU.
> - * This is done automatically on PMU ack.
> - * - Ack the PMU only after the APIC.
> - */
> -
> -static int intel_pmu_handle_irq_v4(struct pt_regs *regs)
> -{
> -       struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> -       int handled = 0;
> -       bool bts = false;
> -       u64 status;
> -       int pmu_enabled = cpuc->enabled;
> -       int loops = 0;
> -
> -       /* PMU has been disabled because of counter freezing */
> -       cpuc->enabled = 0;
> -       if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
> -               bts = true;
> -               intel_bts_disable_local();
> -               handled = intel_pmu_drain_bts_buffer();
> -               handled += intel_bts_interrupt();
> -       }
> -       status = intel_pmu_get_status();
> -       if (!status)
> -               goto done;
> -again:
> -       intel_pmu_lbr_read();
> -       if (++loops > 100) {
> -               static bool warned;
> -
> -               if (!warned) {
> -                       WARN(1, "perfevents: irq loop stuck!\n");
> -                       perf_event_print_debug();
> -                       warned = true;
> -               }
> -               intel_pmu_reset();
> -               goto done;
> -       }
> -
> -
> -       handled += handle_pmi_common(regs, status);
> -done:
> -       /* Ack the PMI in the APIC */
> -       apic_write(APIC_LVTPC, APIC_DM_NMI);
> -
> -       /*
> -        * The counters start counting immediately while ack the status.
> -        * Make it as close as possible to IRET. This avoids bogus
> -        * freezing on Skylake CPUs.
> -        */
> -       if (status) {
> -               intel_pmu_ack_status(status);
> -       } else {
> -               /*
> -                * CPU may issues two PMIs very close to each other.
> -                * When the PMI handler services the first one, the
> -                * GLOBAL_STATUS is already updated to reflect both.
> -                * When it IRETs, the second PMI is immediately
> -                * handled and it sees clear status. At the meantime,
> -                * there may be a third PMI, because the freezing bit
> -                * isn't set since the ack in first PMI handlers.
> -                * Double check if there is more work to be done.
> -                */
> -               status = intel_pmu_get_status();
> -               if (status)
> -                       goto again;
> -       }
> -
> -       if (bts)
> -               intel_bts_enable_local();
> -       cpuc->enabled = pmu_enabled;
> -       return handled;
> -}
> -
>  /*
>   * This handler is triggered by the local APIC, so the APIC IRQ handling
>   * rules apply:
> @@ -4081,9 +3980,6 @@ static void intel_pmu_cpu_starting(int cpu)
>         if (x86_pmu.version > 1)
>                 flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
>
> -       if (x86_pmu.counter_freezing)
> -               enable_counter_freeze();
> -
>         /* Disable perf metrics if any added CPU doesn't support it. */
>         if (x86_pmu.intel_cap.perf_metrics) {
>                 union perf_capabilities perf_cap;
> @@ -4154,9 +4050,6 @@ static void free_excl_cntrs(struct cpu_hw_events *cpuc)
>  static void intel_pmu_cpu_dying(int cpu)
>  {
>         fini_debug_store_on_cpu(cpu);
> -
> -       if (x86_pmu.counter_freezing)
> -               disable_counter_freeze();
>  }
>
>  void intel_cpuc_finish(struct cpu_hw_events *cpuc)
> @@ -4548,39 +4441,6 @@ static __init void intel_nehalem_quirk(void)
>         }
>  }
>
> -static const struct x86_cpu_desc counter_freezing_ucodes[] = {
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT,         2, 0x0000000e),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT,         9, 0x0000002e),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT,        10, 0x00000008),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_D,       1, 0x00000028),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS,    1, 0x00000028),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS,    8, 0x00000006),
> -       {}
> -};
> -
> -static bool intel_counter_freezing_broken(void)
> -{
> -       return !x86_cpu_has_min_microcode_rev(counter_freezing_ucodes);
> -}
> -
> -static __init void intel_counter_freezing_quirk(void)
> -{
> -       /* Check if it's already disabled */
> -       if (disable_counter_freezing)
> -               return;
> -
> -       /*
> -        * If the system starts with the wrong ucode, leave the
> -        * counter-freezing feature permanently disabled.
> -        */
> -       if (intel_counter_freezing_broken()) {
> -               pr_info("PMU counter freezing disabled due to CPU errata,"
> -                       "please upgrade microcode\n");
> -               x86_pmu.counter_freezing = false;
> -               x86_pmu.handle_irq = intel_pmu_handle_irq;
> -       }
> -}
> -
>  /*
>   * enable software workaround for errata:
>   * SNB: BJ122
> @@ -4966,9 +4826,6 @@ __init int intel_pmu_init(void)
>                         max((int)edx.split.num_counters_fixed, assume);
>         }
>
> -       if (version >= 4)
> -               x86_pmu.counter_freezing = !disable_counter_freezing;
> -
>         if (boot_cpu_has(X86_FEATURE_PDCM)) {
>                 u64 capabilities;
>
> @@ -5090,7 +4947,6 @@ __init int intel_pmu_init(void)
>
>         case INTEL_FAM6_ATOM_GOLDMONT:
>         case INTEL_FAM6_ATOM_GOLDMONT_D:
> -               x86_add_quirk(intel_counter_freezing_quirk);
>                 memcpy(hw_cache_event_ids, glm_hw_cache_event_ids,
>                        sizeof(hw_cache_event_ids));
>                 memcpy(hw_cache_extra_regs, glm_hw_cache_extra_regs,
> @@ -5117,7 +4973,6 @@ __init int intel_pmu_init(void)
>                 break;
>
>         case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
> -               x86_add_quirk(intel_counter_freezing_quirk);
>                 memcpy(hw_cache_event_ids, glp_hw_cache_event_ids,
>                        sizeof(hw_cache_event_ids));
>                 memcpy(hw_cache_extra_regs, glp_hw_cache_extra_regs,
> @@ -5577,13 +5432,6 @@ __init int intel_pmu_init(void)
>                 pr_cont("full-width counters, ");
>         }
>
> -       /*
> -        * For arch perfmon 4 use counter freezing to avoid
> -        * several MSR accesses in the PMI.
> -        */
> -       if (x86_pmu.counter_freezing)
> -               x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
> -
>         if (x86_pmu.intel_cap.perf_metrics)
>                 x86_pmu.intel_ctrl |= 1ULL << GLOBAL_CTRL_EN_PERF_METRICS;
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 10032f023fcc..4084fa31cc21 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -681,8 +681,7 @@ struct x86_pmu {
>
>         /* PMI handler bits */
>         unsigned int    late_ack                :1,
> -                       enabled_ack             :1,
> -                       counter_freezing        :1;
> +                       enabled_ack             :1;
>         /*
>          * sysfs attrs
>          */
Xu, Like Nov. 11, 2020, 2:42 a.m. UTC | #2
Hi Peter,

On 2020/11/11 4:52, Stephane Eranian wrote:
> On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra<peterz@infradead.org>  wrote:
>> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
>>> On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
>>>> The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
>>>> platforms can provide an architectural state of the instruction executed
>>>> after the instruction that caused the event. This patch set enables the
>>>> the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
>>>> The Linux guest can use PEBS feature like native:
>>>>
>>>>    # perf record -e instructions:ppp ./br_instr a
>>>>    # perf record -c 100000 -e instructions:pp ./br_instr a
>>>>
>>>> If the counter_freezing is not enabled on the host, the guest PEBS will
>>>> be disabled on purpose when host is using PEBS facility. By default,
>>>> KVM disables the co-existence of guest PEBS and host PEBS.
Thanks Stephane for clarifying the use cases for Freeze-on-[PMI|Overflow].

Please let me express it more clearly.

The goal of the whole patch set is to enable guest PEBS, regardless of
whether the counter_freezing is frozen or not. By default, it will not
support both the guest and the host to use PEBS at the same time.

Please continue reviewing the patch set, especially for the slow path
we proposed this time and related host perf changes:

- add intel_pmu_handle_guest_pebs() to __intel_pmu_pebs_event();
- add switch MSRs (PEBS_ENABLE, DS_AREA, DATA_CFG) to intel_guest_get_msrs();
- the construction of incoming parameters for 
perf_event_create_kernel_counter();

I believe if you understand the general idea, the comments will be very 
valuable.

Thanks,
Like Xu

>>> Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
>>> me go delete all that code.
>> ---
>> Subject: perf/intel: Remove Perfmon-v4 counter_freezing support
>>
>> Perfmon-v4 counter freezing is fundamentally broken; remove this default
>> disabled code to make sure nobody uses it.
>>
>> The feature is called Freeze-on-PMI in the SDM, and if it would do that,
>> there wouldn't actually be a problem,*however*  it does something subtly
>> different. It globally disables the whole PMU when it raises the PMI,
>> not when the PMI hits.
>>
>> This means there's a window between the PMI getting raised and the PMI
>> actually getting served where we loose events and this violates the
>> perf counter independence. That is, a counting event should not result
>> in a different event count when there is a sampling event co-scheduled.
>>
> What is implemented is Freeze-on-Overflow, yet it is described as Freeze-on-PMI.
> That, in itself, is a problem. I agree with you on that point.
>
> However, there are use cases for both modes.
>
> I can sample on event A and count on B, C and when A overflows, I want
> to snapshot B, C.
> For that I want B, C at the moment of the overflow, not at the moment
> the PMI is delivered. Thus, youd
> would want the Freeze-on-overflow behavior. You can collect in this
> mode with the perf tool,
> IIRC: perf record -e '{cycles,instructions,branches:S}' ....
>
> The other usage model is that of the replay-debugger (rr) which you are alluding
> to, which needs precise count of an event including during the skid
> window. For that, you need
> Freeze-on-PMI (delivered). Note that this tool likely only cares about
> user level occurrences of events.
>
> As for counter independence, I am not sure it holds in all cases. If
> the events are setup for user+kernel
> then, as soon as you co-schedule a sampling event, you will likely get
> more counts on the counting
> event due to the additional kernel entries/exits caused by
> interrupt-based profiling. Even if you were to
> restrict to user level only, I would expect to see a few more counts.
>
>
>> This is known to break existing software.
>>
>> Signed-off-by: Peter Zijlstra (Intel)<peterz@infradead.org>
Peter Zijlstra Nov. 11, 2020, 8:38 a.m. UTC | #3
On Tue, Nov 10, 2020 at 12:52:04PM -0800, Stephane Eranian wrote:

> What is implemented is Freeze-on-Overflow, yet it is described as Freeze-on-PMI.
> That, in itself, is a problem. I agree with you on that point.

Exactly.

> However, there are use cases for both modes.
> 
> I can sample on event A and count on B, C and when A overflows, I want
> to snapshot B, C.
> For that I want B, C at the moment of the overflow, not at the moment
> the PMI is delivered. Thus, youd
> would want the Freeze-on-overflow behavior. You can collect in this
> mode with the perf tool,
> IIRC: perf record -e '{cycles,instructions,branches:S}' ....

Right, but we never supported that. Also, in that case the group must
then be fully exlusive so as not to mess with other groups. A better
solution might be an extention to Adaptive PEBS.

> The other usage model is that of the replay-debugger (rr) which you are alluding
> to, which needs precise count of an event including during the skid
> window. For that, you need
> Freeze-on-PMI (delivered). Note that this tool likely only cares about
> user level occurrences of events.

Correct, RR only cares about user-only counting.

> As for counter independence, I am not sure it holds in all cases. If
> the events are setup for user+kernel

This is true; however if it were an actual Freeze-on-PMI we could
actually do u+k independence correctly too.


Anyway, as it stands I think the whole counter_freezing thing is a
trainwreck and it needs to go.
Like Xu Nov. 16, 2020, 3:22 a.m. UTC | #4
Hi Peter,

On 2020/11/10 23:37, Peter Zijlstra wrote:
> -static int __init intel_perf_counter_freezing_setup(char *s)
> -{
> -	bool res;
> -
> -	if (kstrtobool(s, &res))
> -		return -EINVAL;
> -
> -	disable_counter_freezing = !res;
> -	return 1;
> -}
> -__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup);

...

> Anyway, as it stands I think the whole counter_freezing thing is a
> trainwreck and it needs to go.

If you really want to drop the counter_freezing stuff, we also need
to clean it up in Documentation/admin-guide/kernel-parameters.txt:

	perf_v4_pmi=	[X86,INTEL]
			Format: <bool>
			Disable Intel PMU counter freezing feature.
			The feature only exists starting from
			Arch Perfmon v4 (Skylake and newer).

However someone may still need it based on the correct understanding
of "Freeze-on-Overflow" as Stephane said. How about renaming and 
documenting it instead of discarding it completely?

Our guest PEBS enabling patches does not completely depend on it
and we do not require the administrator to enable perf_v4_pmi for
guest PEBS.

Would you generously take a look at the perf part in this series?

Thanks,
Like Xu
Paolo Bonzini Jan. 26, 2021, 9:51 a.m. UTC | #5
On 11/11/20 03:42, Xu, Like wrote:
> Hi Peter,
> 
> On 2020/11/11 4:52, Stephane Eranian wrote:
>> On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra<peterz@infradead.org>  
>> wrote:
>>> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
>>>> On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
>>>>> The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake 
>>>>> server
>>>>> platforms can provide an architectural state of the instruction 
>>>>> executed
>>>>> after the instruction that caused the event. This patch set enables 
>>>>> the
>>>>> the PEBS via DS feature for KVM (also non) Linux guest on the Ice 
>>>>> Lake.
>>>>> The Linux guest can use PEBS feature like native:
>>>>>
>>>>>    # perf record -e instructions:ppp ./br_instr a
>>>>>    # perf record -c 100000 -e instructions:pp ./br_instr a
>>>>>
>>>>> If the counter_freezing is not enabled on the host, the guest PEBS 
>>>>> will
>>>>> be disabled on purpose when host is using PEBS facility. By default,
>>>>> KVM disables the co-existence of guest PEBS and host PEBS.
> Thanks Stephane for clarifying the use cases for Freeze-on-[PMI|Overflow].
> 
> Please let me express it more clearly.
> 
> The goal of the whole patch set is to enable guest PEBS, regardless of
> whether the counter_freezing is frozen or not. By default, it will not
> support both the guest and the host to use PEBS at the same time.
> 
> Please continue reviewing the patch set, especially for the slow path
> we proposed this time and related host perf changes:
> 
> - add intel_pmu_handle_guest_pebs() to __intel_pmu_pebs_event();
> - add switch MSRs (PEBS_ENABLE, DS_AREA, DATA_CFG) to 
> intel_guest_get_msrs();
> - the construction of incoming parameters for 
> perf_event_create_kernel_counter();
> 
> I believe if you understand the general idea, the comments will be very 
> valuable.

What is the state of this work?  I was expecting a new version that 
doesn't use counter_freezing.  However, I see that counter_freezing is 
still in there, so this patch from Peter has never been applied.

Paolo
Peter Zijlstra Jan. 26, 2021, 10:36 a.m. UTC | #6
On Tue, Jan 26, 2021 at 10:51:39AM +0100, Paolo Bonzini wrote:
> What is the state of this work?  I was expecting a new version that doesn't
> use counter_freezing.  However, I see that counter_freezing is still in
> there, so this patch from Peter has never been applied.

Bah, I forgot about it. Lemme go find it.
Xu, Like Jan. 26, 2021, 11:35 a.m. UTC | #7
On 2021/1/26 17:51, Paolo Bonzini wrote:
> On 11/11/20 03:42, Xu, Like wrote:
>> Hi Peter,
>>
>> On 2020/11/11 4:52, Stephane Eranian wrote:
>>> On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra<peterz@infradead.org>  
>>> wrote:
>>>> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
>>>>> On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
>>>>>> The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake 
>>>>>> server
>>>>>> platforms can provide an architectural state of the instruction 
>>>>>> executed
>>>>>> after the instruction that caused the event. This patch set enables the
>>>>>> the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
>>>>>> The Linux guest can use PEBS feature like native:
>>>>>>
>>>>>>    # perf record -e instructions:ppp ./br_instr a
>>>>>>    # perf record -c 100000 -e instructions:pp ./br_instr a
>>>>>>
>>>>>> If the counter_freezing is not enabled on the host, the guest PEBS will
>>>>>> be disabled on purpose when host is using PEBS facility. By default,
>>>>>> KVM disables the co-existence of guest PEBS and host PEBS.
>> Thanks Stephane for clarifying the use cases for Freeze-on-[PMI|Overflow].
>>
>> Please let me express it more clearly.
>>
>> The goal of the whole patch set is to enable guest PEBS, regardless of
>> whether the counter_freezing is frozen or not. By default, it will not
>> support both the guest and the host to use PEBS at the same time.
>>
>> Please continue reviewing the patch set, especially for the slow path
>> we proposed this time and related host perf changes:
>>
>> - add intel_pmu_handle_guest_pebs() to __intel_pmu_pebs_event();
>> - add switch MSRs (PEBS_ENABLE, DS_AREA, DATA_CFG) to 
>> intel_guest_get_msrs();
>> - the construction of incoming parameters for 
>> perf_event_create_kernel_counter();
>>
>> I believe if you understand the general idea, the comments will be very 
>> valuable.
>
> What is the state of this work?  I was expecting a new version that 
> doesn't use counter_freezing.  However, I see that counter_freezing is 
> still in there, so this patch from Peter has never been applied.
>
> Paolo

Ah, now we have the v3 version on guest PEBS feature.
It does not rely on counter_freezing, but disables the co-existence of 
guest PEBS and host PEBS.
I am not clear about your attitude towards this co-existence.

There are also more interesting topics for you to review and comment.
Please check 
https://lore.kernel.org/kvm/20210104131542.495413-1-like.xu@linux.intel.com/
Paolo Bonzini Jan. 26, 2021, 11:59 a.m. UTC | #8
On 26/01/21 12:35, Xu, Like wrote:
>>
> 
> Ah, now we have the v3 version on guest PEBS feature.
> It does not rely on counter_freezing, but disables the co-existence of 
> guest PEBS and host PEBS.
> I am not clear about your attitude towards this co-existence.
> 
> There are also more interesting topics for you to review and comment.
> Please check 
> https://lore.kernel.org/kvm/20210104131542.495413-1-like.xu@linux.intel.com/ 
> 

Thanks Like, I'll review that one.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c79748f6921d..9909dfa6fb12 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2121,18 +2121,6 @@  static void intel_tfa_pmu_enable_all(int added)
 	intel_pmu_enable_all(added);
 }
 
-static void enable_counter_freeze(void)
-{
-	update_debugctlmsr(get_debugctlmsr() |
-			DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
-}
-
-static void disable_counter_freeze(void)
-{
-	update_debugctlmsr(get_debugctlmsr() &
-			~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
-}
-
 static inline u64 intel_pmu_get_status(void)
 {
 	u64 status;
@@ -2696,95 +2684,6 @@  static int handle_pmi_common(struct pt_regs *regs, u64 status)
 	return handled;
 }
 
-static bool disable_counter_freezing = true;
-static int __init intel_perf_counter_freezing_setup(char *s)
-{
-	bool res;
-
-	if (kstrtobool(s, &res))
-		return -EINVAL;
-
-	disable_counter_freezing = !res;
-	return 1;
-}
-__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup);
-
-/*
- * Simplified handler for Arch Perfmon v4:
- * - We rely on counter freezing/unfreezing to enable/disable the PMU.
- * This is done automatically on PMU ack.
- * - Ack the PMU only after the APIC.
- */
-
-static int intel_pmu_handle_irq_v4(struct pt_regs *regs)
-{
-	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-	int handled = 0;
-	bool bts = false;
-	u64 status;
-	int pmu_enabled = cpuc->enabled;
-	int loops = 0;
-
-	/* PMU has been disabled because of counter freezing */
-	cpuc->enabled = 0;
-	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
-		bts = true;
-		intel_bts_disable_local();
-		handled = intel_pmu_drain_bts_buffer();
-		handled += intel_bts_interrupt();
-	}
-	status = intel_pmu_get_status();
-	if (!status)
-		goto done;
-again:
-	intel_pmu_lbr_read();
-	if (++loops > 100) {
-		static bool warned;
-
-		if (!warned) {
-			WARN(1, "perfevents: irq loop stuck!\n");
-			perf_event_print_debug();
-			warned = true;
-		}
-		intel_pmu_reset();
-		goto done;
-	}
-
-
-	handled += handle_pmi_common(regs, status);
-done:
-	/* Ack the PMI in the APIC */
-	apic_write(APIC_LVTPC, APIC_DM_NMI);
-
-	/*
-	 * The counters start counting immediately while ack the status.
-	 * Make it as close as possible to IRET. This avoids bogus
-	 * freezing on Skylake CPUs.
-	 */
-	if (status) {
-		intel_pmu_ack_status(status);
-	} else {
-		/*
-		 * CPU may issues two PMIs very close to each other.
-		 * When the PMI handler services the first one, the
-		 * GLOBAL_STATUS is already updated to reflect both.
-		 * When it IRETs, the second PMI is immediately
-		 * handled and it sees clear status. At the meantime,
-		 * there may be a third PMI, because the freezing bit
-		 * isn't set since the ack in first PMI handlers.
-		 * Double check if there is more work to be done.
-		 */
-		status = intel_pmu_get_status();
-		if (status)
-			goto again;
-	}
-
-	if (bts)
-		intel_bts_enable_local();
-	cpuc->enabled = pmu_enabled;
-	return handled;
-}
-
 /*
  * This handler is triggered by the local APIC, so the APIC IRQ handling
  * rules apply:
@@ -4081,9 +3980,6 @@  static void intel_pmu_cpu_starting(int cpu)
 	if (x86_pmu.version > 1)
 		flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
 
-	if (x86_pmu.counter_freezing)
-		enable_counter_freeze();
-
 	/* Disable perf metrics if any added CPU doesn't support it. */
 	if (x86_pmu.intel_cap.perf_metrics) {
 		union perf_capabilities perf_cap;
@@ -4154,9 +4050,6 @@  static void free_excl_cntrs(struct cpu_hw_events *cpuc)
 static void intel_pmu_cpu_dying(int cpu)
 {
 	fini_debug_store_on_cpu(cpu);
-
-	if (x86_pmu.counter_freezing)
-		disable_counter_freeze();
 }
 
 void intel_cpuc_finish(struct cpu_hw_events *cpuc)
@@ -4548,39 +4441,6 @@  static __init void intel_nehalem_quirk(void)
 	}
 }
 
-static const struct x86_cpu_desc counter_freezing_ucodes[] = {
-	INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT,	 2, 0x0000000e),
-	INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT,	 9, 0x0000002e),
-	INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT,	10, 0x00000008),
-	INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_D,	 1, 0x00000028),
-	INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS,	 1, 0x00000028),
-	INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS,	 8, 0x00000006),
-	{}
-};
-
-static bool intel_counter_freezing_broken(void)
-{
-	return !x86_cpu_has_min_microcode_rev(counter_freezing_ucodes);
-}
-
-static __init void intel_counter_freezing_quirk(void)
-{
-	/* Check if it's already disabled */
-	if (disable_counter_freezing)
-		return;
-
-	/*
-	 * If the system starts with the wrong ucode, leave the
-	 * counter-freezing feature permanently disabled.
-	 */
-	if (intel_counter_freezing_broken()) {
-		pr_info("PMU counter freezing disabled due to CPU errata,"
-			"please upgrade microcode\n");
-		x86_pmu.counter_freezing = false;
-		x86_pmu.handle_irq = intel_pmu_handle_irq;
-	}
-}
-
 /*
  * enable software workaround for errata:
  * SNB: BJ122
@@ -4966,9 +4826,6 @@  __init int intel_pmu_init(void)
 			max((int)edx.split.num_counters_fixed, assume);
 	}
 
-	if (version >= 4)
-		x86_pmu.counter_freezing = !disable_counter_freezing;
-
 	if (boot_cpu_has(X86_FEATURE_PDCM)) {
 		u64 capabilities;
 
@@ -5090,7 +4947,6 @@  __init int intel_pmu_init(void)
 
 	case INTEL_FAM6_ATOM_GOLDMONT:
 	case INTEL_FAM6_ATOM_GOLDMONT_D:
-		x86_add_quirk(intel_counter_freezing_quirk);
 		memcpy(hw_cache_event_ids, glm_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 		memcpy(hw_cache_extra_regs, glm_hw_cache_extra_regs,
@@ -5117,7 +4973,6 @@  __init int intel_pmu_init(void)
 		break;
 
 	case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
-		x86_add_quirk(intel_counter_freezing_quirk);
 		memcpy(hw_cache_event_ids, glp_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 		memcpy(hw_cache_extra_regs, glp_hw_cache_extra_regs,
@@ -5577,13 +5432,6 @@  __init int intel_pmu_init(void)
 		pr_cont("full-width counters, ");
 	}
 
-	/*
-	 * For arch perfmon 4 use counter freezing to avoid
-	 * several MSR accesses in the PMI.
-	 */
-	if (x86_pmu.counter_freezing)
-		x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
-
 	if (x86_pmu.intel_cap.perf_metrics)
 		x86_pmu.intel_ctrl |= 1ULL << GLOBAL_CTRL_EN_PERF_METRICS;
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 10032f023fcc..4084fa31cc21 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -681,8 +681,7 @@  struct x86_pmu {
 
 	/* PMI handler bits */
 	unsigned int	late_ack		:1,
-			enabled_ack		:1,
-			counter_freezing	:1;
+			enabled_ack		:1;
 	/*
 	 * sysfs attrs
 	 */