diff mbox series

[v3,04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

Message ID 20210104131542.495413-5-like.xu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/pmu: Add support to enable Guest PEBS via DS | expand

Commit Message

Like Xu Jan. 4, 2021, 1:15 p.m. UTC
With PEBS virtualization, the PEBS records get delivered to the guest,
and host still sees the PEBS overflow PMI from guest PEBS counters.
This would normally result in a spurious host PMI and we needs to inject
that PEBS overflow PMI into the guest, so that the guest PMI handler
can handle the PEBS records.

Check for this case in the host perf PEBS handler. If a PEBS overflow
PMI occurs and it's not generated from host side (via check host DS),
a fake event will be triggered. The fake event causes the KVM PMI callback
to be called, thereby injecting the PEBS overflow PMI into the guest.

No matter how many guest PEBS counters are overflowed, only triggering
one fake event is enough. The guest PEBS handler would retrieve the
correct information from its own PEBS records buffer.

A guest PEBS overflow PMI would be missed when a PEBS counter is enabled
on the host side and coincidentally a host PEBS overflow PMI based on
host DS_AREA is also triggered right after vm-exit due to the guest
PEBS overflow PMI based on guest DS_AREA. In that case, KVM will disable
guest PEBS before vm-entry once there's a host PEBS counter enabled
on the same CPU.

Originally-by: Andi Kleen <ak@linux.intel.com>
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 62 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

Comments

Peter Zijlstra Jan. 13, 2021, 6:22 p.m. UTC | #1
On Mon, Jan 04, 2021 at 09:15:29PM +0800, Like Xu wrote:

> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index b47cc4226934..c499bdb58373 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1721,6 +1721,65 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
>  	return 0;
>  }
>  
> +/*
> + * We may be running with guest PEBS events created by KVM, and the
> + * PEBS records are logged into the guest's DS and invisible to host.
> + *
> + * In the case of guest PEBS overflow, we only trigger a fake event
> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
> + * The guest will then vm-entry and check the guest DS area to read
> + * the guest PEBS records.
> + *
> + * The guest PEBS overflow PMI may be dropped when both the guest and
> + * the host use PEBS. Therefore, KVM will not enable guest PEBS once
> + * the host PEBS is enabled since it may bring a confused unknown NMI.
> + *
> + * The contents and other behavior of the guest event do not matter.
> + */
> +static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
> +				       struct pt_regs *iregs,
> +				       struct debug_store *ds)
> +{
> +	struct perf_sample_data data;
> +	struct perf_event *event = NULL;
> +	u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
> +	int bit;
> +
> +	/*
> +	 * Ideally, we should check guest DS to understand if it's
> +	 * a guest PEBS overflow PMI from guest PEBS counters.
> +	 * However, it brings high overhead to retrieve guest DS in host.
> +	 * So we check host DS instead for performance.

Again; for the virt illiterate people here (me); why is it expensive to
check guest DS?

Why do we need to? Can't we simply always forward the PMI if the guest
has bits set in MSR_IA32_PEBS_ENABLE ? Surely we can access the guest
MSRs at a reasonable rate..

Sure, it'll send too many PMIs, but is that really a problem?

> +	 *
> +	 * If PEBS interrupt threshold on host is not exceeded in a NMI, there
> +	 * must be a PEBS overflow PMI generated from the guest PEBS counters.
> +	 * There is no ambiguity since the reported event in the PMI is guest
> +	 * only. It gets handled correctly on a case by case base for each event.
> +	 *
> +	 * Note: KVM disables the co-existence of guest PEBS and host PEBS.

Where; I need a code reference here.

> +	 */
> +	if (!guest_pebs_idxs || !in_nmi() ||

All the other code uses !iregs instead of !in_nmi(), also your
indentation is broken.

> +		ds->pebs_index >= ds->pebs_interrupt_threshold)
> +		return 0;
> +
> +	for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
> +			INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
> +
> +		event = cpuc->events[bit];
> +		if (!event->attr.precise_ip)
> +			continue;
> +
> +		perf_sample_data_init(&data, 0, event->hw.last_period);
> +		if (perf_event_overflow(event, &data, iregs))
> +			x86_pmu_stop(event, 0);
> +
> +		/* Inject one fake event is enough. */
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
>  static __always_inline void
>  __intel_pmu_pebs_event(struct perf_event *event,
>  		       struct pt_regs *iregs,
> @@ -1965,6 +2024,9 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>  	if (!x86_pmu.pebs_active)
>  		return;
>  
> +	if (intel_pmu_handle_guest_pebs(cpuc, iregs, ds))
> +		return;
> +
>  	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
>  	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
>  
> -- 
> 2.29.2
>
Peter Zijlstra Jan. 13, 2021, 6:27 p.m. UTC | #2
On Wed, Jan 13, 2021 at 07:22:09PM +0100, Peter Zijlstra wrote:
> Again; for the virt illiterate people here (me); why is it expensive to
> check guest DS?

Remember, you're trying to get someone that thinks virt is the devil's
work (me) to review virt patches. You get to spell things out in detail.
Xu, Like Jan. 14, 2021, 3:39 a.m. UTC | #3
On 2021/1/14 2:22, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 09:15:29PM +0800, Like Xu wrote:
>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index b47cc4226934..c499bdb58373 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1721,6 +1721,65 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
>>   	return 0;
>>   }
>>   
>> +/*
>> + * We may be running with guest PEBS events created by KVM, and the
>> + * PEBS records are logged into the guest's DS and invisible to host.
>> + *
>> + * In the case of guest PEBS overflow, we only trigger a fake event
>> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
>> + * The guest will then vm-entry and check the guest DS area to read
>> + * the guest PEBS records.
>> + *
>> + * The guest PEBS overflow PMI may be dropped when both the guest and
>> + * the host use PEBS. Therefore, KVM will not enable guest PEBS once
>> + * the host PEBS is enabled since it may bring a confused unknown NMI.
>> + *
>> + * The contents and other behavior of the guest event do not matter.
>> + */
>> +static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
>> +				       struct pt_regs *iregs,
>> +				       struct debug_store *ds)
>> +{
>> +	struct perf_sample_data data;
>> +	struct perf_event *event = NULL;
>> +	u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
>> +	int bit;
>> +
>> +	/*
>> +	 * Ideally, we should check guest DS to understand if it's
>> +	 * a guest PEBS overflow PMI from guest PEBS counters.
>> +	 * However, it brings high overhead to retrieve guest DS in host.
>> +	 * So we check host DS instead for performance.
> Again; for the virt illiterate people here (me); why is it expensive to
> check guest DS?

We are not checking the guest DS here for two reasons:
- it brings additional kvm mem locking operations and guest page table 
traversal,
    which is very expensive for guests with large memory (if we have cached the
    mapped values, we still need to check whether the cached ones are still 
valid);
- the current interface kvm_read_guest_*() might sleep and is not irq safe;

If you still need me to try this guest DS check approach, please let me know,
I will provide more performance data.

>
> Why do we need to? Can't we simply always forward the PMI if the guest
> has bits set in MSR_IA32_PEBS_ENABLE ? Surely we can access the guest
> MSRs at a reasonable rate..
>
> Sure, it'll send too many PMIs, but is that really a problem?

More vPMI means more guest irq handler calls and
more PMI virtualization overhead. In addition,

the correctness of some workloads (RR?) depends on
the correct number of PMIs and the PMI trigger times
and virt may not want to break this assumption.

>
>> +	 *
>> +	 * If PEBS interrupt threshold on host is not exceeded in a NMI, there
>> +	 * must be a PEBS overflow PMI generated from the guest PEBS counters.
>> +	 * There is no ambiguity since the reported event in the PMI is guest
>> +	 * only. It gets handled correctly on a case by case base for each event.
>> +	 *
>> +	 * Note: KVM disables the co-existence of guest PEBS and host PEBS.
> Where; I need a code reference here.

How about:

Note: KVM will disable the co-existence of guest PEBS and host PEBS.
In the intel_guest_get_msrs(), when we have host PEBS ctrl bit(s) enabled,
KVM will clear the guest PEBS ctrl enable bit(s) before vm-entry.
The guest PEBS users should be notified of this runtime restriction.

>
>> +	 */
>> +	if (!guest_pebs_idxs || !in_nmi() ||
> All the other code uses !iregs instead of !in_nmi(), also your
> indentation is broken.

Sure, I'll use !iregs and fix the indentation in the next version.

---
thx,likexu
>> +		ds->pebs_index >= ds->pebs_interrupt_threshold)
>> +		return 0;
>> +
>> +	for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
>> +			INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
>> +
>> +		event = cpuc->events[bit];
>> +		if (!event->attr.precise_ip)
>> +			continue;
>> +
>> +		perf_sample_data_init(&data, 0, event->hw.last_period);
>> +		if (perf_event_overflow(event, &data, iregs))
>> +			x86_pmu_stop(event, 0);
>> +
>> +		/* Inject one fake event is enough. */
>> +		return 1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static __always_inline void
>>   __intel_pmu_pebs_event(struct perf_event *event,
>>   		       struct pt_regs *iregs,
>> @@ -1965,6 +2024,9 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>>   	if (!x86_pmu.pebs_active)
>>   		return;
>>   
>> +	if (intel_pmu_handle_guest_pebs(cpuc, iregs, ds))
>> +		return;
>> +
>>   	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
>>   	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
>>   
>> -- 
>> 2.29.2
>>
Sean Christopherson Jan. 14, 2021, 6:55 p.m. UTC | #4
On Mon, Jan 04, 2021, Like Xu wrote:
> ---
>  arch/x86/events/intel/ds.c | 62 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
> 
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index b47cc4226934..c499bdb58373 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1721,6 +1721,65 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
>  	return 0;
>  }
>  
> +/*
> + * We may be running with guest PEBS events created by KVM, and the
> + * PEBS records are logged into the guest's DS and invisible to host.
> + *
> + * In the case of guest PEBS overflow, we only trigger a fake event
> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
> + * The guest will then vm-entry and check the guest DS area to read
> + * the guest PEBS records.
> + *
> + * The guest PEBS overflow PMI may be dropped when both the guest and
> + * the host use PEBS. Therefore, KVM will not enable guest PEBS once
> + * the host PEBS is enabled since it may bring a confused unknown NMI.
> + *
> + * The contents and other behavior of the guest event do not matter.
> + */
> +static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
> +				       struct pt_regs *iregs,
> +				       struct debug_store *ds)
> +{
> +	struct perf_sample_data data;
> +	struct perf_event *event = NULL;
> +	u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
> +	int bit;
> +
> +	/*
> +	 * Ideally, we should check guest DS to understand if it's
> +	 * a guest PEBS overflow PMI from guest PEBS counters.
> +	 * However, it brings high overhead to retrieve guest DS in host.
> +	 * So we check host DS instead for performance.
> +	 *
> +	 * If PEBS interrupt threshold on host is not exceeded in a NMI, there
> +	 * must be a PEBS overflow PMI generated from the guest PEBS counters.
> +	 * There is no ambiguity since the reported event in the PMI is guest
> +	 * only. It gets handled correctly on a case by case base for each event.
> +	 *
> +	 * Note: KVM disables the co-existence of guest PEBS and host PEBS.

By "KVM", do you mean KVM's loading of the MSRs provided by intel_guest_get_msrs()?
Because the PMU should really be the entity that controls guest vs. host.  KVM
should just be a dumb pipe that handles the mechanics of how values are context
switch.

For example, commit 7099e2e1f4d9 ("KVM: VMX: disable PEBS before a guest entry"),
where KVM does an explicit WRMSR(PEBS_ENABLE) to (attempt to) force PEBS
quiescence, is flawed in that the PMU can re-enable PEBS after the WRMSR if a
PMI arrives between the WRMSR and VM-Enter (because VMX can't block NMIs).  The
PMU really needs to be involved in the WRMSR workaround.

> +	 */
> +	if (!guest_pebs_idxs || !in_nmi() ||

Are PEBS updates guaranteed to be isolated in both directions on relevant
hardware?  By that I mean, will host updates be fully processed before VM-Enter
compeletes, and guest updates before VM-Exit completes?  If that's the case,
then this path could be optimized to change the KVM invocation of the NMI
handler so that the "is this a guest PEBS PMI" check is done if and only if the
PMI originated from with the guest.

> +		ds->pebs_index >= ds->pebs_interrupt_threshold)
> +		return 0;
> +
> +	for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
> +			INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
> +
> +		event = cpuc->events[bit];
> +		if (!event->attr.precise_ip)
> +			continue;
> +
> +		perf_sample_data_init(&data, 0, event->hw.last_period);
> +		if (perf_event_overflow(event, &data, iregs))
> +			x86_pmu_stop(event, 0);
> +
> +		/* Inject one fake event is enough. */
> +		return 1;
> +	}
> +
> +	return 0;
> +}
Xu, Like Jan. 15, 2021, 2:49 a.m. UTC | #5
On 2021/1/15 2:55, Sean Christopherson wrote:
> On Mon, Jan 04, 2021, Like Xu wrote:
>> ---
>>   arch/x86/events/intel/ds.c | 62 ++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 62 insertions(+)
>>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index b47cc4226934..c499bdb58373 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1721,6 +1721,65 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
>>   	return 0;
>>   }
>>   
>> +/*
>> + * We may be running with guest PEBS events created by KVM, and the
>> + * PEBS records are logged into the guest's DS and invisible to host.
>> + *
>> + * In the case of guest PEBS overflow, we only trigger a fake event
>> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
>> + * The guest will then vm-entry and check the guest DS area to read
>> + * the guest PEBS records.
>> + *
>> + * The guest PEBS overflow PMI may be dropped when both the guest and
>> + * the host use PEBS. Therefore, KVM will not enable guest PEBS once
>> + * the host PEBS is enabled since it may bring a confused unknown NMI.
>> + *
>> + * The contents and other behavior of the guest event do not matter.
>> + */
>> +static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
>> +				       struct pt_regs *iregs,
>> +				       struct debug_store *ds)
>> +{
>> +	struct perf_sample_data data;
>> +	struct perf_event *event = NULL;
>> +	u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
>> +	int bit;
>> +
>> +	/*
>> +	 * Ideally, we should check guest DS to understand if it's
>> +	 * a guest PEBS overflow PMI from guest PEBS counters.
>> +	 * However, it brings high overhead to retrieve guest DS in host.
>> +	 * So we check host DS instead for performance.
>> +	 *
>> +	 * If PEBS interrupt threshold on host is not exceeded in a NMI, there
>> +	 * must be a PEBS overflow PMI generated from the guest PEBS counters.
>> +	 * There is no ambiguity since the reported event in the PMI is guest
>> +	 * only. It gets handled correctly on a case by case base for each event.
>> +	 *
>> +	 * Note: KVM disables the co-existence of guest PEBS and host PEBS.
> By "KVM", do you mean KVM's loading of the MSRs provided by intel_guest_get_msrs()?
> Because the PMU should really be the entity that controls guest vs. host.  KVM
> should just be a dumb pipe that handles the mechanics of how values are context
> switch.

The intel_guest_get_msrs() and atomic_switch_perf_msrs()
will work together to disable the co-existence of guest PEBS and host PEBS:

https://lore.kernel.org/kvm/961e6135-ff6d-86d1-3b7b-a1846ad0e4c4@intel.com/

+

static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
...
     if (nr_msrs > 2 && (msrs[1].guest & msrs[0].guest)) {
         msrs[2].guest = pmu->ds_area;
         if (nr_msrs > 3)
             msrs[3].guest = pmu->pebs_data_cfg;
     }

    for (i = 0; i < nr_msrs; i++)
...

>
> For example, commit 7099e2e1f4d9 ("KVM: VMX: disable PEBS before a guest entry"),
> where KVM does an explicit WRMSR(PEBS_ENABLE) to (attempt to) force PEBS
> quiescence, is flawed in that the PMU can re-enable PEBS after the WRMSR if a
> PMI arrives between the WRMSR and VM-Enter (because VMX can't block NMIs).  The
> PMU really needs to be involved in the WRMSR workaround.

Thanks, I will carefully confirm the PEBS quiescent behavior on the ICX.
But we're fine to keep "wrmsrl(MSR_IA32_PEBS_ENABLE, 0);" here
since we will load a new guest value (if any) for this msr later.

>
>> +	 */
>> +	if (!guest_pebs_idxs || !in_nmi() ||
> Are PEBS updates guaranteed to be isolated in both directions on relevant
> hardware?

I think it's true on the ICX.

> By that I mean, will host updates be fully processed before VM-Enter
> compeletes, and guest updates before VM-Exit completes?

The situation is more complicated.

> If that's the case,
> then this path could be optimized to change the KVM invocation of the NMI
> handler so that the "is this a guest PEBS PMI" check is done if and only if the
> PMI originated from with the guest.

When we have a PEBS PMI due to guest workload and vm-exits,
the code path from vm-exit to the host PEBS PMI handler may also
bring PEBS PMI and mark the status bit. The current PMI handler
can't distinguish them and would treat the later one as a suspicious
PMI and output a warning.

This is the main reason why we choose to disable the co-existence
of guest PEBS and host PEBS, and future hardware enhancements
may break this limitation.

---
thx, likexu
>
>> +		ds->pebs_index >= ds->pebs_interrupt_threshold)
>> +		return 0;
>> +
>> +	for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
>> +			INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
>> +
>> +		event = cpuc->events[bit];
>> +		if (!event->attr.precise_ip)
>> +			continue;
>> +
>> +		perf_sample_data_init(&data, 0, event->hw.last_period);
>> +		if (perf_event_overflow(event, &data, iregs))
>> +			x86_pmu_stop(event, 0);
>> +
>> +		/* Inject one fake event is enough. */
>> +		return 1;
>> +	}
>> +
>> +	return 0;
>> +}
Peter Zijlstra Jan. 15, 2021, 12:01 p.m. UTC | #6
On Thu, Jan 14, 2021 at 11:39:00AM +0800, Xu, Like wrote:

> > Why do we need to? Can't we simply always forward the PMI if the guest
> > has bits set in MSR_IA32_PEBS_ENABLE ? Surely we can access the guest
> > MSRs at a reasonable rate..
> > 
> > Sure, it'll send too many PMIs, but is that really a problem?
> 
> More vPMI means more guest irq handler calls and
> more PMI virtualization overhead.

Only if you have both guest and host PEBS. And in that case I really
can't be arsed about some overhead to the guest.

> In addition,
> the correctness of some workloads (RR?) depends on
> the correct number of PMIs and the PMI trigger times
> and virt may not want to break this assumption.

Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
too much code to deal with them.

> > > +	 * If PEBS interrupt threshold on host is not exceeded in a NMI, there
> > > +	 * must be a PEBS overflow PMI generated from the guest PEBS counters.
> > > +	 * There is no ambiguity since the reported event in the PMI is guest
> > > +	 * only. It gets handled correctly on a case by case base for each event.
> > > +	 *
> > > +	 * Note: KVM disables the co-existence of guest PEBS and host PEBS.
> > Where; I need a code reference here.
> 
> How about:
> 
> Note: KVM will disable the co-existence of guest PEBS and host PEBS.
> In the intel_guest_get_msrs(), when we have host PEBS ctrl bit(s) enabled,
> KVM will clear the guest PEBS ctrl enable bit(s) before vm-entry.
> The guest PEBS users should be notified of this runtime restriction.

Since you had me look at that function, can clean up that
CONFIG_RETPOLINE crud and replace it with static_call() ?
Xu, Like Jan. 15, 2021, 2:30 p.m. UTC | #7
On 2021/1/15 20:01, Peter Zijlstra wrote:
> On Thu, Jan 14, 2021 at 11:39:00AM +0800, Xu, Like wrote:
>
>>> Why do we need to? Can't we simply always forward the PMI if the guest
>>> has bits set in MSR_IA32_PEBS_ENABLE ? Surely we can access the guest
>>> MSRs at a reasonable rate..
>>>
>>> Sure, it'll send too many PMIs, but is that really a problem?
>> More vPMI means more guest irq handler calls and
>> more PMI virtualization overhead.
> Only if you have both guest and host PEBS. And in that case I really
> can't be arsed about some overhead to the guest.

Less overhead makes everyone happier.

Ah, can I assume that you're fine with disabling the
co-existence of guest PEBS and host PEBS as the first upstream step ?

>
>> In addition,
>> the correctness of some workloads (RR?) depends on
>> the correct number of PMIs and the PMI trigger times
>> and virt may not want to break this assumption.
> Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
> too much code to deal with them.

https://lore.kernel.org/lkml/20170628130748.GI5981@leverpostej/T/

In the rr workload, the commit change "the PMI interrupts in skid region 
should be dropped"
is reverted since some users complain that:

> It seems to me that it might be reasonable to ignore the interrupt if
> the purpose of the interrupt is to trigger sampling of the CPUs
> register state.  But if the interrupt will trigger some other
> operation, such as a signal on an fd, then there's no reason to drop
> it.

I assume that if the PMI drop is unacceptable, either will spurious PMI 
injection.

I'm pretty open if you insist that we really need to do this for guest PEBS 
enabling.

>
>>>> +	 * If PEBS interrupt threshold on host is not exceeded in a NMI, there
>>>> +	 * must be a PEBS overflow PMI generated from the guest PEBS counters.
>>>> +	 * There is no ambiguity since the reported event in the PMI is guest
>>>> +	 * only. It gets handled correctly on a case by case base for each event.
>>>> +	 *
>>>> +	 * Note: KVM disables the co-existence of guest PEBS and host PEBS.
>>> Where; I need a code reference here.
>> How about:
>>
>> Note: KVM will disable the co-existence of guest PEBS and host PEBS.
>> In the intel_guest_get_msrs(), when we have host PEBS ctrl bit(s) enabled,
>> KVM will clear the guest PEBS ctrl enable bit(s) before vm-entry.
>> The guest PEBS users should be notified of this runtime restriction.
> Since you had me look at that function, can clean up that
> CONFIG_RETPOLINE crud and replace it with static_call() ?

Sure. Let me try it.

---
thx, likexu
Peter Zijlstra Jan. 15, 2021, 2:44 p.m. UTC | #8
On Fri, Jan 15, 2021 at 10:30:13PM +0800, Xu, Like wrote:

> > Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
> > too much code to deal with them.
> 
> https://lore.kernel.org/lkml/20170628130748.GI5981@leverpostej/T/
> 
> In the rr workload, the commit change "the PMI interrupts in skid region
> should be dropped"
> is reverted since some users complain that:
> 
> > It seems to me that it might be reasonable to ignore the interrupt if
> > the purpose of the interrupt is to trigger sampling of the CPUs
> > register state.  But if the interrupt will trigger some other
> > operation, such as a signal on an fd, then there's no reason to drop
> > it.
> 
> I assume that if the PMI drop is unacceptable, either will spurious PMI
> injection.
> 
> I'm pretty open if you insist that we really need to do this for guest PEBS
> enabling.

That was an entirely different issue. We were dropping events on the
floor because they'd passed priv boundaries. So there was an actual
event, and we made it go away.

What we're talking about here is raising an PMI with BUFFER_OVF set,
even if the DS is empty. That should really be harmless. We'll take the
PMI, find there's nothing there, and do nothing.
Xu, Like Jan. 15, 2021, 3:12 p.m. UTC | #9
On 2021/1/15 22:44, Peter Zijlstra wrote:
> On Fri, Jan 15, 2021 at 10:30:13PM +0800, Xu, Like wrote:
>
>>> Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
>>> too much code to deal with them.
>> https://lore.kernel.org/lkml/20170628130748.GI5981@leverpostej/T/
>>
>> In the rr workload, the commit change "the PMI interrupts in skid region
>> should be dropped"
>> is reverted since some users complain that:
>>
>>> It seems to me that it might be reasonable to ignore the interrupt if
>>> the purpose of the interrupt is to trigger sampling of the CPUs
>>> register state.  But if the interrupt will trigger some other
>>> operation, such as a signal on an fd, then there's no reason to drop
>>> it.
>> I assume that if the PMI drop is unacceptable, either will spurious PMI
>> injection.
>>
>> I'm pretty open if you insist that we really need to do this for guest PEBS
>> enabling.
> That was an entirely different issue. We were dropping events on the
> floor because they'd passed priv boundaries. So there was an actual
> event, and we made it go away.

Thanks for your clarification and support.

> What we're talking about here is raising an PMI with BUFFER_OVF set,
> even if the DS is empty. That should really be harmless. We'll take the
> PMI, find there's nothing there, and do nothing.

The only harm point is confusing the guest PEBS user with
the behavior of pebs_interrupt_threshold.

Now that KVM has to break it due to cross-mapping issue,
Let me implement this idea in the next version w/ relevant performance data.
Sean Christopherson Jan. 15, 2021, 5:42 p.m. UTC | #10
On Fri, Jan 15, 2021, Xu, Like wrote:
> On 2021/1/15 2:55, Sean Christopherson wrote:
> > On Mon, Jan 04, 2021, Like Xu wrote:
> > > +	 * Note: KVM disables the co-existence of guest PEBS and host PEBS.
> > By "KVM", do you mean KVM's loading of the MSRs provided by intel_guest_get_msrs()?
> > Because the PMU should really be the entity that controls guest vs. host.  KVM
> > should just be a dumb pipe that handles the mechanics of how values are context
> > switch.
> 
> The intel_guest_get_msrs() and atomic_switch_perf_msrs()
> will work together to disable the co-existence of guest PEBS and host PEBS:
> 
> https://lore.kernel.org/kvm/961e6135-ff6d-86d1-3b7b-a1846ad0e4c4@intel.com/
> 
> +
> 
> static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> ...
>     if (nr_msrs > 2 && (msrs[1].guest & msrs[0].guest)) {
>         msrs[2].guest = pmu->ds_area;
>         if (nr_msrs > 3)
>             msrs[3].guest = pmu->pebs_data_cfg;
>     }
> 
>    for (i = 0; i < nr_msrs; i++)
> ...

Yeah, that's exactly what I'm complaining about.  Splitting the logic for
determining the guest values is unnecessarily confusing, and as evidenced by the
PEBS_ENABLE bug, potentially fragile.  Perf should have full knowledge and
control of what values are loaded for the guest.  And, the above indexing magic
is nigh impossible to follow and _super_ fragile.

If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it can
generate the full set of guest values by grabbing ds_area and pebs_data_cfg.
Alternatively, .guest_get_msrs() could take the desired guest MSR values
directly (ds_area and pebs_data_cfg), but kvm_pmu is vendor agnostic, so I don't
see any reason to not just pass the pointer.
Like Xu Jan. 22, 2021, 5:30 a.m. UTC | #11
On 2021/1/16 1:42, Sean Christopherson wrote:
> On Fri, Jan 15, 2021, Xu, Like wrote:
>> On 2021/1/15 2:55, Sean Christopherson wrote:
>>> On Mon, Jan 04, 2021, Like Xu wrote:
>>>> +	 * Note: KVM disables the co-existence of guest PEBS and host PEBS.
>>> By "KVM", do you mean KVM's loading of the MSRs provided by intel_guest_get_msrs()?
>>> Because the PMU should really be the entity that controls guest vs. host.  KVM
>>> should just be a dumb pipe that handles the mechanics of how values are context
>>> switch.
>>
>> The intel_guest_get_msrs() and atomic_switch_perf_msrs()
>> will work together to disable the co-existence of guest PEBS and host PEBS:
>>
>> https://lore.kernel.org/kvm/961e6135-ff6d-86d1-3b7b-a1846ad0e4c4@intel.com/
>>
>> +
>>
>> static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>> ...
>>      if (nr_msrs > 2 && (msrs[1].guest & msrs[0].guest)) {
>>          msrs[2].guest = pmu->ds_area;
>>          if (nr_msrs > 3)
>>              msrs[3].guest = pmu->pebs_data_cfg;
>>      }
>>
>>     for (i = 0; i < nr_msrs; i++)
>> ...
> 
> Yeah, that's exactly what I'm complaining about.  Splitting the logic for
> determining the guest values is unnecessarily confusing, and as evidenced by the
> PEBS_ENABLE bug, potentially fragile.  Perf should have full knowledge and
> control of what values are loaded for the guest.  And, the above indexing magic
> is nigh impossible to follow and _super_ fragile.

Thanks for pointing this out.

> 
> If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it can
> generate the full set of guest values by grabbing ds_area and pebs_data_cfg.
> Alternatively, .guest_get_msrs() could take the desired guest MSR values
> directly (ds_area and pebs_data_cfg), but kvm_pmu is vendor agnostic, so I don't
> see any reason to not just pass the pointer.

Hi Peter,

What do you think of us passing a "struct kvm_pmu" pointer (defined in 
arch/x86/include/asm/kvm_host.h) to guest_get_msrs(int *nr) ?

---
thx,likexu
Like Xu Jan. 25, 2021, 8:26 a.m. UTC | #12
Hi Peter,

On 2021/1/15 22:44, Peter Zijlstra wrote:
> On Fri, Jan 15, 2021 at 10:30:13PM +0800, Xu, Like wrote:
> 
>>> Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
>>> too much code to deal with them.
>>
>> https://lore.kernel.org/lkml/20170628130748.GI5981@leverpostej/T/
>>
>> In the rr workload, the commit change "the PMI interrupts in skid region
>> should be dropped"
>> is reverted since some users complain that:
>>
>>> It seems to me that it might be reasonable to ignore the interrupt if
>>> the purpose of the interrupt is to trigger sampling of the CPUs
>>> register state.  But if the interrupt will trigger some other
>>> operation, such as a signal on an fd, then there's no reason to drop
>>> it.
>>
>> I assume that if the PMI drop is unacceptable, either will spurious PMI
>> injection.
>>
>> I'm pretty open if you insist that we really need to do this for guest PEBS
>> enabling.
> 
> That was an entirely different issue. We were dropping events on the
> floor because they'd passed priv boundaries. So there was an actual
> event, and we made it go away.
> 
> What we're talking about here is raising an PMI with BUFFER_OVF set,
> even if the DS is empty. That should really be harmless. We'll take the
> PMI, find there's nothing there, and do nothing.
> 

In the host and guest PEBS both enabled case,
we'll get a crazy dmesg *bombing* about spurious PMI warning
if we pass the host PEBS PMI "harmlessly" to the guest:

[11261.502536] Uhhuh. NMI received for unknown reason 2c on CPU 36.
[11261.502539] Do you have a strange power saving mode enabled?
[11261.502541] Dazed and confused, but trying to continue

Legacy guest users may be very confused and dissatisfied with that.

I'm double checking with you if it's acceptable to take the proposal
"disables the co-existence of guest PEBS and host PEBS" as the first
step to upstream, and enable both host and guest PEBS in the near future.

---
thx,likexu
Peter Zijlstra Jan. 25, 2021, 11:47 a.m. UTC | #13
On Mon, Jan 25, 2021 at 04:26:22PM +0800, Like Xu wrote:

> In the host and guest PEBS both enabled case,
> we'll get a crazy dmesg *bombing* about spurious PMI warning
> if we pass the host PEBS PMI "harmlessly" to the guest:
> 
> [11261.502536] Uhhuh. NMI received for unknown reason 2c on CPU 36.
> [11261.502539] Do you have a strange power saving mode enabled?
> [11261.502541] Dazed and confused, but trying to continue

How? AFAICT handle_pmi_common() will increment handled when
GLOBAL_STATUS_BUFFER_OVF_BIT is set, irrespective of DS containing
data.
Xu, Like Feb. 2, 2021, 6:31 a.m. UTC | #14
On 2021/1/25 19:47, Peter Zijlstra wrote:
> On Mon, Jan 25, 2021 at 04:26:22PM +0800, Like Xu wrote:
>
>> In the host and guest PEBS both enabled case,
>> we'll get a crazy dmesg *bombing* about spurious PMI warning
>> if we pass the host PEBS PMI "harmlessly" to the guest:
>>
>> [11261.502536] Uhhuh. NMI received for unknown reason 2c on CPU 36.
>> [11261.502539] Do you have a strange power saving mode enabled?
>> [11261.502541] Dazed and confused, but trying to continue
> How? AFAICT handle_pmi_common() will increment handled when
> GLOBAL_STATUS_BUFFER_OVF_BIT is set, irrespective of DS containing
> data.

Thanks for this comment, and it's enlightening.

For the case that both host and guest PEBS are enabled,
the host PEBS PMI will be injected into the guest only when
GLOBAL_STATUS_BUFFER_OVF_BIT is not set in the guest global_status.

>
diff mbox series

Patch

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index b47cc4226934..c499bdb58373 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1721,6 +1721,65 @@  intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
 	return 0;
 }
 
+/*
+ * We may be running with guest PEBS events created by KVM, and the
+ * PEBS records are logged into the guest's DS and invisible to host.
+ *
+ * In the case of guest PEBS overflow, we only trigger a fake event
+ * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
+ * The guest will then vm-entry and check the guest DS area to read
+ * the guest PEBS records.
+ *
+ * The guest PEBS overflow PMI may be dropped when both the guest and
+ * the host use PEBS. Therefore, KVM will not enable guest PEBS once
+ * the host PEBS is enabled since it may bring a confused unknown NMI.
+ *
+ * The contents and other behavior of the guest event do not matter.
+ */
+static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
+				       struct pt_regs *iregs,
+				       struct debug_store *ds)
+{
+	struct perf_sample_data data;
+	struct perf_event *event = NULL;
+	u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+	int bit;
+
+	/*
+	 * Ideally, we should check guest DS to understand if it's
+	 * a guest PEBS overflow PMI from guest PEBS counters.
+	 * However, it brings high overhead to retrieve guest DS in host.
+	 * So we check host DS instead for performance.
+	 *
+	 * If PEBS interrupt threshold on host is not exceeded in a NMI, there
+	 * must be a PEBS overflow PMI generated from the guest PEBS counters.
+	 * There is no ambiguity since the reported event in the PMI is guest
+	 * only. It gets handled correctly on a case by case base for each event.
+	 *
+	 * Note: KVM disables the co-existence of guest PEBS and host PEBS.
+	 */
+	if (!guest_pebs_idxs || !in_nmi() ||
+		ds->pebs_index >= ds->pebs_interrupt_threshold)
+		return 0;
+
+	for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
+			INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
+
+		event = cpuc->events[bit];
+		if (!event->attr.precise_ip)
+			continue;
+
+		perf_sample_data_init(&data, 0, event->hw.last_period);
+		if (perf_event_overflow(event, &data, iregs))
+			x86_pmu_stop(event, 0);
+
+		/* Inject one fake event is enough. */
+		return 1;
+	}
+
+	return 0;
+}
+
 static __always_inline void
 __intel_pmu_pebs_event(struct perf_event *event,
 		       struct pt_regs *iregs,
@@ -1965,6 +2024,9 @@  static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 	if (!x86_pmu.pebs_active)
 		return;
 
+	if (intel_pmu_handle_guest_pebs(cpuc, iregs, ds))
+		return;
+
 	base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_basic *)(unsigned long)ds->pebs_index;