diff mbox series

[3/3] KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC

Message ID 20190930072257.43352-4-like.xu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/vPMU: Efficiency optimization by reusing last created perf_event | expand

Commit Message

Like Xu Sept. 30, 2019, 7:22 a.m. UTC
Currently, a host perf_event is created for a vPMC functionality emulation.
It’s unpredictable to determine if a disabled perf_event will be reused.
If they are disabled and are not reused for a considerable period of time,
those obsolete perf_events would increase host context switch overhead that
could have been avoided.

If the guest doesn't access (set_msr/get_msr/rdpmc) any of the vPMC's MSRs
during an entire vcpu sched time slice, and its independent enable bit of
the vPMC isn't set, we can predict that the guest has finished the use of
this vPMC, and then it's time to release the non-reused perf_event on the
first call of vcpu_enter_guest() since the vcpu gets next scheduled in.

This lazy mechanism delays the event release time to the beginning of the
next scheduled time slice if vPMC's MSRs aren't accessed during this time
slice. If guest comes back to use this vPMC in next time slice, a new perf
event would be re-created via perf_event_create_kernel_counter() as usual.

Suggested-by: Wei W Wang <wei.w.wang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  8 ++++++
 arch/x86/kvm/pmu.c              | 43 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/pmu.h              |  3 +++
 arch/x86/kvm/pmu_amd.c          | 13 ++++++++++
 arch/x86/kvm/vmx/pmu_intel.c    | 25 +++++++++++++++++++
 arch/x86/kvm/x86.c              |  6 +++++
 6 files changed, 98 insertions(+)

Comments

Peter Zijlstra Oct. 1, 2019, 8:23 a.m. UTC | #1
On Mon, Sep 30, 2019 at 03:22:57PM +0800, Like Xu wrote:
> +	union {
> +		u8 event_count :7; /* the total number of created perf_events */
> +		bool enable_cleanup :1;

That's atrocious, don't ever create a bitfield with base _Bool.

> +	} state;
Like Xu Oct. 1, 2019, 12:33 p.m. UTC | #2
Hi Peter,

On 2019/10/1 16:23, Peter Zijlstra wrote:
> On Mon, Sep 30, 2019 at 03:22:57PM +0800, Like Xu wrote:
>> +	union {
>> +		u8 event_count :7; /* the total number of created perf_events */
>> +		bool enable_cleanup :1;
> 
> That's atrocious, don't ever create a bitfield with base _Bool.

I saw this kind of usages in the tree such as "struct 
arm_smmu_master/tipc_mon_state/regmap_irq_chip".

I'm not sure is this your personal preference or is there a technical
reason such as this usage is not incompatible with union syntax?

My design point is to save a little bit space without introducing
two variables such as "int event_count & bool enable_cleanup".

One of the alternatives is to introduce "u8 pmu_state", where the last 
seven bits are event_count for X86_PMC_IDX_MAX and the highest bit is 
the enable_cleanup bit. Are you OK with this ?

By the way, is the lazy release mechanism looks reasonable to you?

> 
>> +	} state;
>
Peter Zijlstra Oct. 8, 2019, 12:11 p.m. UTC | #3
On Tue, Oct 01, 2019 at 08:33:45PM +0800, Like Xu wrote:
> Hi Peter,
> 
> On 2019/10/1 16:23, Peter Zijlstra wrote:
> > On Mon, Sep 30, 2019 at 03:22:57PM +0800, Like Xu wrote:
> > > +	union {
> > > +		u8 event_count :7; /* the total number of created perf_events */
> > > +		bool enable_cleanup :1;
> > 
> > That's atrocious, don't ever create a bitfield with base _Bool.
> 
> I saw this kind of usages in the tree such as "struct
> arm_smmu_master/tipc_mon_state/regmap_irq_chip".

Because other people do tasteless things doesn't make it right.

> I'm not sure is this your personal preference or is there a technical
> reason such as this usage is not incompatible with union syntax?

Apparently it 'works', so there is no hard technical reason, but
consider that _Bool is specified as an integer type large enough to
store the values 0 and 1, then consider it as a base type for a
bitfield. That's just disguisting.

Now, I suppose it 'works', but there is no actual benefit over just
using a single bit of any other base type.

> My design point is to save a little bit space without introducing
> two variables such as "int event_count & bool enable_cleanup".

Your design is questionable, the structure is _huge_, and your union has
event_count:0 and enable_cleanup:0 as the same bit, which I don't think
was intentional.

Did you perhaps want to write:

	struct {
		u8 event_count : 7;
		u8 event_cleanup : 1;
	};

which has a total size of 1 byte and uses the low 7 bits as count and the
msb as cleanup.

Also, the structure has plenty holes to stick proper variables in
without further growing it.

> By the way, is the lazy release mechanism looks reasonable to you?

I've no idea how it works.. I don't know much about virt.
Like Xu Oct. 9, 2019, 3:14 a.m. UTC | #4
On 2019/10/8 20:11, Peter Zijlstra wrote:
> On Tue, Oct 01, 2019 at 08:33:45PM +0800, Like Xu wrote:
>> Hi Peter,
>>
>> On 2019/10/1 16:23, Peter Zijlstra wrote:
>>> On Mon, Sep 30, 2019 at 03:22:57PM +0800, Like Xu wrote:
>>>> +	union {
>>>> +		u8 event_count :7; /* the total number of created perf_events */
>>>> +		bool enable_cleanup :1;
>>>
>>> That's atrocious, don't ever create a bitfield with base _Bool.
>>
>> I saw this kind of usages in the tree such as "struct
>> arm_smmu_master/tipc_mon_state/regmap_irq_chip".
> 
> Because other people do tasteless things doesn't make it right.
> 
>> I'm not sure is this your personal preference or is there a technical
>> reason such as this usage is not incompatible with union syntax?
> 
> Apparently it 'works', so there is no hard technical reason, but
> consider that _Bool is specified as an integer type large enough to
> store the values 0 and 1, then consider it as a base type for a
> bitfield. That's just disguisting.

It's reasonable. Thanks.

> 
> Now, I suppose it 'works', but there is no actual benefit over just
> using a single bit of any other base type.
> 
>> My design point is to save a little bit space without introducing
>> two variables such as "int event_count & bool enable_cleanup".
> 
> Your design is questionable, the structure is _huge_, and your union has
> event_count:0 and enable_cleanup:0 as the same bit, which I don't think
> was intentional.
> 
> Did you perhaps want to write:
> 
> 	struct {
> 		u8 event_count : 7;
> 		u8 event_cleanup : 1;
> 	};
> 
> which has a total size of 1 byte and uses the low 7 bits as count and the
> msb as cleanup.

Yes, my union here is wrong and let me fix it in the next version.

> 
> Also, the structure has plenty holes to stick proper variables in
> without further growing it.

Yes, we could benefit from it.

> 
>> By the way, is the lazy release mechanism looks reasonable to you?
> 
> I've no idea how it works.. I don't know much about virt.
>
Paolo Bonzini Oct. 9, 2019, 7:15 a.m. UTC | #5
On 09/10/19 05:14, Like Xu wrote:
>>
>>
>>> I'm not sure is this your personal preference or is there a technical
>>> reason such as this usage is not incompatible with union syntax?
>>
>> Apparently it 'works', so there is no hard technical reason, but
>> consider that _Bool is specified as an integer type large enough to
>> store the values 0 and 1, then consider it as a base type for a
>> bitfield. That's just disguisting.
> 
> It's reasonable. Thanks.

/me chimes in since this is KVM code after all...

For stuff like hardware registers, bitfields are probably a bad idea
anyway, so let's only consider the case of space optimization.

bool:2 would definitely cause an eyebrow raise, but I don't see why
bool:1 bitfields are a problem.  An integer type large enough to store
the values 0 and 1 can be of any size bigger than one bit.

bool bitfields preserve the magic behavior where something like this:

  foo->x = y;

(x is a bool bitfield) would be compiled as

  foo->x = (y != 0);

which can be a plus or a minus depending on the point of view. :)
Either way, bool bitfields are useful if you are using bitfields for
space optimization, especially if you have existing code using bool and
it might rely on the idiom above.

However, in this patch bitfields are unnecessary and they result in
worse code from the compiler.  There is plenty of padding in struct
kvm_pmu, with or without bitfields, so I'd go with "u8 event_count; bool
enable_cleanup;" (or better "need_cleanup").

Thanks,

Paolo
Like Xu Oct. 9, 2019, 8:07 a.m. UTC | #6
Hi Paolo,
On 2019/10/9 15:15, Paolo Bonzini wrote:
> On 09/10/19 05:14, Like Xu wrote:
>>>
>>>
>>>> I'm not sure is this your personal preference or is there a technical
>>>> reason such as this usage is not incompatible with union syntax?
>>>
>>> Apparently it 'works', so there is no hard technical reason, but
>>> consider that _Bool is specified as an integer type large enough to
>>> store the values 0 and 1, then consider it as a base type for a
>>> bitfield. That's just disguisting.
>>
>> It's reasonable. Thanks.
> 
> /me chimes in since this is KVM code after all...
> 
> For stuff like hardware registers, bitfields are probably a bad idea
> anyway, so let's only consider the case of space optimization.
> 
> bool:2 would definitely cause an eyebrow raise, but I don't see why
> bool:1 bitfields are a problem.  An integer type large enough to store
> the values 0 and 1 can be of any size bigger than one bit.
> 
> bool bitfields preserve the magic behavior where something like this:
> 
>    foo->x = y;
> 
> (x is a bool bitfield) would be compiled as
> 
>    foo->x = (y != 0);
> 
> which can be a plus or a minus depending on the point of view. :)
> Either way, bool bitfields are useful if you are using bitfields for
> space optimization, especially if you have existing code using bool and
> it might rely on the idiom above.
> 
> However, in this patch bitfields are unnecessary and they result in
> worse code from the compiler.  There is plenty of padding in struct
> kvm_pmu, with or without bitfields, so I'd go with "u8 event_count; bool
> enable_cleanup;" (or better "need_cleanup").

Thanks. The "u8 event_count; bool need_cleanup;" looks good to me.

So is the lazy release mechanism looks reasonable to you ?
If so, I may release the next version based on current feedback.

> 
> Thanks,
> 
> Paolo
>
Peter Zijlstra Oct. 9, 2019, 8:16 a.m. UTC | #7
On Wed, Oct 09, 2019 at 09:15:03AM +0200, Paolo Bonzini wrote:
> For stuff like hardware registers, bitfields are probably a bad idea
> anyway, so let's only consider the case of space optimization.

Except for hardware registers? I actually like bitfields to describe
hardware registers.

> bool:2 would definitely cause an eyebrow raise, but I don't see why
> bool:1 bitfields are a problem.  An integer type large enough to store
> the values 0 and 1 can be of any size bigger than one bit.

Consider:

	bool	foo:1;
	bool	bar:1;

Will bar use the second bit of _Bool? Does it have one? (yes it does,
but it's still weird).

But worse, as used in the parent thread:

	u8	count:7;
	bool	flag:1;

Who says the @flag thing will even be the msb of the initial u8 and not
a whole new variable due to change in base type?

> bool bitfields preserve the magic behavior where something like this:
> 
>   foo->x = y;
> 
> (x is a bool bitfield) would be compiled as
> 
>   foo->x = (y != 0);

This is confusion; if y is a single bit bitfield, then there is
absolutely _NO_ difference between these two expressions.

The _only_ thing about _Bool is that it magically casts values to 0,1.
Single bit bitfield variables have no choice but to already be in that
range.

So expressions where it matters are:

	x = (7&2)	// x == 2
vs
	x = !!(7&2)	// x == 1

But it is impossible for int:1 and _Bool to behave differently.

> However, in this patch bitfields are unnecessary and they result in
> worse code from the compiler.

Fully agreed :-)
Paolo Bonzini Oct. 9, 2019, 9:21 a.m. UTC | #8
On 09/10/19 10:16, Peter Zijlstra wrote:
> On Wed, Oct 09, 2019 at 09:15:03AM +0200, Paolo Bonzini wrote:
>> For stuff like hardware registers, bitfields are probably a bad idea
>> anyway, so let's only consider the case of space optimization.
> 
> Except for hardware registers? I actually like bitfields to describe
> hardware registers.

In theory yes, in practice for MMIO it's a problem that you're not able
to see the exact compiler reads or writes.  Of course you can do:

	union {
		struct {
			/* some bitfields here
		} u;
		u32 val;
	}

and only use the bitfields after reading/writing from the register.

> But worse, as used in the parent thread:
> 
> 	u8	count:7;
> 	bool	flag:1;
> 
> Who says the @flag thing will even be the msb of the initial u8 and not
> a whole new variable due to change in base type?

Good point.

>> bool bitfields preserve the magic behavior where something like this:
>>
>>   foo->x = y;
>>
>> (x is a bool bitfield) would be compiled as
>>
>>   foo->x = (y != 0);
> 
> This is confusion; if y is a single bit bitfield, then there is
> absolutely _NO_ difference between these two expressions.

y is not in a struct so it cannot be a single bit bitfield. :) If y is
an int and foo->x is a bool bitfield, you get the following:

	foo->x = 6;	/* foo->x is 1, it would be 0 for int:1 */
	foo->x = 7;	/* foo->x is 1, it would be 1 for int:1 */

Anyway it's good that we agree on the important thing about the patch!

Paolo

> The _only_ thing about _Bool is that it magically casts values to 0,1.
> Single bit bitfield variables have no choice but to already be in that
> range.
Peter Zijlstra Oct. 9, 2019, 9:32 a.m. UTC | #9
On Wed, Oct 09, 2019 at 11:21:30AM +0200, Paolo Bonzini wrote:
> On 09/10/19 10:16, Peter Zijlstra wrote:

> >> bool bitfields preserve the magic behavior where something like this:
> >>
> >>   foo->x = y;
> >>
> >> (x is a bool bitfield) would be compiled as
> >>
> >>   foo->x = (y != 0);
> > 
> > This is confusion; if y is a single bit bitfield, then there is
> > absolutely _NO_ difference between these two expressions.
> 
> y is not in a struct so it cannot be a single bit bitfield. :) If y is
> an int and foo->x is a bool bitfield, you get the following:
> 
> 	foo->x = 6;	/* foo->x is 1, it would be 0 for int:1 */
> 	foo->x = 7;	/* foo->x is 1, it would be 1 for int:1 */
> 

Urgh, reading hard. You're right!
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 15f2ebad94f9..6723c04c8dc6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -479,6 +479,14 @@  struct kvm_pmu {
 	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
 	struct irq_work irq_work;
 	u64 reprogram_pmi;
+
+	/* for PMC being set, do not released its perf_event (if any) */
+	u64 lazy_release_ctrl;
+
+	union {
+		u8 event_count :7; /* the total number of created perf_events */
+		bool enable_cleanup :1;
+	} state;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 74bc5c42b8b5..1b3cec38b1a1 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -137,6 +137,7 @@  static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
 	}
 
 	pmc->perf_event = event;
+	pmc_to_pmu(pmc)->state.event_count++;
 	clear_bit(pmc->idx, (unsigned long*)&pmc_to_pmu(pmc)->reprogram_pmi);
 }
 
@@ -368,6 +369,7 @@  int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	if (!pmc)
 		return 1;
 
+	__set_bit(pmc->idx, (unsigned long *)&pmu->lazy_release_ctrl);
 	*data = pmc_read_counter(pmc) & mask;
 	return 0;
 }
@@ -385,11 +387,13 @@  bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 
 int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 {
+	kvm_x86_ops->pmu_ops->update_lazy_release_ctrl(vcpu, msr);
 	return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data);
 }
 
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
+	kvm_x86_ops->pmu_ops->update_lazy_release_ctrl(vcpu, msr_info->index);
 	return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info);
 }
 
@@ -417,9 +421,48 @@  void kvm_pmu_init(struct kvm_vcpu *vcpu)
 	memset(pmu, 0, sizeof(*pmu));
 	kvm_x86_ops->pmu_ops->init(vcpu);
 	init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn);
+	pmu->lazy_release_ctrl = 0;
+	pmu->state.event_count = 0;
+	pmu->state.enable_cleanup = false;
 	kvm_pmu_refresh(vcpu);
 }
 
+static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
+{
+	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+	if (pmc_is_fixed(pmc))
+		return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
+			pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
+
+	return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
+}
+
+void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc = NULL;
+	u64 bitmask = ~pmu->lazy_release_ctrl;
+	int i;
+
+	if (!unlikely(pmu->state.enable_cleanup))
+		return;
+
+	/* do cleanup before the first time of running vcpu after sched_in */
+	pmu->state.enable_cleanup = false;
+
+	/* cleanup unmarked vPMC in the last sched time slice */
+	for_each_set_bit(i, (unsigned long *)&bitmask, X86_PMC_IDX_MAX) {
+		pmc = kvm_x86_ops->pmu_ops->pmc_idx_to_pmc(pmu, i);
+
+		if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc))
+			pmc_stop_counter(pmc);
+	}
+
+	/* reset vPMC lazy-release states for this sched time slice */
+	pmu->lazy_release_ctrl = 0;
+}
+
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_pmu_reset(vcpu);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 3a95952702d2..c681738ba59c 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -34,6 +34,7 @@  struct kvm_pmu_ops {
 	void (*refresh)(struct kvm_vcpu *vcpu);
 	void (*init)(struct kvm_vcpu *vcpu);
 	void (*reset)(struct kvm_vcpu *vcpu);
+	void (*update_lazy_release_ctrl)(struct kvm_vcpu *vcpu, u32 msr);
 };
 
 static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
@@ -61,6 +62,7 @@  static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
 		perf_event_release_kernel(pmc->perf_event);
 		pmc->perf_event = NULL;
 		pmc->programed_config = 0;
+		pmc_to_pmu(pmc)->state.event_count--;
 	}
 }
 
@@ -125,6 +127,7 @@  int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
 void kvm_pmu_init(struct kvm_vcpu *vcpu);
+void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 
diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
index 3d656b2d439f..c74087dad5e8 100644
--- a/arch/x86/kvm/pmu_amd.c
+++ b/arch/x86/kvm/pmu_amd.c
@@ -208,6 +208,18 @@  static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return ret;
 }
 
+static void amd_update_lazy_release_ctrl(struct kvm_vcpu *vcpu, u32 msr)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc = NULL;
+
+	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
+	pmc = pmc ? pmc : get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL);
+
+	if (pmc)
+		__set_bit(pmc->idx, (unsigned long *)&pmu->lazy_release_ctrl);
+}
+
 static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -315,4 +327,5 @@  struct kvm_pmu_ops amd_pmu_ops = {
 	.refresh = amd_pmu_refresh,
 	.init = amd_pmu_init,
 	.reset = amd_pmu_reset,
+	.update_lazy_release_ctrl = amd_update_lazy_release_ctrl,
 };
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 73bbefa1d54e..4aa7d2eea5c8 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -140,6 +140,30 @@  static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu,
 	return &counters[idx];
 }
 
+static void intel_update_lazy_release_ctrl(struct kvm_vcpu *vcpu, u32 msr)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc = NULL;
+	int i;
+
+	if (msr == MSR_CORE_PERF_FIXED_CTR_CTRL) {
+		for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+			if (!fixed_ctrl_field(pmu->fixed_ctr_ctrl, i))
+				continue;
+			__set_bit(INTEL_PMC_IDX_FIXED + i,
+				(unsigned long *)&pmu->lazy_release_ctrl);
+		}
+		return;
+	}
+
+	pmc = get_fixed_pmc(pmu, msr);
+	pmc = pmc ? pmc : get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0);
+	pmc = pmc ? pmc : get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0);
+
+	if (pmc)
+		__set_bit(pmc->idx, (unsigned long *)&pmu->lazy_release_ctrl);
+}
+
 static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -373,4 +397,5 @@  struct kvm_pmu_ops intel_pmu_ops = {
 	.refresh = intel_pmu_refresh,
 	.init = intel_pmu_init,
 	.reset = intel_pmu_reset,
+	.update_lazy_release_ctrl = intel_update_lazy_release_ctrl,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0ed07d8d2caa..945b8be53a90 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8076,6 +8076,8 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		goto cancel_injection;
 	}
 
+	kvm_pmu_cleanup(vcpu);
+
 	preempt_disable();
 
 	kvm_x86_ops->prepare_guest_switch(vcpu);
@@ -9415,7 +9417,11 @@  void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
 	vcpu->arch.l1tf_flush_l1d = true;
+	if (pmu->version && unlikely(pmu->state.event_count))
+		pmu->state.enable_cleanup = true;
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }