diff mbox

[00/15] perf: KVM: Fix, optimize, and clean up callbacks

Message ID 20210827005718.585190-1-seanjc@google.com (mailing list archive)
State New, archived
Headers show

Commit Message

Sean Christopherson Aug. 27, 2021, 12:57 a.m. UTC
This started out as a small series[1] to fix a KVM bug related to Intel PT
interrupt handling and snowballed horribly.

The main problem being addressed is that the perf_guest_cbs are shared by
all CPUs, can be nullified by KVM during module unload, and are not
protected against concurrent access from NMI context.

The bug has escaped notice because all dereferences of perf_guest_cbs
follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern,
and AFAICT the compiler never reloads perf_guest_cbs in this sequence.
The compiler does reload perf_guest_cbs for any future dereferences, but
the ->is_in_guest() guard all but guarantees the PMI handler will win the
race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest
and teardown down all VMs before it can be unloaded.

But with help, e.g. READ_ONCE(perf_guest_cbs), unloading kvm_intel can
trigger a NULL pointer derference (see below).  Manual intervention aside,
the bug is a bit of a time bomb, e.g. my patch 3 from the original PT
handling series would have omitted the ->is_in_guest() guard.

This series fixes the problem by making the callbacks per-CPU, and
registering/unregistering the callbacks only with preemption disabled
(except for the Xen case, which doesn't unregister).

This approach also allows for several nice cleanups in this series.
KVM x86 and arm64 can share callbacks, KVM x86 drops its somewhat
redundant current_vcpu, and the retpoline that is currently hit when KVM
is loaded (due to always checking ->is_in_guest()) goes away (it's still
there when running as Xen Dom0).

Changing to per-CPU callbacks also provides a good excuse to excise
copy+paste code from architectures that can't possibly have guest
callbacks.

This series conflicts horribly with a proposed patch[2] to use static
calls for perf_guest_cbs.  But that patch is broken as it completely
fails to handle unregister, and it's not clear to me whether or not
it can correctly handle unregister without fixing the underlying race
(I don't know enough about the code patching for static calls).

This tweak


while spamming module load/unload leads to:

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP
  CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:perf_misc_flags+0x1c/0x70
  Call Trace:
   perf_prepare_sample+0x53/0x6b0
   perf_event_output_forward+0x67/0x160
   __perf_event_overflow+0x52/0xf0
   handle_pmi_common+0x207/0x300
   intel_pmu_handle_irq+0xcf/0x410
   perf_event_nmi_handler+0x28/0x50
   nmi_handle+0xc7/0x260
   default_do_nmi+0x6b/0x170
   exc_nmi+0x103/0x130
   asm_exc_nmi+0x76/0xbf

[1] https://lkml.kernel.org/r/20210823193709.55886-1-seanjc@google.com
[2] https://lkml.kernel.org/r/20210806133802.3528-2-lingshan.zhu@intel.com

Sean Christopherson (15):
  KVM: x86: Register perf callbacks after calling vendor's
    hardware_setup()
  KVM: x86: Register Processor Trace interrupt hook iff PT enabled in
    guest
  perf: Stop pretending that perf can handle multiple guest callbacks
  perf: Force architectures to opt-in to guest callbacks
  perf: Track guest callbacks on a per-CPU basis
  KVM: x86: Register perf callbacks only when actively handling
    interrupt
  KVM: Use dedicated flag to track if KVM is handling an NMI from guest
  KVM: x86: Drop current_vcpu in favor of kvm_running_vcpu
  KVM: arm64: Register/unregister perf callbacks at vcpu load/put
  KVM: Move x86's perf guest info callbacks to generic KVM
  KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
  KVM: arm64: Convert to the generic perf callbacks
  KVM: arm64: Drop perf.c and fold its tiny bit of code into pmu.c
  perf: Disallow bulk unregistering of guest callbacks and do cleanup
  perf: KVM: Indicate "in guest" via NULL ->is_in_guest callback

 arch/arm/kernel/perf_callchain.c   | 28 ++------------
 arch/arm64/Kconfig                 |  1 +
 arch/arm64/include/asm/kvm_host.h  |  8 +++-
 arch/arm64/kernel/perf_callchain.c | 18 ++++++---
 arch/arm64/kvm/Makefile            |  2 +-
 arch/arm64/kvm/arm.c               | 13 ++++++-
 arch/arm64/kvm/perf.c              | 62 ------------------------------
 arch/arm64/kvm/pmu.c               |  8 ++++
 arch/csky/kernel/perf_callchain.c  | 10 -----
 arch/nds32/kernel/perf_event_cpu.c | 29 ++------------
 arch/riscv/kernel/perf_callchain.c | 10 -----
 arch/x86/Kconfig                   |  1 +
 arch/x86/events/core.c             | 17 +++++---
 arch/x86/events/intel/core.c       |  7 ++--
 arch/x86/include/asm/kvm_host.h    |  4 +-
 arch/x86/kvm/pmu.c                 |  2 +-
 arch/x86/kvm/pmu.h                 |  1 +
 arch/x86/kvm/svm/svm.c             |  2 +-
 arch/x86/kvm/vmx/vmx.c             | 25 ++++++++++--
 arch/x86/kvm/x86.c                 | 54 +++-----------------------
 arch/x86/kvm/x86.h                 | 12 +++---
 arch/x86/xen/pmu.c                 |  2 +-
 include/kvm/arm_pmu.h              |  1 +
 include/linux/kvm_host.h           | 12 ++++++
 include/linux/perf_event.h         | 33 ++++++++++++----
 init/Kconfig                       |  3 ++
 kernel/events/core.c               | 28 ++++++++------
 virt/kvm/kvm_main.c                | 42 ++++++++++++++++++++
 28 files changed, 204 insertions(+), 231 deletions(-)
 delete mode 100644 arch/arm64/kvm/perf.c

Comments

Like Xu Aug. 27, 2021, 6:52 a.m. UTC | #1
+ STATIC BRANCH/CALL friends.

On 27/8/2021 8:57 am, Sean Christopherson wrote:
> This started out as a small series[1] to fix a KVM bug related to Intel PT
> interrupt handling and snowballed horribly.
> 
> The main problem being addressed is that the perf_guest_cbs are shared by
> all CPUs, can be nullified by KVM during module unload, and are not
> protected against concurrent access from NMI context.

Shouldn't this be a generic issue of the static_call() usage ?

At the beginning, we set up the static entry assuming perf_guest_cbs != NULL:

	if (perf_guest_cbs && perf_guest_cbs->handle_intel_pt_intr) {
		static_call_update(x86_guest_handle_intel_pt_intr,
				   perf_guest_cbs->handle_intel_pt_intr);
	}

and then we unset the perf_guest_cbs and do the static function call like this:

DECLARE_STATIC_CALL(x86_guest_handle_intel_pt_intr, 
*(perf_guest_cbs->handle_intel_pt_intr));
static int handle_pmi_common(struct pt_regs *regs, u64 status)
{
		...
		if (!static_call(x86_guest_handle_intel_pt_intr)())
			intel_pt_interrupt();
		...
}

Can we make static_call() back to the original "(void *)&__static_call_return0" 
in this case ?

> 
> The bug has escaped notice because all dereferences of perf_guest_cbs
> follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern,
> and AFAICT the compiler never reloads perf_guest_cbs in this sequence.
> The compiler does reload perf_guest_cbs for any future dereferences, but
> the ->is_in_guest() guard all but guarantees the PMI handler will win the
> race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest
> and teardown down all VMs before it can be unloaded.
> 
> But with help, e.g. READ_ONCE(perf_guest_cbs), unloading kvm_intel can
> trigger a NULL pointer derference (see below).  Manual intervention aside,
> the bug is a bit of a time bomb, e.g. my patch 3 from the original PT
> handling series would have omitted the ->is_in_guest() guard.
> 
> This series fixes the problem by making the callbacks per-CPU, and
> registering/unregistering the callbacks only with preemption disabled
> (except for the Xen case, which doesn't unregister).
> 
> This approach also allows for several nice cleanups in this series.
> KVM x86 and arm64 can share callbacks, KVM x86 drops its somewhat
> redundant current_vcpu, and the retpoline that is currently hit when KVM
> is loaded (due to always checking ->is_in_guest()) goes away (it's still
> there when running as Xen Dom0).
> 
> Changing to per-CPU callbacks also provides a good excuse to excise
> copy+paste code from architectures that can't possibly have guest
> callbacks.
> 
> This series conflicts horribly with a proposed patch[2] to use static
> calls for perf_guest_cbs.  But that patch is broken as it completely
> fails to handle unregister, and it's not clear to me whether or not
> it can correctly handle unregister without fixing the underlying race
> (I don't know enough about the code patching for static calls).
> 
> This tweak
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 1eb45139fcc6..202e5ad97f82 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
>   {
>          int misc = 0;
> 
> -       if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
> +       if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) {
>                  if (perf_guest_cbs->is_user_mode())
>                          misc |= PERF_RECORD_MISC_GUEST_USER;
>                  else
> 
> while spamming module load/unload leads to:
> 
>    BUG: kernel NULL pointer dereference, address: 0000000000000000
>    #PF: supervisor read access in kernel mode
>    #PF: error_code(0x0000) - not-present page
>    PGD 0 P4D 0
>    Oops: 0000 [#1] PREEMPT SMP
>    CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>    RIP: 0010:perf_misc_flags+0x1c/0x70
>    Call Trace:
>     perf_prepare_sample+0x53/0x6b0
>     perf_event_output_forward+0x67/0x160
>     __perf_event_overflow+0x52/0xf0
>     handle_pmi_common+0x207/0x300
>     intel_pmu_handle_irq+0xcf/0x410
>     perf_event_nmi_handler+0x28/0x50
>     nmi_handle+0xc7/0x260
>     default_do_nmi+0x6b/0x170
>     exc_nmi+0x103/0x130
>     asm_exc_nmi+0x76/0xbf
> 
> [1] https://lkml.kernel.org/r/20210823193709.55886-1-seanjc@google.com
> [2] https://lkml.kernel.org/r/20210806133802.3528-2-lingshan.zhu@intel.com
> 
> Sean Christopherson (15):
>    KVM: x86: Register perf callbacks after calling vendor's
>      hardware_setup()
>    KVM: x86: Register Processor Trace interrupt hook iff PT enabled in
>      guest
>    perf: Stop pretending that perf can handle multiple guest callbacks
>    perf: Force architectures to opt-in to guest callbacks
>    perf: Track guest callbacks on a per-CPU basis
>    KVM: x86: Register perf callbacks only when actively handling
>      interrupt
>    KVM: Use dedicated flag to track if KVM is handling an NMI from guest
>    KVM: x86: Drop current_vcpu in favor of kvm_running_vcpu
>    KVM: arm64: Register/unregister perf callbacks at vcpu load/put
>    KVM: Move x86's perf guest info callbacks to generic KVM
>    KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
>    KVM: arm64: Convert to the generic perf callbacks
>    KVM: arm64: Drop perf.c and fold its tiny bit of code into pmu.c
>    perf: Disallow bulk unregistering of guest callbacks and do cleanup
>    perf: KVM: Indicate "in guest" via NULL ->is_in_guest callback
> 
>   arch/arm/kernel/perf_callchain.c   | 28 ++------------
>   arch/arm64/Kconfig                 |  1 +
>   arch/arm64/include/asm/kvm_host.h  |  8 +++-
>   arch/arm64/kernel/perf_callchain.c | 18 ++++++---
>   arch/arm64/kvm/Makefile            |  2 +-
>   arch/arm64/kvm/arm.c               | 13 ++++++-
>   arch/arm64/kvm/perf.c              | 62 ------------------------------
>   arch/arm64/kvm/pmu.c               |  8 ++++
>   arch/csky/kernel/perf_callchain.c  | 10 -----
>   arch/nds32/kernel/perf_event_cpu.c | 29 ++------------
>   arch/riscv/kernel/perf_callchain.c | 10 -----
>   arch/x86/Kconfig                   |  1 +
>   arch/x86/events/core.c             | 17 +++++---
>   arch/x86/events/intel/core.c       |  7 ++--
>   arch/x86/include/asm/kvm_host.h    |  4 +-
>   arch/x86/kvm/pmu.c                 |  2 +-
>   arch/x86/kvm/pmu.h                 |  1 +
>   arch/x86/kvm/svm/svm.c             |  2 +-
>   arch/x86/kvm/vmx/vmx.c             | 25 ++++++++++--
>   arch/x86/kvm/x86.c                 | 54 +++-----------------------
>   arch/x86/kvm/x86.h                 | 12 +++---
>   arch/x86/xen/pmu.c                 |  2 +-
>   include/kvm/arm_pmu.h              |  1 +
>   include/linux/kvm_host.h           | 12 ++++++
>   include/linux/perf_event.h         | 33 ++++++++++++----
>   init/Kconfig                       |  3 ++
>   kernel/events/core.c               | 28 ++++++++------
>   virt/kvm/kvm_main.c                | 42 ++++++++++++++++++++
>   28 files changed, 204 insertions(+), 231 deletions(-)
>   delete mode 100644 arch/arm64/kvm/perf.c
>
Peter Zijlstra Aug. 27, 2021, 7:44 a.m. UTC | #2
On Fri, Aug 27, 2021 at 02:52:25PM +0800, Like Xu wrote:
> + STATIC BRANCH/CALL friends.
> 
> On 27/8/2021 8:57 am, Sean Christopherson wrote:
> > This started out as a small series[1] to fix a KVM bug related to Intel PT
> > interrupt handling and snowballed horribly.
> > 
> > The main problem being addressed is that the perf_guest_cbs are shared by
> > all CPUs, can be nullified by KVM during module unload, and are not
> > protected against concurrent access from NMI context.
> 
> Shouldn't this be a generic issue of the static_call() usage ?
> 
> At the beginning, we set up the static entry assuming perf_guest_cbs != NULL:
> 
> 	if (perf_guest_cbs && perf_guest_cbs->handle_intel_pt_intr) {
> 		static_call_update(x86_guest_handle_intel_pt_intr,
> 				   perf_guest_cbs->handle_intel_pt_intr);
> 	}
> 
> and then we unset the perf_guest_cbs and do the static function call like this:
> 
> DECLARE_STATIC_CALL(x86_guest_handle_intel_pt_intr,
> *(perf_guest_cbs->handle_intel_pt_intr));
> static int handle_pmi_common(struct pt_regs *regs, u64 status)
> {
> 		...
> 		if (!static_call(x86_guest_handle_intel_pt_intr)())
> 			intel_pt_interrupt();
> 		...
> }

You just have to make sure all static_call() invocations that started
before unreg are finished before continuing with the unload.
synchronize_rcu() can help with that.

This is module unload 101. Nothing specific to static_call().
Like Xu Aug. 27, 2021, 8:01 a.m. UTC | #3
On 27/8/2021 3:44 pm, Peter Zijlstra wrote:
> On Fri, Aug 27, 2021 at 02:52:25PM +0800, Like Xu wrote:
>> + STATIC BRANCH/CALL friends.
>>
>> On 27/8/2021 8:57 am, Sean Christopherson wrote:
>>> This started out as a small series[1] to fix a KVM bug related to Intel PT
>>> interrupt handling and snowballed horribly.
>>>
>>> The main problem being addressed is that the perf_guest_cbs are shared by
>>> all CPUs, can be nullified by KVM during module unload, and are not
>>> protected against concurrent access from NMI context.
>>
>> Shouldn't this be a generic issue of the static_call() usage ?
>>
>> At the beginning, we set up the static entry assuming perf_guest_cbs != NULL:
>>
>> 	if (perf_guest_cbs && perf_guest_cbs->handle_intel_pt_intr) {
>> 		static_call_update(x86_guest_handle_intel_pt_intr,
>> 				   perf_guest_cbs->handle_intel_pt_intr);
>> 	}
>>
>> and then we unset the perf_guest_cbs and do the static function call like this:
>>
>> DECLARE_STATIC_CALL(x86_guest_handle_intel_pt_intr,
>> *(perf_guest_cbs->handle_intel_pt_intr));
>> static int handle_pmi_common(struct pt_regs *regs, u64 status)
>> {
>> 		...
>> 		if (!static_call(x86_guest_handle_intel_pt_intr)())
>> 			intel_pt_interrupt();
>> 		...
>> }
> 
> You just have to make sure all static_call() invocations that started
> before unreg are finished before continuing with the unload.
> synchronize_rcu() can help with that.

Do you mean something like that:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 64e310ff4f3a..e7d310af7509 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8465,6 +8465,7 @@ void kvm_arch_exit(void)
  #endif
  	kvm_lapic_exit();
  	perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
+	synchronize_rcu();

  	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
  		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e466fc8176e1..63ae56c5d133 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6508,6 +6508,7 @@ EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
  int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
  {
  	perf_guest_cbs = NULL;
+	arch_perf_update_guest_cbs();
  	return 0;
  }
  EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);

> 
> This is module unload 101. Nothing specific to static_call().
>
Peter Zijlstra Aug. 27, 2021, 10:47 a.m. UTC | #4
On Fri, Aug 27, 2021 at 04:01:45PM +0800, Like Xu wrote:
> On 27/8/2021 3:44 pm, Peter Zijlstra wrote:

> > You just have to make sure all static_call() invocations that started
> > before unreg are finished before continuing with the unload.
> > synchronize_rcu() can help with that.
> 
> Do you mean something like that:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 64e310ff4f3a..e7d310af7509 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8465,6 +8465,7 @@ void kvm_arch_exit(void)
>  #endif
>  	kvm_lapic_exit();
>  	perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
> +	synchronize_rcu();
> 
>  	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
>  		cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index e466fc8176e1..63ae56c5d133 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6508,6 +6508,7 @@ EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
>  int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
>  {
>  	perf_guest_cbs = NULL;
> +	arch_perf_update_guest_cbs();

I'm thinking the synchronize_rcu() should go here, and access to
perf_guest_cbs should be wrapped to yell when called with preemption
enabled.

But yes..

>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
> 
> > 
> > This is module unload 101. Nothing specific to static_call().
> >
diff mbox

Patch

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1eb45139fcc6..202e5ad97f82 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2954,7 +2954,7 @@  unsigned long perf_misc_flags(struct pt_regs *regs)
 {
        int misc = 0;

-       if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+       if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) {
                if (perf_guest_cbs->is_user_mode())
                        misc |= PERF_RECORD_MISC_GUEST_USER;
                else