Message ID | 20210827005718.585190-1-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
+ STATIC BRANCH/CALL friends. On 27/8/2021 8:57 am, Sean Christopherson wrote: > This started out as a small series[1] to fix a KVM bug related to Intel PT > interrupt handling and snowballed horribly. > > The main problem being addressed is that the perf_guest_cbs are shared by > all CPUs, can be nullified by KVM during module unload, and are not > protected against concurrent access from NMI context. Shouldn't this be a generic issue of the static_call() usage ? At the beginning, we set up the static entry assuming perf_guest_cbs != NULL: if (perf_guest_cbs && perf_guest_cbs->handle_intel_pt_intr) { static_call_update(x86_guest_handle_intel_pt_intr, perf_guest_cbs->handle_intel_pt_intr); } and then we unset the perf_guest_cbs and do the static function call like this: DECLARE_STATIC_CALL(x86_guest_handle_intel_pt_intr, *(perf_guest_cbs->handle_intel_pt_intr)); static int handle_pmi_common(struct pt_regs *regs, u64 status) { ... if (!static_call(x86_guest_handle_intel_pt_intr)()) intel_pt_interrupt(); ... } Can we make static_call() back to the original "(void *)&__static_call_return0" in this case ? > > The bug has escaped notice because all dereferences of perf_guest_cbs > follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, > and AFAICT the compiler never reloads perf_guest_cbs in this sequence. > The compiler does reload perf_guest_cbs for any future dereferences, but > the ->is_in_guest() guard all but guarantees the PMI handler will win the > race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest > and teardown down all VMs before it can be unloaded. > > But with help, e.g. READ_ONCE(perf_guest_cbs), unloading kvm_intel can > trigger a NULL pointer derference (see below). Manual intervention aside, > the bug is a bit of a time bomb, e.g. my patch 3 from the original PT > handling series would have omitted the ->is_in_guest() guard. > > This series fixes the problem by making the callbacks per-CPU, and > registering/unregistering the callbacks only with preemption disabled > (except for the Xen case, which doesn't unregister). > > This approach also allows for several nice cleanups in this series. > KVM x86 and arm64 can share callbacks, KVM x86 drops its somewhat > redundant current_vcpu, and the retpoline that is currently hit when KVM > is loaded (due to always checking ->is_in_guest()) goes away (it's still > there when running as Xen Dom0). > > Changing to per-CPU callbacks also provides a good excuse to excise > copy+paste code from architectures that can't possibly have guest > callbacks. > > This series conflicts horribly with a proposed patch[2] to use static > calls for perf_guest_cbs. But that patch is broken as it completely > fails to handle unregister, and it's not clear to me whether or not > it can correctly handle unregister without fixing the underlying race > (I don't know enough about the code patching for static calls). > > This tweak > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index 1eb45139fcc6..202e5ad97f82 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs) > { > int misc = 0; > > - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { > + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) { > if (perf_guest_cbs->is_user_mode()) > misc |= PERF_RECORD_MISC_GUEST_USER; > else > > while spamming module load/unload leads to: > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] PREEMPT SMP > CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > RIP: 0010:perf_misc_flags+0x1c/0x70 > Call Trace: > perf_prepare_sample+0x53/0x6b0 > perf_event_output_forward+0x67/0x160 > __perf_event_overflow+0x52/0xf0 > handle_pmi_common+0x207/0x300 > intel_pmu_handle_irq+0xcf/0x410 > perf_event_nmi_handler+0x28/0x50 > nmi_handle+0xc7/0x260 > default_do_nmi+0x6b/0x170 > exc_nmi+0x103/0x130 > asm_exc_nmi+0x76/0xbf > > [1] https://lkml.kernel.org/r/20210823193709.55886-1-seanjc@google.com > [2] https://lkml.kernel.org/r/20210806133802.3528-2-lingshan.zhu@intel.com > > Sean Christopherson (15): > KVM: x86: Register perf callbacks after calling vendor's > hardware_setup() > KVM: x86: Register Processor Trace interrupt hook iff PT enabled in > guest > perf: Stop pretending that perf can handle multiple guest callbacks > perf: Force architectures to opt-in to guest callbacks > perf: Track guest callbacks on a per-CPU basis > KVM: x86: Register perf callbacks only when actively handling > interrupt > KVM: Use dedicated flag to track if KVM is handling an NMI from guest > KVM: x86: Drop current_vcpu in favor of kvm_running_vcpu > KVM: arm64: Register/unregister perf callbacks at vcpu load/put > KVM: Move x86's perf guest info callbacks to generic KVM > KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c > KVM: arm64: Convert to the generic perf callbacks > KVM: arm64: Drop perf.c and fold its tiny bit of code into pmu.c > perf: Disallow bulk unregistering of guest callbacks and do cleanup > perf: KVM: Indicate "in guest" via NULL ->is_in_guest callback > > arch/arm/kernel/perf_callchain.c | 28 ++------------ > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/kvm_host.h | 8 +++- > arch/arm64/kernel/perf_callchain.c | 18 ++++++--- > arch/arm64/kvm/Makefile | 2 +- > arch/arm64/kvm/arm.c | 13 ++++++- > arch/arm64/kvm/perf.c | 62 ------------------------------ > arch/arm64/kvm/pmu.c | 8 ++++ > arch/csky/kernel/perf_callchain.c | 10 ----- > arch/nds32/kernel/perf_event_cpu.c | 29 ++------------ > arch/riscv/kernel/perf_callchain.c | 10 ----- > arch/x86/Kconfig | 1 + > arch/x86/events/core.c | 17 +++++--- > arch/x86/events/intel/core.c | 7 ++-- > arch/x86/include/asm/kvm_host.h | 4 +- > arch/x86/kvm/pmu.c | 2 +- > arch/x86/kvm/pmu.h | 1 + > arch/x86/kvm/svm/svm.c | 2 +- > arch/x86/kvm/vmx/vmx.c | 25 ++++++++++-- > arch/x86/kvm/x86.c | 54 +++----------------------- > arch/x86/kvm/x86.h | 12 +++--- > arch/x86/xen/pmu.c | 2 +- > include/kvm/arm_pmu.h | 1 + > include/linux/kvm_host.h | 12 ++++++ > include/linux/perf_event.h | 33 ++++++++++++---- > init/Kconfig | 3 ++ > kernel/events/core.c | 28 ++++++++------ > virt/kvm/kvm_main.c | 42 ++++++++++++++++++++ > 28 files changed, 204 insertions(+), 231 deletions(-) > delete mode 100644 arch/arm64/kvm/perf.c >
On Fri, Aug 27, 2021 at 02:52:25PM +0800, Like Xu wrote: > + STATIC BRANCH/CALL friends. > > On 27/8/2021 8:57 am, Sean Christopherson wrote: > > This started out as a small series[1] to fix a KVM bug related to Intel PT > > interrupt handling and snowballed horribly. > > > > The main problem being addressed is that the perf_guest_cbs are shared by > > all CPUs, can be nullified by KVM during module unload, and are not > > protected against concurrent access from NMI context. > > Shouldn't this be a generic issue of the static_call() usage ? > > At the beginning, we set up the static entry assuming perf_guest_cbs != NULL: > > if (perf_guest_cbs && perf_guest_cbs->handle_intel_pt_intr) { > static_call_update(x86_guest_handle_intel_pt_intr, > perf_guest_cbs->handle_intel_pt_intr); > } > > and then we unset the perf_guest_cbs and do the static function call like this: > > DECLARE_STATIC_CALL(x86_guest_handle_intel_pt_intr, > *(perf_guest_cbs->handle_intel_pt_intr)); > static int handle_pmi_common(struct pt_regs *regs, u64 status) > { > ... > if (!static_call(x86_guest_handle_intel_pt_intr)()) > intel_pt_interrupt(); > ... > } You just have to make sure all static_call() invocations that started before unreg are finished before continuing with the unload. synchronize_rcu() can help with that. This is module unload 101. Nothing specific to static_call().
On 27/8/2021 3:44 pm, Peter Zijlstra wrote: > On Fri, Aug 27, 2021 at 02:52:25PM +0800, Like Xu wrote: >> + STATIC BRANCH/CALL friends. >> >> On 27/8/2021 8:57 am, Sean Christopherson wrote: >>> This started out as a small series[1] to fix a KVM bug related to Intel PT >>> interrupt handling and snowballed horribly. >>> >>> The main problem being addressed is that the perf_guest_cbs are shared by >>> all CPUs, can be nullified by KVM during module unload, and are not >>> protected against concurrent access from NMI context. >> >> Shouldn't this be a generic issue of the static_call() usage ? >> >> At the beginning, we set up the static entry assuming perf_guest_cbs != NULL: >> >> if (perf_guest_cbs && perf_guest_cbs->handle_intel_pt_intr) { >> static_call_update(x86_guest_handle_intel_pt_intr, >> perf_guest_cbs->handle_intel_pt_intr); >> } >> >> and then we unset the perf_guest_cbs and do the static function call like this: >> >> DECLARE_STATIC_CALL(x86_guest_handle_intel_pt_intr, >> *(perf_guest_cbs->handle_intel_pt_intr)); >> static int handle_pmi_common(struct pt_regs *regs, u64 status) >> { >> ... >> if (!static_call(x86_guest_handle_intel_pt_intr)()) >> intel_pt_interrupt(); >> ... >> } > > You just have to make sure all static_call() invocations that started > before unreg are finished before continuing with the unload. > synchronize_rcu() can help with that. Do you mean something like that: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 64e310ff4f3a..e7d310af7509 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8465,6 +8465,7 @@ void kvm_arch_exit(void) #endif kvm_lapic_exit(); perf_unregister_guest_info_callbacks(&kvm_guest_cbs); + synchronize_rcu(); if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block, diff --git a/kernel/events/core.c b/kernel/events/core.c index e466fc8176e1..63ae56c5d133 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6508,6 +6508,7 @@ EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks); int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) { perf_guest_cbs = NULL; + arch_perf_update_guest_cbs(); return 0; } EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks); > > This is module unload 101. Nothing specific to static_call(). >
On Fri, Aug 27, 2021 at 04:01:45PM +0800, Like Xu wrote: > On 27/8/2021 3:44 pm, Peter Zijlstra wrote: > > You just have to make sure all static_call() invocations that started > > before unreg are finished before continuing with the unload. > > synchronize_rcu() can help with that. > > Do you mean something like that: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 64e310ff4f3a..e7d310af7509 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -8465,6 +8465,7 @@ void kvm_arch_exit(void) > #endif > kvm_lapic_exit(); > perf_unregister_guest_info_callbacks(&kvm_guest_cbs); > + synchronize_rcu(); > > if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) > cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block, > diff --git a/kernel/events/core.c b/kernel/events/core.c > index e466fc8176e1..63ae56c5d133 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -6508,6 +6508,7 @@ EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks); > int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) > { > perf_guest_cbs = NULL; > + arch_perf_update_guest_cbs(); I'm thinking the synchronize_rcu() should go here, and access to perf_guest_cbs should be wrapped to yell when called with preemption enabled. But yes.. > return 0; > } > EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks); > > > > > This is module unload 101. Nothing specific to static_call(). > >
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 1eb45139fcc6..202e5ad97f82 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) { if (perf_guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else