Message ID | 20210827005718.585190-6-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/15] KVM: x86: Register perf callbacks after calling vendor's hardware_setup() | expand |
On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote: > Use a per-CPU pointer to track perf's guest callbacks so that KVM can set > the callbacks more precisely and avoid a lurking NULL pointer dereference. I'm completely failing to see how per-cpu helps anything here... > On x86, KVM supports being built as a module and thus can be unloaded. > And because the shared callbacks are referenced from IRQ/NMI context, > unloading KVM can run concurrently with perf, and thus all of perf's > checks for a NULL perf_guest_cbs are flawed as perf_guest_cbs could be > nullified between the check and dereference. No longer allowing KVM to be a module would be *AWESOME*. I detest how much we have to export for KVM :/ Still, what stops KVM from doing a coherent unreg? Even the static_call() proposed in the other patch, unreg can do static_call_update() + synchronize_rcu() to ensure everybody sees the updated pointer (would require a quick audit to see all users are with preempt disabled, but I think your using per-cpu here already imposes the same).
On Fri, Aug 27, 2021, Peter Zijlstra wrote: > On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote: > > Use a per-CPU pointer to track perf's guest callbacks so that KVM can set > > the callbacks more precisely and avoid a lurking NULL pointer dereference. > > I'm completely failing to see how per-cpu helps anything here... It doesn't help until KVM is converted to set the per-cpu pointer in flows that are protected against preemption, and more specifically when KVM only writes to the pointer from the owning CPU. > > On x86, KVM supports being built as a module and thus can be unloaded. > > And because the shared callbacks are referenced from IRQ/NMI context, > > unloading KVM can run concurrently with perf, and thus all of perf's > > checks for a NULL perf_guest_cbs are flawed as perf_guest_cbs could be > > nullified between the check and dereference. > > No longer allowing KVM to be a module would be *AWESOME*. I detest how > much we have to export for KVM :/ > > Still, what stops KVM from doing a coherent unreg? Even the > static_call() proposed in the other patch, unreg can do > static_call_update() + synchronize_rcu() to ensure everybody sees the > updated pointer (would require a quick audit to see all users are with > preempt disabled, but I think your using per-cpu here already imposes > the same). Ignoring static call for the moment, I don't see how the unreg side can be safe using a bare single global pointer. There is no way for KVM to prevent an NMI from running in parallel on a different CPU. If there's a more elegant solution, especially something that can be backported, e.g. an rcu-protected pointer, I'm all for it. I went down the per-cpu path because it allowed for cleanups in KVM, but similar cleanups can be done without per-cpu perf callbacks. As for static calls, I certainly have no objection to employing static calls for the callbacks, but IMO we should not be relying on static call for correctness, i.e. the existing bug needs to be fixed first.
On Fri, Aug 27, 2021 at 02:49:50PM +0000, Sean Christopherson wrote: > On Fri, Aug 27, 2021, Peter Zijlstra wrote: > > On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote: > > > Use a per-CPU pointer to track perf's guest callbacks so that KVM can set > > > the callbacks more precisely and avoid a lurking NULL pointer dereference. > > > > I'm completely failing to see how per-cpu helps anything here... > > It doesn't help until KVM is converted to set the per-cpu pointer in flows that > are protected against preemption, and more specifically when KVM only writes to > the pointer from the owning CPU. So the 'problem' I have with this is that sane (!KVM using) people, will still have to suffer that load, whereas with the static_call() we patch in an 'xor %rax,%rax' and only have immediate code flow. > Ignoring static call for the moment, I don't see how the unreg side can be safe > using a bare single global pointer. There is no way for KVM to prevent an NMI > from running in parallel on a different CPU. If there's a more elegant solution, > especially something that can be backported, e.g. an rcu-protected pointer, I'm > all for it. I went down the per-cpu path because it allowed for cleanups in KVM, > but similar cleanups can be done without per-cpu perf callbacks. If all the perf_guest_cbs dereferences are with preemption disabled (IRQs disabled, IRQ context, NMI context included), then the sequence: WRITE_ONCE(perf_guest_cbs, NULL); synchronize_rcu(); Ensures that all prior observers of perf_guest_csb will have completed and future observes must observe the NULL value.
On Fri, Aug 27, 2021, Peter Zijlstra wrote: > On Fri, Aug 27, 2021 at 02:49:50PM +0000, Sean Christopherson wrote: > > On Fri, Aug 27, 2021, Peter Zijlstra wrote: > > > On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote: > > > > Use a per-CPU pointer to track perf's guest callbacks so that KVM can set > > > > the callbacks more precisely and avoid a lurking NULL pointer dereference. > > > > > > I'm completely failing to see how per-cpu helps anything here... > > > > It doesn't help until KVM is converted to set the per-cpu pointer in flows that > > are protected against preemption, and more specifically when KVM only writes to > > the pointer from the owning CPU. > > So the 'problem' I have with this is that sane (!KVM using) people, will > still have to suffer that load, whereas with the static_call() we patch > in an 'xor %rax,%rax' and only have immediate code flow. Again, I've no objection to the static_call() approach. I didn't even see the patch until I had finished testing my series :-/ > > Ignoring static call for the moment, I don't see how the unreg side can be safe > > using a bare single global pointer. There is no way for KVM to prevent an NMI > > from running in parallel on a different CPU. If there's a more elegant solution, > > especially something that can be backported, e.g. an rcu-protected pointer, I'm > > all for it. I went down the per-cpu path because it allowed for cleanups in KVM, > > but similar cleanups can be done without per-cpu perf callbacks. > > If all the perf_guest_cbs dereferences are with preemption disabled > (IRQs disabled, IRQ context, NMI context included), then the sequence: > > WRITE_ONCE(perf_guest_cbs, NULL); > synchronize_rcu(); > > Ensures that all prior observers of perf_guest_csb will have completed > and future observes must observe the NULL value. That alone won't be sufficient, as the read side also needs to ensure it doesn't reload perf_guest_cbs between NULL checks and dereferences. But that's easy enough to solve with a READ_ONCE and maybe a helper to make it more cumbersome to use perf_guest_cbs directly. How about this for a series? 1. Use READ_ONCE/WRITE_ONCE + synchronize_rcu() to fix the underlying bug 2. Fix KVM PT interrupt handler bug 3. Kill off perf_guest_cbs usage in architectures that don't need the callbacks 4. Replace ->is_in_guest()/->is_user_mode() with ->state(), and s/get_guest_ip/get_ip 5. Implement static_call() support 6. Cleanups, if there are any 6..N KVM cleanups, e.g. to eliminate current_vcpu and share x86+arm64 callbacks
diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c index 4a72c2727309..38555275c6a2 100644 --- a/arch/arm64/kernel/perf_callchain.c +++ b/arch/arm64/kernel/perf_callchain.c @@ -102,7 +102,9 @@ compat_user_backtrace(struct compat_frame_tail __user *tail, void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + + if (guest_cbs && guest_cbs->is_in_guest()) { /* We don't support guest os callchain now */ return; } @@ -147,9 +149,10 @@ static bool callchain_trace(void *data, unsigned long pc) void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); struct stackframe frame; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (guest_cbs && guest_cbs->is_in_guest()) { /* We don't support guest os callchain now */ return; } @@ -160,18 +163,21 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, unsigned long perf_instruction_pointer(struct pt_regs *regs) { - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) - return perf_guest_cbs->get_guest_ip(); + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + + if (guest_cbs && guest_cbs->is_in_guest()) + return guest_cbs->get_guest_ip(); return instruction_pointer(regs); } unsigned long perf_misc_flags(struct pt_regs *regs) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { - if (perf_guest_cbs->is_user_mode()) + if (guest_cbs && guest_cbs->is_in_guest()) { + if (guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else misc |= PERF_RECORD_MISC_GUEST_KERNEL; diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 1eb45139fcc6..34155a52e498 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2761,10 +2761,11 @@ static bool perf_hw_regs(struct pt_regs *regs) void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); struct unwind_state state; unsigned long addr; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (guest_cbs && guest_cbs->is_in_guest()) { /* TODO: We don't support guest os callchain now */ return; } @@ -2864,10 +2865,11 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); struct stack_frame frame; const struct stack_frame __user *fp; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (guest_cbs && guest_cbs->is_in_guest()) { /* TODO: We don't support guest os callchain now */ return; } @@ -2944,18 +2946,21 @@ static unsigned long code_segment_base(struct pt_regs *regs) unsigned long perf_instruction_pointer(struct pt_regs *regs) { - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) - return perf_guest_cbs->get_guest_ip(); + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); + + if (guest_cbs && guest_cbs->is_in_guest()) + return guest_cbs->get_guest_ip(); return regs->ip + code_segment_base(regs); } unsigned long perf_misc_flags(struct pt_regs *regs) { + struct perf_guest_info_callbacks *guest_cbs = this_cpu_read(perf_guest_cbs); int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { - if (perf_guest_cbs->is_user_mode()) + if (guest_cbs && guest_cbs->is_in_guest()) { + if (guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else misc |= PERF_RECORD_MISC_GUEST_KERNEL; diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index fca7a6e2242f..96001962c24d 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2784,6 +2784,7 @@ static void intel_pmu_reset(void) static int handle_pmi_common(struct pt_regs *regs, u64 status) { + struct perf_guest_info_callbacks *guest_cbs; struct perf_sample_data data; struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); int bit; @@ -2852,9 +2853,10 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) */ if (__test_and_clear_bit(GLOBAL_STATUS_TRACE_TOPAPMI_BIT, (unsigned long *)&status)) { handled++; - if (unlikely(perf_guest_cbs && perf_guest_cbs->is_in_guest() && - perf_guest_cbs->handle_intel_pt_intr)) - perf_guest_cbs->handle_intel_pt_intr(); + guest_cbs = this_cpu_read(perf_guest_cbs); + if (unlikely(guest_cbs && guest_cbs->is_in_guest() && + guest_cbs->handle_intel_pt_intr)) + guest_cbs->handle_intel_pt_intr(); else intel_pt_interrupt(); } diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 5eab690622ca..c98253dae037 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1237,7 +1237,7 @@ extern void perf_event_bpf_event(struct bpf_prog *prog, u16 flags); #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS -extern struct perf_guest_info_callbacks *perf_guest_cbs; +DECLARE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs); extern void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); extern void perf_unregister_guest_info_callbacks(void); #endif /* CONFIG_HAVE_GUEST_PERF_EVENTS */ diff --git a/kernel/events/core.c b/kernel/events/core.c index 9820df7ee455..9bc1375d6ed9 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6483,17 +6483,23 @@ static void perf_pending_event(struct irq_work *entry) } #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS -struct perf_guest_info_callbacks *perf_guest_cbs; +DEFINE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs); void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) { - perf_guest_cbs = cbs; + int cpu; + + for_each_possible_cpu(cpu) + per_cpu(perf_guest_cbs, cpu) = cbs; } EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks); void perf_unregister_guest_info_callbacks(void) { - perf_guest_cbs = NULL; + int cpu; + + for_each_possible_cpu(cpu) + per_cpu(perf_guest_cbs, cpu) = NULL; } EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks); #endif
Use a per-CPU pointer to track perf's guest callbacks so that KVM can set the callbacks more precisely and avoid a lurking NULL pointer dereference. On x86, KVM supports being built as a module and thus can be unloaded. And because the shared callbacks are referenced from IRQ/NMI context, unloading KVM can run concurrently with perf, and thus all of perf's checks for a NULL perf_guest_cbs are flawed as perf_guest_cbs could be nullified between the check and dereference. In practice, this has not been problematic because the callbacks are always guarded with a "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and it's extremely unlikely the compiler will choost to reload perf_guest_cbs in that particular sequence. Because is_in_guest() is obviously true only when KVM is running a guest, perf always wins the race to the guarded code (which does often reload perf_guest_cbs) as KVM has to stop running all guests and do a heavy teardown before unloading. Cc: Zhu Lingshan <lingshan.zhu@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/arm64/kernel/perf_callchain.c | 18 ++++++++++++------ arch/x86/events/core.c | 17 +++++++++++------ arch/x86/events/intel/core.c | 8 +++++--- include/linux/perf_event.h | 2 +- kernel/events/core.c | 12 +++++++++--- 5 files changed, 38 insertions(+), 19 deletions(-)