Message ID | 20220825085625.867763-1-xiaoyao.li@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: VMX: Fix VM entry failure on PT_MODE_HOST_GUEST while host is using PT | expand |
On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote: > There is one bug in KVM that can hit vm-entry failure 100% on platform > supporting PT_MODE_HOST_GUEST mode following below steps: > > 1. #modprobe -r kvm_intel > 2. #modprobe kvm_intel pt_mode=1 > 3. start a VM with QEMU > 4. on host: #perf record -e intel_pt// > > The vm-entry failure happens because it violates the requirement stated in > Intel SDM 26.2.1.1 VM-Execution Control Fields > > If the logical processor is operating with Intel PT enabled (if > IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load > IA32_RTIT_CTL" VM-entry control must be 0. > > On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus > KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently > KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it > doesn't work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn > is re-enabled in PT PMI handler before vm-entry. This series tries to fix the > issue by exposing two interfaces from Intel PT driver for the purose to stop and > resume Intel PT on host. It prevents PT PMI handler from re-enabling PT. By the > way, it also fixes another issue that PT PMI touches PT MSRs whihc leads to > what KVM stores for host bemomes stale. I'm thinking about another approach to fixing it. I think we need to have the running host pt event disabled when we switch to guest and don't expect to receive the host pt interrupt at this point. Also, the host pt context can be save/restored by host perf core (instead of KVM) when we disable/enable the event. diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c index 82ef87e9a897..1d3e03ecaf6a 100644 --- a/arch/x86/events/intel/pt.c +++ b/arch/x86/events/intel/pt.c @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event *event, int mode) pt_config_buffer(buf); pt_config(event); + pt->event = event; return; @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event *event, int mode) return; event->hw.state = PERF_HES_STOPPED; + pt->event = NULL; if (mode & PERF_EF_UPDATE) { struct pt_buffer *buf = perf_get_aux(&pt->handle); @@ -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, int mode) } } + +struct perf_event *pt_get_curr_event(void) +{ + struct pt *pt = this_cpu_ptr(&pt_ctx); + + return pt->event; +} +EXPORT_SYMBOL_GPL(pt_get_curr_event); + static long pt_event_snapshot_aux(struct perf_event *event, struct perf_output_handle *handle, unsigned long size) diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h index 96906a62aacd..d46a85bb06bb 100644 --- a/arch/x86/events/intel/pt.h +++ b/arch/x86/events/intel/pt.h @@ -121,6 +121,7 @@ struct pt_filters { * @output_mask: cached RTIT_OUTPUT_MASK MSR value */ struct pt { + struct perf_event *event; struct perf_output_handle handle; struct pt_filters filters; int handle_nmi; diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index f6fc8dd51ef4..be8dd24922a7 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) #ifdef CONFIG_CPU_SUP_INTEL extern void intel_pt_handle_vmx(int on); + extern struct perf_event *pt_get_curr_event(void); #else static inline void intel_pt_handle_vmx(int on) { + } +struct perf_event *pt_get_curr_event(void) { } #endif #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index d7f8331d6f7e..195debc1bff1 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range) static void pt_guest_enter(struct vcpu_vmx *vmx) { - if (vmx_pt_mode_is_system()) + struct perf_event *event; + + if (vmx_pt_mode_is_system() || + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)) return; - /* - * GUEST_IA32_RTIT_CTL is already set in the VMCS. - * Save host state before VM entry. - */ - rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl); - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) { - wrmsrl(MSR_IA32_RTIT_CTL, 0); - pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges); - pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); - } + event = pt_get_curr_event(); + perf_event_disable(event); + vmx->pt_desc.host_event = event; + pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); } static void pt_guest_exit(struct vcpu_vmx *vmx) { - if (vmx_pt_mode_is_system()) - return; + struct perf_event *event = vmx->pt_desc.host_event; - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) { - pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); - pt_load_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges); - } + if (vmx_pt_mode_is_system() || + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)) + return; - /* - * KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest, - * i.e. RTIT_CTL is always cleared on VM-Exit. Restore it if necessary. - */ - if (vmx->pt_desc.host.ctl) - wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl); + pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); + if (event) + perf_event_enable(event); } void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel, diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 24d58c2ffaa3..4c20bdabc85b 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -66,7 +66,7 @@ struct pt_desc { u64 ctl_bitmask; u32 num_address_ranges; u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES]; - struct pt_ctx host; + struct perf_event *host_event; struct pt_ctx guest; };
On Mon, Aug 29, 2022, Wang, Wei W wrote: > On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote: > #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD) > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index d7f8331d6f7e..195debc1bff1 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range) > > static void pt_guest_enter(struct vcpu_vmx *vmx) > { > - if (vmx_pt_mode_is_system()) > + struct perf_event *event; > + > + if (vmx_pt_mode_is_system() || > + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)) I don't think the host should trace the guest in the host/guest mode just because the guest isn't tracing itself. I.e. the host still needs to turn off it's own tracing. > return; > > - /* > - * GUEST_IA32_RTIT_CTL is already set in the VMCS. > - * Save host state before VM entry. > - */ > - rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl); > - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) { > - wrmsrl(MSR_IA32_RTIT_CTL, 0); > - pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges); > - pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); > - } > + event = pt_get_curr_event(); > + perf_event_disable(event); > + vmx->pt_desc.host_event = event; This is effectively what I suggested[*], the main difference being that my version adds dedicated enter/exit helpers so that perf can skip save/restore of the other MSRs. It's easy to extend if perf needs to hand back an event to complete the "exit. bool guest_trace_enabled = vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN; vmx->pt_desc.host_event = intel_pt_guest_enter(guest_trace_enabled); and then on exit bool guest_trace_enabled = vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN; intel_pt_guest_exit(vmx->pt_desc.host_event, guest_trace_enabled); [*] https://lore.kernel.org/all/YwecducnM%2FU6tqJT@google.com > + pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); > }
On Tuesday, August 30, 2022 1:34 AM, Sean Christopherson wrote: > On Mon, Aug 29, 2022, Wang, Wei W wrote: > > On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote: > > #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD) > diff > > --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index > > d7f8331d6f7e..195debc1bff1 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx > > *ctx, u32 addr_range) > > > > static void pt_guest_enter(struct vcpu_vmx *vmx) { > > - if (vmx_pt_mode_is_system()) > > + struct perf_event *event; > > + > > + if (vmx_pt_mode_is_system() || > > + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)) > > I don't think the host should trace the guest in the host/guest mode just > because the guest isn't tracing itself. I.e. the host still needs to turn off it's > own tracing. Right, need to fix this one. > This is effectively what I suggested[*], the main difference being that my > version adds dedicated enter/exit helpers so that perf can skip save/restore of > the other MSRs. What "other MSRs" were you referring to? (I suppose you meant perf_event_disable needs to save more MSRs) > It's easy to extend if perf needs to hand back an event to > complete the "exit. > > bool guest_trace_enabled = vmx->pt_desc.guest.ctl & > RTIT_CTL_TRACEEN; > > vmx->pt_desc.host_event = intel_pt_guest_enter(guest_trace_enabled); > > > and then on exit > > bool guest_trace_enabled = vmx->pt_desc.guest.ctl & > RTIT_CTL_TRACEEN; > > intel_pt_guest_exit(vmx->pt_desc.host_event, guest_trace_enabled); > > [*] https://lore.kernel.org/all/YwecducnM%2FU6tqJT@google.com Yes, this can function. But I feel it a bit violates the general rule that I got from previous experiences: KVM should be a user of the perf subsystem, instead of implementing a secondary driver beyond perf's management. Being a user of perf means everything possible should go through "perf event", which is the interface that perf exposes to users.
On 8/29/2022 3:49 PM, Wang, Wei W wrote: > On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote: >> There is one bug in KVM that can hit vm-entry failure 100% on platform >> supporting PT_MODE_HOST_GUEST mode following below steps: >> >> 1. #modprobe -r kvm_intel >> 2. #modprobe kvm_intel pt_mode=1 >> 3. start a VM with QEMU >> 4. on host: #perf record -e intel_pt// >> >> The vm-entry failure happens because it violates the requirement stated in >> Intel SDM 26.2.1.1 VM-Execution Control Fields >> >> If the logical processor is operating with Intel PT enabled (if >> IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load >> IA32_RTIT_CTL" VM-entry control must be 0. >> >> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus >> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently >> KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it >> doesn't work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn >> is re-enabled in PT PMI handler before vm-entry. This series tries to fix the >> issue by exposing two interfaces from Intel PT driver for the purose to stop and >> resume Intel PT on host. It prevents PT PMI handler from re-enabling PT. By the >> way, it also fixes another issue that PT PMI touches PT MSRs whihc leads to >> what KVM stores for host bemomes stale. > > I'm thinking about another approach to fixing it. I think we need to have the > running host pt event disabled when we switch to guest and don't expect to > receive the host pt interrupt at this point. Also, the host pt context can be > save/restored by host perf core (instead of KVM) when we disable/enable > the event. > > diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c > index 82ef87e9a897..1d3e03ecaf6a 100644 > --- a/arch/x86/events/intel/pt.c > +++ b/arch/x86/events/intel/pt.c > @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event *event, int mode) > > pt_config_buffer(buf); > pt_config(event); > + pt->event = event; > > return; > > @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event *event, int mode) > return; > > event->hw.state = PERF_HES_STOPPED; > + pt->event = NULL; > > if (mode & PERF_EF_UPDATE) { > struct pt_buffer *buf = perf_get_aux(&pt->handle); > @@ -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, int mode) > } > } > > + > +struct perf_event *pt_get_curr_event(void) > +{ > + struct pt *pt = this_cpu_ptr(&pt_ctx); Wei, I'm not sure if we can use pt->handle.event instead or not. > + return pt->event; > +} > +EXPORT_SYMBOL_GPL(pt_get_curr_event); > + > static long pt_event_snapshot_aux(struct perf_event *event, > struct perf_output_handle *handle, > unsigned long size) > diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h > index 96906a62aacd..d46a85bb06bb 100644 > --- a/arch/x86/events/intel/pt.h > +++ b/arch/x86/events/intel/pt.h > @@ -121,6 +121,7 @@ struct pt_filters { > * @output_mask: cached RTIT_OUTPUT_MASK MSR value > */ > struct pt { > + struct perf_event *event; > struct perf_output_handle handle; > struct pt_filters filters; > int handle_nmi; > diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h > index f6fc8dd51ef4..be8dd24922a7 100644 > --- a/arch/x86/include/asm/perf_event.h > +++ b/arch/x86/include/asm/perf_event.h > @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) > > #ifdef CONFIG_CPU_SUP_INTEL > extern void intel_pt_handle_vmx(int on); > + extern struct perf_event *pt_get_curr_event(void); > #else > static inline void intel_pt_handle_vmx(int on) > { > > + > } > +struct perf_event *pt_get_curr_event(void) { } > #endif >
On Thursday, September 8, 2022 3:26 PM, Li, Xiaoyao wrote: > > + > > +struct perf_event *pt_get_curr_event(void) { > > + struct pt *pt = this_cpu_ptr(&pt_ctx); > > Wei, > > I'm not sure if we can use pt->handle.event instead or not. > > > + return pt->event; Yes, I think we could reuse that.
On 8/29/2022 3:49 PM, Wang, Wei W wrote: > On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote: >> There is one bug in KVM that can hit vm-entry failure 100% on platform >> supporting PT_MODE_HOST_GUEST mode following below steps: >> >> 1. #modprobe -r kvm_intel >> 2. #modprobe kvm_intel pt_mode=1 >> 3. start a VM with QEMU >> 4. on host: #perf record -e intel_pt// >> >> The vm-entry failure happens because it violates the requirement stated in >> Intel SDM 26.2.1.1 VM-Execution Control Fields >> >> If the logical processor is operating with Intel PT enabled (if >> IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load >> IA32_RTIT_CTL" VM-entry control must be 0. >> >> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus >> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently >> KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it >> doesn't work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn >> is re-enabled in PT PMI handler before vm-entry. This series tries to fix the >> issue by exposing two interfaces from Intel PT driver for the purose to stop and >> resume Intel PT on host. It prevents PT PMI handler from re-enabling PT. By the >> way, it also fixes another issue that PT PMI touches PT MSRs whihc leads to >> what KVM stores for host bemomes stale. > > I'm thinking about another approach to fixing it. I think we need to have the > running host pt event disabled when we switch to guest and don't expect to > receive the host pt interrupt at this point. Also, the host pt context can be > save/restored by host perf core (instead of KVM) when we disable/enable > the event. > > diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c > index 82ef87e9a897..1d3e03ecaf6a 100644 > --- a/arch/x86/events/intel/pt.c > +++ b/arch/x86/events/intel/pt.c > @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event *event, int mode) > > pt_config_buffer(buf); > pt_config(event); > + pt->event = event; > > return; > > @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event *event, int mode) > return; > > event->hw.state = PERF_HES_STOPPED; > + pt->event = NULL; > > if (mode & PERF_EF_UPDATE) { > struct pt_buffer *buf = perf_get_aux(&pt->handle); > @@ -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, int mode) > } > } > > + > +struct perf_event *pt_get_curr_event(void) > +{ > + struct pt *pt = this_cpu_ptr(&pt_ctx); > + > + return pt->event; > +} > +EXPORT_SYMBOL_GPL(pt_get_curr_event); > + > static long pt_event_snapshot_aux(struct perf_event *event, > struct perf_output_handle *handle, > unsigned long size) > diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h > index 96906a62aacd..d46a85bb06bb 100644 > --- a/arch/x86/events/intel/pt.h > +++ b/arch/x86/events/intel/pt.h > @@ -121,6 +121,7 @@ struct pt_filters { > * @output_mask: cached RTIT_OUTPUT_MASK MSR value > */ > struct pt { > + struct perf_event *event; > struct perf_output_handle handle; > struct pt_filters filters; > int handle_nmi; > diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h > index f6fc8dd51ef4..be8dd24922a7 100644 > --- a/arch/x86/include/asm/perf_event.h > +++ b/arch/x86/include/asm/perf_event.h > @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr) > > #ifdef CONFIG_CPU_SUP_INTEL > extern void intel_pt_handle_vmx(int on); > + extern struct perf_event *pt_get_curr_event(void); > #else > static inline void intel_pt_handle_vmx(int on) > { > > + > } > +struct perf_event *pt_get_curr_event(void) { } > #endif > > #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD) > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index d7f8331d6f7e..195debc1bff1 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range) > > static void pt_guest_enter(struct vcpu_vmx *vmx) > { > - if (vmx_pt_mode_is_system()) > + struct perf_event *event; > + > + if (vmx_pt_mode_is_system() || > + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)) > return; > > - /* > - * GUEST_IA32_RTIT_CTL is already set in the VMCS. > - * Save host state before VM entry. > - */ > - rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl); > - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) { > - wrmsrl(MSR_IA32_RTIT_CTL, 0); > - pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges); > - pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); > - } > + event = pt_get_curr_event(); > + perf_event_disable(event); perf_event_disable() is not allowed in preemption disabled context, since perf_event_disable() -> perf_event_ctx_lock() -> perf_event_ctx_lock_nested() -> mutex_lock_nested() and it causes [ 3542.164553] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 [ 3542.165140] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1518, name: CPU 0/KVM [ 3542.165703] preempt_count: 1, expected: 0 [ 3542.166006] RCU nest depth: 0, expected: 0 [ 3542.166315] INFO: lockdep is turned off. [ 3542.166614] irq event stamp: 0 [ 3542.166857] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 3542.167304] hardirqs last disabled at (0): [<ffffffff94699ac8>] copy_process+0x8e8/0x1bd0 [ 3542.167874] softirqs last enabled at (0): [<ffffffff94699ac8>] copy_process+0x8e8/0x1bd0 [ 3542.168443] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 3542.168891] Preemption disabled at: [ 3542.168893] [<ffffffffc0c28f29>] vcpu_enter_guest+0x139/0x1350 [kvm] [ 3542.169674] CPU: 20 PID: 1518 Comm: CPU 0/KVM Not tainted 6.0.0-rc5-fix-pt-vm-entry+ #3 f2d44ed9be3fc4a510291e2989c9432fce3cb5de [ 3542.170457] Hardware name: Intel Corporation JACOBSVILLE/JACOBSVILLE, BIOS JBVLCRB1.86B.0012.D75.1912120439 12/12/2019 [ 3542.171188] Call Trace: [ 3542.171392] <TASK> [ 3542.171572] show_stack+0x52/0x5c [ 3542.171831] dump_stack_lvl+0x5b/0x86 [ 3542.172112] dump_stack+0x10/0x16 [ 3542.172371] __might_resched.cold+0x135/0x15b [ 3542.172698] __might_sleep+0x52/0xa0 [ 3542.172975] __mutex_lock+0x4e/0x4d0 [ 3542.173251] ? kvm_sched_in+0x4f/0x60 [kvm 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] [ 3542.173839] ? perf_event_ctx_lock_nested+0xc8/0x230 [ 3542.174202] ? rcu_read_lock_sched_held+0x16/0x90 [ 3542.174551] ? lock_acquire+0xfc/0x150 [ 3542.174840] ? perf_event_ctx_lock_nested+0x24/0x230 [ 3542.175205] mutex_lock_nested+0x1c/0x30 [ 3542.175505] perf_event_ctx_lock_nested+0xc8/0x230 [ 3542.181147] ? perf_event_ctx_lock_nested+0x24/0x230 [ 3542.186839] perf_event_disable+0x19/0x80 [ 3542.192502] vmx_vcpu_run+0x3e5/0xfe0 [kvm_intel 7936a7891efe9306918aa504b0eb8bc1e7ba3aa6] [ 3542.203771] ? rcu_read_lock_sched_held+0x16/0x90 [ 3542.209378] vcpu_enter_guest+0xa96/0x1350 [kvm 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] [ 3542.215218] ? vcpu_enter_guest+0xbe1/0x1350 [kvm 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] [ 3542.226292] ? rcu_read_lock_sched_held+0x16/0x90 [ 3542.231956] ? rcu_read_lock_sched_held+0x16/0x90 [ 3542.237542] ? lock_acquire+0xfc/0x150 [ 3542.243093] ? __rseq_handle_notify_resume+0x3a/0x60 [ 3542.248689] vcpu_run+0x53/0x490 [kvm 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] [ 3542.254533] ? vcpu_run+0x35a/0x490 [kvm 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] [ 3542.260567] kvm_arch_vcpu_ioctl_run+0x162/0x680 [kvm 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] [ 3542.272395] ? kvm_arch_vcpu_ioctl_run+0x6d/0x680 [kvm 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] [ 3542.284586] kvm_vcpu_ioctl+0x2ad/0x7a0 [kvm 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] [ 3542.290973] ? lock_acquire+0xfc/0x150 [ 3542.296990] ? rcu_read_lock_sched_held+0x16/0x90 [ 3542.302912] ? lock_release+0x118/0x190 [ 3542.308800] ? __fget_files+0xe8/0x1c0 [ 3542.314710] ? __fget_files+0x5/0x1c0 [ 3542.320591] __x64_sys_ioctl+0x96/0xd0 [ 3542.326500] do_syscall_64+0x3a/0x90 [ 3542.332426] entry_SYSCALL_64_after_hwframe+0x5e/0xc8 I know little about perf. It seems perf_event_disable() is not used widely by other kernel component. Is there a alternative? If no, I think expose disable/enable helper from pt driver like this series seems OK. > + vmx->pt_desc.host_event = event; > + pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); > } > > static void pt_guest_exit(struct vcpu_vmx *vmx) > { > - if (vmx_pt_mode_is_system()) > - return; > + struct perf_event *event = vmx->pt_desc.host_event; > > - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) { > - pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); > - pt_load_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges); > - } > + if (vmx_pt_mode_is_system() || > + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)) > + return; > > - /* > - * KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest, > - * i.e. RTIT_CTL is always cleared on VM-Exit. Restore it if necessary. > - */ > - if (vmx->pt_desc.host.ctl) > - wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl); > + pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges); > + if (event) > + perf_event_enable(event); > } > > void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel, > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h > index 24d58c2ffaa3..4c20bdabc85b 100644 > --- a/arch/x86/kvm/vmx/vmx.h > +++ b/arch/x86/kvm/vmx/vmx.h > @@ -66,7 +66,7 @@ struct pt_desc { > u64 ctl_bitmask; > u32 num_address_ranges; > u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES]; > - struct pt_ctx host; > + struct perf_event *host_event; > struct pt_ctx guest; > }; > >
On Wednesday, September 14, 2022 12:16 PM, Li, Xiaoyao wrote: > On 8/29/2022 3:49 PM, Wang, Wei W wrote: > > On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote: > >> There is one bug in KVM that can hit vm-entry failure 100% on > >> platform supporting PT_MODE_HOST_GUEST mode following below steps: > >> > >> 1. #modprobe -r kvm_intel > >> 2. #modprobe kvm_intel pt_mode=1 > >> 3. start a VM with QEMU > >> 4. on host: #perf record -e intel_pt// > >> > >> The vm-entry failure happens because it violates the requirement > >> stated in Intel SDM 26.2.1.1 VM-Execution Control Fields > >> > >> If the logical processor is operating with Intel PT enabled (if > >> IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load > >> IA32_RTIT_CTL" VM-entry control must be 0. > >> > >> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. > Thus > >> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. > >> Currently KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. > >> However, it doesn't work everytime since there is a posibility that > >> IA32_RTIT_CTL.TraceEn is re-enabled in PT PMI handler before > >> vm-entry. This series tries to fix the issue by exposing two > >> interfaces from Intel PT driver for the purose to stop and resume > >> Intel PT on host. It prevents PT PMI handler from re-enabling PT. By > >> the way, it also fixes another issue that PT PMI touches PT MSRs whihc > leads to what KVM stores for host bemomes stale. > > > > I'm thinking about another approach to fixing it. I think we need to > > have the running host pt event disabled when we switch to guest and > > don't expect to receive the host pt interrupt at this point. Also, the > > host pt context can be save/restored by host perf core (instead of > > KVM) when we disable/enable the event. > > > > diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c > > index 82ef87e9a897..1d3e03ecaf6a 100644 > > --- a/arch/x86/events/intel/pt.c > > +++ b/arch/x86/events/intel/pt.c > > @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event > > *event, int mode) > > > > pt_config_buffer(buf); > > pt_config(event); > > + pt->event = event; > > > > return; > > > > @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event > *event, int mode) > > return; > > > > event->hw.state = PERF_HES_STOPPED; > > + pt->event = NULL; > > > > if (mode & PERF_EF_UPDATE) { > > struct pt_buffer *buf = perf_get_aux(&pt->handle); > @@ > > -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, > int mode) > > } > > } > > > > + > > +struct perf_event *pt_get_curr_event(void) { > > + struct pt *pt = this_cpu_ptr(&pt_ctx); > > + > > + return pt->event; > > +} > > +EXPORT_SYMBOL_GPL(pt_get_curr_event); > > + > > static long pt_event_snapshot_aux(struct perf_event *event, > > struct perf_output_handle > *handle, > > unsigned long size) diff --git > > a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h index > > 96906a62aacd..d46a85bb06bb 100644 > > --- a/arch/x86/events/intel/pt.h > > +++ b/arch/x86/events/intel/pt.h > > @@ -121,6 +121,7 @@ struct pt_filters { > > * @output_mask: cached RTIT_OUTPUT_MASK MSR value > > */ > > struct pt { > > + struct perf_event *event; > > struct perf_output_handle handle; > > struct pt_filters filters; > > int handle_nmi; > > diff --git a/arch/x86/include/asm/perf_event.h > > b/arch/x86/include/asm/perf_event.h > > index f6fc8dd51ef4..be8dd24922a7 100644 > > --- a/arch/x86/include/asm/perf_event.h > > +++ b/arch/x86/include/asm/perf_event.h > > @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct > > x86_pmu_lbr *lbr) > > > > #ifdef CONFIG_CPU_SUP_INTEL > > extern void intel_pt_handle_vmx(int on); > > + extern struct perf_event *pt_get_curr_event(void); > > #else > > static inline void intel_pt_handle_vmx(int on) > > { > > > > + > > } > > +struct perf_event *pt_get_curr_event(void) { } > > #endif > > > > #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD) > diff > > --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index > > d7f8331d6f7e..195debc1bff1 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx > > *ctx, u32 addr_range) > > > > static void pt_guest_enter(struct vcpu_vmx *vmx) > > { > > - if (vmx_pt_mode_is_system()) > > + struct perf_event *event; > > + > > + if (vmx_pt_mode_is_system() || > > + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)) > > return; > > > > - /* > > - * GUEST_IA32_RTIT_CTL is already set in the VMCS. > > - * Save host state before VM entry. > > - */ > > - rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl); > > - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) { > > - wrmsrl(MSR_IA32_RTIT_CTL, 0); > > - pt_save_msr(&vmx->pt_desc.host, > vmx->pt_desc.num_address_ranges); > > - pt_load_msr(&vmx->pt_desc.guest, > vmx->pt_desc.num_address_ranges); > > - } > > + event = pt_get_curr_event(); > > + perf_event_disable(event); > > perf_event_disable() is not allowed in preemption disabled context, since > > perf_event_disable() > -> perf_event_ctx_lock() > -> perf_event_ctx_lock_nested() > -> mutex_lock_nested() > > and it causes > > [ 3542.164553] BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:580 > [ 3542.165140] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: > 1518, name: CPU 0/KVM > [ 3542.165703] preempt_count: 1, expected: 0 [ 3542.166006] RCU nest depth: > 0, expected: 0 [ 3542.166315] INFO: lockdep is turned off. > [ 3542.166614] irq event stamp: 0 > [ 3542.166857] hardirqs last enabled at (0): [<0000000000000000>] 0x0 > [ 3542.167304] hardirqs last disabled at (0): [<ffffffff94699ac8>] > copy_process+0x8e8/0x1bd0 > [ 3542.167874] softirqs last enabled at (0): [<ffffffff94699ac8>] > copy_process+0x8e8/0x1bd0 > [ 3542.168443] softirqs last disabled at (0): [<0000000000000000>] 0x0 > [ 3542.168891] Preemption disabled at: > [ 3542.168893] [<ffffffffc0c28f29>] vcpu_enter_guest+0x139/0x1350 [kvm] > [ 3542.169674] CPU: 20 PID: 1518 Comm: CPU 0/KVM Not tainted > 6.0.0-rc5-fix-pt-vm-entry+ #3 f2d44ed9be3fc4a510291e2989c9432fce3cb5de > [ 3542.170457] Hardware name: Intel Corporation JACOBSVILLE/JACOBSVILLE, > BIOS JBVLCRB1.86B.0012.D75.1912120439 12/12/2019 [ 3542.171188] Call > Trace: > [ 3542.171392] <TASK> > [ 3542.171572] show_stack+0x52/0x5c > [ 3542.171831] dump_stack_lvl+0x5b/0x86 [ 3542.172112] > dump_stack+0x10/0x16 [ 3542.172371] __might_resched.cold+0x135/0x15b > [ 3542.172698] __might_sleep+0x52/0xa0 [ 3542.172975] > __mutex_lock+0x4e/0x4d0 [ 3542.173251] ? kvm_sched_in+0x4f/0x60 [kvm > 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] > [ 3542.173839] ? perf_event_ctx_lock_nested+0xc8/0x230 > [ 3542.174202] ? rcu_read_lock_sched_held+0x16/0x90 > [ 3542.174551] ? lock_acquire+0xfc/0x150 [ 3542.174840] ? > perf_event_ctx_lock_nested+0x24/0x230 > [ 3542.175205] mutex_lock_nested+0x1c/0x30 [ 3542.175505] > perf_event_ctx_lock_nested+0xc8/0x230 > [ 3542.181147] ? perf_event_ctx_lock_nested+0x24/0x230 > [ 3542.186839] perf_event_disable+0x19/0x80 [ 3542.192502] > vmx_vcpu_run+0x3e5/0xfe0 [kvm_intel > 7936a7891efe9306918aa504b0eb8bc1e7ba3aa6] > [ 3542.203771] ? rcu_read_lock_sched_held+0x16/0x90 > [ 3542.209378] vcpu_enter_guest+0xa96/0x1350 [kvm > 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] > [ 3542.215218] ? vcpu_enter_guest+0xbe1/0x1350 [kvm > 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] > [ 3542.226292] ? rcu_read_lock_sched_held+0x16/0x90 > [ 3542.231956] ? rcu_read_lock_sched_held+0x16/0x90 > [ 3542.237542] ? lock_acquire+0xfc/0x150 [ 3542.243093] ? > __rseq_handle_notify_resume+0x3a/0x60 > [ 3542.248689] vcpu_run+0x53/0x490 [kvm > 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] > [ 3542.254533] ? vcpu_run+0x35a/0x490 [kvm > 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] > [ 3542.260567] kvm_arch_vcpu_ioctl_run+0x162/0x680 [kvm > 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] > [ 3542.272395] ? kvm_arch_vcpu_ioctl_run+0x6d/0x680 [kvm > 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] > [ 3542.284586] kvm_vcpu_ioctl+0x2ad/0x7a0 [kvm > 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] > [ 3542.290973] ? lock_acquire+0xfc/0x150 [ 3542.296990] ? > rcu_read_lock_sched_held+0x16/0x90 > [ 3542.302912] ? lock_release+0x118/0x190 [ 3542.308800] ? > __fget_files+0xe8/0x1c0 [ 3542.314710] ? __fget_files+0x5/0x1c0 > [ 3542.320591] __x64_sys_ioctl+0x96/0xd0 [ 3542.326500] > do_syscall_64+0x3a/0x90 [ 3542.332426] > entry_SYSCALL_64_after_hwframe+0x5e/0xc8 > > > I know little about perf. +Kan to double confirm if needed. > It seems perf_event_disable() is not used widely by > other kernel component. Is there a alternative? If no, I think expose > disable/enable helper from pt driver like this series seems OK. Oops, it was actually a mistake to disable the event on other CPUs. Can you expose and use perf_event_disable_local? For the enabling side, we need to add and expose perf_event_enable_local: event_function_local(event, __perf_event_enable, NULL);
On 2022-09-14 2:16 a.m., Wang, Wei W wrote: > On Wednesday, September 14, 2022 12:16 PM, Li, Xiaoyao wrote: >> On 8/29/2022 3:49 PM, Wang, Wei W wrote: >>> On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote: >>>> There is one bug in KVM that can hit vm-entry failure 100% on >>>> platform supporting PT_MODE_HOST_GUEST mode following below steps: >>>> >>>> 1. #modprobe -r kvm_intel >>>> 2. #modprobe kvm_intel pt_mode=1 >>>> 3. start a VM with QEMU >>>> 4. on host: #perf record -e intel_pt// >>>> >>>> The vm-entry failure happens because it violates the requirement >>>> stated in Intel SDM 26.2.1.1 VM-Execution Control Fields >>>> >>>> If the logical processor is operating with Intel PT enabled (if >>>> IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load >>>> IA32_RTIT_CTL" VM-entry control must be 0. >>>> >>>> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. >> Thus >>>> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. >>>> Currently KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. >>>> However, it doesn't work everytime since there is a posibility that >>>> IA32_RTIT_CTL.TraceEn is re-enabled in PT PMI handler before >>>> vm-entry. This series tries to fix the issue by exposing two >>>> interfaces from Intel PT driver for the purose to stop and resume >>>> Intel PT on host. It prevents PT PMI handler from re-enabling PT. By >>>> the way, it also fixes another issue that PT PMI touches PT MSRs whihc >> leads to what KVM stores for host bemomes stale. >>> >>> I'm thinking about another approach to fixing it. I think we need to >>> have the running host pt event disabled when we switch to guest and >>> don't expect to receive the host pt interrupt at this point. Also, the >>> host pt context can be save/restored by host perf core (instead of >>> KVM) when we disable/enable the event. >>> >>> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c >>> index 82ef87e9a897..1d3e03ecaf6a 100644 >>> --- a/arch/x86/events/intel/pt.c >>> +++ b/arch/x86/events/intel/pt.c >>> @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event >>> *event, int mode) >>> >>> pt_config_buffer(buf); >>> pt_config(event); >>> + pt->event = event; >>> >>> return; >>> >>> @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event >> *event, int mode) >>> return; >>> >>> event->hw.state = PERF_HES_STOPPED; >>> + pt->event = NULL; >>> >>> if (mode & PERF_EF_UPDATE) { >>> struct pt_buffer *buf = perf_get_aux(&pt->handle); >> @@ >>> -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, >> int mode) >>> } >>> } >>> >>> + >>> +struct perf_event *pt_get_curr_event(void) { >>> + struct pt *pt = this_cpu_ptr(&pt_ctx); >>> + >>> + return pt->event; >>> +} >>> +EXPORT_SYMBOL_GPL(pt_get_curr_event); >>> + >>> static long pt_event_snapshot_aux(struct perf_event *event, >>> struct perf_output_handle >> *handle, >>> unsigned long size) diff --git >>> a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h index >>> 96906a62aacd..d46a85bb06bb 100644 >>> --- a/arch/x86/events/intel/pt.h >>> +++ b/arch/x86/events/intel/pt.h >>> @@ -121,6 +121,7 @@ struct pt_filters { >>> * @output_mask: cached RTIT_OUTPUT_MASK MSR value >>> */ >>> struct pt { >>> + struct perf_event *event; >>> struct perf_output_handle handle; >>> struct pt_filters filters; >>> int handle_nmi; >>> diff --git a/arch/x86/include/asm/perf_event.h >>> b/arch/x86/include/asm/perf_event.h >>> index f6fc8dd51ef4..be8dd24922a7 100644 >>> --- a/arch/x86/include/asm/perf_event.h >>> +++ b/arch/x86/include/asm/perf_event.h >>> @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct >>> x86_pmu_lbr *lbr) >>> >>> #ifdef CONFIG_CPU_SUP_INTEL >>> extern void intel_pt_handle_vmx(int on); >>> + extern struct perf_event *pt_get_curr_event(void); >>> #else >>> static inline void intel_pt_handle_vmx(int on) >>> { >>> >>> + >>> } >>> +struct perf_event *pt_get_curr_event(void) { } >>> #endif >>> >>> #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD) >> diff >>> --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index >>> d7f8331d6f7e..195debc1bff1 100644 >>> --- a/arch/x86/kvm/vmx/vmx.c >>> +++ b/arch/x86/kvm/vmx/vmx.c >>> @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx >>> *ctx, u32 addr_range) >>> >>> static void pt_guest_enter(struct vcpu_vmx *vmx) >>> { >>> - if (vmx_pt_mode_is_system()) >>> + struct perf_event *event; >>> + >>> + if (vmx_pt_mode_is_system() || >>> + !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN)) >>> return; >>> >>> - /* >>> - * GUEST_IA32_RTIT_CTL is already set in the VMCS. >>> - * Save host state before VM entry. >>> - */ >>> - rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl); >>> - if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) { >>> - wrmsrl(MSR_IA32_RTIT_CTL, 0); >>> - pt_save_msr(&vmx->pt_desc.host, >> vmx->pt_desc.num_address_ranges); >>> - pt_load_msr(&vmx->pt_desc.guest, >> vmx->pt_desc.num_address_ranges); >>> - } >>> + event = pt_get_curr_event(); >>> + perf_event_disable(event); >> >> perf_event_disable() is not allowed in preemption disabled context, since >> >> perf_event_disable() >> -> perf_event_ctx_lock() >> -> perf_event_ctx_lock_nested() >> -> mutex_lock_nested() >> >> and it causes >> >> [ 3542.164553] BUG: sleeping function called from invalid context at >> kernel/locking/mutex.c:580 >> [ 3542.165140] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: >> 1518, name: CPU 0/KVM >> [ 3542.165703] preempt_count: 1, expected: 0 [ 3542.166006] RCU nest depth: >> 0, expected: 0 [ 3542.166315] INFO: lockdep is turned off. >> [ 3542.166614] irq event stamp: 0 >> [ 3542.166857] hardirqs last enabled at (0): [<0000000000000000>] 0x0 >> [ 3542.167304] hardirqs last disabled at (0): [<ffffffff94699ac8>] >> copy_process+0x8e8/0x1bd0 >> [ 3542.167874] softirqs last enabled at (0): [<ffffffff94699ac8>] >> copy_process+0x8e8/0x1bd0 >> [ 3542.168443] softirqs last disabled at (0): [<0000000000000000>] 0x0 >> [ 3542.168891] Preemption disabled at: >> [ 3542.168893] [<ffffffffc0c28f29>] vcpu_enter_guest+0x139/0x1350 [kvm] >> [ 3542.169674] CPU: 20 PID: 1518 Comm: CPU 0/KVM Not tainted >> 6.0.0-rc5-fix-pt-vm-entry+ #3 f2d44ed9be3fc4a510291e2989c9432fce3cb5de >> [ 3542.170457] Hardware name: Intel Corporation JACOBSVILLE/JACOBSVILLE, >> BIOS JBVLCRB1.86B.0012.D75.1912120439 12/12/2019 [ 3542.171188] Call >> Trace: >> [ 3542.171392] <TASK> >> [ 3542.171572] show_stack+0x52/0x5c >> [ 3542.171831] dump_stack_lvl+0x5b/0x86 [ 3542.172112] >> dump_stack+0x10/0x16 [ 3542.172371] __might_resched.cold+0x135/0x15b >> [ 3542.172698] __might_sleep+0x52/0xa0 [ 3542.172975] >> __mutex_lock+0x4e/0x4d0 [ 3542.173251] ? kvm_sched_in+0x4f/0x60 [kvm >> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] >> [ 3542.173839] ? perf_event_ctx_lock_nested+0xc8/0x230 >> [ 3542.174202] ? rcu_read_lock_sched_held+0x16/0x90 >> [ 3542.174551] ? lock_acquire+0xfc/0x150 [ 3542.174840] ? >> perf_event_ctx_lock_nested+0x24/0x230 >> [ 3542.175205] mutex_lock_nested+0x1c/0x30 [ 3542.175505] >> perf_event_ctx_lock_nested+0xc8/0x230 >> [ 3542.181147] ? perf_event_ctx_lock_nested+0x24/0x230 >> [ 3542.186839] perf_event_disable+0x19/0x80 [ 3542.192502] >> vmx_vcpu_run+0x3e5/0xfe0 [kvm_intel >> 7936a7891efe9306918aa504b0eb8bc1e7ba3aa6] >> [ 3542.203771] ? rcu_read_lock_sched_held+0x16/0x90 >> [ 3542.209378] vcpu_enter_guest+0xa96/0x1350 [kvm >> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] >> [ 3542.215218] ? vcpu_enter_guest+0xbe1/0x1350 [kvm >> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] >> [ 3542.226292] ? rcu_read_lock_sched_held+0x16/0x90 >> [ 3542.231956] ? rcu_read_lock_sched_held+0x16/0x90 >> [ 3542.237542] ? lock_acquire+0xfc/0x150 [ 3542.243093] ? >> __rseq_handle_notify_resume+0x3a/0x60 >> [ 3542.248689] vcpu_run+0x53/0x490 [kvm >> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] >> [ 3542.254533] ? vcpu_run+0x35a/0x490 [kvm >> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] >> [ 3542.260567] kvm_arch_vcpu_ioctl_run+0x162/0x680 [kvm >> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] >> [ 3542.272395] ? kvm_arch_vcpu_ioctl_run+0x6d/0x680 [kvm >> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] >> [ 3542.284586] kvm_vcpu_ioctl+0x2ad/0x7a0 [kvm >> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b] >> [ 3542.290973] ? lock_acquire+0xfc/0x150 [ 3542.296990] ? >> rcu_read_lock_sched_held+0x16/0x90 >> [ 3542.302912] ? lock_release+0x118/0x190 [ 3542.308800] ? >> __fget_files+0xe8/0x1c0 [ 3542.314710] ? __fget_files+0x5/0x1c0 >> [ 3542.320591] __x64_sys_ioctl+0x96/0xd0 [ 3542.326500] >> do_syscall_64+0x3a/0x90 [ 3542.332426] >> entry_SYSCALL_64_after_hwframe+0x5e/0xc8 >> >> >> I know little about perf. > > +Kan to double confirm if needed. The perf_event_disable() eventually invokes the intel_pt_stop(). We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to other modules. I don't think we have to use the perf_event_disable(). Also, the perf_event_disable() requires extra codes. I went through the discussions. I agree with Sean's suggestion. We should only put the logic in the KVM but all the MSR access details into the PT driver. But I prefer a more generic and straightforward function name, e.g., intel_pt_stop_save()/intel_pt_start_load(), in case other modules may want to save/restore the PT information in their context switch later. Thanks, Kan > >> It seems perf_event_disable() is not used widely by >> other kernel component. Is there a alternative? If no, I think expose >> disable/enable helper from pt driver like this series seems OK. > > Oops, it was actually a mistake to disable the event on other CPUs. > Can you expose and use perf_event_disable_local? > > For the enabling side, we need to add and expose perf_event_enable_local: > event_function_local(event, __perf_event_enable, NULL);
On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote: > The perf_event_disable() eventually invokes the intel_pt_stop(). > We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to other > modules. I don't think we have to use the perf_event_disable(). Also, the > perf_event_disable() requires extra codes. > > I went through the discussions. I agree with Sean's suggestion. > We should only put the logic in the KVM but all the MSR access details into the PT > driver. Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived by perf. 1. If we make KVM a user of perf, we should do this via perf_event_disable/enable_*. 2. If we make KVM an alternative to perf (i.e. have direct control over PMU HW), we can do this via driver interfaces like perf. Per my experience, we should go for 1. Probably need Peter's opinions on this. > But I prefer a more generic and straightforward function name, e.g., > intel_pt_stop_save()/intel_pt_start_load(), in case other modules may want to > save/restore the PT information in their context switch later. > > Thanks, > Kan > > > > >> It seems perf_event_disable() is not used widely by other kernel > >> component. Because there are not lots of kernel users. You can check another user, watchdog_hld.c, perf_event_enable/disable are used there.
On 2022-09-14 10:46 p.m., Wang, Wei W wrote: > On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote: >> The perf_event_disable() eventually invokes the intel_pt_stop(). >> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to other >> modules. I don't think we have to use the perf_event_disable(). Also, the >> perf_event_disable() requires extra codes. >> >> I went through the discussions. I agree with Sean's suggestion. >> We should only put the logic in the KVM but all the MSR access details into the PT >> driver. > > Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived by perf. It through perf_event, not driven by perf_event. The perf_event generic code never knows when should invokes each driver to save/restore information. It should be driven by the other subsystem e.g., scheduler. For this case, KVM should drive the save/restore, and the PT driver eventually does all the MSR access details. > 1. If we make KVM a user of perf, we should do this via perf_event_disable/enable_*. > 2. If we make KVM an alternative to perf (i.e. have direct control over PMU HW), > we can do this via driver interfaces like perf. > Per my experience, we should go for 1. Probably need Peter's opinions on this. > For 1, the perf_event_disable/enable_* are not enough. They don't save/restore MSRs. If we go to this way, we have to introduce a new generic interface to ask each driver to save/restore their MSRs when the guest is entering/exiting. We'd better combine the new interface with the existing perf_guest_get_msrs() of the core driver. I think that's an ideal solution, but requires big changes in the code. 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). I don't think it's a right way. We'd better fix it. The suggestion should be 3. The KVM notify the PT driver via the interface provided by PT. The PT driver save/restore all the registers. I think it's an acceptable solution with small code changes. So I prefer 3. Thanks, Kan >> But I prefer a more generic and straightforward function name, e.g., >> intel_pt_stop_save()/intel_pt_start_load(), in case other modules may want to >> save/restore the PT information in their context switch later. >> >> Thanks, >> Kan >> >>> >>>> It seems perf_event_disable() is not used widely by other kernel >>>> component. > > Because there are not lots of kernel users. > You can check another user, watchdog_hld.c, perf_event_enable/disable are used there.
On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote: > On 2022-09-14 10:46 p.m., Wang, Wei W wrote: > > On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote: > >> The perf_event_disable() eventually invokes the intel_pt_stop(). > >> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to > >> other modules. I don't think we have to use the perf_event_disable(). > >> Also, the > >> perf_event_disable() requires extra codes. > >> > >> I went through the discussions. I agree with Sean's suggestion. > >> We should only put the logic in the KVM but all the MSR access > >> details into the PT driver. > > > > Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived > by perf. > > It through perf_event, not driven by perf_event. The perf_event generic code > never knows when should invokes each driver to save/restore information. It > should be driven by the other subsystem e.g., scheduler. Yes. The cpu scheduler does this via the perf subsystem, though. > > For this case, KVM should drive the save/restore, and the PT driver eventually > does all the MSR access details. > > > 1. If we make KVM a user of perf, we should do this via > perf_event_disable/enable_*. > > 2. If we make KVM an alternative to perf (i.e. have direct control > > over PMU HW), we can do this via driver interfaces like perf. > > Per my experience, we should go for 1. Probably need Peter's opinions on > this. > > > > For 1, the perf_event_disable/enable_* are not enough. They don't > save/restore MSRs. perf_event_disable will go through perf to call pt_event_stop which saves the related MSRs, right? (if so, what large changes did you mean?) > If we go to this way, we have to introduce a new generic > interface to ask each driver to save/restore their MSRs when the guest is > entering/exiting. We'd better combine the new interface with the existing > perf_guest_get_msrs() of the core driver. > I think that's an ideal solution, but requires big changes in the code. > > 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). I don't > think it's a right way. We'd better fix it. > > The suggestion should be 3. The KVM notify the PT driver via the interface > provided by PT. The PT driver save/restore all the registers. > I think it's an acceptable solution with small code changes. This looks like we just relocate the save/restore functions to the PT driver and KVM still directly call them - still not going through perf's management. Imagine every user operates on the pmu h/w directly like this, things would be a mess. Thanks, Wei
On 2022-09-15 10:39 a.m., Wang, Wei W wrote: > On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote: >> On 2022-09-14 10:46 p.m., Wang, Wei W wrote: >>> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote: >>>> The perf_event_disable() eventually invokes the intel_pt_stop(). >>>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to >>>> other modules. I don't think we have to use the perf_event_disable(). >>>> Also, the >>>> perf_event_disable() requires extra codes. >>>> >>>> I went through the discussions. I agree with Sean's suggestion. >>>> We should only put the logic in the KVM but all the MSR access >>>> details into the PT driver. >>> >>> Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived >> by perf. >> >> It through perf_event, not driven by perf_event. The perf_event generic code >> never knows when should invokes each driver to save/restore information. It >> should be driven by the other subsystem e.g., scheduler. > > Yes. The cpu scheduler does this via the perf subsystem, though. > >> >> For this case, KVM should drive the save/restore, and the PT driver eventually >> does all the MSR access details. >> >>> 1. If we make KVM a user of perf, we should do this via >> perf_event_disable/enable_*. >>> 2. If we make KVM an alternative to perf (i.e. have direct control >>> over PMU HW), we can do this via driver interfaces like perf. >>> Per my experience, we should go for 1. Probably need Peter's opinions on >> this. >>> >> >> For 1, the perf_event_disable/enable_* are not enough. They don't >> save/restore MSRs. > > perf_event_disable will go through perf to call pt_event_stop which saves the related MSRs, right? I don't think so. The pt_event_stop() doesn't save all the MSR_IA32_RTIT_* MSRs. > (if so, what large changes did you mean?) > >> If we go to this way, we have to introduce a new generic >> interface to ask each driver to save/restore their MSRs when the guest is >> entering/exiting. We'd better combine the new interface with the existing >> perf_guest_get_msrs() of the core driver. >> I think that's an ideal solution, but requires big changes in the code. >> >> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). I don't >> think it's a right way. We'd better fix it. >> >> The suggestion should be 3. The KVM notify the PT driver via the interface >> provided by PT. The PT driver save/restore all the registers. >> I think it's an acceptable solution with small code changes. > > This looks like we just relocate the save/restore functions to the PT driver and KVM still directly call them - still not going through perf's management. Imagine every user operates on the pmu h/w directly like this, things would be a mess. > The pt_event_stop() and the proposed interface still manipulate the PT event pt->handle.event. The event status is updated as well. It's still under control of the perf_event. While the current KVM implementation implicitly updates the MSRs without updating the event status. Also, KVM doesn't know the PT as well as the PT driver. It's better to let the dedicated driver maintain the details. Otherwise, if we add more MSRs later, we have to maintain both KVM and PT. Thanks, Kan
On Thursday, September 15, 2022 11:43 PM, Liang, Kan wrote: > On 2022-09-15 10:39 a.m., Wang, Wei W wrote: > > On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote: > >> On 2022-09-14 10:46 p.m., Wang, Wei W wrote: > >>> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote: > >>>> The perf_event_disable() eventually invokes the intel_pt_stop(). > >>>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to > >>>> other modules. I don't think we have to use the perf_event_disable(). > >>>> Also, the > >>>> perf_event_disable() requires extra codes. > >>>> > >>>> I went through the discussions. I agree with Sean's suggestion. > >>>> We should only put the logic in the KVM but all the MSR access > >>>> details into the PT driver. > >>> > >>> Even the driver itself doesn’t drive the save/restore of the MSRs, > >>> it is drived > >> by perf. > >> > >> It through perf_event, not driven by perf_event. The perf_event > >> generic code never knows when should invokes each driver to > >> save/restore information. It should be driven by the other subsystem e.g., > scheduler. > > > > Yes. The cpu scheduler does this via the perf subsystem, though. > > > >> > >> For this case, KVM should drive the save/restore, and the PT driver > >> eventually does all the MSR access details. > >> > >>> 1. If we make KVM a user of perf, we should do this via > >> perf_event_disable/enable_*. > >>> 2. If we make KVM an alternative to perf (i.e. have direct control > >>> over PMU HW), we can do this via driver interfaces like perf. > >>> Per my experience, we should go for 1. Probably need Peter's > >>> opinions on > >> this. > >>> > >> > >> For 1, the perf_event_disable/enable_* are not enough. They don't > >> save/restore MSRs. > > > > perf_event_disable will go through perf to call pt_event_stop which saves > the related MSRs, right? > > I don't think so. The pt_event_stop() doesn't save all the > MSR_IA32_RTIT_* MSRs. Not all the MSRs are required to be saved. In general, pt_event_stop should have saved all the MSRs required for an event switching. Otherwise the general usages of PT have been broken. To be more precise, the following MSRs are not saved by pt_event_stop, but I don’t see they are required to be saved: - MSR_IA32_RTIT_CR3_MATCH: I don’t see it is used by perf. Seems like KVM saved an MSR that's not even used by the host. - Address range MSRs (MSR_IA32_RTIT_ADDR0_A etc.): Those are provided by s/w and not updated by h/w. So they're just set to MSRs when event gets scheduled in. There is no need to save. > > > (if so, what large changes did you mean?) > > > >> If we go to this way, we have to introduce a new generic interface to > >> ask each driver to save/restore their MSRs when the guest is > >> entering/exiting. We'd better combine the new interface with the > >> existing > >> perf_guest_get_msrs() of the core driver. > >> I think that's an ideal solution, but requires big changes in the code. > >> > >> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). > >> I don't think it's a right way. We'd better fix it. > >> > >> The suggestion should be 3. The KVM notify the PT driver via the > >> interface provided by PT. The PT driver save/restore all the registers. > >> I think it's an acceptable solution with small code changes. > > > > This looks like we just relocate the save/restore functions to the PT driver > and KVM still directly call them - still not going through perf's management. > Imagine every user operates on the pmu h/w directly like this, things would be > a mess. > > > > > The pt_event_stop() and the proposed interface still manipulate the PT event > pt->handle.event. The event status is updated as well. It's still under control of > the perf_event. Did you mean to handle the PT event in the proposed driver API? Event status is just one of the things. There are other things if we want to make it complete for this, e.g. event->oncpu = -1, and eventually seems we will re-implement perf_event_disable_*. Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have that many changes. If necessary, we can post the 2nd version out to double check. Thanks, Wei
On 2022-09-15 10:30 p.m., Wang, Wei W wrote: > On Thursday, September 15, 2022 11:43 PM, Liang, Kan wrote: >> On 2022-09-15 10:39 a.m., Wang, Wei W wrote: >>> On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote: >>>> On 2022-09-14 10:46 p.m., Wang, Wei W wrote: >>>>> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote: >>>>>> The perf_event_disable() eventually invokes the intel_pt_stop(). >>>>>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to >>>>>> other modules. I don't think we have to use the perf_event_disable(). >>>>>> Also, the >>>>>> perf_event_disable() requires extra codes. >>>>>> >>>>>> I went through the discussions. I agree with Sean's suggestion. >>>>>> We should only put the logic in the KVM but all the MSR access >>>>>> details into the PT driver. >>>>> >>>>> Even the driver itself doesn’t drive the save/restore of the MSRs, >>>>> it is drived >>>> by perf. >>>> >>>> It through perf_event, not driven by perf_event. The perf_event >>>> generic code never knows when should invokes each driver to >>>> save/restore information. It should be driven by the other subsystem e.g., >> scheduler. >>> >>> Yes. The cpu scheduler does this via the perf subsystem, though. >>> >>>> >>>> For this case, KVM should drive the save/restore, and the PT driver >>>> eventually does all the MSR access details. >>>> >>>>> 1. If we make KVM a user of perf, we should do this via >>>> perf_event_disable/enable_*. >>>>> 2. If we make KVM an alternative to perf (i.e. have direct control >>>>> over PMU HW), we can do this via driver interfaces like perf. >>>>> Per my experience, we should go for 1. Probably need Peter's >>>>> opinions on >>>> this. >>>>> >>>> >>>> For 1, the perf_event_disable/enable_* are not enough. They don't >>>> save/restore MSRs. >>> >>> perf_event_disable will go through perf to call pt_event_stop which saves >> the related MSRs, right? >> >> I don't think so. The pt_event_stop() doesn't save all the >> MSR_IA32_RTIT_* MSRs. > > Not all the MSRs are required to be saved. In general, pt_event_stop should have > saved all the MSRs required for an event switching. Otherwise the general usages > of PT have been broken. To be more precise, the following MSRs are not saved by > pt_event_stop, but I don’t see they are required to be saved: > > - MSR_IA32_RTIT_CR3_MATCH: I don’t see it is used by perf. > Seems like KVM saved an MSR that's not even used by the host. > > - Address range MSRs (MSR_IA32_RTIT_ADDR0_A etc.): Those are provided by s/w and not updated by h/w. > So they're just set to MSRs when event gets scheduled in. There is no need to save. > OK. I think you need a clean-up patch to fix them first. >> >>> (if so, what large changes did you mean?) >>> >>>> If we go to this way, we have to introduce a new generic interface to >>>> ask each driver to save/restore their MSRs when the guest is >>>> entering/exiting. We'd better combine the new interface with the >>>> existing >>>> perf_guest_get_msrs() of the core driver. >>>> I think that's an ideal solution, but requires big changes in the code. >>>> >>>> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). >>>> I don't think it's a right way. We'd better fix it. >>>> >>>> The suggestion should be 3. The KVM notify the PT driver via the >>>> interface provided by PT. The PT driver save/restore all the registers. >>>> I think it's an acceptable solution with small code changes. >>> >>> This looks like we just relocate the save/restore functions to the PT driver >> and KVM still directly call them - still not going through perf's management. >> Imagine every user operates on the pmu h/w directly like this, things would be >> a mess. >>> >> >> >> The pt_event_stop() and the proposed interface still manipulate the PT event >> pt->handle.event. The event status is updated as well. It's still under control of >> the perf_event. > > Did you mean to handle the PT event in the proposed driver API? Event status is just > one of the things. There are other things if we want to make it complete for this, > e.g. event->oncpu = -1, and eventually seems we will re-implement perf_event_disable_*. > As my understand, perf always check the status first. If it's a stopped or inactivated event, I don't think event->oncpu will be touched. That's why I think the proposed driver API should be acceptable. > Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have that many changes. > If necessary, we can post the 2nd version out to double check. > I'm not worry about which ways (either perf_event_disable_local() or the proposed PT driver API) are chosen to stop the PT. If the existing perf_event interfaces can meet your requirement, that's perfect. My real concern is the pt_save_msr()/pt_load_msr(). I don't think it's a job for KVM. See atomic_switch_perf_msrs(). It is the perf core driver rather than KVM that tells which MSRs should be saved/restored in VMCS. We should do the same thing for PT. (Actually, I think we already encounter issues with the current KVM-dominated method. KVM saves/restores unnecessary MSRs. Right?) To do so, I think there may be two ways. - Since MSRs have to be switched for both PT and core drivers, it sounds reasonable to provide a new generic interface in the perf_event. The new interface is to tell KVM which MSRs should be saved/restored. Then KVM can decide to save/restore via VMCS or direct MSR access. I suspect this way requires big change, but it will benefit all the drivers which have similar requirements. - The proposed driver API. The MSRs are saved/restored in the PT driver. Thanks, Kan
On Friday, September 16, 2022 9:27 PM, Liang, Kan wrote: > > Did you mean to handle the PT event in the proposed driver API? Event > > status is just one of the things. There are other things if we want to > > make it complete for this, e.g. event->oncpu = -1, and eventually seems we will > re-implement perf_event_disable_*. > > > > As my understand, perf always check the status first. If it's a stopped or > inactivated event, I don't think event->oncpu will be touched. That's why I think > the proposed driver API should be acceptable. That's the implementation thing. We need to make it architecturally clean though. > > > Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have > that many changes. > > If necessary, we can post the 2nd version out to double check. > > > > I'm not worry about which ways (either perf_event_disable_local() or the > proposed PT driver API) are chosen to stop the PT. If the existing perf_event > interfaces can meet your requirement, that's perfect. > > My real concern is the pt_save_msr()/pt_load_msr(). I don't think it's a job for > KVM. See atomic_switch_perf_msrs(). It is the perf core driver rather than KVM > that tells which MSRs should be saved/restored in VMCS. > We should do the same thing for PT. (Actually, I think we already encounter > issues with the current KVM-dominated method. KVM saves/restores > unnecessary MSRs. Right?) > Right. It's on my plan to improve the current PT virtualization, and planed to be the next step after this fix. The general rule is the same: make KVM a user of perf, that is, we leave those save/restore work to be completely done by the perf (driver) side, so we will eventually remove the KVM side pt_save/load_msr. To be more precise, it will work as below: - we will create a guest event, like what we did for lbr virtualization - on VMEnter: -- perf_disable_event_local(host_event); -- perf_enable_event_local(guest_event); - on VMExit: -- perf_disable_event_local(guest_event); -- perf_enable_event_local(host_event); > To do so, I think there may be two ways. > - Since MSRs have to be switched for both PT and core drivers, it sounds > reasonable to provide a new generic interface in the perf_event. The new > interface is to tell KVM which MSRs should be saved/restored. Then KVM can > decide to save/restore via VMCS or direct MSR access. I suspect this way > requires big change, but it will benefit all the drivers which have similar > requirements. > - The proposed driver API. The MSRs are saved/restored in the PT driver. As shown above, no need for those. We can completely reuse the perf side save/restore. Thanks, Wei
On 2022-09-19 9:46 a.m., Wang, Wei W wrote: > On Friday, September 16, 2022 9:27 PM, Liang, Kan wrote: >>> Did you mean to handle the PT event in the proposed driver API? Event >>> status is just one of the things. There are other things if we want to >>> make it complete for this, e.g. event->oncpu = -1, and eventually seems we will >> re-implement perf_event_disable_*. >>> >> >> As my understand, perf always check the status first. If it's a stopped or >> inactivated event, I don't think event->oncpu will be touched. That's why I think >> the proposed driver API should be acceptable. > > That's the implementation thing. We need to make it architecturally clean though. > >> >>> Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have >> that many changes. >>> If necessary, we can post the 2nd version out to double check. >>> >> >> I'm not worry about which ways (either perf_event_disable_local() or the >> proposed PT driver API) are chosen to stop the PT. If the existing perf_event >> interfaces can meet your requirement, that's perfect. >> >> My real concern is the pt_save_msr()/pt_load_msr(). I don't think it's a job for >> KVM. See atomic_switch_perf_msrs(). It is the perf core driver rather than KVM >> that tells which MSRs should be saved/restored in VMCS. >> We should do the same thing for PT. (Actually, I think we already encounter >> issues with the current KVM-dominated method. KVM saves/restores >> unnecessary MSRs. Right?) >> > > Right. It's on my plan to improve the current PT virtualization, and > planed to be the next step after this fix. The general rule is the same: make KVM a user > of perf, that is, we leave those save/restore work to be completely done by the > perf (driver) side, so we will eventually remove the KVM side pt_save/load_msr. > To be more precise, it will work as below: > - we will create a guest event, like what we did for lbr virtualization Another fake event? We have to specially handle it in the perf code. I don't think it's a clean way for perf. > - on VMEnter: > -- perf_disable_event_local(host_event); > -- perf_enable_event_local(guest_event); > - on VMExit: > -- perf_disable_event_local(guest_event); > -- perf_enable_event_local(host_event); Why we cannot use the same way as the perf core driver to switch the MSRs in the VMCS? You just need one generic function, perf_guest_get_msrs(), for both PT and core driver. If you have to disable PT explicitly before VMCS, I think you can do it in the PT specific perf_guest_get_msrs(). Anyway, that's an improvement for the current code. I don't have a problem, if you prefer to separate the fix patch and improvement patch. Thanks, Kan > >> To do so, I think there may be two ways. >> - Since MSRs have to be switched for both PT and core drivers, it sounds >> reasonable to provide a new generic interface in the perf_event. The new >> interface is to tell KVM which MSRs should be saved/restored. Then KVM can >> decide to save/restore via VMCS or direct MSR access. I suspect this way >> requires big change, but it will benefit all the drivers which have similar >> requirements. >> - The proposed driver API. The MSRs are saved/restored in the PT driver. > > As shown above, no need for those. We can completely reuse the > perf side save/restore. > > Thanks, > Wei
On Monday, September 19, 2022 10:41 PM, Liang, Kan wrote: > Another fake event? We have to specially handle it in the perf code. I don't > think it's a clean way for perf. We can check the patch later. I think it should be clean, like the LBR side. > > > - on VMEnter: > > -- perf_disable_event_local(host_event); > > -- perf_enable_event_local(guest_event); > > - on VMExit: > > -- perf_disable_event_local(guest_event); > > -- perf_enable_event_local(host_event); > > Why we cannot use the same way as the perf core driver to switch the MSRs in > the VMCS? The current MSR switching list from VMCS isn’t fast, should be the last resort when really necessary. > > You just need one generic function, perf_guest_get_msrs(), for both PT and > core driver. If you have to disable PT explicitly before VMCS, I think you can do > it in the PT specific perf_guest_get_msrs(). The disable is done via " Clear IA32_RTIT_CTL" VMExit control. It can ensure PT disabled in time on VMExit, so no need to go through perf_guest_get_msrs. Thanks, Wei
On 2022-09-19 11:22 a.m., Wang, Wei W wrote: > On Monday, September 19, 2022 10:41 PM, Liang, Kan wrote: >> Another fake event? We have to specially handle it in the perf code. I don't >> think it's a clean way for perf. > > We can check the patch later. I think it should be clean, like the LBR side. I doubt. Perf already specially handles the fake LBR event in several places from the core code to the LBR code. It also occupy a reserved bit. If there is another choice, I don't think we want to go that way. > >> >>> - on VMEnter: >>> -- perf_disable_event_local(host_event); >>> -- perf_enable_event_local(guest_event); >>> - on VMExit: >>> -- perf_disable_event_local(guest_event); >>> -- perf_enable_event_local(host_event); >> >> Why we cannot use the same way as the perf core driver to switch the MSRs in >> the VMCS? > > The current MSR switching list from VMCS isn’t fast, > should be the last resort when really necessary. > It's a documented way in the SDM. I believe there must be some reason Intel introduces it. Since it's an documented (or recommended) way, I think we'd better use it if possible. Since both the PT and the core driver needs to switch MSRs during VMCS, it's better to use the same way (function) to handle them. Thanks, Kan >> >> You just need one generic function, perf_guest_get_msrs(), for both PT and >> core driver. If you have to disable PT explicitly before VMCS, I think you can do >> it in the PT specific perf_guest_get_msrs(). > > The disable is done via " Clear IA32_RTIT_CTL" VMExit control. > It can ensure PT disabled in time on VMExit, so no need to go through perf_guest_get_msrs. >