mbox series

[RFC,0/2] KVM: VMX: Fix VM entry failure on PT_MODE_HOST_GUEST while host is using PT

Message ID 20220825085625.867763-1-xiaoyao.li@intel.com (mailing list archive)
Headers show
Series KVM: VMX: Fix VM entry failure on PT_MODE_HOST_GUEST while host is using PT | expand

Message

Xiaoyao Li Aug. 25, 2022, 8:56 a.m. UTC
There is one bug in KVM that can hit vm-entry failure 100% on platform
supporting PT_MODE_HOST_GUEST mode following below steps:

  1. #modprobe -r kvm_intel
  2. #modprobe kvm_intel pt_mode=1
  3. start a VM with QEMU
  4. on host: #perf record -e intel_pt//

The vm-entry failure happens because it violates the requirement stated
in Intel SDM 26.2.1.1 VM-Execution Control Fields

  If the logical processor is operating with Intel PT enabled (if
  IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
  IA32_RTIT_CTL" VM-entry control must be 0.

On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus KVM
needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently KVM
manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it doesn't
work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn is
re-enabled in PT PMI handler before vm-entry. This series tries to fix
the issue by exposing two interfaces from Intel PT driver for the purose
to stop and resume Intel PT on host. It prevents PT PMI handler from
re-enabling PT. By the way, it also fixes another issue that PT PMI
touches PT MSRs whihc leads to what KVM stores for host bemomes stale.

Xiaoyao Li (2):
  perf/x86/intel/pt: Introduce intel_pt_{stop,resume}()
  KVM: VMX: Stop/resume host PT before/after VM entry when
    PT_MODE_HOST_GUEST

 arch/x86/events/intel/pt.c      | 11 ++++++++++-
 arch/x86/include/asm/intel_pt.h |  6 ++++--
 arch/x86/kernel/crash.c         |  4 ++--
 arch/x86/kvm/vmx/vmx.c          | 11 ++++++++++-
 4 files changed, 26 insertions(+), 6 deletions(-)

Comments

Wang, Wei W Aug. 29, 2022, 7:49 a.m. UTC | #1
On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
> There is one bug in KVM that can hit vm-entry failure 100% on platform
> supporting PT_MODE_HOST_GUEST mode following below steps:
> 
>   1. #modprobe -r kvm_intel
>   2. #modprobe kvm_intel pt_mode=1
>   3. start a VM with QEMU
>   4. on host: #perf record -e intel_pt//
> 
> The vm-entry failure happens because it violates the requirement stated in
> Intel SDM 26.2.1.1 VM-Execution Control Fields
> 
>   If the logical processor is operating with Intel PT enabled (if
>   IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
>   IA32_RTIT_CTL" VM-entry control must be 0.
> 
> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus
> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently
> KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it
> doesn't work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn
> is re-enabled in PT PMI handler before vm-entry. This series tries to fix the
> issue by exposing two interfaces from Intel PT driver for the purose to stop and
> resume Intel PT on host. It prevents PT PMI handler from re-enabling PT. By the
> way, it also fixes another issue that PT PMI touches PT MSRs whihc leads to
> what KVM stores for host bemomes stale.

I'm thinking about another approach to fixing it. I think we need to have the
running host pt event disabled when we switch to guest and don't expect to
receive the host pt interrupt at this point. Also, the host pt context can be
save/restored by host perf core (instead of KVM) when we disable/enable
the event.

diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 82ef87e9a897..1d3e03ecaf6a 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event *event, int mode)

        pt_config_buffer(buf);
        pt_config(event);
+       pt->event = event;

        return;

@@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event *event, int mode)
                return;

        event->hw.state = PERF_HES_STOPPED;
+       pt->event = NULL;

        if (mode & PERF_EF_UPDATE) {
                struct pt_buffer *buf = perf_get_aux(&pt->handle);
@@ -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, int mode)
        }
 }

+
+struct perf_event *pt_get_curr_event(void)
+{
+       struct pt *pt = this_cpu_ptr(&pt_ctx);
+
+       return pt->event;
+}
+EXPORT_SYMBOL_GPL(pt_get_curr_event);
+
 static long pt_event_snapshot_aux(struct perf_event *event,
                                  struct perf_output_handle *handle,
                                  unsigned long size)
diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
index 96906a62aacd..d46a85bb06bb 100644
--- a/arch/x86/events/intel/pt.h
+++ b/arch/x86/events/intel/pt.h
@@ -121,6 +121,7 @@ struct pt_filters {
  * @output_mask:       cached RTIT_OUTPUT_MASK MSR value
  */
 struct pt {
+       struct perf_event       *event;
        struct perf_output_handle handle;
        struct pt_filters       filters;
        int                     handle_nmi;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index f6fc8dd51ef4..be8dd24922a7 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)

 #ifdef CONFIG_CPU_SUP_INTEL
  extern void intel_pt_handle_vmx(int on);
+ extern struct perf_event *pt_get_curr_event(void);
 #else
 static inline void intel_pt_handle_vmx(int on)
 {

+
 }
+struct perf_event *pt_get_curr_event(void) { }
 #endif

 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d7f8331d6f7e..195debc1bff1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range)

 static void pt_guest_enter(struct vcpu_vmx *vmx)
 {
-       if (vmx_pt_mode_is_system())
+       struct perf_event *event;
+
+       if (vmx_pt_mode_is_system() ||
+           !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
                return;

-       /*
-        * GUEST_IA32_RTIT_CTL is already set in the VMCS.
-        * Save host state before VM entry.
-        */
-       rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
-       if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
-               wrmsrl(MSR_IA32_RTIT_CTL, 0);
-               pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
-               pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
-       }
+       event = pt_get_curr_event();
+       perf_event_disable(event);
+       vmx->pt_desc.host_event = event;
+       pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
}

 static void pt_guest_exit(struct vcpu_vmx *vmx)
 {
-       if (vmx_pt_mode_is_system())
-               return;
+       struct perf_event *event = vmx->pt_desc.host_event;

-       if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
-               pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
-               pt_load_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
-       }
+       if (vmx_pt_mode_is_system() ||
+           !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
+               return;

-       /*
-        * KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest,
-        * i.e. RTIT_CTL is always cleared on VM-Exit.  Restore it if necessary.
-        */
-       if (vmx->pt_desc.host.ctl)
-               wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
+       pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
+       if (event)
+               perf_event_enable(event);
 }

 void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 24d58c2ffaa3..4c20bdabc85b 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -66,7 +66,7 @@ struct pt_desc {
        u64 ctl_bitmask;
        u32 num_address_ranges;
        u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
-       struct pt_ctx host;
+       struct perf_event *host_event;
        struct pt_ctx guest;
 };
Sean Christopherson Aug. 29, 2022, 5:33 p.m. UTC | #2
On Mon, Aug 29, 2022, Wang, Wei W wrote:
> On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
>  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d7f8331d6f7e..195debc1bff1 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range)
> 
>  static void pt_guest_enter(struct vcpu_vmx *vmx)
>  {
> -       if (vmx_pt_mode_is_system())
> +       struct perf_event *event;
> +
> +       if (vmx_pt_mode_is_system() ||
> +           !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))

I don't think the host should trace the guest in the host/guest mode just because
the guest isn't tracing itself.  I.e. the host still needs to turn off it's own
tracing.

>                 return;
> 
> -       /*
> -        * GUEST_IA32_RTIT_CTL is already set in the VMCS.
> -        * Save host state before VM entry.
> -        */
> -       rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> -       if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> -               wrmsrl(MSR_IA32_RTIT_CTL, 0);
> -               pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
> -               pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> -       }
> +       event = pt_get_curr_event();
> +       perf_event_disable(event);
> +       vmx->pt_desc.host_event = event;

This is effectively what I suggested[*], the main difference being that my version
adds dedicated enter/exit helpers so that perf can skip save/restore of the other
MSRs.  It's easy to extend if perf needs to hand back an event to complete the "exit.

	bool guest_trace_enabled = vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN;
	
	vmx->pt_desc.host_event = intel_pt_guest_enter(guest_trace_enabled);


and then on exit

	bool guest_trace_enabled = vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN;

	intel_pt_guest_exit(vmx->pt_desc.host_event, guest_trace_enabled);

[*] https://lore.kernel.org/all/YwecducnM%2FU6tqJT@google.com

> +       pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> }
Wang, Wei W Aug. 30, 2022, 6:02 a.m. UTC | #3
On Tuesday, August 30, 2022 1:34 AM, Sean Christopherson wrote:
> On Mon, Aug 29, 2022, Wang, Wei W wrote:
> > On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
> >  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
> diff
> > --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > d7f8331d6f7e..195debc1bff1 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx
> > *ctx, u32 addr_range)
> >
> >  static void pt_guest_enter(struct vcpu_vmx *vmx)  {
> > -       if (vmx_pt_mode_is_system())
> > +       struct perf_event *event;
> > +
> > +       if (vmx_pt_mode_is_system() ||
> > +           !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
> 
> I don't think the host should trace the guest in the host/guest mode just
> because the guest isn't tracing itself.  I.e. the host still needs to turn off it's
> own tracing.

Right, need to fix this one.

> This is effectively what I suggested[*], the main difference being that my
> version adds dedicated enter/exit helpers so that perf can skip save/restore of
> the other MSRs.  

What "other MSRs" were you referring to?
(I suppose you meant perf_event_disable needs to save more MSRs)

> It's easy to extend if perf needs to hand back an event to
> complete the "exit.
> 
> 	bool guest_trace_enabled = vmx->pt_desc.guest.ctl &
> RTIT_CTL_TRACEEN;
> 
> 	vmx->pt_desc.host_event = intel_pt_guest_enter(guest_trace_enabled);
> 
> 
> and then on exit
> 
> 	bool guest_trace_enabled = vmx->pt_desc.guest.ctl &
> RTIT_CTL_TRACEEN;
> 
> 	intel_pt_guest_exit(vmx->pt_desc.host_event, guest_trace_enabled);
> 
> [*] https://lore.kernel.org/all/YwecducnM%2FU6tqJT@google.com

Yes, this can function. But I feel it a bit violates the general rule
that I got from previous experiences:
KVM should be a user of the perf subsystem, instead of implementing a secondary
driver beyond perf's management.
Being a user of perf means everything possible should go through "perf event",
which is the interface that perf exposes to users.
Xiaoyao Li Sept. 8, 2022, 7:25 a.m. UTC | #4
On 8/29/2022 3:49 PM, Wang, Wei W wrote:
> On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
>> There is one bug in KVM that can hit vm-entry failure 100% on platform
>> supporting PT_MODE_HOST_GUEST mode following below steps:
>>
>>    1. #modprobe -r kvm_intel
>>    2. #modprobe kvm_intel pt_mode=1
>>    3. start a VM with QEMU
>>    4. on host: #perf record -e intel_pt//
>>
>> The vm-entry failure happens because it violates the requirement stated in
>> Intel SDM 26.2.1.1 VM-Execution Control Fields
>>
>>    If the logical processor is operating with Intel PT enabled (if
>>    IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
>>    IA32_RTIT_CTL" VM-entry control must be 0.
>>
>> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus
>> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently
>> KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it
>> doesn't work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn
>> is re-enabled in PT PMI handler before vm-entry. This series tries to fix the
>> issue by exposing two interfaces from Intel PT driver for the purose to stop and
>> resume Intel PT on host. It prevents PT PMI handler from re-enabling PT. By the
>> way, it also fixes another issue that PT PMI touches PT MSRs whihc leads to
>> what KVM stores for host bemomes stale.
> 
> I'm thinking about another approach to fixing it. I think we need to have the
> running host pt event disabled when we switch to guest and don't expect to
> receive the host pt interrupt at this point. Also, the host pt context can be
> save/restored by host perf core (instead of KVM) when we disable/enable
> the event.
> 
> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> index 82ef87e9a897..1d3e03ecaf6a 100644
> --- a/arch/x86/events/intel/pt.c
> +++ b/arch/x86/events/intel/pt.c
> @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event *event, int mode)
> 
>          pt_config_buffer(buf);
>          pt_config(event);
> +       pt->event = event;
> 
>          return;
> 
> @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event *event, int mode)
>                  return;
> 
>          event->hw.state = PERF_HES_STOPPED;
> +       pt->event = NULL;
> 
>          if (mode & PERF_EF_UPDATE) {
>                  struct pt_buffer *buf = perf_get_aux(&pt->handle);
> @@ -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, int mode)
>          }
>   }
> 
> +
> +struct perf_event *pt_get_curr_event(void)
> +{
> +       struct pt *pt = this_cpu_ptr(&pt_ctx);

Wei,

I'm not sure if we can use pt->handle.event instead or not.

> +       return pt->event;
> +}
> +EXPORT_SYMBOL_GPL(pt_get_curr_event);
> +
>   static long pt_event_snapshot_aux(struct perf_event *event,
>                                    struct perf_output_handle *handle,
>                                    unsigned long size)
> diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
> index 96906a62aacd..d46a85bb06bb 100644
> --- a/arch/x86/events/intel/pt.h
> +++ b/arch/x86/events/intel/pt.h
> @@ -121,6 +121,7 @@ struct pt_filters {
>    * @output_mask:       cached RTIT_OUTPUT_MASK MSR value
>    */
>   struct pt {
> +       struct perf_event       *event;
>          struct perf_output_handle handle;
>          struct pt_filters       filters;
>          int                     handle_nmi;
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index f6fc8dd51ef4..be8dd24922a7 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)
> 
>   #ifdef CONFIG_CPU_SUP_INTEL
>    extern void intel_pt_handle_vmx(int on);
> + extern struct perf_event *pt_get_curr_event(void);
>   #else
>   static inline void intel_pt_handle_vmx(int on)
>   {
> 
> +
>   }
> +struct perf_event *pt_get_curr_event(void) { }
>   #endif
>
Wang, Wei W Sept. 8, 2022, 8:53 a.m. UTC | #5
On Thursday, September 8, 2022 3:26 PM, Li, Xiaoyao wrote:
> > +
> > +struct perf_event *pt_get_curr_event(void) {
> > +       struct pt *pt = this_cpu_ptr(&pt_ctx);
> 
> Wei,
> 
> I'm not sure if we can use pt->handle.event instead or not.
> 
> > +       return pt->event;

Yes, I think we could reuse that.
Xiaoyao Li Sept. 14, 2022, 4:15 a.m. UTC | #6
On 8/29/2022 3:49 PM, Wang, Wei W wrote:
> On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
>> There is one bug in KVM that can hit vm-entry failure 100% on platform
>> supporting PT_MODE_HOST_GUEST mode following below steps:
>>
>>    1. #modprobe -r kvm_intel
>>    2. #modprobe kvm_intel pt_mode=1
>>    3. start a VM with QEMU
>>    4. on host: #perf record -e intel_pt//
>>
>> The vm-entry failure happens because it violates the requirement stated in
>> Intel SDM 26.2.1.1 VM-Execution Control Fields
>>
>>    If the logical processor is operating with Intel PT enabled (if
>>    IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
>>    IA32_RTIT_CTL" VM-entry control must be 0.
>>
>> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set. Thus
>> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry. Currently
>> KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit. However, it
>> doesn't work everytime since there is a posibility that IA32_RTIT_CTL.TraceEn
>> is re-enabled in PT PMI handler before vm-entry. This series tries to fix the
>> issue by exposing two interfaces from Intel PT driver for the purose to stop and
>> resume Intel PT on host. It prevents PT PMI handler from re-enabling PT. By the
>> way, it also fixes another issue that PT PMI touches PT MSRs whihc leads to
>> what KVM stores for host bemomes stale.
> 
> I'm thinking about another approach to fixing it. I think we need to have the
> running host pt event disabled when we switch to guest and don't expect to
> receive the host pt interrupt at this point. Also, the host pt context can be
> save/restored by host perf core (instead of KVM) when we disable/enable
> the event.
> 
> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> index 82ef87e9a897..1d3e03ecaf6a 100644
> --- a/arch/x86/events/intel/pt.c
> +++ b/arch/x86/events/intel/pt.c
> @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event *event, int mode)
> 
>          pt_config_buffer(buf);
>          pt_config(event);
> +       pt->event = event;
> 
>          return;
> 
> @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event *event, int mode)
>                  return;
> 
>          event->hw.state = PERF_HES_STOPPED;
> +       pt->event = NULL;
> 
>          if (mode & PERF_EF_UPDATE) {
>                  struct pt_buffer *buf = perf_get_aux(&pt->handle);
> @@ -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event, int mode)
>          }
>   }
> 
> +
> +struct perf_event *pt_get_curr_event(void)
> +{
> +       struct pt *pt = this_cpu_ptr(&pt_ctx);
> +
> +       return pt->event;
> +}
> +EXPORT_SYMBOL_GPL(pt_get_curr_event);
> +
>   static long pt_event_snapshot_aux(struct perf_event *event,
>                                    struct perf_output_handle *handle,
>                                    unsigned long size)
> diff --git a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h
> index 96906a62aacd..d46a85bb06bb 100644
> --- a/arch/x86/events/intel/pt.h
> +++ b/arch/x86/events/intel/pt.h
> @@ -121,6 +121,7 @@ struct pt_filters {
>    * @output_mask:       cached RTIT_OUTPUT_MASK MSR value
>    */
>   struct pt {
> +       struct perf_event       *event;
>          struct perf_output_handle handle;
>          struct pt_filters       filters;
>          int                     handle_nmi;
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index f6fc8dd51ef4..be8dd24922a7 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)
> 
>   #ifdef CONFIG_CPU_SUP_INTEL
>    extern void intel_pt_handle_vmx(int on);
> + extern struct perf_event *pt_get_curr_event(void);
>   #else
>   static inline void intel_pt_handle_vmx(int on)
>   {
> 
> +
>   }
> +struct perf_event *pt_get_curr_event(void) { }
>   #endif
> 
>   #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d7f8331d6f7e..195debc1bff1 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx *ctx, u32 addr_range)
> 
>   static void pt_guest_enter(struct vcpu_vmx *vmx)
>   {
> -       if (vmx_pt_mode_is_system())
> +       struct perf_event *event;
> +
> +       if (vmx_pt_mode_is_system() ||
> +           !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
>                  return;
> 
> -       /*
> -        * GUEST_IA32_RTIT_CTL is already set in the VMCS.
> -        * Save host state before VM entry.
> -        */
> -       rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> -       if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> -               wrmsrl(MSR_IA32_RTIT_CTL, 0);
> -               pt_save_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
> -               pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> -       }
> +       event = pt_get_curr_event();
> +       perf_event_disable(event);

perf_event_disable() is not allowed in preemption disabled context, since

perf_event_disable()
-> perf_event_ctx_lock()
    -> perf_event_ctx_lock_nested()
       -> mutex_lock_nested()

and it causes

[ 3542.164553] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:580
[ 3542.165140] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
1518, name: CPU 0/KVM
[ 3542.165703] preempt_count: 1, expected: 0
[ 3542.166006] RCU nest depth: 0, expected: 0
[ 3542.166315] INFO: lockdep is turned off.
[ 3542.166614] irq event stamp: 0
[ 3542.166857] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[ 3542.167304] hardirqs last disabled at (0): [<ffffffff94699ac8>] 
copy_process+0x8e8/0x1bd0
[ 3542.167874] softirqs last  enabled at (0): [<ffffffff94699ac8>] 
copy_process+0x8e8/0x1bd0
[ 3542.168443] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 3542.168891] Preemption disabled at:
[ 3542.168893] [<ffffffffc0c28f29>] vcpu_enter_guest+0x139/0x1350 [kvm]
[ 3542.169674] CPU: 20 PID: 1518 Comm: CPU 0/KVM Not tainted 
6.0.0-rc5-fix-pt-vm-entry+ #3 f2d44ed9be3fc4a510291e2989c9432fce3cb5de
[ 3542.170457] Hardware name: Intel Corporation JACOBSVILLE/JACOBSVILLE, 
BIOS JBVLCRB1.86B.0012.D75.1912120439 12/12/2019
[ 3542.171188] Call Trace:
[ 3542.171392]  <TASK>
[ 3542.171572]  show_stack+0x52/0x5c
[ 3542.171831]  dump_stack_lvl+0x5b/0x86
[ 3542.172112]  dump_stack+0x10/0x16
[ 3542.172371]  __might_resched.cold+0x135/0x15b
[ 3542.172698]  __might_sleep+0x52/0xa0
[ 3542.172975]  __mutex_lock+0x4e/0x4d0
[ 3542.173251]  ? kvm_sched_in+0x4f/0x60 [kvm 
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.173839]  ? perf_event_ctx_lock_nested+0xc8/0x230
[ 3542.174202]  ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.174551]  ? lock_acquire+0xfc/0x150
[ 3542.174840]  ? perf_event_ctx_lock_nested+0x24/0x230
[ 3542.175205]  mutex_lock_nested+0x1c/0x30
[ 3542.175505]  perf_event_ctx_lock_nested+0xc8/0x230
[ 3542.181147]  ? perf_event_ctx_lock_nested+0x24/0x230
[ 3542.186839]  perf_event_disable+0x19/0x80
[ 3542.192502]  vmx_vcpu_run+0x3e5/0xfe0 [kvm_intel 
7936a7891efe9306918aa504b0eb8bc1e7ba3aa6]
[ 3542.203771]  ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.209378]  vcpu_enter_guest+0xa96/0x1350 [kvm 
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.215218]  ? vcpu_enter_guest+0xbe1/0x1350 [kvm 
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.226292]  ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.231956]  ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.237542]  ? lock_acquire+0xfc/0x150
[ 3542.243093]  ? __rseq_handle_notify_resume+0x3a/0x60
[ 3542.248689]  vcpu_run+0x53/0x490 [kvm 
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.254533]  ? vcpu_run+0x35a/0x490 [kvm 
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.260567]  kvm_arch_vcpu_ioctl_run+0x162/0x680 [kvm 
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.272395]  ? kvm_arch_vcpu_ioctl_run+0x6d/0x680 [kvm 
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.284586]  kvm_vcpu_ioctl+0x2ad/0x7a0 [kvm 
1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
[ 3542.290973]  ? lock_acquire+0xfc/0x150
[ 3542.296990]  ? rcu_read_lock_sched_held+0x16/0x90
[ 3542.302912]  ? lock_release+0x118/0x190
[ 3542.308800]  ? __fget_files+0xe8/0x1c0
[ 3542.314710]  ? __fget_files+0x5/0x1c0
[ 3542.320591]  __x64_sys_ioctl+0x96/0xd0
[ 3542.326500]  do_syscall_64+0x3a/0x90
[ 3542.332426]  entry_SYSCALL_64_after_hwframe+0x5e/0xc8


I know little about perf. It seems perf_event_disable() is not used 
widely by other kernel component. Is there a alternative? If no, I think 
expose disable/enable helper from pt driver like this series seems OK.

> +       vmx->pt_desc.host_event = event;
> +       pt_load_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> }
> 
>   static void pt_guest_exit(struct vcpu_vmx *vmx)
>   {
> -       if (vmx_pt_mode_is_system())
> -               return;
> +       struct perf_event *event = vmx->pt_desc.host_event;
> 
> -       if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> -               pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> -               pt_load_msr(&vmx->pt_desc.host, vmx->pt_desc.num_address_ranges);
> -       }
> +       if (vmx_pt_mode_is_system() ||
> +           !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
> +               return;
> 
> -       /*
> -        * KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest,
> -        * i.e. RTIT_CTL is always cleared on VM-Exit.  Restore it if necessary.
> -        */
> -       if (vmx->pt_desc.host.ctl)
> -               wrmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> +       pt_save_msr(&vmx->pt_desc.guest, vmx->pt_desc.num_address_ranges);
> +       if (event)
> +               perf_event_enable(event);
>   }
> 
>   void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 24d58c2ffaa3..4c20bdabc85b 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -66,7 +66,7 @@ struct pt_desc {
>          u64 ctl_bitmask;
>          u32 num_address_ranges;
>          u32 caps[PT_CPUID_REGS_NUM * PT_CPUID_LEAVES];
> -       struct pt_ctx host;
> +       struct perf_event *host_event;
>          struct pt_ctx guest;
>   };
> 
>
Wang, Wei W Sept. 14, 2022, 6:16 a.m. UTC | #7
On Wednesday, September 14, 2022 12:16 PM, Li, Xiaoyao wrote:
> On 8/29/2022 3:49 PM, Wang, Wei W wrote:
> > On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
> >> There is one bug in KVM that can hit vm-entry failure 100% on
> >> platform supporting PT_MODE_HOST_GUEST mode following below steps:
> >>
> >>    1. #modprobe -r kvm_intel
> >>    2. #modprobe kvm_intel pt_mode=1
> >>    3. start a VM with QEMU
> >>    4. on host: #perf record -e intel_pt//
> >>
> >> The vm-entry failure happens because it violates the requirement
> >> stated in Intel SDM 26.2.1.1 VM-Execution Control Fields
> >>
> >>    If the logical processor is operating with Intel PT enabled (if
> >>    IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
> >>    IA32_RTIT_CTL" VM-entry control must be 0.
> >>
> >> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set.
> Thus
> >> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry.
> >> Currently KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit.
> >> However, it doesn't work everytime since there is a posibility that
> >> IA32_RTIT_CTL.TraceEn is re-enabled in PT PMI handler before
> >> vm-entry. This series tries to fix the issue by exposing two
> >> interfaces from Intel PT driver for the purose to stop and resume
> >> Intel PT on host. It prevents PT PMI handler from re-enabling PT. By
> >> the way, it also fixes another issue that PT PMI touches PT MSRs whihc
> leads to what KVM stores for host bemomes stale.
> >
> > I'm thinking about another approach to fixing it. I think we need to
> > have the running host pt event disabled when we switch to guest and
> > don't expect to receive the host pt interrupt at this point. Also, the
> > host pt context can be save/restored by host perf core (instead of
> > KVM) when we disable/enable the event.
> >
> > diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> > index 82ef87e9a897..1d3e03ecaf6a 100644
> > --- a/arch/x86/events/intel/pt.c
> > +++ b/arch/x86/events/intel/pt.c
> > @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event
> > *event, int mode)
> >
> >          pt_config_buffer(buf);
> >          pt_config(event);
> > +       pt->event = event;
> >
> >          return;
> >
> > @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event
> *event, int mode)
> >                  return;
> >
> >          event->hw.state = PERF_HES_STOPPED;
> > +       pt->event = NULL;
> >
> >          if (mode & PERF_EF_UPDATE) {
> >                  struct pt_buffer *buf = perf_get_aux(&pt->handle);
> @@
> > -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event,
> int mode)
> >          }
> >   }
> >
> > +
> > +struct perf_event *pt_get_curr_event(void) {
> > +       struct pt *pt = this_cpu_ptr(&pt_ctx);
> > +
> > +       return pt->event;
> > +}
> > +EXPORT_SYMBOL_GPL(pt_get_curr_event);
> > +
> >   static long pt_event_snapshot_aux(struct perf_event *event,
> >                                    struct perf_output_handle
> *handle,
> >                                    unsigned long size) diff --git
> > a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h index
> > 96906a62aacd..d46a85bb06bb 100644
> > --- a/arch/x86/events/intel/pt.h
> > +++ b/arch/x86/events/intel/pt.h
> > @@ -121,6 +121,7 @@ struct pt_filters {
> >    * @output_mask:       cached RTIT_OUTPUT_MASK MSR value
> >    */
> >   struct pt {
> > +       struct perf_event       *event;
> >          struct perf_output_handle handle;
> >          struct pt_filters       filters;
> >          int                     handle_nmi;
> > diff --git a/arch/x86/include/asm/perf_event.h
> > b/arch/x86/include/asm/perf_event.h
> > index f6fc8dd51ef4..be8dd24922a7 100644
> > --- a/arch/x86/include/asm/perf_event.h
> > +++ b/arch/x86/include/asm/perf_event.h
> > @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct
> > x86_pmu_lbr *lbr)
> >
> >   #ifdef CONFIG_CPU_SUP_INTEL
> >    extern void intel_pt_handle_vmx(int on);
> > + extern struct perf_event *pt_get_curr_event(void);
> >   #else
> >   static inline void intel_pt_handle_vmx(int on)
> >   {
> >
> > +
> >   }
> > +struct perf_event *pt_get_curr_event(void) { }
> >   #endif
> >
> >   #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
> diff
> > --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
> > d7f8331d6f7e..195debc1bff1 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx
> > *ctx, u32 addr_range)
> >
> >   static void pt_guest_enter(struct vcpu_vmx *vmx)
> >   {
> > -       if (vmx_pt_mode_is_system())
> > +       struct perf_event *event;
> > +
> > +       if (vmx_pt_mode_is_system() ||
> > +           !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
> >                  return;
> >
> > -       /*
> > -        * GUEST_IA32_RTIT_CTL is already set in the VMCS.
> > -        * Save host state before VM entry.
> > -        */
> > -       rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
> > -       if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
> > -               wrmsrl(MSR_IA32_RTIT_CTL, 0);
> > -               pt_save_msr(&vmx->pt_desc.host,
> vmx->pt_desc.num_address_ranges);
> > -               pt_load_msr(&vmx->pt_desc.guest,
> vmx->pt_desc.num_address_ranges);
> > -       }
> > +       event = pt_get_curr_event();
> > +       perf_event_disable(event);
> 
> perf_event_disable() is not allowed in preemption disabled context, since
> 
> perf_event_disable()
> -> perf_event_ctx_lock()
>     -> perf_event_ctx_lock_nested()
>        -> mutex_lock_nested()
> 
> and it causes
> 
> [ 3542.164553] BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:580
> [ 3542.165140] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid:
> 1518, name: CPU 0/KVM
> [ 3542.165703] preempt_count: 1, expected: 0 [ 3542.166006] RCU nest depth:
> 0, expected: 0 [ 3542.166315] INFO: lockdep is turned off.
> [ 3542.166614] irq event stamp: 0
> [ 3542.166857] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> [ 3542.167304] hardirqs last disabled at (0): [<ffffffff94699ac8>]
> copy_process+0x8e8/0x1bd0
> [ 3542.167874] softirqs last  enabled at (0): [<ffffffff94699ac8>]
> copy_process+0x8e8/0x1bd0
> [ 3542.168443] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [ 3542.168891] Preemption disabled at:
> [ 3542.168893] [<ffffffffc0c28f29>] vcpu_enter_guest+0x139/0x1350 [kvm]
> [ 3542.169674] CPU: 20 PID: 1518 Comm: CPU 0/KVM Not tainted
> 6.0.0-rc5-fix-pt-vm-entry+ #3 f2d44ed9be3fc4a510291e2989c9432fce3cb5de
> [ 3542.170457] Hardware name: Intel Corporation JACOBSVILLE/JACOBSVILLE,
> BIOS JBVLCRB1.86B.0012.D75.1912120439 12/12/2019 [ 3542.171188] Call
> Trace:
> [ 3542.171392]  <TASK>
> [ 3542.171572]  show_stack+0x52/0x5c
> [ 3542.171831]  dump_stack_lvl+0x5b/0x86 [ 3542.172112]
> dump_stack+0x10/0x16 [ 3542.172371]  __might_resched.cold+0x135/0x15b
> [ 3542.172698]  __might_sleep+0x52/0xa0 [ 3542.172975]
> __mutex_lock+0x4e/0x4d0 [ 3542.173251]  ? kvm_sched_in+0x4f/0x60 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.173839]  ? perf_event_ctx_lock_nested+0xc8/0x230
> [ 3542.174202]  ? rcu_read_lock_sched_held+0x16/0x90
> [ 3542.174551]  ? lock_acquire+0xfc/0x150 [ 3542.174840]  ?
> perf_event_ctx_lock_nested+0x24/0x230
> [ 3542.175205]  mutex_lock_nested+0x1c/0x30 [ 3542.175505]
> perf_event_ctx_lock_nested+0xc8/0x230
> [ 3542.181147]  ? perf_event_ctx_lock_nested+0x24/0x230
> [ 3542.186839]  perf_event_disable+0x19/0x80 [ 3542.192502]
> vmx_vcpu_run+0x3e5/0xfe0 [kvm_intel
> 7936a7891efe9306918aa504b0eb8bc1e7ba3aa6]
> [ 3542.203771]  ? rcu_read_lock_sched_held+0x16/0x90
> [ 3542.209378]  vcpu_enter_guest+0xa96/0x1350 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.215218]  ? vcpu_enter_guest+0xbe1/0x1350 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.226292]  ? rcu_read_lock_sched_held+0x16/0x90
> [ 3542.231956]  ? rcu_read_lock_sched_held+0x16/0x90
> [ 3542.237542]  ? lock_acquire+0xfc/0x150 [ 3542.243093]  ?
> __rseq_handle_notify_resume+0x3a/0x60
> [ 3542.248689]  vcpu_run+0x53/0x490 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.254533]  ? vcpu_run+0x35a/0x490 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.260567]  kvm_arch_vcpu_ioctl_run+0x162/0x680 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.272395]  ? kvm_arch_vcpu_ioctl_run+0x6d/0x680 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.284586]  kvm_vcpu_ioctl+0x2ad/0x7a0 [kvm
> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
> [ 3542.290973]  ? lock_acquire+0xfc/0x150 [ 3542.296990]  ?
> rcu_read_lock_sched_held+0x16/0x90
> [ 3542.302912]  ? lock_release+0x118/0x190 [ 3542.308800]  ?
> __fget_files+0xe8/0x1c0 [ 3542.314710]  ? __fget_files+0x5/0x1c0
> [ 3542.320591]  __x64_sys_ioctl+0x96/0xd0 [ 3542.326500]
> do_syscall_64+0x3a/0x90 [ 3542.332426]
> entry_SYSCALL_64_after_hwframe+0x5e/0xc8
> 
> 
> I know little about perf.
 
+Kan to double confirm if needed.

> It seems perf_event_disable() is not used widely by
> other kernel component. Is there a alternative? If no, I think expose
> disable/enable helper from pt driver like this series seems OK.

Oops, it was actually a mistake to disable the event on other CPUs.
Can you expose and use perf_event_disable_local?

For the enabling side, we need to add and expose perf_event_enable_local:
event_function_local(event, __perf_event_enable, NULL);
Liang, Kan Sept. 14, 2022, 8:25 p.m. UTC | #8
On 2022-09-14 2:16 a.m., Wang, Wei W wrote:
> On Wednesday, September 14, 2022 12:16 PM, Li, Xiaoyao wrote:
>> On 8/29/2022 3:49 PM, Wang, Wei W wrote:
>>> On Thursday, August 25, 2022 4:56 PM, Xiaoyao Li wrote:
>>>> There is one bug in KVM that can hit vm-entry failure 100% on
>>>> platform supporting PT_MODE_HOST_GUEST mode following below steps:
>>>>
>>>>    1. #modprobe -r kvm_intel
>>>>    2. #modprobe kvm_intel pt_mode=1
>>>>    3. start a VM with QEMU
>>>>    4. on host: #perf record -e intel_pt//
>>>>
>>>> The vm-entry failure happens because it violates the requirement
>>>> stated in Intel SDM 26.2.1.1 VM-Execution Control Fields
>>>>
>>>>    If the logical processor is operating with Intel PT enabled (if
>>>>    IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
>>>>    IA32_RTIT_CTL" VM-entry control must be 0.
>>>>
>>>> On PT_MODE_HOST_GUEST node, PT_MODE_HOST_GUEST is always set.
>> Thus
>>>> KVM needs to ensure IA32_RTIT_CTL.TraceEn is 0 before VM-entry.
>>>> Currently KVM manually WRMSR(IA32_RTIT_CTL) to clear TraceEn bit.
>>>> However, it doesn't work everytime since there is a posibility that
>>>> IA32_RTIT_CTL.TraceEn is re-enabled in PT PMI handler before
>>>> vm-entry. This series tries to fix the issue by exposing two
>>>> interfaces from Intel PT driver for the purose to stop and resume
>>>> Intel PT on host. It prevents PT PMI handler from re-enabling PT. By
>>>> the way, it also fixes another issue that PT PMI touches PT MSRs whihc
>> leads to what KVM stores for host bemomes stale.
>>>
>>> I'm thinking about another approach to fixing it. I think we need to
>>> have the running host pt event disabled when we switch to guest and
>>> don't expect to receive the host pt interrupt at this point. Also, the
>>> host pt context can be save/restored by host perf core (instead of
>>> KVM) when we disable/enable the event.
>>>
>>> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
>>> index 82ef87e9a897..1d3e03ecaf6a 100644
>>> --- a/arch/x86/events/intel/pt.c
>>> +++ b/arch/x86/events/intel/pt.c
>>> @@ -1575,6 +1575,7 @@ static void pt_event_start(struct perf_event
>>> *event, int mode)
>>>
>>>          pt_config_buffer(buf);
>>>          pt_config(event);
>>> +       pt->event = event;
>>>
>>>          return;
>>>
>>> @@ -1600,6 +1601,7 @@ static void pt_event_stop(struct perf_event
>> *event, int mode)
>>>                  return;
>>>
>>>          event->hw.state = PERF_HES_STOPPED;
>>> +       pt->event = NULL;
>>>
>>>          if (mode & PERF_EF_UPDATE) {
>>>                  struct pt_buffer *buf = perf_get_aux(&pt->handle);
>> @@
>>> -1624,6 +1626,15 @@ static void pt_event_stop(struct perf_event *event,
>> int mode)
>>>          }
>>>   }
>>>
>>> +
>>> +struct perf_event *pt_get_curr_event(void) {
>>> +       struct pt *pt = this_cpu_ptr(&pt_ctx);
>>> +
>>> +       return pt->event;
>>> +}
>>> +EXPORT_SYMBOL_GPL(pt_get_curr_event);
>>> +
>>>   static long pt_event_snapshot_aux(struct perf_event *event,
>>>                                    struct perf_output_handle
>> *handle,
>>>                                    unsigned long size) diff --git
>>> a/arch/x86/events/intel/pt.h b/arch/x86/events/intel/pt.h index
>>> 96906a62aacd..d46a85bb06bb 100644
>>> --- a/arch/x86/events/intel/pt.h
>>> +++ b/arch/x86/events/intel/pt.h
>>> @@ -121,6 +121,7 @@ struct pt_filters {
>>>    * @output_mask:       cached RTIT_OUTPUT_MASK MSR value
>>>    */
>>>   struct pt {
>>> +       struct perf_event       *event;
>>>          struct perf_output_handle handle;
>>>          struct pt_filters       filters;
>>>          int                     handle_nmi;
>>> diff --git a/arch/x86/include/asm/perf_event.h
>>> b/arch/x86/include/asm/perf_event.h
>>> index f6fc8dd51ef4..be8dd24922a7 100644
>>> --- a/arch/x86/include/asm/perf_event.h
>>> +++ b/arch/x86/include/asm/perf_event.h
>>> @@ -553,11 +553,14 @@ static inline int x86_perf_get_lbr(struct
>>> x86_pmu_lbr *lbr)
>>>
>>>   #ifdef CONFIG_CPU_SUP_INTEL
>>>    extern void intel_pt_handle_vmx(int on);
>>> + extern struct perf_event *pt_get_curr_event(void);
>>>   #else
>>>   static inline void intel_pt_handle_vmx(int on)
>>>   {
>>>
>>> +
>>>   }
>>> +struct perf_event *pt_get_curr_event(void) { }
>>>   #endif
>>>
>>>   #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
>> diff
>>> --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index
>>> d7f8331d6f7e..195debc1bff1 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -1125,37 +1125,29 @@ static inline void pt_save_msr(struct pt_ctx
>>> *ctx, u32 addr_range)
>>>
>>>   static void pt_guest_enter(struct vcpu_vmx *vmx)
>>>   {
>>> -       if (vmx_pt_mode_is_system())
>>> +       struct perf_event *event;
>>> +
>>> +       if (vmx_pt_mode_is_system() ||
>>> +           !(vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN))
>>>                  return;
>>>
>>> -       /*
>>> -        * GUEST_IA32_RTIT_CTL is already set in the VMCS.
>>> -        * Save host state before VM entry.
>>> -        */
>>> -       rdmsrl(MSR_IA32_RTIT_CTL, vmx->pt_desc.host.ctl);
>>> -       if (vmx->pt_desc.guest.ctl & RTIT_CTL_TRACEEN) {
>>> -               wrmsrl(MSR_IA32_RTIT_CTL, 0);
>>> -               pt_save_msr(&vmx->pt_desc.host,
>> vmx->pt_desc.num_address_ranges);
>>> -               pt_load_msr(&vmx->pt_desc.guest,
>> vmx->pt_desc.num_address_ranges);
>>> -       }
>>> +       event = pt_get_curr_event();
>>> +       perf_event_disable(event);
>>
>> perf_event_disable() is not allowed in preemption disabled context, since
>>
>> perf_event_disable()
>> -> perf_event_ctx_lock()
>>     -> perf_event_ctx_lock_nested()
>>        -> mutex_lock_nested()
>>
>> and it causes
>>
>> [ 3542.164553] BUG: sleeping function called from invalid context at
>> kernel/locking/mutex.c:580
>> [ 3542.165140] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid:
>> 1518, name: CPU 0/KVM
>> [ 3542.165703] preempt_count: 1, expected: 0 [ 3542.166006] RCU nest depth:
>> 0, expected: 0 [ 3542.166315] INFO: lockdep is turned off.
>> [ 3542.166614] irq event stamp: 0
>> [ 3542.166857] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>> [ 3542.167304] hardirqs last disabled at (0): [<ffffffff94699ac8>]
>> copy_process+0x8e8/0x1bd0
>> [ 3542.167874] softirqs last  enabled at (0): [<ffffffff94699ac8>]
>> copy_process+0x8e8/0x1bd0
>> [ 3542.168443] softirqs last disabled at (0): [<0000000000000000>] 0x0
>> [ 3542.168891] Preemption disabled at:
>> [ 3542.168893] [<ffffffffc0c28f29>] vcpu_enter_guest+0x139/0x1350 [kvm]
>> [ 3542.169674] CPU: 20 PID: 1518 Comm: CPU 0/KVM Not tainted
>> 6.0.0-rc5-fix-pt-vm-entry+ #3 f2d44ed9be3fc4a510291e2989c9432fce3cb5de
>> [ 3542.170457] Hardware name: Intel Corporation JACOBSVILLE/JACOBSVILLE,
>> BIOS JBVLCRB1.86B.0012.D75.1912120439 12/12/2019 [ 3542.171188] Call
>> Trace:
>> [ 3542.171392]  <TASK>
>> [ 3542.171572]  show_stack+0x52/0x5c
>> [ 3542.171831]  dump_stack_lvl+0x5b/0x86 [ 3542.172112]
>> dump_stack+0x10/0x16 [ 3542.172371]  __might_resched.cold+0x135/0x15b
>> [ 3542.172698]  __might_sleep+0x52/0xa0 [ 3542.172975]
>> __mutex_lock+0x4e/0x4d0 [ 3542.173251]  ? kvm_sched_in+0x4f/0x60 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.173839]  ? perf_event_ctx_lock_nested+0xc8/0x230
>> [ 3542.174202]  ? rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.174551]  ? lock_acquire+0xfc/0x150 [ 3542.174840]  ?
>> perf_event_ctx_lock_nested+0x24/0x230
>> [ 3542.175205]  mutex_lock_nested+0x1c/0x30 [ 3542.175505]
>> perf_event_ctx_lock_nested+0xc8/0x230
>> [ 3542.181147]  ? perf_event_ctx_lock_nested+0x24/0x230
>> [ 3542.186839]  perf_event_disable+0x19/0x80 [ 3542.192502]
>> vmx_vcpu_run+0x3e5/0xfe0 [kvm_intel
>> 7936a7891efe9306918aa504b0eb8bc1e7ba3aa6]
>> [ 3542.203771]  ? rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.209378]  vcpu_enter_guest+0xa96/0x1350 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.215218]  ? vcpu_enter_guest+0xbe1/0x1350 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.226292]  ? rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.231956]  ? rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.237542]  ? lock_acquire+0xfc/0x150 [ 3542.243093]  ?
>> __rseq_handle_notify_resume+0x3a/0x60
>> [ 3542.248689]  vcpu_run+0x53/0x490 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.254533]  ? vcpu_run+0x35a/0x490 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.260567]  kvm_arch_vcpu_ioctl_run+0x162/0x680 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.272395]  ? kvm_arch_vcpu_ioctl_run+0x6d/0x680 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.284586]  kvm_vcpu_ioctl+0x2ad/0x7a0 [kvm
>> 1f4ec973e90cea60cd68c22c1431d8e7f534c92b]
>> [ 3542.290973]  ? lock_acquire+0xfc/0x150 [ 3542.296990]  ?
>> rcu_read_lock_sched_held+0x16/0x90
>> [ 3542.302912]  ? lock_release+0x118/0x190 [ 3542.308800]  ?
>> __fget_files+0xe8/0x1c0 [ 3542.314710]  ? __fget_files+0x5/0x1c0
>> [ 3542.320591]  __x64_sys_ioctl+0x96/0xd0 [ 3542.326500]
>> do_syscall_64+0x3a/0x90 [ 3542.332426]
>> entry_SYSCALL_64_after_hwframe+0x5e/0xc8
>>
>>
>> I know little about perf.
>  
> +Kan to double confirm if needed.


The perf_event_disable() eventually invokes the intel_pt_stop().
We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to other
modules. I don't think we have to use the perf_event_disable(). Also,
the perf_event_disable() requires extra codes.

I went through the discussions. I agree with Sean's suggestion.
We should only put the logic in the KVM but all the MSR access details
into the PT driver. But I prefer a more generic and straightforward
function name, e.g., intel_pt_stop_save()/intel_pt_start_load(), in case
other modules may want to save/restore the PT information in their
context switch later.

Thanks,
Kan

> 
>> It seems perf_event_disable() is not used widely by
>> other kernel component. Is there a alternative? If no, I think expose
>> disable/enable helper from pt driver like this series seems OK.
> 
> Oops, it was actually a mistake to disable the event on other CPUs.
> Can you expose and use perf_event_disable_local?
> 
> For the enabling side, we need to add and expose perf_event_enable_local:
> event_function_local(event, __perf_event_enable, NULL);
Wang, Wei W Sept. 15, 2022, 2:46 a.m. UTC | #9
On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
> The perf_event_disable() eventually invokes the intel_pt_stop().
> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to other
> modules. I don't think we have to use the perf_event_disable(). Also, the
> perf_event_disable() requires extra codes.
> 
> I went through the discussions. I agree with Sean's suggestion.
> We should only put the logic in the KVM but all the MSR access details into the PT
> driver. 

Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived by perf.
1. If we make KVM a user of perf, we should do this via perf_event_disable/enable_*.
2. If we make KVM an alternative to perf (i.e. have direct control over PMU HW),
we can do this via driver interfaces like perf.
Per my experience, we should go for 1. Probably need Peter's opinions on this.

> But I prefer a more generic and straightforward function name, e.g.,
> intel_pt_stop_save()/intel_pt_start_load(), in case other modules may want to
> save/restore the PT information in their context switch later.
> 
> Thanks,
> Kan
> 
> >
> >> It seems perf_event_disable() is not used widely by other kernel
> >> component. 

Because there are not lots of kernel users.
You can check another user, watchdog_hld.c, perf_event_enable/disable are used there.
Liang, Kan Sept. 15, 2022, 1:54 p.m. UTC | #10
On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
>> The perf_event_disable() eventually invokes the intel_pt_stop().
>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to other
>> modules. I don't think we have to use the perf_event_disable(). Also, the
>> perf_event_disable() requires extra codes.
>>
>> I went through the discussions. I agree with Sean's suggestion.
>> We should only put the logic in the KVM but all the MSR access details into the PT
>> driver. 
> 
> Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived by perf.

It through perf_event, not driven by perf_event. The perf_event generic
code never knows when should invokes each driver to save/restore
information. It should be driven by the other subsystem e.g., scheduler.

For this case, KVM should drive the save/restore, and the PT driver
eventually does all the MSR access details.

> 1. If we make KVM a user of perf, we should do this via perf_event_disable/enable_*.
> 2. If we make KVM an alternative to perf (i.e. have direct control over PMU HW),
> we can do this via driver interfaces like perf.
> Per my experience, we should go for 1. Probably need Peter's opinions on this.
>

For 1, the perf_event_disable/enable_* are not enough. They don't
save/restore MSRs. If we go to this way, we have to introduce a new
generic interface to ask each driver to save/restore their MSRs when the
guest is entering/exiting. We'd better combine the new interface with
the existing perf_guest_get_msrs() of the core driver.
I think that's an ideal solution, but requires big changes in the code.

2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). I
don't think it's a right way. We'd better fix it.

The suggestion should be 3. The KVM notify the PT driver via the
interface provided by PT. The PT driver save/restore all the registers.
I think it's an acceptable solution with small code changes.

So I prefer 3.

Thanks,
Kan

>> But I prefer a more generic and straightforward function name, e.g.,
>> intel_pt_stop_save()/intel_pt_start_load(), in case other modules may want to
>> save/restore the PT information in their context switch later.
>>
>> Thanks,
>> Kan
>>
>>>
>>>> It seems perf_event_disable() is not used widely by other kernel
>>>> component. 
> 
> Because there are not lots of kernel users.
> You can check another user, watchdog_hld.c, perf_event_enable/disable are used there.
Wang, Wei W Sept. 15, 2022, 2:39 p.m. UTC | #11
On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote:
> On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
> > On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
> >> The perf_event_disable() eventually invokes the intel_pt_stop().
> >> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to
> >> other modules. I don't think we have to use the perf_event_disable().
> >> Also, the
> >> perf_event_disable() requires extra codes.
> >>
> >> I went through the discussions. I agree with Sean's suggestion.
> >> We should only put the logic in the KVM but all the MSR access
> >> details into the PT driver.
> >
> > Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived
> by perf.
> 
> It through perf_event, not driven by perf_event. The perf_event generic code
> never knows when should invokes each driver to save/restore information. It
> should be driven by the other subsystem e.g., scheduler.

Yes. The cpu scheduler does this via the perf subsystem, though.

> 
> For this case, KVM should drive the save/restore, and the PT driver eventually
> does all the MSR access details.
> 
> > 1. If we make KVM a user of perf, we should do this via
> perf_event_disable/enable_*.
> > 2. If we make KVM an alternative to perf (i.e. have direct control
> > over PMU HW), we can do this via driver interfaces like perf.
> > Per my experience, we should go for 1. Probably need Peter's opinions on
> this.
> >
> 
> For 1, the perf_event_disable/enable_* are not enough. They don't
> save/restore MSRs. 

perf_event_disable will go through perf to call pt_event_stop which saves the related MSRs, right?
(if so, what large changes did you mean?)

> If we go to this way, we have to introduce a new generic
> interface to ask each driver to save/restore their MSRs when the guest is
> entering/exiting. We'd better combine the new interface with the existing
> perf_guest_get_msrs() of the core driver.
> I think that's an ideal solution, but requires big changes in the code.
> 
> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). I don't
> think it's a right way. We'd better fix it.
> 
> The suggestion should be 3. The KVM notify the PT driver via the interface
> provided by PT. The PT driver save/restore all the registers.
> I think it's an acceptable solution with small code changes.

This looks like we just relocate the save/restore functions to the PT driver and KVM still directly call them - still not going through perf's management. Imagine every user operates on the pmu h/w directly like this, things would be a mess.

Thanks,
Wei
Liang, Kan Sept. 15, 2022, 3:42 p.m. UTC | #12
On 2022-09-15 10:39 a.m., Wang, Wei W wrote:
> On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote:
>> On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
>>> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
>>>> The perf_event_disable() eventually invokes the intel_pt_stop().
>>>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to
>>>> other modules. I don't think we have to use the perf_event_disable().
>>>> Also, the
>>>> perf_event_disable() requires extra codes.
>>>>
>>>> I went through the discussions. I agree with Sean's suggestion.
>>>> We should only put the logic in the KVM but all the MSR access
>>>> details into the PT driver.
>>>
>>> Even the driver itself doesn’t drive the save/restore of the MSRs, it is drived
>> by perf.
>>
>> It through perf_event, not driven by perf_event. The perf_event generic code
>> never knows when should invokes each driver to save/restore information. It
>> should be driven by the other subsystem e.g., scheduler.
> 
> Yes. The cpu scheduler does this via the perf subsystem, though.
> 
>>
>> For this case, KVM should drive the save/restore, and the PT driver eventually
>> does all the MSR access details.
>>
>>> 1. If we make KVM a user of perf, we should do this via
>> perf_event_disable/enable_*.
>>> 2. If we make KVM an alternative to perf (i.e. have direct control
>>> over PMU HW), we can do this via driver interfaces like perf.
>>> Per my experience, we should go for 1. Probably need Peter's opinions on
>> this.
>>>
>>
>> For 1, the perf_event_disable/enable_* are not enough. They don't
>> save/restore MSRs. 
> 
> perf_event_disable will go through perf to call pt_event_stop which saves the related MSRs, right?

I don't think so. The pt_event_stop() doesn't save all the
MSR_IA32_RTIT_* MSRs.

> (if so, what large changes did you mean?)
> 
>> If we go to this way, we have to introduce a new generic
>> interface to ask each driver to save/restore their MSRs when the guest is
>> entering/exiting. We'd better combine the new interface with the existing
>> perf_guest_get_msrs() of the core driver.
>> I think that's an ideal solution, but requires big changes in the code.
>>
>> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr(). I don't
>> think it's a right way. We'd better fix it.
>>
>> The suggestion should be 3. The KVM notify the PT driver via the interface
>> provided by PT. The PT driver save/restore all the registers.
>> I think it's an acceptable solution with small code changes.
> 
> This looks like we just relocate the save/restore functions to the PT driver and KVM still directly call them - still not going through perf's management. Imagine every user operates on the pmu h/w directly like this, things would be a mess.
> 


The pt_event_stop() and the proposed interface still manipulate the PT
event pt->handle.event. The event status is updated as well. It's still
under control of the perf_event.
While the current KVM implementation implicitly updates the MSRs without
updating the event status.

Also, KVM doesn't know the PT as well as the PT driver. It's better to
let the dedicated driver maintain the details. Otherwise, if we add more
MSRs later, we have to maintain both KVM and PT.

Thanks,
Kan
Wang, Wei W Sept. 16, 2022, 2:30 a.m. UTC | #13
On Thursday, September 15, 2022 11:43 PM, Liang, Kan wrote:
> On 2022-09-15 10:39 a.m., Wang, Wei W wrote:
> > On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote:
> >> On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
> >>> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
> >>>> The perf_event_disable() eventually invokes the intel_pt_stop().
> >>>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to
> >>>> other modules. I don't think we have to use the perf_event_disable().
> >>>> Also, the
> >>>> perf_event_disable() requires extra codes.
> >>>>
> >>>> I went through the discussions. I agree with Sean's suggestion.
> >>>> We should only put the logic in the KVM but all the MSR access
> >>>> details into the PT driver.
> >>>
> >>> Even the driver itself doesn’t drive the save/restore of the MSRs,
> >>> it is drived
> >> by perf.
> >>
> >> It through perf_event, not driven by perf_event. The perf_event
> >> generic code never knows when should invokes each driver to
> >> save/restore information. It should be driven by the other subsystem e.g.,
> scheduler.
> >
> > Yes. The cpu scheduler does this via the perf subsystem, though.
> >
> >>
> >> For this case, KVM should drive the save/restore, and the PT driver
> >> eventually does all the MSR access details.
> >>
> >>> 1. If we make KVM a user of perf, we should do this via
> >> perf_event_disable/enable_*.
> >>> 2. If we make KVM an alternative to perf (i.e. have direct control
> >>> over PMU HW), we can do this via driver interfaces like perf.
> >>> Per my experience, we should go for 1. Probably need Peter's
> >>> opinions on
> >> this.
> >>>
> >>
> >> For 1, the perf_event_disable/enable_* are not enough. They don't
> >> save/restore MSRs.
> >
> > perf_event_disable will go through perf to call pt_event_stop which saves
> the related MSRs, right?
> 
> I don't think so. The pt_event_stop() doesn't save all the
> MSR_IA32_RTIT_* MSRs.

Not all the MSRs are required to be saved. In general, pt_event_stop should have
saved all the MSRs required for an event switching. Otherwise the general usages
of PT have been broken. To be more precise, the following MSRs are not saved by
pt_event_stop, but I don’t see they are required to be saved:

- MSR_IA32_RTIT_CR3_MATCH: I don’t see it is used by perf. 
Seems like KVM saved an MSR that's not even used by the host.

- Address range MSRs (MSR_IA32_RTIT_ADDR0_A etc.): Those are provided by s/w and not updated by h/w.
So they're just set to MSRs when event gets scheduled in. There is no need to save. 

> 
> > (if so, what large changes did you mean?)
> >
> >> If we go to this way, we have to introduce a new generic interface to
> >> ask each driver to save/restore their MSRs when the guest is
> >> entering/exiting. We'd better combine the new interface with the
> >> existing
> >> perf_guest_get_msrs() of the core driver.
> >> I think that's an ideal solution, but requires big changes in the code.
> >>
> >> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr().
> >> I don't think it's a right way. We'd better fix it.
> >>
> >> The suggestion should be 3. The KVM notify the PT driver via the
> >> interface provided by PT. The PT driver save/restore all the registers.
> >> I think it's an acceptable solution with small code changes.
> >
> > This looks like we just relocate the save/restore functions to the PT driver
> and KVM still directly call them - still not going through perf's management.
> Imagine every user operates on the pmu h/w directly like this, things would be
> a mess.
> >
> 
> 
> The pt_event_stop() and the proposed interface still manipulate the PT event
> pt->handle.event. The event status is updated as well. It's still under control of
> the perf_event.

Did you mean to handle the PT event in the proposed driver API? Event status is just
one of the things. There are other things if we want to make it complete for this,
e.g. event->oncpu = -1, and eventually seems we will re-implement perf_event_disable_*.

Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have that many changes.
If necessary, we can post the 2nd version out to double check.

Thanks,
Wei
Liang, Kan Sept. 16, 2022, 1:27 p.m. UTC | #14
On 2022-09-15 10:30 p.m., Wang, Wei W wrote:
> On Thursday, September 15, 2022 11:43 PM, Liang, Kan wrote:
>> On 2022-09-15 10:39 a.m., Wang, Wei W wrote:
>>> On Thursday, September 15, 2022 9:55 PM Liang, Kan wrote:
>>>> On 2022-09-14 10:46 p.m., Wang, Wei W wrote:
>>>>> On Thursday, September 15, 2022 4:26 AM, Liang, Kan wrote:
>>>>>> The perf_event_disable() eventually invokes the intel_pt_stop().
>>>>>> We already expose the intel_pt_stop()/cpu_emergency_stop_pt() to
>>>>>> other modules. I don't think we have to use the perf_event_disable().
>>>>>> Also, the
>>>>>> perf_event_disable() requires extra codes.
>>>>>>
>>>>>> I went through the discussions. I agree with Sean's suggestion.
>>>>>> We should only put the logic in the KVM but all the MSR access
>>>>>> details into the PT driver.
>>>>>
>>>>> Even the driver itself doesn’t drive the save/restore of the MSRs,
>>>>> it is drived
>>>> by perf.
>>>>
>>>> It through perf_event, not driven by perf_event. The perf_event
>>>> generic code never knows when should invokes each driver to
>>>> save/restore information. It should be driven by the other subsystem e.g.,
>> scheduler.
>>>
>>> Yes. The cpu scheduler does this via the perf subsystem, though.
>>>
>>>>
>>>> For this case, KVM should drive the save/restore, and the PT driver
>>>> eventually does all the MSR access details.
>>>>
>>>>> 1. If we make KVM a user of perf, we should do this via
>>>> perf_event_disable/enable_*.
>>>>> 2. If we make KVM an alternative to perf (i.e. have direct control
>>>>> over PMU HW), we can do this via driver interfaces like perf.
>>>>> Per my experience, we should go for 1. Probably need Peter's
>>>>> opinions on
>>>> this.
>>>>>
>>>>
>>>> For 1, the perf_event_disable/enable_* are not enough. They don't
>>>> save/restore MSRs.
>>>
>>> perf_event_disable will go through perf to call pt_event_stop which saves
>> the related MSRs, right?
>>
>> I don't think so. The pt_event_stop() doesn't save all the
>> MSR_IA32_RTIT_* MSRs.
> 
> Not all the MSRs are required to be saved. In general, pt_event_stop should have
> saved all the MSRs required for an event switching. Otherwise the general usages
> of PT have been broken. To be more precise, the following MSRs are not saved by
> pt_event_stop, but I don’t see they are required to be saved:
> 
> - MSR_IA32_RTIT_CR3_MATCH: I don’t see it is used by perf. 
> Seems like KVM saved an MSR that's not even used by the host.
> 
> - Address range MSRs (MSR_IA32_RTIT_ADDR0_A etc.): Those are provided by s/w and not updated by h/w.
> So they're just set to MSRs when event gets scheduled in. There is no need to save. 
>

OK. I think you need a clean-up patch to fix them first.


>>
>>> (if so, what large changes did you mean?)
>>>
>>>> If we go to this way, we have to introduce a new generic interface to
>>>> ask each driver to save/restore their MSRs when the guest is
>>>> entering/exiting. We'd better combine the new interface with the
>>>> existing
>>>> perf_guest_get_msrs() of the core driver.
>>>> I think that's an ideal solution, but requires big changes in the code.
>>>>
>>>> 2 is the current KVM implementation. See pt_save_msr()/pt_load_msr().
>>>> I don't think it's a right way. We'd better fix it.
>>>>
>>>> The suggestion should be 3. The KVM notify the PT driver via the
>>>> interface provided by PT. The PT driver save/restore all the registers.
>>>> I think it's an acceptable solution with small code changes.
>>>
>>> This looks like we just relocate the save/restore functions to the PT driver
>> and KVM still directly call them - still not going through perf's management.
>> Imagine every user operates on the pmu h/w directly like this, things would be
>> a mess.
>>>
>>
>>
>> The pt_event_stop() and the proposed interface still manipulate the PT event
>> pt->handle.event. The event status is updated as well. It's still under control of
>> the perf_event.
> 
> Did you mean to handle the PT event in the proposed driver API? Event status is just
> one of the things. There are other things if we want to make it complete for this,
> e.g. event->oncpu = -1, and eventually seems we will re-implement perf_event_disable_*.
>

As my understand, perf always check the status first. If it's a stopped
or inactivated event, I don't think event->oncpu will be touched. That's
why I think the proposed driver API should be acceptable.

> Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have that many changes.
> If necessary, we can post the 2nd version out to double check.
>

I'm not worry about which ways (either perf_event_disable_local() or the
proposed PT driver API) are chosen to stop the PT. If the existing
perf_event interfaces can meet your requirement, that's perfect.

My real concern is the pt_save_msr()/pt_load_msr(). I don't think it's a
job for KVM. See atomic_switch_perf_msrs(). It is the perf core driver
rather than KVM that tells which MSRs should be saved/restored in VMCS.
We should do the same thing for PT. (Actually, I think we already
encounter issues with the current KVM-dominated method. KVM
saves/restores unnecessary MSRs. Right?)

To do so, I think there may be two ways.
- Since MSRs have to be switched for both PT and core drivers, it sounds
reasonable to provide a new generic interface in the perf_event. The new
interface is to tell KVM which MSRs should be saved/restored. Then KVM
can decide to save/restore via VMCS or direct MSR access. I suspect this
way requires big change, but it will benefit all the drivers which have
similar requirements.
- The proposed driver API. The MSRs are saved/restored in the PT driver.


Thanks,
Kan
Wang, Wei W Sept. 19, 2022, 1:46 p.m. UTC | #15
On Friday, September 16, 2022 9:27 PM, Liang, Kan wrote:
> > Did you mean to handle the PT event in the proposed driver API? Event
> > status is just one of the things. There are other things if we want to
> > make it complete for this, e.g. event->oncpu = -1, and eventually seems we will
> re-implement perf_event_disable_*.
> >
> 
> As my understand, perf always check the status first. If it's a stopped or
> inactivated event, I don't think event->oncpu will be touched. That's why I think
> the proposed driver API should be acceptable.

That's the implementation thing. We need to make it architecturally clean though.

> 
> > Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have
> that many changes.
> > If necessary, we can post the 2nd version out to double check.
> >
> 
> I'm not worry about which ways (either perf_event_disable_local() or the
> proposed PT driver API) are chosen to stop the PT. If the existing perf_event
> interfaces can meet your requirement, that's perfect.
> 
> My real concern is the pt_save_msr()/pt_load_msr(). I don't think it's a job for
> KVM. See atomic_switch_perf_msrs(). It is the perf core driver rather than KVM
> that tells which MSRs should be saved/restored in VMCS.
> We should do the same thing for PT. (Actually, I think we already encounter
> issues with the current KVM-dominated method. KVM saves/restores
> unnecessary MSRs. Right?)
>

Right. It's on my plan to improve the current PT virtualization, and
planed to be the next step after this fix. The general rule is the same: make KVM a user
of perf, that is, we leave those save/restore work to be completely done by the
perf (driver) side, so we will eventually remove the KVM side pt_save/load_msr.
To be more precise, it will work as below:
- we will create a guest event, like what we did for lbr virtualization
- on VMEnter:
  -- perf_disable_event_local(host_event);
  -- perf_enable_event_local(guest_event);
- on VMExit:
  -- perf_disable_event_local(guest_event);
  -- perf_enable_event_local(host_event);

> To do so, I think there may be two ways.
> - Since MSRs have to be switched for both PT and core drivers, it sounds
> reasonable to provide a new generic interface in the perf_event. The new
> interface is to tell KVM which MSRs should be saved/restored. Then KVM can
> decide to save/restore via VMCS or direct MSR access. I suspect this way
> requires big change, but it will benefit all the drivers which have similar
> requirements.
> - The proposed driver API. The MSRs are saved/restored in the PT driver.

As shown above, no need for those. We can completely reuse the
perf side save/restore.

Thanks,
Wei
Liang, Kan Sept. 19, 2022, 2:41 p.m. UTC | #16
On 2022-09-19 9:46 a.m., Wang, Wei W wrote:
> On Friday, September 16, 2022 9:27 PM, Liang, Kan wrote:
>>> Did you mean to handle the PT event in the proposed driver API? Event
>>> status is just one of the things. There are other things if we want to
>>> make it complete for this, e.g. event->oncpu = -1, and eventually seems we will
>> re-implement perf_event_disable_*.
>>>
>>
>> As my understand, perf always check the status first. If it's a stopped or
>> inactivated event, I don't think event->oncpu will be touched. That's why I think
>> the proposed driver API should be acceptable.
> 
> That's the implementation thing. We need to make it architecturally clean though.
> 
>>
>>> Btw, Xiaoyao has made it work with perf_event_disable_local, and don’t have
>> that many changes.
>>> If necessary, we can post the 2nd version out to double check.
>>>
>>
>> I'm not worry about which ways (either perf_event_disable_local() or the
>> proposed PT driver API) are chosen to stop the PT. If the existing perf_event
>> interfaces can meet your requirement, that's perfect.
>>
>> My real concern is the pt_save_msr()/pt_load_msr(). I don't think it's a job for
>> KVM. See atomic_switch_perf_msrs(). It is the perf core driver rather than KVM
>> that tells which MSRs should be saved/restored in VMCS.
>> We should do the same thing for PT. (Actually, I think we already encounter
>> issues with the current KVM-dominated method. KVM saves/restores
>> unnecessary MSRs. Right?)
>>
> 
> Right. It's on my plan to improve the current PT virtualization, and
> planed to be the next step after this fix. The general rule is the same: make KVM a user
> of perf, that is, we leave those save/restore work to be completely done by the
> perf (driver) side, so we will eventually remove the KVM side pt_save/load_msr.
> To be more precise, it will work as below:
> - we will create a guest event, like what we did for lbr virtualization

Another fake event? We have to specially handle it in the perf code. I
don't think it's a clean way for perf.

> - on VMEnter:
>   -- perf_disable_event_local(host_event);
>   -- perf_enable_event_local(guest_event);
> - on VMExit:
>   -- perf_disable_event_local(guest_event);
>   -- perf_enable_event_local(host_event);

Why we cannot use the same way as the perf core driver to switch the
MSRs in the VMCS?

You just need one generic function, perf_guest_get_msrs(), for both PT
and core driver. If you have to disable PT explicitly before VMCS, I
think you can do it in the PT specific perf_guest_get_msrs().

Anyway, that's an improvement for the current code. I don't have a
problem, if you prefer to separate the fix patch and improvement patch.

Thanks,
Kan
> 
>> To do so, I think there may be two ways.
>> - Since MSRs have to be switched for both PT and core drivers, it sounds
>> reasonable to provide a new generic interface in the perf_event. The new
>> interface is to tell KVM which MSRs should be saved/restored. Then KVM can
>> decide to save/restore via VMCS or direct MSR access. I suspect this way
>> requires big change, but it will benefit all the drivers which have similar
>> requirements.
>> - The proposed driver API. The MSRs are saved/restored in the PT driver.
> 
> As shown above, no need for those. We can completely reuse the
> perf side save/restore.
> 
> Thanks,
> Wei
Wang, Wei W Sept. 19, 2022, 3:22 p.m. UTC | #17
On Monday, September 19, 2022 10:41 PM, Liang, Kan wrote:
> Another fake event? We have to specially handle it in the perf code. I don't
> think it's a clean way for perf.

We can check the patch later. I think it should be clean, like the LBR side.

> 
> > - on VMEnter:
> >   -- perf_disable_event_local(host_event);
> >   -- perf_enable_event_local(guest_event);
> > - on VMExit:
> >   -- perf_disable_event_local(guest_event);
> >   -- perf_enable_event_local(host_event);
> 
> Why we cannot use the same way as the perf core driver to switch the MSRs in
> the VMCS?

The current MSR switching list from VMCS isn’t fast,
should be the last resort when really necessary.

> 
> You just need one generic function, perf_guest_get_msrs(), for both PT and
> core driver. If you have to disable PT explicitly before VMCS, I think you can do
> it in the PT specific perf_guest_get_msrs().

The disable is done via " Clear IA32_RTIT_CTL" VMExit control.
It can ensure PT disabled in time on VMExit, so no need to go through perf_guest_get_msrs.

Thanks,
Wei
Liang, Kan Sept. 19, 2022, 3:55 p.m. UTC | #18
On 2022-09-19 11:22 a.m., Wang, Wei W wrote:
> On Monday, September 19, 2022 10:41 PM, Liang, Kan wrote:
>> Another fake event? We have to specially handle it in the perf code. I don't
>> think it's a clean way for perf.
> 
> We can check the patch later. I think it should be clean, like the LBR side.

I doubt. Perf already specially handles the fake LBR event in several
places from the core code to the LBR code. It also occupy a reserved
bit. If there is another choice, I don't think we want to go that way.

> 
>>
>>> - on VMEnter:
>>>   -- perf_disable_event_local(host_event);
>>>   -- perf_enable_event_local(guest_event);
>>> - on VMExit:
>>>   -- perf_disable_event_local(guest_event);
>>>   -- perf_enable_event_local(host_event);
>>
>> Why we cannot use the same way as the perf core driver to switch the MSRs in
>> the VMCS?
> 
> The current MSR switching list from VMCS isn’t fast,
> should be the last resort when really necessary.
> 

It's a documented way in the SDM. I believe there must be some reason
Intel introduces it. Since it's an documented (or recommended) way, I
think we'd better use it if possible.

Since both the PT and the core driver needs to switch MSRs during VMCS,
it's better to use the same way (function) to handle them.

Thanks,
Kan

>>
>> You just need one generic function, perf_guest_get_msrs(), for both PT and
>> core driver. If you have to disable PT explicitly before VMCS, I think you can do
>> it in the PT specific perf_guest_get_msrs().
> 
> The disable is done via " Clear IA32_RTIT_CTL" VMExit control.
> It can ensure PT disabled in time on VMExit, so no need to go through perf_guest_get_msrs.
>