diff mbox series

[v2] KVM: x86/intr: Explicitly check NMI from guest to eliminate false positives

Message ID 20231206032054.55070-1-likexu@tencent.com (mailing list archive)
State New, archived
Headers show
Series [v2] KVM: x86/intr: Explicitly check NMI from guest to eliminate false positives | expand

Commit Message

Like Xu Dec. 6, 2023, 3:20 a.m. UTC
From: Like Xu <likexu@tencent.com>

Explicitly checking the source of external interrupt is indeed NMI and not
other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
positive samples generated in perf/core NMI mode after vm-exit but before
kvm_before_interrupt() from being incorrectly labelled as guest samples:

# test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
# Symbol                                   Overhead       sys       usr  guest sys  guest usr
# .......................................  ........  ........  ........  .........  .........
#
# Before:
  [g] entry_SYSCALL_64                       24.63%     0.00%     0.00%     24.63%      0.00%
  [g] syscall_return_via_sysret              23.23%     0.00%     0.00%     23.23%      0.00%
  [g] files_lookup_fd_raw                     6.35%     0.00%     0.00%      6.35%      0.00%
# After:
  [k] perf_adjust_freq_unthr_context         57.23%    57.23%     0.00%      0.00%      0.00%
  [k] __vmx_vcpu_run                          4.09%     4.09%     0.00%      0.00%      0.00%
  [k] vmx_update_host_rsp                     3.17%     3.17%     0.00%      0.00%      0.00%

In the above case, perf records the samples labelled '[g]', the RIPs behind
the weird samples are actually being queried by perf_instruction_pointer()
after determining whether it's in GUEST state or not, and here's the issue:

If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().

During this window, if a PMI occurs on host (since the KVM instructions on
host are being executed), the control flow, with the help of the host NMI
context, will be transferred to perf/core to generate performance samples,
thus perf_instruction_pointer() and perf_guest_get_ip() is called.

Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
cause perf/core to mistakenly assume that the source RIP of the host NMI
belongs to the guest world and use perf_guest_get_ip() to get the RIP of
a vCPU that has already exited by a non-NMI interrupt.

Error samples are recorded and presented to the end-user via perf-report.
Such false positive samples could be eliminated by explicitly determining
if the exit reason is KVM_HANDLING_NMI.

Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI
is cleared, it's also still possible that another PMI is generated on host.
Also for perf/core timer mode, the false positives are still possible since
that non-NMI sources of interrupts are not always being used by perf/core.
In both cases above, perf/core should correctly distinguish between real
RIP sources or even need to generate two samples, belonging to host and
guest separately, but that's perf/core's story for interested warriors.

Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Like Xu <likexu@tencent.com>
---
V1 -> V2 Changelog:
- Refine commit message to cover both perf/core timer and NMI modes;
- Use in_nmi() to distinguish whether it's NMI mode or not; (Sean)
V1: https://lore.kernel.org/kvm/20231204074535.9567-1-likexu@tencent.com/
 arch/x86/include/asm/kvm_host.h | 10 +++++++++-
 arch/x86/kvm/x86.h              |  6 ------
 2 files changed, 9 insertions(+), 7 deletions(-)


base-commit: 1ab097653e4dd8d23272d028a61352c23486fd4a

Comments

Sean Christopherson Dec. 6, 2023, 3:30 p.m. UTC | #1
Please don't make up random prefixes.  This should really be "x86/pmu".

From Documentation/process/maintainer-kvm-x86.rst:

Shortlog
~~~~~~~~
The preferred prefix format is ``KVM: <topic>:``, where ``<topic>`` is one of::

  - x86
  - x86/mmu
  - x86/pmu
  - x86/xen
  - selftests
  - SVM
  - nSVM
  - VMX
  - nVMX


...

New topics do occasionally pop up, but please start an on-list discussion if
you want to propose introducing a new topic, i.e. don't go rogue.
Like Xu Dec. 7, 2023, 2:12 a.m. UTC | #2
On 6/12/2023 11:30 pm, Sean Christopherson wrote:
> Please don't make up random prefixes.  This should really be "x86/pmu".

Thanks.

I'm hesitant to categorize about NMI handling into kvm/pmu scope. But you
have clear idea on it and it's fine to me. Please feel free to refine commit
message to fit your understanding and taste as you did.

> 
>  From Documentation/process/maintainer-kvm-x86.rst:
> 
> Shortlog
> ~~~~~~~~
> The preferred prefix format is ``KVM: <topic>:``, where ``<topic>`` is one of::
> 
>    - x86
>    - x86/mmu
>    - x86/pmu
>    - x86/xen
>    - selftests
>    - SVM
>    - nSVM
>    - VMX
>    - nVMX
> 
> 
> ...
> 
> New topics do occasionally pop up, but please start an on-list discussion if
> you want to propose introducing a new topic, i.e. don't go rogue.
Maxim Levitsky Dec. 7, 2023, 3:32 p.m. UTC | #3
On Wed, 2023-12-06 at 11:20 +0800, Like Xu wrote:
> From: Like Xu <likexu@tencent.com>
> 
> Explicitly checking the source of external interrupt is indeed NMI and not
> other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
> positive samples generated in perf/core NMI mode after vm-exit but before
> kvm_before_interrupt() from being incorrectly labelled as guest samples:
> 
> # test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
> # Symbol                                   Overhead       sys       usr  guest sys  guest usr
> # .......................................  ........  ........  ........  .........  .........
> #
> # Before:
>   [g] entry_SYSCALL_64                       24.63%     0.00%     0.00%     24.63%      0.00%
>   [g] syscall_return_via_sysret              23.23%     0.00%     0.00%     23.23%      0.00%
>   [g] files_lookup_fd_raw                     6.35%     0.00%     0.00%      6.35%      0.00%
> # After:
>   [k] perf_adjust_freq_unthr_context         57.23%    57.23%     0.00%      0.00%      0.00%
>   [k] __vmx_vcpu_run                          4.09%     4.09%     0.00%      0.00%      0.00%
>   [k] vmx_update_host_rsp                     3.17%     3.17%     0.00%      0.00%      0.00%
> 
> In the above case, perf records the samples labelled '[g]', the RIPs behind
> the weird samples are actually being queried by perf_instruction_pointer()
> after determining whether it's in GUEST state or not, and here's the issue:
> 
> If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
> at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
> will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().
> 
> During this window, if a PMI occurs on host (since the KVM instructions on
> host are being executed), the control flow, with the help of the host NMI
> context, will be transferred to perf/core to generate performance samples,
> thus perf_instruction_pointer() and perf_guest_get_ip() is called.
> 
> Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
> cause perf/core to mistakenly assume that the source RIP of the host NMI
> belongs to the guest world and use perf_guest_get_ip() to get the RIP of
> a vCPU that has already exited by a non-NMI interrupt.
> 
> Error samples are recorded and presented to the end-user via perf-report.
> Such false positive samples could be eliminated by explicitly determining
> if the exit reason is KVM_HANDLING_NMI.
> 
> Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI
> is cleared, it's also still possible that another PMI is generated on host.
> Also for perf/core timer mode, the false positives are still possible since
> that non-NMI sources of interrupts are not always being used by perf/core.
> In both cases above, perf/core should correctly distinguish between real
> RIP sources or even need to generate two samples, belonging to host and
> guest separately, but that's perf/core's story for interested warriors.
> 
> Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
> Signed-off-by: Like Xu <likexu@tencent.com>
> ---
> V1 -> V2 Changelog:
> - Refine commit message to cover both perf/core timer and NMI modes;
> - Use in_nmi() to distinguish whether it's NMI mode or not; (Sean)
> V1: https://lore.kernel.org/kvm/20231204074535.9567-1-likexu@tencent.com/
>  arch/x86/include/asm/kvm_host.h | 10 +++++++++-
>  arch/x86/kvm/x86.h              |  6 ------
>  2 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c8c7e2475a18..167d592e08d0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1868,8 +1868,16 @@ static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
>  }
>  #endif /* CONFIG_HYPERV */
>  
> +enum kvm_intr_type {
> +	/* Values are arbitrary, but must be non-zero. */
> +	KVM_HANDLING_IRQ = 1,
> +	KVM_HANDLING_NMI,
> +};
> +
> +/* Enable perf NMI and timer modes to work, and minimise false positives. */
>  #define kvm_arch_pmi_in_guest(vcpu) \
> -	((vcpu) && (vcpu)->arch.handling_intr_from_guest)
> +	((vcpu) && (vcpu)->arch.handling_intr_from_guest && \
> +	 (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))
>  
>  void __init kvm_mmu_x86_module_init(void);
>  int kvm_mmu_vendor_module_init(void);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 2f7e19166658..4dc38092d599 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
>  	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
>  }
>  
> -enum kvm_intr_type {
> -	/* Values are arbitrary, but must be non-zero. */
> -	KVM_HANDLING_IRQ = 1,
> -	KVM_HANDLING_NMI,
> -};
> -
>  static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
>  						 enum kvm_intr_type intr)
>  {

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


> 
> base-commit: 1ab097653e4dd8d23272d028a61352c23486fd4a
Sean Christopherson Dec. 7, 2023, 4:31 p.m. UTC | #4
On Thu, Dec 07, 2023, Like Xu wrote:
> On 6/12/2023 11:30 pm, Sean Christopherson wrote:
> > Please don't make up random prefixes.  This should really be "x86/pmu".
> 
> Thanks.
> 
> I'm hesitant to categorize about NMI handling into kvm/pmu scope.

Why?  Literally the only thing this can affect is PMU behavior.  Even if there's
a bug that affects the kernel's PMU, that's still x86/pmu as far as KVM is
concerned.
Dongli Zhang Dec. 13, 2023, 7:28 a.m. UTC | #5
Hi Like,

On 12/5/23 19:20, Like Xu wrote:
> From: Like Xu <likexu@tencent.com>
> 
> Explicitly checking the source of external interrupt is indeed NMI and not
> other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
> positive samples generated in perf/core NMI mode after vm-exit but before
> kvm_before_interrupt() from being incorrectly labelled as guest samples:

About the before kvm_before_interrupt() ...

> 
> # test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
> # Symbol                                   Overhead       sys       usr  guest sys  guest usr
> # .......................................  ........  ........  ........  .........  .........
> #
> # Before:
>   [g] entry_SYSCALL_64                       24.63%     0.00%     0.00%     24.63%      0.00%
>   [g] syscall_return_via_sysret              23.23%     0.00%     0.00%     23.23%      0.00%
>   [g] files_lookup_fd_raw                     6.35%     0.00%     0.00%      6.35%      0.00%
> # After:
>   [k] perf_adjust_freq_unthr_context         57.23%    57.23%     0.00%      0.00%      0.00%
>   [k] __vmx_vcpu_run                          4.09%     4.09%     0.00%      0.00%      0.00%
>   [k] vmx_update_host_rsp                     3.17%     3.17%     0.00%      0.00%      0.00%
> 
> In the above case, perf records the samples labelled '[g]', the RIPs behind
> the weird samples are actually being queried by perf_instruction_pointer()
> after determining whether it's in GUEST state or not, and here's the issue:
> 
> If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
> at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
> will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().

... and here.

Would you mind helping why kvm_arch_pmi_in_guest() remains true before
*kvm_before_interrupt()*.

According to the source code, the vcpu->arch.handling_intr_from_guest
is set to non-zero only at kvm_before_interrupt(), and cleared at
kvm_after_interrupt().

Or would you mean kvm_after_interrupt()?

Thank you very much!

Dongli Zhang

> 
> During this window, if a PMI occurs on host (since the KVM instructions on
> host are being executed), the control flow, with the help of the host NMI
> context, will be transferred to perf/core to generate performance samples,
> thus perf_instruction_pointer() and perf_guest_get_ip() is called.
> 
> Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
> cause perf/core to mistakenly assume that the source RIP of the host NMI
> belongs to the guest world and use perf_guest_get_ip() to get the RIP of
> a vCPU that has already exited by a non-NMI interrupt.
> 
> Error samples are recorded and presented to the end-user via perf-report.
> Such false positive samples could be eliminated by explicitly determining
> if the exit reason is KVM_HANDLING_NMI.
> 
> Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI
> is cleared, it's also still possible that another PMI is generated on host.
> Also for perf/core timer mode, the false positives are still possible since
> that non-NMI sources of interrupts are not always being used by perf/core.
> In both cases above, perf/core should correctly distinguish between real
> RIP sources or even need to generate two samples, belonging to host and
> guest separately, but that's perf/core's story for interested warriors.
> 
> Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
> Signed-off-by: Like Xu <likexu@tencent.com>
> ---
> V1 -> V2 Changelog:
> - Refine commit message to cover both perf/core timer and NMI modes;
> - Use in_nmi() to distinguish whether it's NMI mode or not; (Sean)
> V1: https://urldefense.com/v3/__https://lore.kernel.org/kvm/20231204074535.9567-1-likexu@tencent.com/__;!!ACWV5N9M2RV99hQ!MQ8FetD27SVKN34CS_P-K3qrhspFnpf_Mqb0McFN9y5vSUeScc5b0TlZ3ZMDvt4Cn4b3g0h9ci6EO9k3PBEQXpePrg$ 
>  arch/x86/include/asm/kvm_host.h | 10 +++++++++-
>  arch/x86/kvm/x86.h              |  6 ------
>  2 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c8c7e2475a18..167d592e08d0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1868,8 +1868,16 @@ static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
>  }
>  #endif /* CONFIG_HYPERV */
>  
> +enum kvm_intr_type {
> +	/* Values are arbitrary, but must be non-zero. */
> +	KVM_HANDLING_IRQ = 1,
> +	KVM_HANDLING_NMI,
> +};
> +
> +/* Enable perf NMI and timer modes to work, and minimise false positives. */
>  #define kvm_arch_pmi_in_guest(vcpu) \
> -	((vcpu) && (vcpu)->arch.handling_intr_from_guest)
> +	((vcpu) && (vcpu)->arch.handling_intr_from_guest && \
> +	 (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))
>  
>  void __init kvm_mmu_x86_module_init(void);
>  int kvm_mmu_vendor_module_init(void);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 2f7e19166658..4dc38092d599 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
>  	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
>  }
>  
> -enum kvm_intr_type {
> -	/* Values are arbitrary, but must be non-zero. */
> -	KVM_HANDLING_IRQ = 1,
> -	KVM_HANDLING_NMI,
> -};
> -
>  static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
>  						 enum kvm_intr_type intr)
>  {
> 
> base-commit: 1ab097653e4dd8d23272d028a61352c23486fd4a
Like Xu Dec. 13, 2023, 8:24 a.m. UTC | #6
On 13/12/2023 3:28 pm, Dongli Zhang wrote:
> Hi Like,
> 
> On 12/5/23 19:20, Like Xu wrote:
>> From: Like Xu <likexu@tencent.com>
>>
>> Explicitly checking the source of external interrupt is indeed NMI and not
>> other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
>> positive samples generated in perf/core NMI mode after vm-exit but before
>> kvm_before_interrupt() from being incorrectly labelled as guest samples:
> 
> About the before kvm_before_interrupt() ...
> 
>>
>> # test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
>> # Symbol                                   Overhead       sys       usr  guest sys  guest usr
>> # .......................................  ........  ........  ........  .........  .........
>> #
>> # Before:
>>    [g] entry_SYSCALL_64                       24.63%     0.00%     0.00%     24.63%      0.00%
>>    [g] syscall_return_via_sysret              23.23%     0.00%     0.00%     23.23%      0.00%
>>    [g] files_lookup_fd_raw                     6.35%     0.00%     0.00%      6.35%      0.00%
>> # After:
>>    [k] perf_adjust_freq_unthr_context         57.23%    57.23%     0.00%      0.00%      0.00%
>>    [k] __vmx_vcpu_run                          4.09%     4.09%     0.00%      0.00%      0.00%
>>    [k] vmx_update_host_rsp                     3.17%     3.17%     0.00%      0.00%      0.00%
>>
>> In the above case, perf records the samples labelled '[g]', the RIPs behind
>> the weird samples are actually being queried by perf_instruction_pointer()
>> after determining whether it's in GUEST state or not, and here's the issue:
>>
>> If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
>> at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
>> will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().
> 
> ... and here.
> 
> Would you mind helping why kvm_arch_pmi_in_guest() remains true before
> *kvm_before_interrupt()*.
> 
> According to the source code, the vcpu->arch.handling_intr_from_guest
> is set to non-zero only at kvm_before_interrupt(), and cleared at
> kvm_after_interrupt().
> 
> Or would you mean kvm_after_interrupt()?

Oops, it should refer to kvm_after_interrupt() as the code fixed. Thank you.

> 
> Thank you very much!
> 
> Dongli Zhang
> 
>>
>> During this window, if a PMI occurs on host (since the KVM instructions on
>> host are being executed), the control flow, with the help of the host NMI
>> context, will be transferred to perf/core to generate performance samples,
>> thus perf_instruction_pointer() and perf_guest_get_ip() is called.
>>
>> Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
>> cause perf/core to mistakenly assume that the source RIP of the host NMI
>> belongs to the guest world and use perf_guest_get_ip() to get the RIP of
>> a vCPU that has already exited by a non-NMI interrupt.
>>
>> Error samples are recorded and presented to the end-user via perf-report.
>> Such false positive samples could be eliminated by explicitly determining
>> if the exit reason is KVM_HANDLING_NMI.
>>
>> Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI
>> is cleared, it's also still possible that another PMI is generated on host.
>> Also for perf/core timer mode, the false positives are still possible since
>> that non-NMI sources of interrupts are not always being used by perf/core.
>> In both cases above, perf/core should correctly distinguish between real
>> RIP sources or even need to generate two samples, belonging to host and
>> guest separately, but that's perf/core's story for interested warriors.
>>
>> Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
>> Signed-off-by: Like Xu <likexu@tencent.com>
>> ---
>> V1 -> V2 Changelog:
>> - Refine commit message to cover both perf/core timer and NMI modes;
>> - Use in_nmi() to distinguish whether it's NMI mode or not; (Sean)
>> V1: https://urldefense.com/v3/__https://lore.kernel.org/kvm/20231204074535.9567-1-likexu@tencent.com/__;!!ACWV5N9M2RV99hQ!MQ8FetD27SVKN34CS_P-K3qrhspFnpf_Mqb0McFN9y5vSUeScc5b0TlZ3ZMDvt4Cn4b3g0h9ci6EO9k3PBEQXpePrg$
>>   arch/x86/include/asm/kvm_host.h | 10 +++++++++-
>>   arch/x86/kvm/x86.h              |  6 ------
>>   2 files changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index c8c7e2475a18..167d592e08d0 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1868,8 +1868,16 @@ static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
>>   }
>>   #endif /* CONFIG_HYPERV */
>>   
>> +enum kvm_intr_type {
>> +	/* Values are arbitrary, but must be non-zero. */
>> +	KVM_HANDLING_IRQ = 1,
>> +	KVM_HANDLING_NMI,
>> +};
>> +
>> +/* Enable perf NMI and timer modes to work, and minimise false positives. */
>>   #define kvm_arch_pmi_in_guest(vcpu) \
>> -	((vcpu) && (vcpu)->arch.handling_intr_from_guest)
>> +	((vcpu) && (vcpu)->arch.handling_intr_from_guest && \
>> +	 (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))
>>   
>>   void __init kvm_mmu_x86_module_init(void);
>>   int kvm_mmu_vendor_module_init(void);
>> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
>> index 2f7e19166658..4dc38092d599 100644
>> --- a/arch/x86/kvm/x86.h
>> +++ b/arch/x86/kvm/x86.h
>> @@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
>>   	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
>>   }
>>   
>> -enum kvm_intr_type {
>> -	/* Values are arbitrary, but must be non-zero. */
>> -	KVM_HANDLING_IRQ = 1,
>> -	KVM_HANDLING_NMI,
>> -};
>> -
>>   static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
>>   						 enum kvm_intr_type intr)
>>   {
>>
>> base-commit: 1ab097653e4dd8d23272d028a61352c23486fd4a
Sean Christopherson Dec. 13, 2023, 5:01 p.m. UTC | #7
On Wed, Dec 13, 2023, Like Xu wrote:
> 
> 
> On 13/12/2023 3:28 pm, Dongli Zhang wrote:
> > Hi Like,
> > 
> > On 12/5/23 19:20, Like Xu wrote:
> > > From: Like Xu <likexu@tencent.com>
> > > 
> > > Explicitly checking the source of external interrupt is indeed NMI and not
> > > other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
> > > positive samples generated in perf/core NMI mode after vm-exit but before
> > > kvm_before_interrupt() from being incorrectly labelled as guest samples:
> > 
> > About the before kvm_before_interrupt() ...
> > 
> > > 
> > > # test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
> > > # Symbol                                   Overhead       sys       usr  guest sys  guest usr
> > > # .......................................  ........  ........  ........  .........  .........
> > > #
> > > # Before:
> > >    [g] entry_SYSCALL_64                       24.63%     0.00%     0.00%     24.63%      0.00%
> > >    [g] syscall_return_via_sysret              23.23%     0.00%     0.00%     23.23%      0.00%
> > >    [g] files_lookup_fd_raw                     6.35%     0.00%     0.00%      6.35%      0.00%
> > > # After:
> > >    [k] perf_adjust_freq_unthr_context         57.23%    57.23%     0.00%      0.00%      0.00%
> > >    [k] __vmx_vcpu_run                          4.09%     4.09%     0.00%      0.00%      0.00%
> > >    [k] vmx_update_host_rsp                     3.17%     3.17%     0.00%      0.00%      0.00%
> > > 
> > > In the above case, perf records the samples labelled '[g]', the RIPs behind
> > > the weird samples are actually being queried by perf_instruction_pointer()
> > > after determining whether it's in GUEST state or not, and here's the issue:
> > > 
> > > If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
> > > at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
> > > will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().
> > 
> > ... and here.
> > 
> > Would you mind helping why kvm_arch_pmi_in_guest() remains true before
> > *kvm_before_interrupt()*.
> > 
> > According to the source code, the vcpu->arch.handling_intr_from_guest
> > is set to non-zero only at kvm_before_interrupt(), and cleared at
> > kvm_after_interrupt().
> > 
> > Or would you mean kvm_after_interrupt()?
> 
> Oops, it should refer to kvm_after_interrupt() as the code fixed. Thank you.

No need for another version if that's the only hiccup, I can fixup when applying.
Sean Christopherson Feb. 6, 2024, 7:38 p.m. UTC | #8
+Oliver

On Wed, Dec 06, 2023, Like Xu wrote:
> Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI
> is cleared, it's also still possible that another PMI is generated on host.
> Also for perf/core timer mode, the false positives are still possible since
> that non-NMI sources of interrupts are not always being used by perf/core.
> In both cases above, perf/core should correctly distinguish between real
> RIP sources or even need to generate two samples, belonging to host and
> guest separately, but that's perf/core's story for interested warriors.

Oliver has a patch[*] that he promised he would send "soon" (wink wink) to
properly fix events that are configured to exclude the guest.  Unless someone
objects, I'm going to tweak the last part of the changelog to be:

    Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI
    is cleared, it's also still possible that another PMI is generated on host.
    Also for perf/core timer mode, the false positives are still possible since
    that non-NMI sources of interrupts are not always being used by perf/core.
    
    For events that are host-only, perf/core can and should eliminate false
    positives by checking event->attr.exclude_guest, i.e. events that are
    configured to exclude KVM guests should never fire in the guest.
    
    Events that are configured to count host and guest are trickier, perhaps
    impossible to handle with 100% accuracy?  And regardless of what accuracy
    is provided by perf/core, improving KVM's accuracy is cheap and easy, with
    no real downsides.

[*] https://git.kernel.org/pub/scm/linux/kernel/git/oupton/linux.git/commit/?h=perf/unfudge-sampling&id=6a35fa884b378f704b485c6bf887125af5da6077
Sean Christopherson Feb. 6, 2024, 9:08 p.m. UTC | #9
On Tue, Feb 06, 2024, Sean Christopherson wrote:
> +Oliver
> 
> On Wed, Dec 06, 2023, Like Xu wrote:
> > Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI
> > is cleared, it's also still possible that another PMI is generated on host.
> > Also for perf/core timer mode, the false positives are still possible since
> > that non-NMI sources of interrupts are not always being used by perf/core.
> > In both cases above, perf/core should correctly distinguish between real
> > RIP sources or even need to generate two samples, belonging to host and
> > guest separately, but that's perf/core's story for interested warriors.
> 
> Oliver has a patch[*] that he promised he would send "soon" (wink wink) to
> properly fix events that are configured to exclude the guest.  Unless someone
> objects, I'm going to tweak the last part of the changelog to be:
> 
>     Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI
>     is cleared, it's also still possible that another PMI is generated on host.
>     Also for perf/core timer mode, the false positives are still possible since
>     that non-NMI sources of interrupts are not always being used by perf/core.
>     
>     For events that are host-only, perf/core can and should eliminate false
>     positives by checking event->attr.exclude_guest, i.e. events that are
>     configured to exclude KVM guests should never fire in the guest.
>     
>     Events that are configured to count host and guest are trickier, perhaps
>     impossible to handle with 100% accuracy?  And regardless of what accuracy
>     is provided by perf/core, improving KVM's accuracy is cheap and easy, with
>     no real downsides.

Never mind, this causes KUT's pmu_pebs test to fail:

  FAIL: Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
  FAIL: Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x1): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x2): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x4): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x1f000008): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records.
  FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records.

It might be a test bug, but I have neither the time nor the inclination to
investigate.


Like,

If you want any chance of your patches going anywhere but my trash folder, you
need to change your upstream workflow to actually run tests.  I would give most
people the benefit of the doubt, e.g. assume they didn't have the requisite
hardware, or didn't realize which tests would be relevant/important.  But this
is a recurring problem, and you have been warned, multiple times.
Like Xu Feb. 18, 2024, 9:51 a.m. UTC | #10
On 7/2/2024 5:08 am, Sean Christopherson wrote:
> On Tue, Feb 06, 2024, Sean Christopherson wrote:
>> +Oliver
>>
>> On Wed, Dec 06, 2023, Like Xu wrote:
>>> Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI
>>> is cleared, it's also still possible that another PMI is generated on host.
>>> Also for perf/core timer mode, the false positives are still possible since
>>> that non-NMI sources of interrupts are not always being used by perf/core.
>>> In both cases above, perf/core should correctly distinguish between real
>>> RIP sources or even need to generate two samples, belonging to host and
>>> guest separately, but that's perf/core's story for interested warriors.
>>
>> Oliver has a patch[*] that he promised he would send "soon" (wink wink) to
>> properly fix events that are configured to exclude the guest.  Unless someone
>> objects, I'm going to tweak the last part of the changelog to be:
>>
>>      Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI
>>      is cleared, it's also still possible that another PMI is generated on host.
>>      Also for perf/core timer mode, the false positives are still possible since
>>      that non-NMI sources of interrupts are not always being used by perf/core.
>>      
>>      For events that are host-only, perf/core can and should eliminate false
>>      positives by checking event->attr.exclude_guest, i.e. events that are
>>      configured to exclude KVM guests should never fire in the guest.
>>      
>>      Events that are configured to count host and guest are trickier, perhaps
>>      impossible to handle with 100% accuracy?  And regardless of what accuracy
>>      is provided by perf/core, improving KVM's accuracy is cheap and easy, with
>>      no real downsides.
> 
> Never mind, this causes KUT's pmu_pebs test to fail:
> 
>    FAIL: Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
>    FAIL: Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x1): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x2): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x4): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x1f000008): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records.
>    FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records.
> 
> It might be a test bug, but I have neither the time nor the inclination to
> investigate.

For PEBS ovf case, we have "in_nmi() = 0x100000" from the core kernel and
the following diff fixes the issue:

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 995760ba072f..dcf665251fce 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1891,7 +1891,7 @@ enum kvm_intr_type {
  /* Enable perf NMI and timer modes to work, and minimise false positives. */
  #define kvm_arch_pmi_in_guest(vcpu) \
  	((vcpu) && (vcpu)->arch.handling_intr_from_guest && \
-	 (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))
+	 (!!in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))

  void __init kvm_mmu_x86_module_init(void);
  int kvm_mmu_vendor_module_init(void);

, does it help (tests passed on ICX) ?

> 
> 
> Like,
> 
> If you want any chance of your patches going anywhere but my trash folder, you
> need to change your upstream workflow to actually run tests.  I would give most
> people the benefit of the doubt, e.g. assume they didn't have the requisite
> hardware, or didn't realize which tests would be relevant/important.  But this
> is a recurring problem, and you have been warned, multiple times.

Sorry, my CI resources are diverted to other downstream projects.
But there's no doubt it's my fault and this behavior will be corrected.
Sean Christopherson Feb. 27, 2024, 12:11 a.m. UTC | #11
On Sun, Feb 18, 2024, Like Xu wrote:
> On 7/2/2024 5:08 am, Sean Christopherson wrote:
> > On Tue, Feb 06, 2024, Sean Christopherson wrote:
> > Never mind, this causes KUT's pmu_pebs test to fail:
> > 
> >    FAIL: Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
> >    FAIL: Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x1): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x2): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x4): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x1f000008): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records.
> >    FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records.
> > 
> > It might be a test bug, but I have neither the time nor the inclination to
> > investigate.
> 
> For PEBS ovf case, we have "in_nmi() = 0x100000" from the core kernel and
> the following diff fixes the issue:
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 995760ba072f..dcf665251fce 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1891,7 +1891,7 @@ enum kvm_intr_type {
>  /* Enable perf NMI and timer modes to work, and minimise false positives. */
>  #define kvm_arch_pmi_in_guest(vcpu) \
>  	((vcpu) && (vcpu)->arch.handling_intr_from_guest && \
> -	 (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))
> +	 (!!in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))
> 
>  void __init kvm_mmu_x86_module_init(void);
>  int kvm_mmu_vendor_module_init(void);
> 
> , does it help (tests passed on ICX) ?

Yes, that resolves the issues I was seeing.  I'll get this applied with the above
squashed.

I'll also see if the tip tree folks would be open to converting the in_{nmi,hardirq,...}()
macros to functions that return bools (or at least casting to bools in the macros).
I can't see any reason for in_nmi() to effectively return an int since it's just
a wrapper to nmi_count(), and this seems like a disaster waiting to happen.

> > If you want any chance of your patches going anywhere but my trash folder, you
> > need to change your upstream workflow to actually run tests.  I would give most
> > people the benefit of the doubt, e.g. assume they didn't have the requisite
> > hardware, or didn't realize which tests would be relevant/important.  But this
> > is a recurring problem, and you have been warned, multiple times.
> 
> Sorry, my CI resources are diverted to other downstream projects.
> But there's no doubt it's my fault and this behavior will be corrected.

Thank you.
Sean Christopherson Feb. 27, 2024, 2:21 a.m. UTC | #12
On Wed, 06 Dec 2023 11:20:54 +0800, Like Xu wrote:
> Explicitly checking the source of external interrupt is indeed NMI and not
> other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
> positive samples generated in perf/core NMI mode after vm-exit but before
> kvm_before_interrupt() from being incorrectly labelled as guest samples:
> 
> # test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
> # Symbol                                   Overhead       sys       usr  guest sys  guest usr
> # .......................................  ........  ........  ........  .........  .........
> #
> # Before:
>   [g] entry_SYSCALL_64                       24.63%     0.00%     0.00%     24.63%      0.00%
>   [g] syscall_return_via_sysret              23.23%     0.00%     0.00%     23.23%      0.00%
>   [g] files_lookup_fd_raw                     6.35%     0.00%     0.00%      6.35%      0.00%
> # After:
>   [k] perf_adjust_freq_unthr_context         57.23%    57.23%     0.00%      0.00%      0.00%
>   [k] __vmx_vcpu_run                          4.09%     4.09%     0.00%      0.00%      0.00%
>   [k] vmx_update_host_rsp                     3.17%     3.17%     0.00%      0.00%      0.00%
> 
> [...]

Applied to kvm-x86 pmu, with the !!in_nmi() fixup squashed.  Thanks!

[1/1] KVM: x86/intr: Explicitly check NMI from guest to eliminate false positives
      https://github.com/kvm-x86/linux/commit/812d432373f6

--
https://github.com/kvm-x86/linux/tree/next
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c8c7e2475a18..167d592e08d0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1868,8 +1868,16 @@  static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
 }
 #endif /* CONFIG_HYPERV */
 
+enum kvm_intr_type {
+	/* Values are arbitrary, but must be non-zero. */
+	KVM_HANDLING_IRQ = 1,
+	KVM_HANDLING_NMI,
+};
+
+/* Enable perf NMI and timer modes to work, and minimise false positives. */
 #define kvm_arch_pmi_in_guest(vcpu) \
-	((vcpu) && (vcpu)->arch.handling_intr_from_guest)
+	((vcpu) && (vcpu)->arch.handling_intr_from_guest && \
+	 (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))
 
 void __init kvm_mmu_x86_module_init(void);
 int kvm_mmu_vendor_module_init(void);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 2f7e19166658..4dc38092d599 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -431,12 +431,6 @@  static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
 	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
 }
 
-enum kvm_intr_type {
-	/* Values are arbitrary, but must be non-zero. */
-	KVM_HANDLING_IRQ = 1,
-	KVM_HANDLING_NMI,
-};
-
 static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
 						 enum kvm_intr_type intr)
 {