Message ID | 20231206032054.55070-1-likexu@tencent.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] KVM: x86/intr: Explicitly check NMI from guest to eliminate false positives | expand |
Please don't make up random prefixes. This should really be "x86/pmu". From Documentation/process/maintainer-kvm-x86.rst: Shortlog ~~~~~~~~ The preferred prefix format is ``KVM: <topic>:``, where ``<topic>`` is one of:: - x86 - x86/mmu - x86/pmu - x86/xen - selftests - SVM - nSVM - VMX - nVMX ... New topics do occasionally pop up, but please start an on-list discussion if you want to propose introducing a new topic, i.e. don't go rogue.
On 6/12/2023 11:30 pm, Sean Christopherson wrote: > Please don't make up random prefixes. This should really be "x86/pmu". Thanks. I'm hesitant to categorize about NMI handling into kvm/pmu scope. But you have clear idea on it and it's fine to me. Please feel free to refine commit message to fit your understanding and taste as you did. > > From Documentation/process/maintainer-kvm-x86.rst: > > Shortlog > ~~~~~~~~ > The preferred prefix format is ``KVM: <topic>:``, where ``<topic>`` is one of:: > > - x86 > - x86/mmu > - x86/pmu > - x86/xen > - selftests > - SVM > - nSVM > - VMX > - nVMX > > > ... > > New topics do occasionally pop up, but please start an on-list discussion if > you want to propose introducing a new topic, i.e. don't go rogue.
On Wed, 2023-12-06 at 11:20 +0800, Like Xu wrote: > From: Like Xu <likexu@tencent.com> > > Explicitly checking the source of external interrupt is indeed NMI and not > other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false > positive samples generated in perf/core NMI mode after vm-exit but before > kvm_before_interrupt() from being incorrectly labelled as guest samples: > > # test: perf-record + cpu-cycles:HP (which collects host-only precise samples) > # Symbol Overhead sys usr guest sys guest usr > # ....................................... ........ ........ ........ ......... ......... > # > # Before: > [g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00% > [g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00% > [g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00% > # After: > [k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00% > [k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00% > [k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00% > > In the above case, perf records the samples labelled '[g]', the RIPs behind > the weird samples are actually being queried by perf_instruction_pointer() > after determining whether it's in GUEST state or not, and here's the issue: > > If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and > at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest() > will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt(). > > During this window, if a PMI occurs on host (since the KVM instructions on > host are being executed), the control flow, with the help of the host NMI > context, will be transferred to perf/core to generate performance samples, > thus perf_instruction_pointer() and perf_guest_get_ip() is called. > > Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may > cause perf/core to mistakenly assume that the source RIP of the host NMI > belongs to the guest world and use perf_guest_get_ip() to get the RIP of > a vCPU that has already exited by a non-NMI interrupt. > > Error samples are recorded and presented to the end-user via perf-report. > Such false positive samples could be eliminated by explicitly determining > if the exit reason is KVM_HANDLING_NMI. > > Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI > is cleared, it's also still possible that another PMI is generated on host. > Also for perf/core timer mode, the false positives are still possible since > that non-NMI sources of interrupts are not always being used by perf/core. > In both cases above, perf/core should correctly distinguish between real > RIP sources or even need to generate two samples, belonging to host and > guest separately, but that's perf/core's story for interested warriors. > > Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting") > Signed-off-by: Like Xu <likexu@tencent.com> > --- > V1 -> V2 Changelog: > - Refine commit message to cover both perf/core timer and NMI modes; > - Use in_nmi() to distinguish whether it's NMI mode or not; (Sean) > V1: https://lore.kernel.org/kvm/20231204074535.9567-1-likexu@tencent.com/ > arch/x86/include/asm/kvm_host.h | 10 +++++++++- > arch/x86/kvm/x86.h | 6 ------ > 2 files changed, 9 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index c8c7e2475a18..167d592e08d0 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1868,8 +1868,16 @@ static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, > } > #endif /* CONFIG_HYPERV */ > > +enum kvm_intr_type { > + /* Values are arbitrary, but must be non-zero. */ > + KVM_HANDLING_IRQ = 1, > + KVM_HANDLING_NMI, > +}; > + > +/* Enable perf NMI and timer modes to work, and minimise false positives. */ > #define kvm_arch_pmi_in_guest(vcpu) \ > - ((vcpu) && (vcpu)->arch.handling_intr_from_guest) > + ((vcpu) && (vcpu)->arch.handling_intr_from_guest && \ > + (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))) > > void __init kvm_mmu_x86_module_init(void); > int kvm_mmu_vendor_module_init(void); > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h > index 2f7e19166658..4dc38092d599 100644 > --- a/arch/x86/kvm/x86.h > +++ b/arch/x86/kvm/x86.h > @@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm) > return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED; > } > > -enum kvm_intr_type { > - /* Values are arbitrary, but must be non-zero. */ > - KVM_HANDLING_IRQ = 1, > - KVM_HANDLING_NMI, > -}; > - > static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, > enum kvm_intr_type intr) > { Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Best regards, Maxim Levitsky > > base-commit: 1ab097653e4dd8d23272d028a61352c23486fd4a
On Thu, Dec 07, 2023, Like Xu wrote: > On 6/12/2023 11:30 pm, Sean Christopherson wrote: > > Please don't make up random prefixes. This should really be "x86/pmu". > > Thanks. > > I'm hesitant to categorize about NMI handling into kvm/pmu scope. Why? Literally the only thing this can affect is PMU behavior. Even if there's a bug that affects the kernel's PMU, that's still x86/pmu as far as KVM is concerned.
Hi Like, On 12/5/23 19:20, Like Xu wrote: > From: Like Xu <likexu@tencent.com> > > Explicitly checking the source of external interrupt is indeed NMI and not > other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false > positive samples generated in perf/core NMI mode after vm-exit but before > kvm_before_interrupt() from being incorrectly labelled as guest samples: About the before kvm_before_interrupt() ... > > # test: perf-record + cpu-cycles:HP (which collects host-only precise samples) > # Symbol Overhead sys usr guest sys guest usr > # ....................................... ........ ........ ........ ......... ......... > # > # Before: > [g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00% > [g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00% > [g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00% > # After: > [k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00% > [k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00% > [k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00% > > In the above case, perf records the samples labelled '[g]', the RIPs behind > the weird samples are actually being queried by perf_instruction_pointer() > after determining whether it's in GUEST state or not, and here's the issue: > > If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and > at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest() > will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt(). ... and here. Would you mind helping why kvm_arch_pmi_in_guest() remains true before *kvm_before_interrupt()*. According to the source code, the vcpu->arch.handling_intr_from_guest is set to non-zero only at kvm_before_interrupt(), and cleared at kvm_after_interrupt(). Or would you mean kvm_after_interrupt()? Thank you very much! Dongli Zhang > > During this window, if a PMI occurs on host (since the KVM instructions on > host are being executed), the control flow, with the help of the host NMI > context, will be transferred to perf/core to generate performance samples, > thus perf_instruction_pointer() and perf_guest_get_ip() is called. > > Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may > cause perf/core to mistakenly assume that the source RIP of the host NMI > belongs to the guest world and use perf_guest_get_ip() to get the RIP of > a vCPU that has already exited by a non-NMI interrupt. > > Error samples are recorded and presented to the end-user via perf-report. > Such false positive samples could be eliminated by explicitly determining > if the exit reason is KVM_HANDLING_NMI. > > Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI > is cleared, it's also still possible that another PMI is generated on host. > Also for perf/core timer mode, the false positives are still possible since > that non-NMI sources of interrupts are not always being used by perf/core. > In both cases above, perf/core should correctly distinguish between real > RIP sources or even need to generate two samples, belonging to host and > guest separately, but that's perf/core's story for interested warriors. > > Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting") > Signed-off-by: Like Xu <likexu@tencent.com> > --- > V1 -> V2 Changelog: > - Refine commit message to cover both perf/core timer and NMI modes; > - Use in_nmi() to distinguish whether it's NMI mode or not; (Sean) > V1: https://urldefense.com/v3/__https://lore.kernel.org/kvm/20231204074535.9567-1-likexu@tencent.com/__;!!ACWV5N9M2RV99hQ!MQ8FetD27SVKN34CS_P-K3qrhspFnpf_Mqb0McFN9y5vSUeScc5b0TlZ3ZMDvt4Cn4b3g0h9ci6EO9k3PBEQXpePrg$ > arch/x86/include/asm/kvm_host.h | 10 +++++++++- > arch/x86/kvm/x86.h | 6 ------ > 2 files changed, 9 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index c8c7e2475a18..167d592e08d0 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1868,8 +1868,16 @@ static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, > } > #endif /* CONFIG_HYPERV */ > > +enum kvm_intr_type { > + /* Values are arbitrary, but must be non-zero. */ > + KVM_HANDLING_IRQ = 1, > + KVM_HANDLING_NMI, > +}; > + > +/* Enable perf NMI and timer modes to work, and minimise false positives. */ > #define kvm_arch_pmi_in_guest(vcpu) \ > - ((vcpu) && (vcpu)->arch.handling_intr_from_guest) > + ((vcpu) && (vcpu)->arch.handling_intr_from_guest && \ > + (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))) > > void __init kvm_mmu_x86_module_init(void); > int kvm_mmu_vendor_module_init(void); > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h > index 2f7e19166658..4dc38092d599 100644 > --- a/arch/x86/kvm/x86.h > +++ b/arch/x86/kvm/x86.h > @@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm) > return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED; > } > > -enum kvm_intr_type { > - /* Values are arbitrary, but must be non-zero. */ > - KVM_HANDLING_IRQ = 1, > - KVM_HANDLING_NMI, > -}; > - > static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, > enum kvm_intr_type intr) > { > > base-commit: 1ab097653e4dd8d23272d028a61352c23486fd4a
On 13/12/2023 3:28 pm, Dongli Zhang wrote: > Hi Like, > > On 12/5/23 19:20, Like Xu wrote: >> From: Like Xu <likexu@tencent.com> >> >> Explicitly checking the source of external interrupt is indeed NMI and not >> other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false >> positive samples generated in perf/core NMI mode after vm-exit but before >> kvm_before_interrupt() from being incorrectly labelled as guest samples: > > About the before kvm_before_interrupt() ... > >> >> # test: perf-record + cpu-cycles:HP (which collects host-only precise samples) >> # Symbol Overhead sys usr guest sys guest usr >> # ....................................... ........ ........ ........ ......... ......... >> # >> # Before: >> [g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00% >> [g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00% >> [g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00% >> # After: >> [k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00% >> [k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00% >> [k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00% >> >> In the above case, perf records the samples labelled '[g]', the RIPs behind >> the weird samples are actually being queried by perf_instruction_pointer() >> after determining whether it's in GUEST state or not, and here's the issue: >> >> If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and >> at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest() >> will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt(). > > ... and here. > > Would you mind helping why kvm_arch_pmi_in_guest() remains true before > *kvm_before_interrupt()*. > > According to the source code, the vcpu->arch.handling_intr_from_guest > is set to non-zero only at kvm_before_interrupt(), and cleared at > kvm_after_interrupt(). > > Or would you mean kvm_after_interrupt()? Oops, it should refer to kvm_after_interrupt() as the code fixed. Thank you. > > Thank you very much! > > Dongli Zhang > >> >> During this window, if a PMI occurs on host (since the KVM instructions on >> host are being executed), the control flow, with the help of the host NMI >> context, will be transferred to perf/core to generate performance samples, >> thus perf_instruction_pointer() and perf_guest_get_ip() is called. >> >> Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may >> cause perf/core to mistakenly assume that the source RIP of the host NMI >> belongs to the guest world and use perf_guest_get_ip() to get the RIP of >> a vCPU that has already exited by a non-NMI interrupt. >> >> Error samples are recorded and presented to the end-user via perf-report. >> Such false positive samples could be eliminated by explicitly determining >> if the exit reason is KVM_HANDLING_NMI. >> >> Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI >> is cleared, it's also still possible that another PMI is generated on host. >> Also for perf/core timer mode, the false positives are still possible since >> that non-NMI sources of interrupts are not always being used by perf/core. >> In both cases above, perf/core should correctly distinguish between real >> RIP sources or even need to generate two samples, belonging to host and >> guest separately, but that's perf/core's story for interested warriors. >> >> Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting") >> Signed-off-by: Like Xu <likexu@tencent.com> >> --- >> V1 -> V2 Changelog: >> - Refine commit message to cover both perf/core timer and NMI modes; >> - Use in_nmi() to distinguish whether it's NMI mode or not; (Sean) >> V1: https://urldefense.com/v3/__https://lore.kernel.org/kvm/20231204074535.9567-1-likexu@tencent.com/__;!!ACWV5N9M2RV99hQ!MQ8FetD27SVKN34CS_P-K3qrhspFnpf_Mqb0McFN9y5vSUeScc5b0TlZ3ZMDvt4Cn4b3g0h9ci6EO9k3PBEQXpePrg$ >> arch/x86/include/asm/kvm_host.h | 10 +++++++++- >> arch/x86/kvm/x86.h | 6 ------ >> 2 files changed, 9 insertions(+), 7 deletions(-) >> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >> index c8c7e2475a18..167d592e08d0 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -1868,8 +1868,16 @@ static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, >> } >> #endif /* CONFIG_HYPERV */ >> >> +enum kvm_intr_type { >> + /* Values are arbitrary, but must be non-zero. */ >> + KVM_HANDLING_IRQ = 1, >> + KVM_HANDLING_NMI, >> +}; >> + >> +/* Enable perf NMI and timer modes to work, and minimise false positives. */ >> #define kvm_arch_pmi_in_guest(vcpu) \ >> - ((vcpu) && (vcpu)->arch.handling_intr_from_guest) >> + ((vcpu) && (vcpu)->arch.handling_intr_from_guest && \ >> + (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))) >> >> void __init kvm_mmu_x86_module_init(void); >> int kvm_mmu_vendor_module_init(void); >> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h >> index 2f7e19166658..4dc38092d599 100644 >> --- a/arch/x86/kvm/x86.h >> +++ b/arch/x86/kvm/x86.h >> @@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm) >> return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED; >> } >> >> -enum kvm_intr_type { >> - /* Values are arbitrary, but must be non-zero. */ >> - KVM_HANDLING_IRQ = 1, >> - KVM_HANDLING_NMI, >> -}; >> - >> static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, >> enum kvm_intr_type intr) >> { >> >> base-commit: 1ab097653e4dd8d23272d028a61352c23486fd4a
On Wed, Dec 13, 2023, Like Xu wrote: > > > On 13/12/2023 3:28 pm, Dongli Zhang wrote: > > Hi Like, > > > > On 12/5/23 19:20, Like Xu wrote: > > > From: Like Xu <likexu@tencent.com> > > > > > > Explicitly checking the source of external interrupt is indeed NMI and not > > > other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false > > > positive samples generated in perf/core NMI mode after vm-exit but before > > > kvm_before_interrupt() from being incorrectly labelled as guest samples: > > > > About the before kvm_before_interrupt() ... > > > > > > > > # test: perf-record + cpu-cycles:HP (which collects host-only precise samples) > > > # Symbol Overhead sys usr guest sys guest usr > > > # ....................................... ........ ........ ........ ......... ......... > > > # > > > # Before: > > > [g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00% > > > [g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00% > > > [g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00% > > > # After: > > > [k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00% > > > [k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00% > > > [k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00% > > > > > > In the above case, perf records the samples labelled '[g]', the RIPs behind > > > the weird samples are actually being queried by perf_instruction_pointer() > > > after determining whether it's in GUEST state or not, and here's the issue: > > > > > > If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and > > > at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest() > > > will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt(). > > > > ... and here. > > > > Would you mind helping why kvm_arch_pmi_in_guest() remains true before > > *kvm_before_interrupt()*. > > > > According to the source code, the vcpu->arch.handling_intr_from_guest > > is set to non-zero only at kvm_before_interrupt(), and cleared at > > kvm_after_interrupt(). > > > > Or would you mean kvm_after_interrupt()? > > Oops, it should refer to kvm_after_interrupt() as the code fixed. Thank you. No need for another version if that's the only hiccup, I can fixup when applying.
+Oliver On Wed, Dec 06, 2023, Like Xu wrote: > Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI > is cleared, it's also still possible that another PMI is generated on host. > Also for perf/core timer mode, the false positives are still possible since > that non-NMI sources of interrupts are not always being used by perf/core. > In both cases above, perf/core should correctly distinguish between real > RIP sources or even need to generate two samples, belonging to host and > guest separately, but that's perf/core's story for interested warriors. Oliver has a patch[*] that he promised he would send "soon" (wink wink) to properly fix events that are configured to exclude the guest. Unless someone objects, I'm going to tweak the last part of the changelog to be: Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI is cleared, it's also still possible that another PMI is generated on host. Also for perf/core timer mode, the false positives are still possible since that non-NMI sources of interrupts are not always being used by perf/core. For events that are host-only, perf/core can and should eliminate false positives by checking event->attr.exclude_guest, i.e. events that are configured to exclude KVM guests should never fire in the guest. Events that are configured to count host and guest are trickier, perhaps impossible to handle with 100% accuracy? And regardless of what accuracy is provided by perf/core, improving KVM's accuracy is cheap and easy, with no real downsides. [*] https://git.kernel.org/pub/scm/linux/kernel/git/oupton/linux.git/commit/?h=perf/unfudge-sampling&id=6a35fa884b378f704b485c6bf887125af5da6077
On Tue, Feb 06, 2024, Sean Christopherson wrote: > +Oliver > > On Wed, Dec 06, 2023, Like Xu wrote: > > Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI > > is cleared, it's also still possible that another PMI is generated on host. > > Also for perf/core timer mode, the false positives are still possible since > > that non-NMI sources of interrupts are not always being used by perf/core. > > In both cases above, perf/core should correctly distinguish between real > > RIP sources or even need to generate two samples, belonging to host and > > guest separately, but that's perf/core's story for interested warriors. > > Oliver has a patch[*] that he promised he would send "soon" (wink wink) to > properly fix events that are configured to exclude the guest. Unless someone > objects, I'm going to tweak the last part of the changelog to be: > > Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI > is cleared, it's also still possible that another PMI is generated on host. > Also for perf/core timer mode, the false positives are still possible since > that non-NMI sources of interrupts are not always being used by perf/core. > > For events that are host-only, perf/core can and should eliminate false > positives by checking event->attr.exclude_guest, i.e. events that are > configured to exclude KVM guests should never fire in the guest. > > Events that are configured to count host and guest are trickier, perhaps > impossible to handle with 100% accuracy? And regardless of what accuracy > is provided by perf/core, improving KVM's accuracy is cheap and easy, with > no real downsides. Never mind, this causes KUT's pmu_pebs test to fail: FAIL: Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. FAIL: Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x1): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x2): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x4): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x1f000008): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records. FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records. It might be a test bug, but I have neither the time nor the inclination to investigate. Like, If you want any chance of your patches going anywhere but my trash folder, you need to change your upstream workflow to actually run tests. I would give most people the benefit of the doubt, e.g. assume they didn't have the requisite hardware, or didn't realize which tests would be relevant/important. But this is a recurring problem, and you have been warned, multiple times.
On 7/2/2024 5:08 am, Sean Christopherson wrote: > On Tue, Feb 06, 2024, Sean Christopherson wrote: >> +Oliver >> >> On Wed, Dec 06, 2023, Like Xu wrote: >>> Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI >>> is cleared, it's also still possible that another PMI is generated on host. >>> Also for perf/core timer mode, the false positives are still possible since >>> that non-NMI sources of interrupts are not always being used by perf/core. >>> In both cases above, perf/core should correctly distinguish between real >>> RIP sources or even need to generate two samples, belonging to host and >>> guest separately, but that's perf/core's story for interested warriors. >> >> Oliver has a patch[*] that he promised he would send "soon" (wink wink) to >> properly fix events that are configured to exclude the guest. Unless someone >> objects, I'm going to tweak the last part of the changelog to be: >> >> Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI >> is cleared, it's also still possible that another PMI is generated on host. >> Also for perf/core timer mode, the false positives are still possible since >> that non-NMI sources of interrupts are not always being used by perf/core. >> >> For events that are host-only, perf/core can and should eliminate false >> positives by checking event->attr.exclude_guest, i.e. events that are >> configured to exclude KVM guests should never fire in the guest. >> >> Events that are configured to count host and guest are trickier, perhaps >> impossible to handle with 100% accuracy? And regardless of what accuracy >> is provided by perf/core, improving KVM's accuracy is cheap and easy, with >> no real downsides. > > Never mind, this causes KUT's pmu_pebs test to fail: > > FAIL: Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > FAIL: Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x1): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x2): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x4): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x1f000008): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records. > FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records. > > It might be a test bug, but I have neither the time nor the inclination to > investigate. For PEBS ovf case, we have "in_nmi() = 0x100000" from the core kernel and the following diff fixes the issue: diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 995760ba072f..dcf665251fce 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1891,7 +1891,7 @@ enum kvm_intr_type { /* Enable perf NMI and timer modes to work, and minimise false positives. */ #define kvm_arch_pmi_in_guest(vcpu) \ ((vcpu) && (vcpu)->arch.handling_intr_from_guest && \ - (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))) + (!!in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))) void __init kvm_mmu_x86_module_init(void); int kvm_mmu_vendor_module_init(void); , does it help (tests passed on ICX) ? > > > Like, > > If you want any chance of your patches going anywhere but my trash folder, you > need to change your upstream workflow to actually run tests. I would give most > people the benefit of the doubt, e.g. assume they didn't have the requisite > hardware, or didn't realize which tests would be relevant/important. But this > is a recurring problem, and you have been warned, multiple times. Sorry, my CI resources are diverted to other downstream projects. But there's no doubt it's my fault and this behavior will be corrected.
On Sun, Feb 18, 2024, Like Xu wrote: > On 7/2/2024 5:08 am, Sean Christopherson wrote: > > On Tue, Feb 06, 2024, Sean Christopherson wrote: > > Never mind, this causes KUT's pmu_pebs test to fail: > > > > FAIL: Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > > FAIL: Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x1): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x2): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x4): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x1f000008): GP counter 0 (0xfffffffffffe): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x1): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x2): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x4): Multiple (0x700000055): No OVF irq, none PEBS records. > > FAIL: Adaptive (0x1f000008): Multiple (0x700000055): No OVF irq, none PEBS records. > > > > It might be a test bug, but I have neither the time nor the inclination to > > investigate. > > For PEBS ovf case, we have "in_nmi() = 0x100000" from the core kernel and > the following diff fixes the issue: > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 995760ba072f..dcf665251fce 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1891,7 +1891,7 @@ enum kvm_intr_type { > /* Enable perf NMI and timer modes to work, and minimise false positives. */ > #define kvm_arch_pmi_in_guest(vcpu) \ > ((vcpu) && (vcpu)->arch.handling_intr_from_guest && \ > - (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))) > + (!!in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))) > > void __init kvm_mmu_x86_module_init(void); > int kvm_mmu_vendor_module_init(void); > > , does it help (tests passed on ICX) ? Yes, that resolves the issues I was seeing. I'll get this applied with the above squashed. I'll also see if the tip tree folks would be open to converting the in_{nmi,hardirq,...}() macros to functions that return bools (or at least casting to bools in the macros). I can't see any reason for in_nmi() to effectively return an int since it's just a wrapper to nmi_count(), and this seems like a disaster waiting to happen. > > If you want any chance of your patches going anywhere but my trash folder, you > > need to change your upstream workflow to actually run tests. I would give most > > people the benefit of the doubt, e.g. assume they didn't have the requisite > > hardware, or didn't realize which tests would be relevant/important. But this > > is a recurring problem, and you have been warned, multiple times. > > Sorry, my CI resources are diverted to other downstream projects. > But there's no doubt it's my fault and this behavior will be corrected. Thank you.
On Wed, 06 Dec 2023 11:20:54 +0800, Like Xu wrote: > Explicitly checking the source of external interrupt is indeed NMI and not > other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false > positive samples generated in perf/core NMI mode after vm-exit but before > kvm_before_interrupt() from being incorrectly labelled as guest samples: > > # test: perf-record + cpu-cycles:HP (which collects host-only precise samples) > # Symbol Overhead sys usr guest sys guest usr > # ....................................... ........ ........ ........ ......... ......... > # > # Before: > [g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00% > [g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00% > [g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00% > # After: > [k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00% > [k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00% > [k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00% > > [...] Applied to kvm-x86 pmu, with the !!in_nmi() fixup squashed. Thanks! [1/1] KVM: x86/intr: Explicitly check NMI from guest to eliminate false positives https://github.com/kvm-x86/linux/commit/812d432373f6 -- https://github.com/kvm-x86/linux/tree/next
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c8c7e2475a18..167d592e08d0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1868,8 +1868,16 @@ static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, } #endif /* CONFIG_HYPERV */ +enum kvm_intr_type { + /* Values are arbitrary, but must be non-zero. */ + KVM_HANDLING_IRQ = 1, + KVM_HANDLING_NMI, +}; + +/* Enable perf NMI and timer modes to work, and minimise false positives. */ #define kvm_arch_pmi_in_guest(vcpu) \ - ((vcpu) && (vcpu)->arch.handling_intr_from_guest) + ((vcpu) && (vcpu)->arch.handling_intr_from_guest && \ + (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))) void __init kvm_mmu_x86_module_init(void); int kvm_mmu_vendor_module_init(void); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 2f7e19166658..4dc38092d599 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm) return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED; } -enum kvm_intr_type { - /* Values are arbitrary, but must be non-zero. */ - KVM_HANDLING_IRQ = 1, - KVM_HANDLING_NMI, -}; - static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, enum kvm_intr_type intr) {