diff mbox series

[bpf-next,1/3] perf: enable branch record for software events

Message ID 20210824060157.3889139-2-songliubraving@fb.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf: introduce bpf_get_branch_trace | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 17 maintainers not CCed: andrii@kernel.org kafai@fb.com yhs@fb.com john.fastabend@gmail.com jolsa@redhat.com daniel@iogearbox.net bp@alien8.de mark.rutland@arm.com tglx@linutronix.de kpsingh@kernel.org ast@kernel.org alexander.shishkin@linux.intel.com hpa@zytor.com linux-perf-users@vger.kernel.org netdev@vger.kernel.org namhyung@kernel.org x86@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 2461 this patch: 2461
netdev/kdoc success Errors and warnings before: 181 this patch: 181
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 109 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 2388 this patch: 2388
netdev/header_inline success Link
bpf/vmtest success Kernel LATEST + selftests

Commit Message

Song Liu Aug. 24, 2021, 6:01 a.m. UTC
The typical way to access branch record (e.g. Intel LBR) is via hardware
perf_event. For CPUs with FREEZE_LBRS_ON_PMI support, PMI could capture
reliable LBR. On the other hand, LBR could also be useful in non-PMI
scenario. For example, in kretprobe or bpf fexit program, LBR could
provide a lot of information on what happened with the function. Add API
to use branch record for software use.

Note that, when the software event triggers, it is necessary to stop the
branch record hardware asap. Therefore, static_call is used to remove some
branch instructions in this process.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/events/intel/core.c |  5 ++++-
 arch/x86/events/intel/lbr.c  | 12 ++++++++++++
 arch/x86/events/perf_event.h |  2 ++
 include/linux/perf_event.h   | 33 +++++++++++++++++++++++++++++++++
 kernel/events/core.c         | 28 ++++++++++++++++++++++++++++
 5 files changed, 79 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Aug. 25, 2021, 12:09 p.m. UTC | #1
On Mon, Aug 23, 2021 at 11:01:55PM -0700, Song Liu wrote:

>  arch/x86/events/intel/core.c |  5 ++++-
>  arch/x86/events/intel/lbr.c  | 12 ++++++++++++
>  arch/x86/events/perf_event.h |  2 ++
>  include/linux/perf_event.h   | 33 +++++++++++++++++++++++++++++++++
>  kernel/events/core.c         | 28 ++++++++++++++++++++++++++++
>  5 files changed, 79 insertions(+), 1 deletion(-)

No PowerPC support :/

> +void intel_pmu_snapshot_branch_stack(void)
> +{
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> +	intel_pmu_lbr_disable_all();
> +	intel_pmu_lbr_read();
> +	memcpy(this_cpu_ptr(&perf_branch_snapshot_entries), cpuc->lbr_entries,
> +	       sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
> +	*this_cpu_ptr(&perf_branch_snapshot_size) = x86_pmu.lbr_nr;
> +	intel_pmu_lbr_enable_all(false);
> +}

Still has the layering violation and issues vs PMI.

> +#ifdef CONFIG_HAVE_STATIC_CALL
> +DECLARE_STATIC_CALL(perf_snapshot_branch_stack,
> +		    perf_default_snapshot_branch_stack);
> +#else
> +extern void (*perf_snapshot_branch_stack)(void);
> +#endif

That's weird, static call should work unconditionally, and fall back to
a regular function pointer exactly like you do here. Search for:
"Generic Implementation" in include/linux/static_call.h

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 011cc5069b7ba..b42cc20451709 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c

> +#ifdef CONFIG_HAVE_STATIC_CALL
> +DEFINE_STATIC_CALL(perf_snapshot_branch_stack,
> +		   perf_default_snapshot_branch_stack);
> +#else
> +void (*perf_snapshot_branch_stack)(void) = perf_default_snapshot_branch_stack;
> +#endif

Idem.

Something like:

DEFINE_STATIC_CALL_NULL(perf_snapshot_branch_stack, void (*)(void));

with usage like: static_call_cond(perf_snapshot_branch_stack)();

Should unconditionally work.

> +int perf_read_branch_snapshot(void *buf, size_t len)
> +{
> +	int cnt;
> +
> +	memcpy(buf, *this_cpu_ptr(&perf_branch_snapshot_entries),
> +	       min_t(u32, (u32)len,
> +		     sizeof(struct perf_branch_entry) * MAX_BRANCH_SNAPSHOT));
> +	cnt =  *this_cpu_ptr(&perf_branch_snapshot_size);
> +
> +	return (cnt > 0) ? cnt : -EOPNOTSUPP;
> +}

Doesn't seem used at all..
Song Liu Aug. 25, 2021, 3:22 p.m. UTC | #2
> On Aug 25, 2021, at 5:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Aug 23, 2021 at 11:01:55PM -0700, Song Liu wrote:
> 
>> arch/x86/events/intel/core.c |  5 ++++-
>> arch/x86/events/intel/lbr.c  | 12 ++++++++++++
>> arch/x86/events/perf_event.h |  2 ++
>> include/linux/perf_event.h   | 33 +++++++++++++++++++++++++++++++++
>> kernel/events/core.c         | 28 ++++++++++++++++++++++++++++
>> 5 files changed, 79 insertions(+), 1 deletion(-)
> 
> No PowerPC support :/

I don't have PowerPC system for testing at the moment. I guess we can decide
the overall framework now, and ask PowerPC folks' help on PowerPC support
later? 

> 
>> +void intel_pmu_snapshot_branch_stack(void)
>> +{
>> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> +
>> +	intel_pmu_lbr_disable_all();
>> +	intel_pmu_lbr_read();
>> +	memcpy(this_cpu_ptr(&perf_branch_snapshot_entries), cpuc->lbr_entries,
>> +	       sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
>> +	*this_cpu_ptr(&perf_branch_snapshot_size) = x86_pmu.lbr_nr;
>> +	intel_pmu_lbr_enable_all(false);
>> +}
> 
> Still has the layering violation and issues vs PMI.

Yes, this is the biggest change after I test with this more. I tested with 
perf_[disable|enable]_pmu(), and function pointer in "struct pmu". However,
all these logic consumes LBR entries. In one of the version, 22 out of the
32 LBR entries are branches after the fexit event. Most of them are from
perf_disable_pmu(). And each function pointer consumes 1 or 2 entries. 
This would be worse for systems with fewer LBR entries. 

On the other hand, I think current version was not too bad. It may corrupt
some samples when there is collision between this and PMI. But it should not
cause serious issues. Did I miss anything more serious? 

> 
>> +#ifdef CONFIG_HAVE_STATIC_CALL
>> +DECLARE_STATIC_CALL(perf_snapshot_branch_stack,
>> +		    perf_default_snapshot_branch_stack);
>> +#else
>> +extern void (*perf_snapshot_branch_stack)(void);
>> +#endif
> 
> That's weird, static call should work unconditionally, and fall back to
> a regular function pointer exactly like you do here. Search for:
> "Generic Implementation" in include/linux/static_call.h

Thanks for the pointer. Let me look into it. 
> 
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 011cc5069b7ba..b42cc20451709 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
> 
>> +#ifdef CONFIG_HAVE_STATIC_CALL
>> +DEFINE_STATIC_CALL(perf_snapshot_branch_stack,
>> +		   perf_default_snapshot_branch_stack);
>> +#else
>> +void (*perf_snapshot_branch_stack)(void) = perf_default_snapshot_branch_stack;
>> +#endif
> 
> Idem.
> 
> Something like:
> 
> DEFINE_STATIC_CALL_NULL(perf_snapshot_branch_stack, void (*)(void));
> 
> with usage like: static_call_cond(perf_snapshot_branch_stack)();
> 
> Should unconditionally work.
> 
>> +int perf_read_branch_snapshot(void *buf, size_t len)
>> +{
>> +	int cnt;
>> +
>> +	memcpy(buf, *this_cpu_ptr(&perf_branch_snapshot_entries),
>> +	       min_t(u32, (u32)len,
>> +		     sizeof(struct perf_branch_entry) * MAX_BRANCH_SNAPSHOT));
>> +	cnt =  *this_cpu_ptr(&perf_branch_snapshot_size);
>> +
>> +	return (cnt > 0) ? cnt : -EOPNOTSUPP;
>> +}
> 
> Doesn't seem used at all..

At the moment, we only use this from BPF side (see 2/3). We sure can use it
from perf side, but that would require discussions on the user interface. 
How about we have that discussion later? 

Thanks,
Song
kajoljain Aug. 26, 2021, 7:56 a.m. UTC | #3
On 8/25/21 8:52 PM, Song Liu wrote:
> 
> 
>> On Aug 25, 2021, at 5:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Mon, Aug 23, 2021 at 11:01:55PM -0700, Song Liu wrote:
>>
>>> arch/x86/events/intel/core.c |  5 ++++-
>>> arch/x86/events/intel/lbr.c  | 12 ++++++++++++
>>> arch/x86/events/perf_event.h |  2 ++
>>> include/linux/perf_event.h   | 33 +++++++++++++++++++++++++++++++++
>>> kernel/events/core.c         | 28 ++++++++++++++++++++++++++++
>>> 5 files changed, 79 insertions(+), 1 deletion(-)
>>
>> No PowerPC support :/
> 
> I don't have PowerPC system for testing at the moment. I guess we can decide
> the overall framework now, and ask PowerPC folks' help on PowerPC support
> later? 

Hi Song,
   I will look at powerpc side to enable this.

Thanks,
Kajol Jain

> 
>>
>>> +void intel_pmu_snapshot_branch_stack(void)
>>> +{
>>> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>> +
>>> +	intel_pmu_lbr_disable_all();
>>> +	intel_pmu_lbr_read();
>>> +	memcpy(this_cpu_ptr(&perf_branch_snapshot_entries), cpuc->lbr_entries,
>>> +	       sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
>>> +	*this_cpu_ptr(&perf_branch_snapshot_size) = x86_pmu.lbr_nr;
>>> +	intel_pmu_lbr_enable_all(false);
>>> +}
>>
>> Still has the layering violation and issues vs PMI.
> 
> Yes, this is the biggest change after I test with this more. I tested with 
> perf_[disable|enable]_pmu(), and function pointer in "struct pmu". However,
> all these logic consumes LBR entries. In one of the version, 22 out of the
> 32 LBR entries are branches after the fexit event. Most of them are from
> perf_disable_pmu(). And each function pointer consumes 1 or 2 entries. 
> This would be worse for systems with fewer LBR entries. 
> 
> On the other hand, I think current version was not too bad. It may corrupt
> some samples when there is collision between this and PMI. But it should not
> cause serious issues. Did I miss anything more serious? 
> 
>>
>>> +#ifdef CONFIG_HAVE_STATIC_CALL
>>> +DECLARE_STATIC_CALL(perf_snapshot_branch_stack,
>>> +		    perf_default_snapshot_branch_stack);
>>> +#else
>>> +extern void (*perf_snapshot_branch_stack)(void);
>>> +#endif
>>
>> That's weird, static call should work unconditionally, and fall back to
>> a regular function pointer exactly like you do here. Search for:
>> "Generic Implementation" in include/linux/static_call.h
> 
> Thanks for the pointer. Let me look into it. 
>>
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index 011cc5069b7ba..b42cc20451709 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>
>>> +#ifdef CONFIG_HAVE_STATIC_CALL
>>> +DEFINE_STATIC_CALL(perf_snapshot_branch_stack,
>>> +		   perf_default_snapshot_branch_stack);
>>> +#else
>>> +void (*perf_snapshot_branch_stack)(void) = perf_default_snapshot_branch_stack;
>>> +#endif
>>
>> Idem.
>>
>> Something like:
>>
>> DEFINE_STATIC_CALL_NULL(perf_snapshot_branch_stack, void (*)(void));
>>
>> with usage like: static_call_cond(perf_snapshot_branch_stack)();
>>
>> Should unconditionally work.
>>
>>> +int perf_read_branch_snapshot(void *buf, size_t len)
>>> +{
>>> +	int cnt;
>>> +
>>> +	memcpy(buf, *this_cpu_ptr(&perf_branch_snapshot_entries),
>>> +	       min_t(u32, (u32)len,
>>> +		     sizeof(struct perf_branch_entry) * MAX_BRANCH_SNAPSHOT));
>>> +	cnt =  *this_cpu_ptr(&perf_branch_snapshot_size);
>>> +
>>> +	return (cnt > 0) ? cnt : -EOPNOTSUPP;
>>> +}
>>
>> Doesn't seem used at all..
> 
> At the moment, we only use this from BPF side (see 2/3). We sure can use it
> from perf side, but that would require discussions on the user interface. 
> How about we have that discussion later? 
> 
> Thanks,
> Song
>
Song Liu Aug. 26, 2021, 4:41 p.m. UTC | #4
> On Aug 26, 2021, at 12:56 AM, kajoljain <kjain@linux.ibm.com> wrote:
> 
> 
> 
> On 8/25/21 8:52 PM, Song Liu wrote:
>> 
>> 
>>> On Aug 25, 2021, at 5:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Mon, Aug 23, 2021 at 11:01:55PM -0700, Song Liu wrote:
>>> 
>>>> arch/x86/events/intel/core.c |  5 ++++-
>>>> arch/x86/events/intel/lbr.c  | 12 ++++++++++++
>>>> arch/x86/events/perf_event.h |  2 ++
>>>> include/linux/perf_event.h   | 33 +++++++++++++++++++++++++++++++++
>>>> kernel/events/core.c         | 28 ++++++++++++++++++++++++++++
>>>> 5 files changed, 79 insertions(+), 1 deletion(-)
>>> 
>>> No PowerPC support :/
>> 
>> I don't have PowerPC system for testing at the moment. I guess we can decide
>> the overall framework now, and ask PowerPC folks' help on PowerPC support
>> later? 
> 
> Hi Song,
>   I will look at powerpc side to enable this.
> 
> Thanks,
> Kajol Jain

Thanks Kajol! 

Let me address Peter's other comments and send v2. We can then merge in  
PowerPC support.

Song
diff mbox series

Patch

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ac6fd2dabf6a2..a29649e7241cc 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -6283,8 +6283,11 @@  __init int intel_pmu_init(void)
 			x86_pmu.lbr_nr = 0;
 	}
 
-	if (x86_pmu.lbr_nr)
+	if (x86_pmu.lbr_nr) {
 		pr_cont("%d-deep LBR, ", x86_pmu.lbr_nr);
+		static_call_update(perf_snapshot_branch_stack,
+				   intel_pmu_snapshot_branch_stack);
+	}
 
 	intel_pmu_check_extra_regs(x86_pmu.extra_regs);
 
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 9e6d6eaeb4cb6..b73b444cf229d 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1862,3 +1862,15 @@  EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
 struct event_constraint vlbr_constraint =
 	__EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT, (1ULL << INTEL_PMC_IDX_FIXED_VLBR),
 			  FIXED_EVENT_FLAGS, 1, 0, PERF_X86_EVENT_LBR_SELECT);
+
+void intel_pmu_snapshot_branch_stack(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	intel_pmu_lbr_disable_all();
+	intel_pmu_lbr_read();
+	memcpy(this_cpu_ptr(&perf_branch_snapshot_entries), cpuc->lbr_entries,
+	       sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
+	*this_cpu_ptr(&perf_branch_snapshot_size) = x86_pmu.lbr_nr;
+	intel_pmu_lbr_enable_all(false);
+}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e3ac05c97b5e5..5262083f4e13b 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1379,6 +1379,8 @@  void intel_pmu_pebs_data_source_skl(bool pmem);
 
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
+void intel_pmu_snapshot_branch_stack(void);
+
 void intel_pt_interrupt(void);
 
 int intel_bts_interrupt(void);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fe156a8170aa3..7cd2af7c5eda6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -57,6 +57,7 @@  struct perf_guest_info_callbacks {
 #include <linux/cgroup.h>
 #include <linux/refcount.h>
 #include <linux/security.h>
+#include <linux/static_call.h>
 #include <asm/local.h>
 
 struct perf_callchain_entry {
@@ -1612,4 +1613,36 @@  extern void __weak arch_perf_update_userpage(struct perf_event *event,
 extern __weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr);
 #endif
 
+/*
+ * Snapshot branch stack on software events.
+ *
+ * Branch stack can be very useful in understanding software events. For
+ * example, when a long function, e.g. sys_perf_event_open, returns an
+ * errno, it is not obvious why the function failed. Branch stack could
+ * provide very helpful information in this type of scenarios.
+ *
+ * On software event, it is necessary to stop the hardware branch recorder
+ * fast. Otherwise, the hardware register/buffer will be flushed with
+ * entries af the triggering event. Therefore, static call is used to
+ * stop the hardware recorder.
+ *
+ * To use the snapshot:
+ * 1) After the event triggers, call perf_snapshot_branch_stack asap;
+ * 2) On the same cpu, access the snapshot with perf_read_branch_snapshot;
+ */
+#define MAX_BRANCH_SNAPSHOT 32
+DECLARE_PER_CPU(struct perf_branch_entry,
+		perf_branch_snapshot_entries[MAX_BRANCH_SNAPSHOT]);
+DECLARE_PER_CPU(int, perf_branch_snapshot_size);
+
+void perf_default_snapshot_branch_stack(void);
+
+#ifdef CONFIG_HAVE_STATIC_CALL
+DECLARE_STATIC_CALL(perf_snapshot_branch_stack,
+		    perf_default_snapshot_branch_stack);
+#else
+extern void (*perf_snapshot_branch_stack)(void);
+#endif
+
+int perf_read_branch_snapshot(void *buf, size_t len);
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 011cc5069b7ba..b42cc20451709 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13437,3 +13437,31 @@  struct cgroup_subsys perf_event_cgrp_subsys = {
 	.threaded	= true,
 };
 #endif /* CONFIG_CGROUP_PERF */
+
+DEFINE_PER_CPU(struct perf_branch_entry,
+	       perf_branch_snapshot_entries[MAX_BRANCH_SNAPSHOT]);
+DEFINE_PER_CPU(int, perf_branch_snapshot_size);
+
+void perf_default_snapshot_branch_stack(void)
+{
+	*this_cpu_ptr(&perf_branch_snapshot_size) = 0;
+}
+
+#ifdef CONFIG_HAVE_STATIC_CALL
+DEFINE_STATIC_CALL(perf_snapshot_branch_stack,
+		   perf_default_snapshot_branch_stack);
+#else
+void (*perf_snapshot_branch_stack)(void) = perf_default_snapshot_branch_stack;
+#endif
+
+int perf_read_branch_snapshot(void *buf, size_t len)
+{
+	int cnt;
+
+	memcpy(buf, *this_cpu_ptr(&perf_branch_snapshot_entries),
+	       min_t(u32, (u32)len,
+		     sizeof(struct perf_branch_entry) * MAX_BRANCH_SNAPSHOT));
+	cnt =  *this_cpu_ptr(&perf_branch_snapshot_size);
+
+	return (cnt > 0) ? cnt : -EOPNOTSUPP;
+}