diff mbox series

[2/3] perf/core: Set data->sample_flags in perf_prepare_sample()

Message ID 20221229204101.1099430-2-namhyung@kernel.org (mailing list archive)
State Not Applicable
Delegated to: BPF
Headers show
Series [1/3] perf/core: Change the layout of perf_sample_data | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-7 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 fail Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-32 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-34 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35 success Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-37 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-36 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for test_progs_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_maps on s390x with gcc

Commit Message

Namhyung Kim Dec. 29, 2022, 8:41 p.m. UTC
The perf_prepare_sample() sets the perf_sample_data according to the
attr->sample_type before copying it to the ring buffer.  But BPF also
wants to access the sample data so it needs to prepare the sample even
before the regular path.

That means the perf_prepare_sample() can be called more than once.  Set
the data->sample_flags consistently so that it can indicate which fields
are set already and skip them if sets.

Mostly it's just a matter of checking filtered_sample_type which is a
bitmask for unset bits in the attr->sample_type.  But some of sample
data is implied by others even if it's not in the attr->sample_type
(like PERF_SAMPLE_ADDR for PERF_SAMPLE_PHYS_ADDR).  So they need to
check data->sample_flags separately.

Also some of them like callchain, user regs/stack and aux data require
more calculations.  Protect them using the data->sample_flags to avoid
the duplicate work.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
Maybe we don't need this change to prevent duplication in favor of the
next patch using the data->saved_size.  But I think it's still useful
to set data->sample_flags consistently.  Anyway it's up to you.

 kernel/events/core.c | 86 ++++++++++++++++++++++++++++++++------------
 1 file changed, 63 insertions(+), 23 deletions(-)

Comments

Peter Zijlstra Jan. 9, 2023, 12:14 p.m. UTC | #1
On Thu, Dec 29, 2022 at 12:41:00PM -0800, Namhyung Kim wrote:

So I like the general idea; I just think it's turned into a bit of a
mess. That is code is already overly branchy which is known to hurt
performance, we should really try and not make it worse than absolutely
needed.

>  kernel/events/core.c | 86 ++++++++++++++++++++++++++++++++------------
>  1 file changed, 63 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index eacc3702654d..70bff8a04583 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7582,14 +7582,21 @@ void perf_prepare_sample(struct perf_event_header *header,
>  	filtered_sample_type = sample_type & ~data->sample_flags;
>  	__perf_event_header__init_id(header, data, event, filtered_sample_type);
>  
> -	if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
> -		data->ip = perf_instruction_pointer(regs);
> +	if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) {
> +		/* attr.sample_type may not have PERF_SAMPLE_IP */

Right, but that shouldn't matter, IIRC its OK to have more bits set in
data->sample_flags than we have set in attr.sample_type. It just means
we have data available for sample types we're (possibly) not using.

That is, I think you can simply write this like:

> +		if (!(data->sample_flags & PERF_SAMPLE_IP)) {
> +			data->ip = perf_instruction_pointer(regs);
> +			data->sample_flags |= PERF_SAMPLE_IP;
> +		}
> +	}

	if (filtered_sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) {
		data->ip = perf_instruction_pointer(regs);
		data->sample_flags |= PERF_SAMPLE_IP);
	}

	...

	if (filtered_sample_type & PERF_SAMPLE_CODE_PAGE_SIZE) {
		data->code_page_size = perf_get_page_size(data->ip);
		data->sample_flags |= PERF_SAMPLE_CODE_PAGE_SIZE;
	}

Then after a single perf_prepare_sample() run we have:

  pre			|	post
  ----------------------------------------
  0			|	0
  IP			|	IP
  CODE_PAGE_SIZE	|	IP|CODE_PAGE_SIZE
  IP|CODE_PAGE_SIZE	|	IP|CODE_PAGE_SIZE

So while data->sample_flags will have an extra bit set in the 3rd case,
that will not affect perf_sample_outout() which only looks at data->type
(== attr.sample_type).

And since data->sample_flags will have both bits set, a second run will
filter out both and avoid the extra work (except doing that will mess up
the branch predictors).


>  	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
>  		int size = 1;
>  
> -		if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN)
> +		if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
>  			data->callchain = perf_callchain(event, regs);
> +			data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
> +		}
>  
>  		size += data->callchain->nr;
>  

This, why can't this be:

	if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
		data->callchain = perf_callchain(event, regs);
		data->sample_flags |= PERF_SAMPLE_CALLCHAIN;

		header->size += (1 + data->callchain->nr) * sizeof(u64);
	}

I suppose this is because perf_event_header lives on the stack of the
overflow handler and all that isn't available / relevant for the BPF
thing.

And we can't pull that out into anther function without adding yet
another branch fest.

However; inspired by your next patch; we can do something like so:

	if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
		data->callchain = perf_callchain(event, regs);
		data->sample_flags |= PERF_SAMPLE_CALLCHAIN;

		data->size += (1 + data->callchain->nr) * sizeof(u64);
	}

And then have __perf_event_output() (or something thereabout) do:

	perf_prepare_sample(data, event, regs);
	perf_prepare_header(&header, data, event);
	err = output_begin(&handle, data, event, header.size);
	if (err)
		goto exit;
	perf_output_sample(&handle, &header, data, event);
	perf_output_end(&handle);

With perf_prepare_header() being something like:

	header->type = PERF_RECORD_SAMPLE;
	header->size = sizeof(*header) + event->header_size + data->size;
	header->misc = perf_misc_flags(regs);
	...

Hmm ?

(same for all the other sites)
Namhyung Kim Jan. 9, 2023, 8:21 p.m. UTC | #2
Hi Peter,

On Mon, Jan 09, 2023 at 01:14:31PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 29, 2022 at 12:41:00PM -0800, Namhyung Kim wrote:
> 
> So I like the general idea; I just think it's turned into a bit of a
> mess. That is code is already overly branchy which is known to hurt
> performance, we should really try and not make it worse than absolutely
> needed.

Agreed.

> 
> >  kernel/events/core.c | 86 ++++++++++++++++++++++++++++++++------------
> >  1 file changed, 63 insertions(+), 23 deletions(-)
> > 
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index eacc3702654d..70bff8a04583 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -7582,14 +7582,21 @@ void perf_prepare_sample(struct perf_event_header *header,
> >  	filtered_sample_type = sample_type & ~data->sample_flags;
> >  	__perf_event_header__init_id(header, data, event, filtered_sample_type);
> >  
> > -	if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
> > -		data->ip = perf_instruction_pointer(regs);
> > +	if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) {
> > +		/* attr.sample_type may not have PERF_SAMPLE_IP */
> 
> Right, but that shouldn't matter, IIRC its OK to have more bits set in
> data->sample_flags than we have set in attr.sample_type. It just means
> we have data available for sample types we're (possibly) not using.
> 
> That is, I think you can simply write this like:
> 
> > +		if (!(data->sample_flags & PERF_SAMPLE_IP)) {
> > +			data->ip = perf_instruction_pointer(regs);
> > +			data->sample_flags |= PERF_SAMPLE_IP;
> > +		}
> > +	}
> 
> 	if (filtered_sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) {
> 		data->ip = perf_instruction_pointer(regs);
> 		data->sample_flags |= PERF_SAMPLE_IP);
> 	}
> 
> 	...
> 
> 	if (filtered_sample_type & PERF_SAMPLE_CODE_PAGE_SIZE) {
> 		data->code_page_size = perf_get_page_size(data->ip);
> 		data->sample_flags |= PERF_SAMPLE_CODE_PAGE_SIZE;
> 	}
> 
> Then after a single perf_prepare_sample() run we have:
> 
>   pre			|	post
>   ----------------------------------------
>   0			|	0
>   IP			|	IP
>   CODE_PAGE_SIZE	|	IP|CODE_PAGE_SIZE
>   IP|CODE_PAGE_SIZE	|	IP|CODE_PAGE_SIZE
> 
> So while data->sample_flags will have an extra bit set in the 3rd case,
> that will not affect perf_sample_outout() which only looks at data->type
> (== attr.sample_type).
> 
> And since data->sample_flags will have both bits set, a second run will
> filter out both and avoid the extra work (except doing that will mess up
> the branch predictors).

Yeah, it'd be better to check filtered_sample_type in the first place.

Btw, I was thinking about a hypothetical scenario that IP set by a PMU
driver not from the regs.  In this case, having CODE_PAGE_SIZE will
overwrite the IP.  I don't think we need to worry about that for now
since PMU drivers updates the regs (using set_linear_ip).  But it seems
like a possible scenario for something like PEBS or IBS.

> 
> 
> >  	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
> >  		int size = 1;
> >  
> > -		if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN)
> > +		if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
> >  			data->callchain = perf_callchain(event, regs);
> > +			data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
> > +		}
> >  
> >  		size += data->callchain->nr;
> >  
> 
> This, why can't this be:
> 
> 	if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
> 		data->callchain = perf_callchain(event, regs);
> 		data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
> 
> 		header->size += (1 + data->callchain->nr) * sizeof(u64);
> 	}
> 
> I suppose this is because perf_event_header lives on the stack of the
> overflow handler and all that isn't available / relevant for the BPF
> thing.

Right, it needs to calculate the data size for each sample data.

> 
> And we can't pull that out into anther function without adding yet
> another branch fest.
> 
> However; inspired by your next patch; we can do something like so:
> 
> 	if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
> 		data->callchain = perf_callchain(event, regs);
> 		data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
> 
> 		data->size += (1 + data->callchain->nr) * sizeof(u64);
> 	}

This is fine as long as all other places (like in PMU drivers) set the
callchain update the sample data size accordingly.  If not, we can get
the callchain but the data size will be wrong.

> 
> And then have __perf_event_output() (or something thereabout) do:
> 
> 	perf_prepare_sample(data, event, regs);
> 	perf_prepare_header(&header, data, event);
> 	err = output_begin(&handle, data, event, header.size);
> 	if (err)
> 		goto exit;
> 	perf_output_sample(&handle, &header, data, event);
> 	perf_output_end(&handle);
> 
> With perf_prepare_header() being something like:
> 
> 	header->type = PERF_RECORD_SAMPLE;
> 	header->size = sizeof(*header) + event->header_size + data->size;
> 	header->misc = perf_misc_flags(regs);
> 	...
> 
> Hmm ?
> 
> (same for all the other sites)

Looks good.  But I'm confused by the tip-bot2 messages saying it's
merged.  Do you want me to work on it as a follow up?

Thanks,
Namhyung
Peter Zijlstra Jan. 10, 2023, 10:54 a.m. UTC | #3
On Mon, Jan 09, 2023 at 12:21:25PM -0800, Namhyung Kim wrote:

> Looks good.  But I'm confused by the tip-bot2 messages saying it's
> merged.  Do you want me to work on it as a follow up?

Ingo and me talked past one another, I agreed with 1/3 and he applied
the whole series. Just talked to him again and he's just zapped these
last two patches.

Sorry for the confusion.
Peter Zijlstra Jan. 10, 2023, 10:55 a.m. UTC | #4
On Mon, Jan 09, 2023 at 12:21:25PM -0800, Namhyung Kim wrote:

> > However; inspired by your next patch; we can do something like so:
> > 
> > 	if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
> > 		data->callchain = perf_callchain(event, regs);
> > 		data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
> > 
> > 		data->size += (1 + data->callchain->nr) * sizeof(u64);
> > 	}
> 
> This is fine as long as all other places (like in PMU drivers) set the
> callchain update the sample data size accordingly.  If not, we can get
> the callchain but the data size will be wrong.

Good point, maybe add a helper there to ensure that code doesn't
duplicate/diverge?
Ingo Molnar Jan. 10, 2023, 11:10 a.m. UTC | #5
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jan 09, 2023 at 12:21:25PM -0800, Namhyung Kim wrote:
> 
> > Looks good.  But I'm confused by the tip-bot2 messages saying it's
> > merged.  Do you want me to work on it as a follow up?
> 
> Ingo and me talked past one another, I agreed with 1/3 and he applied
> the whole series. Just talked to him again and he's just zapped these
> last two patches.

Yeah - perf/core is now:

   7bdb1767bf01 ("perf/core: Change the layout of perf_sample_data")

which can be used for further work. Sorry about the confusion ...

Thanks,

	Ingo
Namhyung Kim Jan. 10, 2023, 7 p.m. UTC | #6
Hi Ingo,

On Tue, Jan 10, 2023 at 3:10 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Mon, Jan 09, 2023 at 12:21:25PM -0800, Namhyung Kim wrote:
> >
> > > Looks good.  But I'm confused by the tip-bot2 messages saying it's
> > > merged.  Do you want me to work on it as a follow up?
> >
> > Ingo and me talked past one another, I agreed with 1/3 and he applied
> > the whole series. Just talked to him again and he's just zapped these
> > last two patches.
>
> Yeah - perf/core is now:
>
>    7bdb1767bf01 ("perf/core: Change the layout of perf_sample_data")
>
> which can be used for further work. Sorry about the confusion ...

No problem, thanks for your work!

Thanks,
Namhyung
Namhyung Kim Jan. 10, 2023, 7:01 p.m. UTC | #7
On Tue, Jan 10, 2023 at 2:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jan 09, 2023 at 12:21:25PM -0800, Namhyung Kim wrote:
>
> > > However; inspired by your next patch; we can do something like so:
> > >
> > >     if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
> > >             data->callchain = perf_callchain(event, regs);
> > >             data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
> > >
> > >             data->size += (1 + data->callchain->nr) * sizeof(u64);
> > >     }
> >
> > This is fine as long as all other places (like in PMU drivers) set the
> > callchain update the sample data size accordingly.  If not, we can get
> > the callchain but the data size will be wrong.
>
> Good point, maybe add a helper there to ensure that code doesn't
> duplicate/diverge?

Sure, will do.

Thanks,
Namhyung
Namhyung Kim Jan. 10, 2023, 8:06 p.m. UTC | #8
On Mon, Jan 9, 2023 at 12:21 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hi Peter,
>
> On Mon, Jan 09, 2023 at 01:14:31PM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 29, 2022 at 12:41:00PM -0800, Namhyung Kim wrote:
> >
> > So I like the general idea; I just think it's turned into a bit of a
> > mess. That is code is already overly branchy which is known to hurt
> > performance, we should really try and not make it worse than absolutely
> > needed.
>
> Agreed.
>
> >
> > >  kernel/events/core.c | 86 ++++++++++++++++++++++++++++++++------------
> > >  1 file changed, 63 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index eacc3702654d..70bff8a04583 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -7582,14 +7582,21 @@ void perf_prepare_sample(struct perf_event_header *header,
> > >     filtered_sample_type = sample_type & ~data->sample_flags;
> > >     __perf_event_header__init_id(header, data, event, filtered_sample_type);
> > >
> > > -   if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
> > > -           data->ip = perf_instruction_pointer(regs);
> > > +   if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) {
> > > +           /* attr.sample_type may not have PERF_SAMPLE_IP */
> >
> > Right, but that shouldn't matter, IIRC its OK to have more bits set in
> > data->sample_flags than we have set in attr.sample_type. It just means
> > we have data available for sample types we're (possibly) not using.
> >
> > That is, I think you can simply write this like:
> >
> > > +           if (!(data->sample_flags & PERF_SAMPLE_IP)) {
> > > +                   data->ip = perf_instruction_pointer(regs);
> > > +                   data->sample_flags |= PERF_SAMPLE_IP;
> > > +           }
> > > +   }
> >
> >       if (filtered_sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) {
> >               data->ip = perf_instruction_pointer(regs);
> >               data->sample_flags |= PERF_SAMPLE_IP);
> >       }
> >
> >       ...
> >
> >       if (filtered_sample_type & PERF_SAMPLE_CODE_PAGE_SIZE) {
> >               data->code_page_size = perf_get_page_size(data->ip);
> >               data->sample_flags |= PERF_SAMPLE_CODE_PAGE_SIZE;
> >       }
> >
> > Then after a single perf_prepare_sample() run we have:
> >
> >   pre                 |       post
> >   ----------------------------------------
> >   0                   |       0
> >   IP                  |       IP
> >   CODE_PAGE_SIZE      |       IP|CODE_PAGE_SIZE
> >   IP|CODE_PAGE_SIZE   |       IP|CODE_PAGE_SIZE
> >
> > So while data->sample_flags will have an extra bit set in the 3rd case,
> > that will not affect perf_sample_outout() which only looks at data->type
> > (== attr.sample_type).
> >
> > And since data->sample_flags will have both bits set, a second run will
> > filter out both and avoid the extra work (except doing that will mess up
> > the branch predictors).
>
> Yeah, it'd be better to check filtered_sample_type in the first place.
>
> Btw, I was thinking about a hypothetical scenario that IP set by a PMU
> driver not from the regs.  In this case, having CODE_PAGE_SIZE will
> overwrite the IP.  I don't think we need to worry about that for now
> since PMU drivers updates the regs (using set_linear_ip).  But it seems
> like a possible scenario for something like PEBS or IBS.

Another example, but in this case it's real, is ADDR.  We cannot update
the data->addr just because filtered_sample_type has PHYS_ADDR or
DATA_PAGE_SIZE as it'd lose the original value.

Other than that, I'll update the other paths to minimized the branches.

Thanks,
Namhyung
Peter Zijlstra Jan. 11, 2023, 12:54 p.m. UTC | #9
On Tue, Jan 10, 2023 at 12:06:00PM -0800, Namhyung Kim wrote:

> Another example, but in this case it's real, is ADDR.  We cannot update
> the data->addr just because filtered_sample_type has PHYS_ADDR or
> DATA_PAGE_SIZE as it'd lose the original value.

Hmm, how about something like so?

/*
 * if (flags & s) flags |= d; // without branches
 */
static __always_inline unsigned long
__cond_set(unsigned long flags, unsigned long s, unsigned long d)
{
	return flags | (d * !!(flags & s));
}

Then:

	fst = sample_type;
	fst = __cond_set(fst, PERF_SAMPLE_CODE_PAGE_SIZE, PERF_SAMPLE_IP);
	fst = __cond_set(fst, PERF_SAMPLE_DATA_PAGE_SIZE |
			      PERF_SAMPLE_PHYS_ADDR,	  PERF_SAMPLE_ADDR);
	fst = __cond_set(fst, PERF_SAMPLE_STACK_USER,     PERF_SAMPLE_REGS_USER);
	fst &= ~data->sample_flags;

This way we express the implicit conditions by setting the required
sample data flags, then we mask those we already have set.

After the above something like:

	if (fst & PERF_SAMPLE_ADDR) {
		data->addr = 0;
		data->sample_flags |= PERF_SAMPLE_ADDR;
	}

	if (fst & PERF_SAMPLE_PHYS_ADDR) {
		data->phys_addr = perf_virt_to_phys(data->addr);
		data->sample_flags |= PERF_SAMPLE_PHYS_ADDR;
	}

	if (fst & PERF_SAMPLE_DATA_PAGE_SIZE) {
		data->data_page_size = perf_get_page_size(data->addr);
		data->sample_flags |= PERF_SAMPLE_DATA_PAGE_SIZE;
	}

And maybe something like:

#define __IF_SAMPLE_DATA(f_)		({		\
	bool __f = fst & PERF_SAMPLE_##f_;		\
	if (__f) data->sample_flags |= PERF_SAMPLE_##f_;\
	__f;				})

#define IF_SAMPLE_DATA(f_) if (__IF_SAMPLE_DATA(f_))

Then we can write:

	IF_SAMPLE_DATA(ADDR)
		data->addr = 0;

	IF_SAMPLE_DATA(PHYS_ADDR)
		data->phys_addr = perf_virt_to_phys(data->addr);

	IF_SAMPLE_DATA(DATA_PAGE_SIZE)
		data->data_page_size = perf_get_page_size(data->addr);

But I didn't check code-gen for this last suggestion.
Peter Zijlstra Jan. 11, 2023, 4:45 p.m. UTC | #10
On Wed, Jan 11, 2023 at 01:54:54PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 10, 2023 at 12:06:00PM -0800, Namhyung Kim wrote:
> 
> > Another example, but in this case it's real, is ADDR.  We cannot update
> > the data->addr just because filtered_sample_type has PHYS_ADDR or
> > DATA_PAGE_SIZE as it'd lose the original value.
> 
> Hmm, how about something like so?
> 
> /*
>  * if (flags & s) flags |= d; // without branches
>  */
> static __always_inline unsigned long
> __cond_set(unsigned long flags, unsigned long s, unsigned long d)
> {
> 	return flags | (d * !!(flags & s));
> }
> 
> Then:
> 
> 	fst = sample_type;
> 	fst = __cond_set(fst, PERF_SAMPLE_CODE_PAGE_SIZE, PERF_SAMPLE_IP);
> 	fst = __cond_set(fst, PERF_SAMPLE_DATA_PAGE_SIZE |
> 			      PERF_SAMPLE_PHYS_ADDR,	  PERF_SAMPLE_ADDR);
> 	fst = __cond_set(fst, PERF_SAMPLE_STACK_USER,     PERF_SAMPLE_REGS_USER);
> 	fst &= ~data->sample_flags;
> 

Hmm, I think it's better to write this like:

static __always_inline unsigned long
__cond_set(unsigned long flags, unsigned long s, unsigned long d)
{
	return d * !!(flags & s);
}

	fst = sample_type;
	fst |= __cond_set(sample_type, PERF_SAMPLE_CODE_PAGE_SIZE, PERF_SAMPLE_IP);
	fst |= __cond_set(sample_type, PERF_SAMPLE_DATA_PAGE_SIZE |
			               PERF_SAMPLE_PHYS_ADDR,	   PERF_SAMPLE_ADDR);
	fst |= __cond_set(sample_type, PERF_SAMPLE_STACK_USER,     PERF_SAMPLE_REGS_USER);
	fst &= ~data->sample_flags;

Which should be identical but has less data dependencies and thus gives
an OoO CPU more leaway to paralleize things.
Namhyung Kim Jan. 11, 2023, 5:59 p.m. UTC | #11
On Wed, Jan 11, 2023 at 8:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 11, 2023 at 01:54:54PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 10, 2023 at 12:06:00PM -0800, Namhyung Kim wrote:
> >
> > > Another example, but in this case it's real, is ADDR.  We cannot update
> > > the data->addr just because filtered_sample_type has PHYS_ADDR or
> > > DATA_PAGE_SIZE as it'd lose the original value.
> >
> > Hmm, how about something like so?
> >
> > /*
> >  * if (flags & s) flags |= d; // without branches
> >  */
> > static __always_inline unsigned long
> > __cond_set(unsigned long flags, unsigned long s, unsigned long d)
> > {
> >       return flags | (d * !!(flags & s));
> > }
> >
> > Then:
> >
> >       fst = sample_type;
> >       fst = __cond_set(fst, PERF_SAMPLE_CODE_PAGE_SIZE, PERF_SAMPLE_IP);
> >       fst = __cond_set(fst, PERF_SAMPLE_DATA_PAGE_SIZE |
> >                             PERF_SAMPLE_PHYS_ADDR,      PERF_SAMPLE_ADDR);
> >       fst = __cond_set(fst, PERF_SAMPLE_STACK_USER,     PERF_SAMPLE_REGS_USER);
> >       fst &= ~data->sample_flags;
> >
>
> Hmm, I think it's better to write this like:
>
> static __always_inline unsigned long
> __cond_set(unsigned long flags, unsigned long s, unsigned long d)
> {
>         return d * !!(flags & s);
> }
>
>         fst = sample_type;
>         fst |= __cond_set(sample_type, PERF_SAMPLE_CODE_PAGE_SIZE, PERF_SAMPLE_IP);
>         fst |= __cond_set(sample_type, PERF_SAMPLE_DATA_PAGE_SIZE |
>                                        PERF_SAMPLE_PHYS_ADDR,      PERF_SAMPLE_ADDR);
>         fst |= __cond_set(sample_type, PERF_SAMPLE_STACK_USER,     PERF_SAMPLE_REGS_USER);
>         fst &= ~data->sample_flags;
>
> Which should be identical but has less data dependencies and thus gives
> an OoO CPU more leaway to paralleize things.

Looks good.  Let me try this.

Thanks,
Namhyung
diff mbox series

Patch

diff --git a/kernel/events/core.c b/kernel/events/core.c
index eacc3702654d..70bff8a04583 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7582,14 +7582,21 @@  void perf_prepare_sample(struct perf_event_header *header,
 	filtered_sample_type = sample_type & ~data->sample_flags;
 	__perf_event_header__init_id(header, data, event, filtered_sample_type);
 
-	if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
-		data->ip = perf_instruction_pointer(regs);
+	if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) {
+		/* attr.sample_type may not have PERF_SAMPLE_IP */
+		if (!(data->sample_flags & PERF_SAMPLE_IP)) {
+			data->ip = perf_instruction_pointer(regs);
+			data->sample_flags |= PERF_SAMPLE_IP;
+		}
+	}
 
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		int size = 1;
 
-		if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN)
+		if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) {
 			data->callchain = perf_callchain(event, regs);
+			data->sample_flags |= PERF_SAMPLE_CALLCHAIN;
+		}
 
 		size += data->callchain->nr;
 
@@ -7634,8 +7641,13 @@  void perf_prepare_sample(struct perf_event_header *header,
 		header->size += size;
 	}
 
-	if (sample_type & (PERF_SAMPLE_REGS_USER | PERF_SAMPLE_STACK_USER))
-		perf_sample_regs_user(&data->regs_user, regs);
+	if (sample_type & (PERF_SAMPLE_REGS_USER | PERF_SAMPLE_STACK_USER)) {
+		/* attr.sample_type may not have PERF_SAMPLE_REGS_USER */
+		if (!(data->sample_flags & PERF_SAMPLE_REGS_USER)) {
+			perf_sample_regs_user(&data->regs_user, regs);
+			data->sample_flags |= PERF_SAMPLE_REGS_USER;
+		}
+	}
 
 	if (sample_type & PERF_SAMPLE_REGS_USER) {
 		/* regs dump ABI info */
@@ -7656,11 +7668,18 @@  void perf_prepare_sample(struct perf_event_header *header,
 		 * in case new sample type is added, because we could eat
 		 * up the rest of the sample size.
 		 */
-		u16 stack_size = event->attr.sample_stack_user;
 		u16 size = sizeof(u64);
+		u16 stack_size;
+
+		if (filtered_sample_type & PERF_SAMPLE_STACK_USER) {
+			stack_size = event->attr.sample_stack_user;
+			stack_size = perf_sample_ustack_size(stack_size, header->size,
+							     data->regs_user.regs);
 
-		stack_size = perf_sample_ustack_size(stack_size, header->size,
-						     data->regs_user.regs);
+			data->stack_user_size = stack_size;
+			data->sample_flags |= PERF_SAMPLE_STACK_USER;
+		}
+		stack_size = data->stack_user_size;
 
 		/*
 		 * If there is something to dump, add space for the dump
@@ -7670,29 +7689,40 @@  void perf_prepare_sample(struct perf_event_header *header,
 		if (stack_size)
 			size += sizeof(u64) + stack_size;
 
-		data->stack_user_size = stack_size;
 		header->size += size;
 	}
 
-	if (filtered_sample_type & PERF_SAMPLE_WEIGHT_TYPE)
+	if (filtered_sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
 		data->weight.full = 0;
+		data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
+	}
 
-	if (filtered_sample_type & PERF_SAMPLE_DATA_SRC)
+	if (filtered_sample_type & PERF_SAMPLE_DATA_SRC) {
 		data->data_src.val = PERF_MEM_NA;
+		data->sample_flags |= PERF_SAMPLE_DATA_SRC;
+	}
 
-	if (filtered_sample_type & PERF_SAMPLE_TRANSACTION)
+	if (filtered_sample_type & PERF_SAMPLE_TRANSACTION) {
 		data->txn = 0;
+		data->sample_flags |= PERF_SAMPLE_TRANSACTION;
+	}
 
 	if (sample_type & (PERF_SAMPLE_ADDR | PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_DATA_PAGE_SIZE)) {
-		if (filtered_sample_type & PERF_SAMPLE_ADDR)
+		/* attr.sample_type may not have PERF_SAMPLE_ADDR */
+		if (!(data->sample_flags & PERF_SAMPLE_ADDR)) {
 			data->addr = 0;
+			data->sample_flags |= PERF_SAMPLE_ADDR;
+		}
 	}
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR) {
 		/* regs dump ABI info */
 		int size = sizeof(u64);
 
-		perf_sample_regs_intr(&data->regs_intr, regs);
+		if (filtered_sample_type & PERF_SAMPLE_REGS_INTR) {
+			perf_sample_regs_intr(&data->regs_intr, regs);
+			data->sample_flags |= PERF_SAMPLE_REGS_INTR;
+		}
 
 		if (data->regs_intr.regs) {
 			u64 mask = event->attr.sample_regs_intr;
@@ -7703,17 +7733,19 @@  void perf_prepare_sample(struct perf_event_header *header,
 		header->size += size;
 	}
 
-	if (sample_type & PERF_SAMPLE_PHYS_ADDR &&
-	    filtered_sample_type & PERF_SAMPLE_PHYS_ADDR)
+	if (filtered_sample_type & PERF_SAMPLE_PHYS_ADDR) {
 		data->phys_addr = perf_virt_to_phys(data->addr);
+		data->sample_flags |= PERF_SAMPLE_PHYS_ADDR;
+	}
 
 #ifdef CONFIG_CGROUP_PERF
-	if (sample_type & PERF_SAMPLE_CGROUP) {
+	if (filtered_sample_type & PERF_SAMPLE_CGROUP) {
 		struct cgroup *cgrp;
 
 		/* protected by RCU */
 		cgrp = task_css_check(current, perf_event_cgrp_id, 1)->cgroup;
 		data->cgroup = cgroup_id(cgrp);
+		data->sample_flags |= PERF_SAMPLE_CGROUP;
 	}
 #endif
 
@@ -7722,11 +7754,15 @@  void perf_prepare_sample(struct perf_event_header *header,
 	 * require PERF_SAMPLE_ADDR, kernel implicitly retrieve the data->addr,
 	 * but the value will not dump to the userspace.
 	 */
-	if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
+	if (filtered_sample_type & PERF_SAMPLE_DATA_PAGE_SIZE) {
 		data->data_page_size = perf_get_page_size(data->addr);
+		data->sample_flags |= PERF_SAMPLE_DATA_PAGE_SIZE;
+	}
 
-	if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
+	if (filtered_sample_type & PERF_SAMPLE_CODE_PAGE_SIZE) {
 		data->code_page_size = perf_get_page_size(data->ip);
+		data->sample_flags |= PERF_SAMPLE_CODE_PAGE_SIZE;
+	}
 
 	if (sample_type & PERF_SAMPLE_AUX) {
 		u64 size;
@@ -7739,10 +7775,14 @@  void perf_prepare_sample(struct perf_event_header *header,
 		 * Make sure this doesn't happen by using up to U16_MAX bytes
 		 * per sample in total (rounded down to 8 byte boundary).
 		 */
-		size = min_t(size_t, U16_MAX - header->size,
-			     event->attr.aux_sample_size);
-		size = rounddown(size, 8);
-		size = perf_prepare_sample_aux(event, data, size);
+		if (filtered_sample_type & PERF_SAMPLE_AUX) {
+			size = min_t(size_t, U16_MAX - header->size,
+				     event->attr.aux_sample_size);
+			size = rounddown(size, 8);
+			perf_prepare_sample_aux(event, data, size);
+			data->sample_flags |= PERF_SAMPLE_AUX;
+		}
+		size = data->aux_size;
 
 		WARN_ON_ONCE(size + header->size > U16_MAX);
 		header->size += size;