diff mbox series

[PATCHv2,2/4] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()

Message ID aa6e571156d6e26e54da0bb3015ba474e4a08da0.1603363729.git.saiprakash.ranjan@codeaurora.org (mailing list archive)
State New, archived
Headers show
Series coresight: etf/etb10/etr: Fix NULL pointer dereference crashes | expand

Commit Message

Sai Prakash Ranjan Oct. 22, 2020, 10:57 a.m. UTC
There was a report of NULL pointer dereference in ETF enable
path for perf CS mode with PID monitoring. It is almost 100%
reproducible when the process to monitor is something very
active such as chrome and with ETF as the sink and not ETR.
Currently in a bid to find the pid, the owner is dereferenced
via task_pid_nr() call in tmc_enable_etf_sink_perf() and with
owner being NULL, we get a NULL pointer dereference.

Looking at the ETR and other places in the kernel, ETF and the
ETB are the only places trying to dereference the task(owner)
in tmc_enable_etf_sink_perf() which is also called from the
sched_in path as in the call trace. Owner(task) is NULL even
in the case of ETR in tmc_enable_etr_sink_perf(), but since we
cache the PID in alloc_buffer() callback and it is done as part
of etm_setup_aux() when allocating buffer for ETR sink, we never
dereference this NULL pointer and we are safe. So lets do the
same thing with ETF and cache the PID to which the cs_buffer
belongs in tmc_alloc_etf_buffer() as done for ETR. This will
also remove the unnecessary function calls(task_pid_nr()) since
we are caching the PID. In addition to this, add a check to
validate event->owner which will prevent any possible NULL
pointer dereferences and check for kernel events.

Easily reproducible running below:

 perf record -e cs_etm/@tmc_etf0/ -N -p <pid>

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000548
Mem abort info:
  ESR = 0x96000006
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x00000006
  CM = 0, WnR = 0
<snip>...
Call trace:
 tmc_enable_etf_sink+0xe4/0x280
 coresight_enable_path+0x168/0x1fc
 etm_event_start+0x8c/0xf8
 etm_event_add+0x38/0x54
 event_sched_in+0x194/0x2ac
 group_sched_in+0x54/0x12c
 flexible_sched_in+0xd8/0x120
 visit_groups_merge+0x100/0x16c
 ctx_flexible_sched_in+0x50/0x74
 ctx_sched_in+0xa4/0xa8
 perf_event_sched_in+0x60/0x6c
 perf_event_context_sched_in+0x98/0xe0
 __perf_event_task_sched_in+0x5c/0xd8
 finish_task_switch+0x184/0x1cc
 schedule_tail+0x20/0xec
 ret_from_fork+0x4/0x18

Fixes: 880af782c6e8 ("coresight: tmc-etf: Add support for CPU-wide trace scenarios")
Suggested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/hwtracing/coresight/coresight-priv.h    | 2 ++
 drivers/hwtracing/coresight/coresight-tmc-etf.c | 8 +++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Oct. 22, 2020, 11:32 a.m. UTC | #1
On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:

> Looking at the ETR and other places in the kernel, ETF and the
> ETB are the only places trying to dereference the task(owner)
> in tmc_enable_etf_sink_perf() which is also called from the
> sched_in path as in the call trace.

> @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev,
>  {
>  	int node;
>  	struct cs_buffers *buf;
> +	struct task_struct *task = READ_ONCE(event->owner);
> +
> +	if (!task || is_kernel_event(event))
> +		return NULL;


This is *wrong*... why do you care about who owns the events?
Sai Prakash Ranjan Oct. 22, 2020, 12:49 p.m. UTC | #2
On 2020-10-22 17:02, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
> 
>> Looking at the ETR and other places in the kernel, ETF and the
>> ETB are the only places trying to dereference the task(owner)
>> in tmc_enable_etf_sink_perf() which is also called from the
>> sched_in path as in the call trace.
> 
>> @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct 
>> coresight_device *csdev,
>>  {
>>  	int node;
>>  	struct cs_buffers *buf;
>> +	struct task_struct *task = READ_ONCE(event->owner);
>> +
>> +	if (!task || is_kernel_event(event))
>> +		return NULL;
> 
> 
> This is *wrong*... why do you care about who owns the events?
> 

The original issue was the owner being NULL and causing
a NULL pointer dereference. I did ask some time back
if it is valid for the owner to be NULL [1] and should
probably be handled in events core?

[1] 
https://lore.kernel.org/lkml/c0e1f99a0a2480dfc8d788bb424d3f08@codeaurora.org/

Unable to handle kernel NULL pointer dereference at virtual address 
0000000000000548
Mem abort info:
   ESR = 0x96000006
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
Data abort info:
   ISV = 0, ISS = 0x00000006
   CM = 0, WnR = 0
<snip>...
Call trace:
  tmc_enable_etf_sink+0xe4/0x280
  coresight_enable_path+0x168/0x1fc
  etm_event_start+0x8c/0xf8
  etm_event_add+0x38/0x54
  event_sched_in+0x194/0x2ac
  group_sched_in+0x54/0x12c
  flexible_sched_in+0xd8/0x120
  visit_groups_merge+0x100/0x16c
  ctx_flexible_sched_in+0x50/0x74
  ctx_sched_in+0xa4/0xa8
  perf_event_sched_in+0x60/0x6c
  perf_event_context_sched_in+0x98/0xe0
  __perf_event_task_sched_in+0x5c/0xd8
  finish_task_switch+0x184/0x1cc
  schedule_tail+0x20/0xec
  ret_from_fork+0x4/0x18


Thanks,
Sai
Suzuki K Poulose Oct. 22, 2020, 1:30 p.m. UTC | #3
On 10/22/20 12:32 PM, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
> 
>> Looking at the ETR and other places in the kernel, ETF and the
>> ETB are the only places trying to dereference the task(owner)
>> in tmc_enable_etf_sink_perf() which is also called from the
>> sched_in path as in the call trace.
> 
>> @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev,
>>   {
>>   	int node;
>>   	struct cs_buffers *buf;
>> +	struct task_struct *task = READ_ONCE(event->owner);
>> +
>> +	if (!task || is_kernel_event(event))
>> +		return NULL;
> 
> 
> This is *wrong*... why do you care about who owns the events?
> 

This is due to the special case of the CoreSight configuration, where
a "sink" (where the trace data is captured) is shared by multiple Trace
units. So, we could share the "sink" for multiple trace units if they
are tracing the events that belong to the same "perf" session. (The
userspace tool could decode the trace data based on the TraceID
in the trace packets). Is there a better way to do this ?

Suzuki
Peter Zijlstra Oct. 22, 2020, 1:34 p.m. UTC | #4
On Thu, Oct 22, 2020 at 06:19:37PM +0530, Sai Prakash Ranjan wrote:
> On 2020-10-22 17:02, Peter Zijlstra wrote:
> > On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
> > 
> > > Looking at the ETR and other places in the kernel, ETF and the
> > > ETB are the only places trying to dereference the task(owner)
> > > in tmc_enable_etf_sink_perf() which is also called from the
> > > sched_in path as in the call trace.
> > 
> > > @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct
> > > coresight_device *csdev,
> > >  {
> > >  	int node;
> > >  	struct cs_buffers *buf;
> > > +	struct task_struct *task = READ_ONCE(event->owner);
> > > +
> > > +	if (!task || is_kernel_event(event))
> > > +		return NULL;
> > 
> > 
> > This is *wrong*... why do you care about who owns the events?
> > 
> 
> The original issue was the owner being NULL and causing
> a NULL pointer dereference. I did ask some time back
> if it is valid for the owner to be NULL [1] and should
> probably be handled in events core?

No, what I asked is why do you care about ->owner to begin with? That
seems wrong. A driver should not touch ->owner _at_all_.
Sai Prakash Ranjan Oct. 22, 2020, 2:23 p.m. UTC | #5
On 2020-10-22 19:04, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 06:19:37PM +0530, Sai Prakash Ranjan wrote:
>> On 2020-10-22 17:02, Peter Zijlstra wrote:
>> > On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
>> >
>> > > Looking at the ETR and other places in the kernel, ETF and the
>> > > ETB are the only places trying to dereference the task(owner)
>> > > in tmc_enable_etf_sink_perf() which is also called from the
>> > > sched_in path as in the call trace.
>> >
>> > > @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct
>> > > coresight_device *csdev,
>> > >  {
>> > >  	int node;
>> > >  	struct cs_buffers *buf;
>> > > +	struct task_struct *task = READ_ONCE(event->owner);
>> > > +
>> > > +	if (!task || is_kernel_event(event))
>> > > +		return NULL;
>> >
>> >
>> > This is *wrong*... why do you care about who owns the events?
>> >
>> 
>> The original issue was the owner being NULL and causing
>> a NULL pointer dereference. I did ask some time back
>> if it is valid for the owner to be NULL [1] and should
>> probably be handled in events core?
> 
> No, what I asked is why do you care about ->owner to begin with? That
> seems wrong. A driver should not touch ->owner _at_all_.
> 

Ah ok, so Suzuki explained that in other reply and if there is
some other better way?

Thanks,
Sai
Peter Zijlstra Oct. 22, 2020, 3:06 p.m. UTC | #6
On Thu, Oct 22, 2020 at 02:30:21PM +0100, Suzuki Poulose wrote:
> On 10/22/20 12:32 PM, Peter Zijlstra wrote:
> > On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
> > 
> > > Looking at the ETR and other places in the kernel, ETF and the
> > > ETB are the only places trying to dereference the task(owner)
> > > in tmc_enable_etf_sink_perf() which is also called from the
> > > sched_in path as in the call trace.
> > 
> > > @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev,
> > >   {
> > >   	int node;
> > >   	struct cs_buffers *buf;
> > > +	struct task_struct *task = READ_ONCE(event->owner);
> > > +
> > > +	if (!task || is_kernel_event(event))
> > > +		return NULL;
> > 
> > 
> > This is *wrong*... why do you care about who owns the events?
> > 
> 
> This is due to the special case of the CoreSight configuration, where
> a "sink" (where the trace data is captured) is shared by multiple Trace
> units. So, we could share the "sink" for multiple trace units if they
> are tracing the events that belong to the same "perf" session. (The
> userspace tool could decode the trace data based on the TraceID
> in the trace packets). Is there a better way to do this ?

I thought we added sink identification through perf_event_attr::config2
?
Suzuki K Poulose Oct. 22, 2020, 3:32 p.m. UTC | #7
On 10/22/20 4:06 PM, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 02:30:21PM +0100, Suzuki Poulose wrote:
>> On 10/22/20 12:32 PM, Peter Zijlstra wrote:
>>> On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
>>>
>>>> Looking at the ETR and other places in the kernel, ETF and the
>>>> ETB are the only places trying to dereference the task(owner)
>>>> in tmc_enable_etf_sink_perf() which is also called from the
>>>> sched_in path as in the call trace.
>>>
>>>> @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev,
>>>>    {
>>>>    	int node;
>>>>    	struct cs_buffers *buf;
>>>> +	struct task_struct *task = READ_ONCE(event->owner);
>>>> +
>>>> +	if (!task || is_kernel_event(event))
>>>> +		return NULL;
>>>
>>>
>>> This is *wrong*... why do you care about who owns the events?
>>>
>>
>> This is due to the special case of the CoreSight configuration, where
>> a "sink" (where the trace data is captured) is shared by multiple Trace
>> units. So, we could share the "sink" for multiple trace units if they
>> are tracing the events that belong to the same "perf" session. (The
>> userspace tool could decode the trace data based on the TraceID
>> in the trace packets). Is there a better way to do this ?
> 
> I thought we added sink identification through perf_event_attr::config2
> ?
> 

Correct. attr:config2 identifies the "sink" for the collection. But,
that doesn't solve the problem we have here. If two separate perf
sessions use the "same sink", we don't want to mix the
trace data into the same sink for events from different sessions.

Thus, we need a way to check if a new event starting the tracing on
an ETM belongs to the same session as the one already pumping the trace
into the sink.

We use event->owner pid for this check and thats where we encountered
a NULL event->owner. Looking at the code further, we identified that
kernel events could also trigger this issue.

Suzuki
Mathieu Poirier Oct. 22, 2020, 9:20 p.m. UTC | #8
Hi Peter,

On Thu, Oct 22, 2020 at 04:32:36PM +0100, Suzuki Poulose wrote:
> On 10/22/20 4:06 PM, Peter Zijlstra wrote:
> > On Thu, Oct 22, 2020 at 02:30:21PM +0100, Suzuki Poulose wrote:
> > > On 10/22/20 12:32 PM, Peter Zijlstra wrote:
> > > > On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote:
> > > > 
> > > > > Looking at the ETR and other places in the kernel, ETF and the
> > > > > ETB are the only places trying to dereference the task(owner)
> > > > > in tmc_enable_etf_sink_perf() which is also called from the
> > > > > sched_in path as in the call trace.
> > > > 
> > > > > @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev,
> > > > >    {
> > > > >    	int node;
> > > > >    	struct cs_buffers *buf;
> > > > > +	struct task_struct *task = READ_ONCE(event->owner);
> > > > > +
> > > > > +	if (!task || is_kernel_event(event))
> > > > > +		return NULL;
> > > > 
> > > > 
> > > > This is *wrong*... why do you care about who owns the events?
> > > > 
> > > 
> > > This is due to the special case of the CoreSight configuration, where
> > > a "sink" (where the trace data is captured) is shared by multiple Trace
> > > units. So, we could share the "sink" for multiple trace units if they
> > > are tracing the events that belong to the same "perf" session. (The
> > > userspace tool could decode the trace data based on the TraceID
> > > in the trace packets). Is there a better way to do this ?
> > 
> > I thought we added sink identification through perf_event_attr::config2
> > ?
> > 
> 
> Correct. attr:config2 identifies the "sink" for the collection. But,
> that doesn't solve the problem we have here. If two separate perf
> sessions use the "same sink", we don't want to mix the
> trace data into the same sink for events from different sessions.
> 
> Thus, we need a way to check if a new event starting the tracing on
> an ETM belongs to the same session as the one already pumping the trace
> into the sink.

Suzuki's depiction of the usecase is accurate.  Using the pid of the process
that created the events comes out of a discussion you and I had in the common
area by the Intel booth at ELC in Edinburgh in the fall of 2018.  At the time I
exposed the problem of having multiple events sharing the same HW resources and
you advised to proceed this way.

That being said it is plausible that I did not expressed myself clearly enough
for you to understand the full extend of the problem.  If that is the case we
are more than willing to revisit that solution.  Do you see a better option than
what has currently been implemented?

Many thanks,
Mathieu

> 
> We use event->owner pid for this check and thats where we encountered
> a NULL event->owner. Looking at the code further, we identified that
> kernel events could also trigger this issue.
> 
> Suzuki
Peter Zijlstra Oct. 23, 2020, 7:39 a.m. UTC | #9
On Thu, Oct 22, 2020 at 03:20:33PM -0600, Mathieu Poirier wrote:
> Suzuki's depiction of the usecase is accurate.  Using the pid of the process
> that created the events comes out of a discussion you and I had in the common
> area by the Intel booth at ELC in Edinburgh in the fall of 2018.  At the time I
> exposed the problem of having multiple events sharing the same HW resources and
> you advised to proceed this way.

Bah, I was afraid of that. I desperately tried to find correspondence on
it, but alas, verbal crap doesn't end up in the Sent folder :-/

> That being said it is plausible that I did not expressed myself clearly enough
> for you to understand the full extend of the problem.  If that is the case we
> are more than willing to revisit that solution.  Do you see a better option than
> what has currently been implemented?

Moo... that really could've done with a comment I suppose.

So then I don't understand the !->owner issue, that only happens when
the task dies, which cannot be concurrent with event creation. Are you
somehow accessing ->owner later?

As for the kernel events.. why do you care about the actual task_struct
* in there? I see you're using it to grab the task-pid, but how is that
useful?
Suzuki K Poulose Oct. 23, 2020, 8:49 a.m. UTC | #10
Hi Peter

On 10/23/20 8:39 AM, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 03:20:33PM -0600, Mathieu Poirier wrote:
>> Suzuki's depiction of the usecase is accurate.  Using the pid of the process
>> that created the events comes out of a discussion you and I had in the common
>> area by the Intel booth at ELC in Edinburgh in the fall of 2018.  At the time I
>> exposed the problem of having multiple events sharing the same HW resources and
>> you advised to proceed this way.
> 
> Bah, I was afraid of that. I desperately tried to find correspondence on
> it, but alas, verbal crap doesn't end up in the Sent folder :-/
> 
>> That being said it is plausible that I did not expressed myself clearly enough
>> for you to understand the full extend of the problem.  If that is the case we
>> are more than willing to revisit that solution.  Do you see a better option than
>> what has currently been implemented?
> 
> Moo... that really could've done with a comment I suppose.
> 
> So then I don't understand the !->owner issue, that only happens when
> the task dies, which cannot be concurrent with event creation. Are you

Part of the patch from Sai, fixes this by avoiding the dereferencing
after event creation (by caching it). But the kernel events needs
fixing.

One follow up question on the !->owner issue. Given the ->owner is
dying, does it prevent events from being scheduled ? Or is there a delay
between that and eventually stopping the events. In this case, we hit
the issue when :

A					  A or B ?

event_start()
   ...					event->owner = NULL

  READ_ONCE(event->owner);

Is this expected ?

> somehow accessing ->owner later?

> 
> As for the kernel events.. why do you care about the actual task_struct
> * in there? I see you're using it to grab the task-pid, but how is that
> useful?

Correct, kernel events are something that the driver didn't account for.
May be we could handle this case with a "special pid" and simply
disallow sharing (which is fine I believe, given there are not grouping
for the kernel created events).

Suzuki
Peter Zijlstra Oct. 23, 2020, 9:23 a.m. UTC | #11
On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
> On 10/23/20 8:39 AM, Peter Zijlstra wrote:

> > So then I don't understand the !->owner issue, that only happens when
> > the task dies, which cannot be concurrent with event creation. Are you
> 
> Part of the patch from Sai, fixes this by avoiding the dereferencing
> after event creation (by caching it). But the kernel events needs
> fixing.
> 
> One follow up question on the !->owner issue. Given the ->owner is
> dying, does it prevent events from being scheduled ? Or is there a delay
> between that and eventually stopping the events. In this case, we hit
> the issue when :
> 
> A					  A or B ?
> 
> event_start()
>   ...					event->owner = NULL
> 
>  READ_ONCE(event->owner);
> 
> Is this expected ?

Yeah, teardown is a bit of an effort. Also, you can pass an fd over a
unix socket to another process, so this isn't something you can rely on
in any case.

The perf tool doesn't do it, but the kernel infra should be able to deal
with someone doing a perf-deamon of sorts, where you can request a perf
event and recieve a fd from it.

Imagine the fun ;-)

> > As for the kernel events.. why do you care about the actual task_struct
> > * in there? I see you're using it to grab the task-pid, but how is that
> > useful?
> 
> Correct, kernel events are something that the driver didn't account for.
> May be we could handle this case with a "special pid" and simply
> disallow sharing (which is fine I believe, given there are not grouping
> for the kernel created events).

Why do you need a pid in the first place? Can't you use the "task_struct
*" as a value?
Peter Zijlstra Oct. 23, 2020, 9:41 a.m. UTC | #12
On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
> On 10/23/20 8:39 AM, Peter Zijlstra wrote:

> > So then I don't understand the !->owner issue, that only happens when
> > the task dies, which cannot be concurrent with event creation. Are you
> 
> Part of the patch from Sai, fixes this by avoiding the dereferencing
> after event creation (by caching it). But the kernel events needs
> fixing.

I'm fundamentally failing here. Creating a link to the sink is strictly
event-creation time. Why would you ever need it again later? Later you
already have the sink setup.
Suzuki K Poulose Oct. 23, 2020, 10:34 a.m. UTC | #13
On 10/23/20 10:41 AM, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
>> On 10/23/20 8:39 AM, Peter Zijlstra wrote:
> 
>>> So then I don't understand the !->owner issue, that only happens when
>>> the task dies, which cannot be concurrent with event creation. Are you
>>
>> Part of the patch from Sai, fixes this by avoiding the dereferencing
>> after event creation (by caching it). But the kernel events needs
>> fixing.
> 
> I'm fundamentally failing here. Creating a link to the sink is strictly
> event-creation time. Why would you ever need it again later? Later you
> already have the sink setup.
> 

Sorry for the lack of clarity here, and you are not alone unless you
have drowned in the CoreSight topologies ;-)

Typically current generation of systems have the following topology :

CPU0
  etm0   \
          \  ________
          /          \
CPU1    /            \
   etm1                \
                        \
                        /-------  sink0
CPU2                  /
   etm2  \            /
          \ ________ /
          /
CPU3    /
   etm3


i.e, Multiple ETMs share a sink. [for the sake of simplicity, I have
used one sink. Even though there could be potential sinks (of different
types), none of them are private to the ETMs. So, in a nutshell, a sink
can be reached by multiple ETMs. ]

Now, for a session :

perf record -e cs_etm/sinkid=sink0/u workload

We create an event per CPU (say eventN, which are scheduled based on the
threads that could execute on the CPU. At this point we have finalized
the sink0, and have allocated necessary buffer for the sink0.

Now, when the threads are scheduled on the CPUs, we start the
appropriate events for the CPUs.

e.g,
  CPU0 sched -> workload:0 - > etm0->event0_start -> Turns all
the components upto sink0, starting the trace collection in the buffer.

Now, if another CPU, CPU1 starts tracing event1 for workload:1 thread,
it will eventually try to turn ON the sink0.Since sink0 is already
active tracing event0, we could allow this to go through and collect
the trace in the *same hardware buffer* (which can be demuxed from the 
single AUX record using the TraceID in the packets). Please note that
we do double buffering and hardware buffer is copied only when the sink0
is stopped (see below).

But, if the event scheduled on CPU1 doesn't belong to the above session, 
but belongs to different perf session
  (say, perf record -e  cs_etm/sinkid=sink0/u benchmark),

we can't allow this to succeed and mix the trace data in to that of 
workload and thus fail the operation.

In a nutshell, since the sinks are shared, we start the sink on the
first event and keeps sharing the sink buffer with any event that
belongs to the same session (using refcounts). The sink is only released
for other sessions, when there are no more events in the session tracing
on any of the ETMs.

I know this is fundamentally a topology issue, but that is not something
we can fix. But the situation is changing and we are starting to see
systems with per-CPU sinks.

Hope this helps.

Suzuki
Suzuki K Poulose Oct. 23, 2020, 10:49 a.m. UTC | #14
On 10/23/20 10:23 AM, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
>> On 10/23/20 8:39 AM, Peter Zijlstra wrote:
> 
>>> So then I don't understand the !->owner issue, that only happens when
>>> the task dies, which cannot be concurrent with event creation. Are you
>>
>> Part of the patch from Sai, fixes this by avoiding the dereferencing
>> after event creation (by caching it). But the kernel events needs
>> fixing.
>>
>> One follow up question on the !->owner issue. Given the ->owner is
>> dying, does it prevent events from being scheduled ? Or is there a delay
>> between that and eventually stopping the events. In this case, we hit
>> the issue when :
>>
>> A					  A or B ?
>>
>> event_start()
>>    ...					event->owner = NULL
>>
>>   READ_ONCE(event->owner);
>>
>> Is this expected ?
> 
> Yeah, teardown is a bit of an effort. Also, you can pass an fd over a
> unix socket to another process, so this isn't something you can rely on
> in any case.
> 
> The perf tool doesn't do it, but the kernel infra should be able to deal
> with someone doing a perf-deamon of sorts, where you can request a perf
> event and recieve a fd from it.
> 
> Imagine the fun ;-)
> 
>>> As for the kernel events.. why do you care about the actual task_struct
>>> * in there? I see you're using it to grab the task-pid, but how is that
>>> useful?
>>
>> Correct, kernel events are something that the driver didn't account for.
>> May be we could handle this case with a "special pid" and simply
>> disallow sharing (which is fine I believe, given there are not grouping
>> for the kernel created events).
> 
> Why do you need a pid in the first place? Can't you use the "task_struct
> *" as a value?

We could. But, without a refcount on the task pointer, that could be
tricky, even though we don't dereference it. In the same situation,
if the tsk owner dies and is freed and is reallocated to a new perf 
session task but with different PID, we could be mixing things up again
?

Special pid here could be -1.
Peter Zijlstra Oct. 23, 2020, 10:54 a.m. UTC | #15
On Fri, Oct 23, 2020 at 11:34:32AM +0100, Suzuki Poulose wrote:
> On 10/23/20 10:41 AM, Peter Zijlstra wrote:
> > On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
> > > On 10/23/20 8:39 AM, Peter Zijlstra wrote:
> > 
> > > > So then I don't understand the !->owner issue, that only happens when
> > > > the task dies, which cannot be concurrent with event creation. Are you
> > > 
> > > Part of the patch from Sai, fixes this by avoiding the dereferencing
> > > after event creation (by caching it). But the kernel events needs
> > > fixing.
> > 
> > I'm fundamentally failing here. Creating a link to the sink is strictly
> > event-creation time. Why would you ever need it again later? Later you
> > already have the sink setup.
> > 
> 
> Sorry for the lack of clarity here, and you are not alone unless you
> have drowned in the CoreSight topologies ;-)
> 
> Typically current generation of systems have the following topology :
> 
> CPU0
>  etm0   \
>          \  ________
>          /          \
> CPU1    /            \
>   etm1                \
>                        \
>                        /-------  sink0
> CPU2                  /
>   etm2  \            /
>          \ ________ /
>          /
> CPU3    /
>   etm3
> 
> 
> i.e, Multiple ETMs share a sink. [for the sake of simplicity, I have
> used one sink. Even though there could be potential sinks (of different
> types), none of them are private to the ETMs. So, in a nutshell, a sink
> can be reached by multiple ETMs. ]
> 
> Now, for a session :
> 
> perf record -e cs_etm/sinkid=sink0/u workload
> 
> We create an event per CPU (say eventN, which are scheduled based on the
> threads that could execute on the CPU. At this point we have finalized
> the sink0, and have allocated necessary buffer for the sink0.
> 
> Now, when the threads are scheduled on the CPUs, we start the
> appropriate events for the CPUs.
> 
> e.g,
>  CPU0 sched -> workload:0 - > etm0->event0_start -> Turns all
> the components upto sink0, starting the trace collection in the buffer.
> 
> Now, if another CPU, CPU1 starts tracing event1 for workload:1 thread,
> it will eventually try to turn ON the sink0.Since sink0 is already
> active tracing event0, we could allow this to go through and collect
> the trace in the *same hardware buffer* (which can be demuxed from the
> single AUX record using the TraceID in the packets). Please note that
> we do double buffering and hardware buffer is copied only when the sink0
> is stopped (see below).
> 
> But, if the event scheduled on CPU1 doesn't belong to the above session, but
> belongs to different perf session
>  (say, perf record -e  cs_etm/sinkid=sink0/u benchmark),
> 
> we can't allow this to succeed and mix the trace data in to that of workload
> and thus fail the operation.
> 
> In a nutshell, since the sinks are shared, we start the sink on the
> first event and keeps sharing the sink buffer with any event that
> belongs to the same session (using refcounts). The sink is only released
> for other sessions, when there are no more events in the session tracing
> on any of the ETMs.
> 
> I know this is fundamentally a topology issue, but that is not something
> we can fix. But the situation is changing and we are starting to see
> systems with per-CPU sinks.
> 
> Hope this helps.

I think I'm more confused now :-/

Where do we use ->owner after event creation? The moment you create your
eventN you create the link to sink0. That link either succeeds (same
'cookie') or fails.

If it fails, event creation fails, the end.

On success, we have the sink pointer in our event and we never ever need
to look at ->owner ever again.

I'm also not seeing why exactly we need ->owner in the first place.

Suppose we make the sink0 device return -EBUSY on open() when it is
active. Then a perf session can open the sink0 device, create perf
events and attach them to the sink0 device using
perf_event_attr::config2. The events will attach to sink0 and increment
its usage count, such that any further open() will fail.

Once the events are created, the perf tool close()s the sink0 device,
which is now will in-use by the events. No other events can be attached
to it.

Or are you doing the event->sink mapping every time you do: pmu::add()?
That sounds insane.
Suzuki K Poulose Oct. 23, 2020, 12:56 p.m. UTC | #16
On 10/23/20 11:54 AM, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 11:34:32AM +0100, Suzuki Poulose wrote:
>> On 10/23/20 10:41 AM, Peter Zijlstra wrote:
>>> On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote:
>>>> On 10/23/20 8:39 AM, Peter Zijlstra wrote:
>>>
>>>>> So then I don't understand the !->owner issue, that only happens when
>>>>> the task dies, which cannot be concurrent with event creation. Are you
>>>>
>>>> Part of the patch from Sai, fixes this by avoiding the dereferencing
>>>> after event creation (by caching it). But the kernel events needs
>>>> fixing.
>>>
>>> I'm fundamentally failing here. Creating a link to the sink is strictly
>>> event-creation time. Why would you ever need it again later? Later you
>>> already have the sink setup.
>>>
>>
>> Sorry for the lack of clarity here, and you are not alone unless you
>> have drowned in the CoreSight topologies ;-)
>>
>> Typically current generation of systems have the following topology :
>>
>> CPU0
>>   etm0   \
>>           \  ________
>>           /          \
>> CPU1    /            \
>>    etm1                \
>>                         \
>>                         /-------  sink0
>> CPU2                  /
>>    etm2  \            /
>>           \ ________ /
>>           /
>> CPU3    /
>>    etm3
>>
>>
>> i.e, Multiple ETMs share a sink. [for the sake of simplicity, I have
>> used one sink. Even though there could be potential sinks (of different
>> types), none of them are private to the ETMs. So, in a nutshell, a sink
>> can be reached by multiple ETMs. ]
>>
>> Now, for a session :
>>
>> perf record -e cs_etm/sinkid=sink0/u workload
>>
>> We create an event per CPU (say eventN, which are scheduled based on the
>> threads that could execute on the CPU. At this point we have finalized
>> the sink0, and have allocated necessary buffer for the sink0.
>>
>> Now, when the threads are scheduled on the CPUs, we start the
>> appropriate events for the CPUs.
>>
>> e.g,
>>   CPU0 sched -> workload:0 - > etm0->event0_start -> Turns all
>> the components upto sink0, starting the trace collection in the buffer.
>>
>> Now, if another CPU, CPU1 starts tracing event1 for workload:1 thread,
>> it will eventually try to turn ON the sink0.Since sink0 is already
>> active tracing event0, we could allow this to go through and collect
>> the trace in the *same hardware buffer* (which can be demuxed from the
>> single AUX record using the TraceID in the packets). Please note that
>> we do double buffering and hardware buffer is copied only when the sink0
>> is stopped (see below).
>>
>> But, if the event scheduled on CPU1 doesn't belong to the above session, but
>> belongs to different perf session
>>   (say, perf record -e  cs_etm/sinkid=sink0/u benchmark),
>>
>> we can't allow this to succeed and mix the trace data in to that of workload
>> and thus fail the operation.
>>
>> In a nutshell, since the sinks are shared, we start the sink on the
>> first event and keeps sharing the sink buffer with any event that
>> belongs to the same session (using refcounts). The sink is only released
>> for other sessions, when there are no more events in the session tracing
>> on any of the ETMs.
>>
>> I know this is fundamentally a topology issue, but that is not something
>> we can fix. But the situation is changing and we are starting to see
>> systems with per-CPU sinks.
>>
>> Hope this helps.
> 
> I think I'm more confused now :-/
> 
> Where do we use ->owner after event creation? The moment you create your
> eventN you create the link to sink0. That link either succeeds (same
> 'cookie') or fails.

The event->sink link is established at creation. At event::add(), we
check the sink is free (i.e, it is inactive) or is used by an event
of the same session (this is where the owner field *was* required. But
this is not needed anymore, as we cache the "owner" read pid in the 
handle->rb->aux_priv for each event and this is compared against the
pid from the handle currently driving the hardware)


> 
> If it fails, event creation fails, the end.
> 

> On success, we have the sink pointer in our event and we never ever need
> to look at ->owner ever again.

Correct, we don't need it after the event creation "one part of the 
patch" as explained above.

> 
> I'm also not seeing why exactly we need ->owner in the first place.
> 
> Suppose we make the sink0 device return -EBUSY on open() when it is
> active. Then a perf session can open the sink0 device, create perf
> events and attach them to the sink0 device using
> perf_event_attr::config2. The events will attach to sink0 and increment
> its usage count, such that any further open() will fail.

Thats where we are diverging. The sink device doesn't have any fops. It
is all managed by the coresight driver transparent to the perf tool. All
the perf tool does is, specifying which sink to use (btw, we now have
automatic sink selection support which gets rid of this, and uses
the best possible sink e.g, in case of per-CPU sinks).

> 
> Once the events are created, the perf tool close()s the sink0 device,
> which is now will in-use by the events. No other events can be attached
> to it.
> 
> Or are you doing the event->sink mapping every time you do: pmu::add()?
> That sounds insane.

Sink is already mapped at event create. But yes, the refcount on the
sink is managed at start/stop. Thats when we need to make sure that the
event being scheduled belongs to the same owner as the one already
driving the sink.

That way another session could use the same sink if it is free. i.e

perf record -e cs_etm/@sink0/u --per-thread app1

and

perf record -e cs_etm/@sink0/u --per-thread app2

both can work as long as the sink is not used by the other session.

Suzuki
Peter Zijlstra Oct. 23, 2020, 1:16 p.m. UTC | #17
On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
> On 10/23/20 11:54 AM, Peter Zijlstra wrote:

> > I think I'm more confused now :-/
> > 
> > Where do we use ->owner after event creation? The moment you create your
> > eventN you create the link to sink0. That link either succeeds (same
> > 'cookie') or fails.
> 
> The event->sink link is established at creation. At event::add(), we
> check the sink is free (i.e, it is inactive) or is used by an event
> of the same session (this is where the owner field *was* required. But
> this is not needed anymore, as we cache the "owner" read pid in the
> handle->rb->aux_priv for each event and this is compared against the
> pid from the handle currently driving the hardware)

*groan*.. that's going to be a mess with sinks that are shared between
CPUs :/

> > I'm also not seeing why exactly we need ->owner in the first place.
> > 
> > Suppose we make the sink0 device return -EBUSY on open() when it is
> > active. Then a perf session can open the sink0 device, create perf
> > events and attach them to the sink0 device using
> > perf_event_attr::config2. The events will attach to sink0 and increment
> > its usage count, such that any further open() will fail.
> 
> Thats where we are diverging. The sink device doesn't have any fops. It
> is all managed by the coresight driver transparent to the perf tool. All
> the perf tool does is, specifying which sink to use (btw, we now have
> automatic sink selection support which gets rid of this, and uses
> the best possible sink e.g, in case of per-CPU sinks).

per-CPU sinks sounds a lot better.

I'm really not convinced it makes sense to do what you do with shared
sinks though. You'll loose random parts of the execution trace because
of what the other CPUs do.

Full exclusive sink access is far more deterministic.

> > Once the events are created, the perf tool close()s the sink0 device,
> > which is now will in-use by the events. No other events can be attached
> > to it.
> > 
> > Or are you doing the event->sink mapping every time you do: pmu::add()?
> > That sounds insane.
> 
> Sink is already mapped at event create. But yes, the refcount on the
> sink is managed at start/stop. Thats when we need to make sure that the
> event being scheduled belongs to the same owner as the one already
> driving the sink.

pmu::add() I might hope, because pmu::start() is not allowed to fail.

> That way another session could use the same sink if it is free. i.e
> 
> perf record -e cs_etm/@sink0/u --per-thread app1
> 
> and
> 
> perf record -e cs_etm/@sink0/u --per-thread app2
> 
> both can work as long as the sink is not used by the other session.

Like said above, if sink is shared between CPUs, that's going to be a
trainwreck :/ Why do you want that?

And once you have per-CPU sinks like mentioned above, the whole problem
goes away.
Suzuki K Poulose Oct. 23, 2020, 1:29 p.m. UTC | #18
On 10/23/20 2:16 PM, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
>> On 10/23/20 11:54 AM, Peter Zijlstra wrote:
> 
>>> I think I'm more confused now :-/
>>>
>>> Where do we use ->owner after event creation? The moment you create your
>>> eventN you create the link to sink0. That link either succeeds (same
>>> 'cookie') or fails.
>>
>> The event->sink link is established at creation. At event::add(), we
>> check the sink is free (i.e, it is inactive) or is used by an event
>> of the same session (this is where the owner field *was* required. But
>> this is not needed anymore, as we cache the "owner" read pid in the
>> handle->rb->aux_priv for each event and this is compared against the
>> pid from the handle currently driving the hardware)
> 
> *groan*.. that's going to be a mess with sinks that are shared between
> CPUs :/
> 
>>> I'm also not seeing why exactly we need ->owner in the first place.
>>>
>>> Suppose we make the sink0 device return -EBUSY on open() when it is
>>> active. Then a perf session can open the sink0 device, create perf
>>> events and attach them to the sink0 device using
>>> perf_event_attr::config2. The events will attach to sink0 and increment
>>> its usage count, such that any further open() will fail.
>>
>> Thats where we are diverging. The sink device doesn't have any fops. It
>> is all managed by the coresight driver transparent to the perf tool. All
>> the perf tool does is, specifying which sink to use (btw, we now have
>> automatic sink selection support which gets rid of this, and uses
>> the best possible sink e.g, in case of per-CPU sinks).
> 
> per-CPU sinks sounds a lot better.
> 
> I'm really not convinced it makes sense to do what you do with shared
> sinks though. You'll loose random parts of the execution trace because
> of what the other CPUs do.

The ETM trace protocol has in built TraceID to distinguish the packets
and thus we could decode the trace streams from the shared buffer.
[ But, we don't have buffer overflow interrupts (I am keeping the lid 
closed on that can, for the sake of keeping sanity ;-) ), and thus
any shared session could easily loose data unless we tune the AUX
buffer size to a really large buffer ].

> 
> Full exclusive sink access is far more deterministic.
> 
>>> Once the events are created, the perf tool close()s the sink0 device,
>>> which is now will in-use by the events. No other events can be attached
>>> to it.
>>>
>>> Or are you doing the event->sink mapping every time you do: pmu::add()?
>>> That sounds insane.
>>
>> Sink is already mapped at event create. But yes, the refcount on the
>> sink is managed at start/stop. Thats when we need to make sure that the
>> event being scheduled belongs to the same owner as the one already
>> driving the sink.
> 
> pmu::add() I might hope, because pmu::start() is not allowed to fail.
> 

Right. If we can't get the sink, we simply truncate the buffer.

>> That way another session could use the same sink if it is free. i.e
>>
>> perf record -e cs_etm/@sink0/u --per-thread app1
>>
>> and
>>
>> perf record -e cs_etm/@sink0/u --per-thread app2
>>
>> both can work as long as the sink is not used by the other session.
> 
> Like said above, if sink is shared between CPUs, that's going to be a
> trainwreck :/ Why do you want that?

That ship has sailed. That is how the current generation of systems are,
unfortunately. But as I said, this is changing and there are guidelines
in place to avoid these kind of topologies. With the future
technologies, this will be completely gone.

> 
> And once you have per-CPU sinks like mentioned above, the whole problem
> goes away.

True, until then, this is the best we could do.

Suzuki
Peter Zijlstra Oct. 23, 2020, 1:44 p.m. UTC | #19
On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
> On 10/23/20 2:16 PM, Peter Zijlstra wrote:
> > On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:

> > > That way another session could use the same sink if it is free. i.e
> > > 
> > > perf record -e cs_etm/@sink0/u --per-thread app1
> > > 
> > > and
> > > 
> > > perf record -e cs_etm/@sink0/u --per-thread app2
> > > 
> > > both can work as long as the sink is not used by the other session.
> > 
> > Like said above, if sink is shared between CPUs, that's going to be a
> > trainwreck :/ Why do you want that?
> 
> That ship has sailed. That is how the current generation of systems are,
> unfortunately. But as I said, this is changing and there are guidelines
> in place to avoid these kind of topologies. With the future
> technologies, this will be completely gone.

I understand that the hardware is like that, but why do you want to
support this insanity in software?

If you only allow a single sink user (group) at the same time, your
problem goes away. Simply disallow the above scenario, do not allow
concurrent sink users if sinks are shared like this.

Have the perf-record of app2 above fail because the sink is in-user
already.

Only if the hardware has per-CPU sinks can you allow this.
Mathieu Poirier Oct. 23, 2020, 8:37 p.m. UTC | #20
On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
> > On 10/23/20 2:16 PM, Peter Zijlstra wrote:
> > > On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
> 
> > > > That way another session could use the same sink if it is free. i.e
> > > > 
> > > > perf record -e cs_etm/@sink0/u --per-thread app1
> > > > 
> > > > and
> > > > 
> > > > perf record -e cs_etm/@sink0/u --per-thread app2
> > > > 
> > > > both can work as long as the sink is not used by the other session.
> > > 
> > > Like said above, if sink is shared between CPUs, that's going to be a
> > > trainwreck :/ Why do you want that?
> > 
> > That ship has sailed. That is how the current generation of systems are,
> > unfortunately. But as I said, this is changing and there are guidelines
> > in place to avoid these kind of topologies. With the future
> > technologies, this will be completely gone.
> 
> I understand that the hardware is like that, but why do you want to
> support this insanity in software?
> 
> If you only allow a single sink user (group) at the same time, your
> problem goes away. Simply disallow the above scenario, do not allow
> concurrent sink users if sinks are shared like this.
> 
> Have the perf-record of app2 above fail because the sink is in-user
> already.

I agree with you that --per-thread scenarios are easy to deal with, but to
support cpu-wide scenarios events must share a sink (because there is one event
per CPU).  CPU-wide support can't be removed because it has been around
for close to a couple of years and heavily used. I also think using the pid of
the process that created the events, i.e perf, is a good idea.  We just need to
agree on how to gain access to it.

In Sai's patch you objected to the following:

> +     struct task_struct *task = READ_ONCE(event->owner);
> +
> +     if (!task || is_kernel_event(event))

Would it be better to use task_nr_pid(current) instead of event->owner?  The end
result will be exactly the same.  There is also no need to check the validity of
@current since it is a user process.

Thanks,
Mathieu 

[1]. https://elixir.bootlin.com/linux/latest/source/kernel/events/core.c#L6170

> 
> Only if the hardware has per-CPU sinks can you allow this.
Sai Prakash Ranjan Oct. 30, 2020, 7:59 a.m. UTC | #21
Hello guys,

On 2020-10-24 02:07, Mathieu Poirier wrote:
> On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
>> On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
>> > On 10/23/20 2:16 PM, Peter Zijlstra wrote:
>> > > On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
>> 
>> > > > That way another session could use the same sink if it is free. i.e
>> > > >
>> > > > perf record -e cs_etm/@sink0/u --per-thread app1
>> > > >
>> > > > and
>> > > >
>> > > > perf record -e cs_etm/@sink0/u --per-thread app2
>> > > >
>> > > > both can work as long as the sink is not used by the other session.
>> > >
>> > > Like said above, if sink is shared between CPUs, that's going to be a
>> > > trainwreck :/ Why do you want that?
>> >
>> > That ship has sailed. That is how the current generation of systems are,
>> > unfortunately. But as I said, this is changing and there are guidelines
>> > in place to avoid these kind of topologies. With the future
>> > technologies, this will be completely gone.
>> 
>> I understand that the hardware is like that, but why do you want to
>> support this insanity in software?
>> 
>> If you only allow a single sink user (group) at the same time, your
>> problem goes away. Simply disallow the above scenario, do not allow
>> concurrent sink users if sinks are shared like this.
>> 
>> Have the perf-record of app2 above fail because the sink is in-user
>> already.
> 
> I agree with you that --per-thread scenarios are easy to deal with, but 
> to
> support cpu-wide scenarios events must share a sink (because there is 
> one event
> per CPU).  CPU-wide support can't be removed because it has been around
> for close to a couple of years and heavily used. I also think using the 
> pid of
> the process that created the events, i.e perf, is a good idea.  We just 
> need to
> agree on how to gain access to it.
> 
> In Sai's patch you objected to the following:
> 
>> +     struct task_struct *task = READ_ONCE(event->owner);
>> +
>> +     if (!task || is_kernel_event(event))
> 
> Would it be better to use task_nr_pid(current) instead of event->owner? 
>  The end
> result will be exactly the same.  There is also no need to check the 
> validity of
> @current since it is a user process.
> 

We have devices deployed where these crashes are seen consistently,
so for some immediate relief, could we atleast get some fix in this
cycle without major design overhaul which would likely take more time.
Perhaps my first patch [1] without any check for owner or
I can post a new version as Suzuki suggested [2] dropping the export
of is_kernel_event(). Then we can always work on top of it based on the
conclusion of this discussion, we will atleast not have the systems
crash in the meantime, thoughts?

[1] https://lore.kernel.org/patchwork/patch/1318098/
[2] 
https://lore.kernel.org/lkml/fa6cdf34-88a0-1050-b9ea-556d0a9438cb@arm.com/

Thanks,
Sai
Mathieu Poirier Oct. 30, 2020, 4:48 p.m. UTC | #22
On Fri, Oct 30, 2020 at 01:29:56PM +0530, Sai Prakash Ranjan wrote:
> Hello guys,
> 
> On 2020-10-24 02:07, Mathieu Poirier wrote:
> > On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
> > > > On 10/23/20 2:16 PM, Peter Zijlstra wrote:
> > > > > On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
> > > 
> > > > > > That way another session could use the same sink if it is free. i.e
> > > > > >
> > > > > > perf record -e cs_etm/@sink0/u --per-thread app1
> > > > > >
> > > > > > and
> > > > > >
> > > > > > perf record -e cs_etm/@sink0/u --per-thread app2
> > > > > >
> > > > > > both can work as long as the sink is not used by the other session.
> > > > >
> > > > > Like said above, if sink is shared between CPUs, that's going to be a
> > > > > trainwreck :/ Why do you want that?
> > > >
> > > > That ship has sailed. That is how the current generation of systems are,
> > > > unfortunately. But as I said, this is changing and there are guidelines
> > > > in place to avoid these kind of topologies. With the future
> > > > technologies, this will be completely gone.
> > > 
> > > I understand that the hardware is like that, but why do you want to
> > > support this insanity in software?
> > > 
> > > If you only allow a single sink user (group) at the same time, your
> > > problem goes away. Simply disallow the above scenario, do not allow
> > > concurrent sink users if sinks are shared like this.
> > > 
> > > Have the perf-record of app2 above fail because the sink is in-user
> > > already.
> > 
> > I agree with you that --per-thread scenarios are easy to deal with, but
> > to
> > support cpu-wide scenarios events must share a sink (because there is
> > one event
> > per CPU).  CPU-wide support can't be removed because it has been around
> > for close to a couple of years and heavily used. I also think using the
> > pid of
> > the process that created the events, i.e perf, is a good idea.  We just
> > need to
> > agree on how to gain access to it.
> > 
> > In Sai's patch you objected to the following:
> > 
> > > +     struct task_struct *task = READ_ONCE(event->owner);
> > > +
> > > +     if (!task || is_kernel_event(event))
> > 
> > Would it be better to use task_nr_pid(current) instead of event->owner?
> > The end
> > result will be exactly the same.  There is also no need to check the
> > validity of
> > @current since it is a user process.
> > 
> 
> We have devices deployed where these crashes are seen consistently,
> so for some immediate relief, could we atleast get some fix in this
> cycle without major design overhaul which would likely take more time.
> Perhaps my first patch [1] without any check for owner or
> I can post a new version as Suzuki suggested [2] dropping the export
> of is_kernel_event(). Then we can always work on top of it based on the
> conclusion of this discussion, we will atleast not have the systems
> crash in the meantime, thoughts?

For the time being I think [1], exactly the way it is, is a reasonable way
forward.

Regards,
Mathieu

> 
> [1] https://lore.kernel.org/patchwork/patch/1318098/
> [2]
> https://lore.kernel.org/lkml/fa6cdf34-88a0-1050-b9ea-556d0a9438cb@arm.com/
> 
> Thanks,
> Sai
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation
Sai Prakash Ranjan Oct. 30, 2020, 5:26 p.m. UTC | #23
Hi Mathieu,

On 2020-10-30 22:18, Mathieu Poirier wrote:
> On Fri, Oct 30, 2020 at 01:29:56PM +0530, Sai Prakash Ranjan wrote:
>> Hello guys,
>> 
>> On 2020-10-24 02:07, Mathieu Poirier wrote:
>> > On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
>> > > On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
>> > > > On 10/23/20 2:16 PM, Peter Zijlstra wrote:
>> > > > > On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
>> > >
>> > > > > > That way another session could use the same sink if it is free. i.e
>> > > > > >
>> > > > > > perf record -e cs_etm/@sink0/u --per-thread app1
>> > > > > >
>> > > > > > and
>> > > > > >
>> > > > > > perf record -e cs_etm/@sink0/u --per-thread app2
>> > > > > >
>> > > > > > both can work as long as the sink is not used by the other session.
>> > > > >
>> > > > > Like said above, if sink is shared between CPUs, that's going to be a
>> > > > > trainwreck :/ Why do you want that?
>> > > >
>> > > > That ship has sailed. That is how the current generation of systems are,
>> > > > unfortunately. But as I said, this is changing and there are guidelines
>> > > > in place to avoid these kind of topologies. With the future
>> > > > technologies, this will be completely gone.
>> > >
>> > > I understand that the hardware is like that, but why do you want to
>> > > support this insanity in software?
>> > >
>> > > If you only allow a single sink user (group) at the same time, your
>> > > problem goes away. Simply disallow the above scenario, do not allow
>> > > concurrent sink users if sinks are shared like this.
>> > >
>> > > Have the perf-record of app2 above fail because the sink is in-user
>> > > already.
>> >
>> > I agree with you that --per-thread scenarios are easy to deal with, but
>> > to
>> > support cpu-wide scenarios events must share a sink (because there is
>> > one event
>> > per CPU).  CPU-wide support can't be removed because it has been around
>> > for close to a couple of years and heavily used. I also think using the
>> > pid of
>> > the process that created the events, i.e perf, is a good idea.  We just
>> > need to
>> > agree on how to gain access to it.
>> >
>> > In Sai's patch you objected to the following:
>> >
>> > > +     struct task_struct *task = READ_ONCE(event->owner);
>> > > +
>> > > +     if (!task || is_kernel_event(event))
>> >
>> > Would it be better to use task_nr_pid(current) instead of event->owner?
>> > The end
>> > result will be exactly the same.  There is also no need to check the
>> > validity of
>> > @current since it is a user process.
>> >
>> 
>> We have devices deployed where these crashes are seen consistently,
>> so for some immediate relief, could we atleast get some fix in this
>> cycle without major design overhaul which would likely take more time.
>> Perhaps my first patch [1] without any check for owner or
>> I can post a new version as Suzuki suggested [2] dropping the export
>> of is_kernel_event(). Then we can always work on top of it based on 
>> the
>> conclusion of this discussion, we will atleast not have the systems
>> crash in the meantime, thoughts?
> 
> For the time being I think [1], exactly the way it is, is a reasonable 
> way
> forward.
> 

Sure, I just checked now and [1] still applies neatly on top of 
coresight
next branch.

[1] https://lore.kernel.org/patchwork/patch/1318098/

Thanks,
Sai
Mathieu Poirier Nov. 4, 2020, 5:03 p.m. UTC | #24
On Fri, Oct 30, 2020 at 10:56:09PM +0530, Sai Prakash Ranjan wrote:
> Hi Mathieu,
> 
> On 2020-10-30 22:18, Mathieu Poirier wrote:
> > On Fri, Oct 30, 2020 at 01:29:56PM +0530, Sai Prakash Ranjan wrote:
> > > Hello guys,
> > > 
> > > On 2020-10-24 02:07, Mathieu Poirier wrote:
> > > > On Fri, Oct 23, 2020 at 03:44:16PM +0200, Peter Zijlstra wrote:
> > > > > On Fri, Oct 23, 2020 at 02:29:54PM +0100, Suzuki Poulose wrote:
> > > > > > On 10/23/20 2:16 PM, Peter Zijlstra wrote:
> > > > > > > On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote:
> > > > >
> > > > > > > > That way another session could use the same sink if it is free. i.e
> > > > > > > >
> > > > > > > > perf record -e cs_etm/@sink0/u --per-thread app1
> > > > > > > >
> > > > > > > > and
> > > > > > > >
> > > > > > > > perf record -e cs_etm/@sink0/u --per-thread app2
> > > > > > > >
> > > > > > > > both can work as long as the sink is not used by the other session.
> > > > > > >
> > > > > > > Like said above, if sink is shared between CPUs, that's going to be a
> > > > > > > trainwreck :/ Why do you want that?
> > > > > >
> > > > > > That ship has sailed. That is how the current generation of systems are,
> > > > > > unfortunately. But as I said, this is changing and there are guidelines
> > > > > > in place to avoid these kind of topologies. With the future
> > > > > > technologies, this will be completely gone.
> > > > >
> > > > > I understand that the hardware is like that, but why do you want to
> > > > > support this insanity in software?
> > > > >
> > > > > If you only allow a single sink user (group) at the same time, your
> > > > > problem goes away. Simply disallow the above scenario, do not allow
> > > > > concurrent sink users if sinks are shared like this.
> > > > >
> > > > > Have the perf-record of app2 above fail because the sink is in-user
> > > > > already.
> > > >
> > > > I agree with you that --per-thread scenarios are easy to deal with, but
> > > > to
> > > > support cpu-wide scenarios events must share a sink (because there is
> > > > one event
> > > > per CPU).  CPU-wide support can't be removed because it has been around
> > > > for close to a couple of years and heavily used. I also think using the
> > > > pid of
> > > > the process that created the events, i.e perf, is a good idea.  We just
> > > > need to
> > > > agree on how to gain access to it.
> > > >
> > > > In Sai's patch you objected to the following:
> > > >
> > > > > +     struct task_struct *task = READ_ONCE(event->owner);
> > > > > +
> > > > > +     if (!task || is_kernel_event(event))
> > > >
> > > > Would it be better to use task_nr_pid(current) instead of event->owner?
> > > > The end
> > > > result will be exactly the same.  There is also no need to check the
> > > > validity of
> > > > @current since it is a user process.
> > > >
> > > 
> > > We have devices deployed where these crashes are seen consistently,
> > > so for some immediate relief, could we atleast get some fix in this
> > > cycle without major design overhaul which would likely take more time.
> > > Perhaps my first patch [1] without any check for owner or
> > > I can post a new version as Suzuki suggested [2] dropping the export
> > > of is_kernel_event(). Then we can always work on top of it based on
> > > the
> > > conclusion of this discussion, we will atleast not have the systems
> > > crash in the meantime, thoughts?
> > 
> > For the time being I think [1], exactly the way it is, is a reasonable
> > way
> > forward.
> > 
> 
> Sure, I just checked now and [1] still applies neatly on top of coresight
> next branch.
> 
> [1] https://lore.kernel.org/patchwork/patch/1318098/

I have applied both patches that were part of the set.

> 
> Thanks,
> Sai
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation
diff mbox series

Patch

diff --git a/drivers/hwtracing/coresight/coresight-priv.h b/drivers/hwtracing/coresight/coresight-priv.h
index 65a29293b6cb..f5f654ea2994 100644
--- a/drivers/hwtracing/coresight/coresight-priv.h
+++ b/drivers/hwtracing/coresight/coresight-priv.h
@@ -87,6 +87,7 @@  enum cs_mode {
  * struct cs_buffer - keep track of a recording session' specifics
  * @cur:	index of the current buffer
  * @nr_pages:	max number of pages granted to us
+ * @pid:	PID this cs_buffer belongs to
  * @offset:	offset within the current buffer
  * @data_size:	how much we collected in this run
  * @snapshot:	is this run in snapshot mode
@@ -95,6 +96,7 @@  enum cs_mode {
 struct cs_buffers {
 	unsigned int		cur;
 	unsigned int		nr_pages;
+	pid_t			pid;
 	unsigned long		offset;
 	local_t			data_size;
 	bool			snapshot;
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etf.c b/drivers/hwtracing/coresight/coresight-tmc-etf.c
index 44402d413ebb..86ff0dda0444 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-etf.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-etf.c
@@ -227,6 +227,7 @@  static int tmc_enable_etf_sink_perf(struct coresight_device *csdev, void *data)
 	unsigned long flags;
 	struct tmc_drvdata *drvdata = dev_get_drvdata(csdev->dev.parent);
 	struct perf_output_handle *handle = data;
+	struct cs_buffers *buf = etm_perf_sink_config(handle);
 
 	spin_lock_irqsave(&drvdata->spinlock, flags);
 	do {
@@ -243,7 +244,7 @@  static int tmc_enable_etf_sink_perf(struct coresight_device *csdev, void *data)
 		}
 
 		/* Get a handle on the pid of the process to monitor */
-		pid = task_pid_nr(handle->event->owner);
+		pid = buf->pid;
 
 		if (drvdata->pid != -1 && drvdata->pid != pid) {
 			ret = -EBUSY;
@@ -391,6 +392,10 @@  static void *tmc_alloc_etf_buffer(struct coresight_device *csdev,
 {
 	int node;
 	struct cs_buffers *buf;
+	struct task_struct *task = READ_ONCE(event->owner);
+
+	if (!task || is_kernel_event(event))
+		return NULL;
 
 	node = (event->cpu == -1) ? NUMA_NO_NODE : cpu_to_node(event->cpu);
 
@@ -399,6 +404,7 @@  static void *tmc_alloc_etf_buffer(struct coresight_device *csdev,
 	if (!buf)
 		return NULL;
 
+	buf->pid = task_pid_nr(task);
 	buf->snapshot = overwrite;
 	buf->nr_pages = nr_pages;
 	buf->data_pages = pages;