[RFC,5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU
diff mbox

Message ID 1434966909-4113-6-git-send-email-sourab.gupta@intel.com
State New
Headers show

Commit Message

sourab.gupta@intel.com June 22, 2015, 9:55 a.m. UTC
From: Sourab Gupta <sourab.gupta@intel.com>

To collect timestamps around any GPU workload, we need to insert
commands to capture them into the ringbuffer. Therefore, during the stop event
call, we need to wait for GPU to complete processing the last request for
which these commands were inserted.
We need to ensure this processing is done before event_destroy callback which
deallocates the buffer for holding the data.

Signed-off-by: Sourab Gupta <sourab.gupta@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     |  2 ++
 drivers/gpu/drm/i915/i915_oa_perf.c | 54 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 55 insertions(+), 1 deletion(-)

Comments

Chris Wilson June 22, 2015, 1:22 p.m. UTC | #1
On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> From: Sourab Gupta <sourab.gupta@intel.com>
> 
> To collect timestamps around any GPU workload, we need to insert
> commands to capture them into the ringbuffer. Therefore, during the stop event
> call, we need to wait for GPU to complete processing the last request for
> which these commands were inserted.
> We need to ensure this processing is done before event_destroy callback which
> deallocates the buffer for holding the data.

There's no reason for this to be synchronous. Just that you need an
active reference on the output buffer to be only released after the
final request.
-Chris
Daniel Vetter June 22, 2015, 4:09 p.m. UTC | #2
On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
> On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> > From: Sourab Gupta <sourab.gupta@intel.com>
> > 
> > To collect timestamps around any GPU workload, we need to insert
> > commands to capture them into the ringbuffer. Therefore, during the stop event
> > call, we need to wait for GPU to complete processing the last request for
> > which these commands were inserted.
> > We need to ensure this processing is done before event_destroy callback which
> > deallocates the buffer for holding the data.
> 
> There's no reason for this to be synchronous. Just that you need an
> active reference on the output buffer to be only released after the
> final request.

Yeah I think the interaction between OA sampling and GEM is the critical
piece here for both patch series. Step one is to have a per-pmu lock to
keep track of data private to OA and mmio based sampling as a starting
point. Then we need to figure out how to structure the connection without
OA/PMU and gem trampling over each another's feet.

Wrt requests those are refcounted already and just waiting on them doesn't
need dev->struct_mutex. That should be all you need, assuming you do
correctly refcount them ...
-Daniel
sourab.gupta@intel.com June 25, 2015, 6:02 a.m. UTC | #3
On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
> On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:

> > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:

> > > From: Sourab Gupta <sourab.gupta@intel.com>

> > > 

> > > To collect timestamps around any GPU workload, we need to insert

> > > commands to capture them into the ringbuffer. Therefore, during the stop event

> > > call, we need to wait for GPU to complete processing the last request for

> > > which these commands were inserted.

> > > We need to ensure this processing is done before event_destroy callback which

> > > deallocates the buffer for holding the data.

> > 

> > There's no reason for this to be synchronous. Just that you need an

> > active reference on the output buffer to be only released after the

> > final request.

> 

> Yeah I think the interaction between OA sampling and GEM is the critical

> piece here for both patch series. Step one is to have a per-pmu lock to

> keep track of data private to OA and mmio based sampling as a starting

> point. Then we need to figure out how to structure the connection without

> OA/PMU and gem trampling over each another's feet.

> 

> Wrt requests those are refcounted already and just waiting on them doesn't

> need dev->struct_mutex. That should be all you need, assuming you do

> correctly refcount them ...

> -Daniel


Hi Daniel/Chris,

Thanks for your feedback. I acknowledge the issue with increasing
coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
of data private to OA in my next version of patches.

W.r.t. having the synchronous wait during event stop, I thought through
the method of having active reference on output buffer. This will let us
have a delayed buffer destroy (controlled by decrement of output buffer
refcount as and when gpu requests complete). This will be decoupled from
the event stop here. But that is not the only thing at play here.

The event stop is expected to disable the OA unit (which it is doing
right now). In my usecase, I can't disable OA unit, till the time GPU
processes the MI_RPC requests scheduled already, which otherwise may
lead to hangs.
So, I'm waiting synchronously for requests to complete before disabling
OA unit.

Also, the event_destroy callback should be destroying the buffers
associated with event. (Right now, in current design, event_destroy
callback waits for the above operations to be completed before
proceeding to destroy the buffer).

If I try to create a function which destroys the output buffer according
to all references being released, all these operations will have to be
performed together in that function, in this serial order (so that when
refcount drops to 0, i.e. requests have completed, OA unit will be
disabled, and then the buffer destroyed). This would lead to these
operations being decoupled from the event_stop and event_destroy
functions. This may be non-intentional behaviour w.r.t. these callbacks
from userspace perspective.

Robert, any thoughts?
Daniel Vetter June 25, 2015, 7:42 a.m. UTC | #4
On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
> > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
> > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> > > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > > 
> > > > To collect timestamps around any GPU workload, we need to insert
> > > > commands to capture them into the ringbuffer. Therefore, during the stop event
> > > > call, we need to wait for GPU to complete processing the last request for
> > > > which these commands were inserted.
> > > > We need to ensure this processing is done before event_destroy callback which
> > > > deallocates the buffer for holding the data.
> > > 
> > > There's no reason for this to be synchronous. Just that you need an
> > > active reference on the output buffer to be only released after the
> > > final request.
> > 
> > Yeah I think the interaction between OA sampling and GEM is the critical
> > piece here for both patch series. Step one is to have a per-pmu lock to
> > keep track of data private to OA and mmio based sampling as a starting
> > point. Then we need to figure out how to structure the connection without
> > OA/PMU and gem trampling over each another's feet.
> > 
> > Wrt requests those are refcounted already and just waiting on them doesn't
> > need dev->struct_mutex. That should be all you need, assuming you do
> > correctly refcount them ...
> > -Daniel
> 
> Hi Daniel/Chris,
> 
> Thanks for your feedback. I acknowledge the issue with increasing
> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
> of data private to OA in my next version of patches.

If you create new locking schemes please coordinate with Rob about them.
Better if we come up with a consistent locking scheme for all of OA. That
ofc doesn't exclude that we'll have per-pmu locking, just that the locking
design should match if it makes sense. The example I'm thinking of is
drrs&psr which both use the frontbuffer tracking code and both have their
private mutex. But the overall approach to locking and cleaning up the
async work is the same.

> W.r.t. having the synchronous wait during event stop, I thought through
> the method of having active reference on output buffer. This will let us
> have a delayed buffer destroy (controlled by decrement of output buffer
> refcount as and when gpu requests complete). This will be decoupled from
> the event stop here. But that is not the only thing at play here.
> 
> The event stop is expected to disable the OA unit (which it is doing
> right now). In my usecase, I can't disable OA unit, till the time GPU
> processes the MI_RPC requests scheduled already, which otherwise may
> lead to hangs.
> So, I'm waiting synchronously for requests to complete before disabling
> OA unit.
> 
> Also, the event_destroy callback should be destroying the buffers
> associated with event. (Right now, in current design, event_destroy
> callback waits for the above operations to be completed before
> proceeding to destroy the buffer).
> 
> If I try to create a function which destroys the output buffer according
> to all references being released, all these operations will have to be
> performed together in that function, in this serial order (so that when
> refcount drops to 0, i.e. requests have completed, OA unit will be
> disabled, and then the buffer destroyed). This would lead to these
> operations being decoupled from the event_stop and event_destroy
> functions. This may be non-intentional behaviour w.r.t. these callbacks
> from userspace perspective.
> 
> Robert, any thoughts?

Yeah now idea whether those perf callbacks are allowed to stall for a
longer time. If not we could simply push the cleanup/OA disabling into an
async work.
-Daniel
Chris Wilson June 25, 2015, 8:02 a.m. UTC | #5
On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
> > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
> > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
> > > > From: Sourab Gupta <sourab.gupta@intel.com>
> > > > 
> > > > To collect timestamps around any GPU workload, we need to insert
> > > > commands to capture them into the ringbuffer. Therefore, during the stop event
> > > > call, we need to wait for GPU to complete processing the last request for
> > > > which these commands were inserted.
> > > > We need to ensure this processing is done before event_destroy callback which
> > > > deallocates the buffer for holding the data.
> > > 
> > > There's no reason for this to be synchronous. Just that you need an
> > > active reference on the output buffer to be only released after the
> > > final request.
> > 
> > Yeah I think the interaction between OA sampling and GEM is the critical
> > piece here for both patch series. Step one is to have a per-pmu lock to
> > keep track of data private to OA and mmio based sampling as a starting
> > point. Then we need to figure out how to structure the connection without
> > OA/PMU and gem trampling over each another's feet.
> > 
> > Wrt requests those are refcounted already and just waiting on them doesn't
> > need dev->struct_mutex. That should be all you need, assuming you do
> > correctly refcount them ...
> > -Daniel
> 
> Hi Daniel/Chris,
> 
> Thanks for your feedback. I acknowledge the issue with increasing
> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
> of data private to OA in my next version of patches.
> 
> W.r.t. having the synchronous wait during event stop, I thought through
> the method of having active reference on output buffer. This will let us
> have a delayed buffer destroy (controlled by decrement of output buffer
> refcount as and when gpu requests complete). This will be decoupled from
> the event stop here. But that is not the only thing at play here.
> 
> The event stop is expected to disable the OA unit (which it is doing
> right now). In my usecase, I can't disable OA unit, till the time GPU
> processes the MI_RPC requests scheduled already, which otherwise may
> lead to hangs.
> So, I'm waiting synchronously for requests to complete before disabling
> OA unit.

There's no issue here. Just hook into the retire-notification callback
for the last active request of the oa buffer. If you are using LRI to
disable the OA Context writing to the buffer, you can simply create a
request for the LRI and use the active reference as the retire
notification. (Otoh, if the oa buffer is inactive you can simply do the
decoupling immediately.) You shouldn't need a object-free-notification
callback aiui.
 
> Also, the event_destroy callback should be destroying the buffers
> associated with event. (Right now, in current design, event_destroy
> callback waits for the above operations to be completed before
> proceeding to destroy the buffer).
> 
> If I try to create a function which destroys the output buffer according
> to all references being released, all these operations will have to be
> performed together in that function, in this serial order (so that when
> refcount drops to 0, i.e. requests have completed, OA unit will be
> disabled, and then the buffer destroyed). This would lead to these
> operations being decoupled from the event_stop and event_destroy
> functions. This may be non-intentional behaviour w.r.t. these callbacks
> from userspace perspective.

When the perf event is stopped, you stop generating perf events. But the
buffer may still be alive, and may be resurrected if you a new event is
started, but you will want to be sure to ensure that pending oa writes
are ignored.
-Chris
sourab.gupta@intel.com June 25, 2015, 8:27 a.m. UTC | #6
On Thu, 2015-06-25 at 07:42 +0000, Daniel Vetter wrote:
> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:

> > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:

> > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:

> > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:

> > > > > From: Sourab Gupta <sourab.gupta@intel.com>

> > > > > 

> > > > > To collect timestamps around any GPU workload, we need to insert

> > > > > commands to capture them into the ringbuffer. Therefore, during the stop event

> > > > > call, we need to wait for GPU to complete processing the last request for

> > > > > which these commands were inserted.

> > > > > We need to ensure this processing is done before event_destroy callback which

> > > > > deallocates the buffer for holding the data.

> > > > 

> > > > There's no reason for this to be synchronous. Just that you need an

> > > > active reference on the output buffer to be only released after the

> > > > final request.

> > > 

> > > Yeah I think the interaction between OA sampling and GEM is the critical

> > > piece here for both patch series. Step one is to have a per-pmu lock to

> > > keep track of data private to OA and mmio based sampling as a starting

> > > point. Then we need to figure out how to structure the connection without

> > > OA/PMU and gem trampling over each another's feet.

> > > 

> > > Wrt requests those are refcounted already and just waiting on them doesn't

> > > need dev->struct_mutex. That should be all you need, assuming you do

> > > correctly refcount them ...

> > > -Daniel

> > 

> > Hi Daniel/Chris,

> > 

> > Thanks for your feedback. I acknowledge the issue with increasing

> > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track

> > of data private to OA in my next version of patches.

> 

> If you create new locking schemes please coordinate with Rob about them.

> Better if we come up with a consistent locking scheme for all of OA. That

> ofc doesn't exclude that we'll have per-pmu locking, just that the locking

> design should match if it makes sense. The example I'm thinking of is

> drrs&psr which both use the frontbuffer tracking code and both have their

> private mutex. But the overall approach to locking and cleaning up the

> async work is the same.

> 

Ok, I'll coordinate with Robert about a consistent locking scheme here.

> > W.r.t. having the synchronous wait during event stop, I thought through

> > the method of having active reference on output buffer. This will let us

> > have a delayed buffer destroy (controlled by decrement of output buffer

> > refcount as and when gpu requests complete). This will be decoupled from

> > the event stop here. But that is not the only thing at play here.

> > 

> > The event stop is expected to disable the OA unit (which it is doing

> > right now). In my usecase, I can't disable OA unit, till the time GPU

> > processes the MI_RPC requests scheduled already, which otherwise may

> > lead to hangs.

> > So, I'm waiting synchronously for requests to complete before disabling

> > OA unit.

> > 

> > Also, the event_destroy callback should be destroying the buffers

> > associated with event. (Right now, in current design, event_destroy

> > callback waits for the above operations to be completed before

> > proceeding to destroy the buffer).

> > 

> > If I try to create a function which destroys the output buffer according

> > to all references being released, all these operations will have to be

> > performed together in that function, in this serial order (so that when

> > refcount drops to 0, i.e. requests have completed, OA unit will be

> > disabled, and then the buffer destroyed). This would lead to these

> > operations being decoupled from the event_stop and event_destroy

> > functions. This may be non-intentional behaviour w.r.t. these callbacks

> > from userspace perspective.

> > 

> > Robert, any thoughts?

> 

> Yeah now idea whether those perf callbacks are allowed to stall for a

> longer time. If not we could simply push the cleanup/OA disabling into an

> async work.

> -Daniel


Hi Daniel,
Actually right now, the stop_event callback is not stalled. Instead I'm
scheduling an async work fn to do the job. The event destroy is then
waiting for the work fn to finish processing, before proceeding on to
destroy the buffer.
I'm also thinking through Chris' suggestions here and will try to come
up with solution based on them.

Thanks,
Sourab
Robert Bragg June 25, 2015, 11:47 a.m. UTC | #7
On Thu, Jun 25, 2015 at 9:27 AM, Gupta, Sourab <sourab.gupta@intel.com> wrote:
> On Thu, 2015-06-25 at 07:42 +0000, Daniel Vetter wrote:
>> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
>> > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
>> > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
>> > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
>> > > > > From: Sourab Gupta <sourab.gupta@intel.com>
>> > > > >
>> > > > > To collect timestamps around any GPU workload, we need to insert
>> > > > > commands to capture them into the ringbuffer. Therefore, during the stop event
>> > > > > call, we need to wait for GPU to complete processing the last request for
>> > > > > which these commands were inserted.
>> > > > > We need to ensure this processing is done before event_destroy callback which
>> > > > > deallocates the buffer for holding the data.
>> > > >
>> > > > There's no reason for this to be synchronous. Just that you need an
>> > > > active reference on the output buffer to be only released after the
>> > > > final request.
>> > >
>> > > Yeah I think the interaction between OA sampling and GEM is the critical
>> > > piece here for both patch series. Step one is to have a per-pmu lock to
>> > > keep track of data private to OA and mmio based sampling as a starting
>> > > point. Then we need to figure out how to structure the connection without
>> > > OA/PMU and gem trampling over each another's feet.
>> > >
>> > > Wrt requests those are refcounted already and just waiting on them doesn't
>> > > need dev->struct_mutex. That should be all you need, assuming you do
>> > > correctly refcount them ...
>> > > -Daniel
>> >
>> > Hi Daniel/Chris,
>> >
>> > Thanks for your feedback. I acknowledge the issue with increasing
>> > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
>> > of data private to OA in my next version of patches.
>>
>> If you create new locking schemes please coordinate with Rob about them.
>> Better if we come up with a consistent locking scheme for all of OA. That
>> ofc doesn't exclude that we'll have per-pmu locking, just that the locking
>> design should match if it makes sense. The example I'm thinking of is
>> drrs&psr which both use the frontbuffer tracking code and both have their
>> private mutex. But the overall approach to locking and cleaning up the
>> async work is the same.
>>
> Ok, I'll coordinate with Robert about a consistent locking scheme here.
>
>> > W.r.t. having the synchronous wait during event stop, I thought through
>> > the method of having active reference on output buffer. This will let us
>> > have a delayed buffer destroy (controlled by decrement of output buffer
>> > refcount as and when gpu requests complete). This will be decoupled from
>> > the event stop here. But that is not the only thing at play here.
>> >
>> > The event stop is expected to disable the OA unit (which it is doing
>> > right now). In my usecase, I can't disable OA unit, till the time GPU
>> > processes the MI_RPC requests scheduled already, which otherwise may
>> > lead to hangs.
>> > So, I'm waiting synchronously for requests to complete before disabling
>> > OA unit.
>> >
>> > Also, the event_destroy callback should be destroying the buffers
>> > associated with event. (Right now, in current design, event_destroy
>> > callback waits for the above operations to be completed before
>> > proceeding to destroy the buffer).
>> >
>> > If I try to create a function which destroys the output buffer according
>> > to all references being released, all these operations will have to be
>> > performed together in that function, in this serial order (so that when
>> > refcount drops to 0, i.e. requests have completed, OA unit will be
>> > disabled, and then the buffer destroyed). This would lead to these
>> > operations being decoupled from the event_stop and event_destroy
>> > functions. This may be non-intentional behaviour w.r.t. these callbacks
>> > from userspace perspective.
>> >
>> > Robert, any thoughts?
>>
>> Yeah now idea whether those perf callbacks are allowed to stall for a
>> longer time. If not we could simply push the cleanup/OA disabling into an
>> async work.
>> -Daniel
>
> Hi Daniel,
> Actually right now, the stop_event callback is not stalled. Instead I'm
> scheduling an async work fn to do the job. The event destroy is then
> waiting for the work fn to finish processing, before proceeding on to
> destroy the buffer.
> I'm also thinking through Chris' suggestions here and will try to come
> up with solution based on them.

A notable detail here is that most pmu methods are called in atomic
context by events/core.c via inter-processor interrupts (as a somewhat
unintended side effect of userspace opening i915_oa events as
specific-cpu events perf will make sure all methods are invoked on
that specific cpu). This means waiting in pmu->event_stop() isn't even
an option for us, though as noted it's only the destroy currently that
waits for the completion of the stop work fn. event_init() and
event->destroy() are two exceptions that are called in process
context. That said though, it does seem worth considering if we can
avoid waiting in the event_destroy() if not strictly necessary, and
chris's suggestion to follow a refcounting scheme sounds workable.

Regards,
- Robert

>
> Thanks,
> Sourab
>
Robert Bragg June 25, 2015, 1:02 p.m. UTC | #8
On Thu, Jun 25, 2015 at 7:02 AM, Gupta, Sourab <sourab.gupta@intel.com> wrote:
> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
>> On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
>> > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
>> > > From: Sourab Gupta <sourab.gupta@intel.com>
>> > >
>> > > To collect timestamps around any GPU workload, we need to insert
>> > > commands to capture them into the ringbuffer. Therefore, during the stop event
>> > > call, we need to wait for GPU to complete processing the last request for
>> > > which these commands were inserted.
>> > > We need to ensure this processing is done before event_destroy callback which
>> > > deallocates the buffer for holding the data.
>> >
>> > There's no reason for this to be synchronous. Just that you need an
>> > active reference on the output buffer to be only released after the
>> > final request.
>>
>> Yeah I think the interaction between OA sampling and GEM is the critical
>> piece here for both patch series. Step one is to have a per-pmu lock to
>> keep track of data private to OA and mmio based sampling as a starting
>> point. Then we need to figure out how to structure the connection without
>> OA/PMU and gem trampling over each another's feet.

From the initial i915_oa code I think the notable locks are:

the struct perf_event_context::lock spin lock taken by events/core.c
before calling any of our pmu methods (with the exception of
event_init and event->destroy()) so we at least don't need to worry
about pmu methods trampling each other.

I think we should only take struct_mutex to meet the existing
assumptions of gem code we're using. Since most pmu methods are in
interrupt context our calling into gem is generally constrained to
event_init or event->destroy() without having to use workers.

we have a pmu spin lock to protect OA register state, notably because
besides the pmu methods we also have hooks into gem context switching
or pinning/unpinning that may trigger updates to the OACONTROL state
vs e.g. disabling OACONTROL in pmu->stop() which may concurrently on
different cpus.

we have an oa_buffer.flush_lock spin lock since in addition to a
hrtimer periodically forwarding samples from OABUFFER to perf,
userspace may also explicitly request a flush via fsync() or we might
flush at pmu->stop(). As a further detail here, if the hrtimer
contends with userspace flushing the hrtimer won't ever actually wait
on flush_lock, assuming it's not necessary to then have a back-to-back
flush while the buffer will probably be empty.

I have to admit I haven't really scrutinized the locking details of
Sourab's work yet, but it may make sense to introduce some additional
exclusion scheme for managing access to to the fifo of pending RPC
commands. I need to double check with Sourab, but I think we'd already
discussed some changes to how forwarding of these RPC based reports
will be managed where I thought we might be able to avoid the need for
a worker and struct_mutex locking while forwarding. I think Sourab may
have just preferred to send something out before this refactor to try
and solicit broader, high level feedback on his work sooner. I imagine
if we do refactor to avoid the worker for forwarding though that will
affect the locking details regarding the fifo structure.

Regards,
- Robert

>>
>> Wrt requests those are refcounted already and just waiting on them doesn't
>> need dev->struct_mutex. That should be all you need, assuming you do
>> correctly refcount them ...
>> -Daniel
>
> Hi Daniel/Chris,
>
> Thanks for your feedback. I acknowledge the issue with increasing
> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
> of data private to OA in my next version of patches.
>
> W.r.t. having the synchronous wait during event stop, I thought through
> the method of having active reference on output buffer. This will let us
> have a delayed buffer destroy (controlled by decrement of output buffer
> refcount as and when gpu requests complete). This will be decoupled from
> the event stop here. But that is not the only thing at play here.
>
> The event stop is expected to disable the OA unit (which it is doing
> right now). In my usecase, I can't disable OA unit, till the time GPU
> processes the MI_RPC requests scheduled already, which otherwise may
> lead to hangs.
> So, I'm waiting synchronously for requests to complete before disabling
> OA unit.
>
> Also, the event_destroy callback should be destroying the buffers
> associated with event. (Right now, in current design, event_destroy
> callback waits for the above operations to be completed before
> proceeding to destroy the buffer).
>
> If I try to create a function which destroys the output buffer according
> to all references being released, all these operations will have to be
> performed together in that function, in this serial order (so that when
> refcount drops to 0, i.e. requests have completed, OA unit will be
> disabled, and then the buffer destroyed). This would lead to these
> operations being decoupled from the event_stop and event_destroy
> functions. This may be non-intentional behaviour w.r.t. these callbacks
> from userspace perspective.
>
> Robert, any thoughts?
>

I think a few pertinent details here that inform how we need to handle this are:

1) Almost all the pmu methods are called in atomic context (except
event_init) as they are invoked via events/core.c via an
inter-processor interrupt so waiting for a completion in event_destroy
isn't really an option for us.
Robert Bragg June 25, 2015, 1:07 p.m. UTC | #9
>> Robert, any thoughts?
>>
>
> I think a few pertinent details here that inform how we need to handle this are:
>
> 1) Almost all the pmu methods are called in atomic context (except
> event_init) as they are invoked via events/core.c via an
> inter-processor interrupt so waiting for a completion in event_destroy
> isn't really an option for us.

Oops meant to delete this incomplete comment. Before double checking
events/core.c I was worried that event->destroy() might be called in
interrupt context, but that's not the case.

- Robert
Robert Bragg June 25, 2015, 5:31 p.m. UTC | #10
On Thu, Jun 25, 2015 at 9:02 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
>> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
>> > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
>> > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote:
>> > > > From: Sourab Gupta <sourab.gupta@intel.com>
>> > > >
>> > > > To collect timestamps around any GPU workload, we need to insert
>> > > > commands to capture them into the ringbuffer. Therefore, during the stop event
>> > > > call, we need to wait for GPU to complete processing the last request for
>> > > > which these commands were inserted.
>> > > > We need to ensure this processing is done before event_destroy callback which
>> > > > deallocates the buffer for holding the data.
>> > >
>> > > There's no reason for this to be synchronous. Just that you need an
>> > > active reference on the output buffer to be only released after the
>> > > final request.
>> >
>> > Yeah I think the interaction between OA sampling and GEM is the critical
>> > piece here for both patch series. Step one is to have a per-pmu lock to
>> > keep track of data private to OA and mmio based sampling as a starting
>> > point. Then we need to figure out how to structure the connection without
>> > OA/PMU and gem trampling over each another's feet.
>> >
>> > Wrt requests those are refcounted already and just waiting on them doesn't
>> > need dev->struct_mutex. That should be all you need, assuming you do
>> > correctly refcount them ...
>> > -Daniel
>>
>> Hi Daniel/Chris,
>>
>> Thanks for your feedback. I acknowledge the issue with increasing
>> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
>> of data private to OA in my next version of patches.
>>
>> W.r.t. having the synchronous wait during event stop, I thought through
>> the method of having active reference on output buffer. This will let us
>> have a delayed buffer destroy (controlled by decrement of output buffer
>> refcount as and when gpu requests complete). This will be decoupled from
>> the event stop here. But that is not the only thing at play here.
>>
>> The event stop is expected to disable the OA unit (which it is doing
>> right now). In my usecase, I can't disable OA unit, till the time GPU
>> processes the MI_RPC requests scheduled already, which otherwise may
>> lead to hangs.
>> So, I'm waiting synchronously for requests to complete before disabling
>> OA unit.
>
> There's no issue here. Just hook into the retire-notification callback
> for the last active request of the oa buffer. If you are using LRI to
> disable the OA Context writing to the buffer, you can simply create a
> request for the LRI and use the active reference as the retire
> notification. (Otoh, if the oa buffer is inactive you can simply do the
> decoupling immediately.) You shouldn't need a object-free-notification
> callback aiui.
>
>> Also, the event_destroy callback should be destroying the buffers
>> associated with event. (Right now, in current design, event_destroy
>> callback waits for the above operations to be completed before
>> proceeding to destroy the buffer).
>>
>> If I try to create a function which destroys the output buffer according
>> to all references being released, all these operations will have to be
>> performed together in that function, in this serial order (so that when
>> refcount drops to 0, i.e. requests have completed, OA unit will be
>> disabled, and then the buffer destroyed). This would lead to these
>> operations being decoupled from the event_stop and event_destroy
>> functions. This may be non-intentional behaviour w.r.t. these callbacks
>> from userspace perspective.
>
> When the perf event is stopped, you stop generating perf events. But the
> buffer may still be alive, and may be resurrected if you a new event is
> started, but you will want to be sure to ensure that pending oa writes
> are ignored.
> -Chris

Just to summarise some things I know Sourab discussed with Chris on
IRC and I then also discussed with Sourab offline.

It does look like we can avoid any waiting in event->destroy(),
decoupling tracking of outstanding RPC commands+oacontrol disable and
the event active status. There's no strict reason OACONTROL needs to
be disabled when stopping or destroying an event if there are still
pending RPC commands, we just need to stop forwarding samples while
disabled and discard reports that land after an event is destroyed. We
can also update the hw.state and active state before OACONTROL is
disabled, whereas currently the code is updating this state outside of
core perf's context lock which isn't good and we might also end up
continuing to forward periodic samples from the oabuffer to perf when
the user expects the event to be disabled.

There was a discussion of updating oacontrol via LRIs, but my concern
here is with wanting to update oacontrol state for periodic sampling
use cases where it would seem awkward to handle that via LRIs.
Enabling/disabling periodic sampling can be considered orthogonal from
fully disabling oacontrol and it will be useful to use periodic
sampling in conjunction with RPC sampling. When we stop an event we
should be able to immediately stop periodic sampling even if we must
leave OA enabled for a pending RPC command. If the event is started
again we can quickly re-enable periodic sampling but it might be
awkward to consider an in-flight LRI we know is going to disable
oacontrol soon. Related to using LRIs though was the idea of using the
LRI request to help manage the lifetime of the destination buffer, by
adding the vmas of the dest buffer to the request's active list. It
seems like we should be able to do this kind of thing with the
requests associated with the MI_RPC commands instead.

Thinking about the details of waiting for the last RPC command before
destroying the dest buffer and disabling OACONTROL these are the
requirements I see:
- we want free the dest buffer in a finite time, since it's large
(i.e. don't want to assume it's ok to keep around if allocated once)
- must wait for last RPC commands to complete before disabling
oacontrol (and can free the dest buffer at the same point)
- in any subsequent event_init() we need to be able to verify we don't
/still/ have pending RPC commands, because an incompatible oacontrol
requirement at this point is a corner case where we really do have to
block and wait for those RPC commands to complete or say return
-EBUSY.

Without the retire-notification api that Chris has been experimenting
with (that could provide us a convenient callback when our RPC
commands retire), the current plan is to still schedule some work in
event->destroy() to wait for the last outstanding RPC command to
retire before disabling oacontrol as well as free the RPC destination
buffer, but notably there's no need to block on its completion. I
imagine this could later easily be adapted to work with a
retire-notification api, and maybe Sourab could even experiment with
Chris's wip code for this to compare.

In the end, I think the only time we'll really need to block waiting
for outstanding requests is in event_init() in cases where there are
still pending RPC commands and the previous oacontrol report-format is
smaller than the newly requested format (which should basically be
never I guess if in practice, most users will be requesting the
largest report format... and technically I suppose there are even lots
of cases where you could safely allow the report size to increase if
you know there's adequate space in the old dest buffer; though
probably not worth fussing about.).


Besides avoiding blocking in event->destroy(); I'd been thinking we
wouldn't need the worker for forwarding the RPC reports to perf and
could instead just use a hrtimer like we do for periodic samples.
Talking this through with Sourab though, he tried this, and it looks
like it would be awkward due to needing to drop the references on
request structs which may in-turn need to take struct_mutex to finally
free.

In terms of locking while forwarding samples and in the insert_cmd
hooks, for the fifo structures tracking pending commands we should
stop using struct_mutex (except if necessary for calling into gem) and
we should also avoid locking around all of the forwarding work with
any mutex that (if anything) just needs to protect updates to the fifo
itself. I guess there's some reasonable lock free way to handle adding
and removing from these fifos, but haven't considered that carefully
and don't have a strong opinion on particular details here.


Ok I think that pretty much summarises what was discussed offline,

Regards,
- Robert

>
> --
> Chris Wilson, Intel Open Source Technology Centre
Chris Wilson June 25, 2015, 5:37 p.m. UTC | #11
On Thu, Jun 25, 2015 at 06:31:52PM +0100, Robert Bragg wrote:
> Thinking about the details of waiting for the last RPC command before
> destroying the dest buffer and disabling OACONTROL these are the
> requirements I see:
> - we want free the dest buffer in a finite time, since it's large
> (i.e. don't want to assume it's ok to keep around if allocated once)

Skipping to this point, no you don't. If you mark the buffer as
purgeable after you stop tracking it (though they may still be pending
writes from the GPU), the system will preferentially reuse the buffer's
backing storage when memory is tight (including waiting for the GPU).
Similarly, you need to unpin it as soon as possible then the address
space is available for reuse asap.
-Chris
Chris Wilson June 25, 2015, 6:20 p.m. UTC | #12
On Thu, Jun 25, 2015 at 06:37:18PM +0100, Chris Wilson wrote:
> On Thu, Jun 25, 2015 at 06:31:52PM +0100, Robert Bragg wrote:
> > Thinking about the details of waiting for the last RPC command before
> > destroying the dest buffer and disabling OACONTROL these are the
> > requirements I see:
> > - we want free the dest buffer in a finite time, since it's large
> > (i.e. don't want to assume it's ok to keep around if allocated once)
> 
> Skipping to this point, no you don't. If you mark the buffer as
> purgeable after you stop tracking it (though they may still be pending
> writes from the GPU), the system will preferentially reuse the buffer's
> backing storage when memory is tight (including waiting for the GPU).
> Similarly, you need to unpin it as soon as possible then the address
> space is available for reuse asap.

Or perhaps more to the point, you cannot free the buffer before the GPU
is finished. It doesn't matter if you want to free it synchronously, it
cannot happen before it would have been freed by dropping the active
reference anyway.
-Chris

Patch
diff mbox

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 25c0938..a0e1d17 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2022,6 +2022,8 @@  struct drm_i915_private {
 			u32 tail;
 		} buffer;
 		struct work_struct work_timer;
+		struct work_struct work_event_stop;
+		struct completion complete;
 	} gen_pmu;
 
 	struct list_head profile_cmd;
diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c
index e3e867f..574b6d3 100644
--- a/drivers/gpu/drm/i915/i915_oa_perf.c
+++ b/drivers/gpu/drm/i915/i915_oa_perf.c
@@ -306,6 +306,9 @@  void forward_gen_pmu_snapshots_work(struct work_struct *__work)
 	int head, tail, num_nodes, ret;
 	struct drm_i915_gem_request *req;
 
+	if (dev_priv->gen_pmu.event_active == false)
+		return;
+
 	first_node = (struct drm_i915_ts_node *)
 			((char *)hdr + hdr->data_offset);
 	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
@@ -335,6 +338,50 @@  void forward_gen_pmu_snapshots_work(struct work_struct *__work)
 	mutex_unlock(&dev_priv->dev->struct_mutex);
 }
 
+void i915_gen_pmu_stop_work_fn(struct work_struct *__work)
+{
+	struct drm_i915_private *dev_priv =
+		container_of(__work, typeof(*dev_priv),
+			gen_pmu.work_event_stop);
+	struct perf_event *event = dev_priv->gen_pmu.exclusive_event;
+	struct drm_i915_ts_queue_header *hdr =
+		(struct drm_i915_ts_queue_header *)
+		dev_priv->gen_pmu.buffer.addr;
+	struct drm_i915_ts_node *first_node, *node;
+	int head, tail, num_nodes, ret;
+	struct drm_i915_gem_request *req;
+
+	first_node = (struct drm_i915_ts_node *)
+			((char *)hdr + hdr->data_offset);
+	num_nodes = (hdr->size_in_bytes - hdr->data_offset) /
+			sizeof(*node);
+
+
+	ret = i915_mutex_lock_interruptible(dev_priv->dev);
+	if (ret)
+		return;
+
+	i915_gen_pmu_wait_gpu(dev_priv);
+
+	/* Ensure that all requests are completed*/
+	tail = hdr->node_count;
+	head = dev_priv->gen_pmu.buffer.head;
+	while ((head % num_nodes) != (tail % num_nodes)) {
+		node = &first_node[head % num_nodes];
+		req = node->node_info.req;
+		if (req && !i915_gem_request_completed(req, true))
+			WARN_ON(1);
+		head++;
+	}
+
+	event->hw.state = PERF_HES_STOPPED;
+	dev_priv->gen_pmu.buffer.tail = 0;
+	dev_priv->gen_pmu.buffer.head = 0;
+
+	mutex_unlock(&dev_priv->dev->struct_mutex);
+	complete(&dev_priv->gen_pmu.complete);
+}
+
 static void gen_pmu_flush_snapshots(struct drm_i915_private *dev_priv)
 {
 	WARN_ON(!dev_priv->gen_pmu.buffer.addr);
@@ -562,6 +609,7 @@  static void i915_oa_event_destroy(struct perf_event *event)
 
 static void gen_buffer_destroy(struct drm_i915_private *i915)
 {
+	wait_for_completion(&i915->gen_pmu.complete);
 	mutex_lock(&i915->dev->struct_mutex);
 
 	vunmap(i915->gen_pmu.buffer.addr);
@@ -1409,7 +1457,7 @@  static void i915_gen_event_stop(struct perf_event *event, int flags)
 	hrtimer_cancel(&dev_priv->gen_pmu.timer);
 	gen_pmu_flush_snapshots(dev_priv);
 
-	event->hw.state = PERF_HES_STOPPED;
+	schedule_work(&dev_priv->gen_pmu.work_event_stop);
 }
 
 static int i915_gen_event_add(struct perf_event *event, int flags)
@@ -1595,6 +1643,9 @@  void i915_gen_pmu_register(struct drm_device *dev)
 	i915->gen_pmu.timer.function = hrtimer_sample_gen;
 
 	INIT_WORK(&i915->gen_pmu.work_timer, forward_gen_pmu_snapshots_work);
+	INIT_WORK(&i915->gen_pmu.work_event_stop, i915_gen_pmu_stop_work_fn);
+	init_completion(&i915->gen_pmu.complete);
+
 	spin_lock_init(&i915->gen_pmu.lock);
 
 	i915->gen_pmu.pmu.capabilities  = PERF_PMU_CAP_IS_DEVICE;
@@ -1625,6 +1676,7 @@  void i915_gen_pmu_unregister(struct drm_device *dev)
 		return;
 
 	cancel_work_sync(&i915->gen_pmu.work_timer);
+	cancel_work_sync(&i915->gen_pmu.work_event_stop);
 
 	perf_pmu_unregister(&i915->gen_pmu.pmu);
 	i915->gen_pmu.pmu.event_init = NULL;