drm/i915: Mark the removal of the i915_request from the sched.link
diff mbox series

Message ID 20200120175704.36340-1-chris@chris-wilson.co.uk
State New
Headers show
Series
  • drm/i915: Mark the removal of the i915_request from the sched.link
Related show

Commit Message

Chris Wilson Jan. 20, 2020, 5:57 p.m. UTC
Keep the rq->fence.flags consistent with the status of the
rq->sched.link, and clear the associated bits when decoupling the link
on retirement (as we may wish to inspect those flags independent of
other state).

Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
References: https://gitlab.freedesktop.org/drm/intel/issues/997
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_request.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Tvrtko Ursulin Jan. 20, 2020, 7:47 p.m. UTC | #1
On 20/01/2020 17:57, Chris Wilson wrote:
> Keep the rq->fence.flags consistent with the status of the
> rq->sched.link, and clear the associated bits when decoupling the link
> on retirement (as we may wish to inspect those flags independent of
> other state).
> 
> Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
> References: https://gitlab.freedesktop.org/drm/intel/issues/997
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_request.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 9ed0d3bc7249..78a5f5d3c070 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -221,6 +221,8 @@ static void remove_from_engine(struct i915_request *rq)
>   		locked = engine;
>   	}
>   	list_del_init(&rq->sched.link);
> +	clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);

This one I think can not be set in retirement. Or there is a path?

[comes back after writing the comment below]

Race between completion to hold puts the request on hold, then request 
completes just as it is un-held? It needs retire to happen at the right 
time, driven by ...? Is this it?

> +	clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);

This one I think indeed can race with completion.

Regards,

Tvrtko

>   	spin_unlock_irq(&locked->active.lock);
>   }
>   
>
Chris Wilson Jan. 20, 2020, 8:27 p.m. UTC | #2
Quoting Tvrtko Ursulin (2020-01-20 19:47:08)
> 
> On 20/01/2020 17:57, Chris Wilson wrote:
> > Keep the rq->fence.flags consistent with the status of the
> > rq->sched.link, and clear the associated bits when decoupling the link
> > on retirement (as we may wish to inspect those flags independent of
> > other state).
> > 
> > Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
> > References: https://gitlab.freedesktop.org/drm/intel/issues/997
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_request.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index 9ed0d3bc7249..78a5f5d3c070 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -221,6 +221,8 @@ static void remove_from_engine(struct i915_request *rq)
> >               locked = engine;
> >       }
> >       list_del_init(&rq->sched.link);
> > +     clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> 
> This one I think can not be set in retirement. Or there is a path?

No, I don't think there's one for pqueue, it was just being consistent.
> 
> [comes back after writing the comment below]
> 
> Race between completion to hold puts the request on hold, then request 
> completes just as it is un-held? It needs retire to happen at the right 
> time, driven by ...? Is this it?

Yeah, but the clear one I was thinking about is 

static bool hold_request(const struct i915_request *rq)
{
        struct i915_dependency *p;

        /*
         * If one of our ancestors is on hold, we must also be on hold,
         * otherwise we will bypass it and execute before it.
         */
        list_for_each_entry(p, &rq->sched.signalers_list, signal_link) {
                const struct i915_request *s =
                        container_of(p->signaler, typeof(*s), sched);

                if (s->engine != rq->engine)
                        continue;

                if (i915_request_on_hold(s))
                        return true;
        }

        return false;
}

where we check the rq->fence.flags which holds stale information.

> 
> > +     clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
> 
> This one I think indeed can race with completion.

Clear both for consistency, caught out once, may be caught out again on
the other.
-Chris
Chris Wilson Jan. 20, 2020, 8:32 p.m. UTC | #3
Quoting Tvrtko Ursulin (2020-01-20 19:47:08)
> 
> On 20/01/2020 17:57, Chris Wilson wrote:
> > Keep the rq->fence.flags consistent with the status of the
> > rq->sched.link, and clear the associated bits when decoupling the link
> > on retirement (as we may wish to inspect those flags independent of
> > other state).
> > 
> > Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
> > References: https://gitlab.freedesktop.org/drm/intel/issues/997
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_request.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> > index 9ed0d3bc7249..78a5f5d3c070 100644
> > --- a/drivers/gpu/drm/i915/i915_request.c
> > +++ b/drivers/gpu/drm/i915/i915_request.c
> > @@ -221,6 +221,8 @@ static void remove_from_engine(struct i915_request *rq)
> >               locked = engine;
> >       }
> >       list_del_init(&rq->sched.link);
> > +     clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> 
> This one I think can not be set in retirement. Or there is a path?
> 
> [comes back after writing the comment below]
> 
> Race between completion to hold puts the request on hold, then request 
> completes just as it is un-held? It needs retire to happen at the right 
> time, driven by ...? Is this it?
> 
> > +     clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
> 
> This one I think indeed can race with completion.

Fwiw, this appears to be the active part of the trace

<0>[   56.132334]   <idle>-0       1.Ns1 52042407us : execlists_reset_finish: 0000:00:02.0 vcs1: depth->1
<0>[   56.132415] rs:main -428     0..s. 52042429us : process_csb: 0000:00:02.0 vcs1: cs-irq head=5, tail=1
<0>[   56.132501] rs:main -428     0..s. 52042432us : process_csb: 0000:00:02.0 vcs1: csb[0]: status=0x00000001:0x00000000
<0>[   56.132591] rs:main -428     0..s. 52042437us : trace_ports: 0000:00:02.0 vcs1: promote { b:6!, 0:0 }
<0>[   56.132676] rs:main -428     0..s. 52042442us : process_csb: 0000:00:02.0 vcs1: csb[1]: status=0x00000818:0x00000020
<0>[   56.132766] rs:main -428     0..s. 52042445us : trace_ports: 0000:00:02.0 vcs1: completed { b:6!, 0:0 }
<0>[   56.132860] kworker/-12      0.... 52042771us : i915_request_retire: 0000:00:02.0 vcs1: fence b:6, current 6
<0>[   56.132949] kworker/-65      1d..1 52044886us : execlists_unhold: 0000:00:02.0 vcs0: fence 27:2, current 2 hold release
<0>[   56.133041] ksoftirq-16      1..s. 52044912us : process_csb: 0000:00:02.0 vcs0: cs-irq head=2, tail=2
<0>[   56.133118] ksoftirq-16      1d.s1 52044916us : __i915_request_submit: 0000:00:02.0 vcs0: fence 27:2, current 2
<0>[   56.133216] kworker/-65      1.... 52044946us : i915_request_retire: 0000:00:02.0 vcs0: fence 9:14, current 14
<0>[   56.133314] kworker/-65      1d... 52044986us : __i915_request_commit: 0000:00:02.0 vcs0: fence 9:16, current 14
<0>[   56.133402] kworker/-65      1d... 52044993us : __engine_park: 0000:00:02.0 vcs0:
<0>[   56.133490] kworker/-65      1d..2 52045032us : __i915_request_submit: 0000:00:02.0 vcs0: fence 9:16, current 14
<0>[   56.133581] kworker/-65      1d..2 52045892us : trace_ports: 0000:00:02.0 vcs0: submit { 9:16, 0:0 }
<0>[   56.133664]   <idle>-0       0dNH2 52045932us : __intel_context_retire: 0000:00:02.0 vcs0: context:27 retire
<0>[   56.133751]   <idle>-0       0.Ns1 52045949us : process_csb: 0000:00:02.0 vcs0: cs-irq head=2, tail=4
<0>[   56.133838]   <idle>-0       0.Ns1 52045950us : process_csb: 0000:00:02.0 vcs0: csb[3]: status=0x00000001:0x00000000
<0>[   56.133927]   <idle>-0       0.Ns1 52045951us : trace_ports: 0000:00:02.0 vcs0: promote { 9:16!, 0:0 }
<0>[   56.134013]   <idle>-0       0.Ns1 52045951us : process_csb: 0000:00:02.0 vcs0: csb[4]: status=0x00000818:0x00000060
<0>[   56.134102]   <idle>-0       0.Ns1 52045952us : trace_ports: 0000:00:02.0 vcs0: completed { 9:16!, 0:0 }
<0>[   56.134198] kworker/-12      0.... 52045960us : i915_request_retire: 0000:00:02.0 vcs0: fence 9:16, current 16
<0>[   56.134285] kworker/-12      0.... 52045962us : __engine_park: 0000:00:02.0 vcs0:
<0>[   56.134361] kworker/-65      1d..1 52046503us : execlists_unhold: 0000:00:02.0 vcs1: fence 2a:2, current 1 hold release
<0>[   56.134427] ksoftirq-16      1..s. 52046510us : process_csb: 0000:00:02.0 vcs1: cs-irq head=1, tail=1

It doesn't look like there's overlap between the hold and retire. :|
-Chris
Tvrtko Ursulin Jan. 21, 2020, 11:11 a.m. UTC | #4
On 20/01/2020 20:32, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-01-20 19:47:08)
>>
>> On 20/01/2020 17:57, Chris Wilson wrote:
>>> Keep the rq->fence.flags consistent with the status of the
>>> rq->sched.link, and clear the associated bits when decoupling the link
>>> on retirement (as we may wish to inspect those flags independent of
>>> other state).
>>>
>>> Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
>>> References: https://gitlab.freedesktop.org/drm/intel/issues/997
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_request.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
>>> index 9ed0d3bc7249..78a5f5d3c070 100644
>>> --- a/drivers/gpu/drm/i915/i915_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_request.c
>>> @@ -221,6 +221,8 @@ static void remove_from_engine(struct i915_request *rq)
>>>                locked = engine;
>>>        }
>>>        list_del_init(&rq->sched.link);
>>> +     clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
>>
>> This one I think can not be set in retirement. Or there is a path?
>>
>> [comes back after writing the comment below]
>>
>> Race between completion to hold puts the request on hold, then request
>> completes just as it is un-held? It needs retire to happen at the right
>> time, driven by ...? Is this it?
>>
>>> +     clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
>>
>> This one I think indeed can race with completion.
> 
> Fwiw, this appears to be the active part of the trace
> 
> <0>[   56.132334]   <idle>-0       1.Ns1 52042407us : execlists_reset_finish: 0000:00:02.0 vcs1: depth->1
> <0>[   56.132415] rs:main -428     0..s. 52042429us : process_csb: 0000:00:02.0 vcs1: cs-irq head=5, tail=1
> <0>[   56.132501] rs:main -428     0..s. 52042432us : process_csb: 0000:00:02.0 vcs1: csb[0]: status=0x00000001:0x00000000
> <0>[   56.132591] rs:main -428     0..s. 52042437us : trace_ports: 0000:00:02.0 vcs1: promote { b:6!, 0:0 }
> <0>[   56.132676] rs:main -428     0..s. 52042442us : process_csb: 0000:00:02.0 vcs1: csb[1]: status=0x00000818:0x00000020
> <0>[   56.132766] rs:main -428     0..s. 52042445us : trace_ports: 0000:00:02.0 vcs1: completed { b:6!, 0:0 }
> <0>[   56.132860] kworker/-12      0.... 52042771us : i915_request_retire: 0000:00:02.0 vcs1: fence b:6, current 6
> <0>[   56.132949] kworker/-65      1d..1 52044886us : execlists_unhold: 0000:00:02.0 vcs0: fence 27:2, current 2 hold release
> <0>[   56.133041] ksoftirq-16      1..s. 52044912us : process_csb: 0000:00:02.0 vcs0: cs-irq head=2, tail=2
> <0>[   56.133118] ksoftirq-16      1d.s1 52044916us : __i915_request_submit: 0000:00:02.0 vcs0: fence 27:2, current 2
> <0>[   56.133216] kworker/-65      1.... 52044946us : i915_request_retire: 0000:00:02.0 vcs0: fence 9:14, current 14
> <0>[   56.133314] kworker/-65      1d... 52044986us : __i915_request_commit: 0000:00:02.0 vcs0: fence 9:16, current 14
> <0>[   56.133402] kworker/-65      1d... 52044993us : __engine_park: 0000:00:02.0 vcs0:
> <0>[   56.133490] kworker/-65      1d..2 52045032us : __i915_request_submit: 0000:00:02.0 vcs0: fence 9:16, current 14
> <0>[   56.133581] kworker/-65      1d..2 52045892us : trace_ports: 0000:00:02.0 vcs0: submit { 9:16, 0:0 }
> <0>[   56.133664]   <idle>-0       0dNH2 52045932us : __intel_context_retire: 0000:00:02.0 vcs0: context:27 retire
> <0>[   56.133751]   <idle>-0       0.Ns1 52045949us : process_csb: 0000:00:02.0 vcs0: cs-irq head=2, tail=4
> <0>[   56.133838]   <idle>-0       0.Ns1 52045950us : process_csb: 0000:00:02.0 vcs0: csb[3]: status=0x00000001:0x00000000
> <0>[   56.133927]   <idle>-0       0.Ns1 52045951us : trace_ports: 0000:00:02.0 vcs0: promote { 9:16!, 0:0 }
> <0>[   56.134013]   <idle>-0       0.Ns1 52045951us : process_csb: 0000:00:02.0 vcs0: csb[4]: status=0x00000818:0x00000060
> <0>[   56.134102]   <idle>-0       0.Ns1 52045952us : trace_ports: 0000:00:02.0 vcs0: completed { 9:16!, 0:0 }
> <0>[   56.134198] kworker/-12      0.... 52045960us : i915_request_retire: 0000:00:02.0 vcs0: fence 9:16, current 16
> <0>[   56.134285] kworker/-12      0.... 52045962us : __engine_park: 0000:00:02.0 vcs0:
> <0>[   56.134361] kworker/-65      1d..1 52046503us : execlists_unhold: 0000:00:02.0 vcs1: fence 2a:2, current 1 hold release
> <0>[   56.134427] ksoftirq-16      1..s. 52046510us : process_csb: 0000:00:02.0 vcs1: cs-irq head=1, tail=1
> 
> It doesn't look like there's overlap between the hold and retire. :|

Which bit is the race? I coudln't spot the same request being mentioned 
on two separate paths.

I mean I have no issues with the patch. Just trying to understand the 
possible races and whether or not there should be an assert for PQUEUE 
instead of clearing it on retire.

Regards,

Tvrtko
Chris Wilson Jan. 21, 2020, 11:15 a.m. UTC | #5
Quoting Tvrtko Ursulin (2020-01-21 11:11:27)
> 
> On 20/01/2020 20:32, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-01-20 19:47:08)
> >>
> >> On 20/01/2020 17:57, Chris Wilson wrote:
> >>> Keep the rq->fence.flags consistent with the status of the
> >>> rq->sched.link, and clear the associated bits when decoupling the link
> >>> on retirement (as we may wish to inspect those flags independent of
> >>> other state).
> >>>
> >>> Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
> >>> References: https://gitlab.freedesktop.org/drm/intel/issues/997
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/i915_request.c | 2 ++
> >>>    1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> >>> index 9ed0d3bc7249..78a5f5d3c070 100644
> >>> --- a/drivers/gpu/drm/i915/i915_request.c
> >>> +++ b/drivers/gpu/drm/i915/i915_request.c
> >>> @@ -221,6 +221,8 @@ static void remove_from_engine(struct i915_request *rq)
> >>>                locked = engine;
> >>>        }
> >>>        list_del_init(&rq->sched.link);
> >>> +     clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> >>
> >> This one I think can not be set in retirement. Or there is a path?
> >>
> >> [comes back after writing the comment below]
> >>
> >> Race between completion to hold puts the request on hold, then request
> >> completes just as it is un-held? It needs retire to happen at the right
> >> time, driven by ...? Is this it?
> >>
> >>> +     clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
> >>
> >> This one I think indeed can race with completion.
> > 
> > Fwiw, this appears to be the active part of the trace
> > 
> > <0>[   56.132334]   <idle>-0       1.Ns1 52042407us : execlists_reset_finish: 0000:00:02.0 vcs1: depth->1
> > <0>[   56.132415] rs:main -428     0..s. 52042429us : process_csb: 0000:00:02.0 vcs1: cs-irq head=5, tail=1
> > <0>[   56.132501] rs:main -428     0..s. 52042432us : process_csb: 0000:00:02.0 vcs1: csb[0]: status=0x00000001:0x00000000
> > <0>[   56.132591] rs:main -428     0..s. 52042437us : trace_ports: 0000:00:02.0 vcs1: promote { b:6!, 0:0 }
> > <0>[   56.132676] rs:main -428     0..s. 52042442us : process_csb: 0000:00:02.0 vcs1: csb[1]: status=0x00000818:0x00000020
> > <0>[   56.132766] rs:main -428     0..s. 52042445us : trace_ports: 0000:00:02.0 vcs1: completed { b:6!, 0:0 }
> > <0>[   56.132860] kworker/-12      0.... 52042771us : i915_request_retire: 0000:00:02.0 vcs1: fence b:6, current 6
> > <0>[   56.132949] kworker/-65      1d..1 52044886us : execlists_unhold: 0000:00:02.0 vcs0: fence 27:2, current 2 hold release
> > <0>[   56.133041] ksoftirq-16      1..s. 52044912us : process_csb: 0000:00:02.0 vcs0: cs-irq head=2, tail=2
> > <0>[   56.133118] ksoftirq-16      1d.s1 52044916us : __i915_request_submit: 0000:00:02.0 vcs0: fence 27:2, current 2
> > <0>[   56.133216] kworker/-65      1.... 52044946us : i915_request_retire: 0000:00:02.0 vcs0: fence 9:14, current 14
> > <0>[   56.133314] kworker/-65      1d... 52044986us : __i915_request_commit: 0000:00:02.0 vcs0: fence 9:16, current 14
> > <0>[   56.133402] kworker/-65      1d... 52044993us : __engine_park: 0000:00:02.0 vcs0:
> > <0>[   56.133490] kworker/-65      1d..2 52045032us : __i915_request_submit: 0000:00:02.0 vcs0: fence 9:16, current 14
> > <0>[   56.133581] kworker/-65      1d..2 52045892us : trace_ports: 0000:00:02.0 vcs0: submit { 9:16, 0:0 }
> > <0>[   56.133664]   <idle>-0       0dNH2 52045932us : __intel_context_retire: 0000:00:02.0 vcs0: context:27 retire
> > <0>[   56.133751]   <idle>-0       0.Ns1 52045949us : process_csb: 0000:00:02.0 vcs0: cs-irq head=2, tail=4
> > <0>[   56.133838]   <idle>-0       0.Ns1 52045950us : process_csb: 0000:00:02.0 vcs0: csb[3]: status=0x00000001:0x00000000
> > <0>[   56.133927]   <idle>-0       0.Ns1 52045951us : trace_ports: 0000:00:02.0 vcs0: promote { 9:16!, 0:0 }
> > <0>[   56.134013]   <idle>-0       0.Ns1 52045951us : process_csb: 0000:00:02.0 vcs0: csb[4]: status=0x00000818:0x00000060
> > <0>[   56.134102]   <idle>-0       0.Ns1 52045952us : trace_ports: 0000:00:02.0 vcs0: completed { 9:16!, 0:0 }
> > <0>[   56.134198] kworker/-12      0.... 52045960us : i915_request_retire: 0000:00:02.0 vcs0: fence 9:16, current 16
> > <0>[   56.134285] kworker/-12      0.... 52045962us : __engine_park: 0000:00:02.0 vcs0:
> > <0>[   56.134361] kworker/-65      1d..1 52046503us : execlists_unhold: 0000:00:02.0 vcs1: fence 2a:2, current 1 hold release
> > <0>[   56.134427] ksoftirq-16      1..s. 52046510us : process_csb: 0000:00:02.0 vcs1: cs-irq head=1, tail=1
> > 
> > It doesn't look like there's overlap between the hold and retire. :|
> 
> Which bit is the race? I coudln't spot the same request being mentioned 
> on two separate paths.

Neither could I. I don't think this patch is actually associated with
the bug, but I think the timing of the GEM_BUG_ON implicates the
execlists_hold() patch.

> I mean I have no issues with the patch. Just trying to understand the 
> possible races and whether or not there should be an assert for PQUEUE 
> instead of clearing it on retire.

It can be on the pqueue during retirement, we're just a bit more careful
during dequeuing.
-Chris
Tvrtko Ursulin Jan. 21, 2020, 5:33 p.m. UTC | #6
On 20/01/2020 20:27, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-01-20 19:47:08)
>>
>> On 20/01/2020 17:57, Chris Wilson wrote:
>>> Keep the rq->fence.flags consistent with the status of the
>>> rq->sched.link, and clear the associated bits when decoupling the link
>>> on retirement (as we may wish to inspect those flags independent of
>>> other state).
>>>
>>> Fixes: 32ff621fd744 ("drm/i915/gt: Allow temporary suspension of inflight requests")
>>> References: https://gitlab.freedesktop.org/drm/intel/issues/997
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_request.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
>>> index 9ed0d3bc7249..78a5f5d3c070 100644
>>> --- a/drivers/gpu/drm/i915/i915_request.c
>>> +++ b/drivers/gpu/drm/i915/i915_request.c
>>> @@ -221,6 +221,8 @@ static void remove_from_engine(struct i915_request *rq)
>>>                locked = engine;
>>>        }
>>>        list_del_init(&rq->sched.link);
>>> +     clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
>>
>> This one I think can not be set in retirement. Or there is a path?
> 
> No, I don't think there's one for pqueue, it was just being consistent.
>>
>> [comes back after writing the comment below]
>>
>> Race between completion to hold puts the request on hold, then request
>> completes just as it is un-held? It needs retire to happen at the right
>> time, driven by ...? Is this it?
> 
> Yeah, but the clear one I was thinking about is
> 
> static bool hold_request(const struct i915_request *rq)
> {
>          struct i915_dependency *p;
> 
>          /*
>           * If one of our ancestors is on hold, we must also be on hold,
>           * otherwise we will bypass it and execute before it.
>           */
>          list_for_each_entry(p, &rq->sched.signalers_list, signal_link) {
>                  const struct i915_request *s =
>                          container_of(p->signaler, typeof(*s), sched);
> 
>                  if (s->engine != rq->engine)
>                          continue;
> 
>                  if (i915_request_on_hold(s))
>                          return true;
>          }
> 
>          return false;
> }
> 
> where we check the rq->fence.flags which holds stale information.

i915_request_on_hold? Why would that hold stale information?
 
>>
>>> +     clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
>>
>> This one I think indeed can race with completion.
> 
> Clear both for consistency, caught out once, may be caught out again on
> the other.

From the bug we have this:

<0>[   61.788268] gem_exec-2120    3.... 56070451us : i915_sched_node_reinit: i915_sched_node_reinit:400 GEM_BUG_ON(!list_empty(&node->link))
...
2>[   61.793703] kernel BUG at drivers/gpu/drm/i915/i915_scheduler.c:400!
<4>[   61.793736] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4>[   61.793744] CPU: 0 PID: 2120 Comm: gem_exec_fence Tainted: G     U  W         5.5.0-rc6-CI-CI_DRM_7762+ #1
<4>[   61.793755] Hardware name:  /NUC7i5BNB, BIOS BNKBL357.86A.0054.2017.1025.1822 10/25/2017
<4>[   61.793822] RIP: 0010:i915_sched_node_reinit+0x14a/0x150 [i915]
<4>[   61.793831] Code: 00 48 c7 c2 10 27 79 a0 48 c7 c7 1f c2 69 a0 e8 0c 37 b0 e0 bf 01 00 00 00 e8 22 09 b0 e0 31 f6 bf 09 00 00 00 e8 86 48 a1 e0 <0f> 0b 0f 1f 40 00 48 8d 47 10 48 89 3f 48 89 7f 08 48 89 47 10 48
<4>[   61.793850] RSP: 0018:ffffc900020d3a58 EFLAGS: 00010296
<4>[   61.793857] RAX: ffff888261bed140 RBX: ffff88825c429140 RCX: 0000000000000006
<4>[   61.793865] RDX: 00000000000018dd RSI: 0000000000000000 RDI: 0000000000000009
<4>[   61.793873] RBP: ffff888272f58440 R08: 0000000000000000 R09: 0000000000000000
<4>[   61.793881] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88825c4292e8
<4>[   61.793889] R13: 0000000000000000 R14: ffffc900020d3dc0 R15: ffff888256a50068
<4>[   61.793898] FS:  00007f321ab41e40(0000) GS:ffff888276c00000(0000) knlGS:0000000000000000
<4>[   61.793907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   61.793914] CR2: 00007f275c1470f8 CR3: 0000000260a28002 CR4: 00000000003606f0
<4>[   61.793922] Call Trace:
<4>[   61.793979]  __i915_request_create+0x1c6/0x5a0 [i915]
<4>[   61.794034]  i915_request_create+0x86/0x1c0 [i915]
<4>[   61.794085]  i915_gem_do_execbuffer+0xbf2/0x2500 [i915]

This would mean in use request returned to the RCU slab cache, right? Do we expect to keep any rq.fence->flags across rq recycling? Perhaps we need dma_fence_reinit or something with the asserts similar to what you added in:

commit 67a3acaab7167157fb827595019eaf55df244824
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Nov 22 09:49:24 2019 +0000

    drm/i915: Use a ctor for TYPESAFE_BY_RCU i915_request

And I reviewed, although of course the substance mostly evaporated from my head by now.

Question seems to be how an on list request could have been returned to the slab. Does hold needs to have a reference if it races with parallel retire?

Regards,

Tvrtko

Patch
diff mbox series

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 9ed0d3bc7249..78a5f5d3c070 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -221,6 +221,8 @@  static void remove_from_engine(struct i915_request *rq)
 		locked = engine;
 	}
 	list_del_init(&rq->sched.link);
+	clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+	clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
 	spin_unlock_irq(&locked->active.lock);
 }