diff mbox series

[3/4] drm/i915/gt: Perform an arbitration check before busywaiting

Message ID 20210111105735.21515-3-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [1/4] drm/i915/gt: Disable arbitration around Braswell's pdp updates | expand

Commit Message

Chris Wilson Jan. 11, 2021, 10:57 a.m. UTC
During igt_reset_nop_engine, it was observed that an unexpected failed
engine reset lead to us busywaiting on the stop-ring semaphore (set
during the reset preparations) on the first request afterwards. There was
no explicit MI_ARB_CHECK in this sequence as the presumption was that
the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
It did not in this circumstance, so force it.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/gen8_engine_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Tvrtko Ursulin Jan. 11, 2021, 4:19 p.m. UTC | #1
On 11/01/2021 10:57, Chris Wilson wrote:
> During igt_reset_nop_engine, it was observed that an unexpected failed
> engine reset lead to us busywaiting on the stop-ring semaphore (set
> during the reset preparations) on the first request afterwards. There was
> no explicit MI_ARB_CHECK in this sequence as the presumption was that
> the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
> It did not in this circumstance, so force it.

In other words MI_SEMAPHORE_POLL is not a preemption point? Can't 
remember if I knew that or not..

1)
Why not the same handling in !gen12 version?

2)
Failed reset leads to busy-hang in following request _tail_? But there 
is an arb check at the start of following request as well. Or in cases 
where we context switch into the middle of a previously executing request?

But why would that busy hang? Hasn't the failed request unpaused the ring?

Regards,

Tvrtko

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/gt/gen8_engine_cs.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
> index 9a182652a35e..e9ac281644cc 100644
> --- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
> @@ -567,6 +567,7 @@ u32 *gen11_emit_fini_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
>   
>   static u32 *gen12_emit_preempt_busywait(struct i915_request *rq, u32 *cs)
>   {
> +	*cs++ = MI_ARB_CHECK;
>   	*cs++ = MI_SEMAPHORE_WAIT_TOKEN |
>   		MI_SEMAPHORE_GLOBAL_GTT |
>   		MI_SEMAPHORE_POLL |
> @@ -575,7 +576,6 @@ static u32 *gen12_emit_preempt_busywait(struct i915_request *rq, u32 *cs)
>   	*cs++ = preempt_address(rq->engine);
>   	*cs++ = 0;
>   	*cs++ = 0;
> -	*cs++ = MI_NOOP;
>   
>   	return cs;
>   }
>
Chris Wilson Jan. 11, 2021, 4:27 p.m. UTC | #2
Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
> 
> On 11/01/2021 10:57, Chris Wilson wrote:
> > During igt_reset_nop_engine, it was observed that an unexpected failed
> > engine reset lead to us busywaiting on the stop-ring semaphore (set
> > during the reset preparations) on the first request afterwards. There was
> > no explicit MI_ARB_CHECK in this sequence as the presumption was that
> > the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
> > It did not in this circumstance, so force it.
> 
> In other words MI_SEMAPHORE_POLL is not a preemption point? Can't 
> remember if I knew that or not..

MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
miss.

> 1)
> Why not the same handling in !gen12 version?

Because I think it's a bug in tgl [a0 at least]. I think I've seen the
same symptoms on tgl before, but not earlier. This is the first time the
sequence clicked as to why it was busy spinning. Random engine reset
failures are rare enough -- I was meant to also write a test case to
inject failure.
 
> 2)
> Failed reset leads to busy-hang in following request _tail_? But there 
> is an arb check at the start of following request as well. Or in cases 
> where we context switch into the middle of a previously executing request?

It was the first request submitted after the failed reset. We expect to
clear the ring-stop flag on the CS IDLE->ACTIVE event.

> But why would that busy hang? Hasn't the failed request unpaused the ring?

The engine was idle at the time of the failed reset. We left the
ring-stop set, and submitted the next batch of requests. We hit the
MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
without hitting an arbitration point (first request, no init-breadcrumb
in this case), the semaphore was stuck.
-Chris
Tvrtko Ursulin Jan. 11, 2021, 5:12 p.m. UTC | #3
On 11/01/2021 16:27, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
>>
>> On 11/01/2021 10:57, Chris Wilson wrote:
>>> During igt_reset_nop_engine, it was observed that an unexpected failed
>>> engine reset lead to us busywaiting on the stop-ring semaphore (set
>>> during the reset preparations) on the first request afterwards. There was
>>> no explicit MI_ARB_CHECK in this sequence as the presumption was that
>>> the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
>>> It did not in this circumstance, so force it.
>>
>> In other words MI_SEMAPHORE_POLL is not a preemption point? Can't
>> remember if I knew that or not..
> 
> MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
> miss.
> 
>> 1)
>> Why not the same handling in !gen12 version?
> 
> Because I think it's a bug in tgl [a0 at least]. I think I've seen the
> same symptoms on tgl before, but not earlier. This is the first time the
> sequence clicked as to why it was busy spinning. Random engine reset
> failures are rare enough -- I was meant to also write a test case to
> inject failure.

Random engine reset failure you think is a TGL issue?

>   
>> 2)
>> Failed reset leads to busy-hang in following request _tail_? But there
>> is an arb check at the start of following request as well. Or in cases
>> where we context switch into the middle of a previously executing request?
> 
> It was the first request submitted after the failed reset. We expect to
> clear the ring-stop flag on the CS IDLE->ACTIVE event.
> 
>> But why would that busy hang? Hasn't the failed request unpaused the ring?
> 
> The engine was idle at the time of the failed reset. We left the
> ring-stop set, and submitted the next batch of requests. We hit the
> MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
> without hitting an arbitration point (first request, no init-breadcrumb
> in this case), the semaphore was stuck.

So a kernel context request? Why hasn't IDLE->ACTIVE cleared ring stop? 
Presumably this CSB must come after the first request has been submitted 
so apparently I am still not getting how it hangs.

Just because igt_reset_nop_engine does things "quickly"? It prevents the 
CSB from arriving? So ARB_CHECK pickups up on the fact ELSP has been 
rewritten before the IDLE->ACTIVE even received and/or engine reset 
prevented it from arriving?

Regards,

Tvrtko
Chris Wilson Jan. 11, 2021, 9:54 p.m. UTC | #4
Quoting Tvrtko Ursulin (2021-01-11 17:12:57)
> 
> On 11/01/2021 16:27, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
> >>
> >> On 11/01/2021 10:57, Chris Wilson wrote:
> >>> During igt_reset_nop_engine, it was observed that an unexpected failed
> >>> engine reset lead to us busywaiting on the stop-ring semaphore (set
> >>> during the reset preparations) on the first request afterwards. There was
> >>> no explicit MI_ARB_CHECK in this sequence as the presumption was that
> >>> the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
> >>> It did not in this circumstance, so force it.
> >>
> >> In other words MI_SEMAPHORE_POLL is not a preemption point? Can't
> >> remember if I knew that or not..
> > 
> > MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
> > miss.
> > 
> >> 1)
> >> Why not the same handling in !gen12 version?
> > 
> > Because I think it's a bug in tgl [a0 at least]. I think I've seen the
> > same symptoms on tgl before, but not earlier. This is the first time the
> > sequence clicked as to why it was busy spinning. Random engine reset
> > failures are rare enough -- I was meant to also write a test case to
> > inject failure.
> 
> Random engine reset failure you think is a TGL issue?

The MI_SEMAPHORE_WAIT | POLL miss not generating an arbitration point.
We have quite a few selftests and IGT that use this feature.

So I was wondering if this was similar to one of those tgl issues with
semaphores and CS events.

The random engine reset failure here is also decidedly odd. The engine
was idle!

> >> 2)
> >> Failed reset leads to busy-hang in following request _tail_? But there
> >> is an arb check at the start of following request as well. Or in cases
> >> where we context switch into the middle of a previously executing request?
> > 
> > It was the first request submitted after the failed reset. We expect to
> > clear the ring-stop flag on the CS IDLE->ACTIVE event.
> > 
> >> But why would that busy hang? Hasn't the failed request unpaused the ring?
> > 
> > The engine was idle at the time of the failed reset. We left the
> > ring-stop set, and submitted the next batch of requests. We hit the
> > MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
> > without hitting an arbitration point (first request, no init-breadcrumb
> > in this case), the semaphore was stuck.
> 
> So a kernel context request?

Ish. The selftest is using empty requests, and not emitting the
initial breadcrumb. (So acting like a kernel context.)

> Why hasn't IDLE->ACTIVE cleared ring stop? 

There hasn't been an idle->active event, not a single CS event after
writing to ELSP and timing out while still spinning on the semaphore.

> Presumably this CSB must come after the first request has been submitted 
> so apparently I am still not getting how it hangs.

It was never sent. The context is still in pending[0] (not active[0])
and there's no sign in the trace of any interrupts/tasklet handing other
than the semaphore-wait interrupt.
 
> Just because igt_reset_nop_engine does things "quickly"? It prevents the 
> CSB from arriving?

More that the since we do very little we hit the semaphore before the CS
has recovered from the shock of being asked to do something.

> So ARB_CHECK pickups up on the fact ELSP has been 
> rewritten before the IDLE->ACTIVE even received and/or engine reset 
> prevented it from arriving?

The ARB_CHECK should trigger the CS to generate the IDLE->ACTIVE event.
(Of course assuming that the bug is in the semaphore not triggering the
event due to strange circumstances and not a bug in the event generator
itself.) I'm suspicious of the semaphore due to the earlier CS bugs with
lite-restores + semaphores, and am expecting that since the MI_ARB_CHECK
is explicit, it actually works.
-Chris
Tvrtko Ursulin Jan. 12, 2021, 9:21 a.m. UTC | #5
On 11/01/2021 21:54, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2021-01-11 17:12:57)
>>
>> On 11/01/2021 16:27, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2021-01-11 16:19:40)
>>>>
>>>> On 11/01/2021 10:57, Chris Wilson wrote:
>>>>> During igt_reset_nop_engine, it was observed that an unexpected failed
>>>>> engine reset lead to us busywaiting on the stop-ring semaphore (set
>>>>> during the reset preparations) on the first request afterwards. There was
>>>>> no explicit MI_ARB_CHECK in this sequence as the presumption was that
>>>>> the failed MI_SEMAPHORE_WAIT would itself act as an arbitration point.
>>>>> It did not in this circumstance, so force it.
>>>>
>>>> In other words MI_SEMAPHORE_POLL is not a preemption point? Can't
>>>> remember if I knew that or not..
>>>
>>> MI_SEMAPHORE_WAIT | POLL is most definitely a preemption point on a
>>> miss.
>>>
>>>> 1)
>>>> Why not the same handling in !gen12 version?
>>>
>>> Because I think it's a bug in tgl [a0 at least]. I think I've seen the
>>> same symptoms on tgl before, but not earlier. This is the first time the
>>> sequence clicked as to why it was busy spinning. Random engine reset
>>> failures are rare enough -- I was meant to also write a test case to
>>> inject failure.
>>
>> Random engine reset failure you think is a TGL issue?
> 
> The MI_SEMAPHORE_WAIT | POLL miss not generating an arbitration point.
> We have quite a few selftests and IGT that use this feature.
> 
> So I was wondering if this was similar to one of those tgl issues with
> semaphores and CS events.
> 
> The random engine reset failure here is also decidedly odd. The engine
> was idle!
> 
>>>> 2)
>>>> Failed reset leads to busy-hang in following request _tail_? But there
>>>> is an arb check at the start of following request as well. Or in cases
>>>> where we context switch into the middle of a previously executing request?
>>>
>>> It was the first request submitted after the failed reset. We expect to
>>> clear the ring-stop flag on the CS IDLE->ACTIVE event.
>>>
>>>> But why would that busy hang? Hasn't the failed request unpaused the ring?
>>>
>>> The engine was idle at the time of the failed reset. We left the
>>> ring-stop set, and submitted the next batch of requests. We hit the
>>> MI_SEMAPHORE_WAIT(ring-stop) at the end of the first request, but
>>> without hitting an arbitration point (first request, no init-breadcrumb
>>> in this case), the semaphore was stuck.
>>
>> So a kernel context request?
> 
> Ish. The selftest is using empty requests, and not emitting the
> initial breadcrumb. (So acting like a kernel context.)
> 
>> Why hasn't IDLE->ACTIVE cleared ring stop?
> 
> There hasn't been an idle->active event, not a single CS event after
> writing to ELSP and timing out while still spinning on the semaphore.
> 
>> Presumably this CSB must come after the first request has been submitted
>> so apparently I am still not getting how it hangs.
> 
> It was never sent. The context is still in pending[0] (not active[0])
> and there's no sign in the trace of any interrupts/tasklet handing other
> than the semaphore-wait interrupt.
>   
>> Just because igt_reset_nop_engine does things "quickly"? It prevents the
>> CSB from arriving?
> 
> More that the since we do very little we hit the semaphore before the CS
> has recovered from the shock of being asked to do something.
> 
>> So ARB_CHECK pickups up on the fact ELSP has been
>> rewritten before the IDLE->ACTIVE even received and/or engine reset
>> prevented it from arriving?
> 
> The ARB_CHECK should trigger the CS to generate the IDLE->ACTIVE event.
> (Of course assuming that the bug is in the semaphore not triggering the
> event due to strange circumstances and not a bug in the event generator
> itself.) I'm suspicious of the semaphore due to the earlier CS bugs with
> lite-restores + semaphores, and am expecting that since the MI_ARB_CHECK
> is explicit, it actually works.

Okay got it, thanks. I suggest it would be good to slightly improve the 
commit message so it is clear what are the suspected TGL quirks. But in 
general:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
index 9a182652a35e..e9ac281644cc 100644
--- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c
@@ -567,6 +567,7 @@  u32 *gen11_emit_fini_breadcrumb_rcs(struct i915_request *rq, u32 *cs)
 
 static u32 *gen12_emit_preempt_busywait(struct i915_request *rq, u32 *cs)
 {
+	*cs++ = MI_ARB_CHECK;
 	*cs++ = MI_SEMAPHORE_WAIT_TOKEN |
 		MI_SEMAPHORE_GLOBAL_GTT |
 		MI_SEMAPHORE_POLL |
@@ -575,7 +576,6 @@  static u32 *gen12_emit_preempt_busywait(struct i915_request *rq, u32 *cs)
 	*cs++ = preempt_address(rq->engine);
 	*cs++ = 0;
 	*cs++ = 0;
-	*cs++ = MI_NOOP;
 
 	return cs;
 }