diff mbox series

drm/i915/execlists: Double check breadcrumb before crying foul

Message ID 20200330234318.30638-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series drm/i915/execlists: Double check breadcrumb before crying foul | expand

Commit Message

Chris Wilson March 30, 2020, 11:43 p.m. UTC
process_csb: 0000:00:02.0 bcs0: cs-irq head=4, tail=5
  process_csb: 0000:00:02.0 bcs0: csb[5]: status=0x00008002:0x60000020
  trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 }
  trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 }
  trace_ports: 0000:00:02.0 bcs0: submit { ff84:45156 prio 2 }

  process_csb: 0000:00:02.0 bcs0: cs-irq head=5, tail=6
  process_csb: 0000:00:02.0 bcs0: csb[6]: status=0x00000018:0x60000020
  trace_ports: 0000:00:02.0 bcs0: completed { ff84:45155* prio 2 }
  process_csb: 0000:00:02.0 bcs0: ring:{start:0x00178000, head:0928, tail:0928, ctl:00000000, mode:00000200}
  process_csb: 0000:00:02.0 bcs0: rq:{start:00178000, head:08b0, tail:08f0, seqno:ff84:45155, hwsp:45156},
  process_csb: 0000:00:02.0 bcs0: ctx:{start:00178000, head:e000928, tail:0928},
  process_csb: GEM_BUG_ON("context completed before request")

In this sequence, we can see that although we have submitted the next
request [ff84:45156] to HW (via ELSP[]) it has not yet reported the
lite-restore. Instead, we see the completion event of the currently
active request [ff85:45155] but at the time of processing that event,
the breadcrumb has not yet been written. Though by the time we do print
out the debug info, the seqno write of ff85:45156 has landed!

Therefore there is a serialisation problem between the seqno writes and
CS events, not just between the CS buffer and its head/tail pointers as
previously observed on Icelake.

This is not a huge problem, as we don't strictly rely on the breadcrumb
to determine HW activity, but it may indicate that interrupt delivery is
before the seqno write, aka bringing back the plague of missed
interrupts from yesteryear. However, there is no indication of this
wider problem, so let's just flush the seqno read before reporting an
error. If it persists after the fresh read we can worry again.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

Comments

Mika Kuoppala March 31, 2020, 7:50 a.m. UTC | #1
Chris Wilson <chris@chris-wilson.co.uk> writes:

>   process_csb: 0000:00:02.0 bcs0: cs-irq head=4, tail=5
>   process_csb: 0000:00:02.0 bcs0: csb[5]: status=0x00008002:0x60000020
>   trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 }
>   trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 }
>   trace_ports: 0000:00:02.0 bcs0: submit { ff84:45156 prio 2 }
>
>   process_csb: 0000:00:02.0 bcs0: cs-irq head=5, tail=6
>   process_csb: 0000:00:02.0 bcs0: csb[6]: status=0x00000018:0x60000020
>   trace_ports: 0000:00:02.0 bcs0: completed { ff84:45155* prio 2 }
>   process_csb: 0000:00:02.0 bcs0: ring:{start:0x00178000, head:0928, tail:0928, ctl:00000000, mode:00000200}
>   process_csb: 0000:00:02.0 bcs0: rq:{start:00178000, head:08b0, tail:08f0, seqno:ff84:45155, hwsp:45156},
>   process_csb: 0000:00:02.0 bcs0: ctx:{start:00178000, head:e000928, tail:0928},
>   process_csb: GEM_BUG_ON("context completed before request")
>
> In this sequence, we can see that although we have submitted the next
> request [ff84:45156] to HW (via ELSP[]) it has not yet reported the
> lite-restore. Instead, we see the completion event of the currently
> active request [ff85:45155] but at the time of processing that event,
> the breadcrumb has not yet been written. Though by the time we do print
> out the debug info, the seqno write of ff85:45156 has landed!

I see it.

But I have now a more generic confusion about the sequence:
why would the same context preempt itself?

>   trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 }
>   trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 }

>
> Therefore there is a serialisation problem between the seqno writes and
> CS events, not just between the CS buffer and its head/tail pointers as
> previously observed on Icelake.
>
> This is not a huge problem, as we don't strictly rely on the breadcrumb
> to determine HW activity, but it may indicate that interrupt delivery is
> before the seqno write, aka bringing back the plague of missed
> interrupts from yesteryear. However, there is no indication of this
> wider problem, so let's just flush the seqno read before reporting an
> error. If it persists after the fresh read we can worry again.

Well you are preempting part of my concerns.

Still you will have a serialision point now only with debug builds.
So I assume the rest of the request completion flow is ambivalent of
intr seqno coherency and we can cope with this is only for debug...

Being in generic read hwsp path would be expensive and thus overkill?

Could we lift the flushing to the non show_debug path to be always
part of process_csb cycle?

-Mika

>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_lrc.c | 29 ++++++++++++++++++++++-------
>  1 file changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index 3d5f3f7677bb..afeca7eb1e3a 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -2418,8 +2418,6 @@ static void process_csb(struct intel_engine_cs *engine)
>  		if (promote) {
>  			struct i915_request * const *old = execlists->active;
>  
> -			GEM_BUG_ON(!assert_pending_valid(execlists, "promote"));
> -
>  			ring_set_paused(engine, 0);
>  
>  			/* Point active to the new ELSP; prevent overwriting */
> @@ -2432,6 +2430,7 @@ static void process_csb(struct intel_engine_cs *engine)
>  				execlists_schedule_out(*old++);
>  
>  			/* switch pending to inflight */
> +			GEM_BUG_ON(!assert_pending_valid(execlists, "promote"));
>  			memcpy(execlists->inflight,
>  			       execlists->pending,
>  			       execlists_num_ports(execlists) *
> @@ -2453,13 +2452,26 @@ static void process_csb(struct intel_engine_cs *engine)
>  			 * user interrupt and CSB is processed.
>  			 */
>  			if (GEM_SHOW_DEBUG() &&
> -			    !i915_request_completed(*execlists->active) &&
> -			    !reset_in_progress(execlists)) {
> -				struct i915_request *rq __maybe_unused =
> -					*execlists->active;
> +			    !i915_request_completed(*execlists->active)) {
> +				struct i915_request *rq = *execlists->active;
>  				const u32 *regs __maybe_unused =
>  					rq->context->lrc_reg_state;
>  
> +				/*
> +				 * Flush the breadcrumb before crying foul.
> +				 *
> +				 * Since we have hit this on icl and seen the
> +				 * breadcrumb advance as we print out the debug
> +				 * info (so the problem corrected itself without
> +				 * lasting damage), and we know that icl suffers
> +				 * from missing global observation points in
> +				 * execlists, presume that affects even more
> +				 * coherency.
> +				 */
> +				mb();
> +				clflush((void *)READ_ONCE(rq->hwsp_seqno));
> +				mb();
> +
>  				ENGINE_TRACE(engine,
>  					     "ring:{start:0x%08x, head:%04x, tail:%04x, ctl:%08x, mode:%08x}\n",
>  					     ENGINE_READ(engine, RING_START),
> @@ -2480,7 +2492,10 @@ static void process_csb(struct intel_engine_cs *engine)
>  					     regs[CTX_RING_HEAD],
>  					     regs[CTX_RING_TAIL]);
>  
> -				GEM_BUG_ON("context completed before request");
> +				/* Still? Declare it caput! */
> +				if (!i915_request_completed(rq) &&
> +				    !reset_in_progress(execlists))
> +					GEM_BUG_ON("context completed before request");
>  			}
>  
>  			execlists_schedule_out(*execlists->active++);
> -- 
> 2.20.1
Chris Wilson March 31, 2020, 8:06 a.m. UTC | #2
Quoting Mika Kuoppala (2020-03-31 08:50:54)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> >   process_csb: 0000:00:02.0 bcs0: cs-irq head=4, tail=5
> >   process_csb: 0000:00:02.0 bcs0: csb[5]: status=0x00008002:0x60000020
> >   trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 }
> >   trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 }
> >   trace_ports: 0000:00:02.0 bcs0: submit { ff84:45156 prio 2 }
> >
> >   process_csb: 0000:00:02.0 bcs0: cs-irq head=5, tail=6
> >   process_csb: 0000:00:02.0 bcs0: csb[6]: status=0x00000018:0x60000020
> >   trace_ports: 0000:00:02.0 bcs0: completed { ff84:45155* prio 2 }
> >   process_csb: 0000:00:02.0 bcs0: ring:{start:0x00178000, head:0928, tail:0928, ctl:00000000, mode:00000200}
> >   process_csb: 0000:00:02.0 bcs0: rq:{start:00178000, head:08b0, tail:08f0, seqno:ff84:45155, hwsp:45156},
> >   process_csb: 0000:00:02.0 bcs0: ctx:{start:00178000, head:e000928, tail:0928},
> >   process_csb: GEM_BUG_ON("context completed before request")
> >
> > In this sequence, we can see that although we have submitted the next
> > request [ff84:45156] to HW (via ELSP[]) it has not yet reported the
> > lite-restore. Instead, we see the completion event of the currently
> > active request [ff85:45155] but at the time of processing that event,
> > the breadcrumb has not yet been written. Though by the time we do print
> > out the debug info, the seqno write of ff85:45156 has landed!
> 
> I see it.
> 
> But I have now a more generic confusion about the sequence:
> why would the same context preempt itself?

lite-restore continuations.

> 
> >   trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 }
> >   trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 }
> 
> >
> > Therefore there is a serialisation problem between the seqno writes and
> > CS events, not just between the CS buffer and its head/tail pointers as
> > previously observed on Icelake.
> >
> > This is not a huge problem, as we don't strictly rely on the breadcrumb
> > to determine HW activity, but it may indicate that interrupt delivery is
> > before the seqno write, aka bringing back the plague of missed
> > interrupts from yesteryear. However, there is no indication of this
> > wider problem, so let's just flush the seqno read before reporting an
> > error. If it persists after the fresh read we can worry again.
> 
> Well you are preempting part of my concerns.
> 
> Still you will have a serialision point now only with debug builds.

We only check in debug.

> So I assume the rest of the request completion flow is ambivalent of
> intr seqno coherency and we can cope with this is only for debug...

Right.
 
> Being in generic read hwsp path would be expensive and thus overkill?

Very expensive. There's no evidence that the GPU is that broken that it
returns garbage in a *cached* CPU read.
 
> Could we lift the flushing to the non show_debug path to be always
> part of process_csb cycle?

No. What would be the point? The breadcrumb is not part of the CS flow.
The debug is our sanitycheck that the two events are ordered, but that
is not a strict requirement of the code. If it fails this test, the
presumption is that either the GPU is very dead (e.g. the GO problem,
RING_HEAD/TAIL issues, a variety of other means of killing the gpu) and
the problem will remain, or that we have a programming error, which
again will not resolve itself as we print the debug.
-Chris
Mika Kuoppala March 31, 2020, 8:52 a.m. UTC | #3
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2020-03-31 08:50:54)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> >   process_csb: 0000:00:02.0 bcs0: cs-irq head=4, tail=5
>> >   process_csb: 0000:00:02.0 bcs0: csb[5]: status=0x00008002:0x60000020
>> >   trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 }
>> >   trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 }
>> >   trace_ports: 0000:00:02.0 bcs0: submit { ff84:45156 prio 2 }
>> >
>> >   process_csb: 0000:00:02.0 bcs0: cs-irq head=5, tail=6
>> >   process_csb: 0000:00:02.0 bcs0: csb[6]: status=0x00000018:0x60000020
>> >   trace_ports: 0000:00:02.0 bcs0: completed { ff84:45155* prio 2 }
>> >   process_csb: 0000:00:02.0 bcs0: ring:{start:0x00178000, head:0928, tail:0928, ctl:00000000, mode:00000200}
>> >   process_csb: 0000:00:02.0 bcs0: rq:{start:00178000, head:08b0, tail:08f0, seqno:ff84:45155, hwsp:45156},
>> >   process_csb: 0000:00:02.0 bcs0: ctx:{start:00178000, head:e000928, tail:0928},
>> >   process_csb: GEM_BUG_ON("context completed before request")
>> >
>> > In this sequence, we can see that although we have submitted the next
>> > request [ff84:45156] to HW (via ELSP[]) it has not yet reported the
>> > lite-restore. Instead, we see the completion event of the currently
>> > active request [ff85:45155] but at the time of processing that event,
>> > the breadcrumb has not yet been written. Though by the time we do print
>> > out the debug info, the seqno write of ff85:45156 has landed!
>> 
>> I see it.
>> 
>> But I have now a more generic confusion about the sequence:
>> why would the same context preempt itself?
>
> lite-restore continuations.
>
>> 
>> >   trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 }
>> >   trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 }
>> 
>> >
>> > Therefore there is a serialisation problem between the seqno writes and
>> > CS events, not just between the CS buffer and its head/tail pointers as
>> > previously observed on Icelake.
>> >
>> > This is not a huge problem, as we don't strictly rely on the breadcrumb
>> > to determine HW activity, but it may indicate that interrupt delivery is
>> > before the seqno write, aka bringing back the plague of missed
>> > interrupts from yesteryear. However, there is no indication of this
>> > wider problem, so let's just flush the seqno read before reporting an
>> > error. If it persists after the fresh read we can worry again.
>> 
>> Well you are preempting part of my concerns.
>> 
>> Still you will have a serialision point now only with debug builds.
>
> We only check in debug.
>
>> So I assume the rest of the request completion flow is ambivalent of
>> intr seqno coherency and we can cope with this is only for debug...
>
> Right.
>  
>> Being in generic read hwsp path would be expensive and thus overkill?
>
> Very expensive. There's no evidence that the GPU is that broken that it
> returns garbage in a *cached* CPU read.
>  
>> Could we lift the flushing to the non show_debug path to be always
>> part of process_csb cycle?
>
> No. What would be the point? The breadcrumb is not part of the CS flow.
> The debug is our sanitycheck that the two events are ordered, but that
> is not a strict requirement of the code. If it fails this test, the
> presumption is that either the GPU is very dead (e.g. the GO problem,
> RING_HEAD/TAIL issues, a variety of other means of killing the gpu) and
> the problem will remain, or that we have a programming error, which
> again will not resolve itself as we print the debug.

Agreed that there is no point if it is for forcing a fresh
read only for debug. Concern was that it affects the
validity for other reads.

But you amended that on the commit message, preemptively.
So,
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 3d5f3f7677bb..afeca7eb1e3a 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2418,8 +2418,6 @@  static void process_csb(struct intel_engine_cs *engine)
 		if (promote) {
 			struct i915_request * const *old = execlists->active;
 
-			GEM_BUG_ON(!assert_pending_valid(execlists, "promote"));
-
 			ring_set_paused(engine, 0);
 
 			/* Point active to the new ELSP; prevent overwriting */
@@ -2432,6 +2430,7 @@  static void process_csb(struct intel_engine_cs *engine)
 				execlists_schedule_out(*old++);
 
 			/* switch pending to inflight */
+			GEM_BUG_ON(!assert_pending_valid(execlists, "promote"));
 			memcpy(execlists->inflight,
 			       execlists->pending,
 			       execlists_num_ports(execlists) *
@@ -2453,13 +2452,26 @@  static void process_csb(struct intel_engine_cs *engine)
 			 * user interrupt and CSB is processed.
 			 */
 			if (GEM_SHOW_DEBUG() &&
-			    !i915_request_completed(*execlists->active) &&
-			    !reset_in_progress(execlists)) {
-				struct i915_request *rq __maybe_unused =
-					*execlists->active;
+			    !i915_request_completed(*execlists->active)) {
+				struct i915_request *rq = *execlists->active;
 				const u32 *regs __maybe_unused =
 					rq->context->lrc_reg_state;
 
+				/*
+				 * Flush the breadcrumb before crying foul.
+				 *
+				 * Since we have hit this on icl and seen the
+				 * breadcrumb advance as we print out the debug
+				 * info (so the problem corrected itself without
+				 * lasting damage), and we know that icl suffers
+				 * from missing global observation points in
+				 * execlists, presume that affects even more
+				 * coherency.
+				 */
+				mb();
+				clflush((void *)READ_ONCE(rq->hwsp_seqno));
+				mb();
+
 				ENGINE_TRACE(engine,
 					     "ring:{start:0x%08x, head:%04x, tail:%04x, ctl:%08x, mode:%08x}\n",
 					     ENGINE_READ(engine, RING_START),
@@ -2480,7 +2492,10 @@  static void process_csb(struct intel_engine_cs *engine)
 					     regs[CTX_RING_HEAD],
 					     regs[CTX_RING_TAIL]);
 
-				GEM_BUG_ON("context completed before request");
+				/* Still? Declare it caput! */
+				if (!i915_request_completed(rq) &&
+				    !reset_in_progress(execlists))
+					GEM_BUG_ON("context completed before request");
 			}
 
 			execlists_schedule_out(*execlists->active++);