diff mbox

[v2] drm/i915/lrc: Scrub the GPU state of the guilty hanging request

Message ID 20180427202446.29747-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson April 27, 2018, 8:24 p.m. UTC
Previously, we just reset the ring register in the context image such
that we could skip over the broken batch and emit the closing
breadcrumb. However, on resume the context image and GPU state would be
reloaded, which may have been left in an inconsistent state by the
reset. The presumption was that at worst it would just cause another
reset and skip again until it recovered, however it seems just as likely
to cause an unrecoverable hang. Instead of risking loading an incomplete
context image, restore it back to the default state.

v2: Fix up off-by-one from including the ppHSWP in with the register
state.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

Comments

Michel Thierry April 27, 2018, 8:27 p.m. UTC | #1
On 4/27/2018 1:24 PM, Chris Wilson wrote:
> Previously, we just reset the ring register in the context image such
> that we could skip over the broken batch and emit the closing
> breadcrumb. However, on resume the context image and GPU state would be
> reloaded, which may have been left in an inconsistent state by the
> reset. The presumption was that at worst it would just cause another
> reset and skip again until it recovered, however it seems just as likely
> to cause an unrecoverable hang. Instead of risking loading an incomplete
> context image, restore it back to the default state.
> 
> v2: Fix up off-by-one from including the ppHSWP in with the register
> state.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Reviewed-by: Michel Thierry <michel.thierry@intel.com>

Does it need a 'Fixes:' tag or has a bugzilla reference?
> ---
>   drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++-------
>   1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ce23d5116482..01750a4c2f3f 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>   			      struct i915_request *request)
>   {
>   	struct intel_engine_execlists * const execlists = &engine->execlists;
> -	struct intel_context *ce;
>   	unsigned long flags;
> +	u32 *regs;
>   
>   	GEM_TRACE("%s request global=%x, current=%d\n",
>   		  engine->name, request ? request->global_seqno : 0,
> @@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine,
>   	 * future request will be after userspace has had the opportunity
>   	 * to recreate its own state.
>   	 */
> -	ce = &request->ctx->engine[engine->id];
> -	execlists_init_reg_state(ce->lrc_reg_state,
> -				 request->ctx, engine, ce->ring);
> +	regs = request->ctx->engine[engine->id].lrc_reg_state;
> +	if (engine->default_state) {
> +		void *defaults;
> +
> +		defaults = i915_gem_object_pin_map(engine->default_state,
> +						   I915_MAP_WB);
> +		if (!IS_ERR(defaults)) {
> +			memcpy(regs, /* skip restoring to the vanilla PPHWSP */
> +			       defaults + LRC_STATE_PN * PAGE_SIZE,
> +			       engine->context_size - PAGE_SIZE);
> +			i915_gem_object_unpin_map(engine->default_state);
> +		}
> +	}
> +	execlists_init_reg_state(regs, request->ctx, engine, request->ring);
>   
>   	/* Move the RING_HEAD onto the breadcrumb, past the hanging batch */
> -	ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
> -		i915_ggtt_offset(ce->ring->vma);
> -	ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix;
> +	regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma);
> +	regs[CTX_RING_HEAD + 1] = request->postfix;
>   
>   	request->ring->head = request->postfix;
>   	intel_ring_update_space(request->ring);
>
Chris Wilson April 27, 2018, 8:35 p.m. UTC | #2
Quoting Michel Thierry (2018-04-27 21:27:46)
> On 4/27/2018 1:24 PM, Chris Wilson wrote:
> > Previously, we just reset the ring register in the context image such
> > that we could skip over the broken batch and emit the closing
> > breadcrumb. However, on resume the context image and GPU state would be
> > reloaded, which may have been left in an inconsistent state by the
> > reset. The presumption was that at worst it would just cause another
> > reset and skip again until it recovered, however it seems just as likely
> > to cause an unrecoverable hang. Instead of risking loading an incomplete
> > context image, restore it back to the default state.
> > 
> > v2: Fix up off-by-one from including the ppHSWP in with the register
> > state.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Reviewed-by: Michel Thierry <michel.thierry@intel.com>
> 
> Does it need a 'Fixes:' tag or has a bugzilla reference?

I suspect it's rare enough that the unrecoverable hang might not be
recognisable in bugzilla. I was just looking at 

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log

trying to think of ways how the reset might appear to work but the
recovery fail with 

<7>[  521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
<7>[  521.765176] missed_breadcrumb 	current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1
<7>[  521.765191] missed_breadcrumb 	Reset count: 0 (global 0)
<7>[  521.765206] missed_breadcrumb 	Requests:
<7>[  521.765223] missed_breadcrumb 		first  e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765239] missed_breadcrumb 		last   e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765256] missed_breadcrumb 		active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
<7>[  521.765274] missed_breadcrumb 		[head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000]
<7>[  521.765289] missed_breadcrumb 		ring->start:  0x008ef000
<7>[  521.765301] missed_breadcrumb 		ring->head:   0x000038f8
<7>[  521.765313] missed_breadcrumb 		ring->tail:   0x00003948
<7>[  521.765325] missed_breadcrumb 		ring->emit:   0x00003950
<7>[  521.765337] missed_breadcrumb 		ring->space:  0x00002618
<7>[  521.765372] missed_breadcrumb 	RING_START: 0x008ef000
<7>[  521.765389] missed_breadcrumb 	RING_HEAD:  0x000038f8
<7>[  521.765404] missed_breadcrumb 	RING_TAIL:  0x00003948
<7>[  521.765422] missed_breadcrumb 	RING_CTL:   0x00003001
<7>[  521.765438] missed_breadcrumb 	RING_MODE:  0x00000000
<7>[  521.765453] missed_breadcrumb 	RING_IMR: fffffefe
<7>[  521.765473] missed_breadcrumb 	ACTHD:  0x00000000_022039b8
<7>[  521.765492] missed_breadcrumb 	BBADDR: 0x00000000_00042004
<7>[  521.765511] missed_breadcrumb 	DMA_FADDR: 0x00000000_008f28f8
<7>[  521.765537] missed_breadcrumb 	IPEIR: 0x00000000
<7>[  521.765552] missed_breadcrumb 	IPEHR: 0x11000011
<7>[  521.765570] missed_breadcrumb 	Execlist status: 0x00044032 00000002
<7>[  521.765586] missed_breadcrumb 	Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  521.765604] missed_breadcrumb 	Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp]
<7>[  521.765619] missed_breadcrumb 		ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765632] missed_breadcrumb 		ELSP[1] idle
<7>[  521.765645] missed_breadcrumb 		HW active? 0x1
<7>[  521.765660] missed_breadcrumb 		E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765670] missed_breadcrumb 		Queue priority: -2147483648
<7>[  521.765684] missed_breadcrumb 	gem_sync [3112] waiting for e4f
<7>[  521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
<7>[  521.765707] missed_breadcrumb HWSP:
<7>[  521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765733] missed_breadcrumb *
<7>[  521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002
<7>[  521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002
<7>[  521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765784] missed_breadcrumb *
<7>[  521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7>[  521.765833] missed_breadcrumb *
<7>[  521.765845] missed_breadcrumb Idle? no

Of particular note being the IPEHR being MI_LRI, the ring being idle (it
hasn't moved on from the earlier reset) and the fetch address being
unconnected to the rings, so naturally I assume it died loading the
context image on resume.
-Chris
Michel Thierry April 27, 2018, 10:30 p.m. UTC | #3
On 4/27/2018 1:35 PM, Chris Wilson wrote:
> Quoting Michel Thierry (2018-04-27 21:27:46)
>> On 4/27/2018 1:24 PM, Chris Wilson wrote:
>>> Previously, we just reset the ring register in the context image such
>>> that we could skip over the broken batch and emit the closing
>>> breadcrumb. However, on resume the context image and GPU state would be
>>> reloaded, which may have been left in an inconsistent state by the
>>> reset. The presumption was that at worst it would just cause another
>>> reset and skip again until it recovered, however it seems just as likely
>>> to cause an unrecoverable hang. Instead of risking loading an incomplete
>>> context image, restore it back to the default state.
>>>
>>> v2: Fix up off-by-one from including the ppHSWP in with the register
>>> state.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>>> Cc: Michel Thierry <michel.thierry@intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Reviewed-by: Michel Thierry <michel.thierry@intel.com>
>>
>> Does it need a 'Fixes:' tag or has a bugzilla reference?
> 
> I suspect it's rare enough that the unrecoverable hang might not be
> recognisable in bugzilla. I was just looking at
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log
> 
> trying to think of ways how the reset might appear to work but the
> recovery fail with
> 
> <7>[  521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
> <7>[  521.765176] missed_breadcrumb 	current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1
> <7>[  521.765191] missed_breadcrumb 	Reset count: 0 (global 0)
> <7>[  521.765206] missed_breadcrumb 	Requests:
> <7>[  521.765223] missed_breadcrumb 		first  e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
> <7>[  521.765239] missed_breadcrumb 		last   e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
> <7>[  521.765256] missed_breadcrumb 		active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0
> <7>[  521.765274] missed_breadcrumb 		[head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000]
> <7>[  521.765289] missed_breadcrumb 		ring->start:  0x008ef000
> <7>[  521.765301] missed_breadcrumb 		ring->head:   0x000038f8
> <7>[  521.765313] missed_breadcrumb 		ring->tail:   0x00003948
> <7>[  521.765325] missed_breadcrumb 		ring->emit:   0x00003950
> <7>[  521.765337] missed_breadcrumb 		ring->space:  0x00002618
> <7>[  521.765372] missed_breadcrumb 	RING_START: 0x008ef000
> <7>[  521.765389] missed_breadcrumb 	RING_HEAD:  0x000038f8
> <7>[  521.765404] missed_breadcrumb 	RING_TAIL:  0x00003948
> <7>[  521.765422] missed_breadcrumb 	RING_CTL:   0x00003001
> <7>[  521.765438] missed_breadcrumb 	RING_MODE:  0x00000000
> <7>[  521.765453] missed_breadcrumb 	RING_IMR: fffffefe
> <7>[  521.765473] missed_breadcrumb 	ACTHD:  0x00000000_022039b8
> <7>[  521.765492] missed_breadcrumb 	BBADDR: 0x00000000_00042004
> <7>[  521.765511] missed_breadcrumb 	DMA_FADDR: 0x00000000_008f28f8
> <7>[  521.765537] missed_breadcrumb 	IPEIR: 0x00000000
> <7>[  521.765552] missed_breadcrumb 	IPEHR: 0x11000011
> <7>[  521.765570] missed_breadcrumb 	Execlist status: 0x00044032 00000002
> <7>[  521.765586] missed_breadcrumb 	Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled)
> <7>[  521.765604] missed_breadcrumb 	Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp]
> <7>[  521.765619] missed_breadcrumb 		ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
> <7>[  521.765632] missed_breadcrumb 		ELSP[1] idle
> <7>[  521.765645] missed_breadcrumb 		HW active? 0x1
> <7>[  521.765660] missed_breadcrumb 		E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
> <7>[  521.765670] missed_breadcrumb 		Queue priority: -2147483648
> <7>[  521.765684] missed_breadcrumb 	gem_sync [3112] waiting for e4f
> <7>[  521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
> <7>[  521.765707] missed_breadcrumb HWSP:
> <7>[  521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765733] missed_breadcrumb *
> <7>[  521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002
> <7>[  521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002
> <7>[  521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765784] missed_breadcrumb *
> <7>[  521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> <7>[  521.765833] missed_breadcrumb *
> <7>[  521.765845] missed_breadcrumb Idle? no
> 
> Of particular note being the IPEHR being MI_LRI, the ring being idle (it
> hasn't moved on from the earlier reset) and the fetch address being
> unconnected to the rings, so naturally I assume it died loading the
> context image on resume.
Plus it is a bsw...
Agreed, this looks like an issue during the ctx restore.

> -Chris
>
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ce23d5116482..01750a4c2f3f 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1804,8 +1804,8 @@  static void reset_common_ring(struct intel_engine_cs *engine,
 			      struct i915_request *request)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
-	struct intel_context *ce;
 	unsigned long flags;
+	u32 *regs;
 
 	GEM_TRACE("%s request global=%x, current=%d\n",
 		  engine->name, request ? request->global_seqno : 0,
@@ -1855,14 +1855,24 @@  static void reset_common_ring(struct intel_engine_cs *engine,
 	 * future request will be after userspace has had the opportunity
 	 * to recreate its own state.
 	 */
-	ce = &request->ctx->engine[engine->id];
-	execlists_init_reg_state(ce->lrc_reg_state,
-				 request->ctx, engine, ce->ring);
+	regs = request->ctx->engine[engine->id].lrc_reg_state;
+	if (engine->default_state) {
+		void *defaults;
+
+		defaults = i915_gem_object_pin_map(engine->default_state,
+						   I915_MAP_WB);
+		if (!IS_ERR(defaults)) {
+			memcpy(regs, /* skip restoring to the vanilla PPHWSP */
+			       defaults + LRC_STATE_PN * PAGE_SIZE,
+			       engine->context_size - PAGE_SIZE);
+			i915_gem_object_unpin_map(engine->default_state);
+		}
+	}
+	execlists_init_reg_state(regs, request->ctx, engine, request->ring);
 
 	/* Move the RING_HEAD onto the breadcrumb, past the hanging batch */
-	ce->lrc_reg_state[CTX_RING_BUFFER_START+1] =
-		i915_ggtt_offset(ce->ring->vma);
-	ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix;
+	regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma);
+	regs[CTX_RING_HEAD + 1] = request->postfix;
 
 	request->ring->head = request->postfix;
 	intel_ring_update_space(request->ring);