Message ID | 20180427202446.29747-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 4/27/2018 1:24 PM, Chris Wilson wrote: > Previously, we just reset the ring register in the context image such > that we could skip over the broken batch and emit the closing > breadcrumb. However, on resume the context image and GPU state would be > reloaded, which may have been left in an inconsistent state by the > reset. The presumption was that at worst it would just cause another > reset and skip again until it recovered, however it seems just as likely > to cause an unrecoverable hang. Instead of risking loading an incomplete > context image, restore it back to the default state. > > v2: Fix up off-by-one from including the ppHSWP in with the register > state. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Cc: Michał Winiarski <michal.winiarski@intel.com> > Cc: Michel Thierry <michel.thierry@intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Michel Thierry <michel.thierry@intel.com> Does it need a 'Fixes:' tag or has a bugzilla reference? > --- > drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++------- > 1 file changed, 17 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index ce23d5116482..01750a4c2f3f 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine, > struct i915_request *request) > { > struct intel_engine_execlists * const execlists = &engine->execlists; > - struct intel_context *ce; > unsigned long flags; > + u32 *regs; > > GEM_TRACE("%s request global=%x, current=%d\n", > engine->name, request ? request->global_seqno : 0, > @@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine, > * future request will be after userspace has had the opportunity > * to recreate its own state. > */ > - ce = &request->ctx->engine[engine->id]; > - execlists_init_reg_state(ce->lrc_reg_state, > - request->ctx, engine, ce->ring); > + regs = request->ctx->engine[engine->id].lrc_reg_state; > + if (engine->default_state) { > + void *defaults; > + > + defaults = i915_gem_object_pin_map(engine->default_state, > + I915_MAP_WB); > + if (!IS_ERR(defaults)) { > + memcpy(regs, /* skip restoring to the vanilla PPHWSP */ > + defaults + LRC_STATE_PN * PAGE_SIZE, > + engine->context_size - PAGE_SIZE); > + i915_gem_object_unpin_map(engine->default_state); > + } > + } > + execlists_init_reg_state(regs, request->ctx, engine, request->ring); > > /* Move the RING_HEAD onto the breadcrumb, past the hanging batch */ > - ce->lrc_reg_state[CTX_RING_BUFFER_START+1] = > - i915_ggtt_offset(ce->ring->vma); > - ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix; > + regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma); > + regs[CTX_RING_HEAD + 1] = request->postfix; > > request->ring->head = request->postfix; > intel_ring_update_space(request->ring); >
Quoting Michel Thierry (2018-04-27 21:27:46) > On 4/27/2018 1:24 PM, Chris Wilson wrote: > > Previously, we just reset the ring register in the context image such > > that we could skip over the broken batch and emit the closing > > breadcrumb. However, on resume the context image and GPU state would be > > reloaded, which may have been left in an inconsistent state by the > > reset. The presumption was that at worst it would just cause another > > reset and skip again until it recovered, however it seems just as likely > > to cause an unrecoverable hang. Instead of risking loading an incomplete > > context image, restore it back to the default state. > > > > v2: Fix up off-by-one from including the ppHSWP in with the register > > state. > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > Cc: Michał Winiarski <michal.winiarski@intel.com> > > Cc: Michel Thierry <michel.thierry@intel.com> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Reviewed-by: Michel Thierry <michel.thierry@intel.com> > > Does it need a 'Fixes:' tag or has a bugzilla reference? I suspect it's rare enough that the unrecoverable hang might not be recognisable in bugzilla. I was just looking at https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log trying to think of ways how the reset might appear to work but the recovery fail with <7>[ 521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915] <7>[ 521.765176] missed_breadcrumb current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1 <7>[ 521.765191] missed_breadcrumb Reset count: 0 (global 0) <7>[ 521.765206] missed_breadcrumb Requests: <7>[ 521.765223] missed_breadcrumb first e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0 <7>[ 521.765239] missed_breadcrumb last e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0 <7>[ 521.765256] missed_breadcrumb active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0 <7>[ 521.765274] missed_breadcrumb [head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000] <7>[ 521.765289] missed_breadcrumb ring->start: 0x008ef000 <7>[ 521.765301] missed_breadcrumb ring->head: 0x000038f8 <7>[ 521.765313] missed_breadcrumb ring->tail: 0x00003948 <7>[ 521.765325] missed_breadcrumb ring->emit: 0x00003950 <7>[ 521.765337] missed_breadcrumb ring->space: 0x00002618 <7>[ 521.765372] missed_breadcrumb RING_START: 0x008ef000 <7>[ 521.765389] missed_breadcrumb RING_HEAD: 0x000038f8 <7>[ 521.765404] missed_breadcrumb RING_TAIL: 0x00003948 <7>[ 521.765422] missed_breadcrumb RING_CTL: 0x00003001 <7>[ 521.765438] missed_breadcrumb RING_MODE: 0x00000000 <7>[ 521.765453] missed_breadcrumb RING_IMR: fffffefe <7>[ 521.765473] missed_breadcrumb ACTHD: 0x00000000_022039b8 <7>[ 521.765492] missed_breadcrumb BBADDR: 0x00000000_00042004 <7>[ 521.765511] missed_breadcrumb DMA_FADDR: 0x00000000_008f28f8 <7>[ 521.765537] missed_breadcrumb IPEIR: 0x00000000 <7>[ 521.765552] missed_breadcrumb IPEHR: 0x11000011 <7>[ 521.765570] missed_breadcrumb Execlist status: 0x00044032 00000002 <7>[ 521.765586] missed_breadcrumb Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled) <7>[ 521.765604] missed_breadcrumb Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp] <7>[ 521.765619] missed_breadcrumb ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0 <7>[ 521.765632] missed_breadcrumb ELSP[1] idle <7>[ 521.765645] missed_breadcrumb HW active? 0x1 <7>[ 521.765660] missed_breadcrumb E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0 <7>[ 521.765670] missed_breadcrumb Queue priority: -2147483648 <7>[ 521.765684] missed_breadcrumb gem_sync [3112] waiting for e4f <7>[ 521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no) <7>[ 521.765707] missed_breadcrumb HWSP: <7>[ 521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 521.765733] missed_breadcrumb * <7>[ 521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002 <7>[ 521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002 <7>[ 521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 521.765784] missed_breadcrumb * <7>[ 521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7>[ 521.765833] missed_breadcrumb * <7>[ 521.765845] missed_breadcrumb Idle? no Of particular note being the IPEHR being MI_LRI, the ring being idle (it hasn't moved on from the earlier reset) and the fetch address being unconnected to the rings, so naturally I assume it died loading the context image on resume. -Chris
On 4/27/2018 1:35 PM, Chris Wilson wrote: > Quoting Michel Thierry (2018-04-27 21:27:46) >> On 4/27/2018 1:24 PM, Chris Wilson wrote: >>> Previously, we just reset the ring register in the context image such >>> that we could skip over the broken batch and emit the closing >>> breadcrumb. However, on resume the context image and GPU state would be >>> reloaded, which may have been left in an inconsistent state by the >>> reset. The presumption was that at worst it would just cause another >>> reset and skip again until it recovered, however it seems just as likely >>> to cause an unrecoverable hang. Instead of risking loading an incomplete >>> context image, restore it back to the default state. >>> >>> v2: Fix up off-by-one from including the ppHSWP in with the register >>> state. >>> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> >>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> >>> Cc: Michał Winiarski <michal.winiarski@intel.com> >>> Cc: Michel Thierry <michel.thierry@intel.com> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> >> Reviewed-by: Michel Thierry <michel.thierry@intel.com> >> >> Does it need a 'Fixes:' tag or has a bugzilla reference? > > I suspect it's rare enough that the unrecoverable hang might not be > recognisable in bugzilla. I was just looking at > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log > > trying to think of ways how the reset might appear to work but the > recovery fail with > > <7>[ 521.765114] missed_breadcrumb vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5a/0x80 [i915] > <7>[ 521.765176] missed_breadcrumb current seqno e4e, last e4f, hangcheck e4e [2048 ms], inflight 1 > <7>[ 521.765191] missed_breadcrumb Reset count: 0 (global 0) > <7>[ 521.765206] missed_breadcrumb Requests: > <7>[ 521.765223] missed_breadcrumb first e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0 > <7>[ 521.765239] missed_breadcrumb last e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0 > <7>[ 521.765256] missed_breadcrumb active e4f [9b82:e4f] prio=0 @ 3766ms: gem_sync[3107]/0 > <7>[ 521.765274] missed_breadcrumb [head 3900, postfix 3930, tail 3948, batch 0x00000000_00042000] > <7>[ 521.765289] missed_breadcrumb ring->start: 0x008ef000 > <7>[ 521.765301] missed_breadcrumb ring->head: 0x000038f8 > <7>[ 521.765313] missed_breadcrumb ring->tail: 0x00003948 > <7>[ 521.765325] missed_breadcrumb ring->emit: 0x00003950 > <7>[ 521.765337] missed_breadcrumb ring->space: 0x00002618 > <7>[ 521.765372] missed_breadcrumb RING_START: 0x008ef000 > <7>[ 521.765389] missed_breadcrumb RING_HEAD: 0x000038f8 > <7>[ 521.765404] missed_breadcrumb RING_TAIL: 0x00003948 > <7>[ 521.765422] missed_breadcrumb RING_CTL: 0x00003001 > <7>[ 521.765438] missed_breadcrumb RING_MODE: 0x00000000 > <7>[ 521.765453] missed_breadcrumb RING_IMR: fffffefe > <7>[ 521.765473] missed_breadcrumb ACTHD: 0x00000000_022039b8 > <7>[ 521.765492] missed_breadcrumb BBADDR: 0x00000000_00042004 > <7>[ 521.765511] missed_breadcrumb DMA_FADDR: 0x00000000_008f28f8 > <7>[ 521.765537] missed_breadcrumb IPEIR: 0x00000000 > <7>[ 521.765552] missed_breadcrumb IPEHR: 0x11000011 > <7>[ 521.765570] missed_breadcrumb Execlist status: 0x00044032 00000002 > <7>[ 521.765586] missed_breadcrumb Execlist CSB read 1 [1 cached], write 2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled) > <7>[ 521.765604] missed_breadcrumb Execlist CSB[2]: 0x00000001 [0x00000001 in hwsp], context: 0 [0 in hwsp] > <7>[ 521.765619] missed_breadcrumb ELSP[0] count=1, rq: e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0 > <7>[ 521.765632] missed_breadcrumb ELSP[1] idle > <7>[ 521.765645] missed_breadcrumb HW active? 0x1 > <7>[ 521.765660] missed_breadcrumb E e4f [9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0 > <7>[ 521.765670] missed_breadcrumb Queue priority: -2147483648 > <7>[ 521.765684] missed_breadcrumb gem_sync [3112] waiting for e4f > <7>[ 521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no) > <7>[ 521.765707] missed_breadcrumb HWSP: > <7>[ 521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > <7>[ 521.765733] missed_breadcrumb * > <7>[ 521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 00000002 00000001 00000000 00000018 00000002 > <7>[ 521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 00000002 00000000 00000000 00000000 00000002 > <7>[ 521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > <7>[ 521.765784] missed_breadcrumb * > <7>[ 521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > <7>[ 521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > <7>[ 521.765833] missed_breadcrumb * > <7>[ 521.765845] missed_breadcrumb Idle? no > > Of particular note being the IPEHR being MI_LRI, the ring being idle (it > hasn't moved on from the earlier reset) and the fetch address being > unconnected to the rings, so naturally I assume it died loading the > context image on resume. Plus it is a bsw... Agreed, this looks like an issue during the ctx restore. > -Chris >
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index ce23d5116482..01750a4c2f3f 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -1804,8 +1804,8 @@ static void reset_common_ring(struct intel_engine_cs *engine, struct i915_request *request) { struct intel_engine_execlists * const execlists = &engine->execlists; - struct intel_context *ce; unsigned long flags; + u32 *regs; GEM_TRACE("%s request global=%x, current=%d\n", engine->name, request ? request->global_seqno : 0, @@ -1855,14 +1855,24 @@ static void reset_common_ring(struct intel_engine_cs *engine, * future request will be after userspace has had the opportunity * to recreate its own state. */ - ce = &request->ctx->engine[engine->id]; - execlists_init_reg_state(ce->lrc_reg_state, - request->ctx, engine, ce->ring); + regs = request->ctx->engine[engine->id].lrc_reg_state; + if (engine->default_state) { + void *defaults; + + defaults = i915_gem_object_pin_map(engine->default_state, + I915_MAP_WB); + if (!IS_ERR(defaults)) { + memcpy(regs, /* skip restoring to the vanilla PPHWSP */ + defaults + LRC_STATE_PN * PAGE_SIZE, + engine->context_size - PAGE_SIZE); + i915_gem_object_unpin_map(engine->default_state); + } + } + execlists_init_reg_state(regs, request->ctx, engine, request->ring); /* Move the RING_HEAD onto the breadcrumb, past the hanging batch */ - ce->lrc_reg_state[CTX_RING_BUFFER_START+1] = - i915_ggtt_offset(ce->ring->vma); - ce->lrc_reg_state[CTX_RING_HEAD+1] = request->postfix; + regs[CTX_RING_BUFFER_START + 1] = i915_ggtt_offset(request->ring->vma); + regs[CTX_RING_HEAD + 1] = request->postfix; request->ring->head = request->postfix; intel_ring_update_space(request->ring);
Previously, we just reset the ring register in the context image such that we could skip over the broken batch and emit the closing breadcrumb. However, on resume the context image and GPU state would be reloaded, which may have been left in an inconsistent state by the reset. The presumption was that at worst it would just cause another reset and skip again until it recovered, however it seems just as likely to cause an unrecoverable hang. Instead of risking loading an incomplete context image, restore it back to the default state. v2: Fix up off-by-one from including the ppHSWP in with the register state. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> --- drivers/gpu/drm/i915/intel_lrc.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-)