Message ID | 20180604073441.6737-2-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 04/06/2018 08:34, Chris Wilson wrote: > In the unlikely case where we have failed to keep submitting to the GPU, > we end up with the ELSP queue empty but a pending queue of requests. How does this happen? We have nothing in ports but a queue of requests, but we managed to declare a GPU hang, even though there is nothing in ports so GPU looks idle from the outside? Regards, Tvrtko > Here, we skip the per-engine reset as there is no guilty request, but in > doing so we also skip the engine restart leaving ourselves with a > permanently hung engine. A quick way to recover is by moving the tasklet > kick to execlists_reset_finish() (from init_hw). We still emit the error > on hanging, so the error is not lost but we should be able to recover. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Michel Thierry <michel.thierry@intel.com> > --- > drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index 8d912d0c8fc1..c8d9b5aed94a 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -1803,7 +1803,6 @@ static bool unexpected_starting_state(struct intel_engine_cs *engine) > > static int gen8_init_common_ring(struct intel_engine_cs *engine) > { > - struct intel_engine_execlists * const execlists = &engine->execlists; > int ret; > > ret = intel_mocs_init_engine(engine); > @@ -1821,10 +1820,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine) > > enable_execlists(engine); > > - /* After a GPU reset, we may have requests to replay */ > - if (execlists->first) > - tasklet_schedule(&execlists->tasklet); > - > return 0; > } > > @@ -2006,6 +2001,12 @@ static void execlists_reset(struct intel_engine_cs *engine, > > static void execlists_reset_finish(struct intel_engine_cs *engine) > { > + struct intel_engine_execlists * const execlists = &engine->execlists; > + > + /* After a GPU reset, we may have requests to replay */ > + if (execlists->first) > + tasklet_schedule(&execlists->tasklet); > + > /* > * Flush the tasklet while we still have the forcewake to be sure > * that it is not allowed to sleep before we restart and reload a > @@ -2015,7 +2016,7 @@ static void execlists_reset_finish(struct intel_engine_cs *engine) > * serialising multiple attempts to reset so that we know that we > * are the only one manipulating tasklet state. > */ > - __tasklet_enable_sync_once(&engine->execlists.tasklet); > + __tasklet_enable_sync_once(&execlists->tasklet); > > GEM_TRACE("%s\n", engine->name); > } >
Quoting Tvrtko Ursulin (2018-06-04 16:17:47) > > On 04/06/2018 08:34, Chris Wilson wrote: > > In the unlikely case where we have failed to keep submitting to the GPU, > > we end up with the ELSP queue empty but a pending queue of requests. > > How does this happen? We have nothing in ports but a queue of requests, > but we managed to declare a GPU hang, even though there is nothing in > ports so GPU looks idle from the outside? Driver bug. A buggy driver is no excuse for us to fail to recover though. -Chris
Chris Wilson <chris@chris-wilson.co.uk> writes: > In the unlikely case where we have failed to keep submitting to the GPU, > we end up with the ELSP queue empty but a pending queue of requests. > Here, we skip the per-engine reset as there is no guilty request, but in > doing so we also skip the engine restart leaving ourselves with a > permanently hung engine. A quick way to recover is by moving the tasklet > kick to execlists_reset_finish() (from init_hw). We still emit the error > on hanging, so the error is not lost but we should be able to recover. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Michel Thierry <michel.thierry@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > --- > drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index 8d912d0c8fc1..c8d9b5aed94a 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -1803,7 +1803,6 @@ static bool unexpected_starting_state(struct intel_engine_cs *engine) > > static int gen8_init_common_ring(struct intel_engine_cs *engine) > { > - struct intel_engine_execlists * const execlists = &engine->execlists; > int ret; > > ret = intel_mocs_init_engine(engine); > @@ -1821,10 +1820,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine) > > enable_execlists(engine); > > - /* After a GPU reset, we may have requests to replay */ > - if (execlists->first) > - tasklet_schedule(&execlists->tasklet); > - > return 0; > } > > @@ -2006,6 +2001,12 @@ static void execlists_reset(struct intel_engine_cs *engine, > > static void execlists_reset_finish(struct intel_engine_cs *engine) > { > + struct intel_engine_execlists * const execlists = &engine->execlists; > + > + /* After a GPU reset, we may have requests to replay */ > + if (execlists->first) > + tasklet_schedule(&execlists->tasklet); > + > /* > * Flush the tasklet while we still have the forcewake to be sure > * that it is not allowed to sleep before we restart and reload a > @@ -2015,7 +2016,7 @@ static void execlists_reset_finish(struct intel_engine_cs *engine) > * serialising multiple attempts to reset so that we know that we > * are the only one manipulating tasklet state. > */ > - __tasklet_enable_sync_once(&engine->execlists.tasklet); > + __tasklet_enable_sync_once(&execlists->tasklet); > > GEM_TRACE("%s\n", engine->name); > } > -- > 2.17.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Mika Kuoppala (2018-06-14 16:48:48) > Chris Wilson <chris@chris-wilson.co.uk> writes: > > > In the unlikely case where we have failed to keep submitting to the GPU, > > we end up with the ELSP queue empty but a pending queue of requests. > > Here, we skip the per-engine reset as there is no guilty request, but in > > doing so we also skip the engine restart leaving ourselves with a > > permanently hung engine. A quick way to recover is by moving the tasklet > > kick to execlists_reset_finish() (from init_hw). We still emit the error > > on hanging, so the error is not lost but we should be able to recover. > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Cc: Michel Thierry <michel.thierry@intel.com> > > Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Thanks for the review, pushed. -Chris
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 8d912d0c8fc1..c8d9b5aed94a 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -1803,7 +1803,6 @@ static bool unexpected_starting_state(struct intel_engine_cs *engine) static int gen8_init_common_ring(struct intel_engine_cs *engine) { - struct intel_engine_execlists * const execlists = &engine->execlists; int ret; ret = intel_mocs_init_engine(engine); @@ -1821,10 +1820,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine) enable_execlists(engine); - /* After a GPU reset, we may have requests to replay */ - if (execlists->first) - tasklet_schedule(&execlists->tasklet); - return 0; } @@ -2006,6 +2001,12 @@ static void execlists_reset(struct intel_engine_cs *engine, static void execlists_reset_finish(struct intel_engine_cs *engine) { + struct intel_engine_execlists * const execlists = &engine->execlists; + + /* After a GPU reset, we may have requests to replay */ + if (execlists->first) + tasklet_schedule(&execlists->tasklet); + /* * Flush the tasklet while we still have the forcewake to be sure * that it is not allowed to sleep before we restart and reload a @@ -2015,7 +2016,7 @@ static void execlists_reset_finish(struct intel_engine_cs *engine) * serialising multiple attempts to reset so that we know that we * are the only one manipulating tasklet state. */ - __tasklet_enable_sync_once(&engine->execlists.tasklet); + __tasklet_enable_sync_once(&execlists->tasklet); GEM_TRACE("%s\n", engine->name); }
In the unlikely case where we have failed to keep submitting to the GPU, we end up with the ELSP queue empty but a pending queue of requests. Here, we skip the per-engine reset as there is no guilty request, but in doing so we also skip the engine restart leaving ourselves with a permanently hung engine. A quick way to recover is by moving the tasklet kick to execlists_reset_finish() (from init_hw). We still emit the error on hanging, so the error is not lost but we should be able to recover. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> --- drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)