diff mbox

[2/3] drm/i915/execlists: Push the tasklet kick after reset to reset_finish

Message ID 20180604073441.6737-2-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson June 4, 2018, 7:34 a.m. UTC
In the unlikely case where we have failed to keep submitting to the GPU,
we end up with the ELSP queue empty but a pending queue of requests.
Here, we skip the per-engine reset as there is no guilty request, but in
doing so we also skip the engine restart leaving ourselves with a
permanently hung engine. A quick way to recover is by moving the tasklet
kick to execlists_reset_finish() (from init_hw). We still emit the error
on hanging, so the error is not lost but we should be able to recover.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Comments

Tvrtko Ursulin June 4, 2018, 3:17 p.m. UTC | #1
On 04/06/2018 08:34, Chris Wilson wrote:
> In the unlikely case where we have failed to keep submitting to the GPU,
> we end up with the ELSP queue empty but a pending queue of requests.

How does this happen? We have nothing in ports but a queue of requests, 
but we managed to declare a GPU hang, even though there is nothing in 
ports so GPU looks idle from the outside?

Regards,

Tvrtko

> Here, we skip the per-engine reset as there is no guilty request, but in
> doing so we also skip the engine restart leaving ourselves with a
> permanently hung engine. A quick way to recover is by moving the tasklet
> kick to execlists_reset_finish() (from init_hw). We still emit the error
> on hanging, so the error is not lost but we should be able to recover.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> ---
>   drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++------
>   1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 8d912d0c8fc1..c8d9b5aed94a 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1803,7 +1803,6 @@ static bool unexpected_starting_state(struct intel_engine_cs *engine)
>   
>   static int gen8_init_common_ring(struct intel_engine_cs *engine)
>   {
> -	struct intel_engine_execlists * const execlists = &engine->execlists;
>   	int ret;
>   
>   	ret = intel_mocs_init_engine(engine);
> @@ -1821,10 +1820,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>   
>   	enable_execlists(engine);
>   
> -	/* After a GPU reset, we may have requests to replay */
> -	if (execlists->first)
> -		tasklet_schedule(&execlists->tasklet);
> -
>   	return 0;
>   }
>   
> @@ -2006,6 +2001,12 @@ static void execlists_reset(struct intel_engine_cs *engine,
>   
>   static void execlists_reset_finish(struct intel_engine_cs *engine)
>   {
> +	struct intel_engine_execlists * const execlists = &engine->execlists;
> +
> +	/* After a GPU reset, we may have requests to replay */
> +	if (execlists->first)
> +		tasklet_schedule(&execlists->tasklet);
> +
>   	/*
>   	 * Flush the tasklet while we still have the forcewake to be sure
>   	 * that it is not allowed to sleep before we restart and reload a
> @@ -2015,7 +2016,7 @@ static void execlists_reset_finish(struct intel_engine_cs *engine)
>   	 * serialising multiple attempts to reset so that we know that we
>   	 * are the only one manipulating tasklet state.
>   	 */
> -	__tasklet_enable_sync_once(&engine->execlists.tasklet);
> +	__tasklet_enable_sync_once(&execlists->tasklet);
>   
>   	GEM_TRACE("%s\n", engine->name);
>   }
>
Chris Wilson June 5, 2018, 9:31 a.m. UTC | #2
Quoting Tvrtko Ursulin (2018-06-04 16:17:47)
> 
> On 04/06/2018 08:34, Chris Wilson wrote:
> > In the unlikely case where we have failed to keep submitting to the GPU,
> > we end up with the ELSP queue empty but a pending queue of requests.
> 
> How does this happen? We have nothing in ports but a queue of requests, 
> but we managed to declare a GPU hang, even though there is nothing in 
> ports so GPU looks idle from the outside?

Driver bug. A buggy driver is no excuse for us to fail to recover
though.
-Chris
Mika Kuoppala June 14, 2018, 3:48 p.m. UTC | #3
Chris Wilson <chris@chris-wilson.co.uk> writes:

> In the unlikely case where we have failed to keep submitting to the GPU,
> we end up with the ELSP queue empty but a pending queue of requests.
> Here, we skip the per-engine reset as there is no guilty request, but in
> doing so we also skip the engine restart leaving ourselves with a
> permanently hung engine. A quick way to recover is by moving the tasklet
> kick to execlists_reset_finish() (from init_hw). We still emit the error
> on hanging, so the error is not lost but we should be able to recover.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 8d912d0c8fc1..c8d9b5aed94a 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1803,7 +1803,6 @@ static bool unexpected_starting_state(struct intel_engine_cs *engine)
>  
>  static int gen8_init_common_ring(struct intel_engine_cs *engine)
>  {
> -	struct intel_engine_execlists * const execlists = &engine->execlists;
>  	int ret;
>  
>  	ret = intel_mocs_init_engine(engine);
> @@ -1821,10 +1820,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>  
>  	enable_execlists(engine);
>  
> -	/* After a GPU reset, we may have requests to replay */
> -	if (execlists->first)
> -		tasklet_schedule(&execlists->tasklet);
> -
>  	return 0;
>  }
>  
> @@ -2006,6 +2001,12 @@ static void execlists_reset(struct intel_engine_cs *engine,
>  
>  static void execlists_reset_finish(struct intel_engine_cs *engine)
>  {
> +	struct intel_engine_execlists * const execlists = &engine->execlists;
> +
> +	/* After a GPU reset, we may have requests to replay */
> +	if (execlists->first)
> +		tasklet_schedule(&execlists->tasklet);
> +
>  	/*
>  	 * Flush the tasklet while we still have the forcewake to be sure
>  	 * that it is not allowed to sleep before we restart and reload a
> @@ -2015,7 +2016,7 @@ static void execlists_reset_finish(struct intel_engine_cs *engine)
>  	 * serialising multiple attempts to reset so that we know that we
>  	 * are the only one manipulating tasklet state.
>  	 */
> -	__tasklet_enable_sync_once(&engine->execlists.tasklet);
> +	__tasklet_enable_sync_once(&execlists->tasklet);
>  
>  	GEM_TRACE("%s\n", engine->name);
>  }
> -- 
> 2.17.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson June 14, 2018, 6:38 p.m. UTC | #4
Quoting Mika Kuoppala (2018-06-14 16:48:48)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > In the unlikely case where we have failed to keep submitting to the GPU,
> > we end up with the ELSP queue empty but a pending queue of requests.
> > Here, we skip the per-engine reset as there is no guilty request, but in
> > doing so we also skip the engine restart leaving ourselves with a
> > permanently hung engine. A quick way to recover is by moving the tasklet
> > kick to execlists_reset_finish() (from init_hw). We still emit the error
> > on hanging, so the error is not lost but we should be able to recover.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Thanks for the review, pushed.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 8d912d0c8fc1..c8d9b5aed94a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1803,7 +1803,6 @@  static bool unexpected_starting_state(struct intel_engine_cs *engine)
 
 static int gen8_init_common_ring(struct intel_engine_cs *engine)
 {
-	struct intel_engine_execlists * const execlists = &engine->execlists;
 	int ret;
 
 	ret = intel_mocs_init_engine(engine);
@@ -1821,10 +1820,6 @@  static int gen8_init_common_ring(struct intel_engine_cs *engine)
 
 	enable_execlists(engine);
 
-	/* After a GPU reset, we may have requests to replay */
-	if (execlists->first)
-		tasklet_schedule(&execlists->tasklet);
-
 	return 0;
 }
 
@@ -2006,6 +2001,12 @@  static void execlists_reset(struct intel_engine_cs *engine,
 
 static void execlists_reset_finish(struct intel_engine_cs *engine)
 {
+	struct intel_engine_execlists * const execlists = &engine->execlists;
+
+	/* After a GPU reset, we may have requests to replay */
+	if (execlists->first)
+		tasklet_schedule(&execlists->tasklet);
+
 	/*
 	 * Flush the tasklet while we still have the forcewake to be sure
 	 * that it is not allowed to sleep before we restart and reload a
@@ -2015,7 +2016,7 @@  static void execlists_reset_finish(struct intel_engine_cs *engine)
 	 * serialising multiple attempts to reset so that we know that we
 	 * are the only one manipulating tasklet state.
 	 */
-	__tasklet_enable_sync_once(&engine->execlists.tasklet);
+	__tasklet_enable_sync_once(&execlists->tasklet);
 
 	GEM_TRACE("%s\n", engine->name);
 }