diff mbox series

drm/i915/gt: Restrict forced preemption to the active context

Message ID 20220921135258.1714873-1-andrzej.hajda@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/gt: Restrict forced preemption to the active context | expand

Commit Message

Andrzej Hajda Sept. 21, 2022, 1:52 p.m. UTC
From: Chris Wilson <chris.p.wilson@intel.com>

When we submit a new pair of contexts to ELSP for execution, we start a
timer by which point we expect the HW to have switched execution to the
pending contexts. If the promotion to the new pair of contexts has not
occurred, we declare the executing context to have hung and force the
preemption to take place by resetting the engine and resubmitting the
new contexts.

This can lead to an unfair situation where almost all of the preemption
timeout is consumed by the first context which just switches into the
second context immediately prior to the timer firing and triggering the
preemption reset (assuming that the timer interrupts before we process
the CS events for the context switch). The second context hasn't yet had
a chance to yield to the incoming ELSP (and send the ACk for the
promotion) and so ends up being blamed for the reset.

If we see that a context switch has occurred since setting the
preemption timeout, but have not yet received the ACK for the ELSP
promotion, rearm the preemption timer and check again. This is
especially significant if the first context was not schedulable and so
we used the shortest timer possible, greatly increasing the chance of
accidentally blaming the second innocent context.

Fixes: 3a7a92aba8fb ("drm/i915/execlists: Force preemption")
Fixes: d12acee84ffb ("drm/i915/execlists: Cancel banned contexts on schedule-out")
Reported-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
Tested-by: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: <stable@vger.kernel.org> # v5.5+
---
Hi,

This patch is upstreamed from internal branch. So I have removed
R-B by Andi. Andi let me know if your R-B still apply.

Regards
Andrzej
---
 drivers/gpu/drm/i915/gt/intel_engine_types.h  | 15 +++++++++++++
 .../drm/i915/gt/intel_execlists_submission.c  | 21 ++++++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

Comments

Andi Shyti Sept. 21, 2022, 3:43 p.m. UTC | #1
Hi Andrzej and Chris,

On Wed, Sep 21, 2022 at 03:52:58PM +0200, Andrzej Hajda wrote:
> From: Chris Wilson <chris.p.wilson@intel.com>
> 
> When we submit a new pair of contexts to ELSP for execution, we start a
> timer by which point we expect the HW to have switched execution to the
> pending contexts. If the promotion to the new pair of contexts has not
> occurred, we declare the executing context to have hung and force the
> preemption to take place by resetting the engine and resubmitting the
> new contexts.
> 
> This can lead to an unfair situation where almost all of the preemption
> timeout is consumed by the first context which just switches into the
> second context immediately prior to the timer firing and triggering the
> preemption reset (assuming that the timer interrupts before we process
> the CS events for the context switch). The second context hasn't yet had
> a chance to yield to the incoming ELSP (and send the ACk for the
> promotion) and so ends up being blamed for the reset.
> 
> If we see that a context switch has occurred since setting the
> preemption timeout, but have not yet received the ACK for the ELSP
> promotion, rearm the preemption timer and check again. This is
> especially significant if the first context was not schedulable and so
> we used the shortest timer possible, greatly increasing the chance of
> accidentally blaming the second innocent context.
> 
> Fixes: 3a7a92aba8fb ("drm/i915/execlists: Force preemption")
> Fixes: d12acee84ffb ("drm/i915/execlists: Cancel banned contexts on schedule-out")
> Reported-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Andi Shyti <andi.shyti@linux.intel.com>
> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
> Tested-by: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: <stable@vger.kernel.org> # v5.5+
> ---
> Hi,
> 
> This patch is upstreamed from internal branch. So I have removed
> R-B by Andi. Andi let me know if your R-B still apply.

yes, I know this patch and my r-b holds:

Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Anyway, thanks Chris for the comments and the clear explanation
both in the commit log and in between the code.

Andi

> Regards
> Andrzej
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_types.h  | 15 +++++++++++++
>  .../drm/i915/gt/intel_execlists_submission.c  | 21 ++++++++++++++++++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 633a7e5dba3b4b..6b5d4ea22b673b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -165,6 +165,21 @@ struct intel_engine_execlists {
>  	 */
>  	struct timer_list preempt;
>  
> +	/**
> +	 * @preempt_target: active request at the time of the preemption request
> +	 *
> +	 * We force a preemption to occur if the pending contexts have not
> +	 * been promoted to active upon receipt of the CS ack event within
> +	 * the timeout. This timeout maybe chosen based on the target,
> +	 * using a very short timeout if the context is no longer schedulable.
> +	 * That short timeout may not be applicable to other contexts, so
> +	 * if a context switch should happen within before the preemption
> +	 * timeout, we may shoot early at an innocent context. To prevent this,
> +	 * we record which context was active at the time of the preemption
> +	 * request and only reset that context upon the timeout.
> +	 */
> +	const struct i915_request *preempt_target;
> +
>  	/**
>  	 * @ccid: identifier for contexts submitted to this engine
>  	 */
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 4b909cb88cdfb7..c718e6dc40b515 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -1241,6 +1241,9 @@ static unsigned long active_preempt_timeout(struct intel_engine_cs *engine,
>  	if (!rq)
>  		return 0;
>  
> +	/* Only allow ourselves to force reset the currently active context */
> +	engine->execlists.preempt_target = rq;
> +
>  	/* Force a fast reset for terminated contexts (ignoring sysfs!) */
>  	if (unlikely(intel_context_is_banned(rq->context) || bad_request(rq)))
>  		return INTEL_CONTEXT_BANNED_PREEMPT_TIMEOUT_MS;
> @@ -2427,8 +2430,24 @@ static void execlists_submission_tasklet(struct tasklet_struct *t)
>  	GEM_BUG_ON(inactive - post > ARRAY_SIZE(post));
>  
>  	if (unlikely(preempt_timeout(engine))) {
> +		const struct i915_request *rq = *engine->execlists.active;
> +
> +		/*
> +		 * If after the preempt-timeout expired, we are still on the
> +		 * same active request/context as before we initiated the
> +		 * preemption, reset the engine.
> +		 *
> +		 * However, if we have processed a CS event to switch contexts,
> +		 * but not yet processed the CS event for the pending
> +		 * preemption, reset the timer allowing the new context to
> +		 * gracefully exit.
> +		 */
>  		cancel_timer(&engine->execlists.preempt);
> -		engine->execlists.error_interrupt |= ERROR_PREEMPT;
> +		if (rq == engine->execlists.preempt_target)
> +			engine->execlists.error_interrupt |= ERROR_PREEMPT;
> +		else
> +			set_timer_ms(&engine->execlists.preempt,
> +				     active_preempt_timeout(engine, rq));
>  	}
>  
>  	if (unlikely(READ_ONCE(engine->execlists.error_interrupt))) {
> -- 
> 2.34.1
Andi Shyti Sept. 26, 2022, 11:26 a.m. UTC | #2
Hi Andrzej and Chris,

pushed to drm-intel-gt-next

Thanks!
Andi

On Wed, Sep 21, 2022 at 03:52:58PM +0200, Andrzej Hajda wrote:
> From: Chris Wilson <chris.p.wilson@intel.com>
> 
> When we submit a new pair of contexts to ELSP for execution, we start a
> timer by which point we expect the HW to have switched execution to the
> pending contexts. If the promotion to the new pair of contexts has not
> occurred, we declare the executing context to have hung and force the
> preemption to take place by resetting the engine and resubmitting the
> new contexts.
> 
> This can lead to an unfair situation where almost all of the preemption
> timeout is consumed by the first context which just switches into the
> second context immediately prior to the timer firing and triggering the
> preemption reset (assuming that the timer interrupts before we process
> the CS events for the context switch). The second context hasn't yet had
> a chance to yield to the incoming ELSP (and send the ACk for the
> promotion) and so ends up being blamed for the reset.
> 
> If we see that a context switch has occurred since setting the
> preemption timeout, but have not yet received the ACK for the ELSP
> promotion, rearm the preemption timer and check again. This is
> especially significant if the first context was not schedulable and so
> we used the shortest timer possible, greatly increasing the chance of
> accidentally blaming the second innocent context.
> 
> Fixes: 3a7a92aba8fb ("drm/i915/execlists: Force preemption")
> Fixes: d12acee84ffb ("drm/i915/execlists: Cancel banned contexts on schedule-out")
> Reported-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Andi Shyti <andi.shyti@linux.intel.com>
> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
> Tested-by: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: <stable@vger.kernel.org> # v5.5+
> ---
> Hi,
> 
> This patch is upstreamed from internal branch. So I have removed
> R-B by Andi. Andi let me know if your R-B still apply.
> 
> Regards
> Andrzej
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_types.h  | 15 +++++++++++++
>  .../drm/i915/gt/intel_execlists_submission.c  | 21 ++++++++++++++++++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 633a7e5dba3b4b..6b5d4ea22b673b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -165,6 +165,21 @@ struct intel_engine_execlists {
>  	 */
>  	struct timer_list preempt;
>  
> +	/**
> +	 * @preempt_target: active request at the time of the preemption request
> +	 *
> +	 * We force a preemption to occur if the pending contexts have not
> +	 * been promoted to active upon receipt of the CS ack event within
> +	 * the timeout. This timeout maybe chosen based on the target,
> +	 * using a very short timeout if the context is no longer schedulable.
> +	 * That short timeout may not be applicable to other contexts, so
> +	 * if a context switch should happen within before the preemption
> +	 * timeout, we may shoot early at an innocent context. To prevent this,
> +	 * we record which context was active at the time of the preemption
> +	 * request and only reset that context upon the timeout.
> +	 */
> +	const struct i915_request *preempt_target;
> +
>  	/**
>  	 * @ccid: identifier for contexts submitted to this engine
>  	 */
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 4b909cb88cdfb7..c718e6dc40b515 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -1241,6 +1241,9 @@ static unsigned long active_preempt_timeout(struct intel_engine_cs *engine,
>  	if (!rq)
>  		return 0;
>  
> +	/* Only allow ourselves to force reset the currently active context */
> +	engine->execlists.preempt_target = rq;
> +
>  	/* Force a fast reset for terminated contexts (ignoring sysfs!) */
>  	if (unlikely(intel_context_is_banned(rq->context) || bad_request(rq)))
>  		return INTEL_CONTEXT_BANNED_PREEMPT_TIMEOUT_MS;
> @@ -2427,8 +2430,24 @@ static void execlists_submission_tasklet(struct tasklet_struct *t)
>  	GEM_BUG_ON(inactive - post > ARRAY_SIZE(post));
>  
>  	if (unlikely(preempt_timeout(engine))) {
> +		const struct i915_request *rq = *engine->execlists.active;
> +
> +		/*
> +		 * If after the preempt-timeout expired, we are still on the
> +		 * same active request/context as before we initiated the
> +		 * preemption, reset the engine.
> +		 *
> +		 * However, if we have processed a CS event to switch contexts,
> +		 * but not yet processed the CS event for the pending
> +		 * preemption, reset the timer allowing the new context to
> +		 * gracefully exit.
> +		 */
>  		cancel_timer(&engine->execlists.preempt);
> -		engine->execlists.error_interrupt |= ERROR_PREEMPT;
> +		if (rq == engine->execlists.preempt_target)
> +			engine->execlists.error_interrupt |= ERROR_PREEMPT;
> +		else
> +			set_timer_ms(&engine->execlists.preempt,
> +				     active_preempt_timeout(engine, rq));
>  	}
>  
>  	if (unlikely(READ_ONCE(engine->execlists.error_interrupt))) {
> -- 
> 2.34.1
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 633a7e5dba3b4b..6b5d4ea22b673b 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -165,6 +165,21 @@  struct intel_engine_execlists {
 	 */
 	struct timer_list preempt;
 
+	/**
+	 * @preempt_target: active request at the time of the preemption request
+	 *
+	 * We force a preemption to occur if the pending contexts have not
+	 * been promoted to active upon receipt of the CS ack event within
+	 * the timeout. This timeout maybe chosen based on the target,
+	 * using a very short timeout if the context is no longer schedulable.
+	 * That short timeout may not be applicable to other contexts, so
+	 * if a context switch should happen within before the preemption
+	 * timeout, we may shoot early at an innocent context. To prevent this,
+	 * we record which context was active at the time of the preemption
+	 * request and only reset that context upon the timeout.
+	 */
+	const struct i915_request *preempt_target;
+
 	/**
 	 * @ccid: identifier for contexts submitted to this engine
 	 */
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 4b909cb88cdfb7..c718e6dc40b515 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -1241,6 +1241,9 @@  static unsigned long active_preempt_timeout(struct intel_engine_cs *engine,
 	if (!rq)
 		return 0;
 
+	/* Only allow ourselves to force reset the currently active context */
+	engine->execlists.preempt_target = rq;
+
 	/* Force a fast reset for terminated contexts (ignoring sysfs!) */
 	if (unlikely(intel_context_is_banned(rq->context) || bad_request(rq)))
 		return INTEL_CONTEXT_BANNED_PREEMPT_TIMEOUT_MS;
@@ -2427,8 +2430,24 @@  static void execlists_submission_tasklet(struct tasklet_struct *t)
 	GEM_BUG_ON(inactive - post > ARRAY_SIZE(post));
 
 	if (unlikely(preempt_timeout(engine))) {
+		const struct i915_request *rq = *engine->execlists.active;
+
+		/*
+		 * If after the preempt-timeout expired, we are still on the
+		 * same active request/context as before we initiated the
+		 * preemption, reset the engine.
+		 *
+		 * However, if we have processed a CS event to switch contexts,
+		 * but not yet processed the CS event for the pending
+		 * preemption, reset the timer allowing the new context to
+		 * gracefully exit.
+		 */
 		cancel_timer(&engine->execlists.preempt);
-		engine->execlists.error_interrupt |= ERROR_PREEMPT;
+		if (rq == engine->execlists.preempt_target)
+			engine->execlists.error_interrupt |= ERROR_PREEMPT;
+		else
+			set_timer_ms(&engine->execlists.preempt,
+				     active_preempt_timeout(engine, rq));
 	}
 
 	if (unlikely(READ_ONCE(engine->execlists.error_interrupt))) {