diff mbox series

[3/9] drm/i915: Prevent using semaphores to chain up to external fences

Message ID 20200508092933.738-3-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [1/9] drm/i915: Ignore submit-fences on the same timeline | expand

Commit Message

Chris Wilson May 8, 2020, 9:29 a.m. UTC
The downside of using semaphores is that we lose metadata passing
along the signaling chain. This is particularly nasty when we
need to pass along a fatal error such as EFAULT or EDEADLK. For
fatal errors we want to scrub the request before it is executed,
which means that we cannot preload the request onto HW and have
it wait upon a semaphore.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_request.c         | 26 +++++++++++++++++++++
 drivers/gpu/drm/i915/i915_scheduler_types.h |  1 +
 2 files changed, 27 insertions(+)

Comments

Mika Kuoppala May 8, 2020, 3:37 p.m. UTC | #1
Chris Wilson <chris@chris-wilson.co.uk> writes:

> The downside of using semaphores is that we lose metadata passing
> along the signaling chain. This is particularly nasty when we
> need to pass along a fatal error such as EFAULT or EDEADLK. For
> fatal errors we want to scrub the request before it is executed,
> which means that we cannot preload the request onto HW and have
> it wait upon a semaphore.

b is waiting on a, a fails and we want to release b with error?

>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_request.c         | 26 +++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_scheduler_types.h |  1 +
>  2 files changed, 27 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 94189c7d43cd..f0f9393e2ade 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1002,6 +1002,15 @@ emit_semaphore_wait(struct i915_request *to,
>  	if (!rcu_access_pointer(from->hwsp_cacheline))
>  		goto await_fence;
>  
> +	/*
> +	 * If this or its dependents are waiting on an external fence
> +	 * that may fail catastrophically, then we want to avoid using
> +	 * sempahores as they bypass the fence signaling metadata, and we

semaphore
-Mika

> +	 * lose the fence->error propagation.
> +	 */
> +	if (from->sched.flags & I915_SCHED_HAS_EXTERNAL_CHAIN)
> +		goto await_fence;
> +
>  	/* Just emit the first semaphore we see as request space is limited. */
>  	if (already_busywaiting(to) & mask)
>  		goto await_fence;
> @@ -1064,12 +1073,29 @@ i915_request_await_request(struct i915_request *to, struct i915_request *from)
>  			return ret;
>  	}
>  
> +	if (from->sched.flags & I915_SCHED_HAS_EXTERNAL_CHAIN)
> +		to->sched.flags |= I915_SCHED_HAS_EXTERNAL_CHAIN;
> +
>  	return 0;
>  }
>  
> +static void mark_external(struct i915_request *rq)
> +{
> +	/*
> +	 * The downside of using semaphores is that we lose metadata passing
> +	 * along the signaling chain. This is particularly nasty when we
> +	 * need to pass along a fatal error such as EFAULT or EDEADLK. For
> +	 * fatal errors we want to scrub the request before it is executed,
> +	 * which means that we cannot preload the request onto HW and have
> +	 * it wait upon a semaphore.
> +	 */
> +	rq->sched.flags |= I915_SCHED_HAS_EXTERNAL_CHAIN;
> +}
> +
>  static int
>  i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
>  {
> +	mark_external(rq);
>  	return i915_sw_fence_await_dma_fence(&rq->submit, fence,
>  					     fence->context ? I915_FENCE_TIMEOUT : 0,
>  					     I915_FENCE_GFP);
> diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h b/drivers/gpu/drm/i915/i915_scheduler_types.h
> index 7186875088a0..6ab2c5289bed 100644
> --- a/drivers/gpu/drm/i915/i915_scheduler_types.h
> +++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
> @@ -66,6 +66,7 @@ struct i915_sched_node {
>  	struct i915_sched_attr attr;
>  	unsigned int flags;
>  #define I915_SCHED_HAS_SEMAPHORE_CHAIN	BIT(0)
> +#define I915_SCHED_HAS_EXTERNAL_CHAIN	BIT(1)
>  	intel_engine_mask_t semaphores;
>  };
>  
> -- 
> 2.20.1
Chris Wilson May 8, 2020, 3:43 p.m. UTC | #2
Quoting Mika Kuoppala (2020-05-08 16:37:15)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > The downside of using semaphores is that we lose metadata passing
> > along the signaling chain. This is particularly nasty when we
> > need to pass along a fatal error such as EFAULT or EDEADLK. For
> > fatal errors we want to scrub the request before it is executed,
> > which means that we cannot preload the request onto HW and have
> > it wait upon a semaphore.
> 
> b is waiting on a, a fails and we want to release b with error?

Yes. B is submitted before A, and if B is relying on A to setup GPU page
tables or other interesting things, we can't let B run if A dies. For
the EDEADLK if B is waiting on A, but then A is submitted with a wait on
B -- we have to untangle that mess.
-Chris
Mika Kuoppala May 8, 2020, 3:44 p.m. UTC | #3
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2020-05-08 16:37:15)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > The downside of using semaphores is that we lose metadata passing
>> > along the signaling chain. This is particularly nasty when we
>> > need to pass along a fatal error such as EFAULT or EDEADLK. For
>> > fatal errors we want to scrub the request before it is executed,
>> > which means that we cannot preload the request onto HW and have
>> > it wait upon a semaphore.
>> 
>> b is waiting on a, a fails and we want to release b with error?
>
> Yes. B is submitted before A, and if B is relying on A to setup GPU page

I guess this has to be A is before B.

> tables or other interesting things, we can't let B run if A dies. For
> the EDEADLK if B is waiting on A, but then A is submitted with a wait on
> B -- we have to untangle that mess.

Avoiding the hw semaphore makes sense

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> -Chris
Chris Wilson May 8, 2020, 3:56 p.m. UTC | #4
Quoting Mika Kuoppala (2020-05-08 16:44:48)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Quoting Mika Kuoppala (2020-05-08 16:37:15)
> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >> 
> >> > The downside of using semaphores is that we lose metadata passing
> >> > along the signaling chain. This is particularly nasty when we
> >> > need to pass along a fatal error such as EFAULT or EDEADLK. For
> >> > fatal errors we want to scrub the request before it is executed,
> >> > which means that we cannot preload the request onto HW and have
> >> > it wait upon a semaphore.
> >> 
> >> b is waiting on a, a fails and we want to release b with error?
> >
> > Yes. B is submitted before A, and if B is relying on A to setup GPU page
> 
> I guess this has to be A is before B.
> 
> > tables or other interesting things, we can't let B run if A dies. For
> > the EDEADLK if B is waiting on A, but then A is submitted with a wait on
> > B -- we have to untangle that mess.
> 
> Avoiding the hw semaphore makes sense

It's an unfortunate necessity, unless we can work in a conditional jump
after the semaphore. But it also requires sampling all the possible
error signals :(
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 94189c7d43cd..f0f9393e2ade 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1002,6 +1002,15 @@  emit_semaphore_wait(struct i915_request *to,
 	if (!rcu_access_pointer(from->hwsp_cacheline))
 		goto await_fence;
 
+	/*
+	 * If this or its dependents are waiting on an external fence
+	 * that may fail catastrophically, then we want to avoid using
+	 * sempahores as they bypass the fence signaling metadata, and we
+	 * lose the fence->error propagation.
+	 */
+	if (from->sched.flags & I915_SCHED_HAS_EXTERNAL_CHAIN)
+		goto await_fence;
+
 	/* Just emit the first semaphore we see as request space is limited. */
 	if (already_busywaiting(to) & mask)
 		goto await_fence;
@@ -1064,12 +1073,29 @@  i915_request_await_request(struct i915_request *to, struct i915_request *from)
 			return ret;
 	}
 
+	if (from->sched.flags & I915_SCHED_HAS_EXTERNAL_CHAIN)
+		to->sched.flags |= I915_SCHED_HAS_EXTERNAL_CHAIN;
+
 	return 0;
 }
 
+static void mark_external(struct i915_request *rq)
+{
+	/*
+	 * The downside of using semaphores is that we lose metadata passing
+	 * along the signaling chain. This is particularly nasty when we
+	 * need to pass along a fatal error such as EFAULT or EDEADLK. For
+	 * fatal errors we want to scrub the request before it is executed,
+	 * which means that we cannot preload the request onto HW and have
+	 * it wait upon a semaphore.
+	 */
+	rq->sched.flags |= I915_SCHED_HAS_EXTERNAL_CHAIN;
+}
+
 static int
 i915_request_await_external(struct i915_request *rq, struct dma_fence *fence)
 {
+	mark_external(rq);
 	return i915_sw_fence_await_dma_fence(&rq->submit, fence,
 					     fence->context ? I915_FENCE_TIMEOUT : 0,
 					     I915_FENCE_GFP);
diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h b/drivers/gpu/drm/i915/i915_scheduler_types.h
index 7186875088a0..6ab2c5289bed 100644
--- a/drivers/gpu/drm/i915/i915_scheduler_types.h
+++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
@@ -66,6 +66,7 @@  struct i915_sched_node {
 	struct i915_sched_attr attr;
 	unsigned int flags;
 #define I915_SCHED_HAS_SEMAPHORE_CHAIN	BIT(0)
+#define I915_SCHED_HAS_EXTERNAL_CHAIN	BIT(1)
 	intel_engine_mask_t semaphores;
 };