diff mbox series

[07/25] drm/i915: Cancel context if it hangs after it is closed

Message ID 20191110185806.17413-7-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [01/25] drm/i915: Protect context while grabbing its name for the request | expand

Commit Message

Chris Wilson Nov. 10, 2019, 6:57 p.m. UTC
If we detect a hang in a closed context, just flush all of its requests
and cancel any remaining execution along the context. Note that after
closing the context, the last reference to the context may be dropped,
leaving it only valid under RCU.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Mika Kuoppala Nov. 11, 2019, 10:54 a.m. UTC | #1
Chris Wilson <chris@chris-wilson.co.uk> writes:

> If we detect a hang in a closed context, just flush all of its requests
> and cancel any remaining execution along the context. Note that after
> closing the context, the last reference to the context may be dropped,
> leaving it only valid under RCU.

Sound good. But is there a window for userspace to start
to see -EIO if it resubmits to a closed context?

In other words, after userspace doing gem_ctx_destroy(ctx_handle),
we would return -EINVAL due to ctx_handle being stale
earlier than we check for banned status and return -EIO?

-Mika

>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index f03e000051c1..a6b0d00c3a51 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -81,6 +81,11 @@ static bool context_mark_guilty(struct i915_gem_context *ctx)
>  	bool banned;
>  	int i;
>  
> +	if (i915_gem_context_is_closed(ctx)) {
> +		i915_gem_context_set_banned(ctx);
> +		return true;
> +	}
> +
>  	atomic_inc(&ctx->guilty_count);
>  
>  	/* Cool contexts are too cool to be banned! (Used for reset testing.) */
> @@ -124,6 +129,7 @@ void __i915_request_reset(struct i915_request *rq, bool guilty)
>  
>  	GEM_BUG_ON(i915_request_completed(rq));
>  
> +	rcu_read_lock(); /* protect the GEM context */
>  	if (guilty) {
>  		i915_request_skip(rq, -EIO);
>  		if (context_mark_guilty(rq->gem_context))
> @@ -132,6 +138,7 @@ void __i915_request_reset(struct i915_request *rq, bool guilty)
>  		dma_fence_set_error(&rq->fence, -EAGAIN);
>  		context_mark_innocent(rq->gem_context);
>  	}
> +	rcu_read_unlock();
>  }
>  
>  static bool i915_in_reset(struct pci_dev *pdev)
> -- 
> 2.24.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Nov. 11, 2019, 11:04 a.m. UTC | #2
Quoting Mika Kuoppala (2019-11-11 10:54:14)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > If we detect a hang in a closed context, just flush all of its requests
> > and cancel any remaining execution along the context. Note that after
> > closing the context, the last reference to the context may be dropped,
> > leaving it only valid under RCU.
> 
> Sound good. But is there a window for userspace to start
> to see -EIO if it resubmits to a closed context?

Userspace can not submit to a closed context (-ENOENT) as that would be
tantamount to a use-after-free kernel bug.
 
> In other words, after userspace doing gem_ctx_destroy(ctx_handle),
> we would return -EINVAL due to ctx_handle being stale
> earlier than we check for banned status and return -EIO?

It's as simple as if the context is closed, it is removed from the
file->context_idr and userspace cannot access it. If userspace is racing
with itself, there's not much we can do other than protect our
references. If userspace succeeds in submitting to the context prior to
closing it in another thread, it has the context to continue (and if
then hangs, it will be shot down immediately). If it loses that race, it
gets an -ENOENT. If it loses that race so badly the context id is
replace by a new context, it submits to that new context; which surely
will end in tears and GPU hangs, but not our fault and nothing we can do
to prevent that.
-Chris
Mika Kuoppala Nov. 11, 2019, 11:25 a.m. UTC | #3
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2019-11-11 10:54:14)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > If we detect a hang in a closed context, just flush all of its requests
>> > and cancel any remaining execution along the context. Note that after
>> > closing the context, the last reference to the context may be dropped,
>> > leaving it only valid under RCU.
>> 
>> Sound good. But is there a window for userspace to start
>> to see -EIO if it resubmits to a closed context?
>
> Userspace can not submit to a closed context (-ENOENT) as that would be
> tantamount to a use-after-free kernel bug.
>  
>> In other words, after userspace doing gem_ctx_destroy(ctx_handle),
>> we would return -EINVAL due to ctx_handle being stale
>> earlier than we check for banned status and return -EIO?
>
> It's as simple as if the context is closed, it is removed from the
> file->context_idr and userspace cannot access it. If userspace is racing
> with itself, there's not much we can do other than protect our
> references. If userspace succeeds in submitting to the context prior to
> closing it in another thread, it has the context to continue (and if
> then hangs, it will be shot down immediately). If it loses that race, it
> gets an -ENOENT. If it loses that race so badly the context id is
> replace by a new context, it submits to that new context; which surely
> will end in tears and GPU hangs, but not our fault and nothing we can do
> to prevent that.

Let them shed tears if they bring it on themselves. I was concerned
on a behavioural change on close/resubmit race. But as you explained
racing on a different id, they deserve what they begged for.

We are in a business of protecting the state of all the sane
ones.

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index f03e000051c1..a6b0d00c3a51 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -81,6 +81,11 @@  static bool context_mark_guilty(struct i915_gem_context *ctx)
 	bool banned;
 	int i;
 
+	if (i915_gem_context_is_closed(ctx)) {
+		i915_gem_context_set_banned(ctx);
+		return true;
+	}
+
 	atomic_inc(&ctx->guilty_count);
 
 	/* Cool contexts are too cool to be banned! (Used for reset testing.) */
@@ -124,6 +129,7 @@  void __i915_request_reset(struct i915_request *rq, bool guilty)
 
 	GEM_BUG_ON(i915_request_completed(rq));
 
+	rcu_read_lock(); /* protect the GEM context */
 	if (guilty) {
 		i915_request_skip(rq, -EIO);
 		if (context_mark_guilty(rq->gem_context))
@@ -132,6 +138,7 @@  void __i915_request_reset(struct i915_request *rq, bool guilty)
 		dma_fence_set_error(&rq->fence, -EAGAIN);
 		context_mark_innocent(rq->gem_context);
 	}
+	rcu_read_unlock();
 }
 
 static bool i915_in_reset(struct pci_dev *pdev)