Message ID | 20191110185806.17413-7-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/25] drm/i915: Protect context while grabbing its name for the request | expand |
Chris Wilson <chris@chris-wilson.co.uk> writes: > If we detect a hang in a closed context, just flush all of its requests > and cancel any remaining execution along the context. Note that after > closing the context, the last reference to the context may be dropped, > leaving it only valid under RCU. Sound good. But is there a window for userspace to start to see -EIO if it resubmits to a closed context? In other words, after userspace doing gem_ctx_destroy(ctx_handle), we would return -EINVAL due to ctx_handle being stale earlier than we check for banned status and return -EIO? -Mika > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > --- > drivers/gpu/drm/i915/gt/intel_reset.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c > index f03e000051c1..a6b0d00c3a51 100644 > --- a/drivers/gpu/drm/i915/gt/intel_reset.c > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c > @@ -81,6 +81,11 @@ static bool context_mark_guilty(struct i915_gem_context *ctx) > bool banned; > int i; > > + if (i915_gem_context_is_closed(ctx)) { > + i915_gem_context_set_banned(ctx); > + return true; > + } > + > atomic_inc(&ctx->guilty_count); > > /* Cool contexts are too cool to be banned! (Used for reset testing.) */ > @@ -124,6 +129,7 @@ void __i915_request_reset(struct i915_request *rq, bool guilty) > > GEM_BUG_ON(i915_request_completed(rq)); > > + rcu_read_lock(); /* protect the GEM context */ > if (guilty) { > i915_request_skip(rq, -EIO); > if (context_mark_guilty(rq->gem_context)) > @@ -132,6 +138,7 @@ void __i915_request_reset(struct i915_request *rq, bool guilty) > dma_fence_set_error(&rq->fence, -EAGAIN); > context_mark_innocent(rq->gem_context); > } > + rcu_read_unlock(); > } > > static bool i915_in_reset(struct pci_dev *pdev) > -- > 2.24.0 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Mika Kuoppala (2019-11-11 10:54:14) > Chris Wilson <chris@chris-wilson.co.uk> writes: > > > If we detect a hang in a closed context, just flush all of its requests > > and cancel any remaining execution along the context. Note that after > > closing the context, the last reference to the context may be dropped, > > leaving it only valid under RCU. > > Sound good. But is there a window for userspace to start > to see -EIO if it resubmits to a closed context? Userspace can not submit to a closed context (-ENOENT) as that would be tantamount to a use-after-free kernel bug. > In other words, after userspace doing gem_ctx_destroy(ctx_handle), > we would return -EINVAL due to ctx_handle being stale > earlier than we check for banned status and return -EIO? It's as simple as if the context is closed, it is removed from the file->context_idr and userspace cannot access it. If userspace is racing with itself, there's not much we can do other than protect our references. If userspace succeeds in submitting to the context prior to closing it in another thread, it has the context to continue (and if then hangs, it will be shot down immediately). If it loses that race, it gets an -ENOENT. If it loses that race so badly the context id is replace by a new context, it submits to that new context; which surely will end in tears and GPU hangs, but not our fault and nothing we can do to prevent that. -Chris
Chris Wilson <chris@chris-wilson.co.uk> writes: > Quoting Mika Kuoppala (2019-11-11 10:54:14) >> Chris Wilson <chris@chris-wilson.co.uk> writes: >> >> > If we detect a hang in a closed context, just flush all of its requests >> > and cancel any remaining execution along the context. Note that after >> > closing the context, the last reference to the context may be dropped, >> > leaving it only valid under RCU. >> >> Sound good. But is there a window for userspace to start >> to see -EIO if it resubmits to a closed context? > > Userspace can not submit to a closed context (-ENOENT) as that would be > tantamount to a use-after-free kernel bug. > >> In other words, after userspace doing gem_ctx_destroy(ctx_handle), >> we would return -EINVAL due to ctx_handle being stale >> earlier than we check for banned status and return -EIO? > > It's as simple as if the context is closed, it is removed from the > file->context_idr and userspace cannot access it. If userspace is racing > with itself, there's not much we can do other than protect our > references. If userspace succeeds in submitting to the context prior to > closing it in another thread, it has the context to continue (and if > then hangs, it will be shot down immediately). If it loses that race, it > gets an -ENOENT. If it loses that race so badly the context id is > replace by a new context, it submits to that new context; which surely > will end in tears and GPU hangs, but not our fault and nothing we can do > to prevent that. Let them shed tears if they bring it on themselves. I was concerned on a behavioural change on close/resubmit race. But as you explained racing on a different id, they deserve what they begged for. We are in a business of protecting the state of all the sane ones. Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index f03e000051c1..a6b0d00c3a51 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -81,6 +81,11 @@ static bool context_mark_guilty(struct i915_gem_context *ctx) bool banned; int i; + if (i915_gem_context_is_closed(ctx)) { + i915_gem_context_set_banned(ctx); + return true; + } + atomic_inc(&ctx->guilty_count); /* Cool contexts are too cool to be banned! (Used for reset testing.) */ @@ -124,6 +129,7 @@ void __i915_request_reset(struct i915_request *rq, bool guilty) GEM_BUG_ON(i915_request_completed(rq)); + rcu_read_lock(); /* protect the GEM context */ if (guilty) { i915_request_skip(rq, -EIO); if (context_mark_guilty(rq->gem_context)) @@ -132,6 +138,7 @@ void __i915_request_reset(struct i915_request *rq, bool guilty) dma_fence_set_error(&rq->fence, -EAGAIN); context_mark_innocent(rq->gem_context); } + rcu_read_unlock(); } static bool i915_in_reset(struct pci_dev *pdev)
If we detect a hang in a closed context, just flush all of its requests and cancel any remaining execution along the context. Note that after closing the context, the last reference to the context may be dropped, leaving it only valid under RCU. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/gt/intel_reset.c | 7 +++++++ 1 file changed, 7 insertions(+)