Message ID | 20200205121313.1834548-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] drm/i915/gem: Don't leak non-persistent requests on changing engines | expand |
On 05/02/2020 12:13, Chris Wilson wrote: > If we have a set of active engines marked as being non-persistent, we > lose track of those if the user replaces those engines with > I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that > non-persistent requests are terminated if they are no longer being > tracked by the user's context (in order to prevent a lost request > causing an untracked and so unstoppable GPU hang), we need to apply the > same context cancellation upon changing engines. > > Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional") > Testcase: XXX > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > --- > drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c > index 52a749691a8d..20f1d3e0221f 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > @@ -1624,11 +1624,18 @@ set_engines(struct i915_gem_context *ctx, > > replace: > mutex_lock(&ctx->engines_mutex); > + > + /* Flush stale requests off the old engines if required */ > + if (!i915_gem_context_is_persistent(ctx) || > + !i915_modparams.enable_hangcheck) > + kill_context(ctx); Is the negative effect of this is legit contexts can't keep submitting and changing the map? Only if PREEMPT_TIMEOUT is disabled I think but still. Might break legitimate userspace. Not that I offer solutions.. :( Banning changing engines once context went non-persistent? That too can break someone. Regards, Tvrtko > + > if (args->size) > i915_gem_context_set_user_engines(ctx); > else > i915_gem_context_clear_user_engines(ctx); > set.engines = rcu_replace_pointer(ctx->engines, set.engines, 1); > + > mutex_unlock(&ctx->engines_mutex); > > call_rcu(&set.engines->rcu, free_engines_rcu); >
Quoting Tvrtko Ursulin (2020-02-05 16:22:34) > > > On 05/02/2020 12:13, Chris Wilson wrote: > > If we have a set of active engines marked as being non-persistent, we > > lose track of those if the user replaces those engines with > > I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that > > non-persistent requests are terminated if they are no longer being > > tracked by the user's context (in order to prevent a lost request > > causing an untracked and so unstoppable GPU hang), we need to apply the > > same context cancellation upon changing engines. > > > > Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional") > > Testcase: XXX > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > --- > > drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > index 52a749691a8d..20f1d3e0221f 100644 > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > @@ -1624,11 +1624,18 @@ set_engines(struct i915_gem_context *ctx, > > > > replace: > > mutex_lock(&ctx->engines_mutex); > > + > > + /* Flush stale requests off the old engines if required */ > > + if (!i915_gem_context_is_persistent(ctx) || > > + !i915_modparams.enable_hangcheck) > > + kill_context(ctx); > > Is the negative effect of this is legit contexts can't keep submitting > and changing the map? Only if PREEMPT_TIMEOUT is disabled I think but > still. Might break legitimate userspace. Not that I offer solutions.. :( > Banning changing engines once context went non-persistent? That too can > break someone. It closes the hole we have. To do otherwise, we need to keep track of the old engines. Not an impossible task, certainly inconvenient. struct old_engines { struct i915_active active; struct list_head link; struct i915_gem_context *ctx; void *engines; int num_engines; }; With a list+spinlock in the ctx that we can work in kill_context. The biggest catch there is actually worrying about attaching the active to already executing request, and making sure the coupling doesn't bug on a concurrent completion. Hmm, it's just a completion callback, but more convenient to use a ready made one. -Chris
On 05/02/2020 16:44, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2020-02-05 16:22:34) >> On 05/02/2020 12:13, Chris Wilson wrote: >>> If we have a set of active engines marked as being non-persistent, we >>> lose track of those if the user replaces those engines with >>> I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that >>> non-persistent requests are terminated if they are no longer being >>> tracked by the user's context (in order to prevent a lost request >>> causing an untracked and so unstoppable GPU hang), we need to apply the >>> same context cancellation upon changing engines. >>> >>> Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional") >>> Testcase: XXX >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>> --- >>> drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++ >>> 1 file changed, 7 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c >>> index 52a749691a8d..20f1d3e0221f 100644 >>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c >>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c >>> @@ -1624,11 +1624,18 @@ set_engines(struct i915_gem_context *ctx, >>> >>> replace: >>> mutex_lock(&ctx->engines_mutex); >>> + >>> + /* Flush stale requests off the old engines if required */ >>> + if (!i915_gem_context_is_persistent(ctx) || >>> + !i915_modparams.enable_hangcheck) >>> + kill_context(ctx); >> >> Is the negative effect of this is legit contexts can't keep submitting >> and changing the map? Only if PREEMPT_TIMEOUT is disabled I think but >> still. Might break legitimate userspace. Not that I offer solutions.. :( >> Banning changing engines once context went non-persistent? That too can >> break someone. > > It closes the hole we have. To do otherwise, we need to keep track of > the old engines. Not an impossible task, certainly inconvenient. > > struct old_engines { > struct i915_active active; > struct list_head link; > struct i915_gem_context *ctx; > void *engines; > int num_engines; > }; > > With a list+spinlock in the ctx that we can work in kill_context. > > The biggest catch there is actually worrying about attaching the active > to already executing request, and making sure the coupling doesn't bug > on a concurrent completion. Hmm, it's just a completion callback, but > more convenient to use a ready made one. What would you do with old engines? We don't have a mechanism to mark intel_context closed. Hm, right, it would get unreachable by definition. But how to terminate it if it doesn't play nicely? Regards, Tvrtko
Quoting Tvrtko Ursulin (2020-02-05 18:33:57) > > On 05/02/2020 16:44, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2020-02-05 16:22:34) > >> On 05/02/2020 12:13, Chris Wilson wrote: > >>> If we have a set of active engines marked as being non-persistent, we > >>> lose track of those if the user replaces those engines with > >>> I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that > >>> non-persistent requests are terminated if they are no longer being > >>> tracked by the user's context (in order to prevent a lost request > >>> causing an untracked and so unstoppable GPU hang), we need to apply the > >>> same context cancellation upon changing engines. > >>> > >>> Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional") > >>> Testcase: XXX > >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>> --- > >>> drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++ > >>> 1 file changed, 7 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c > >>> index 52a749691a8d..20f1d3e0221f 100644 > >>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > >>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > >>> @@ -1624,11 +1624,18 @@ set_engines(struct i915_gem_context *ctx, > >>> > >>> replace: > >>> mutex_lock(&ctx->engines_mutex); > >>> + > >>> + /* Flush stale requests off the old engines if required */ > >>> + if (!i915_gem_context_is_persistent(ctx) || > >>> + !i915_modparams.enable_hangcheck) > >>> + kill_context(ctx); > >> > >> Is the negative effect of this is legit contexts can't keep submitting > >> and changing the map? Only if PREEMPT_TIMEOUT is disabled I think but > >> still. Might break legitimate userspace. Not that I offer solutions.. :( > >> Banning changing engines once context went non-persistent? That too can > >> break someone. > > > > It closes the hole we have. To do otherwise, we need to keep track of > > the old engines. Not an impossible task, certainly inconvenient. > > > > struct old_engines { > > struct i915_active active; > > struct list_head link; > > struct i915_gem_context *ctx; > > void *engines; > > int num_engines; > > }; > > > > With a list+spinlock in the ctx that we can work in kill_context. > > > > The biggest catch there is actually worrying about attaching the active > > to already executing request, and making sure the coupling doesn't bug > > on a concurrent completion. Hmm, it's just a completion callback, but > > more convenient to use a ready made one. > > What would you do with old engines? We don't have a mechanism to mark > intel_context closed. Hm, right, it would get unreachable by definition. > But how to terminate it if it doesn't play nicely? Wait 30 minutes (if it passes the tests) and you'll find out :) -Chris
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index 52a749691a8d..20f1d3e0221f 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -1624,11 +1624,18 @@ set_engines(struct i915_gem_context *ctx, replace: mutex_lock(&ctx->engines_mutex); + + /* Flush stale requests off the old engines if required */ + if (!i915_gem_context_is_persistent(ctx) || + !i915_modparams.enable_hangcheck) + kill_context(ctx); + if (args->size) i915_gem_context_set_user_engines(ctx); else i915_gem_context_clear_user_engines(ctx); set.engines = rcu_replace_pointer(ctx->engines, set.engines, 1); + mutex_unlock(&ctx->engines_mutex); call_rcu(&set.engines->rcu, free_engines_rcu);
If we have a set of active engines marked as being non-persistent, we lose track of those if the user replaces those engines with I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that non-persistent requests are terminated if they are no longer being tracked by the user's context (in order to prevent a lost request causing an untracked and so unstoppable GPU hang), we need to apply the same context cancellation upon changing engines. Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional") Testcase: XXX Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++ 1 file changed, 7 insertions(+)