diff mbox series

[v2] drm/i915/gem: Don't leak non-persistent requests on changing engines

Message ID 20200205121313.1834548-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [v2] drm/i915/gem: Don't leak non-persistent requests on changing engines | expand

Commit Message

Chris Wilson Feb. 5, 2020, 12:13 p.m. UTC
If we have a set of active engines marked as being non-persistent, we
lose track of those if the user replaces those engines with
I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that
non-persistent requests are terminated if they are no longer being
tracked by the user's context (in order to prevent a lost request
causing an untracked and so unstoppable GPU hang), we need to apply the
same context cancellation upon changing engines.

Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional")
Testcase: XXX
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Tvrtko Ursulin Feb. 5, 2020, 4:22 p.m. UTC | #1
On 05/02/2020 12:13, Chris Wilson wrote:
> If we have a set of active engines marked as being non-persistent, we
> lose track of those if the user replaces those engines with
> I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that
> non-persistent requests are terminated if they are no longer being
> tracked by the user's context (in order to prevent a lost request
> causing an untracked and so unstoppable GPU hang), we need to apply the
> same context cancellation upon changing engines.
> 
> Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional")
> Testcase: XXX
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 52a749691a8d..20f1d3e0221f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -1624,11 +1624,18 @@ set_engines(struct i915_gem_context *ctx,
>   
>   replace:
>   	mutex_lock(&ctx->engines_mutex);
> +
> +	/* Flush stale requests off the old engines if required */
> +	if (!i915_gem_context_is_persistent(ctx) ||
> +	    !i915_modparams.enable_hangcheck)
> +		kill_context(ctx);

Is the negative effect of this is legit contexts can't keep submitting 
and changing the map? Only if PREEMPT_TIMEOUT is disabled I think but 
still. Might break legitimate userspace. Not that I offer solutions.. :( 
Banning changing engines once context went non-persistent? That too can 
break someone.

Regards,

Tvrtko

> +
>   	if (args->size)
>   		i915_gem_context_set_user_engines(ctx);
>   	else
>   		i915_gem_context_clear_user_engines(ctx);
>   	set.engines = rcu_replace_pointer(ctx->engines, set.engines, 1);
> +
>   	mutex_unlock(&ctx->engines_mutex);
>   
>   	call_rcu(&set.engines->rcu, free_engines_rcu);
>
Chris Wilson Feb. 5, 2020, 4:44 p.m. UTC | #2
Quoting Tvrtko Ursulin (2020-02-05 16:22:34)
> 
> 
> On 05/02/2020 12:13, Chris Wilson wrote:
> > If we have a set of active engines marked as being non-persistent, we
> > lose track of those if the user replaces those engines with
> > I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that
> > non-persistent requests are terminated if they are no longer being
> > tracked by the user's context (in order to prevent a lost request
> > causing an untracked and so unstoppable GPU hang), we need to apply the
> > same context cancellation upon changing engines.
> > 
> > Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional")
> > Testcase: XXX
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++
> >   1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > index 52a749691a8d..20f1d3e0221f 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> > @@ -1624,11 +1624,18 @@ set_engines(struct i915_gem_context *ctx,
> >   
> >   replace:
> >       mutex_lock(&ctx->engines_mutex);
> > +
> > +     /* Flush stale requests off the old engines if required */
> > +     if (!i915_gem_context_is_persistent(ctx) ||
> > +         !i915_modparams.enable_hangcheck)
> > +             kill_context(ctx);
> 
> Is the negative effect of this is legit contexts can't keep submitting 
> and changing the map? Only if PREEMPT_TIMEOUT is disabled I think but 
> still. Might break legitimate userspace. Not that I offer solutions.. :( 
> Banning changing engines once context went non-persistent? That too can 
> break someone.

It closes the hole we have. To do otherwise, we need to keep track of
the old engines. Not an impossible task, certainly inconvenient.

struct old_engines {
	struct i915_active active;
	struct list_head link;
	struct i915_gem_context *ctx;
	void *engines;
	int num_engines;
};

With a list+spinlock in the ctx that we can work in kill_context.

The biggest catch there is actually worrying about attaching the active
to already executing request, and making sure the coupling doesn't bug
on a concurrent completion. Hmm, it's just a completion callback, but
more convenient to use a ready made one.
-Chris
Tvrtko Ursulin Feb. 5, 2020, 6:33 p.m. UTC | #3
On 05/02/2020 16:44, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-02-05 16:22:34)
>> On 05/02/2020 12:13, Chris Wilson wrote:
>>> If we have a set of active engines marked as being non-persistent, we
>>> lose track of those if the user replaces those engines with
>>> I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that
>>> non-persistent requests are terminated if they are no longer being
>>> tracked by the user's context (in order to prevent a lost request
>>> causing an untracked and so unstoppable GPU hang), we need to apply the
>>> same context cancellation upon changing engines.
>>>
>>> Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional")
>>> Testcase: XXX
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> index 52a749691a8d..20f1d3e0221f 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> @@ -1624,11 +1624,18 @@ set_engines(struct i915_gem_context *ctx,
>>>    
>>>    replace:
>>>        mutex_lock(&ctx->engines_mutex);
>>> +
>>> +     /* Flush stale requests off the old engines if required */
>>> +     if (!i915_gem_context_is_persistent(ctx) ||
>>> +         !i915_modparams.enable_hangcheck)
>>> +             kill_context(ctx);
>>
>> Is the negative effect of this is legit contexts can't keep submitting
>> and changing the map? Only if PREEMPT_TIMEOUT is disabled I think but
>> still. Might break legitimate userspace. Not that I offer solutions.. :(
>> Banning changing engines once context went non-persistent? That too can
>> break someone.
> 
> It closes the hole we have. To do otherwise, we need to keep track of
> the old engines. Not an impossible task, certainly inconvenient.
> 
> struct old_engines {
> 	struct i915_active active;
> 	struct list_head link;
> 	struct i915_gem_context *ctx;
> 	void *engines;
> 	int num_engines;
> };
> 
> With a list+spinlock in the ctx that we can work in kill_context.
> 
> The biggest catch there is actually worrying about attaching the active
> to already executing request, and making sure the coupling doesn't bug
> on a concurrent completion. Hmm, it's just a completion callback, but
> more convenient to use a ready made one.

What would you do with old engines? We don't have a mechanism to mark 
intel_context closed. Hm, right, it would get unreachable by definition. 
But how to terminate it if it doesn't play nicely?

Regards,

Tvrtko
Chris Wilson Feb. 5, 2020, 6:39 p.m. UTC | #4
Quoting Tvrtko Ursulin (2020-02-05 18:33:57)
> 
> On 05/02/2020 16:44, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-02-05 16:22:34)
> >> On 05/02/2020 12:13, Chris Wilson wrote:
> >>> If we have a set of active engines marked as being non-persistent, we
> >>> lose track of those if the user replaces those engines with
> >>> I915_CONTEXT_PARAM_ENGINES. As part of our uABI contract is that
> >>> non-persistent requests are terminated if they are no longer being
> >>> tracked by the user's context (in order to prevent a lost request
> >>> causing an untracked and so unstoppable GPU hang), we need to apply the
> >>> same context cancellation upon changing engines.
> >>>
> >>> Fixes: a0e047156cde ("drm/i915/gem: Make context persistence optional")
> >>> Testcase: XXX
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/gem/i915_gem_context.c | 7 +++++++
> >>>    1 file changed, 7 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> >>> index 52a749691a8d..20f1d3e0221f 100644
> >>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> >>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> >>> @@ -1624,11 +1624,18 @@ set_engines(struct i915_gem_context *ctx,
> >>>    
> >>>    replace:
> >>>        mutex_lock(&ctx->engines_mutex);
> >>> +
> >>> +     /* Flush stale requests off the old engines if required */
> >>> +     if (!i915_gem_context_is_persistent(ctx) ||
> >>> +         !i915_modparams.enable_hangcheck)
> >>> +             kill_context(ctx);
> >>
> >> Is the negative effect of this is legit contexts can't keep submitting
> >> and changing the map? Only if PREEMPT_TIMEOUT is disabled I think but
> >> still. Might break legitimate userspace. Not that I offer solutions.. :(
> >> Banning changing engines once context went non-persistent? That too can
> >> break someone.
> > 
> > It closes the hole we have. To do otherwise, we need to keep track of
> > the old engines. Not an impossible task, certainly inconvenient.
> > 
> > struct old_engines {
> >       struct i915_active active;
> >       struct list_head link;
> >       struct i915_gem_context *ctx;
> >       void *engines;
> >       int num_engines;
> > };
> > 
> > With a list+spinlock in the ctx that we can work in kill_context.
> > 
> > The biggest catch there is actually worrying about attaching the active
> > to already executing request, and making sure the coupling doesn't bug
> > on a concurrent completion. Hmm, it's just a completion callback, but
> > more convenient to use a ready made one.
> 
> What would you do with old engines? We don't have a mechanism to mark 
> intel_context closed. Hm, right, it would get unreachable by definition. 
> But how to terminate it if it doesn't play nicely?

Wait 30 minutes (if it passes the tests) and you'll find out :)
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 52a749691a8d..20f1d3e0221f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -1624,11 +1624,18 @@  set_engines(struct i915_gem_context *ctx,
 
 replace:
 	mutex_lock(&ctx->engines_mutex);
+
+	/* Flush stale requests off the old engines if required */
+	if (!i915_gem_context_is_persistent(ctx) ||
+	    !i915_modparams.enable_hangcheck)
+		kill_context(ctx);
+
 	if (args->size)
 		i915_gem_context_set_user_engines(ctx);
 	else
 		i915_gem_context_clear_user_engines(ctx);
 	set.engines = rcu_replace_pointer(ctx->engines, set.engines, 1);
+
 	mutex_unlock(&ctx->engines_mutex);
 
 	call_rcu(&set.engines->rcu, free_engines_rcu);