[01/11] drm/i915/gem: Make context persistence optional

Message ID	20191024114028.6170-2-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=U0wV=YR=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 908722166E From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Thu, 24 Oct 2019 12:40:18 +0100 Message-Id: <20191024114028.6170-2-chris@chris-wilson.co.uk> In-Reply-To: <20191024114028.6170-1-chris@chris-wilson.co.uk> References: <20191024114028.6170-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	[01/11] drm/i915/gem: Make context persistence optional \| expand [01/11] drm/i915/gem: Make context persistence optional [02/11] drm/i915: Put future HW and their uAPIs under STAGING & BROKEN [03/11] drm/i915/gt: Expose engine properties via sysfs [04/11] drm/i915/gt: Expose engine->mmio_base via sysfs [05/11] drm/i915/gt: Expose timeslice duration to sysfs [06/11] drm/i915/gt: Expose reset stop timeout via sysfs [07/11] drm/i915/gt: Expose preempt reset timeout via sysfs [08/11] drm/i915/gt: Expose heartbeat interval via sysfs [09/11] drm/i915: Flush idle barriers when waiting [10/11] drm/i915: Allow userspace to specify ringsize on construction [11/11] drm/i915/gem: Honour O_NONBLOCK before throttling execbuf submissions

Chris Wilson Oct. 24, 2019, 11:40 a.m. UTC

Our existing behaviour is to allow contexts and their GPU requests to
persist past the point of closure until the requests are complete. This
allows clients to operate in a 'fire-and-forget' manner where they can
setup a rendering pipeline and hand it over to the display server and
immediately exiting. As the rendering pipeline is kept alive until
completion, the display server (or other consumer) can use the results
in the future and present them to the user.

However, not all clients want this persistent behaviour and would prefer
that the contexts are cleaned up immediately upon closure. This ensures
that when clients are run without hangchecking, any GPU hang is
terminated with the process and does not continue to hog resources.

By defining a context property to allow clients to control persistence
explicitly, we can remove the blanket advice to disable hangchecking
that seems to be far too prevalent.

The default behaviour for new controls is the legacy persistence mode.
New clients will have to opt out for immediate cleanup on context
closure. If the hangchecking modparam is disabled, so is persistent
context support -- all contexts will be terminated on closure.

Testcase: igt/gem_ctx_persistence
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 50 ++++++++++++++++++-
 drivers/gpu/drm/i915/gem/i915_gem_context.h   | 15 ++++++
 .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
 .../gpu/drm/i915/gem/selftests/mock_context.c |  2 +
 include/uapi/drm/i915_drm.h                   | 15 ++++++
 5 files changed, 82 insertions(+), 1 deletion(-)

Joonas Lahtinen Oct. 25, 2019, 9:10 a.m. UTC | #1

Quoting Chris Wilson (2019-10-24 14:40:18)
> Our existing behaviour is to allow contexts and their GPU requests to
> persist past the point of closure until the requests are complete. This
> allows clients to operate in a 'fire-and-forget' manner where they can
> setup a rendering pipeline and hand it over to the display server and
> immediately exiting. As the rendering pipeline is kept alive until
> completion, the display server (or other consumer) can use the results
> in the future and present them to the user.
> 
> However, not all clients want this persistent behaviour and would prefer
> that the contexts are cleaned up immediately upon closure. This ensures
> that when clients are run without hangchecking, any GPU hang is
> terminated with the process and does not continue to hog resources.

It's worth mentioning GPU compute workloads of indeterminate length
as an example for increased clarity.

> By defining a context property to allow clients to control persistence
> explicitly, we can remove the blanket advice to disable hangchecking
> that seems to be far too prevalent.

I would drop this paragraph from this patch, as it doesn't (yet) make
sense.

> The default behaviour for new controls is the legacy persistence mode.
> New clients will have to opt out for immediate cleanup on context
> closure.

"... have to opt in to immediate ..." reads more clear.

> If the hangchecking modparam is disabled, so is persistent
> context support -- all contexts will be terminated on closure.

Let's add here that it has actually been a source of bug reports
in the past that we don't terminate the workoads. So we expect
this behaviour change to be welcomed by the compute users who
have been instructed to disable the hanghceck in the past.

A couple of comments below. But anyway, with the uAPI comment
and commit message clarified this is:

Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Still needs the Link:, though.

> Testcase: igt/gem_ctx_persistence
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

<SNIP>

> +static int __context_set_persistence(struct i915_gem_context *ctx, bool state)
> +{
> +       if (i915_gem_context_is_persistent(ctx) == state)
> +               return 0;
> +
> +       if (state) {
> +               /*
> +                * Only contexts that are short-lived [that will expire or be
> +                * reset] are allowed to survive past termination. We require
> +                * hangcheck to ensure that the persistent requests are healthy.
> +                */
> +               if (!i915_modparams.enable_hangcheck)
> +                       return -EINVAL;

This is slightly confusing as the default is to enable persistence.

Disabling and re-enabling would result -EINVAL. But I guess it's no
problem to lift such restriction later.

> +++ b/include/uapi/drm/i915_drm.h
> @@ -1572,6 +1572,21 @@ struct drm_i915_gem_context_param {
>   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
>   */
>  #define I915_CONTEXT_PARAM_ENGINES     0xa
> +
> +/*
> + * I915_CONTEXT_PARAM_PERSISTENCE:
> + *
> + * Allow the context and active rendering to survive the process until
> + * completion. Persistence allows fire-and-forget clients to queue up a
> + * bunch of work, hand the output over to a display server and the quit.
> + * If the context is not marked as persistent, upon closing (either via

".. is marked as not persistent," is more clear.

Regards, Joonas

> + * an explicit DRM_I915_GEM_CONTEXT_DESTROY or implicitly from file closure
> + * or process termination), the context and any outstanding requests will be
> + * cancelled (and exported fences for cancelled requests marked as -EIO).
> + *
> + * By default, new contexts allow persistence.
> + */

Jason Ekstrand Oct. 25, 2019, 6:22 p.m. UTC | #2

On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson <chris@chris-wilson.co.uk>
wrote:

> Our existing behaviour is to allow contexts and their GPU requests to
> persist past the point of closure until the requests are complete. This
> allows clients to operate in a 'fire-and-forget' manner where they can
> setup a rendering pipeline and hand it over to the display server and
> immediately exiting. As the rendering pipeline is kept alive until
> completion, the display server (or other consumer) can use the results
> in the future and present them to the user.
>
> However, not all clients want this persistent behaviour and would prefer
> that the contexts are cleaned up immediately upon closure. This ensures
> that when clients are run without hangchecking, any GPU hang is
> terminated with the process and does not continue to hog resources.
>
> By defining a context property to allow clients to control persistence
> explicitly, we can remove the blanket advice to disable hangchecking
> that seems to be far too prevalent.
>

Just to be clear, when you say "disable hangchecking" do you mean disabling
it for all processes via a kernel parameter at boot time or a sysfs entry
or similar?  Or is there some mechanism whereby a context can request no
hang checking?


> The default behaviour for new controls is the legacy persistence mode.
> New clients will have to opt out for immediate cleanup on context
> closure. If the hangchecking modparam is disabled, so is persistent
> context support -- all contexts will be terminated on closure.
>

What happens to fences when the context is cancelled?  Is it the same
behavior as we have today for when a GPU hang is detected and a context is
banned?

--Jason



> Testcase: igt/gem_ctx_persistence
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Jon Bloomfield <jon.bloomfield@intel.com>
> Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 50 ++++++++++++++++++-
>  drivers/gpu/drm/i915/gem/i915_gem_context.h   | 15 ++++++
>  .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
>  .../gpu/drm/i915/gem/selftests/mock_context.c |  2 +
>  include/uapi/drm/i915_drm.h                   | 15 ++++++
>  5 files changed, 82 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 55f1f93c0925..0f1bbf5d1a11 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -437,12 +437,39 @@ static void context_close(struct i915_gem_context
> *ctx)
>          * case we opt to forcibly kill off all remaining requests on
>          * context close.
>          */
> -       if (!i915_modparams.enable_hangcheck)
> +       if (!i915_gem_context_is_persistent(ctx) ||
> +           !i915_modparams.enable_hangcheck)
>                 kill_context(ctx);
>
>         i915_gem_context_put(ctx);
>  }
>
> +static int __context_set_persistence(struct i915_gem_context *ctx, bool
> state)
> +{
> +       if (i915_gem_context_is_persistent(ctx) == state)
> +               return 0;
> +
> +       if (state) {
> +               /*
> +                * Only contexts that are short-lived [that will expire or
> be
> +                * reset] are allowed to survive past termination. We
> require
> +                * hangcheck to ensure that the persistent requests are
> healthy.
> +                */
> +               if (!i915_modparams.enable_hangcheck)
> +                       return -EINVAL;
> +
> +               i915_gem_context_set_persistence(ctx);
> +       } else {
> +               /* To cancel a context we use "preempt-to-idle" */
> +               if (!(ctx->i915->caps.scheduler &
> I915_SCHEDULER_CAP_PREEMPTION))
> +                       return -ENODEV;
> +
> +               i915_gem_context_clear_persistence(ctx);
> +       }
> +
> +       return 0;
> +}
> +
>  static struct i915_gem_context *
>  __create_context(struct drm_i915_private *i915)
>  {
> @@ -477,6 +504,7 @@ __create_context(struct drm_i915_private *i915)
>
>         i915_gem_context_set_bannable(ctx);
>         i915_gem_context_set_recoverable(ctx);
> +       __context_set_persistence(ctx, true /* cgroup hook? */);
>
>         for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp); i++)
>                 ctx->hang_timestamp[i] = jiffies -
> CONTEXT_FAST_HANG_JIFFIES;
> @@ -633,6 +661,7 @@ i915_gem_context_create_kernel(struct drm_i915_private
> *i915, int prio)
>                 return ctx;
>
>         i915_gem_context_clear_bannable(ctx);
> +       i915_gem_context_set_persistence(ctx);
>         ctx->sched.priority = I915_USER_PRIORITY(prio);
>
>         GEM_BUG_ON(!i915_gem_context_is_kernel(ctx));
> @@ -1743,6 +1772,16 @@ get_engines(struct i915_gem_context *ctx,
>         return err;
>  }
>
> +static int
> +set_persistence(struct i915_gem_context *ctx,
> +               const struct drm_i915_gem_context_param *args)
> +{
> +       if (args->size)
> +               return -EINVAL;
> +
> +       return __context_set_persistence(ctx, args->value);
> +}
> +
>  static int ctx_setparam(struct drm_i915_file_private *fpriv,
>                         struct i915_gem_context *ctx,
>                         struct drm_i915_gem_context_param *args)
> @@ -1820,6 +1859,10 @@ static int ctx_setparam(struct
> drm_i915_file_private *fpriv,
>                 ret = set_engines(ctx, args);
>                 break;
>
> +       case I915_CONTEXT_PARAM_PERSISTENCE:
> +               ret = set_persistence(ctx, args);
> +               break;
> +
>         case I915_CONTEXT_PARAM_BAN_PERIOD:
>         default:
>                 ret = -EINVAL;
> @@ -2272,6 +2315,11 @@ int i915_gem_context_getparam_ioctl(struct
> drm_device *dev, void *data,
>                 ret = get_engines(ctx, args);
>                 break;
>
> +       case I915_CONTEXT_PARAM_PERSISTENCE:
> +               args->size = 0;
> +               args->value = i915_gem_context_is_persistent(ctx);
> +               break;
> +
>         case I915_CONTEXT_PARAM_BAN_PERIOD:
>         default:
>                 ret = -EINVAL;
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.h
> b/drivers/gpu/drm/i915/gem/i915_gem_context.h
> index cfe80590f0ed..18e50a769a6e 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.h
> @@ -76,6 +76,21 @@ static inline void
> i915_gem_context_clear_recoverable(struct i915_gem_context *c
>         clear_bit(UCONTEXT_RECOVERABLE, &ctx->user_flags);
>  }
>
> +static inline bool i915_gem_context_is_persistent(const struct
> i915_gem_context *ctx)
> +{
> +       return test_bit(UCONTEXT_PERSISTENCE, &ctx->user_flags);
> +}
> +
> +static inline void i915_gem_context_set_persistence(struct
> i915_gem_context *ctx)
> +{
> +       set_bit(UCONTEXT_PERSISTENCE, &ctx->user_flags);
> +}
> +
> +static inline void i915_gem_context_clear_persistence(struct
> i915_gem_context *ctx)
> +{
> +       clear_bit(UCONTEXT_PERSISTENCE, &ctx->user_flags);
> +}
> +
>  static inline bool i915_gem_context_is_banned(const struct
> i915_gem_context *ctx)
>  {
>         return test_bit(CONTEXT_BANNED, &ctx->flags);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> index fe97b8ba4fda..861d7d92fe9f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> @@ -137,6 +137,7 @@ struct i915_gem_context {
>  #define UCONTEXT_NO_ERROR_CAPTURE      1
>  #define UCONTEXT_BANNABLE              2
>  #define UCONTEXT_RECOVERABLE           3
> +#define UCONTEXT_PERSISTENCE           4
>
>         /**
>          * @flags: small set of booleans
> diff --git a/drivers/gpu/drm/i915/gem/selftests/mock_context.c
> b/drivers/gpu/drm/i915/gem/selftests/mock_context.c
> index 74ddd682c9cd..29b8984f0e47 100644
> --- a/drivers/gpu/drm/i915/gem/selftests/mock_context.c
> +++ b/drivers/gpu/drm/i915/gem/selftests/mock_context.c
> @@ -22,6 +22,8 @@ mock_context(struct drm_i915_private *i915,
>         INIT_LIST_HEAD(&ctx->link);
>         ctx->i915 = i915;
>
> +       i915_gem_context_set_persistence(ctx);
> +
>         mutex_init(&ctx->engines_mutex);
>         e = default_engines(ctx);
>         if (IS_ERR(e))
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 63d40cba97e0..b5fd5e4bd16e 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1572,6 +1572,21 @@ struct drm_i915_gem_context_param {
>   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
>   */
>  #define I915_CONTEXT_PARAM_ENGINES     0xa
> +
> +/*
> + * I915_CONTEXT_PARAM_PERSISTENCE:
> + *
> + * Allow the context and active rendering to survive the process until
> + * completion. Persistence allows fire-and-forget clients to queue up a
> + * bunch of work, hand the output over to a display server and the quit.
> + * If the context is not marked as persistent, upon closing (either via
> + * an explicit DRM_I915_GEM_CONTEXT_DESTROY or implicitly from file
> closure
> + * or process termination), the context and any outstanding requests will
> be
> + * cancelled (and exported fences for cancelled requests marked as -EIO).
> + *
> + * By default, new contexts allow persistence.
> + */
> +#define I915_CONTEXT_PARAM_PERSISTENCE 0xb
>  /* Must be kept compact -- no holes and well documented */
>
>         __u64 value;
> --
> 2.24.0.rc0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Chris Wilson Oct. 25, 2019, 9:29 p.m. UTC | #3

Quoting Jason Ekstrand (2019-10-25 19:22:04)
> On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson <chris@chris-wilson.co.uk> wrote:
> 
>     Our existing behaviour is to allow contexts and their GPU requests to
>     persist past the point of closure until the requests are complete. This
>     allows clients to operate in a 'fire-and-forget' manner where they can
>     setup a rendering pipeline and hand it over to the display server and
>     immediately exiting. As the rendering pipeline is kept alive until
>     completion, the display server (or other consumer) can use the results
>     in the future and present them to the user.
> 
>     However, not all clients want this persistent behaviour and would prefer
>     that the contexts are cleaned up immediately upon closure. This ensures
>     that when clients are run without hangchecking, any GPU hang is
>     terminated with the process and does not continue to hog resources.
> 
>     By defining a context property to allow clients to control persistence
>     explicitly, we can remove the blanket advice to disable hangchecking
>     that seems to be far too prevalent.
> 
> 
> Just to be clear, when you say "disable hangchecking" do you mean disabling it
> for all processes via a kernel parameter at boot time or a sysfs entry or
> similar?  Or is there some mechanism whereby a context can request no hang
> checking?

They are being told to use the module parameter i915.enable_hangcheck=0
to globally disable hangchecking. This is what we are trying to wean
them off, and yet still allow indefinitely long kernels. The softer
hangcheck is focused on if you block scheduling or preemption of higher
priority work, then you are forcibly removed from the GPU. However, even
that is too much for some workloads, where they really do expect to
permanently hog the GPU. (All I can say is that they better be dedicated
systems because if you demand interactivity on top of disabling
preemption...)

>     The default behaviour for new controls is the legacy persistence mode.
>     New clients will have to opt out for immediate cleanup on context
>     closure. If the hangchecking modparam is disabled, so is persistent
>     context support -- all contexts will be terminated on closure.
> 
> 
> What happens to fences when the context is cancelled?  Is it the same behavior
> as we have today for when a GPU hang is detected and a context is banned?

Yes. The incomplete fence statuses are set to -EIO -- it is the very same
mechanism used to remove this context's future work from the GPU as is
used for banning.
-Chris

Jason Ekstrand Oct. 29, 2019, 4:19 p.m. UTC | #4

On Fri, Oct 25, 2019 at 4:29 PM Chris Wilson <chris@chris-wilson.co.uk>
wrote:

> Quoting Jason Ekstrand (2019-10-25 19:22:04)
> > On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson <chris@chris-wilson.co.uk>
> wrote:
> >
> >     Our existing behaviour is to allow contexts and their GPU requests to
> >     persist past the point of closure until the requests are complete.
> This
> >     allows clients to operate in a 'fire-and-forget' manner where they
> can
> >     setup a rendering pipeline and hand it over to the display server and
> >     immediately exiting. As the rendering pipeline is kept alive until
> >     completion, the display server (or other consumer) can use the
> results
> >     in the future and present them to the user.
> >
> >     However, not all clients want this persistent behaviour and would
> prefer
> >     that the contexts are cleaned up immediately upon closure. This
> ensures
> >     that when clients are run without hangchecking, any GPU hang is
> >     terminated with the process and does not continue to hog resources.
> >
> >     By defining a context property to allow clients to control
> persistence
> >     explicitly, we can remove the blanket advice to disable hangchecking
> >     that seems to be far too prevalent.
> >
> >
> > Just to be clear, when you say "disable hangchecking" do you mean
> disabling it
> > for all processes via a kernel parameter at boot time or a sysfs entry or
> > similar?  Or is there some mechanism whereby a context can request no
> hang
> > checking?
>
> They are being told to use the module parameter i915.enable_hangcheck=0
> to globally disable hangchecking. This is what we are trying to wean
> them off, and yet still allow indefinitely long kernels. The softer
> hangcheck is focused on if you block scheduling or preemption of higher
> priority work, then you are forcibly removed from the GPU. However, even
> that is too much for some workloads, where they really do expect to
> permanently hog the GPU. (All I can say is that they better be dedicated
> systems because if you demand interactivity on top of disabling
> preemption...)
>

Ok, thinking out loud here (no need for you to respond):  Why should we
take this approach?  It seems like there are several other ways we could
solve this:

 1. Have a per-context flag (that's what we did here)
 2. Have a per-execbuf flag for "don't allow this execbuf to outlive the
process".
 3. Have a DRM_IOCTL_I915_KILL_CONTEXT which lets the client manually kill
the context

Option 2 seems like a lot more work in i915 and it doesn't seem that
advantageous.  Most drivers are going to either want their batches to
outlive them or not; they aren't going to be making that decision on a
per-batch basis.

Option 3 would work for some cases but it doesn't let the kernel terminate
work if the client is killed unexpectedly by, for instance a segfault.  The
client could try to insert a crash handler but calling a DRM ioctl from a
crash handler sounds like a bad plan.  On the other hand, the client can
just as easily implement 3 by setting the new context flag and then calling
GEM_CONTEXT_DESTROY.

With that, I think I'm convinced that a context param is the best way to do
this.  We may even consider using it in Vulkan when running headless to let
us kill stuff quicker.  We aren't seeing any long-running Vulkan compute
workloads yet but they may be coming.

Acked-by: Jason Ekstrand <jason@jlekstrand.net>

One more question: Does this bit fully support being turned on and off or
is it a set-once?  I ask because how I'd likely go about doing this in
Vulkan would be to set it on context create and then unset it the moment we
see a buffer shared with the outside world.

--Jason

Chris Wilson Oct. 29, 2019, 6:02 p.m. UTC | #5

Quoting Jason Ekstrand (2019-10-29 16:19:09)
> 
> 
> On Fri, Oct 25, 2019 at 4:29 PM Chris Wilson <chris@chris-wilson.co.uk> wrote:
> 
>     Quoting Jason Ekstrand (2019-10-25 19:22:04)
>     > On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson <chris@chris-wilson.co.uk>
>     wrote:
>     >
>     >     Our existing behaviour is to allow contexts and their GPU requests to
>     >     persist past the point of closure until the requests are complete.
>     This
>     >     allows clients to operate in a 'fire-and-forget' manner where they
>     can
>     >     setup a rendering pipeline and hand it over to the display server and
>     >     immediately exiting. As the rendering pipeline is kept alive until
>     >     completion, the display server (or other consumer) can use the
>     results
>     >     in the future and present them to the user.
>     >
>     >     However, not all clients want this persistent behaviour and would
>     prefer
>     >     that the contexts are cleaned up immediately upon closure. This
>     ensures
>     >     that when clients are run without hangchecking, any GPU hang is
>     >     terminated with the process and does not continue to hog resources.
>     >
>     >     By defining a context property to allow clients to control
>     persistence
>     >     explicitly, we can remove the blanket advice to disable hangchecking
>     >     that seems to be far too prevalent.
>     >
>     >
>     > Just to be clear, when you say "disable hangchecking" do you mean
>     disabling it
>     > for all processes via a kernel parameter at boot time or a sysfs entry or
>     > similar?  Or is there some mechanism whereby a context can request no
>     hang
>     > checking?
> 
>     They are being told to use the module parameter i915.enable_hangcheck=0
>     to globally disable hangchecking. This is what we are trying to wean
>     them off, and yet still allow indefinitely long kernels. The softer
>     hangcheck is focused on if you block scheduling or preemption of higher
>     priority work, then you are forcibly removed from the GPU. However, even
>     that is too much for some workloads, where they really do expect to
>     permanently hog the GPU. (All I can say is that they better be dedicated
>     systems because if you demand interactivity on top of disabling
>     preemption...)
> 
> 
> Ok, thinking out loud here (no need for you to respond):  Why should we take
> this approach?  It seems like there are several other ways we could solve this:
> 
>  1. Have a per-context flag (that's what we did here)
>  2. Have a per-execbuf flag for "don't allow this execbuf to outlive the
> process".
>  3. Have a DRM_IOCTL_I915_KILL_CONTEXT which lets the client manually kill the
> context
> 
> Option 2 seems like a lot more work in i915 and it doesn't seem that
> advantageous.  Most drivers are going to either want their batches to outlive
> them or not; they aren't going to be making that decision on a per-batch basis.

And processing each batch on context close, deciding how to carry
forward the different options doesn't sound like fun.
 
> Option 3 would work for some cases but it doesn't let the kernel terminate work
> if the client is killed unexpectedly by, for instance a segfault.  The client
> could try to insert a crash handler but calling a DRM ioctl from a crash
> handler sounds like a bad plan.  On the other hand, the client can just as
> easily implement 3 by setting the new context flag and then calling
> GEM_CONTEXT_DESTROY.

Exactly. Abnormal process termination is the name of the game.
 
> With that, I think I'm convinced that a context param is the best way to do
> this.  We may even consider using it in Vulkan when running headless to let us
> kill stuff quicker.  We aren't seeing any long-running Vulkan compute workloads
> yet but they may be coming.

Yeah, might it find a use for GL robustness? If the app wants a
guarantee that its resources are garbage collected along with it?
 
> Acked-by: Jason Ekstrand <jason@jlekstrand.net>
> 
> 
> One more question: Does this bit fully support being turned on and off or is it
> a set-once?  I ask because how I'd likely go about doing this in Vulkan would
> be to set it on context create and then unset it the moment we see a buffer
> shared with the outside world.

You can set it as many times as you like, it only takes effect on
context termination. That comes back to whether you want to evaluate it
as a batch/execbuf attribute or on the context. The starting point for
the patches was to close the process termination hole, of which this is
the natural extension for a context to opt into being cancelled on
closure. I think an all-or-nothing makes sense, or rather I don't see
enough advantage in the per-batch attribute to justify the processing.
-Chris

[01/11] drm/i915/gem: Make context persistence optional

Commit Message

Comments

Patch