diff mbox series

[8/8] drm/i915: Expose RPCS (SSEU) configuration to userspace

Message ID 20180814144058.19286-9-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Per context dynamic (sub)slice power-gating | expand

Commit Message

Tvrtko Ursulin Aug. 14, 2018, 2:40 p.m. UTC
From: Chris Wilson <chris@chris-wilson.co.uk>

We want to allow userspace to reconfigure the subslice configuration for
its own use case. To do so, we expose a context parameter to allow
adjustment of the RPCS register stored within the context image (and
currently not accessible via LRI). If the context is adjusted before
first use, the adjustment is for "free"; otherwise if the context is
active we flush the context off the GPU (stalling all users) and forcing
the GPU to save the context to memory where we can modify it and so
ensure that the register is reloaded on next execution.

The overhead of managing additional EU subslices can be significant,
especially in multi-context workloads. Non-GPGPU contexts should
preferably disable the subslices it is not using, and others should
fine-tune the number to match their workload.

We expose complete control over the RPCS register, allowing
configuration of slice/subslice, via masks packed into a u64 for
simplicity. For example,

	struct drm_i915_gem_context_param arg;
	struct drm_i915_gem_context_param_sseu sseu = { .class = 0,
	                                                .instance = 0, };

	memset(&arg, 0, sizeof(arg));
	arg.ctx_id = ctx;
	arg.param = I915_CONTEXT_PARAM_SSEU;
	arg.value = (uintptr_t) &sseu;
	if (drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_GETPARAM, &arg) == 0) {
		sseu.packed.subslice_mask = 0;

		drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, &arg);
	}

could be used to disable all subslices where supported.

v2: Fix offset of CTX_R_PWR_CLK_STATE in intel_lr_context_set_sseu() (Lionel)

v3: Add ability to program this per engine (Chris)

v4: Move most get_sseu() into i915_gem_context.c (Lionel)

v5: Validate sseu configuration against the device's capabilities (Lionel)

v6: Change context powergating settings through MI_SDM on kernel context (Chris)

v7: Synchronize the requests following a powergating setting change using a global
    dependency (Chris)
    Iterate timelines through dev_priv.gt.active_rings (Tvrtko)
    Disable RPCS configuration setting for non capable users (Lionel/Tvrtko)

v8: s/union intel_sseu/struct intel_sseu/ (Lionel)
    s/dev_priv/i915/ (Tvrtko)
    Change uapi class/instance fields to u16 (Tvrtko)
    Bump mask fields to 64bits (Lionel)
    Don't return EPERM when dynamic sseu is disabled (Tvrtko)

v9: Import context image into kernel context's ppgtt only when
    reconfiguring powergated slice/subslices (Chris)
    Use aliasing ppgtt when needed (Michel)

Tvrtko Ursulin:

v10:
 * Update for upstream changes.
 * Request submit needs a RPM reference.
 * Reject on !FULL_PPGTT for simplicity.
 * Pull out get/set param to helpers for readability and less indent.
 * Use i915_request_await_dma_fence in add_global_barrier to skip waits
   on the same timeline and avoid GEM_BUG_ON.
 * No need to explicitly assign a NULL pointer to engine in legacy mode.
 * No need to move gen8_make_rpcs up.
 * Factored out global barrier as prep patch.
 * Allow to only CAP_SYS_ADMIN if !Gen11.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100899
Issue: https://github.com/intel/media-driver/issues/267
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Zhipeng Gong <zhipeng.gong@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 187 +++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.c        |  55 +++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |   4 +
 include/uapi/drm/i915_drm.h             |  43 ++++++
 4 files changed, 288 insertions(+), 1 deletion(-)

Comments

Chris Wilson Aug. 14, 2018, 2:59 p.m. UTC | #1
Quoting Tvrtko Ursulin (2018-08-14 15:40:58)
> From: Chris Wilson <chris@chris-wilson.co.uk>
> 
> We want to allow userspace to reconfigure the subslice configuration for
> its own use case. To do so, we expose a context parameter to allow
> adjustment of the RPCS register stored within the context image (and
> currently not accessible via LRI). If the context is adjusted before
> first use, the adjustment is for "free"; otherwise if the context is
> active we flush the context off the GPU (stalling all users) and forcing
> the GPU to save the context to memory where we can modify it and so
> ensure that the register is reloaded on next execution.
> 
> The overhead of managing additional EU subslices can be significant,
> especially in multi-context workloads. Non-GPGPU contexts should
> preferably disable the subslices it is not using, and others should
> fine-tune the number to match their workload.
> 
> We expose complete control over the RPCS register, allowing
> configuration of slice/subslice, via masks packed into a u64 for
> simplicity. For example,
> 
>         struct drm_i915_gem_context_param arg;
>         struct drm_i915_gem_context_param_sseu sseu = { .class = 0,
>                                                         .instance = 0, };
> 
>         memset(&arg, 0, sizeof(arg));
>         arg.ctx_id = ctx;
>         arg.param = I915_CONTEXT_PARAM_SSEU;
>         arg.value = (uintptr_t) &sseu;
>         if (drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_GETPARAM, &arg) == 0) {
>                 sseu.packed.subslice_mask = 0;
> 
>                 drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, &arg);
>         }
> 
> could be used to disable all subslices where supported.
> 
> v2: Fix offset of CTX_R_PWR_CLK_STATE in intel_lr_context_set_sseu() (Lionel)
> 
> v3: Add ability to program this per engine (Chris)
> 
> v4: Move most get_sseu() into i915_gem_context.c (Lionel)
> 
> v5: Validate sseu configuration against the device's capabilities (Lionel)
> 
> v6: Change context powergating settings through MI_SDM on kernel context (Chris)
> 
> v7: Synchronize the requests following a powergating setting change using a global
>     dependency (Chris)
>     Iterate timelines through dev_priv.gt.active_rings (Tvrtko)
>     Disable RPCS configuration setting for non capable users (Lionel/Tvrtko)
> 
> v8: s/union intel_sseu/struct intel_sseu/ (Lionel)
>     s/dev_priv/i915/ (Tvrtko)
>     Change uapi class/instance fields to u16 (Tvrtko)
>     Bump mask fields to 64bits (Lionel)
>     Don't return EPERM when dynamic sseu is disabled (Tvrtko)
> 
> v9: Import context image into kernel context's ppgtt only when
>     reconfiguring powergated slice/subslices (Chris)
>     Use aliasing ppgtt when needed (Michel)
> 
> Tvrtko Ursulin:
> 
> v10:
>  * Update for upstream changes.
>  * Request submit needs a RPM reference.
>  * Reject on !FULL_PPGTT for simplicity.
>  * Pull out get/set param to helpers for readability and less indent.
>  * Use i915_request_await_dma_fence in add_global_barrier to skip waits
>    on the same timeline and avoid GEM_BUG_ON.
>  * No need to explicitly assign a NULL pointer to engine in legacy mode.
>  * No need to move gen8_make_rpcs up.
>  * Factored out global barrier as prep patch.
>  * Allow to only CAP_SYS_ADMIN if !Gen11.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100899
> Issue: https://github.com/intel/media-driver/issues/267
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Zhipeng Gong <zhipeng.gong@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 187 +++++++++++++++++++++++-
>  drivers/gpu/drm/i915/intel_lrc.c        |  55 +++++++
>  drivers/gpu/drm/i915/intel_ringbuffer.h |   4 +
>  include/uapi/drm/i915_drm.h             |  43 ++++++
>  4 files changed, 288 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 8a12984e7495..6d6220634e9e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -773,6 +773,91 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>         return 0;
>  }
>  
> +static int
> +i915_gem_context_reconfigure_sseu(struct i915_gem_context *ctx,
> +                                 struct intel_engine_cs *engine,
> +                                 struct intel_sseu sseu)
> +{
> +       struct drm_i915_private *i915 = ctx->i915;
> +       struct i915_request *rq;
> +       struct intel_ring *ring;
> +       int ret;
> +
> +       lockdep_assert_held(&i915->drm.struct_mutex);
> +
> +       /* Submitting requests etc needs the hw awake. */
> +       intel_runtime_pm_get(i915);
> +
> +       i915_retire_requests(i915);

?

> +
> +       /* Now use the RCS to actually reconfigure. */
> +       engine = i915->engine[RCS];

? Modifying registers stored in another engine's context image.

> +
> +       rq = i915_request_alloc(engine, i915->kernel_context);
> +       if (IS_ERR(rq)) {
> +               ret = PTR_ERR(rq);
> +               goto out_put;
> +       }
> +
> +       ret = engine->emit_rpcs_config(rq, ctx, sseu);

It's just an LRI, I'd rather we do it directly unless there's evidence
that there will be na explicit rpcs config instruction in future. It
just doesn't seem general enough.

> +       if (ret)
> +               goto out_add;
> +
> +       /* Queue this switch after all other activity */

Only needs to be after the target ctx.

> +       list_for_each_entry(ring, &i915->gt.active_rings, active_link) {
> +               struct i915_request *prev;
> +
> +               prev = last_request_on_engine(ring->timeline, engine);

As constructed above you need target-engine + RCS.

> +               if (prev)
> +                       i915_sw_fence_await_sw_fence_gfp(&rq->submit,
> +                                                        &prev->submit,
> +                                                        I915_FENCE_GFP);
> +       }
> +
> +       i915_gem_set_global_barrier(i915, rq);

This is just for a link from ctx-engine to this rq. Overkill much?
Presumably this stems from using the wrong engine.

> +
> +out_add:
> +       i915_request_add(rq);

And I'd still recommend not using indirect access if we can apply the
changes immediately.
-Chris
Lionel Landwerlin Aug. 14, 2018, 3:11 p.m. UTC | #2
On 14/08/18 15:59, Chris Wilson wrote:
> And I'd still recommend not using indirect access if we can apply the
> changes immediately.
> -Chris
>

Hangs on Gen9 :(

-
Lionel
Chris Wilson Aug. 14, 2018, 3:18 p.m. UTC | #3
Quoting Lionel Landwerlin (2018-08-14 16:11:36)
> On 14/08/18 15:59, Chris Wilson wrote:
> > And I'd still recommend not using indirect access if we can apply the
> > changes immediately.
> > -Chris
> >
> 
> Hangs on Gen9 :(

How does modifying the context image of an idle (unpinned) context cause
a hang?
-Chris
Chris Wilson Aug. 14, 2018, 3:22 p.m. UTC | #4
Quoting Tvrtko Ursulin (2018-08-14 15:40:58)
> +static int
> +i915_gem_context_reconfigure_sseu(struct i915_gem_context *ctx,
> +                                 struct intel_engine_cs *engine,
> +                                 struct intel_sseu sseu)
> +{
> +       struct drm_i915_private *i915 = ctx->i915;
> +       struct i915_request *rq;
> +       struct intel_ring *ring;
> +       int ret;
> +
> +       lockdep_assert_held(&i915->drm.struct_mutex);
> +
> +       /* Submitting requests etc needs the hw awake. */
> +       intel_runtime_pm_get(i915);
> +
> +       i915_retire_requests(i915);
> +
> +       /* Now use the RCS to actually reconfigure. */
> +       engine = i915->engine[RCS];
> +
> +       rq = i915_request_alloc(engine, i915->kernel_context);
> +       if (IS_ERR(rq)) {
> +               ret = PTR_ERR(rq);
> +               goto out_put;
> +       }
> +
> +       ret = engine->emit_rpcs_config(rq, ctx, sseu);
> +       if (ret)
> +               goto out_add;
> +
> +       /* Queue this switch after all other activity */
> +       list_for_each_entry(ring, &i915->gt.active_rings, active_link) {
> +               struct i915_request *prev;
> +
> +               prev = last_request_on_engine(ring->timeline, engine);
> +               if (prev)
> +                       i915_sw_fence_await_sw_fence_gfp(&rq->submit,
> +                                                        &prev->submit,
> +                                                        I915_FENCE_GFP);
> +       }
> +
> +       i915_gem_set_global_barrier(i915, rq);
> +
> +out_add:
> +       i915_request_add(rq);
> +out_put:
> +       intel_runtime_pm_put(i915);
> +
> +       return ret;

Looks like we should be able to hook this up to a selftest to confirm
the modification does land in the target context image, and a SRM to
confirm it loaded.
-Chris
Lionel Landwerlin Aug. 14, 2018, 4:05 p.m. UTC | #5
On 14/08/18 16:18, Chris Wilson wrote:
> Quoting Lionel Landwerlin (2018-08-14 16:11:36)
>> On 14/08/18 15:59, Chris Wilson wrote:
>>> And I'd still recommend not using indirect access if we can apply the
>>> changes immediately.
>>> -Chris
>>>
>> Hangs on Gen9 :(
> How does modifying the context image of an idle (unpinned) context cause
> a hang?
> -Chris
>
I thought you meant a context modifying it's own RPCS register... no?

-
Lionel
Lionel Landwerlin Aug. 14, 2018, 4:09 p.m. UTC | #6
On 14/08/18 17:05, Lionel Landwerlin wrote:
> On 14/08/18 16:18, Chris Wilson wrote:
>> Quoting Lionel Landwerlin (2018-08-14 16:11:36)
>>> On 14/08/18 15:59, Chris Wilson wrote:
>>>> And I'd still recommend not using indirect access if we can apply the
>>>> changes immediately.
>>>> -Chris
>>>>
>>> Hangs on Gen9 :(
>> How does modifying the context image of an idle (unpinned) context cause
>> a hang?
>> -Chris
>>
> I thought you meant a context modifying it's own RPCS register... no?
>
> -
> Lionel

Oh I get it now... Sorry, forget me :)

-
Lionel
Tvrtko Ursulin Aug. 14, 2018, 6:44 p.m. UTC | #7
On 14/08/2018 15:59, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-08-14 15:40:58)
>> From: Chris Wilson <chris@chris-wilson.co.uk>
>>
>> We want to allow userspace to reconfigure the subslice configuration for
>> its own use case. To do so, we expose a context parameter to allow
>> adjustment of the RPCS register stored within the context image (and
>> currently not accessible via LRI). If the context is adjusted before
>> first use, the adjustment is for "free"; otherwise if the context is
>> active we flush the context off the GPU (stalling all users) and forcing
>> the GPU to save the context to memory where we can modify it and so
>> ensure that the register is reloaded on next execution.
>>
>> The overhead of managing additional EU subslices can be significant,
>> especially in multi-context workloads. Non-GPGPU contexts should
>> preferably disable the subslices it is not using, and others should
>> fine-tune the number to match their workload.
>>
>> We expose complete control over the RPCS register, allowing
>> configuration of slice/subslice, via masks packed into a u64 for
>> simplicity. For example,
>>
>>          struct drm_i915_gem_context_param arg;
>>          struct drm_i915_gem_context_param_sseu sseu = { .class = 0,
>>                                                          .instance = 0, };
>>
>>          memset(&arg, 0, sizeof(arg));
>>          arg.ctx_id = ctx;
>>          arg.param = I915_CONTEXT_PARAM_SSEU;
>>          arg.value = (uintptr_t) &sseu;
>>          if (drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_GETPARAM, &arg) == 0) {
>>                  sseu.packed.subslice_mask = 0;
>>
>>                  drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, &arg);
>>          }
>>
>> could be used to disable all subslices where supported.
>>
>> v2: Fix offset of CTX_R_PWR_CLK_STATE in intel_lr_context_set_sseu() (Lionel)
>>
>> v3: Add ability to program this per engine (Chris)
>>
>> v4: Move most get_sseu() into i915_gem_context.c (Lionel)
>>
>> v5: Validate sseu configuration against the device's capabilities (Lionel)
>>
>> v6: Change context powergating settings through MI_SDM on kernel context (Chris)
>>
>> v7: Synchronize the requests following a powergating setting change using a global
>>      dependency (Chris)
>>      Iterate timelines through dev_priv.gt.active_rings (Tvrtko)
>>      Disable RPCS configuration setting for non capable users (Lionel/Tvrtko)
>>
>> v8: s/union intel_sseu/struct intel_sseu/ (Lionel)
>>      s/dev_priv/i915/ (Tvrtko)
>>      Change uapi class/instance fields to u16 (Tvrtko)
>>      Bump mask fields to 64bits (Lionel)
>>      Don't return EPERM when dynamic sseu is disabled (Tvrtko)
>>
>> v9: Import context image into kernel context's ppgtt only when
>>      reconfiguring powergated slice/subslices (Chris)
>>      Use aliasing ppgtt when needed (Michel)
>>
>> Tvrtko Ursulin:
>>
>> v10:
>>   * Update for upstream changes.
>>   * Request submit needs a RPM reference.
>>   * Reject on !FULL_PPGTT for simplicity.
>>   * Pull out get/set param to helpers for readability and less indent.
>>   * Use i915_request_await_dma_fence in add_global_barrier to skip waits
>>     on the same timeline and avoid GEM_BUG_ON.
>>   * No need to explicitly assign a NULL pointer to engine in legacy mode.
>>   * No need to move gen8_make_rpcs up.
>>   * Factored out global barrier as prep patch.
>>   * Allow to only CAP_SYS_ADMIN if !Gen11.
>>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100899
>> Issue: https://github.com/intel/media-driver/issues/267
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Zhipeng Gong <zhipeng.gong@intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gem_context.c | 187 +++++++++++++++++++++++-
>>   drivers/gpu/drm/i915/intel_lrc.c        |  55 +++++++
>>   drivers/gpu/drm/i915/intel_ringbuffer.h |   4 +
>>   include/uapi/drm/i915_drm.h             |  43 ++++++
>>   4 files changed, 288 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> index 8a12984e7495..6d6220634e9e 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -773,6 +773,91 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>>          return 0;
>>   }
>>   
>> +static int
>> +i915_gem_context_reconfigure_sseu(struct i915_gem_context *ctx,
>> +                                 struct intel_engine_cs *engine,
>> +                                 struct intel_sseu sseu)
>> +{
>> +       struct drm_i915_private *i915 = ctx->i915;
>> +       struct i915_request *rq;
>> +       struct intel_ring *ring;
>> +       int ret;
>> +
>> +       lockdep_assert_held(&i915->drm.struct_mutex);
>> +
>> +       /* Submitting requests etc needs the hw awake. */
>> +       intel_runtime_pm_get(i915);
>> +
>> +       i915_retire_requests(i915);
> 
> ?

I wondered myself but did not make myself dig through all the history of 
the series. Cannot think that it does anything useful in the current design.

>> +
>> +       /* Now use the RCS to actually reconfigure. */
>> +       engine = i915->engine[RCS];
> 
> ? Modifying registers stored in another engine's context image.

Well, I was hoping design was kind of agreed by now.

>> +
>> +       rq = i915_request_alloc(engine, i915->kernel_context);
>> +       if (IS_ERR(rq)) {
>> +               ret = PTR_ERR(rq);
>> +               goto out_put;
>> +       }
>> +
>> +       ret = engine->emit_rpcs_config(rq, ctx, sseu);
> 
> It's just an LRI, I'd rather we do it directly unless there's evidence
> that there will be na explicit rpcs config instruction in future. It
> just doesn't seem general enough.

No complaints, can do.

>> +       if (ret)
>> +               goto out_add;
>> +
>> +       /* Queue this switch after all other activity */
> 
> Only needs to be after the target ctx.

True, so find just the timeline belonging to target context. Some 
backpointers would be needed to find it. Or a walk and compare against 
target ce->ring->timeline->fence_context. Sounds like more than the 
latter isn't justified for this use case.

> 
>> +       list_for_each_entry(ring, &i915->gt.active_rings, active_link) {
>> +               struct i915_request *prev;
>> +
>> +               prev = last_request_on_engine(ring->timeline, engine);
> 
> As constructed above you need target-engine + RCS.

Target engine is always RCS. Looks like the engine param to this 
function is pointless.

> 
>> +               if (prev)
>> +                       i915_sw_fence_await_sw_fence_gfp(&rq->submit,
>> +                                                        &prev->submit,
>> +                                                        I915_FENCE_GFP);
>> +       }
>> +
>> +       i915_gem_set_global_barrier(i915, rq);
> 
> This is just for a link from ctx-engine to this rq. Overkill much?
> Presumably this stems from using the wrong engine.

AFAIU this is to ensure target context cannot get ahead of this request. 
Without it could overtake and then there is no guarantee execbufs 
following set param will be with new SSEU configuration.
> 
>> +
>> +out_add:
>> +       i915_request_add(rq);
> 
> And I'd still recommend not using indirect access if we can apply the
> changes immediately.

Direct (CPU) access means blocking in set param until the context is 
idle. AFAIR in one of the earlier version GPU idling approach was used 
but I thought that is very dangerous API. Can we somehow take just one 
context out of circulation? Still have to idle since it could be deep in 
the queue. The current approach is nicely async. And I don't like having 
two paths - direct if unpinned, or via GPU if pinned. In fact it would 
even be a third path since normally it happens via init_reg_state of 
pristine contexts.

Regards,

Tvrtko
Chris Wilson Aug. 14, 2018, 6:53 p.m. UTC | #8
Quoting Tvrtko Ursulin (2018-08-14 19:44:09)
> 
> On 14/08/2018 15:59, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-08-14 15:40:58)
> >> From: Chris Wilson <chris@chris-wilson.co.uk>
> >>
> >> We want to allow userspace to reconfigure the subslice configuration for
> >> its own use case. To do so, we expose a context parameter to allow
> >> adjustment of the RPCS register stored within the context image (and
> >> currently not accessible via LRI). If the context is adjusted before
> >> first use, the adjustment is for "free"; otherwise if the context is
> >> active we flush the context off the GPU (stalling all users) and forcing
> >> the GPU to save the context to memory where we can modify it and so
> >> ensure that the register is reloaded on next execution.
> >>
> >> The overhead of managing additional EU subslices can be significant,
> >> especially in multi-context workloads. Non-GPGPU contexts should
> >> preferably disable the subslices it is not using, and others should
> >> fine-tune the number to match their workload.
> >>
> >> We expose complete control over the RPCS register, allowing
> >> configuration of slice/subslice, via masks packed into a u64 for
> >> simplicity. For example,
> >>
> >>          struct drm_i915_gem_context_param arg;
> >>          struct drm_i915_gem_context_param_sseu sseu = { .class = 0,
> >>                                                          .instance = 0, };
> >>
> >>          memset(&arg, 0, sizeof(arg));
> >>          arg.ctx_id = ctx;
> >>          arg.param = I915_CONTEXT_PARAM_SSEU;
> >>          arg.value = (uintptr_t) &sseu;
> >>          if (drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_GETPARAM, &arg) == 0) {
> >>                  sseu.packed.subslice_mask = 0;
> >>
> >>                  drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, &arg);
> >>          }
> >>
> >> could be used to disable all subslices where supported.
> >>
> >> v2: Fix offset of CTX_R_PWR_CLK_STATE in intel_lr_context_set_sseu() (Lionel)
> >>
> >> v3: Add ability to program this per engine (Chris)
> >>
> >> v4: Move most get_sseu() into i915_gem_context.c (Lionel)
> >>
> >> v5: Validate sseu configuration against the device's capabilities (Lionel)
> >>
> >> v6: Change context powergating settings through MI_SDM on kernel context (Chris)
> >>
> >> v7: Synchronize the requests following a powergating setting change using a global
> >>      dependency (Chris)
> >>      Iterate timelines through dev_priv.gt.active_rings (Tvrtko)
> >>      Disable RPCS configuration setting for non capable users (Lionel/Tvrtko)
> >>
> >> v8: s/union intel_sseu/struct intel_sseu/ (Lionel)
> >>      s/dev_priv/i915/ (Tvrtko)
> >>      Change uapi class/instance fields to u16 (Tvrtko)
> >>      Bump mask fields to 64bits (Lionel)
> >>      Don't return EPERM when dynamic sseu is disabled (Tvrtko)
> >>
> >> v9: Import context image into kernel context's ppgtt only when
> >>      reconfiguring powergated slice/subslices (Chris)
> >>      Use aliasing ppgtt when needed (Michel)
> >>
> >> Tvrtko Ursulin:
> >>
> >> v10:
> >>   * Update for upstream changes.
> >>   * Request submit needs a RPM reference.
> >>   * Reject on !FULL_PPGTT for simplicity.
> >>   * Pull out get/set param to helpers for readability and less indent.
> >>   * Use i915_request_await_dma_fence in add_global_barrier to skip waits
> >>     on the same timeline and avoid GEM_BUG_ON.
> >>   * No need to explicitly assign a NULL pointer to engine in legacy mode.
> >>   * No need to move gen8_make_rpcs up.
> >>   * Factored out global barrier as prep patch.
> >>   * Allow to only CAP_SYS_ADMIN if !Gen11.
> >>
> >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100899
> >> Issue: https://github.com/intel/media-driver/issues/267
> >> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> >> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
> >> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >> Cc: Zhipeng Gong <zhipeng.gong@intel.com>
> >> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >> ---
> >>   drivers/gpu/drm/i915/i915_gem_context.c | 187 +++++++++++++++++++++++-
> >>   drivers/gpu/drm/i915/intel_lrc.c        |  55 +++++++
> >>   drivers/gpu/drm/i915/intel_ringbuffer.h |   4 +
> >>   include/uapi/drm/i915_drm.h             |  43 ++++++
> >>   4 files changed, 288 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> >> index 8a12984e7495..6d6220634e9e 100644
> >> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> >> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> >> @@ -773,6 +773,91 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
> >>          return 0;
> >>   }
> >>   
> >> +static int
> >> +i915_gem_context_reconfigure_sseu(struct i915_gem_context *ctx,
> >> +                                 struct intel_engine_cs *engine,
> >> +                                 struct intel_sseu sseu)
> >> +{
> >> +       struct drm_i915_private *i915 = ctx->i915;
> >> +       struct i915_request *rq;
> >> +       struct intel_ring *ring;
> >> +       int ret;
> >> +
> >> +       lockdep_assert_held(&i915->drm.struct_mutex);
> >> +
> >> +       /* Submitting requests etc needs the hw awake. */
> >> +       intel_runtime_pm_get(i915);
> >> +
> >> +       i915_retire_requests(i915);
> > 
> > ?
> 
> I wondered myself but did not make myself dig through all the history of 
> the series. Cannot think that it does anything useful in the current design.
> 
> >> +
> >> +       /* Now use the RCS to actually reconfigure. */
> >> +       engine = i915->engine[RCS];
> > 
> > ? Modifying registers stored in another engine's context image.
> 
> Well, I was hoping design was kind of agreed by now.

:-p

This wasn't about using requests per-se, but raising the question of why
use rcs to adjust the vcs context image. If we used vcs, the
question of serialisation is then only on that engine.
 
> >> +       rq = i915_request_alloc(engine, i915->kernel_context);
> >> +       if (IS_ERR(rq)) {
> >> +               ret = PTR_ERR(rq);
> >> +               goto out_put;
> >> +       }
> >> +
> >> +       ret = engine->emit_rpcs_config(rq, ctx, sseu);
> > 
> > It's just an LRI, I'd rather we do it directly unless there's evidence
> > that there will be na explicit rpcs config instruction in future. It
> > just doesn't seem general enough.
> 
> No complaints, can do.
> 
> >> +       if (ret)
> >> +               goto out_add;
> >> +
> >> +       /* Queue this switch after all other activity */
> > 
> > Only needs to be after the target ctx.
> 
> True, so find just the timeline belonging to target context. Some 
> backpointers would be needed to find it. Or a walk and compare against 
> target ce->ring->timeline->fence_context. Sounds like more than the 
> latter isn't justified for this use case.

Right, we should be able to use ce->ring->timeline.

> >> +       list_for_each_entry(ring, &i915->gt.active_rings, active_link) {
> >> +               struct i915_request *prev;
> >> +
> >> +               prev = last_request_on_engine(ring->timeline, engine);
> > 
> > As constructed above you need target-engine + RCS.
> 
> Target engine is always RCS. Looks like the engine param to this 
> function is pointless.

Always always? At the beginning the speculation was that the other
engines were going to get their own subslice parameters. I guess that
was on the same design sheet as the VCS commands...

> >> +               if (prev)
> >> +                       i915_sw_fence_await_sw_fence_gfp(&rq->submit,
> >> +                                                        &prev->submit,
> >> +                                                        I915_FENCE_GFP);
> >> +       }
> >> +
> >> +       i915_gem_set_global_barrier(i915, rq);
> > 
> > This is just for a link from ctx-engine to this rq. Overkill much?
> > Presumably this stems from using the wrong engine.
> 
> AFAIU this is to ensure target context cannot get ahead of this request. 
> Without it could overtake and then there is no guarantee execbufs 
> following set param will be with new SSEU configuration.

Right, but we only need a fence on the target context to this request.
The existing way would be to queue an empty request on the target with
the await on this rq, but if you feel that is one request too many we
can use a timeline barrier. Or if that is too narrow, an engine barrier.
Over time, I can see that we may want all 3 barrier levels. The cost
isn't too onerous.

> >> +
> >> +out_add:
> >> +       i915_request_add(rq);
> > 
> > And I'd still recommend not using indirect access if we can apply the
> > changes immediately.
> 
> Direct (CPU) access means blocking in set param until the context is 
> idle.

No, it doesn't. It's is just a choice of whether you use request to a
pinned context, or direct access if idle (literally ce->pin_count).
Direct access is going to be near zero overhead.
-Chris
Tvrtko Ursulin Aug. 15, 2018, 9:12 a.m. UTC | #9
On 14/08/2018 19:53, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-08-14 19:44:09)
>>
>> On 14/08/2018 15:59, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-08-14 15:40:58)
>>>> From: Chris Wilson <chris@chris-wilson.co.uk>
>>>>
>>>> We want to allow userspace to reconfigure the subslice configuration for
>>>> its own use case. To do so, we expose a context parameter to allow
>>>> adjustment of the RPCS register stored within the context image (and
>>>> currently not accessible via LRI). If the context is adjusted before
>>>> first use, the adjustment is for "free"; otherwise if the context is
>>>> active we flush the context off the GPU (stalling all users) and forcing
>>>> the GPU to save the context to memory where we can modify it and so
>>>> ensure that the register is reloaded on next execution.
>>>>
>>>> The overhead of managing additional EU subslices can be significant,
>>>> especially in multi-context workloads. Non-GPGPU contexts should
>>>> preferably disable the subslices it is not using, and others should
>>>> fine-tune the number to match their workload.
>>>>
>>>> We expose complete control over the RPCS register, allowing
>>>> configuration of slice/subslice, via masks packed into a u64 for
>>>> simplicity. For example,
>>>>
>>>>           struct drm_i915_gem_context_param arg;
>>>>           struct drm_i915_gem_context_param_sseu sseu = { .class = 0,
>>>>                                                           .instance = 0, };
>>>>
>>>>           memset(&arg, 0, sizeof(arg));
>>>>           arg.ctx_id = ctx;
>>>>           arg.param = I915_CONTEXT_PARAM_SSEU;
>>>>           arg.value = (uintptr_t) &sseu;
>>>>           if (drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_GETPARAM, &arg) == 0) {
>>>>                   sseu.packed.subslice_mask = 0;
>>>>
>>>>                   drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_SETPARAM, &arg);
>>>>           }
>>>>
>>>> could be used to disable all subslices where supported.
>>>>
>>>> v2: Fix offset of CTX_R_PWR_CLK_STATE in intel_lr_context_set_sseu() (Lionel)
>>>>
>>>> v3: Add ability to program this per engine (Chris)
>>>>
>>>> v4: Move most get_sseu() into i915_gem_context.c (Lionel)
>>>>
>>>> v5: Validate sseu configuration against the device's capabilities (Lionel)
>>>>
>>>> v6: Change context powergating settings through MI_SDM on kernel context (Chris)
>>>>
>>>> v7: Synchronize the requests following a powergating setting change using a global
>>>>       dependency (Chris)
>>>>       Iterate timelines through dev_priv.gt.active_rings (Tvrtko)
>>>>       Disable RPCS configuration setting for non capable users (Lionel/Tvrtko)
>>>>
>>>> v8: s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>>       s/dev_priv/i915/ (Tvrtko)
>>>>       Change uapi class/instance fields to u16 (Tvrtko)
>>>>       Bump mask fields to 64bits (Lionel)
>>>>       Don't return EPERM when dynamic sseu is disabled (Tvrtko)
>>>>
>>>> v9: Import context image into kernel context's ppgtt only when
>>>>       reconfiguring powergated slice/subslices (Chris)
>>>>       Use aliasing ppgtt when needed (Michel)
>>>>
>>>> Tvrtko Ursulin:
>>>>
>>>> v10:
>>>>    * Update for upstream changes.
>>>>    * Request submit needs a RPM reference.
>>>>    * Reject on !FULL_PPGTT for simplicity.
>>>>    * Pull out get/set param to helpers for readability and less indent.
>>>>    * Use i915_request_await_dma_fence in add_global_barrier to skip waits
>>>>      on the same timeline and avoid GEM_BUG_ON.
>>>>    * No need to explicitly assign a NULL pointer to engine in legacy mode.
>>>>    * No need to move gen8_make_rpcs up.
>>>>    * Factored out global barrier as prep patch.
>>>>    * Allow to only CAP_SYS_ADMIN if !Gen11.
>>>>
>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100899
>>>> Issue: https://github.com/intel/media-driver/issues/267
>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Cc: Zhipeng Gong <zhipeng.gong@intel.com>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> ---
>>>>    drivers/gpu/drm/i915/i915_gem_context.c | 187 +++++++++++++++++++++++-
>>>>    drivers/gpu/drm/i915/intel_lrc.c        |  55 +++++++
>>>>    drivers/gpu/drm/i915/intel_ringbuffer.h |   4 +
>>>>    include/uapi/drm/i915_drm.h             |  43 ++++++
>>>>    4 files changed, 288 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> index 8a12984e7495..6d6220634e9e 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>>> @@ -773,6 +773,91 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>>>>           return 0;
>>>>    }
>>>>    
>>>> +static int
>>>> +i915_gem_context_reconfigure_sseu(struct i915_gem_context *ctx,
>>>> +                                 struct intel_engine_cs *engine,
>>>> +                                 struct intel_sseu sseu)
>>>> +{
>>>> +       struct drm_i915_private *i915 = ctx->i915;
>>>> +       struct i915_request *rq;
>>>> +       struct intel_ring *ring;
>>>> +       int ret;
>>>> +
>>>> +       lockdep_assert_held(&i915->drm.struct_mutex);
>>>> +
>>>> +       /* Submitting requests etc needs the hw awake. */
>>>> +       intel_runtime_pm_get(i915);
>>>> +
>>>> +       i915_retire_requests(i915);
>>>
>>> ?
>>
>> I wondered myself but did not make myself dig through all the history of
>> the series. Cannot think that it does anything useful in the current design.
>>
>>>> +
>>>> +       /* Now use the RCS to actually reconfigure. */
>>>> +       engine = i915->engine[RCS];
>>>
>>> ? Modifying registers stored in another engine's context image.
>>
>> Well, I was hoping design was kind of agreed by now.
> 
> :-p
> 
> This wasn't about using requests per-se, but raising the question of why
> use rcs to adjust the vcs context image. If we used vcs, the
> question of serialisation is then only on that engine.

It only ever modifies the RCS context image - since that is where the 
feature is only supported AFAIU. And does it via RCS which again AFAIU 
ensures we not only received the user interrupt for the last previous rq 
on the timeline, but it has been fully context saved as well.

In the light of this perhaps the uAPI allowing set param on any engine 
is incorrect for now, and it should instead return -ENODEV for !RCS. It 
is misleading that SSEU config on other engines will succeed, but in 
fact the RCS context image is modified and SSEU configuration only 
applies if/when the RCS context runs.

So now I am thinking the uAPI is fine to support other engines for now, 
to be future proof, but unless I am not getting something it should 
really return -ENODEV for !RCS.

>>>> +       rq = i915_request_alloc(engine, i915->kernel_context);
>>>> +       if (IS_ERR(rq)) {
>>>> +               ret = PTR_ERR(rq);
>>>> +               goto out_put;
>>>> +       }
>>>> +
>>>> +       ret = engine->emit_rpcs_config(rq, ctx, sseu);
>>>
>>> It's just an LRI, I'd rather we do it directly unless there's evidence
>>> that there will be na explicit rpcs config instruction in future. It
>>> just doesn't seem general enough.
>>
>> No complaints, can do.
>>
>>>> +       if (ret)
>>>> +               goto out_add;
>>>> +
>>>> +       /* Queue this switch after all other activity */
>>>
>>> Only needs to be after the target ctx.
>>
>> True, so find just the timeline belonging to target context. Some
>> backpointers would be needed to find it. Or a walk and compare against
>> target ce->ring->timeline->fence_context. Sounds like more than the
>> latter isn't justified for this use case.
> 
> Right, we should be able to use ce->ring->timeline.
> 
>>>> +       list_for_each_entry(ring, &i915->gt.active_rings, active_link) {
>>>> +               struct i915_request *prev;
>>>> +
>>>> +               prev = last_request_on_engine(ring->timeline, engine);
>>>
>>> As constructed above you need target-engine + RCS.
>>
>> Target engine is always RCS. Looks like the engine param to this
>> function is pointless.
> 
> Always always? At the beginning the speculation was that the other
> engines were going to get their own subslice parameters. I guess that
> was on the same design sheet as the VCS commands...

See above.

> 
>>>> +               if (prev)
>>>> +                       i915_sw_fence_await_sw_fence_gfp(&rq->submit,
>>>> +                                                        &prev->submit,
>>>> +                                                        I915_FENCE_GFP);
>>>> +       }
>>>> +
>>>> +       i915_gem_set_global_barrier(i915, rq);
>>>
>>> This is just for a link from ctx-engine to this rq. Overkill much?
>>> Presumably this stems from using the wrong engine.
>>
>> AFAIU this is to ensure target context cannot get ahead of this request.
>> Without it could overtake and then there is no guarantee execbufs
>> following set param will be with new SSEU configuration.
> 
> Right, but we only need a fence on the target context to this request.
> The existing way would be to queue an empty request on the target with
> the await on this rq, but if you feel that is one request too many we
> can use a timeline barrier. Or if that is too narrow, an engine barrier.
> Over time, I can see that we may want all 3 barrier levels. The cost
> isn't too onerous.

Not sure exactly what you are thinking here but I think we do need a 
global barrier, or at least a barrier local to the RCS engine.

Since the context update is done from the kernel context, we have to 
ensure new requests against the target context cannot overtake the 
context update request.

Perhaps it would be sufficient to set the highest, non user accessible 
priority on the update request? Barrier sounds more future proof and 
correct from design point of view though.

> 
>>>> +
>>>> +out_add:
>>>> +       i915_request_add(rq);
>>>
>>> And I'd still recommend not using indirect access if we can apply the
>>> changes immediately.
>>
>> Direct (CPU) access means blocking in set param until the context is
>> idle.
> 
> No, it doesn't. It's is just a choice of whether you use request to a
> pinned context, or direct access if idle (literally ce->pin_count).
> Direct access is going to be near zero overhead.

You cut all the rest of my comments here so it is unclear what you meant.

It was mostly about avoiding multiple paths to do the same thing. As I 
see it, we cannot only have the CPU update path - since it does imply 
idling the GPU. If you disagree with that please explain.

And if we do need the SDM path anyway, the CPU update path could be 
added later. I agree it would bring a small advantage of not having to 
add a global barrier *if* the target context is idle *and* RCS is busy 
with other contexts.

Regards,

Tvrtko
Tvrtko Ursulin Aug. 15, 2018, 11:51 a.m. UTC | #10
On 14/08/2018 16:22, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-08-14 15:40:58)
>> +static int
>> +i915_gem_context_reconfigure_sseu(struct i915_gem_context *ctx,
>> +                                 struct intel_engine_cs *engine,
>> +                                 struct intel_sseu sseu)
>> +{
>> +       struct drm_i915_private *i915 = ctx->i915;
>> +       struct i915_request *rq;
>> +       struct intel_ring *ring;
>> +       int ret;
>> +
>> +       lockdep_assert_held(&i915->drm.struct_mutex);
>> +
>> +       /* Submitting requests etc needs the hw awake. */
>> +       intel_runtime_pm_get(i915);
>> +
>> +       i915_retire_requests(i915);
>> +
>> +       /* Now use the RCS to actually reconfigure. */
>> +       engine = i915->engine[RCS];
>> +
>> +       rq = i915_request_alloc(engine, i915->kernel_context);
>> +       if (IS_ERR(rq)) {
>> +               ret = PTR_ERR(rq);
>> +               goto out_put;
>> +       }
>> +
>> +       ret = engine->emit_rpcs_config(rq, ctx, sseu);
>> +       if (ret)
>> +               goto out_add;
>> +
>> +       /* Queue this switch after all other activity */
>> +       list_for_each_entry(ring, &i915->gt.active_rings, active_link) {
>> +               struct i915_request *prev;
>> +
>> +               prev = last_request_on_engine(ring->timeline, engine);
>> +               if (prev)
>> +                       i915_sw_fence_await_sw_fence_gfp(&rq->submit,
>> +                                                        &prev->submit,
>> +                                                        I915_FENCE_GFP);
>> +       }
>> +
>> +       i915_gem_set_global_barrier(i915, rq);
>> +
>> +out_add:
>> +       i915_request_add(rq);
>> +out_put:
>> +       intel_runtime_pm_put(i915);
>> +
>> +       return ret;
> 
> Looks like we should be able to hook this up to a selftest to confirm
> the modification does land in the target context image, and a SRM to
> confirm it loaded.

Lionel wrote an IGT which reads it back via SRM so that should be 
covered. I will be posting the IGT counterpart at some point as well, 
ideally when the agreement on the i915 side is there.

Regards,

Tvrtko
Chris Wilson Aug. 15, 2018, 11:56 a.m. UTC | #11
Quoting Tvrtko Ursulin (2018-08-15 12:51:28)
> 
> On 14/08/2018 16:22, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-08-14 15:40:58)
> > Looks like we should be able to hook this up to a selftest to confirm
> > the modification does land in the target context image, and a SRM to
> > confirm it loaded.
> 
> Lionel wrote an IGT which reads it back via SRM so that should be 
> covered. I will be posting the IGT counterpart at some point as well, 
> ideally when the agreement on the i915 side is there.

Wouldn't you rather have both? I know I would :-p

The emphasis is slightly different, IGT wants to cover the ABI
behaviour, selftests looking towards the edge cases hard to reach
otherwise. But even a trivial selftest would let us know the series
functions as intended.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8a12984e7495..6d6220634e9e 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -773,6 +773,91 @@  int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 	return 0;
 }
 
+static int
+i915_gem_context_reconfigure_sseu(struct i915_gem_context *ctx,
+				  struct intel_engine_cs *engine,
+				  struct intel_sseu sseu)
+{
+	struct drm_i915_private *i915 = ctx->i915;
+	struct i915_request *rq;
+	struct intel_ring *ring;
+	int ret;
+
+	lockdep_assert_held(&i915->drm.struct_mutex);
+
+	/* Submitting requests etc needs the hw awake. */
+	intel_runtime_pm_get(i915);
+
+	i915_retire_requests(i915);
+
+	/* Now use the RCS to actually reconfigure. */
+	engine = i915->engine[RCS];
+
+	rq = i915_request_alloc(engine, i915->kernel_context);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		goto out_put;
+	}
+
+	ret = engine->emit_rpcs_config(rq, ctx, sseu);
+	if (ret)
+		goto out_add;
+
+	/* Queue this switch after all other activity */
+	list_for_each_entry(ring, &i915->gt.active_rings, active_link) {
+		struct i915_request *prev;
+
+		prev = last_request_on_engine(ring->timeline, engine);
+		if (prev)
+			i915_sw_fence_await_sw_fence_gfp(&rq->submit,
+							 &prev->submit,
+							 I915_FENCE_GFP);
+	}
+
+	i915_gem_set_global_barrier(i915, rq);
+
+out_add:
+	i915_request_add(rq);
+out_put:
+	intel_runtime_pm_put(i915);
+
+	return ret;
+}
+
+static int get_sseu(struct i915_gem_context *ctx,
+		    struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_gem_context_param_sseu user_sseu;
+	struct intel_engine_cs *engine;
+	struct intel_context *ce;
+
+	if (copy_from_user(&user_sseu, u64_to_user_ptr(args->value),
+			   sizeof(user_sseu)))
+		return -EFAULT;
+
+	if (user_sseu.rsvd1 || user_sseu.rsvd2)
+		return -EINVAL;
+
+	engine = intel_engine_lookup_user(ctx->i915,
+					  user_sseu.class,
+					  user_sseu.instance);
+	if (!engine)
+		return -EINVAL;
+
+	ce = to_intel_context(ctx, engine);
+
+	user_sseu.slice_mask = ce->sseu.slice_mask;
+	user_sseu.subslice_mask = ce->sseu.subslice_mask;
+	user_sseu.min_eus_per_subslice = ce->sseu.min_eus_per_subslice;
+	user_sseu.max_eus_per_subslice = ce->sseu.max_eus_per_subslice;
+
+	if (copy_to_user(u64_to_user_ptr(args->value), &user_sseu,
+			 sizeof(user_sseu)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
@@ -810,6 +895,9 @@  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority;
 		break;
+	case I915_CONTEXT_PARAM_SSEU:
+		ret = get_sseu(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -819,6 +907,101 @@  int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	return ret;
 }
 
+static int
+__user_to_context_sseu(const struct sseu_dev_info *device,
+		       const struct drm_i915_gem_context_param_sseu *user,
+		       struct intel_sseu *context)
+{
+	/* No zeros in any field. */
+	if (!user->slice_mask || !user->subslice_mask ||
+	    !user->min_eus_per_subslice || !user->max_eus_per_subslice)
+		return -EINVAL;
+
+	/* Max > min. */
+	if (user->max_eus_per_subslice < user->min_eus_per_subslice)
+		return -EINVAL;
+
+	/* Check validity against hardware. */
+	if (user->slice_mask & ~device->slice_mask)
+		return -EINVAL;
+
+	if (user->subslice_mask & ~device->subslice_mask[0])
+		return -EINVAL;
+
+	if (user->max_eus_per_subslice > device->max_eus_per_subslice)
+		return -EINVAL;
+
+	context->slice_mask = user->slice_mask;
+	context->subslice_mask = user->subslice_mask;
+	context->min_eus_per_subslice = user->min_eus_per_subslice;
+	context->max_eus_per_subslice = user->max_eus_per_subslice;
+
+	return 0;
+}
+
+static int set_sseu(struct i915_gem_context *ctx,
+		    struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *i915 = ctx->i915;
+	struct drm_i915_gem_context_param_sseu user_sseu;
+	struct intel_engine_cs *engine;
+	struct intel_sseu ctx_sseu;
+	struct intel_context *ce;
+	enum intel_engine_id id;
+	int ret;
+
+	if (args->size)
+		return -EINVAL;
+
+	if (!USES_FULL_PPGTT(i915))
+		return -ENODEV;
+
+	if (!IS_GEN11(i915) && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&user_sseu, u64_to_user_ptr(args->value),
+			   sizeof(user_sseu)))
+		return -EFAULT;
+
+	if (user_sseu.rsvd1 || user_sseu.rsvd2)
+		return -EINVAL;
+
+	engine = intel_engine_lookup_user(i915,
+					  user_sseu.class,
+					  user_sseu.instance);
+	if (!engine)
+		return -EINVAL;
+
+	if (!engine->emit_rpcs_config)
+		return -ENODEV;
+
+	ret = __user_to_context_sseu(&INTEL_INFO(i915)->sseu, &user_sseu,
+				     &ctx_sseu);
+	if (ret)
+		return ret;
+
+	ce = to_intel_context(ctx, engine);
+
+	/* Nothing to do if unmodified. */
+	if (!memcmp(&ce->sseu, &ctx_sseu, sizeof(ctx_sseu)))
+		return 0;
+
+	ret = i915_gem_context_reconfigure_sseu(ctx, engine, ctx_sseu);
+	if (ret)
+		return ret;
+
+	/*
+	 * Copy the configuration to all engines. Our hardware doesn't
+	 * currently support different configurations for each engine.
+	 */
+	for_each_engine(engine, i915, id) {
+		ce = to_intel_context(ctx, engine);
+		ce->sseu = ctx_sseu;
+	}
+
+	return 0;
+}
+
 int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
@@ -884,7 +1067,9 @@  int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				ctx->sched.priority = priority;
 		}
 		break;
-
+	case I915_CONTEXT_PARAM_SSEU:
+		ret = set_sseu(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 8a2997be7ef7..0f780c666e98 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2232,6 +2232,60 @@  static void gen8_emit_breadcrumb_rcs(struct i915_request *request, u32 *cs)
 }
 static const int gen8_emit_breadcrumb_rcs_sz = 8 + WA_TAIL_DWORDS;
 
+static int gen8_emit_rpcs_config(struct i915_request *rq,
+				 struct i915_gem_context *ctx,
+				 struct intel_sseu sseu)
+{
+	struct drm_i915_private *i915 = rq->i915;
+	struct intel_context *ce = to_intel_context(ctx, i915->engine[RCS]);
+	struct i915_vma *vma;
+	u64 offset;
+	u32 *cs;
+	int err;
+
+	/* Let the deferred state allocation take care of this. */
+	if (!ce->state)
+		return 0;
+
+	vma = i915_vma_instance(ce->state->obj,
+				&i915->kernel_context->ppgtt->vm,
+				NULL);
+	if (IS_ERR(vma))
+		return PTR_ERR(vma);
+
+	err = i915_vma_pin(vma, 0, 0, PIN_USER);
+	if (err) {
+		i915_vma_close(vma);
+		return err;
+	}
+
+	err = i915_vma_move_to_active(vma, rq, EXEC_OBJECT_WRITE);
+	if (unlikely(err)) {
+		i915_vma_close(vma);
+		return err;
+	}
+
+	i915_vma_unpin(vma);
+
+	cs = intel_ring_begin(rq, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	offset = vma->node.start +
+		LRC_STATE_PN * PAGE_SIZE +
+		(CTX_R_PWR_CLK_STATE + 1) * 4;
+
+	*cs++ = MI_STORE_DWORD_IMM_GEN4;
+	*cs++ = lower_32_bits(offset);
+	*cs++ = upper_32_bits(offset);
+	*cs++ = gen8_make_rpcs(&INTEL_INFO(i915)->sseu,
+			       intel_engine_prepare_sseu(rq->engine, sseu));
+
+	intel_ring_advance(rq, cs);
+
+	return 0;
+}
+
 static int gen8_init_rcs_context(struct i915_request *rq)
 {
 	int ret;
@@ -2324,6 +2378,7 @@  logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 	engine->emit_flush = gen8_emit_flush;
 	engine->emit_breadcrumb = gen8_emit_breadcrumb;
 	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_sz;
+	engine->emit_rpcs_config = gen8_emit_rpcs_config;
 
 	engine->set_default_submission = intel_execlists_set_default_submission;
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 9090885d57de..acb8b6fe912a 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -477,6 +477,10 @@  struct intel_engine_cs {
 	void		(*emit_breadcrumb)(struct i915_request *rq, u32 *cs);
 	int		emit_breadcrumb_sz;
 
+	int		(*emit_rpcs_config)(struct i915_request *rq,
+					    struct i915_gem_context *ctx,
+					    struct intel_sseu sseu);
+
 	/* Pass the request to the hardware queue (e.g. directly into
 	 * the legacy ringbuffer or to the end of an execlist).
 	 *
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index a4446f452040..e195c38b15a6 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1478,9 +1478,52 @@  struct drm_i915_gem_context_param {
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
+	/*
+	 * When using the following param, value should be a pointer to
+	 * drm_i915_gem_context_param_sseu.
+	 */
+#define I915_CONTEXT_PARAM_SSEU		0x7
 	__u64 value;
 };
 
+struct drm_i915_gem_context_param_sseu {
+	/*
+	 * Engine class & instance to be configured or queried.
+	 */
+	__u16 class;
+	__u16 instance;
+
+	/*
+	 * Unused for now. Must be cleared to zero.
+	 */
+	__u32 rsvd1;
+
+	/*
+	 * Mask of slices to enable for the context. Valid values are a subset
+	 * of the bitmask value returned for I915_PARAM_SLICE_MASK.
+	 */
+	__u64 slice_mask;
+
+	/*
+	 * Mask of subslices to enable for the context. Valid values are a
+	 * subset of the bitmask value return by I915_PARAM_SUBSLICE_MASK.
+	 */
+	__u64 subslice_mask;
+
+	/*
+	 * Minimum/Maximum number of EUs to enable per subslice for the
+	 * context. min_eus_per_subslice must be inferior or equal to
+	 * max_eus_per_subslice.
+	 */
+	__u16 min_eus_per_subslice;
+	__u16 max_eus_per_subslice;
+
+	/*
+	 * Unused for now. Must be cleared to zero.
+	 */
+	__u32 rsvd2;
+};
+
 enum drm_i915_oa_format {
 	I915_OA_FORMAT_A13 = 1,	    /* HSW only */
 	I915_OA_FORMAT_A29,	    /* HSW only */