diff mbox series

[4/7] drm/i915/perf: lock powergating configuration to default when active

Message ID 20180905142222.3251-5-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Per context dynamic (sub)slice power-gating | expand

Commit Message

Tvrtko Ursulin Sept. 5, 2018, 2:22 p.m. UTC
From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>

If some of the contexts submitting workloads to the GPU have been
configured to shutdown slices/subslices, we might loose the NOA
configurations written in the NOA muxes.

One possible solution to this problem is to reprogram the NOA muxes
when we switch to a new context. We initially tried this in the
workaround batchbuffer but some concerns where raised about the cost
of reprogramming at every context switch. This solution is also not
without consequences from the userspace point of view. Reprogramming
of the muxes can only happen once the powergating configuration has
changed (which happens after context switch). This means for a window
of time during the recording, counters recorded by the OA unit might
be invalid. This requires userspace dealing with OA reports to discard
the invalid values.

Minimizing the reprogramming could be implemented by tracking of the
last programmed configuration somewhere in GGTT and use MI_PREDICATE
to discard some of the programming commands, but the command streamer
would still have to parse all the MI_LRI instructions in the
workaround batchbuffer.

Another solution, which this change implements, is to simply disregard
the user requested configuration for the period of time when i915/perf
is active. There is no known issue with this apart from a performance
penality for some media workloads that benefit from running on a
partially powergated GPU. We already prevent RC6 from affecting the
programming so it doesn't sound completely unreasonable to hold on
powergating for the same reason.

v2: Leave RPCS programming in intel_lrc.c (Lionel)

v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
    More to_intel_context() (Tvrtko)
    s/dev_priv/i915/ (Tvrtko)

Tvrtko Ursulin:

v4:
 * Rebase for make_rpcs changes.

v5:
 * Apply OA restriction from make_rpcs directly.

v6:
 * Rebase for context image setup changes.

Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_perf.c |  5 +++++
 drivers/gpu/drm/i915/intel_lrc.c | 30 ++++++++++++++++++++----------
 drivers/gpu/drm/i915/intel_lrc.h |  3 +++
 3 files changed, 28 insertions(+), 10 deletions(-)

Comments

Chris Wilson Sept. 5, 2018, 3:21 p.m. UTC | #1
Quoting Tvrtko Ursulin (2018-09-05 15:22:19)
> -static u32 make_rpcs(struct drm_i915_private *dev_priv,
> -                    struct intel_sseu *ctx_sseu)
> +u32 gen8_make_rpcs(struct drm_i915_private *dev_priv,
> +                  struct intel_sseu *req_sseu)

Should we retrospectively make this const?

(And anychance for a s/dev_priv/i915?)

>  {
>         const struct sseu_dev_info *sseu = &INTEL_INFO(dev_priv)->sseu;
>         bool subslice_pg = sseu->has_subslice_pg;
> -       u8 slices = hweight8(ctx_sseu->slice_mask);
> -       u8 subslices = hweight8(ctx_sseu->subslice_mask);
> +       struct intel_sseu ctx_sseu;
> +       u8 slices, subslices;
>         u32 rpcs = 0;
>  
> +       /*
> +        * If i915/perf is active, we want a stable powergating configuration
> +        * on the system. The most natural configuration to take in that case
> +        * is the default (i.e maximum the hardware can do).
> +        */
> +       if (unlikely(dev_priv->perf.oa.exclusive_stream))
> +               ctx_sseu = intel_device_default_sseu(dev_priv);
> +       else
> +               ctx_sseu = *req_sseu;

:(

I'm not sure if I can suggest anything better, but this does feel like a
layering violation.

It makes sense which makes it only feel worse.
-Chris
Tvrtko Ursulin Sept. 6, 2018, 9:41 a.m. UTC | #2
On 05/09/2018 16:21, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-09-05 15:22:19)
>> -static u32 make_rpcs(struct drm_i915_private *dev_priv,
>> -                    struct intel_sseu *ctx_sseu)
>> +u32 gen8_make_rpcs(struct drm_i915_private *dev_priv,
>> +                  struct intel_sseu *req_sseu)
> 
> Should we retrospectively make this const?

Can do, but generally I try to avoid it kernel code since most of the 
time it is way more pain than benefit.

> (And anychance for a s/dev_priv/i915?)

Will check if it is doable without much noise at any of the stages.

>>   {
>>          const struct sseu_dev_info *sseu = &INTEL_INFO(dev_priv)->sseu;
>>          bool subslice_pg = sseu->has_subslice_pg;
>> -       u8 slices = hweight8(ctx_sseu->slice_mask);
>> -       u8 subslices = hweight8(ctx_sseu->subslice_mask);
>> +       struct intel_sseu ctx_sseu;
>> +       u8 slices, subslices;
>>          u32 rpcs = 0;
>>   
>> +       /*
>> +        * If i915/perf is active, we want a stable powergating configuration
>> +        * on the system. The most natural configuration to take in that case
>> +        * is the default (i.e maximum the hardware can do).
>> +        */
>> +       if (unlikely(dev_priv->perf.oa.exclusive_stream))
>> +               ctx_sseu = intel_device_default_sseu(dev_priv);
>> +       else
>> +               ctx_sseu = *req_sseu;
> 
> :(
> 
> I'm not sure if I can suggest anything better, but this does feel like a
> layering violation.
> 
> It makes sense which makes it only feel worse.

It used to be a helper which applied the adjustment but I wasn't happy 
with how callers then had to know to call the helper and decided 
handling it at the core is better in more than one way.

I think bottom line is there is fundamental interaction between the two 
so some layering violation has to happen.

Regards,

Tvrtko
Lionel Landwerlin Sept. 6, 2018, 9:57 a.m. UTC | #3
On 05/09/2018 15:22, Tvrtko Ursulin wrote:
> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>
> If some of the contexts submitting workloads to the GPU have been
> configured to shutdown slices/subslices, we might loose the NOA
> configurations written in the NOA muxes.
>
> One possible solution to this problem is to reprogram the NOA muxes
> when we switch to a new context. We initially tried this in the
> workaround batchbuffer but some concerns where raised about the cost
> of reprogramming at every context switch. This solution is also not
> without consequences from the userspace point of view. Reprogramming
> of the muxes can only happen once the powergating configuration has
> changed (which happens after context switch). This means for a window
> of time during the recording, counters recorded by the OA unit might
> be invalid. This requires userspace dealing with OA reports to discard
> the invalid values.
>
> Minimizing the reprogramming could be implemented by tracking of the
> last programmed configuration somewhere in GGTT and use MI_PREDICATE
> to discard some of the programming commands, but the command streamer
> would still have to parse all the MI_LRI instructions in the
> workaround batchbuffer.
>
> Another solution, which this change implements, is to simply disregard
> the user requested configuration for the period of time when i915/perf
> is active. There is no known issue with this apart from a performance
> penality for some media workloads that benefit from running on a
> partially powergated GPU. We already prevent RC6 from affecting the
> programming so it doesn't sound completely unreasonable to hold on
> powergating for the same reason.
>
> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>
> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>      More to_intel_context() (Tvrtko)
>      s/dev_priv/i915/ (Tvrtko)
>
> Tvrtko Ursulin:
>
> v4:
>   * Rebase for make_rpcs changes.
>
> v5:
>   * Apply OA restriction from make_rpcs directly.
>
> v6:
>   * Rebase for context image setup changes.
>
> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>   drivers/gpu/drm/i915/intel_lrc.c | 30 ++++++++++++++++++++----------
>   drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>   3 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
> index ccb20230df2c..dd65b72bddd4 100644
> --- a/drivers/gpu/drm/i915/i915_perf.c
> +++ b/drivers/gpu/drm/i915/i915_perf.c
> @@ -1677,6 +1677,11 @@ static void gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>   
>   		CTX_REG(reg_state, state_offset, flex_regs[i], value);
>   	}
> +
> +	CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
> +		gen8_make_rpcs(dev_priv,
> +			       &to_intel_context(ctx,
> +						 dev_priv->engine[RCS])->sseu));


I think there is one issue I missed on the previous iterations of this 
patch.

This gen8_update_reg_state_unlocked() is called when the GPU is parked 
on the kernel context.

It's supposed to update all contexts, but I think we might not be able 
to update the kernel context image while the GPU is using it.

Context save might happen after we edited the image and that would 
override the values we just put in there.


The OA config is emitted through context image edition in this function 
but also through the ring buffer in 
gen8_switch_to_updated_kernel_context() for the kernel context.

Since we can't have a context modify its own RCPS value, we'll have to 
resort to yet another context to do that for the kernel context.


I remember having a patch that created yet another kernel context (let's 
call it rpcs edition context), which is used to reconfigure rpcs for 
every context but itself and then have the kernel context reconfigure 
this rpcs edition context.

Or alternatively not do anything to it, because it's only going to run 
to edit other contexts at a time when we don't care about power 
configuration stability.


-

Lionel


>   }
>   
>   /*
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 8a477e43dbca..9709c1fbe836 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1305,9 +1305,6 @@ static int __context_pin(struct i915_gem_context *ctx, struct i915_vma *vma)
>   	return i915_vma_pin(vma, 0, 0, flags);
>   }
>   
> -static u32 make_rpcs(struct drm_i915_private *dev_priv,
> -		     struct intel_sseu *ctx_sseu);
> -
>   static struct intel_context *
>   __execlists_context_pin(struct intel_engine_cs *engine,
>   			struct i915_gem_context *ctx,
> @@ -1350,7 +1347,7 @@ __execlists_context_pin(struct intel_engine_cs *engine,
>   	/* RPCS */
>   	if (engine->class == RENDER_CLASS) {
>   		ce->lrc_reg_state[CTX_R_PWR_CLK_STATE + 1] =
> -					make_rpcs(engine->i915, &ce->sseu);
> +					gen8_make_rpcs(engine->i915, &ce->sseu);
>   	}
>   
>   	ce->state->obj->pin_global++;
> @@ -2494,15 +2491,28 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
>   	return logical_ring_init(engine);
>   }
>   
> -static u32 make_rpcs(struct drm_i915_private *dev_priv,
> -		     struct intel_sseu *ctx_sseu)
> +u32 gen8_make_rpcs(struct drm_i915_private *dev_priv,
> +		   struct intel_sseu *req_sseu)
>   {
>   	const struct sseu_dev_info *sseu = &INTEL_INFO(dev_priv)->sseu;
>   	bool subslice_pg = sseu->has_subslice_pg;
> -	u8 slices = hweight8(ctx_sseu->slice_mask);
> -	u8 subslices = hweight8(ctx_sseu->subslice_mask);
> +	struct intel_sseu ctx_sseu;
> +	u8 slices, subslices;
>   	u32 rpcs = 0;
>   
> +	/*
> +	 * If i915/perf is active, we want a stable powergating configuration
> +	 * on the system. The most natural configuration to take in that case
> +	 * is the default (i.e maximum the hardware can do).
> +	 */
> +	if (unlikely(dev_priv->perf.oa.exclusive_stream))
> +		ctx_sseu = intel_device_default_sseu(dev_priv);
> +	else
> +		ctx_sseu = *req_sseu;
> +
> +	slices = hweight8(ctx_sseu.slice_mask);
> +	subslices = hweight8(ctx_sseu.subslice_mask);
> +
>   	/*
>   	 * Since the SScount bitfield in GEN8_R_PWR_CLK_STATE is only three bits
>   	 * wide and Icelake has up to eight subslices, specfial programming is
> @@ -2572,13 +2582,13 @@ static u32 make_rpcs(struct drm_i915_private *dev_priv,
>   	if (sseu->has_eu_pg) {
>   		u32 val;
>   
> -		val = ctx_sseu->min_eus_per_subslice << GEN8_RPCS_EU_MIN_SHIFT;
> +		val = ctx_sseu.min_eus_per_subslice << GEN8_RPCS_EU_MIN_SHIFT;
>   		GEM_BUG_ON(val & ~GEN8_RPCS_EU_MIN_MASK);
>   		val &= GEN8_RPCS_EU_MIN_MASK;
>   
>   		rpcs |= val;
>   
> -		val = ctx_sseu->max_eus_per_subslice << GEN8_RPCS_EU_MAX_SHIFT;
> +		val = ctx_sseu.max_eus_per_subslice << GEN8_RPCS_EU_MAX_SHIFT;
>   		GEM_BUG_ON(val & ~GEN8_RPCS_EU_MAX_MASK);
>   		val &= GEN8_RPCS_EU_MAX_MASK;
>   
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index f5a5502ecf70..11da6fc0002d 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -104,4 +104,7 @@ void intel_lr_context_resume(struct drm_i915_private *dev_priv);
>   
>   void intel_execlists_set_default_submission(struct intel_engine_cs *engine);
>   
> +u32 gen8_make_rpcs(struct drm_i915_private *dev_priv,
> +		   struct intel_sseu *ctx_sseu);
> +
>   #endif /* _INTEL_LRC_H_ */
Chris Wilson Sept. 6, 2018, 10:10 a.m. UTC | #4
Quoting Lionel Landwerlin (2018-09-06 10:57:47)
> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
> > From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> >
> > If some of the contexts submitting workloads to the GPU have been
> > configured to shutdown slices/subslices, we might loose the NOA
> > configurations written in the NOA muxes.
> >
> > One possible solution to this problem is to reprogram the NOA muxes
> > when we switch to a new context. We initially tried this in the
> > workaround batchbuffer but some concerns where raised about the cost
> > of reprogramming at every context switch. This solution is also not
> > without consequences from the userspace point of view. Reprogramming
> > of the muxes can only happen once the powergating configuration has
> > changed (which happens after context switch). This means for a window
> > of time during the recording, counters recorded by the OA unit might
> > be invalid. This requires userspace dealing with OA reports to discard
> > the invalid values.
> >
> > Minimizing the reprogramming could be implemented by tracking of the
> > last programmed configuration somewhere in GGTT and use MI_PREDICATE
> > to discard some of the programming commands, but the command streamer
> > would still have to parse all the MI_LRI instructions in the
> > workaround batchbuffer.
> >
> > Another solution, which this change implements, is to simply disregard
> > the user requested configuration for the period of time when i915/perf
> > is active. There is no known issue with this apart from a performance
> > penality for some media workloads that benefit from running on a
> > partially powergated GPU. We already prevent RC6 from affecting the
> > programming so it doesn't sound completely unreasonable to hold on
> > powergating for the same reason.
> >
> > v2: Leave RPCS programming in intel_lrc.c (Lionel)
> >
> > v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
> >      More to_intel_context() (Tvrtko)
> >      s/dev_priv/i915/ (Tvrtko)
> >
> > Tvrtko Ursulin:
> >
> > v4:
> >   * Rebase for make_rpcs changes.
> >
> > v5:
> >   * Apply OA restriction from make_rpcs directly.
> >
> > v6:
> >   * Rebase for context image setup changes.
> >
> > Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_perf.c |  5 +++++
> >   drivers/gpu/drm/i915/intel_lrc.c | 30 ++++++++++++++++++++----------
> >   drivers/gpu/drm/i915/intel_lrc.h |  3 +++
> >   3 files changed, 28 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
> > index ccb20230df2c..dd65b72bddd4 100644
> > --- a/drivers/gpu/drm/i915/i915_perf.c
> > +++ b/drivers/gpu/drm/i915/i915_perf.c
> > @@ -1677,6 +1677,11 @@ static void gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
> >   
> >               CTX_REG(reg_state, state_offset, flex_regs[i], value);
> >       }
> > +
> > +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
> > +             gen8_make_rpcs(dev_priv,
> > +                            &to_intel_context(ctx,
> > +                                              dev_priv->engine[RCS])->sseu));
> 
> 
> I think there is one issue I missed on the previous iterations of this 
> patch.
> 
> This gen8_update_reg_state_unlocked() is called when the GPU is parked 
> on the kernel context.
> 
> It's supposed to update all contexts, but I think we might not be able 
> to update the kernel context image while the GPU is using it.

The kernel context is only ever taken in extremis (you are either
parking or stalling userspace) so I don't care.
 
> Context save might happen after we edited the image and that would 
> override the values we just put in there.
> 
> 
> The OA config is emitted through context image edition in this function 
> but also through the ring buffer in 
> gen8_switch_to_updated_kernel_context() for the kernel context.
> 
> Since we can't have a context modify its own RCPS value, we'll have to 
> resort to yet another context to do that for the kernel context.
> 
> 
> I remember having a patch that created yet another kernel context (let's 
> call it rpcs edition context), which is used to reconfigure rpcs for 
> every context but itself and then have the kernel context reconfigure 
> this rpcs edition context.
> 
> Or alternatively not do anything to it, because it's only going to run 
> to edit other contexts at a time when we don't care about power 
> configuration stability.

Exactly.
-Chris
Lionel Landwerlin Sept. 6, 2018, 10:18 a.m. UTC | #5
On 06/09/2018 11:10, Chris Wilson wrote:
> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>
>>> If some of the contexts submitting workloads to the GPU have been
>>> configured to shutdown slices/subslices, we might loose the NOA
>>> configurations written in the NOA muxes.
>>>
>>> One possible solution to this problem is to reprogram the NOA muxes
>>> when we switch to a new context. We initially tried this in the
>>> workaround batchbuffer but some concerns where raised about the cost
>>> of reprogramming at every context switch. This solution is also not
>>> without consequences from the userspace point of view. Reprogramming
>>> of the muxes can only happen once the powergating configuration has
>>> changed (which happens after context switch). This means for a window
>>> of time during the recording, counters recorded by the OA unit might
>>> be invalid. This requires userspace dealing with OA reports to discard
>>> the invalid values.
>>>
>>> Minimizing the reprogramming could be implemented by tracking of the
>>> last programmed configuration somewhere in GGTT and use MI_PREDICATE
>>> to discard some of the programming commands, but the command streamer
>>> would still have to parse all the MI_LRI instructions in the
>>> workaround batchbuffer.
>>>
>>> Another solution, which this change implements, is to simply disregard
>>> the user requested configuration for the period of time when i915/perf
>>> is active. There is no known issue with this apart from a performance
>>> penality for some media workloads that benefit from running on a
>>> partially powergated GPU. We already prevent RC6 from affecting the
>>> programming so it doesn't sound completely unreasonable to hold on
>>> powergating for the same reason.
>>>
>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>>>
>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>       More to_intel_context() (Tvrtko)
>>>       s/dev_priv/i915/ (Tvrtko)
>>>
>>> Tvrtko Ursulin:
>>>
>>> v4:
>>>    * Rebase for make_rpcs changes.
>>>
>>> v5:
>>>    * Apply OA restriction from make_rpcs directly.
>>>
>>> v6:
>>>    * Rebase for context image setup changes.
>>>
>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>>>    drivers/gpu/drm/i915/intel_lrc.c | 30 ++++++++++++++++++++----------
>>>    drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>>>    3 files changed, 28 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
>>> index ccb20230df2c..dd65b72bddd4 100644
>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>> @@ -1677,6 +1677,11 @@ static void gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>>    
>>>                CTX_REG(reg_state, state_offset, flex_regs[i], value);
>>>        }
>>> +
>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
>>> +             gen8_make_rpcs(dev_priv,
>>> +                            &to_intel_context(ctx,
>>> +                                              dev_priv->engine[RCS])->sseu));
>>
>> I think there is one issue I missed on the previous iterations of this
>> patch.
>>
>> This gen8_update_reg_state_unlocked() is called when the GPU is parked
>> on the kernel context.
>>
>> It's supposed to update all contexts, but I think we might not be able
>> to update the kernel context image while the GPU is using it.
> The kernel context is only ever taken in extremis (you are either
> parking or stalling userspace) so I don't care.


The patch exposing the RPCS configuration to userspace will make use of 
the kernel context while OA/perf is enabled. Even if it reprograms the 
locked value that will break the power configuration stability on Gen11 
(because the locked configuration will be different from the kernel 
context configuration).


-

Lionel

>   
>> Context save might happen after we edited the image and that would
>> override the values we just put in there.
>>
>>
>> The OA config is emitted through context image edition in this function
>> but also through the ring buffer in
>> gen8_switch_to_updated_kernel_context() for the kernel context.
>>
>> Since we can't have a context modify its own RCPS value, we'll have to
>> resort to yet another context to do that for the kernel context.
>>
>>
>> I remember having a patch that created yet another kernel context (let's
>> call it rpcs edition context), which is used to reconfigure rpcs for
>> every context but itself and then have the kernel context reconfigure
>> this rpcs edition context.
>>
>> Or alternatively not do anything to it, because it's only going to run
>> to edit other contexts at a time when we don't care about power
>> configuration stability.
> Exactly.
> -Chris
>
Chris Wilson Sept. 6, 2018, 10:22 a.m. UTC | #6
Quoting Lionel Landwerlin (2018-09-06 11:18:01)
> On 06/09/2018 11:10, Chris Wilson wrote:
> > Quoting Lionel Landwerlin (2018-09-06 10:57:47)
> >> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
> >>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> >>>
> >>> If some of the contexts submitting workloads to the GPU have been
> >>> configured to shutdown slices/subslices, we might loose the NOA
> >>> configurations written in the NOA muxes.
> >>>
> >>> One possible solution to this problem is to reprogram the NOA muxes
> >>> when we switch to a new context. We initially tried this in the
> >>> workaround batchbuffer but some concerns where raised about the cost
> >>> of reprogramming at every context switch. This solution is also not
> >>> without consequences from the userspace point of view. Reprogramming
> >>> of the muxes can only happen once the powergating configuration has
> >>> changed (which happens after context switch). This means for a window
> >>> of time during the recording, counters recorded by the OA unit might
> >>> be invalid. This requires userspace dealing with OA reports to discard
> >>> the invalid values.
> >>>
> >>> Minimizing the reprogramming could be implemented by tracking of the
> >>> last programmed configuration somewhere in GGTT and use MI_PREDICATE
> >>> to discard some of the programming commands, but the command streamer
> >>> would still have to parse all the MI_LRI instructions in the
> >>> workaround batchbuffer.
> >>>
> >>> Another solution, which this change implements, is to simply disregard
> >>> the user requested configuration for the period of time when i915/perf
> >>> is active. There is no known issue with this apart from a performance
> >>> penality for some media workloads that benefit from running on a
> >>> partially powergated GPU. We already prevent RC6 from affecting the
> >>> programming so it doesn't sound completely unreasonable to hold on
> >>> powergating for the same reason.
> >>>
> >>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
> >>>
> >>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
> >>>       More to_intel_context() (Tvrtko)
> >>>       s/dev_priv/i915/ (Tvrtko)
> >>>
> >>> Tvrtko Ursulin:
> >>>
> >>> v4:
> >>>    * Rebase for make_rpcs changes.
> >>>
> >>> v5:
> >>>    * Apply OA restriction from make_rpcs directly.
> >>>
> >>> v6:
> >>>    * Rebase for context image setup changes.
> >>>
> >>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> >>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/i915_perf.c |  5 +++++
> >>>    drivers/gpu/drm/i915/intel_lrc.c | 30 ++++++++++++++++++++----------
> >>>    drivers/gpu/drm/i915/intel_lrc.h |  3 +++
> >>>    3 files changed, 28 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
> >>> index ccb20230df2c..dd65b72bddd4 100644
> >>> --- a/drivers/gpu/drm/i915/i915_perf.c
> >>> +++ b/drivers/gpu/drm/i915/i915_perf.c
> >>> @@ -1677,6 +1677,11 @@ static void gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
> >>>    
> >>>                CTX_REG(reg_state, state_offset, flex_regs[i], value);
> >>>        }
> >>> +
> >>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
> >>> +             gen8_make_rpcs(dev_priv,
> >>> +                            &to_intel_context(ctx,
> >>> +                                              dev_priv->engine[RCS])->sseu));
> >>
> >> I think there is one issue I missed on the previous iterations of this
> >> patch.
> >>
> >> This gen8_update_reg_state_unlocked() is called when the GPU is parked
> >> on the kernel context.
> >>
> >> It's supposed to update all contexts, but I think we might not be able
> >> to update the kernel context image while the GPU is using it.
> > The kernel context is only ever taken in extremis (you are either
> > parking or stalling userspace) so I don't care.
> 
> 
> The patch exposing the RPCS configuration to userspace will make use of 
> the kernel context while OA/perf is enabled. Even if it reprograms the 
> locked value that will break the power configuration stability on Gen11 
> (because the locked configuration will be different from the kernel 
> context configuration).

Sure, but as you point out that's only on changing configuration.

What's missing in the patch is that we only bail early if the new sseu
matches the ce->sseu, but that doesn't necessarily match whats in the
context due to OA. (Or maybe I missed the conversion to rpcs value and
checking.)
-Chris
Lionel Landwerlin Sept. 6, 2018, 10:36 a.m. UTC | #7
On 06/09/2018 11:22, Chris Wilson wrote:
> Quoting Lionel Landwerlin (2018-09-06 11:18:01)
>> On 06/09/2018 11:10, Chris Wilson wrote:
>>> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
>>>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
>>>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>
>>>>> If some of the contexts submitting workloads to the GPU have been
>>>>> configured to shutdown slices/subslices, we might loose the NOA
>>>>> configurations written in the NOA muxes.
>>>>>
>>>>> One possible solution to this problem is to reprogram the NOA muxes
>>>>> when we switch to a new context. We initially tried this in the
>>>>> workaround batchbuffer but some concerns where raised about the cost
>>>>> of reprogramming at every context switch. This solution is also not
>>>>> without consequences from the userspace point of view. Reprogramming
>>>>> of the muxes can only happen once the powergating configuration has
>>>>> changed (which happens after context switch). This means for a window
>>>>> of time during the recording, counters recorded by the OA unit might
>>>>> be invalid. This requires userspace dealing with OA reports to discard
>>>>> the invalid values.
>>>>>
>>>>> Minimizing the reprogramming could be implemented by tracking of the
>>>>> last programmed configuration somewhere in GGTT and use MI_PREDICATE
>>>>> to discard some of the programming commands, but the command streamer
>>>>> would still have to parse all the MI_LRI instructions in the
>>>>> workaround batchbuffer.
>>>>>
>>>>> Another solution, which this change implements, is to simply disregard
>>>>> the user requested configuration for the period of time when i915/perf
>>>>> is active. There is no known issue with this apart from a performance
>>>>> penality for some media workloads that benefit from running on a
>>>>> partially powergated GPU. We already prevent RC6 from affecting the
>>>>> programming so it doesn't sound completely unreasonable to hold on
>>>>> powergating for the same reason.
>>>>>
>>>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>>>>>
>>>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>>>        More to_intel_context() (Tvrtko)
>>>>>        s/dev_priv/i915/ (Tvrtko)
>>>>>
>>>>> Tvrtko Ursulin:
>>>>>
>>>>> v4:
>>>>>     * Rebase for make_rpcs changes.
>>>>>
>>>>> v5:
>>>>>     * Apply OA restriction from make_rpcs directly.
>>>>>
>>>>> v6:
>>>>>     * Rebase for context image setup changes.
>>>>>
>>>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 30 ++++++++++++++++++++----------
>>>>>     drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>>>>>     3 files changed, 28 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
>>>>> index ccb20230df2c..dd65b72bddd4 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>>>> @@ -1677,6 +1677,11 @@ static void gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>>>>     
>>>>>                 CTX_REG(reg_state, state_offset, flex_regs[i], value);
>>>>>         }
>>>>> +
>>>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
>>>>> +             gen8_make_rpcs(dev_priv,
>>>>> +                            &to_intel_context(ctx,
>>>>> +                                              dev_priv->engine[RCS])->sseu));
>>>> I think there is one issue I missed on the previous iterations of this
>>>> patch.
>>>>
>>>> This gen8_update_reg_state_unlocked() is called when the GPU is parked
>>>> on the kernel context.
>>>>
>>>> It's supposed to update all contexts, but I think we might not be able
>>>> to update the kernel context image while the GPU is using it.
>>> The kernel context is only ever taken in extremis (you are either
>>> parking or stalling userspace) so I don't care.
>>
>> The patch exposing the RPCS configuration to userspace will make use of
>> the kernel context while OA/perf is enabled. Even if it reprograms the
>> locked value that will break the power configuration stability on Gen11
>> (because the locked configuration will be different from the kernel
>> context configuration).
> Sure, but as you point out that's only on changing configuration.
>
> What's missing in the patch is that we only bail early if the new sseu
> matches the ce->sseu, but that doesn't necessarily match whats in the
> context due to OA. (Or maybe I missed the conversion to rpcs value and
> checking.)
> -Chris
>

Yep, because the gen8_make_rpcs() post processes the values store at the 
gem context level, we risk rerunning the kernel context to write the 
exiting value.
Sorry this is all so messy :(

-
Lionel
Tvrtko Ursulin Sept. 7, 2018, 8:26 a.m. UTC | #8
On 06/09/2018 11:36, Lionel Landwerlin wrote:
> On 06/09/2018 11:22, Chris Wilson wrote:
>> Quoting Lionel Landwerlin (2018-09-06 11:18:01)
>>> On 06/09/2018 11:10, Chris Wilson wrote:
>>>> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
>>>>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
>>>>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>
>>>>>> If some of the contexts submitting workloads to the GPU have been
>>>>>> configured to shutdown slices/subslices, we might loose the NOA
>>>>>> configurations written in the NOA muxes.
>>>>>>
>>>>>> One possible solution to this problem is to reprogram the NOA muxes
>>>>>> when we switch to a new context. We initially tried this in the
>>>>>> workaround batchbuffer but some concerns where raised about the cost
>>>>>> of reprogramming at every context switch. This solution is also not
>>>>>> without consequences from the userspace point of view. Reprogramming
>>>>>> of the muxes can only happen once the powergating configuration has
>>>>>> changed (which happens after context switch). This means for a window
>>>>>> of time during the recording, counters recorded by the OA unit might
>>>>>> be invalid. This requires userspace dealing with OA reports to 
>>>>>> discard
>>>>>> the invalid values.
>>>>>>
>>>>>> Minimizing the reprogramming could be implemented by tracking of the
>>>>>> last programmed configuration somewhere in GGTT and use MI_PREDICATE
>>>>>> to discard some of the programming commands, but the command streamer
>>>>>> would still have to parse all the MI_LRI instructions in the
>>>>>> workaround batchbuffer.
>>>>>>
>>>>>> Another solution, which this change implements, is to simply 
>>>>>> disregard
>>>>>> the user requested configuration for the period of time when 
>>>>>> i915/perf
>>>>>> is active. There is no known issue with this apart from a performance
>>>>>> penality for some media workloads that benefit from running on a
>>>>>> partially powergated GPU. We already prevent RC6 from affecting the
>>>>>> programming so it doesn't sound completely unreasonable to hold on
>>>>>> powergating for the same reason.
>>>>>>
>>>>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>>>>>>
>>>>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>>>>        More to_intel_context() (Tvrtko)
>>>>>>        s/dev_priv/i915/ (Tvrtko)
>>>>>>
>>>>>> Tvrtko Ursulin:
>>>>>>
>>>>>> v4:
>>>>>>     * Rebase for make_rpcs changes.
>>>>>>
>>>>>> v5:
>>>>>>     * Apply OA restriction from make_rpcs directly.
>>>>>>
>>>>>> v6:
>>>>>>     * Rebase for context image setup changes.
>>>>>>
>>>>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 30 
>>>>>> ++++++++++++++++++++----------
>>>>>>     drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>>>>>>     3 files changed, 28 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>>>>>> b/drivers/gpu/drm/i915/i915_perf.c
>>>>>> index ccb20230df2c..dd65b72bddd4 100644
>>>>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>>>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>>>>> @@ -1677,6 +1677,11 @@ static void 
>>>>>> gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>>>>>                 CTX_REG(reg_state, state_offset, flex_regs[i], 
>>>>>> value);
>>>>>>         }
>>>>>> +
>>>>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
>>>>>> +             gen8_make_rpcs(dev_priv,
>>>>>> +                            &to_intel_context(ctx,
>>>>>> +                                              
>>>>>> dev_priv->engine[RCS])->sseu));
>>>>> I think there is one issue I missed on the previous iterations of this
>>>>> patch.
>>>>>
>>>>> This gen8_update_reg_state_unlocked() is called when the GPU is parked
>>>>> on the kernel context.
>>>>>
>>>>> It's supposed to update all contexts, but I think we might not be able
>>>>> to update the kernel context image while the GPU is using it.
>>>> The kernel context is only ever taken in extremis (you are either
>>>> parking or stalling userspace) so I don't care.
>>>
>>> The patch exposing the RPCS configuration to userspace will make use of
>>> the kernel context while OA/perf is enabled. Even if it reprograms the
>>> locked value that will break the power configuration stability on Gen11
>>> (because the locked configuration will be different from the kernel
>>> context configuration).
>> Sure, but as you point out that's only on changing configuration.
>>
>> What's missing in the patch is that we only bail early if the new sseu
>> matches the ce->sseu, but that doesn't necessarily match whats in the
>> context due to OA. (Or maybe I missed the conversion to rpcs value and
>> checking.)
>> -Chris
>>
> 
> Yep, because the gen8_make_rpcs() post processes the values store at the 
> gem context level, we risk rerunning the kernel context to write the 
> exiting value.
> Sorry this is all so messy :(

Lets see if I managed to follow here.

The current code indeed bails out at the set ctx param level if the 
requested state matches the ce->state. My thinking was that ce->state is 
the master state and whatever happens in "post processing" via 
gen8_make_rpcs should be hidden from it since the design is that the 
i915_perf.c will re-configure all contexts when the OA active status 
changes (to either direction).

So I don't see a problem in those two interactions.

Apart from one, get_param_sseu will lie a bit - we can discuss about 
this one more. At one point I suggested we have two sets of masks in the 
uAPI, requested and active in a way. So userspace could query what it 
set and what is actually active.

Now second issue is if i915_perf.c is able to reprogram the kernel config.

Here its true, it will write to the context image and that will get 
overwritten by context save.

If that is a problem for OA, I was initially if a throw-away second 
"kernel" context could be use to re-program the real one, but perhaps 
even simpler - what about a mmio write to program the RPCS while kernel 
context is active?

Regards,

Tvrtko
Chris Wilson Sept. 7, 2018, 8:59 a.m. UTC | #9
Quoting Tvrtko Ursulin (2018-09-07 09:26:27)
> 
> On 06/09/2018 11:36, Lionel Landwerlin wrote:
> > On 06/09/2018 11:22, Chris Wilson wrote:
> >> Quoting Lionel Landwerlin (2018-09-06 11:18:01)
> >>> On 06/09/2018 11:10, Chris Wilson wrote:
> >>>> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
> >>>>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
> >>>>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> >>>>>>
> >>>>>> If some of the contexts submitting workloads to the GPU have been
> >>>>>> configured to shutdown slices/subslices, we might loose the NOA
> >>>>>> configurations written in the NOA muxes.
> >>>>>>
> >>>>>> One possible solution to this problem is to reprogram the NOA muxes
> >>>>>> when we switch to a new context. We initially tried this in the
> >>>>>> workaround batchbuffer but some concerns where raised about the cost
> >>>>>> of reprogramming at every context switch. This solution is also not
> >>>>>> without consequences from the userspace point of view. Reprogramming
> >>>>>> of the muxes can only happen once the powergating configuration has
> >>>>>> changed (which happens after context switch). This means for a window
> >>>>>> of time during the recording, counters recorded by the OA unit might
> >>>>>> be invalid. This requires userspace dealing with OA reports to 
> >>>>>> discard
> >>>>>> the invalid values.
> >>>>>>
> >>>>>> Minimizing the reprogramming could be implemented by tracking of the
> >>>>>> last programmed configuration somewhere in GGTT and use MI_PREDICATE
> >>>>>> to discard some of the programming commands, but the command streamer
> >>>>>> would still have to parse all the MI_LRI instructions in the
> >>>>>> workaround batchbuffer.
> >>>>>>
> >>>>>> Another solution, which this change implements, is to simply 
> >>>>>> disregard
> >>>>>> the user requested configuration for the period of time when 
> >>>>>> i915/perf
> >>>>>> is active. There is no known issue with this apart from a performance
> >>>>>> penality for some media workloads that benefit from running on a
> >>>>>> partially powergated GPU. We already prevent RC6 from affecting the
> >>>>>> programming so it doesn't sound completely unreasonable to hold on
> >>>>>> powergating for the same reason.
> >>>>>>
> >>>>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
> >>>>>>
> >>>>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
> >>>>>>        More to_intel_context() (Tvrtko)
> >>>>>>        s/dev_priv/i915/ (Tvrtko)
> >>>>>>
> >>>>>> Tvrtko Ursulin:
> >>>>>>
> >>>>>> v4:
> >>>>>>     * Rebase for make_rpcs changes.
> >>>>>>
> >>>>>> v5:
> >>>>>>     * Apply OA restriction from make_rpcs directly.
> >>>>>>
> >>>>>> v6:
> >>>>>>     * Rebase for context image setup changes.
> >>>>>>
> >>>>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
> >>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/i915/i915_perf.c |  5 +++++
> >>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 30 
> >>>>>> ++++++++++++++++++++----------
> >>>>>>     drivers/gpu/drm/i915/intel_lrc.h |  3 +++
> >>>>>>     3 files changed, 28 insertions(+), 10 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
> >>>>>> b/drivers/gpu/drm/i915/i915_perf.c
> >>>>>> index ccb20230df2c..dd65b72bddd4 100644
> >>>>>> --- a/drivers/gpu/drm/i915/i915_perf.c
> >>>>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
> >>>>>> @@ -1677,6 +1677,11 @@ static void 
> >>>>>> gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
> >>>>>>                 CTX_REG(reg_state, state_offset, flex_regs[i], 
> >>>>>> value);
> >>>>>>         }
> >>>>>> +
> >>>>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
> >>>>>> +             gen8_make_rpcs(dev_priv,
> >>>>>> +                            &to_intel_context(ctx,
> >>>>>> +                                              
> >>>>>> dev_priv->engine[RCS])->sseu));
> >>>>> I think there is one issue I missed on the previous iterations of this
> >>>>> patch.
> >>>>>
> >>>>> This gen8_update_reg_state_unlocked() is called when the GPU is parked
> >>>>> on the kernel context.
> >>>>>
> >>>>> It's supposed to update all contexts, but I think we might not be able
> >>>>> to update the kernel context image while the GPU is using it.
> >>>> The kernel context is only ever taken in extremis (you are either
> >>>> parking or stalling userspace) so I don't care.
> >>>
> >>> The patch exposing the RPCS configuration to userspace will make use of
> >>> the kernel context while OA/perf is enabled. Even if it reprograms the
> >>> locked value that will break the power configuration stability on Gen11
> >>> (because the locked configuration will be different from the kernel
> >>> context configuration).
> >> Sure, but as you point out that's only on changing configuration.
> >>
> >> What's missing in the patch is that we only bail early if the new sseu
> >> matches the ce->sseu, but that doesn't necessarily match whats in the
> >> context due to OA. (Or maybe I missed the conversion to rpcs value and
> >> checking.)
> >> -Chris
> >>
> > 
> > Yep, because the gen8_make_rpcs() post processes the values store at the 
> > gem context level, we risk rerunning the kernel context to write the 
> > exiting value.
> > Sorry this is all so messy :(
> 
> Lets see if I managed to follow here.
> 
> The current code indeed bails out at the set ctx param level if the 
> requested state matches the ce->state. My thinking was that ce->state is 
> the master state and whatever happens in "post processing" via 
> gen8_make_rpcs should be hidden from it since the design is that the 
> i915_perf.c will re-configure all contexts when the OA active status 
> changes (to either direction).
> 
> So I don't see a problem in those two interactions.

Our muttering was just along the lines that we can skip the update via
GPU if oa was already active.
 
> Apart from one, get_param_sseu will lie a bit - we can discuss about 
> this one more. At one point I suggested we have two sets of masks in the 
> uAPI, requested and active in a way. So userspace could query what it 
> set and what is actually active.

In essence, the context should only get to see its own value, not the
system value since that is privileged information (of the OA user in
this case). It's always a nasty dilemma and I think idempotence of the
user interface is far more important (i.e. the current paired setparam,
getparam is the correct starting point for the API).
 
> Now second issue is if i915_perf.c is able to reprogram the kernel config.
> 
> Here its true, it will write to the context image and that will get 
> overwritten by context save.
> 
> If that is a problem for OA, I was initially if a throw-away second 
> "kernel" context could be use to re-program the real one, but perhaps 
> even simpler - what about a mmio write to program the RPCS while kernel 
> context is active?

I object to OA reporting on the kernel context. I think it should never
provide information about the system contexts as that is privileged
information.
* crawls back under his rock
-Chris
Lionel Landwerlin Sept. 7, 2018, 9:23 a.m. UTC | #10
On 07/09/2018 09:26, Tvrtko Ursulin wrote:
>
> On 06/09/2018 11:36, Lionel Landwerlin wrote:
>> On 06/09/2018 11:22, Chris Wilson wrote:
>>> Quoting Lionel Landwerlin (2018-09-06 11:18:01)
>>>> On 06/09/2018 11:10, Chris Wilson wrote:
>>>>> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
>>>>>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
>>>>>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>>
>>>>>>> If some of the contexts submitting workloads to the GPU have been
>>>>>>> configured to shutdown slices/subslices, we might loose the NOA
>>>>>>> configurations written in the NOA muxes.
>>>>>>>
>>>>>>> One possible solution to this problem is to reprogram the NOA muxes
>>>>>>> when we switch to a new context. We initially tried this in the
>>>>>>> workaround batchbuffer but some concerns where raised about the 
>>>>>>> cost
>>>>>>> of reprogramming at every context switch. This solution is also not
>>>>>>> without consequences from the userspace point of view. 
>>>>>>> Reprogramming
>>>>>>> of the muxes can only happen once the powergating configuration has
>>>>>>> changed (which happens after context switch). This means for a 
>>>>>>> window
>>>>>>> of time during the recording, counters recorded by the OA unit 
>>>>>>> might
>>>>>>> be invalid. This requires userspace dealing with OA reports to 
>>>>>>> discard
>>>>>>> the invalid values.
>>>>>>>
>>>>>>> Minimizing the reprogramming could be implemented by tracking of 
>>>>>>> the
>>>>>>> last programmed configuration somewhere in GGTT and use 
>>>>>>> MI_PREDICATE
>>>>>>> to discard some of the programming commands, but the command 
>>>>>>> streamer
>>>>>>> would still have to parse all the MI_LRI instructions in the
>>>>>>> workaround batchbuffer.
>>>>>>>
>>>>>>> Another solution, which this change implements, is to simply 
>>>>>>> disregard
>>>>>>> the user requested configuration for the period of time when 
>>>>>>> i915/perf
>>>>>>> is active. There is no known issue with this apart from a 
>>>>>>> performance
>>>>>>> penality for some media workloads that benefit from running on a
>>>>>>> partially powergated GPU. We already prevent RC6 from affecting the
>>>>>>> programming so it doesn't sound completely unreasonable to hold on
>>>>>>> powergating for the same reason.
>>>>>>>
>>>>>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>>>>>>>
>>>>>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>>>>>        More to_intel_context() (Tvrtko)
>>>>>>>        s/dev_priv/i915/ (Tvrtko)
>>>>>>>
>>>>>>> Tvrtko Ursulin:
>>>>>>>
>>>>>>> v4:
>>>>>>>     * Rebase for make_rpcs changes.
>>>>>>>
>>>>>>> v5:
>>>>>>>     * Apply OA restriction from make_rpcs directly.
>>>>>>>
>>>>>>> v6:
>>>>>>>     * Rebase for context image setup changes.
>>>>>>>
>>>>>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>>>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 30 
>>>>>>> ++++++++++++++++++++----------
>>>>>>>     drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>>>>>>>     3 files changed, 28 insertions(+), 10 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>>>>>>> b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>> index ccb20230df2c..dd65b72bddd4 100644
>>>>>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>>>>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>> @@ -1677,6 +1677,11 @@ static void 
>>>>>>> gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>>>>>>                 CTX_REG(reg_state, state_offset, flex_regs[i], 
>>>>>>> value);
>>>>>>>         }
>>>>>>> +
>>>>>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
>>>>>>> +             gen8_make_rpcs(dev_priv,
>>>>>>> + &to_intel_context(ctx,
>>>>>>> + dev_priv->engine[RCS])->sseu));
>>>>>> I think there is one issue I missed on the previous iterations of 
>>>>>> this
>>>>>> patch.
>>>>>>
>>>>>> This gen8_update_reg_state_unlocked() is called when the GPU is 
>>>>>> parked
>>>>>> on the kernel context.
>>>>>>
>>>>>> It's supposed to update all contexts, but I think we might not be 
>>>>>> able
>>>>>> to update the kernel context image while the GPU is using it.
>>>>> The kernel context is only ever taken in extremis (you are either
>>>>> parking or stalling userspace) so I don't care.
>>>>
>>>> The patch exposing the RPCS configuration to userspace will make 
>>>> use of
>>>> the kernel context while OA/perf is enabled. Even if it reprograms the
>>>> locked value that will break the power configuration stability on 
>>>> Gen11
>>>> (because the locked configuration will be different from the kernel
>>>> context configuration).
>>> Sure, but as you point out that's only on changing configuration.
>>>
>>> What's missing in the patch is that we only bail early if the new sseu
>>> matches the ce->sseu, but that doesn't necessarily match whats in the
>>> context due to OA. (Or maybe I missed the conversion to rpcs value and
>>> checking.)
>>> -Chris
>>>
>>
>> Yep, because the gen8_make_rpcs() post processes the values store at 
>> the gem context level, we risk rerunning the kernel context to write 
>> the exiting value.
>> Sorry this is all so messy :(
>
> Lets see if I managed to follow here.
>
> The current code indeed bails out at the set ctx param level if the 
> requested state matches the ce->state. My thinking was that ce->state 
> is the master state and whatever happens in "post processing" via 
> gen8_make_rpcs should be hidden from it since the design is that the 
> i915_perf.c will re-configure all contexts when the OA active status 
> changes (to either direction).
>
> So I don't see a problem in those two interactions.


Let's say you have contextA with sseu(slice,subslice)=(0x1/0xff) for ICL.

You then enable OA which locks the configuration at (0x1,0xf).

The kernel context has retained its (0x1/0xff) configuration.


And after you change the config of contextA to (0x1,0x7).


This would lead to the kernel context scheduled with (0x1,0xff) while OA 
is active.


>
> Apart from one, get_param_sseu will lie a bit - we can discuss about 
> this one more. At one point I suggested we have two sets of masks in 
> the uAPI, requested and active in a way. So userspace could query what 
> it set and what is actually active.
>
> Now second issue is if i915_perf.c is able to reprogram the kernel 
> config.
>
> Here its true, it will write to the context image and that will get 
> overwritten by context save.
>
> If that is a problem for OA, I was initially if a throw-away second 
> "kernel" context could be use to re-program the real one, but perhaps 
> even simpler - what about a mmio write to program the RPCS while 
> kernel context is active?


Documentation says : "This register must not be programmed directly 
through CPU MMIO cycle."


Sorry :(


-

Lionel


>
> Regards,
>
> Tvrtko
>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">On 07/09/2018 09:26, Tvrtko Ursulin
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:4bfe6648-203f-fbc8-9bcd-7b4e28279d02@linux.intel.com">
      <br>
      On 06/09/2018 11:36, Lionel Landwerlin wrote:
      <br>
      <blockquote type="cite">On 06/09/2018 11:22, Chris Wilson wrote:
        <br>
        <blockquote type="cite">Quoting Lionel Landwerlin (2018-09-06
          11:18:01)
          <br>
          <blockquote type="cite">On 06/09/2018 11:10, Chris Wilson
            wrote:
            <br>
            <blockquote type="cite">Quoting Lionel Landwerlin
              (2018-09-06 10:57:47)
              <br>
              <blockquote type="cite">On 05/09/2018 15:22, Tvrtko
                Ursulin wrote:
                <br>
                <blockquote type="cite">From: Lionel Landwerlin
                  <a class="moz-txt-link-rfc2396E" href="mailto:lionel.g.landwerlin@intel.com">&lt;lionel.g.landwerlin@intel.com&gt;</a>
                  <br>
                  <br>
                  If some of the contexts submitting workloads to the
                  GPU have been
                  <br>
                  configured to shutdown slices/subslices, we might
                  loose the NOA
                  <br>
                  configurations written in the NOA muxes.
                  <br>
                  <br>
                  One possible solution to this problem is to reprogram
                  the NOA muxes
                  <br>
                  when we switch to a new context. We initially tried
                  this in the
                  <br>
                  workaround batchbuffer but some concerns where raised
                  about the cost
                  <br>
                  of reprogramming at every context switch. This
                  solution is also not
                  <br>
                  without consequences from the userspace point of view.
                  Reprogramming
                  <br>
                  of the muxes can only happen once the powergating
                  configuration has
                  <br>
                  changed (which happens after context switch). This
                  means for a window
                  <br>
                  of time during the recording, counters recorded by the
                  OA unit might
                  <br>
                  be invalid. This requires userspace dealing with OA
                  reports to discard
                  <br>
                  the invalid values.
                  <br>
                  <br>
                  Minimizing the reprogramming could be implemented by
                  tracking of the
                  <br>
                  last programmed configuration somewhere in GGTT and
                  use MI_PREDICATE
                  <br>
                  to discard some of the programming commands, but the
                  command streamer
                  <br>
                  would still have to parse all the MI_LRI instructions
                  in the
                  <br>
                  workaround batchbuffer.
                  <br>
                  <br>
                  Another solution, which this change implements, is to
                  simply disregard
                  <br>
                  the user requested configuration for the period of
                  time when i915/perf
                  <br>
                  is active. There is no known issue with this apart
                  from a performance
                  <br>
                  penality for some media workloads that benefit from
                  running on a
                  <br>
                  partially powergated GPU. We already prevent RC6 from
                  affecting the
                  <br>
                  programming so it doesn't sound completely
                  unreasonable to hold on
                  <br>
                  powergating for the same reason.
                  <br>
                  <br>
                  v2: Leave RPCS programming in intel_lrc.c (Lionel)
                  <br>
                  <br>
                  v3: Update for s/union intel_sseu/struct intel_sseu/
                  (Lionel)
                  <br>
                         More to_intel_context() (Tvrtko)
                  <br>
                         s/dev_priv/i915/ (Tvrtko)
                  <br>
                  <br>
                  Tvrtko Ursulin:
                  <br>
                  <br>
                  v4:
                  <br>
                      * Rebase for make_rpcs changes.
                  <br>
                  <br>
                  v5:
                  <br>
                      * Apply OA restriction from make_rpcs directly.
                  <br>
                  <br>
                  v6:
                  <br>
                      * Rebase for context image setup changes.
                  <br>
                  <br>
                  Signed-off-by: Lionel Landwerlin
                  <a class="moz-txt-link-rfc2396E" href="mailto:lionel.g.landwerlin@intel.com">&lt;lionel.g.landwerlin@intel.com&gt;</a>
                  <br>
                  Signed-off-by: Tvrtko Ursulin
                  <a class="moz-txt-link-rfc2396E" href="mailto:tvrtko.ursulin@intel.com">&lt;tvrtko.ursulin@intel.com&gt;</a>
                  <br>
                  ---
                  <br>
                      drivers/gpu/drm/i915/i915_perf.c |  5 +++++
                  <br>
                      drivers/gpu/drm/i915/intel_lrc.c | 30
                  ++++++++++++++++++++----------
                  <br>
                      drivers/gpu/drm/i915/intel_lrc.h |  3 +++
                  <br>
                      3 files changed, 28 insertions(+), 10 deletions(-)
                  <br>
                  <br>
                  diff --git a/drivers/gpu/drm/i915/i915_perf.c
                  b/drivers/gpu/drm/i915/i915_perf.c
                  <br>
                  index ccb20230df2c..dd65b72bddd4 100644
                  <br>
                  --- a/drivers/gpu/drm/i915/i915_perf.c
                  <br>
                  +++ b/drivers/gpu/drm/i915/i915_perf.c
                  <br>
                  @@ -1677,6 +1677,11 @@ static void
                  gen8_update_reg_state_unlocked(struct i915_gem_context
                  *ctx,
                  <br>
                                  CTX_REG(reg_state, state_offset,
                  flex_regs[i], value);
                  <br>
                          }
                  <br>
                  +
                  <br>
                  +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE,
                  GEN8_R_PWR_CLK_STATE,
                  <br>
                  +             gen8_make_rpcs(dev_priv,
                  <br>
                  +                           
                  &amp;to_intel_context(ctx,
                  <br>
                  +                                             
                  dev_priv-&gt;engine[RCS])-&gt;sseu));
                  <br>
                </blockquote>
                I think there is one issue I missed on the previous
                iterations of this
                <br>
                patch.
                <br>
                <br>
                This gen8_update_reg_state_unlocked() is called when the
                GPU is parked
                <br>
                on the kernel context.
                <br>
                <br>
                It's supposed to update all contexts, but I think we
                might not be able
                <br>
                to update the kernel context image while the GPU is
                using it.
                <br>
              </blockquote>
              The kernel context is only ever taken in extremis (you are
              either
              <br>
              parking or stalling userspace) so I don't care.
              <br>
            </blockquote>
            <br>
            The patch exposing the RPCS configuration to userspace will
            make use of
            <br>
            the kernel context while OA/perf is enabled. Even if it
            reprograms the
            <br>
            locked value that will break the power configuration
            stability on Gen11
            <br>
            (because the locked configuration will be different from the
            kernel
            <br>
            context configuration).
            <br>
          </blockquote>
          Sure, but as you point out that's only on changing
          configuration.
          <br>
          <br>
          What's missing in the patch is that we only bail early if the
          new sseu
          <br>
          matches the ce-&gt;sseu, but that doesn't necessarily match
          whats in the
          <br>
          context due to OA. (Or maybe I missed the conversion to rpcs
          value and
          <br>
          checking.)
          <br>
          -Chris
          <br>
          <br>
        </blockquote>
        <br>
        Yep, because the gen8_make_rpcs() post processes the values
        store at the gem context level, we risk rerunning the kernel
        context to write the exiting value.
        <br>
        Sorry this is all so messy :(
        <br>
      </blockquote>
      <br>
      Lets see if I managed to follow here.
      <br>
      <br>
      The current code indeed bails out at the set ctx param level if
      the requested state matches the ce-&gt;state. My thinking was that
      ce-&gt;state is the master state and whatever happens in "post
      processing" via gen8_make_rpcs should be hidden from it since the
      design is that the i915_perf.c will re-configure all contexts when
      the OA active status changes (to either direction).
      <br>
      <br>
      So I don't see a problem in those two interactions.
      <br>
    </blockquote>
    <p><br>
    </p>
    <p>Let's say you have contextA with sseu(slice,subslice)=(0x1/0xff)
      for ICL.<br>
    </p>
    <p>You then enable OA which locks the configuration at (0x1,0xf).</p>
    <p>The kernel context has retained its (0x1/0xff) configuration.<br>
    </p>
    <p><br>
    </p>
    <p>And after you change the config of contextA to (0x1,0x7).</p>
    <p><br>
    </p>
    <p>This would lead to the kernel context scheduled with (0x1,0xff)
      while OA is active.<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite"
      cite="mid:4bfe6648-203f-fbc8-9bcd-7b4e28279d02@linux.intel.com">
      <br>
      Apart from one, get_param_sseu will lie a bit - we can discuss
      about this one more. At one point I suggested we have two sets of
      masks in the uAPI, requested and active in a way. So userspace
      could query what it set and what is actually active.
      <br>
      <br>
      Now second issue is if i915_perf.c is able to reprogram the kernel
      config.
      <br>
      <br>
      Here its true, it will write to the context image and that will
      get overwritten by context save.
      <br>
      <br>
      If that is a problem for OA, I was initially if a throw-away
      second "kernel" context could be use to re-program the real one,
      but perhaps even simpler - what about a mmio write to program the
      RPCS while kernel context is active?
      <br>
    </blockquote>
    <p><br>
    </p>
    <p>Documentation says : "<span style="color: rgb(35, 35, 35);
        font-family: Arial, sans-serif; font-size: 13.3333px;
        font-style: normal; font-variant-ligatures: normal;
        font-variant-caps: normal; font-weight: 400; letter-spacing:
        normal; orphans: 2; text-align: start; text-indent: 0px;
        text-transform: none; white-space: normal; widows: 2;
        word-spacing: 0px; -webkit-text-stroke-width: 0px;
        background-color: rgb(255, 255, 255); text-decoration-style:
        initial; text-decoration-color: initial; display: inline
        !important; float: none;">This register must not be programmed
        directly through CPU MMIO cycle.<span> "<br>
        </span></span></p>
    <p><br>
    </p>
    <p>Sorry :(</p>
    <p><br>
    </p>
    <p>-</p>
    <p>Lionel<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite"
      cite="mid:4bfe6648-203f-fbc8-9bcd-7b4e28279d02@linux.intel.com">
      <br>
      Regards,
      <br>
      <br>
      Tvrtko
      <br>
      <br>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>
Tvrtko Ursulin Sept. 7, 2018, 9:39 a.m. UTC | #11
On 07/09/2018 10:23, Lionel Landwerlin wrote:
> On 07/09/2018 09:26, Tvrtko Ursulin wrote:
>>
>> On 06/09/2018 11:36, Lionel Landwerlin wrote:
>>> On 06/09/2018 11:22, Chris Wilson wrote:
>>>> Quoting Lionel Landwerlin (2018-09-06 11:18:01)
>>>>> On 06/09/2018 11:10, Chris Wilson wrote:
>>>>>> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
>>>>>>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
>>>>>>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>>>
>>>>>>>> If some of the contexts submitting workloads to the GPU have been
>>>>>>>> configured to shutdown slices/subslices, we might loose the NOA
>>>>>>>> configurations written in the NOA muxes.
>>>>>>>>
>>>>>>>> One possible solution to this problem is to reprogram the NOA muxes
>>>>>>>> when we switch to a new context. We initially tried this in the
>>>>>>>> workaround batchbuffer but some concerns where raised about the 
>>>>>>>> cost
>>>>>>>> of reprogramming at every context switch. This solution is also not
>>>>>>>> without consequences from the userspace point of view. 
>>>>>>>> Reprogramming
>>>>>>>> of the muxes can only happen once the powergating configuration has
>>>>>>>> changed (which happens after context switch). This means for a 
>>>>>>>> window
>>>>>>>> of time during the recording, counters recorded by the OA unit 
>>>>>>>> might
>>>>>>>> be invalid. This requires userspace dealing with OA reports to 
>>>>>>>> discard
>>>>>>>> the invalid values.
>>>>>>>>
>>>>>>>> Minimizing the reprogramming could be implemented by tracking of 
>>>>>>>> the
>>>>>>>> last programmed configuration somewhere in GGTT and use 
>>>>>>>> MI_PREDICATE
>>>>>>>> to discard some of the programming commands, but the command 
>>>>>>>> streamer
>>>>>>>> would still have to parse all the MI_LRI instructions in the
>>>>>>>> workaround batchbuffer.
>>>>>>>>
>>>>>>>> Another solution, which this change implements, is to simply 
>>>>>>>> disregard
>>>>>>>> the user requested configuration for the period of time when 
>>>>>>>> i915/perf
>>>>>>>> is active. There is no known issue with this apart from a 
>>>>>>>> performance
>>>>>>>> penality for some media workloads that benefit from running on a
>>>>>>>> partially powergated GPU. We already prevent RC6 from affecting the
>>>>>>>> programming so it doesn't sound completely unreasonable to hold on
>>>>>>>> powergating for the same reason.
>>>>>>>>
>>>>>>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>>>>>>>>
>>>>>>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>>>>>>        More to_intel_context() (Tvrtko)
>>>>>>>>        s/dev_priv/i915/ (Tvrtko)
>>>>>>>>
>>>>>>>> Tvrtko Ursulin:
>>>>>>>>
>>>>>>>> v4:
>>>>>>>>     * Rebase for make_rpcs changes.
>>>>>>>>
>>>>>>>> v5:
>>>>>>>>     * Apply OA restriction from make_rpcs directly.
>>>>>>>>
>>>>>>>> v6:
>>>>>>>>     * Rebase for context image setup changes.
>>>>>>>>
>>>>>>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 30 
>>>>>>>> ++++++++++++++++++++----------
>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>>>>>>>>     3 files changed, 28 insertions(+), 10 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>>>>>>>> b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>> index ccb20230df2c..dd65b72bddd4 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>> @@ -1677,6 +1677,11 @@ static void 
>>>>>>>> gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>>>>>>>                 CTX_REG(reg_state, state_offset, flex_regs[i], 
>>>>>>>> value);
>>>>>>>>         }
>>>>>>>> +
>>>>>>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
>>>>>>>> +             gen8_make_rpcs(dev_priv,
>>>>>>>> + &to_intel_context(ctx,
>>>>>>>> + dev_priv->engine[RCS])->sseu));
>>>>>>> I think there is one issue I missed on the previous iterations of 
>>>>>>> this
>>>>>>> patch.
>>>>>>>
>>>>>>> This gen8_update_reg_state_unlocked() is called when the GPU is 
>>>>>>> parked
>>>>>>> on the kernel context.
>>>>>>>
>>>>>>> It's supposed to update all contexts, but I think we might not be 
>>>>>>> able
>>>>>>> to update the kernel context image while the GPU is using it.
>>>>>> The kernel context is only ever taken in extremis (you are either
>>>>>> parking or stalling userspace) so I don't care.
>>>>>
>>>>> The patch exposing the RPCS configuration to userspace will make 
>>>>> use of
>>>>> the kernel context while OA/perf is enabled. Even if it reprograms the
>>>>> locked value that will break the power configuration stability on 
>>>>> Gen11
>>>>> (because the locked configuration will be different from the kernel
>>>>> context configuration).
>>>> Sure, but as you point out that's only on changing configuration.
>>>>
>>>> What's missing in the patch is that we only bail early if the new sseu
>>>> matches the ce->sseu, but that doesn't necessarily match whats in the
>>>> context due to OA. (Or maybe I missed the conversion to rpcs value and
>>>> checking.)
>>>> -Chris
>>>>
>>>
>>> Yep, because the gen8_make_rpcs() post processes the values store at 
>>> the gem context level, we risk rerunning the kernel context to write 
>>> the exiting value.
>>> Sorry this is all so messy :(
>>
>> Lets see if I managed to follow here.
>>
>> The current code indeed bails out at the set ctx param level if the 
>> requested state matches the ce->state. My thinking was that ce->state 
>> is the master state and whatever happens in "post processing" via 
>> gen8_make_rpcs should be hidden from it since the design is that the 
>> i915_perf.c will re-configure all contexts when the OA active status 
>> changes (to either direction).
>>
>> So I don't see a problem in those two interactions.
> 
> 
> Let's say you have contextA with sseu(slice,subslice)=(0x1/0xff) for ICL.
> 
> You then enable OA which locks the configuration at (0x1,0xf).
> 
> The kernel context has retained its (0x1/0xff) configuration.
> 
> 
> And after you change the config of contextA to (0x1,0x7).
> 
> 
> This would lead to the kernel context scheduled with (0x1,0xff) while OA 
> is active.

Okay that's a problem discussed in the paragraph below - that the kernel 
context is not updated at all. But is it a problem for OA? Will it mess 
up some counters even if kernel context isn't executing anything 
interacting with them? Or is it?

> 
>>
>> Apart from one, get_param_sseu will lie a bit - we can discuss about 
>> this one more. At one point I suggested we have two sets of masks in 
>> the uAPI, requested and active in a way. So userspace could query what 
>> it set and what is actually active.
>>
>> Now second issue is if i915_perf.c is able to reprogram the kernel 
>> config.
>>
>> Here its true, it will write to the context image and that will get 
>> overwritten by context save.
>>
>> If that is a problem for OA, I was initially if a throw-away second 
>> "kernel" context could be use to re-program the real one, but perhaps 
>> even simpler - what about a mmio write to program the RPCS while 
>> kernel context is active?
> 
> 
> Documentation says : "This register must not be programmed directly 
> through CPU MMIO cycle."
> 
> 
> Sorry :(

Ugh.. okay, help me understand if kernel context absolutely needs to 
follow the "lock" for OA to work and then we'll see what to do.

Regards,

Tvrtko
Lionel Landwerlin Sept. 7, 2018, 9:55 a.m. UTC | #12
On 07/09/2018 10:39, Tvrtko Ursulin wrote:
>
> On 07/09/2018 10:23, Lionel Landwerlin wrote:
>> On 07/09/2018 09:26, Tvrtko Ursulin wrote:
>>>
>>> On 06/09/2018 11:36, Lionel Landwerlin wrote:
>>>> On 06/09/2018 11:22, Chris Wilson wrote:
>>>>> Quoting Lionel Landwerlin (2018-09-06 11:18:01)
>>>>>> On 06/09/2018 11:10, Chris Wilson wrote:
>>>>>>> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
>>>>>>>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
>>>>>>>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>>>>
>>>>>>>>> If some of the contexts submitting workloads to the GPU have been
>>>>>>>>> configured to shutdown slices/subslices, we might loose the NOA
>>>>>>>>> configurations written in the NOA muxes.
>>>>>>>>>
>>>>>>>>> One possible solution to this problem is to reprogram the NOA 
>>>>>>>>> muxes
>>>>>>>>> when we switch to a new context. We initially tried this in the
>>>>>>>>> workaround batchbuffer but some concerns where raised about 
>>>>>>>>> the cost
>>>>>>>>> of reprogramming at every context switch. This solution is 
>>>>>>>>> also not
>>>>>>>>> without consequences from the userspace point of view. 
>>>>>>>>> Reprogramming
>>>>>>>>> of the muxes can only happen once the powergating 
>>>>>>>>> configuration has
>>>>>>>>> changed (which happens after context switch). This means for a 
>>>>>>>>> window
>>>>>>>>> of time during the recording, counters recorded by the OA unit 
>>>>>>>>> might
>>>>>>>>> be invalid. This requires userspace dealing with OA reports to 
>>>>>>>>> discard
>>>>>>>>> the invalid values.
>>>>>>>>>
>>>>>>>>> Minimizing the reprogramming could be implemented by tracking 
>>>>>>>>> of the
>>>>>>>>> last programmed configuration somewhere in GGTT and use 
>>>>>>>>> MI_PREDICATE
>>>>>>>>> to discard some of the programming commands, but the command 
>>>>>>>>> streamer
>>>>>>>>> would still have to parse all the MI_LRI instructions in the
>>>>>>>>> workaround batchbuffer.
>>>>>>>>>
>>>>>>>>> Another solution, which this change implements, is to simply 
>>>>>>>>> disregard
>>>>>>>>> the user requested configuration for the period of time when 
>>>>>>>>> i915/perf
>>>>>>>>> is active. There is no known issue with this apart from a 
>>>>>>>>> performance
>>>>>>>>> penality for some media workloads that benefit from running on a
>>>>>>>>> partially powergated GPU. We already prevent RC6 from 
>>>>>>>>> affecting the
>>>>>>>>> programming so it doesn't sound completely unreasonable to 
>>>>>>>>> hold on
>>>>>>>>> powergating for the same reason.
>>>>>>>>>
>>>>>>>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>>>>>>>>>
>>>>>>>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>>>>>>>        More to_intel_context() (Tvrtko)
>>>>>>>>>        s/dev_priv/i915/ (Tvrtko)
>>>>>>>>>
>>>>>>>>> Tvrtko Ursulin:
>>>>>>>>>
>>>>>>>>> v4:
>>>>>>>>>     * Rebase for make_rpcs changes.
>>>>>>>>>
>>>>>>>>> v5:
>>>>>>>>>     * Apply OA restriction from make_rpcs directly.
>>>>>>>>>
>>>>>>>>> v6:
>>>>>>>>>     * Rebase for context image setup changes.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>>> ---
>>>>>>>>>     drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 30 
>>>>>>>>> ++++++++++++++++++++----------
>>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>>>>>>>>>     3 files changed, 28 insertions(+), 10 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>>>>>>>>> b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>> index ccb20230df2c..dd65b72bddd4 100644
>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>> @@ -1677,6 +1677,11 @@ static void 
>>>>>>>>> gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>>>>>>>>                 CTX_REG(reg_state, state_offset, flex_regs[i], 
>>>>>>>>> value);
>>>>>>>>>         }
>>>>>>>>> +
>>>>>>>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, 
>>>>>>>>> GEN8_R_PWR_CLK_STATE,
>>>>>>>>> +             gen8_make_rpcs(dev_priv,
>>>>>>>>> + &to_intel_context(ctx,
>>>>>>>>> + dev_priv->engine[RCS])->sseu));
>>>>>>>> I think there is one issue I missed on the previous iterations 
>>>>>>>> of this
>>>>>>>> patch.
>>>>>>>>
>>>>>>>> This gen8_update_reg_state_unlocked() is called when the GPU is 
>>>>>>>> parked
>>>>>>>> on the kernel context.
>>>>>>>>
>>>>>>>> It's supposed to update all contexts, but I think we might not 
>>>>>>>> be able
>>>>>>>> to update the kernel context image while the GPU is using it.
>>>>>>> The kernel context is only ever taken in extremis (you are either
>>>>>>> parking or stalling userspace) so I don't care.
>>>>>>
>>>>>> The patch exposing the RPCS configuration to userspace will make 
>>>>>> use of
>>>>>> the kernel context while OA/perf is enabled. Even if it 
>>>>>> reprograms the
>>>>>> locked value that will break the power configuration stability on 
>>>>>> Gen11
>>>>>> (because the locked configuration will be different from the kernel
>>>>>> context configuration).
>>>>> Sure, but as you point out that's only on changing configuration.
>>>>>
>>>>> What's missing in the patch is that we only bail early if the new 
>>>>> sseu
>>>>> matches the ce->sseu, but that doesn't necessarily match whats in the
>>>>> context due to OA. (Or maybe I missed the conversion to rpcs value 
>>>>> and
>>>>> checking.)
>>>>> -Chris
>>>>>
>>>>
>>>> Yep, because the gen8_make_rpcs() post processes the values store 
>>>> at the gem context level, we risk rerunning the kernel context to 
>>>> write the exiting value.
>>>> Sorry this is all so messy :(
>>>
>>> Lets see if I managed to follow here.
>>>
>>> The current code indeed bails out at the set ctx param level if the 
>>> requested state matches the ce->state. My thinking was that 
>>> ce->state is the master state and whatever happens in "post 
>>> processing" via gen8_make_rpcs should be hidden from it since the 
>>> design is that the i915_perf.c will re-configure all contexts when 
>>> the OA active status changes (to either direction).
>>>
>>> So I don't see a problem in those two interactions.
>>
>>
>> Let's say you have contextA with sseu(slice,subslice)=(0x1/0xff) for 
>> ICL.
>>
>> You then enable OA which locks the configuration at (0x1,0xf).
>>
>> The kernel context has retained its (0x1/0xff) configuration.
>>
>>
>> And after you change the config of contextA to (0x1,0x7).
>>
>>
>> This would lead to the kernel context scheduled with (0x1,0xff) while 
>> OA is active.
>
> Okay that's a problem discussed in the paragraph below - that the 
> kernel context is not updated at all. But is it a problem for OA? Will 
> it mess up some counters even if kernel context isn't executing 
> anything interacting with them? Or is it?


What the HW is going to do to the NOA logic in power configuration 
changes is not really documented.
Experimentally on SKL GT4, it seems a change in power configuration will 
trigger a power off of everything before applying the power at the new 
configuration.
So that would imply loosing the NOA programming when we switch to the 
kernel context which means invalid values in the counters.


>
>>
>>>
>>> Apart from one, get_param_sseu will lie a bit - we can discuss about 
>>> this one more. At one point I suggested we have two sets of masks in 
>>> the uAPI, requested and active in a way. So userspace could query 
>>> what it set and what is actually active.
>>>
>>> Now second issue is if i915_perf.c is able to reprogram the kernel 
>>> config.
>>>
>>> Here its true, it will write to the context image and that will get 
>>> overwritten by context save.
>>>
>>> If that is a problem for OA, I was initially if a throw-away second 
>>> "kernel" context could be use to re-program the real one, but 
>>> perhaps even simpler - what about a mmio write to program the RPCS 
>>> while kernel context is active?
>>
>>
>> Documentation says : "This register must not be programmed directly 
>> through CPU MMIO cycle."
>>
>>
>> Sorry :(
>
> Ugh.. okay, help me understand if kernel context absolutely needs to 
> follow the "lock" for OA to work and then we'll see what to do.

I think so.

Your idea of a throw away context to reprogramming every seems sound.

Thanks,

-
Lionel


>
> Regards,
>
> Tvrtko
>
Tvrtko Ursulin Sept. 10, 2018, 1:44 p.m. UTC | #13
On 07/09/2018 10:55, Lionel Landwerlin wrote:
> On 07/09/2018 10:39, Tvrtko Ursulin wrote:
>>
>> On 07/09/2018 10:23, Lionel Landwerlin wrote:
>>> On 07/09/2018 09:26, Tvrtko Ursulin wrote:
>>>>
>>>> On 06/09/2018 11:36, Lionel Landwerlin wrote:
>>>>> On 06/09/2018 11:22, Chris Wilson wrote:
>>>>>> Quoting Lionel Landwerlin (2018-09-06 11:18:01)
>>>>>>> On 06/09/2018 11:10, Chris Wilson wrote:
>>>>>>>> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
>>>>>>>>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
>>>>>>>>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>>>>>
>>>>>>>>>> If some of the contexts submitting workloads to the GPU have been
>>>>>>>>>> configured to shutdown slices/subslices, we might loose the NOA
>>>>>>>>>> configurations written in the NOA muxes.
>>>>>>>>>>
>>>>>>>>>> One possible solution to this problem is to reprogram the NOA 
>>>>>>>>>> muxes
>>>>>>>>>> when we switch to a new context. We initially tried this in the
>>>>>>>>>> workaround batchbuffer but some concerns where raised about 
>>>>>>>>>> the cost
>>>>>>>>>> of reprogramming at every context switch. This solution is 
>>>>>>>>>> also not
>>>>>>>>>> without consequences from the userspace point of view. 
>>>>>>>>>> Reprogramming
>>>>>>>>>> of the muxes can only happen once the powergating 
>>>>>>>>>> configuration has
>>>>>>>>>> changed (which happens after context switch). This means for a 
>>>>>>>>>> window
>>>>>>>>>> of time during the recording, counters recorded by the OA unit 
>>>>>>>>>> might
>>>>>>>>>> be invalid. This requires userspace dealing with OA reports to 
>>>>>>>>>> discard
>>>>>>>>>> the invalid values.
>>>>>>>>>>
>>>>>>>>>> Minimizing the reprogramming could be implemented by tracking 
>>>>>>>>>> of the
>>>>>>>>>> last programmed configuration somewhere in GGTT and use 
>>>>>>>>>> MI_PREDICATE
>>>>>>>>>> to discard some of the programming commands, but the command 
>>>>>>>>>> streamer
>>>>>>>>>> would still have to parse all the MI_LRI instructions in the
>>>>>>>>>> workaround batchbuffer.
>>>>>>>>>>
>>>>>>>>>> Another solution, which this change implements, is to simply 
>>>>>>>>>> disregard
>>>>>>>>>> the user requested configuration for the period of time when 
>>>>>>>>>> i915/perf
>>>>>>>>>> is active. There is no known issue with this apart from a 
>>>>>>>>>> performance
>>>>>>>>>> penality for some media workloads that benefit from running on a
>>>>>>>>>> partially powergated GPU. We already prevent RC6 from 
>>>>>>>>>> affecting the
>>>>>>>>>> programming so it doesn't sound completely unreasonable to 
>>>>>>>>>> hold on
>>>>>>>>>> powergating for the same reason.
>>>>>>>>>>
>>>>>>>>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>>>>>>>>>>
>>>>>>>>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>>>>>>>>        More to_intel_context() (Tvrtko)
>>>>>>>>>>        s/dev_priv/i915/ (Tvrtko)
>>>>>>>>>>
>>>>>>>>>> Tvrtko Ursulin:
>>>>>>>>>>
>>>>>>>>>> v4:
>>>>>>>>>>     * Rebase for make_rpcs changes.
>>>>>>>>>>
>>>>>>>>>> v5:
>>>>>>>>>>     * Apply OA restriction from make_rpcs directly.
>>>>>>>>>>
>>>>>>>>>> v6:
>>>>>>>>>>     * Rebase for context image setup changes.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>>>> ---
>>>>>>>>>>     drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>>>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 30 
>>>>>>>>>> ++++++++++++++++++++----------
>>>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>>>>>>>>>>     3 files changed, 28 insertions(+), 10 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>>>>>>>>>> b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>>> index ccb20230df2c..dd65b72bddd4 100644
>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>>> @@ -1677,6 +1677,11 @@ static void 
>>>>>>>>>> gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>>>>>>>>>                 CTX_REG(reg_state, state_offset, flex_regs[i], 
>>>>>>>>>> value);
>>>>>>>>>>         }
>>>>>>>>>> +
>>>>>>>>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, 
>>>>>>>>>> GEN8_R_PWR_CLK_STATE,
>>>>>>>>>> +             gen8_make_rpcs(dev_priv,
>>>>>>>>>> + &to_intel_context(ctx,
>>>>>>>>>> + dev_priv->engine[RCS])->sseu));
>>>>>>>>> I think there is one issue I missed on the previous iterations 
>>>>>>>>> of this
>>>>>>>>> patch.
>>>>>>>>>
>>>>>>>>> This gen8_update_reg_state_unlocked() is called when the GPU is 
>>>>>>>>> parked
>>>>>>>>> on the kernel context.
>>>>>>>>>
>>>>>>>>> It's supposed to update all contexts, but I think we might not 
>>>>>>>>> be able
>>>>>>>>> to update the kernel context image while the GPU is using it.
>>>>>>>> The kernel context is only ever taken in extremis (you are either
>>>>>>>> parking or stalling userspace) so I don't care.
>>>>>>>
>>>>>>> The patch exposing the RPCS configuration to userspace will make 
>>>>>>> use of
>>>>>>> the kernel context while OA/perf is enabled. Even if it 
>>>>>>> reprograms the
>>>>>>> locked value that will break the power configuration stability on 
>>>>>>> Gen11
>>>>>>> (because the locked configuration will be different from the kernel
>>>>>>> context configuration).
>>>>>> Sure, but as you point out that's only on changing configuration.
>>>>>>
>>>>>> What's missing in the patch is that we only bail early if the new 
>>>>>> sseu
>>>>>> matches the ce->sseu, but that doesn't necessarily match whats in the
>>>>>> context due to OA. (Or maybe I missed the conversion to rpcs value 
>>>>>> and
>>>>>> checking.)
>>>>>> -Chris
>>>>>>
>>>>>
>>>>> Yep, because the gen8_make_rpcs() post processes the values store 
>>>>> at the gem context level, we risk rerunning the kernel context to 
>>>>> write the exiting value.
>>>>> Sorry this is all so messy :(
>>>>
>>>> Lets see if I managed to follow here.
>>>>
>>>> The current code indeed bails out at the set ctx param level if the 
>>>> requested state matches the ce->state. My thinking was that 
>>>> ce->state is the master state and whatever happens in "post 
>>>> processing" via gen8_make_rpcs should be hidden from it since the 
>>>> design is that the i915_perf.c will re-configure all contexts when 
>>>> the OA active status changes (to either direction).
>>>>
>>>> So I don't see a problem in those two interactions.
>>>
>>>
>>> Let's say you have contextA with sseu(slice,subslice)=(0x1/0xff) for 
>>> ICL.
>>>
>>> You then enable OA which locks the configuration at (0x1,0xf).
>>>
>>> The kernel context has retained its (0x1/0xff) configuration.
>>>
>>>
>>> And after you change the config of contextA to (0x1,0x7).
>>>
>>>
>>> This would lead to the kernel context scheduled with (0x1,0xff) while 
>>> OA is active.
>>
>> Okay that's a problem discussed in the paragraph below - that the 
>> kernel context is not updated at all. But is it a problem for OA? Will 
>> it mess up some counters even if kernel context isn't executing 
>> anything interacting with them? Or is it?
> 
> 
> What the HW is going to do to the NOA logic in power configuration 
> changes is not really documented.
> Experimentally on SKL GT4, it seems a change in power configuration will 
> trigger a power off of everything before applying the power at the new 
> configuration.
> So that would imply loosing the NOA programming when we switch to the 
> kernel context which means invalid values in the counters.
> 
> 
>>
>>>
>>>>
>>>> Apart from one, get_param_sseu will lie a bit - we can discuss about 
>>>> this one more. At one point I suggested we have two sets of masks in 
>>>> the uAPI, requested and active in a way. So userspace could query 
>>>> what it set and what is actually active.
>>>>
>>>> Now second issue is if i915_perf.c is able to reprogram the kernel 
>>>> config.
>>>>
>>>> Here its true, it will write to the context image and that will get 
>>>> overwritten by context save.
>>>>
>>>> If that is a problem for OA, I was initially if a throw-away second 
>>>> "kernel" context could be use to re-program the real one, but 
>>>> perhaps even simpler - what about a mmio write to program the RPCS 
>>>> while kernel context is active?
>>>
>>>
>>> Documentation says : "This register must not be programmed directly 
>>> through CPU MMIO cycle."
>>>
>>>
>>> Sorry :(
>>
>> Ugh.. okay, help me understand if kernel context absolutely needs to 
>> follow the "lock" for OA to work and then we'll see what to do.
> 
> I think so.
> 
> Your idea of a throw away context to reprogramming every seems sound.

I was in the middle of refactoring the series to implement this when I 
started suspecting the need for it.

Premise of the problem statement was the kernel context doesn't get 
updated by the OA code, but, there is a loop in 
gen8_configure_all_contexts which goes through exactly all of them after 
it has idled the GPU.

AFAICS that means it is able to map the kernel context state and edit it 
so everything seems fine from this angle. (Kernel context is 
perma-pinned in software, but after idling the GPU we know it is not 
actually on the GPU so it safe to edit it's image.)

Am I missing some hole here, and if not, why does the code needs to have 
a primary update via a request in gen8_switch_to_updated_kernel_context? 
Context image after idling seems again sufficient.

Would it be possible to exercise the hypothetical loss of NOA 
configuration from an IGT, if kernel context wasn't correctly updated?

Regards,

Tvrtko
Lionel Landwerlin Sept. 11, 2018, 8:11 p.m. UTC | #14
On 10/09/2018 14:44, Tvrtko Ursulin wrote:
>
> On 07/09/2018 10:55, Lionel Landwerlin wrote:
>> On 07/09/2018 10:39, Tvrtko Ursulin wrote:
>>>
>>> On 07/09/2018 10:23, Lionel Landwerlin wrote:
>>>> On 07/09/2018 09:26, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 06/09/2018 11:36, Lionel Landwerlin wrote:
>>>>>> On 06/09/2018 11:22, Chris Wilson wrote:
>>>>>>> Quoting Lionel Landwerlin (2018-09-06 11:18:01)
>>>>>>>> On 06/09/2018 11:10, Chris Wilson wrote:
>>>>>>>>> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
>>>>>>>>>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
>>>>>>>>>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>>>>>>
>>>>>>>>>>> If some of the contexts submitting workloads to the GPU have 
>>>>>>>>>>> been
>>>>>>>>>>> configured to shutdown slices/subslices, we might loose the NOA
>>>>>>>>>>> configurations written in the NOA muxes.
>>>>>>>>>>>
>>>>>>>>>>> One possible solution to this problem is to reprogram the 
>>>>>>>>>>> NOA muxes
>>>>>>>>>>> when we switch to a new context. We initially tried this in the
>>>>>>>>>>> workaround batchbuffer but some concerns where raised about 
>>>>>>>>>>> the cost
>>>>>>>>>>> of reprogramming at every context switch. This solution is 
>>>>>>>>>>> also not
>>>>>>>>>>> without consequences from the userspace point of view. 
>>>>>>>>>>> Reprogramming
>>>>>>>>>>> of the muxes can only happen once the powergating 
>>>>>>>>>>> configuration has
>>>>>>>>>>> changed (which happens after context switch). This means for 
>>>>>>>>>>> a window
>>>>>>>>>>> of time during the recording, counters recorded by the OA 
>>>>>>>>>>> unit might
>>>>>>>>>>> be invalid. This requires userspace dealing with OA reports 
>>>>>>>>>>> to discard
>>>>>>>>>>> the invalid values.
>>>>>>>>>>>
>>>>>>>>>>> Minimizing the reprogramming could be implemented by 
>>>>>>>>>>> tracking of the
>>>>>>>>>>> last programmed configuration somewhere in GGTT and use 
>>>>>>>>>>> MI_PREDICATE
>>>>>>>>>>> to discard some of the programming commands, but the command 
>>>>>>>>>>> streamer
>>>>>>>>>>> would still have to parse all the MI_LRI instructions in the
>>>>>>>>>>> workaround batchbuffer.
>>>>>>>>>>>
>>>>>>>>>>> Another solution, which this change implements, is to simply 
>>>>>>>>>>> disregard
>>>>>>>>>>> the user requested configuration for the period of time when 
>>>>>>>>>>> i915/perf
>>>>>>>>>>> is active. There is no known issue with this apart from a 
>>>>>>>>>>> performance
>>>>>>>>>>> penality for some media workloads that benefit from running 
>>>>>>>>>>> on a
>>>>>>>>>>> partially powergated GPU. We already prevent RC6 from 
>>>>>>>>>>> affecting the
>>>>>>>>>>> programming so it doesn't sound completely unreasonable to 
>>>>>>>>>>> hold on
>>>>>>>>>>> powergating for the same reason.
>>>>>>>>>>>
>>>>>>>>>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>>>>>>>>>>>
>>>>>>>>>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>>>>>>>>>        More to_intel_context() (Tvrtko)
>>>>>>>>>>>        s/dev_priv/i915/ (Tvrtko)
>>>>>>>>>>>
>>>>>>>>>>> Tvrtko Ursulin:
>>>>>>>>>>>
>>>>>>>>>>> v4:
>>>>>>>>>>>     * Rebase for make_rpcs changes.
>>>>>>>>>>>
>>>>>>>>>>> v5:
>>>>>>>>>>>     * Apply OA restriction from make_rpcs directly.
>>>>>>>>>>>
>>>>>>>>>>> v6:
>>>>>>>>>>>     * Rebase for context image setup changes.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Lionel Landwerlin 
>>>>>>>>>>> <lionel.g.landwerlin@intel.com>
>>>>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>>     drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>>>>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 30 
>>>>>>>>>>> ++++++++++++++++++++----------
>>>>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>>>>>>>>>>>     3 files changed, 28 insertions(+), 10 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>>>>>>>>>>> b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>>>> index ccb20230df2c..dd65b72bddd4 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>>>> @@ -1677,6 +1677,11 @@ static void 
>>>>>>>>>>> gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>>>>>>>>>>                 CTX_REG(reg_state, state_offset, 
>>>>>>>>>>> flex_regs[i], value);
>>>>>>>>>>>         }
>>>>>>>>>>> +
>>>>>>>>>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, 
>>>>>>>>>>> GEN8_R_PWR_CLK_STATE,
>>>>>>>>>>> +             gen8_make_rpcs(dev_priv,
>>>>>>>>>>> + &to_intel_context(ctx,
>>>>>>>>>>> + dev_priv->engine[RCS])->sseu));
>>>>>>>>>> I think there is one issue I missed on the previous 
>>>>>>>>>> iterations of this
>>>>>>>>>> patch.
>>>>>>>>>>
>>>>>>>>>> This gen8_update_reg_state_unlocked() is called when the GPU 
>>>>>>>>>> is parked
>>>>>>>>>> on the kernel context.
>>>>>>>>>>
>>>>>>>>>> It's supposed to update all contexts, but I think we might 
>>>>>>>>>> not be able
>>>>>>>>>> to update the kernel context image while the GPU is using it.
>>>>>>>>> The kernel context is only ever taken in extremis (you are either
>>>>>>>>> parking or stalling userspace) so I don't care.
>>>>>>>>
>>>>>>>> The patch exposing the RPCS configuration to userspace will 
>>>>>>>> make use of
>>>>>>>> the kernel context while OA/perf is enabled. Even if it 
>>>>>>>> reprograms the
>>>>>>>> locked value that will break the power configuration stability 
>>>>>>>> on Gen11
>>>>>>>> (because the locked configuration will be different from the 
>>>>>>>> kernel
>>>>>>>> context configuration).
>>>>>>> Sure, but as you point out that's only on changing configuration.
>>>>>>>
>>>>>>> What's missing in the patch is that we only bail early if the 
>>>>>>> new sseu
>>>>>>> matches the ce->sseu, but that doesn't necessarily match whats 
>>>>>>> in the
>>>>>>> context due to OA. (Or maybe I missed the conversion to rpcs 
>>>>>>> value and
>>>>>>> checking.)
>>>>>>> -Chris
>>>>>>>
>>>>>>
>>>>>> Yep, because the gen8_make_rpcs() post processes the values store 
>>>>>> at the gem context level, we risk rerunning the kernel context to 
>>>>>> write the exiting value.
>>>>>> Sorry this is all so messy :(
>>>>>
>>>>> Lets see if I managed to follow here.
>>>>>
>>>>> The current code indeed bails out at the set ctx param level if 
>>>>> the requested state matches the ce->state. My thinking was that 
>>>>> ce->state is the master state and whatever happens in "post 
>>>>> processing" via gen8_make_rpcs should be hidden from it since the 
>>>>> design is that the i915_perf.c will re-configure all contexts when 
>>>>> the OA active status changes (to either direction).
>>>>>
>>>>> So I don't see a problem in those two interactions.
>>>>
>>>>
>>>> Let's say you have contextA with sseu(slice,subslice)=(0x1/0xff) 
>>>> for ICL.
>>>>
>>>> You then enable OA which locks the configuration at (0x1,0xf).
>>>>
>>>> The kernel context has retained its (0x1/0xff) configuration.
>>>>
>>>>
>>>> And after you change the config of contextA to (0x1,0x7).
>>>>
>>>>
>>>> This would lead to the kernel context scheduled with (0x1,0xff) 
>>>> while OA is active.
>>>
>>> Okay that's a problem discussed in the paragraph below - that the 
>>> kernel context is not updated at all. But is it a problem for OA? 
>>> Will it mess up some counters even if kernel context isn't executing 
>>> anything interacting with them? Or is it?
>>
>>
>> What the HW is going to do to the NOA logic in power configuration 
>> changes is not really documented.
>> Experimentally on SKL GT4, it seems a change in power configuration 
>> will trigger a power off of everything before applying the power at 
>> the new configuration.
>> So that would imply loosing the NOA programming when we switch to the 
>> kernel context which means invalid values in the counters.
>>
>>
>>>
>>>>
>>>>>
>>>>> Apart from one, get_param_sseu will lie a bit - we can discuss 
>>>>> about this one more. At one point I suggested we have two sets of 
>>>>> masks in the uAPI, requested and active in a way. So userspace 
>>>>> could query what it set and what is actually active.
>>>>>
>>>>> Now second issue is if i915_perf.c is able to reprogram the kernel 
>>>>> config.
>>>>>
>>>>> Here its true, it will write to the context image and that will 
>>>>> get overwritten by context save.
>>>>>
>>>>> If that is a problem for OA, I was initially if a throw-away 
>>>>> second "kernel" context could be use to re-program the real one, 
>>>>> but perhaps even simpler - what about a mmio write to program the 
>>>>> RPCS while kernel context is active?
>>>>
>>>>
>>>> Documentation says : "This register must not be programmed directly 
>>>> through CPU MMIO cycle."
>>>>
>>>>
>>>> Sorry :(
>>>
>>> Ugh.. okay, help me understand if kernel context absolutely needs to 
>>> follow the "lock" for OA to work and then we'll see what to do.
>>
>> I think so.
>>
>> Your idea of a throw away context to reprogramming every seems sound.
>
> I was in the middle of refactoring the series to implement this when I 
> started suspecting the need for it.
>
> Premise of the problem statement was the kernel context doesn't get 
> updated by the OA code, but, there is a loop in 
> gen8_configure_all_contexts which goes through exactly all of them 
> after it has idled the GPU.
>
> AFAICS that means it is able to map the kernel context state and edit 
> it so everything seems fine from this angle. (Kernel context is 
> perma-pinned in software, but after idling the GPU we know it is not 
> actually on the GPU so it safe to edit it's image.)
>
> Am I missing some hole here, and if not, why does the code needs to 
> have a primary update via a request in 
> gen8_switch_to_updated_kernel_context? Context image after idling 
> seems again sufficient.


My understanding is that the current code puts the GPU idle on the 
kernel context.
When idling stops, the GPU will save the last values from the HW 
register into the kernel context image.
We currently don't see any issue because we also run a set of commands 
on the kernel context prior to idle that mirror the contexts image edition.

That won't be the case for the RPCS register because we can't load it 
from the command streamer.


>
> Would it be possible to exercise the hypothetical loss of NOA 
> configuration from an IGT, if kernel context wasn't correctly updated?


The test configs saved in the kernel use NOA and their counters are 
progressing at different ratios from the "core clock" (RING_TIMESTAMP 
register).
In IGT tests/perf.c, the gen8_sanity_check_test_oa_reports() verify that 
the counters progress as expected.

But I can't tell from what part of the GT the test configs are sourcing 
signals from. If they do from unslice, then that won't work :(

-
Lionel



>
> Regards,
>
> Tvrtko
>
Tvrtko Ursulin Sept. 12, 2018, 8:03 a.m. UTC | #15
On 11/09/2018 21:11, Lionel Landwerlin wrote:
> On 10/09/2018 14:44, Tvrtko Ursulin wrote:
>>
>> On 07/09/2018 10:55, Lionel Landwerlin wrote:
>>> On 07/09/2018 10:39, Tvrtko Ursulin wrote:
>>>>
>>>> On 07/09/2018 10:23, Lionel Landwerlin wrote:
>>>>> On 07/09/2018 09:26, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 06/09/2018 11:36, Lionel Landwerlin wrote:
>>>>>>> On 06/09/2018 11:22, Chris Wilson wrote:
>>>>>>>> Quoting Lionel Landwerlin (2018-09-06 11:18:01)
>>>>>>>>> On 06/09/2018 11:10, Chris Wilson wrote:
>>>>>>>>>> Quoting Lionel Landwerlin (2018-09-06 10:57:47)
>>>>>>>>>>> On 05/09/2018 15:22, Tvrtko Ursulin wrote:
>>>>>>>>>>>> From: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
>>>>>>>>>>>>
>>>>>>>>>>>> If some of the contexts submitting workloads to the GPU have 
>>>>>>>>>>>> been
>>>>>>>>>>>> configured to shutdown slices/subslices, we might loose the NOA
>>>>>>>>>>>> configurations written in the NOA muxes.
>>>>>>>>>>>>
>>>>>>>>>>>> One possible solution to this problem is to reprogram the 
>>>>>>>>>>>> NOA muxes
>>>>>>>>>>>> when we switch to a new context. We initially tried this in the
>>>>>>>>>>>> workaround batchbuffer but some concerns where raised about 
>>>>>>>>>>>> the cost
>>>>>>>>>>>> of reprogramming at every context switch. This solution is 
>>>>>>>>>>>> also not
>>>>>>>>>>>> without consequences from the userspace point of view. 
>>>>>>>>>>>> Reprogramming
>>>>>>>>>>>> of the muxes can only happen once the powergating 
>>>>>>>>>>>> configuration has
>>>>>>>>>>>> changed (which happens after context switch). This means for 
>>>>>>>>>>>> a window
>>>>>>>>>>>> of time during the recording, counters recorded by the OA 
>>>>>>>>>>>> unit might
>>>>>>>>>>>> be invalid. This requires userspace dealing with OA reports 
>>>>>>>>>>>> to discard
>>>>>>>>>>>> the invalid values.
>>>>>>>>>>>>
>>>>>>>>>>>> Minimizing the reprogramming could be implemented by 
>>>>>>>>>>>> tracking of the
>>>>>>>>>>>> last programmed configuration somewhere in GGTT and use 
>>>>>>>>>>>> MI_PREDICATE
>>>>>>>>>>>> to discard some of the programming commands, but the command 
>>>>>>>>>>>> streamer
>>>>>>>>>>>> would still have to parse all the MI_LRI instructions in the
>>>>>>>>>>>> workaround batchbuffer.
>>>>>>>>>>>>
>>>>>>>>>>>> Another solution, which this change implements, is to simply 
>>>>>>>>>>>> disregard
>>>>>>>>>>>> the user requested configuration for the period of time when 
>>>>>>>>>>>> i915/perf
>>>>>>>>>>>> is active. There is no known issue with this apart from a 
>>>>>>>>>>>> performance
>>>>>>>>>>>> penality for some media workloads that benefit from running 
>>>>>>>>>>>> on a
>>>>>>>>>>>> partially powergated GPU. We already prevent RC6 from 
>>>>>>>>>>>> affecting the
>>>>>>>>>>>> programming so it doesn't sound completely unreasonable to 
>>>>>>>>>>>> hold on
>>>>>>>>>>>> powergating for the same reason.
>>>>>>>>>>>>
>>>>>>>>>>>> v2: Leave RPCS programming in intel_lrc.c (Lionel)
>>>>>>>>>>>>
>>>>>>>>>>>> v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
>>>>>>>>>>>>        More to_intel_context() (Tvrtko)
>>>>>>>>>>>>        s/dev_priv/i915/ (Tvrtko)
>>>>>>>>>>>>
>>>>>>>>>>>> Tvrtko Ursulin:
>>>>>>>>>>>>
>>>>>>>>>>>> v4:
>>>>>>>>>>>>     * Rebase for make_rpcs changes.
>>>>>>>>>>>>
>>>>>>>>>>>> v5:
>>>>>>>>>>>>     * Apply OA restriction from make_rpcs directly.
>>>>>>>>>>>>
>>>>>>>>>>>> v6:
>>>>>>>>>>>>     * Rebase for context image setup changes.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Lionel Landwerlin 
>>>>>>>>>>>> <lionel.g.landwerlin@intel.com>
>>>>>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>     drivers/gpu/drm/i915/i915_perf.c |  5 +++++
>>>>>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 30 
>>>>>>>>>>>> ++++++++++++++++++++----------
>>>>>>>>>>>>     drivers/gpu/drm/i915/intel_lrc.h |  3 +++
>>>>>>>>>>>>     3 files changed, 28 insertions(+), 10 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
>>>>>>>>>>>> b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>>>>> index ccb20230df2c..dd65b72bddd4 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_perf.c
>>>>>>>>>>>> @@ -1677,6 +1677,11 @@ static void 
>>>>>>>>>>>> gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
>>>>>>>>>>>>                 CTX_REG(reg_state, state_offset, 
>>>>>>>>>>>> flex_regs[i], value);
>>>>>>>>>>>>         }
>>>>>>>>>>>> +
>>>>>>>>>>>> +     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, 
>>>>>>>>>>>> GEN8_R_PWR_CLK_STATE,
>>>>>>>>>>>> +             gen8_make_rpcs(dev_priv,
>>>>>>>>>>>> + &to_intel_context(ctx,
>>>>>>>>>>>> + dev_priv->engine[RCS])->sseu));
>>>>>>>>>>> I think there is one issue I missed on the previous 
>>>>>>>>>>> iterations of this
>>>>>>>>>>> patch.
>>>>>>>>>>>
>>>>>>>>>>> This gen8_update_reg_state_unlocked() is called when the GPU 
>>>>>>>>>>> is parked
>>>>>>>>>>> on the kernel context.
>>>>>>>>>>>
>>>>>>>>>>> It's supposed to update all contexts, but I think we might 
>>>>>>>>>>> not be able
>>>>>>>>>>> to update the kernel context image while the GPU is using it.
>>>>>>>>>> The kernel context is only ever taken in extremis (you are either
>>>>>>>>>> parking or stalling userspace) so I don't care.
>>>>>>>>>
>>>>>>>>> The patch exposing the RPCS configuration to userspace will 
>>>>>>>>> make use of
>>>>>>>>> the kernel context while OA/perf is enabled. Even if it 
>>>>>>>>> reprograms the
>>>>>>>>> locked value that will break the power configuration stability 
>>>>>>>>> on Gen11
>>>>>>>>> (because the locked configuration will be different from the 
>>>>>>>>> kernel
>>>>>>>>> context configuration).
>>>>>>>> Sure, but as you point out that's only on changing configuration.
>>>>>>>>
>>>>>>>> What's missing in the patch is that we only bail early if the 
>>>>>>>> new sseu
>>>>>>>> matches the ce->sseu, but that doesn't necessarily match whats 
>>>>>>>> in the
>>>>>>>> context due to OA. (Or maybe I missed the conversion to rpcs 
>>>>>>>> value and
>>>>>>>> checking.)
>>>>>>>> -Chris
>>>>>>>>
>>>>>>>
>>>>>>> Yep, because the gen8_make_rpcs() post processes the values store 
>>>>>>> at the gem context level, we risk rerunning the kernel context to 
>>>>>>> write the exiting value.
>>>>>>> Sorry this is all so messy :(
>>>>>>
>>>>>> Lets see if I managed to follow here.
>>>>>>
>>>>>> The current code indeed bails out at the set ctx param level if 
>>>>>> the requested state matches the ce->state. My thinking was that 
>>>>>> ce->state is the master state and whatever happens in "post 
>>>>>> processing" via gen8_make_rpcs should be hidden from it since the 
>>>>>> design is that the i915_perf.c will re-configure all contexts when 
>>>>>> the OA active status changes (to either direction).
>>>>>>
>>>>>> So I don't see a problem in those two interactions.
>>>>>
>>>>>
>>>>> Let's say you have contextA with sseu(slice,subslice)=(0x1/0xff) 
>>>>> for ICL.
>>>>>
>>>>> You then enable OA which locks the configuration at (0x1,0xf).
>>>>>
>>>>> The kernel context has retained its (0x1/0xff) configuration.
>>>>>
>>>>>
>>>>> And after you change the config of contextA to (0x1,0x7).
>>>>>
>>>>>
>>>>> This would lead to the kernel context scheduled with (0x1,0xff) 
>>>>> while OA is active.
>>>>
>>>> Okay that's a problem discussed in the paragraph below - that the 
>>>> kernel context is not updated at all. But is it a problem for OA? 
>>>> Will it mess up some counters even if kernel context isn't executing 
>>>> anything interacting with them? Or is it?
>>>
>>>
>>> What the HW is going to do to the NOA logic in power configuration 
>>> changes is not really documented.
>>> Experimentally on SKL GT4, it seems a change in power configuration 
>>> will trigger a power off of everything before applying the power at 
>>> the new configuration.
>>> So that would imply loosing the NOA programming when we switch to the 
>>> kernel context which means invalid values in the counters.
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Apart from one, get_param_sseu will lie a bit - we can discuss 
>>>>>> about this one more. At one point I suggested we have two sets of 
>>>>>> masks in the uAPI, requested and active in a way. So userspace 
>>>>>> could query what it set and what is actually active.
>>>>>>
>>>>>> Now second issue is if i915_perf.c is able to reprogram the kernel 
>>>>>> config.
>>>>>>
>>>>>> Here its true, it will write to the context image and that will 
>>>>>> get overwritten by context save.
>>>>>>
>>>>>> If that is a problem for OA, I was initially if a throw-away 
>>>>>> second "kernel" context could be use to re-program the real one, 
>>>>>> but perhaps even simpler - what about a mmio write to program the 
>>>>>> RPCS while kernel context is active?
>>>>>
>>>>>
>>>>> Documentation says : "This register must not be programmed directly 
>>>>> through CPU MMIO cycle."
>>>>>
>>>>>
>>>>> Sorry :(
>>>>
>>>> Ugh.. okay, help me understand if kernel context absolutely needs to 
>>>> follow the "lock" for OA to work and then we'll see what to do.
>>>
>>> I think so.
>>>
>>> Your idea of a throw away context to reprogramming every seems sound.
>>
>> I was in the middle of refactoring the series to implement this when I 
>> started suspecting the need for it.
>>
>> Premise of the problem statement was the kernel context doesn't get 
>> updated by the OA code, but, there is a loop in 
>> gen8_configure_all_contexts which goes through exactly all of them 
>> after it has idled the GPU.
>>
>> AFAICS that means it is able to map the kernel context state and edit 
>> it so everything seems fine from this angle. (Kernel context is 
>> perma-pinned in software, but after idling the GPU we know it is not 
>> actually on the GPU so it safe to edit it's image.)
>>
>> Am I missing some hole here, and if not, why does the code needs to 
>> have a primary update via a request in 
>> gen8_switch_to_updated_kernel_context? Context image after idling 
>> seems again sufficient.
> 
> 
> My understanding is that the current code puts the GPU idle on the 
> kernel context.
> When idling stops, the GPU will save the last values from the HW 
> register into the kernel context image.
> We currently don't see any issue because we also run a set of commands 
> on the kernel context prior to idle that mirror the contexts image edition.

We actually retire everything, and the kernel context, wait for any 
pending interrupts (context save) and check that CS reports idle. So I 
think it is actually simpler than the OA code currently does it. I just 
wanted a way to express it as a test.

> That won't be the case for the RPCS register because we can't load it 
> from the command streamer.
> 
>>
>> Would it be possible to exercise the hypothetical loss of NOA 
>> configuration from an IGT, if kernel context wasn't correctly updated?
> 
> 
> The test configs saved in the kernel use NOA and their counters are 
> progressing at different ratios from the "core clock" (RING_TIMESTAMP 
> register).
> In IGT tests/perf.c, the gen8_sanity_check_test_oa_reports() verify that 
> the counters progress as expected.
> 
> But I can't tell from what part of the GT the test configs are sourcing 
> signals from. If they do from unslice, then that won't work :(

So not testable? Shall we still apply the doctrine of "if in doubt rip 
it out"? Or too risky? Shouldn't be.. looks pretty obvious we check 
properly for idle. If we simplify the worst that can happen is that OA 
becomes glitchy paired with Gen11 media workloads which is not so 
critical, but I don't think it will break to start with.

Regards,

Tvrtko
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index ccb20230df2c..dd65b72bddd4 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1677,6 +1677,11 @@  static void gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
 
 		CTX_REG(reg_state, state_offset, flex_regs[i], value);
 	}
+
+	CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, GEN8_R_PWR_CLK_STATE,
+		gen8_make_rpcs(dev_priv,
+			       &to_intel_context(ctx,
+						 dev_priv->engine[RCS])->sseu));
 }
 
 /*
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 8a477e43dbca..9709c1fbe836 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1305,9 +1305,6 @@  static int __context_pin(struct i915_gem_context *ctx, struct i915_vma *vma)
 	return i915_vma_pin(vma, 0, 0, flags);
 }
 
-static u32 make_rpcs(struct drm_i915_private *dev_priv,
-		     struct intel_sseu *ctx_sseu);
-
 static struct intel_context *
 __execlists_context_pin(struct intel_engine_cs *engine,
 			struct i915_gem_context *ctx,
@@ -1350,7 +1347,7 @@  __execlists_context_pin(struct intel_engine_cs *engine,
 	/* RPCS */
 	if (engine->class == RENDER_CLASS) {
 		ce->lrc_reg_state[CTX_R_PWR_CLK_STATE + 1] =
-					make_rpcs(engine->i915, &ce->sseu);
+					gen8_make_rpcs(engine->i915, &ce->sseu);
 	}
 
 	ce->state->obj->pin_global++;
@@ -2494,15 +2491,28 @@  int logical_xcs_ring_init(struct intel_engine_cs *engine)
 	return logical_ring_init(engine);
 }
 
-static u32 make_rpcs(struct drm_i915_private *dev_priv,
-		     struct intel_sseu *ctx_sseu)
+u32 gen8_make_rpcs(struct drm_i915_private *dev_priv,
+		   struct intel_sseu *req_sseu)
 {
 	const struct sseu_dev_info *sseu = &INTEL_INFO(dev_priv)->sseu;
 	bool subslice_pg = sseu->has_subslice_pg;
-	u8 slices = hweight8(ctx_sseu->slice_mask);
-	u8 subslices = hweight8(ctx_sseu->subslice_mask);
+	struct intel_sseu ctx_sseu;
+	u8 slices, subslices;
 	u32 rpcs = 0;
 
+	/*
+	 * If i915/perf is active, we want a stable powergating configuration
+	 * on the system. The most natural configuration to take in that case
+	 * is the default (i.e maximum the hardware can do).
+	 */
+	if (unlikely(dev_priv->perf.oa.exclusive_stream))
+		ctx_sseu = intel_device_default_sseu(dev_priv);
+	else
+		ctx_sseu = *req_sseu;
+
+	slices = hweight8(ctx_sseu.slice_mask);
+	subslices = hweight8(ctx_sseu.subslice_mask);
+
 	/*
 	 * Since the SScount bitfield in GEN8_R_PWR_CLK_STATE is only three bits
 	 * wide and Icelake has up to eight subslices, specfial programming is
@@ -2572,13 +2582,13 @@  static u32 make_rpcs(struct drm_i915_private *dev_priv,
 	if (sseu->has_eu_pg) {
 		u32 val;
 
-		val = ctx_sseu->min_eus_per_subslice << GEN8_RPCS_EU_MIN_SHIFT;
+		val = ctx_sseu.min_eus_per_subslice << GEN8_RPCS_EU_MIN_SHIFT;
 		GEM_BUG_ON(val & ~GEN8_RPCS_EU_MIN_MASK);
 		val &= GEN8_RPCS_EU_MIN_MASK;
 
 		rpcs |= val;
 
-		val = ctx_sseu->max_eus_per_subslice << GEN8_RPCS_EU_MAX_SHIFT;
+		val = ctx_sseu.max_eus_per_subslice << GEN8_RPCS_EU_MAX_SHIFT;
 		GEM_BUG_ON(val & ~GEN8_RPCS_EU_MAX_MASK);
 		val &= GEN8_RPCS_EU_MAX_MASK;
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index f5a5502ecf70..11da6fc0002d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -104,4 +104,7 @@  void intel_lr_context_resume(struct drm_i915_private *dev_priv);
 
 void intel_execlists_set_default_submission(struct intel_engine_cs *engine);
 
+u32 gen8_make_rpcs(struct drm_i915_private *dev_priv,
+		   struct intel_sseu *ctx_sseu);
+
 #endif /* _INTEL_LRC_H_ */