diff mbox

drm/i915: Restore GT performance in headless mode with DMC loaded

Message ID 20170505114321.20348-1-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tvrtko Ursulin May 5, 2017, 11:43 a.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

It seems that the DMC likes to transition between the DC states
a lot when there are no connected displays (no active power
domains) during simple command submission.

This frantic activity on DC states has a terrible impact on the
performance of the overall chip with huge latencies observed in
the interrupt handlers and elsewhere. Simple tests like
igt/gem_latency -n 0 are slowed down by a factor of eight.

Work around it by grabbing a modeset display power domain whilst
there is any GT activity. This seems to be effective in making
the DMC keep its paws off the chip.

On the other hand this may have a negative impact on the overall
power budget of the chip and so could still affect performance.

This version limits the workaround got SKL GT3 and GT4 parts but
this is just due the absence of testing on other platforms. It
is possible we will have to apply it wider.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100572
Testcase: igt/gem_exec_nop/headless
Cc: Imre Deak <imre.deak@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         | 5 +++++
 drivers/gpu/drm/i915/i915_gem.c         | 4 ++++
 drivers/gpu/drm/i915/i915_gem_request.c | 3 +++
 3 files changed, 12 insertions(+)

Comments

Chris Wilson May 5, 2017, 11:54 a.m. UTC | #1
On Fri, May 05, 2017 at 12:43:21PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> It seems that the DMC likes to transition between the DC states
> a lot when there are no connected displays (no active power
> domains) during simple command submission.
> 
> This frantic activity on DC states has a terrible impact on the
> performance of the overall chip with huge latencies observed in
> the interrupt handlers and elsewhere. Simple tests like
> igt/gem_latency -n 0 are slowed down by a factor of eight.
> 
> Work around it by grabbing a modeset display power domain whilst
> there is any GT activity. This seems to be effective in making
> the DMC keep its paws off the chip.
> 
> On the other hand this may have a negative impact on the overall
> power budget of the chip and so could still affect performance.

Please add this as a comment to the code, I think in mark_busy(). I
don't think this w/a will remain applicable forever and so merits a
continual reminder and being discussed again in future.
 
> This version limits the workaround got SKL GT3 and GT4 parts but
> this is just due the absence of testing on other platforms. It
> is possible we will have to apply it wider.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100572
> Testcase: igt/gem_exec_nop/headless
> Cc: Imre Deak <imre.deak@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         | 5 +++++
>  drivers/gpu/drm/i915/i915_gem.c         | 4 ++++
>  drivers/gpu/drm/i915/i915_gem_request.c | 3 +++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 320c16df1c9c..4d58e2e28c2f 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2990,6 +2990,11 @@ intel_info(const struct drm_i915_private *dev_priv)
>  
>  #define HAS_DECOUPLED_MMIO(dev_priv) (INTEL_INFO(dev_priv)->has_decoupled_mmio)
>  
> +#define NEEDS_CSR_GT_PERF_WA(dev_priv) \
> +	HAS_CSR(dev_priv) && \
> +	(IS_SKL_GT3(dev_priv) || IS_SKL_GT4(dev_priv)) && \
> +	(dev_priv)->csr.dmc_payload

csr.dmc_payload is a bit of a surprise, but looks correct.

Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Ville Syrjälä May 5, 2017, 2:49 p.m. UTC | #2
On Fri, May 05, 2017 at 12:43:21PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> It seems that the DMC likes to transition between the DC states
> a lot when there are no connected displays (no active power
> domains) during simple command submission.

Is it trapping on some interrupt register accesses or what? And
if so which registers are affected?

> 
> This frantic activity on DC states has a terrible impact on the
> performance of the overall chip with huge latencies observed in
> the interrupt handlers and elsewhere. Simple tests like
> igt/gem_latency -n 0 are slowed down by a factor of eight.
> 
> Work around it by grabbing a modeset display power domain whilst
> there is any GT activity. This seems to be effective in making
> the DMC keep its paws off the chip.
> 
> On the other hand this may have a negative impact on the overall
> power budget of the chip and so could still affect performance.
> 
> This version limits the workaround got SKL GT3 and GT4 parts but
> this is just due the absence of testing on other platforms. It
> is possible we will have to apply it wider.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100572
> Testcase: igt/gem_exec_nop/headless
> Cc: Imre Deak <imre.deak@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         | 5 +++++
>  drivers/gpu/drm/i915/i915_gem.c         | 4 ++++
>  drivers/gpu/drm/i915/i915_gem_request.c | 3 +++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 320c16df1c9c..4d58e2e28c2f 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2990,6 +2990,11 @@ intel_info(const struct drm_i915_private *dev_priv)
>  
>  #define HAS_DECOUPLED_MMIO(dev_priv) (INTEL_INFO(dev_priv)->has_decoupled_mmio)
>  
> +#define NEEDS_CSR_GT_PERF_WA(dev_priv) \
> +	HAS_CSR(dev_priv) && \
> +	(IS_SKL_GT3(dev_priv) || IS_SKL_GT4(dev_priv)) && \
> +	(dev_priv)->csr.dmc_payload
> +
>  #include "i915_trace.h"
>  
>  static inline bool intel_scanout_needs_vtd_wa(struct drm_i915_private *dev_priv)
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index b2727905ef2b..c52d863f409c 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3200,7 +3200,11 @@ i915_gem_idle_work_handler(struct work_struct *work)
>  
>  	if (INTEL_GEN(dev_priv) >= 6)
>  		gen6_rps_idle(dev_priv);
> +
>  	intel_runtime_pm_put(dev_priv);
> +
> +	if (NEEDS_CSR_GT_PERF_WA(dev_priv))
> +		intel_display_power_put(dev_priv, POWER_DOMAIN_MODESET);
>  out_unlock:
>  	mutex_unlock(&dev->struct_mutex);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 10361c7e3b37..10a3b51f6362 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -873,6 +873,9 @@ static void i915_gem_mark_busy(const struct intel_engine_cs *engine)
>  
>  	GEM_BUG_ON(!dev_priv->gt.active_requests);
>  
> +	if (NEEDS_CSR_GT_PERF_WA(dev_priv))
> +		intel_display_power_get(dev_priv, POWER_DOMAIN_MODESET);
> +
>  	intel_runtime_pm_get_noresume(dev_priv);
>  	dev_priv->gt.awake = true;
>  
> -- 
> 2.9.3
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Tvrtko Ursulin May 5, 2017, 4:13 p.m. UTC | #3
On 05/05/2017 15:49, Ville Syrjälä wrote:
> On Fri, May 05, 2017 at 12:43:21PM +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> It seems that the DMC likes to transition between the DC states
>> a lot when there are no connected displays (no active power
>> domains) during simple command submission.
>
> Is it trapping on some interrupt register accesses or what? And
> if so which registers are affected?

It looks like GT IIR or something along those lines it but I couldn't 
say with total confidence. It is just a guess. Firmware binary 
definitely "mentions" those registers as can be seen by inspecting it 
with a hex editor.

The data I collected at least seems to present a correlation between the 
batch frequency and DC state transition frequency:

tgt	DC		irqs	irqs/s	irq	batch/s	DC/s	DC/batch
submit	transitions			/
freq					batch
========================================================================
10000	20000		78300	7830.00	1.96	4000.00	2000.00	0.50
9901	14000		52855	7550.71	1.32	5714.29	2000.00	0.35
9524	13500		49100	7328.36	1.23	5970.15	2014.93	0.34
9091	13500		49200	7235.29	1.23	5882.35	1985.29	0.34
5000	16900		33290	3916.47	0.83	4705.88	1988.24	0.42
3333	27800		69550	4932.62	1.74	2836.88	1971.63	0.70
1667	57200		80200	2655.63	2.01	1324.50	1894.04	1.43
909	80000		80034	1482.11	2.00	740.74	1481.48	2.00
476	87000		80039	820.91	2.00	410.26	892.31	2.18
196	160000		80055	334.40	2.00	167.08	668.34	4.00

Submitted batches were ~100us long in all cases. So with low batch 
frequency it looks pretty believable. For example when we have 167.08 
batches/s, we have 334.40 irq/s - which is double - as expected for 
execlists. And we get again double that in terms of DC transitions per 
second. Each irq is one GT IIR write from the GPU side, and another from 
the CPU side.

This was actually Imre's suggestion btw. Before I was only looking at 
the irq/s to DC/s correlation which was confusing me a lot since there 
are more of the latter, so I thought it can't be the trigger. But once 
Imre mentioned the possibility that things are triggered by IIR register 
writes numbers started making more sense.

Then with higher batch frequencies the ratio starts falling, is it 
because DC transitions are too slow to keep up, or something else I am 
not sure.

Regards,

Tvrtko
Ville Syrjälä May 5, 2017, 4:28 p.m. UTC | #4
On Fri, May 05, 2017 at 05:13:58PM +0100, Tvrtko Ursulin wrote:
> 
> On 05/05/2017 15:49, Ville Syrjälä wrote:
> > On Fri, May 05, 2017 at 12:43:21PM +0100, Tvrtko Ursulin wrote:
> >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >> It seems that the DMC likes to transition between the DC states
> >> a lot when there are no connected displays (no active power
> >> domains) during simple command submission.
> >
> > Is it trapping on some interrupt register accesses or what? And
> > if so which registers are affected?
> 
> It looks like GT IIR or something along those lines it but I couldn't 
> say with total confidence.

<read DC counters>
for i in `seq 1 100` ; do IGT_NO_FORCEWAKE=1 intel_reg read <whatever> ; done
<read DC counters>

Should be a pretty trivial to run that against the suspect
registers.

> It is just a guess. Firmware binary 
> definitely "mentions" those registers as can be seen by inspecting it 
> with a hex editor.
> 
> The data I collected at least seems to present a correlation between the 
> batch frequency and DC state transition frequency:
> 
> tgt	DC		irqs	irqs/s	irq	batch/s	DC/s	DC/batch
> submit	transitions			/
> freq					batch
> ========================================================================
> 10000	20000		78300	7830.00	1.96	4000.00	2000.00	0.50
> 9901	14000		52855	7550.71	1.32	5714.29	2000.00	0.35
> 9524	13500		49100	7328.36	1.23	5970.15	2014.93	0.34
> 9091	13500		49200	7235.29	1.23	5882.35	1985.29	0.34
> 5000	16900		33290	3916.47	0.83	4705.88	1988.24	0.42
> 3333	27800		69550	4932.62	1.74	2836.88	1971.63	0.70
> 1667	57200		80200	2655.63	2.01	1324.50	1894.04	1.43
> 909	80000		80034	1482.11	2.00	740.74	1481.48	2.00
> 476	87000		80039	820.91	2.00	410.26	892.31	2.18
> 196	160000		80055	334.40	2.00	167.08	668.34	4.00
> 
> Submitted batches were ~100us long in all cases. So with low batch 
> frequency it looks pretty believable. For example when we have 167.08 
> batches/s, we have 334.40 irq/s - which is double - as expected for 
> execlists. And we get again double that in terms of DC transitions per 
> second. Each irq is one GT IIR write from the GPU side, and another from 
> the CPU side.

GPU doesn't actually write the IIRs. It's just latching stuff from the
ISR. Whether the ISR edge or some higher level interrupt event actually
causes the DMC to kick into action isn't clear at all. My original
impressions was that it just traps the register accesses.
Jani Nikula May 8, 2017, 9:23 a.m. UTC | #5
On Fri, 05 May 2017, Tvrtko Ursulin <tursulin@ursulin.net> wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> It seems that the DMC likes to transition between the DC states
> a lot when there are no connected displays (no active power
> domains) during simple command submission.
>
> This frantic activity on DC states has a terrible impact on the
> performance of the overall chip with huge latencies observed in
> the interrupt handlers and elsewhere. Simple tests like
> igt/gem_latency -n 0 are slowed down by a factor of eight.
>
> Work around it by grabbing a modeset display power domain whilst
> there is any GT activity. This seems to be effective in making
> the DMC keep its paws off the chip.
>
> On the other hand this may have a negative impact on the overall
> power budget of the chip and so could still affect performance.
>
> This version limits the workaround got SKL GT3 and GT4 parts but
> this is just due the absence of testing on other platforms. It
> is possible we will have to apply it wider.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100572
> Testcase: igt/gem_exec_nop/headless
> Cc: Imre Deak <imre.deak@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         | 5 +++++
>  drivers/gpu/drm/i915/i915_gem.c         | 4 ++++
>  drivers/gpu/drm/i915/i915_gem_request.c | 3 +++
>  3 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 320c16df1c9c..4d58e2e28c2f 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2990,6 +2990,11 @@ intel_info(const struct drm_i915_private *dev_priv)
>  
>  #define HAS_DECOUPLED_MMIO(dev_priv) (INTEL_INFO(dev_priv)->has_decoupled_mmio)
>  
> +#define NEEDS_CSR_GT_PERF_WA(dev_priv) \
> +	HAS_CSR(dev_priv) && \
> +	(IS_SKL_GT3(dev_priv) || IS_SKL_GT4(dev_priv)) && \
> +	(dev_priv)->csr.dmc_payload

Nitpick, the whole thing could use braces around it for consistency,
although I don't see any sane use of the macro causing precedence
surprises.

BR,
Jani.

> +
>  #include "i915_trace.h"
>  
>  static inline bool intel_scanout_needs_vtd_wa(struct drm_i915_private *dev_priv)
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index b2727905ef2b..c52d863f409c 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3200,7 +3200,11 @@ i915_gem_idle_work_handler(struct work_struct *work)
>  
>  	if (INTEL_GEN(dev_priv) >= 6)
>  		gen6_rps_idle(dev_priv);
> +
>  	intel_runtime_pm_put(dev_priv);
> +
> +	if (NEEDS_CSR_GT_PERF_WA(dev_priv))
> +		intel_display_power_put(dev_priv, POWER_DOMAIN_MODESET);
>  out_unlock:
>  	mutex_unlock(&dev->struct_mutex);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 10361c7e3b37..10a3b51f6362 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -873,6 +873,9 @@ static void i915_gem_mark_busy(const struct intel_engine_cs *engine)
>  
>  	GEM_BUG_ON(!dev_priv->gt.active_requests);
>  
> +	if (NEEDS_CSR_GT_PERF_WA(dev_priv))
> +		intel_display_power_get(dev_priv, POWER_DOMAIN_MODESET);
> +
>  	intel_runtime_pm_get_noresume(dev_priv);
>  	dev_priv->gt.awake = true;
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 320c16df1c9c..4d58e2e28c2f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2990,6 +2990,11 @@  intel_info(const struct drm_i915_private *dev_priv)
 
 #define HAS_DECOUPLED_MMIO(dev_priv) (INTEL_INFO(dev_priv)->has_decoupled_mmio)
 
+#define NEEDS_CSR_GT_PERF_WA(dev_priv) \
+	HAS_CSR(dev_priv) && \
+	(IS_SKL_GT3(dev_priv) || IS_SKL_GT4(dev_priv)) && \
+	(dev_priv)->csr.dmc_payload
+
 #include "i915_trace.h"
 
 static inline bool intel_scanout_needs_vtd_wa(struct drm_i915_private *dev_priv)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b2727905ef2b..c52d863f409c 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3200,7 +3200,11 @@  i915_gem_idle_work_handler(struct work_struct *work)
 
 	if (INTEL_GEN(dev_priv) >= 6)
 		gen6_rps_idle(dev_priv);
+
 	intel_runtime_pm_put(dev_priv);
+
+	if (NEEDS_CSR_GT_PERF_WA(dev_priv))
+		intel_display_power_put(dev_priv, POWER_DOMAIN_MODESET);
 out_unlock:
 	mutex_unlock(&dev->struct_mutex);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 10361c7e3b37..10a3b51f6362 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -873,6 +873,9 @@  static void i915_gem_mark_busy(const struct intel_engine_cs *engine)
 
 	GEM_BUG_ON(!dev_priv->gt.active_requests);
 
+	if (NEEDS_CSR_GT_PERF_WA(dev_priv))
+		intel_display_power_get(dev_priv, POWER_DOMAIN_MODESET);
+
 	intel_runtime_pm_get_noresume(dev_priv);
 	dev_priv->gt.awake = true;