drm/i915: Do not use iowait while waiting for the GPU

Message ID	20180727184312.29937-1-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Fri, 27 Jul 2018 19:43:12 +0100 Message-Id: <20180727184312.29937-1-chris@chris-wilson.co.uk> Subject: [Intel-gfx] [PATCH] drm/i915: Do not use iowait while waiting for the GPU Precedence: list Cc: Eero Tamminen <eero.t.tamminen@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915: Do not use iowait while waiting for the GPU \| expand drm/i915: Do not use iowait while waiting for the GPU

Chris Wilson July 27, 2018, 6:43 p.m. UTC

A recent trend for cpufreq is to boost the CPU frequencies for
iowaiters, in particularly to benefit high frequency I/O. We do the same
and boost the GPU clocks to try and minimise time spent waiting for the
GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
frequency will result in the GPU being throttled and its frequency being
reduced. Thus declaring iowait negatively impacts on GPU throughput.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Francisco Jerez <currojerez@riseup.net>
---
 drivers/gpu/drm/i915/i915_request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Chris Wilson July 27, 2018, 6:58 p.m. UTC | #1

Quoting Chris Wilson (2018-07-27 19:43:12)
> A recent trend for cpufreq is to boost the CPU frequencies for
> iowaiters, in particularly to benefit high frequency I/O. We do the same
> and boost the GPU clocks to try and minimise time spent waiting for the
> GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
> frequency will result in the GPU being throttled and its frequency being
> reduced. Thus declaring iowait negatively impacts on GPU throughput.
> 
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
> References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Francisco Jerez <currojerez@riseup.net>
> ---
>  drivers/gpu/drm/i915/i915_request.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 5c2c93cbab12..7ef7ade12073 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1330,7 +1330,7 @@ long i915_request_wait(struct i915_request *rq,
>                         goto complete;
>                 }
>  
> -               timeout = io_schedule_timeout(timeout);
> +               timeout = schedule_timeout(timeout);

Wrong one...
-Chris

Francisco Jerez July 28, 2018, 5:20 a.m. UTC | #2

Chris Wilson <chris@chris-wilson.co.uk> writes:

> A recent trend for cpufreq is to boost the CPU frequencies for
> iowaiters, in particularly to benefit high frequency I/O. We do the same
> and boost the GPU clocks to try and minimise time spent waiting for the
> GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
> frequency will result in the GPU being throttled and its frequency being
> reduced. Thus declaring iowait negatively impacts on GPU throughput.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
> References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")

This patch causes up to ~13% performance regressions (with significance
5%) on several latency-sensitive tests on my BXT:

 jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:     XXX ±35.69% x53 -> XXX ±32.57% x61 d=-13.52% ±31.88% p=2.58%
 jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128: XXX ±3.51% x21 ->  XXX ±3.77% x21   d=-12.08% ±3.41% p=0.00%
 gtkperf/gtk-test=GtkComboBox:                                                 XXX ±1.90% x19 ->  XXX ±1.59% x20    d=-4.74% ±1.71% p=0.00%
 x11perf/test=500px Compositing From Pixmap To Window:                         XXX ±2.35% x21 ->  XXX ±1.73% x21    d=-2.69% ±2.04% p=0.01%
 qgears2/render-backend=XRender Extension/test-mode=Text:                      XXX ±0.38% x21 ->  XXX ±0.40% x25    d=-2.20% ±0.38% p=0.00%
 x11perf/test=500px Compositing From Pixmap To Window:                         XXX ±2.78% x53 ->  XXX ±2.27% x61    d=-1.77% ±2.50% p=0.03%

It's unsurprising to see latency-sensitive workloads relying on the
lower latency offered by io_schedule_timeout(), since the CPUFREQ
governor will have substantial downward bias without it, in response to
the intermittent CPU usage pattern of those benchmarks.

We could possibly have the best from both worlds if the CPUFREQ governor
didn't attempt to EPP-boost the CPU frequency on IOWAIT while the system
is heavily IO-bound, since the occurrence of both conditions
simultaneously indicates the CPU workload is also likely to be IO-bound
and its performance will remain unchanged while boosting the CPU
frequency, so it can only pessimize the performance of the system.  This
could be achieved by using the statistic implemented here [1].  I think
the offending patch should probably be reverted for the time being...

[1] https://patchwork.kernel.org/patch/10312259/

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Francisco Jerez <currojerez@riseup.net>
> ---
>  drivers/gpu/drm/i915/i915_request.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 5c2c93cbab12..7ef7ade12073 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1330,7 +1330,7 @@ long i915_request_wait(struct i915_request *rq,
>  			goto complete;
>  		}
>  
> -		timeout = io_schedule_timeout(timeout);
> +		timeout = schedule_timeout(timeout);
>  	} while (1);
>  
>  	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> -- 
> 2.18.0

Chris Wilson July 28, 2018, 2:22 p.m. UTC | #3

Quoting Francisco Jerez (2018-07-28 06:20:12)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > A recent trend for cpufreq is to boost the CPU frequencies for
> > iowaiters, in particularly to benefit high frequency I/O. We do the same
> > and boost the GPU clocks to try and minimise time spent waiting for the
> > GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
> > frequency will result in the GPU being throttled and its frequency being
> > reduced. Thus declaring iowait negatively impacts on GPU throughput.
> >
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
> > References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
> 
> This patch causes up to ~13% performance regressions (with significance
> 5%) on several latency-sensitive tests on my BXT:
> 
>  jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:     XXX ±35.69% x53 -> XXX ±32.57% x61 d=-13.52% ±31.88% p=2.58%

Curious, as this is just a bunch of composites and as with the others,
should never be latency sensitive (at least under bare X11).
Fwiw, I double checked this result:

Broxton J3455, jxrend -num $(for i in $(seq 1 100); do echo 12 128; done)
x noio-1.txt
+ io-1.txt
+------------------------------------------------------------------------+
|                 +                                                      |
|                 +                                                      |
|                 *                                                      |
|                +*x                                                     |
|              + +***+                                                   |
|              + +***++                                                  |
|              + ****+*   +                                              |
|             ++x******  x+   x                                          |
|       xx    **x*******+x* xx*                                          |
|   + + xx*xx+***********x**x***x x+                                     |
|x x+** x**x****************x***x***+  x + x x                    ++    +|
|           |_______MA_______|                                           |
|          |________MA__________|                                        |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x 100     16109.095     16211.579     16152.497      16154.87     19.270749
+ 100      16116.47     16274.973     16152.365     16156.954     25.304398
No difference proven at 95.0% confidence

Your variance is much, much higher, are you still using the original
jxrendermark that doesn't wait for rendering completion?
-Chris

Chris Wilson July 28, 2018, 4:27 p.m. UTC | #4

Quoting Francisco Jerez (2018-07-28 06:20:12)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > A recent trend for cpufreq is to boost the CPU frequencies for
> > iowaiters, in particularly to benefit high frequency I/O. We do the same
> > and boost the GPU clocks to try and minimise time spent waiting for the
> > GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
> > frequency will result in the GPU being throttled and its frequency being
> > reduced. Thus declaring iowait negatively impacts on GPU throughput.
> >
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
> > References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
> 
> This patch causes up to ~13% performance regressions (with significance
> 5%) on several latency-sensitive tests on my BXT:
> 
>  jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:     XXX ±35.69% x53 -> XXX ±32.57% x61 d=-13.52% ±31.88% p=2.58%
>  jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128: XXX ±3.51% x21 ->  XXX ±3.77% x21   d=-12.08% ±3.41% p=0.00%
>  gtkperf/gtk-test=GtkComboBox:                                                 XXX ±1.90% x19 ->  XXX ±1.59% x20    d=-4.74% ±1.71% p=0.00%
>  x11perf/test=500px Compositing From Pixmap To Window:                         XXX ±2.35% x21 ->  XXX ±1.73% x21    d=-2.69% ±2.04% p=0.01%
>  qgears2/render-backend=XRender Extension/test-mode=Text:                      XXX ±0.38% x21 ->  XXX ±0.40% x25    d=-2.20% ±0.38% p=0.00%
>  x11perf/test=500px Compositing From Pixmap To Window:                         XXX ±2.78% x53 ->  XXX ±2.27% x61    d=-1.77% ±2.50% p=0.03%
> 
> It's unsurprising to see latency-sensitive workloads relying on the
> lower latency offered by io_schedule_timeout(), since the CPUFREQ
> governor will have substantial downward bias without it, in response to
> the intermittent CPU usage pattern of those benchmarks.

Fwiw, I have a better example, gem_sync --run store-default. This test
waits on a short batch,

with io_schedule_timeout:
Completed 987136 cycles: 152.092 us

with schedule_timeout:
Completed 157696 cycles: 956.403 us

Though note that for a no-op batch, we see no difference as the sleep is
short enough, both take on average 52us. But microbenchmarks be micro.
-Chris

Francisco Jerez July 28, 2018, 8:18 p.m. UTC | #5

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Francisco Jerez (2018-07-28 06:20:12)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > A recent trend for cpufreq is to boost the CPU frequencies for
>> > iowaiters, in particularly to benefit high frequency I/O. We do the same
>> > and boost the GPU clocks to try and minimise time spent waiting for the
>> > GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
>> > frequency will result in the GPU being throttled and its frequency being
>> > reduced. Thus declaring iowait negatively impacts on GPU throughput.
>> >
>> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
>> > References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
>> 
>> This patch causes up to ~13% performance regressions (with significance
>> 5%) on several latency-sensitive tests on my BXT:
>> 
>>  jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:     XXX ±35.69% x53 -> XXX ±32.57% x61 d=-13.52% ±31.88% p=2.58%
>

The jxrendermark Linear Gradient Blend test-case had probably the
smallest effect size of all the regressions I noticed...  Can you take a
look at any of the other ones instead?

> Curious, as this is just a bunch of composites and as with the others,
> should never be latency sensitive (at least under bare X11).

They are largely latency-sensitive due to the poor pipelining they seem
to achieve between their GPU rendering work and the X11 thread.

> Fwiw, I double checked this result:
>
> Broxton J3455, jxrend -num $(for i in $(seq 1 100); do echo 12 128; done)
> x noio-1.txt
> + io-1.txt
> +------------------------------------------------------------------------+
> |                 +                                                      |
> |                 +                                                      |
> |                 *                                                      |
> |                +*x                                                     |
> |              + +***+                                                   |
> |              + +***++                                                  |
> |              + ****+*   +                                              |
> |             ++x******  x+   x                                          |
> |       xx    **x*******+x* xx*                                          |
> |   + + xx*xx+***********x**x***x x+                                     |
> |x x+** x**x****************x***x***+  x + x x                    ++    +|
> |           |_______MA_______|                                           |
> |          |________MA__________|                                        |
> +------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x 100     16109.095     16211.579     16152.497      16154.87     19.270749
> + 100      16116.47     16274.973     16152.365     16156.954     25.304398
> No difference proven at 95.0% confidence
>
> Your variance is much, much higher, are you still using the original
> jxrendermark that doesn't wait for rendering completion?

I bet, but the other regressing benchmarks shouldn't be affected.

> -Chris

Chris Wilson July 28, 2018, 8:58 p.m. UTC | #6

Quoting Francisco Jerez (2018-07-28 21:18:50)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Quoting Francisco Jerez (2018-07-28 06:20:12)
> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >> 
> >> > A recent trend for cpufreq is to boost the CPU frequencies for
> >> > iowaiters, in particularly to benefit high frequency I/O. We do the same
> >> > and boost the GPU clocks to try and minimise time spent waiting for the
> >> > GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
> >> > frequency will result in the GPU being throttled and its frequency being
> >> > reduced. Thus declaring iowait negatively impacts on GPU throughput.
> >> >
> >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
> >> > References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
> >> 
> >> This patch causes up to ~13% performance regressions (with significance
> >> 5%) on several latency-sensitive tests on my BXT:
> >> 
> >>  jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:     XXX ±35.69% x53 -> XXX ±32.57% x61 d=-13.52% ±31.88% p=2.58%
> >
> 
> The jxrendermark Linear Gradient Blend test-case had probably the
> smallest effect size of all the regressions I noticed...  Can you take a
> look at any of the other ones instead?

It was the biggest in the list, was it not? I didn't observe anything of
note in a quick look at x11perf, but didn't let it run for a good sample
size. They didn't seem to be as relevant as jxrendermark so I went and
dug that out.
 
> > Curious, as this is just a bunch of composites and as with the others,
> > should never be latency sensitive (at least under bare X11).
> 
> They are largely latency-sensitive due to the poor pipelining they seem
> to achieve between their GPU rendering work and the X11 thread.

Only the X11 thread is touching the GPU, and in the cases I looked at
it, we were either waiting for the ring to drain or on throttling.
Synchronisation with the GPU was only for draining the queue on timing,
and the cpu was able to stay ahead during the benchmark.

Off the top of my head, for X to be latency sensitive you need to mix
client and Xserver rendering, along the lines of Paint; GetImage, in the
extreme becoming gem_sync. Adding a compositor is also interesting for
the context switching will prevent us merging requests (but that all
depends on the frequency of compositor updates ofc), and we would
need more CPU and require reasonably low latency (less than the next
request) to keep the GPU busy. However, that is driven directly off
interrupts, iowait isn't a factor -- but your hook could still be useful
to provide pm_qos.
-Chris

Francisco Jerez July 29, 2018, 7:29 p.m. UTC | #7

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Francisco Jerez (2018-07-28 21:18:50)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > Quoting Francisco Jerez (2018-07-28 06:20:12)
>> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> >> 
>> >> > A recent trend for cpufreq is to boost the CPU frequencies for
>> >> > iowaiters, in particularly to benefit high frequency I/O. We do the same
>> >> > and boost the GPU clocks to try and minimise time spent waiting for the
>> >> > GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
>> >> > frequency will result in the GPU being throttled and its frequency being
>> >> > reduced. Thus declaring iowait negatively impacts on GPU throughput.
>> >> >
>> >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
>> >> > References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
>> >> 
>> >> This patch causes up to ~13% performance regressions (with significance
>> >> 5%) on several latency-sensitive tests on my BXT:
>> >> 
>> >>  jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:     XXX ±35.69% x53 -> XXX ±32.57% x61 d=-13.52% ±31.88% p=2.58%
>> >
>> 
>> The jxrendermark Linear Gradient Blend test-case had probably the
>> smallest effect size of all the regressions I noticed...  Can you take a
>> look at any of the other ones instead?
>
> It was the biggest in the list, was it not? I didn't observe anything of
> note in a quick look at x11perf, but didn't let it run for a good sample
> size. They didn't seem to be as relevant as jxrendermark so I went and
> dug that out.
>

That was the biggest regression in absolute value, but the smallest in
effect size (roughly 0.4 standard deviations).

>> > Curious, as this is just a bunch of composites and as with the others,
>> > should never be latency sensitive (at least under bare X11).
>> 
>> They are largely latency-sensitive due to the poor pipelining they seem
>> to achieve between their GPU rendering work and the X11 thread.
>
> Only the X11 thread is touching the GPU, and in the cases I looked at
> it, we were either waiting for the ring to drain or on throttling.
> Synchronisation with the GPU was only for draining the queue on timing,
> and the cpu was able to stay ahead during the benchmark.
>

Apparently the CPU doesn't get ahead enough for the GPU to be
consistently loaded, which prevents us from hiding the latency of the
CPU computation even in those cases.

> Off the top of my head, for X to be latency sensitive you need to mix
> client and Xserver rendering, along the lines of Paint; GetImage, in the
> extreme becoming gem_sync. Adding a compositor is also interesting for
> the context switching will prevent us merging requests (but that all
> depends on the frequency of compositor updates ofc), and we would
> need more CPU and require reasonably low latency (less than the next
> request) to keep the GPU busy. However, that is driven directly off
> interrupts, iowait isn't a factor -- but your hook could still be useful
> to provide pm_qos.
> -Chris

Chris Wilson July 30, 2018, 12:56 p.m. UTC | #8

Quoting Francisco Jerez (2018-07-29 20:29:42)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Quoting Francisco Jerez (2018-07-28 21:18:50)
> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >> 
> >> > Quoting Francisco Jerez (2018-07-28 06:20:12)
> >> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >> >> 
> >> >> > A recent trend for cpufreq is to boost the CPU frequencies for
> >> >> > iowaiters, in particularly to benefit high frequency I/O. We do the same
> >> >> > and boost the GPU clocks to try and minimise time spent waiting for the
> >> >> > GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
> >> >> > frequency will result in the GPU being throttled and its frequency being
> >> >> > reduced. Thus declaring iowait negatively impacts on GPU throughput.
> >> >> >
> >> >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
> >> >> > References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
> >> >> 
> >> >> This patch causes up to ~13% performance regressions (with significance
> >> >> 5%) on several latency-sensitive tests on my BXT:
> >> >> 
> >> >>  jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:     XXX ±35.69% x53 -> XXX ±32.57% x61 d=-13.52% ±31.88% p=2.58%
> >> >
> >> 
> >> The jxrendermark Linear Gradient Blend test-case had probably the
> >> smallest effect size of all the regressions I noticed...  Can you take a
> >> look at any of the other ones instead?
> >
> > It was the biggest in the list, was it not? I didn't observe anything of
> > note in a quick look at x11perf, but didn't let it run for a good sample
> > size. They didn't seem to be as relevant as jxrendermark so I went and
> > dug that out.
> >
> 
> That was the biggest regression in absolute value, but the smallest in
> effect size (roughly 0.4 standard deviations).

d=-13.52% wasn't the delta between the two runs?

Sorry, but it appears to be redacted beyond my comprehension.

> >> > Curious, as this is just a bunch of composites and as with the others,
> >> > should never be latency sensitive (at least under bare X11).
> >> 
> >> They are largely latency-sensitive due to the poor pipelining they seem
> >> to achieve between their GPU rendering work and the X11 thread.
> >
> > Only the X11 thread is touching the GPU, and in the cases I looked at
> > it, we were either waiting for the ring to drain or on throttling.
> > Synchronisation with the GPU was only for draining the queue on timing,
> > and the cpu was able to stay ahead during the benchmark.
> >
> 
> Apparently the CPU doesn't get ahead enough for the GPU to be
> consistently loaded, which prevents us from hiding the latency of the
> CPU computation even in those cases.

The curse of reproducibility. On my bxt, I don't see the issue, so we
have a significant difference in setup.
-Chris

Francisco Jerez July 30, 2018, 6:55 p.m. UTC | #9

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Francisco Jerez (2018-07-29 20:29:42)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > Quoting Francisco Jerez (2018-07-28 21:18:50)
>> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> >> 
>> >> > Quoting Francisco Jerez (2018-07-28 06:20:12)
>> >> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> >> >> 
>> >> >> > A recent trend for cpufreq is to boost the CPU frequencies for
>> >> >> > iowaiters, in particularly to benefit high frequency I/O. We do the same
>> >> >> > and boost the GPU clocks to try and minimise time spent waiting for the
>> >> >> > GPU. However, as the igfx and CPU share the same TDP, boosting the CPU
>> >> >> > frequency will result in the GPU being throttled and its frequency being
>> >> >> > reduced. Thus declaring iowait negatively impacts on GPU throughput.
>> >> >> >
>> >> >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107410
>> >> >> > References: 52ccc4314293 ("cpufreq: intel_pstate: HWP boost performance on IO wakeup")
>> >> >> 
>> >> >> This patch causes up to ~13% performance regressions (with significance
>> >> >> 5%) on several latency-sensitive tests on my BXT:
>> >> >> 
>> >> >>  jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:     XXX ±35.69% x53 -> XXX ±32.57% x61 d=-13.52% ±31.88% p=2.58%
>> >> >
>> >> 
>> >> The jxrendermark Linear Gradient Blend test-case had probably the
>> >> smallest effect size of all the regressions I noticed...  Can you take a
>> >> look at any of the other ones instead?
>> >
>> > It was the biggest in the list, was it not? I didn't observe anything of
>> > note in a quick look at x11perf, but didn't let it run for a good sample
>> > size. They didn't seem to be as relevant as jxrendermark so I went and
>> > dug that out.
>> >
>> 
>> That was the biggest regression in absolute value, but the smallest in
>> effect size (roughly 0.4 standard deviations).
>
> d=-13.52% wasn't the delta between the two runs?
>

It is, less than half of 31.88% which is the pooled standard deviation.

> Sorry, but it appears to be redacted beyond my comprehension.
>
>> >> > Curious, as this is just a bunch of composites and as with the others,
>> >> > should never be latency sensitive (at least under bare X11).
>> >> 
>> >> They are largely latency-sensitive due to the poor pipelining they seem
>> >> to achieve between their GPU rendering work and the X11 thread.
>> >
>> > Only the X11 thread is touching the GPU, and in the cases I looked at
>> > it, we were either waiting for the ring to drain or on throttling.
>> > Synchronisation with the GPU was only for draining the queue on timing,
>> > and the cpu was able to stay ahead during the benchmark.
>> >
>> 
>> Apparently the CPU doesn't get ahead enough for the GPU to be
>> consistently loaded, which prevents us from hiding the latency of the
>> CPU computation even in those cases.
>
> The curse of reproducibility. On my bxt, I don't see the issue, so we
> have a significant difference in setup.
> -Chris

drm/i915: Do not use iowait while waiting for the GPU

Commit Message

Comments

Patch