diff mbox series

[3/5] drm/i915: Increase busyspin limit before a context-switch

Message ID 20180728164623.10613-3-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [1/5] drm/i915: Expose the busyspin durations for i915_wait_request | expand

Commit Message

Chris Wilson July 28, 2018, 4:46 p.m. UTC
Looking at the distribution of i915_wait_request for a set of GL
benchmarks, we see:

broadwell# python bcc/tools/funclatency.py -u i915_wait_request
   usecs               : count     distribution
       0 -> 1          : 29184    |****************************************|
       2 -> 3          : 5767     |*******                                 |
       4 -> 7          : 3000     |****                                    |
       8 -> 15         : 491      |                                        |
      16 -> 31         : 140      |                                        |
      32 -> 63         : 203      |                                        |
      64 -> 127        : 543      |                                        |
     128 -> 255        : 881      |*                                       |
     256 -> 511        : 1209     |*                                       |
     512 -> 1023       : 1739     |**                                      |
    1024 -> 2047       : 22855    |*******************************         |
    2048 -> 4095       : 1725     |**                                      |
    4096 -> 8191       : 5813     |*******                                 |
    8192 -> 16383      : 5348     |*******                                 |
   16384 -> 32767      : 1000     |*                                       |
   32768 -> 65535      : 4400     |******                                  |
   65536 -> 131071     : 296      |                                        |
  131072 -> 262143     : 225      |                                        |
  262144 -> 524287     : 4        |                                        |
  524288 -> 1048575    : 1        |                                        |
 1048576 -> 2097151    : 1        |                                        |
 2097152 -> 4194303    : 1        |                                        |

broxton# python bcc/tools/funclatency.py -u i915_wait_request
   usecs               : count     distribution
       0 -> 1          : 5523     |*************************************   |
       2 -> 3          : 1340     |*********                               |
       4 -> 7          : 2100     |**************                          |
       8 -> 15         : 755      |*****                                   |
      16 -> 31         : 211      |*                                       |
      32 -> 63         : 53       |                                        |
      64 -> 127        : 71       |                                        |
     128 -> 255        : 113      |                                        |
     256 -> 511        : 262      |*                                       |
     512 -> 1023       : 358      |**                                      |
    1024 -> 2047       : 1105     |*******                                 |
    2048 -> 4095       : 848      |*****                                   |
    4096 -> 8191       : 1295     |********                                |
    8192 -> 16383      : 5894     |****************************************|
   16384 -> 32767      : 4270     |****************************            |
   32768 -> 65535      : 5622     |**************************************  |
   65536 -> 131071     : 306      |**                                      |
  131072 -> 262143     : 50       |                                        |
  262144 -> 524287     : 76       |                                        |
  524288 -> 1048575    : 34       |                                        |
 1048576 -> 2097151    : 0        |                                        |
 2097152 -> 4194303    : 1        |                                        |

Picking 20us for the context-switch busyspin has the dual advantage of
catching most frequent short waits while avoiding the cost of a context
switch. 20us is a typical latency of 2 context-switches, i.e. the cost
of taking the sleep, without the secondary effects of cache flushing.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sagar Kamble <sagar.a.kamble@intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/Kconfig.profile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Tvrtko Ursulin Aug. 2, 2018, 2:40 p.m. UTC | #1
On 28/07/2018 17:46, Chris Wilson wrote:
> Looking at the distribution of i915_wait_request for a set of GL

What was the set?

> benchmarks, we see:
> 
> broadwell# python bcc/tools/funclatency.py -u i915_wait_request
>     usecs               : count     distribution
>         0 -> 1          : 29184    |****************************************|
>         2 -> 3          : 5767     |*******                                 |
>         4 -> 7          : 3000     |****                                    |
>         8 -> 15         : 491      |                                        |
>        16 -> 31         : 140      |                                        |
>        32 -> 63         : 203      |                                        |
>        64 -> 127        : 543      |                                        |
>       128 -> 255        : 881      |*                                       |
>       256 -> 511        : 1209     |*                                       |
>       512 -> 1023       : 1739     |**                                      |
>      1024 -> 2047       : 22855    |*******************************         |
>      2048 -> 4095       : 1725     |**                                      |
>      4096 -> 8191       : 5813     |*******                                 |
>      8192 -> 16383      : 5348     |*******                                 |
>     16384 -> 32767      : 1000     |*                                       |
>     32768 -> 65535      : 4400     |******                                  |
>     65536 -> 131071     : 296      |                                        |
>    131072 -> 262143     : 225      |                                        |
>    262144 -> 524287     : 4        |                                        |
>    524288 -> 1048575    : 1        |                                        |
>   1048576 -> 2097151    : 1        |                                        |
>   2097152 -> 4194303    : 1        |                                        |
> 
> broxton# python bcc/tools/funclatency.py -u i915_wait_request
>     usecs               : count     distribution
>         0 -> 1          : 5523     |*************************************   |
>         2 -> 3          : 1340     |*********                               |
>         4 -> 7          : 2100     |**************                          |
>         8 -> 15         : 755      |*****                                   |
>        16 -> 31         : 211      |*                                       |
>        32 -> 63         : 53       |                                        |
>        64 -> 127        : 71       |                                        |
>       128 -> 255        : 113      |                                        |
>       256 -> 511        : 262      |*                                       |
>       512 -> 1023       : 358      |**                                      |
>      1024 -> 2047       : 1105     |*******                                 |
>      2048 -> 4095       : 848      |*****                                   |
>      4096 -> 8191       : 1295     |********                                |
>      8192 -> 16383      : 5894     |****************************************|
>     16384 -> 32767      : 4270     |****************************            |
>     32768 -> 65535      : 5622     |**************************************  |
>     65536 -> 131071     : 306      |**                                      |
>    131072 -> 262143     : 50       |                                        |
>    262144 -> 524287     : 76       |                                        |
>    524288 -> 1048575    : 34       |                                        |
>   1048576 -> 2097151    : 0        |                                        |
>   2097152 -> 4194303    : 1        |                                        |
> 
> Picking 20us for the context-switch busyspin has the dual advantage of
> catching most frequent short waits while avoiding the cost of a context
> switch. 20us is a typical latency of 2 context-switches, i.e. the cost
> of taking the sleep, without the secondary effects of cache flushing.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> ---
>   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> index 63cb744d920d..de394dea4a14 100644
> --- a/drivers/gpu/drm/i915/Kconfig.profile
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -14,7 +14,7 @@ config DRM_I915_SPIN_REQUEST_IRQ
>   
>   config DRM_I915_SPIN_REQUEST_CS
>   	int
> -	default 2 # microseconds
> +	default 20 # microseconds
>   	help
>   	  After sleeping for a request (GPU operation) to complete, we will
>   	  be woken up on the completion of every request prior to the one
> 

I'd be more tempted to pick 10us given the histograms. It would avoid 
wasting cycles on Broadwell and keep the majority of the benefit on Broxton.

However.. it also raises the question if we perhaps want to have this 
initialized per-platform at runtime.. ? That would open up the way of 
auto-tuning it, if the goal is to eliminate the low part of the histogram.

Also, please add to the commit what kind of perf/watt or something 
effect on the benchmarks we get with it.

Regards,

Tvrtko
Tvrtko Ursulin Aug. 2, 2018, 2:46 p.m. UTC | #2
On 02/08/2018 15:40, Tvrtko Ursulin wrote:
> 
> On 28/07/2018 17:46, Chris Wilson wrote:
>> Looking at the distribution of i915_wait_request for a set of GL
> 
> What was the set?
> 
>> benchmarks, we see:
>>
>> broadwell# python bcc/tools/funclatency.py -u i915_wait_request
>>     usecs               : count     distribution
>>         0 -> 1          : 29184    
>> |****************************************|
>>         2 -> 3          : 5767     
>> |*******                                 |
>>         4 -> 7          : 3000     
>> |****                                    |
>>         8 -> 15         : 491      
>> |                                        |
>>        16 -> 31         : 140      
>> |                                        |
>>        32 -> 63         : 203      
>> |                                        |
>>        64 -> 127        : 543      
>> |                                        |
>>       128 -> 255        : 881      
>> |*                                       |
>>       256 -> 511        : 1209     
>> |*                                       |
>>       512 -> 1023       : 1739     
>> |**                                      |
>>      1024 -> 2047       : 22855    
>> |*******************************         |
>>      2048 -> 4095       : 1725     
>> |**                                      |
>>      4096 -> 8191       : 5813     
>> |*******                                 |
>>      8192 -> 16383      : 5348     
>> |*******                                 |
>>     16384 -> 32767      : 1000     
>> |*                                       |
>>     32768 -> 65535      : 4400     
>> |******                                  |
>>     65536 -> 131071     : 296      
>> |                                        |
>>    131072 -> 262143     : 225      
>> |                                        |
>>    262144 -> 524287     : 4        
>> |                                        |
>>    524288 -> 1048575    : 1        
>> |                                        |
>>   1048576 -> 2097151    : 1        
>> |                                        |
>>   2097152 -> 4194303    : 1        
>> |                                        |
>>
>> broxton# python bcc/tools/funclatency.py -u i915_wait_request
>>     usecs               : count     distribution
>>         0 -> 1          : 5523     
>> |*************************************   |
>>         2 -> 3          : 1340     
>> |*********                               |
>>         4 -> 7          : 2100     
>> |**************                          |
>>         8 -> 15         : 755      
>> |*****                                   |
>>        16 -> 31         : 211      
>> |*                                       |
>>        32 -> 63         : 53       
>> |                                        |
>>        64 -> 127        : 71       
>> |                                        |
>>       128 -> 255        : 113      
>> |                                        |
>>       256 -> 511        : 262      
>> |*                                       |
>>       512 -> 1023       : 358      
>> |**                                      |
>>      1024 -> 2047       : 1105     
>> |*******                                 |
>>      2048 -> 4095       : 848      
>> |*****                                   |
>>      4096 -> 8191       : 1295     
>> |********                                |
>>      8192 -> 16383      : 5894     
>> |****************************************|
>>     16384 -> 32767      : 4270     
>> |****************************            |
>>     32768 -> 65535      : 5622     
>> |**************************************  |
>>     65536 -> 131071     : 306      
>> |**                                      |
>>    131072 -> 262143     : 50       
>> |                                        |
>>    262144 -> 524287     : 76       
>> |                                        |
>>    524288 -> 1048575    : 34       
>> |                                        |
>>   1048576 -> 2097151    : 0        
>> |                                        |
>>   2097152 -> 4194303    : 1        
>> |                                        |
>>
>> Picking 20us for the context-switch busyspin has the dual advantage of
>> catching most frequent short waits while avoiding the cost of a context
>> switch. 20us is a typical latency of 2 context-switches, i.e. the cost
>> of taking the sleep, without the secondary effects of cache flushing.
>>
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
>> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Ben Widawsky <ben@bwidawsk.net>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>> ---
>>   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/Kconfig.profile 
>> b/drivers/gpu/drm/i915/Kconfig.profile
>> index 63cb744d920d..de394dea4a14 100644
>> --- a/drivers/gpu/drm/i915/Kconfig.profile
>> +++ b/drivers/gpu/drm/i915/Kconfig.profile
>> @@ -14,7 +14,7 @@ config DRM_I915_SPIN_REQUEST_IRQ
>>   config DRM_I915_SPIN_REQUEST_CS
>>       int
>> -    default 2 # microseconds
>> +    default 20 # microseconds
>>       help
>>         After sleeping for a request (GPU operation) to complete, we will
>>         be woken up on the completion of every request prior to the one
>>
> 
> I'd be more tempted to pick 10us given the histograms. It would avoid 
> wasting cycles on Broadwell and keep the majority of the benefit on 
> Broxton.

Actually the first spin is 5us so are you sure bumping of the second 
spin should be the first step? In other words, wouldn't bumping the 
first one to 10us eliminate most the the low bars from the histogram?

Regards,

Tvrtko

> However.. it also raises the question if we perhaps want to have this 
> initialized per-platform at runtime.. ? That would open up the way of 
> auto-tuning it, if the goal is to eliminate the low part of the histogram.
> 
> Also, please add to the commit what kind of perf/watt or something 
> effect on the benchmarks we get with it.
> 
> Regards,
> 
> Tvrtko
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Aug. 2, 2018, 2:47 p.m. UTC | #3
Quoting Tvrtko Ursulin (2018-08-02 15:40:27)
> 
> On 28/07/2018 17:46, Chris Wilson wrote:
> > Looking at the distribution of i915_wait_request for a set of GL
> 
> What was the set?
> 
> > benchmarks, we see:
> > 
> > broadwell# python bcc/tools/funclatency.py -u i915_wait_request
> >     usecs               : count     distribution
> >         0 -> 1          : 29184    |****************************************|
> >         2 -> 3          : 5767     |*******                                 |
> >         4 -> 7          : 3000     |****                                    |
> >         8 -> 15         : 491      |                                        |
> >        16 -> 31         : 140      |                                        |
> >        32 -> 63         : 203      |                                        |
> >        64 -> 127        : 543      |                                        |
> >       128 -> 255        : 881      |*                                       |
> >       256 -> 511        : 1209     |*                                       |
> >       512 -> 1023       : 1739     |**                                      |
> >      1024 -> 2047       : 22855    |*******************************         |
> >      2048 -> 4095       : 1725     |**                                      |
> >      4096 -> 8191       : 5813     |*******                                 |
> >      8192 -> 16383      : 5348     |*******                                 |
> >     16384 -> 32767      : 1000     |*                                       |
> >     32768 -> 65535      : 4400     |******                                  |
> >     65536 -> 131071     : 296      |                                        |
> >    131072 -> 262143     : 225      |                                        |
> >    262144 -> 524287     : 4        |                                        |
> >    524288 -> 1048575    : 1        |                                        |
> >   1048576 -> 2097151    : 1        |                                        |
> >   2097152 -> 4194303    : 1        |                                        |
> > 
> > broxton# python bcc/tools/funclatency.py -u i915_wait_request
> >     usecs               : count     distribution
> >         0 -> 1          : 5523     |*************************************   |
> >         2 -> 3          : 1340     |*********                               |
> >         4 -> 7          : 2100     |**************                          |
> >         8 -> 15         : 755      |*****                                   |
> >        16 -> 31         : 211      |*                                       |
> >        32 -> 63         : 53       |                                        |
> >        64 -> 127        : 71       |                                        |
> >       128 -> 255        : 113      |                                        |
> >       256 -> 511        : 262      |*                                       |
> >       512 -> 1023       : 358      |**                                      |
> >      1024 -> 2047       : 1105     |*******                                 |
> >      2048 -> 4095       : 848      |*****                                   |
> >      4096 -> 8191       : 1295     |********                                |
> >      8192 -> 16383      : 5894     |****************************************|
> >     16384 -> 32767      : 4270     |****************************            |
> >     32768 -> 65535      : 5622     |**************************************  |
> >     65536 -> 131071     : 306      |**                                      |
> >    131072 -> 262143     : 50       |                                        |
> >    262144 -> 524287     : 76       |                                        |
> >    524288 -> 1048575    : 34       |                                        |
> >   1048576 -> 2097151    : 0        |                                        |
> >   2097152 -> 4194303    : 1        |                                        |
> > 
> > Picking 20us for the context-switch busyspin has the dual advantage of
> > catching most frequent short waits while avoiding the cost of a context
> > switch. 20us is a typical latency of 2 context-switches, i.e. the cost
> > of taking the sleep, without the secondary effects of cache flushing.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Ben Widawsky <ben@bwidawsk.net>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > ---
> >   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> > index 63cb744d920d..de394dea4a14 100644
> > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > @@ -14,7 +14,7 @@ config DRM_I915_SPIN_REQUEST_IRQ
> >   
> >   config DRM_I915_SPIN_REQUEST_CS
> >       int
> > -     default 2 # microseconds
> > +     default 20 # microseconds
> >       help
> >         After sleeping for a request (GPU operation) to complete, we will
> >         be woken up on the completion of every request prior to the one
> > 
> 
> I'd be more tempted to pick 10us given the histograms. It would avoid 
> wasting cycles on Broadwell and keep the majority of the benefit on Broxton.
> 
> However.. it also raises the question if we perhaps want to have this 
> initialized per-platform at runtime.. ? That would open up the way of 
> auto-tuning it, if the goal is to eliminate the low part of the histogram.
> 
> Also, please add to the commit what kind of perf/watt or something 
> effect on the benchmarks we get with it.

There's a reason why I keep sending it to people supposed to be
interested in such things ;)

But honestly I don't value this patch much in the grand scheme of
things, since this caters to being woken up at the end of the previous
request with the expectation that the request of interest is super
short. That does not seem likely. I'd even float the opposite patch to
set it to 0 by default, which iirc you suggested I was crazy for putting
a spin here in the first place.

The only place where it might help is when some other process held onto
the first_waiter slot for too long.

The initial spin is much more interesting wrt to stall latencies.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 63cb744d920d..de394dea4a14 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -14,7 +14,7 @@  config DRM_I915_SPIN_REQUEST_IRQ
 
 config DRM_I915_SPIN_REQUEST_CS
 	int
-	default 2 # microseconds
+	default 20 # microseconds
 	help
 	  After sleeping for a request (GPU operation) to complete, we will
 	  be woken up on the completion of every request prior to the one