Message ID | 20180728164623.10613-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/5] drm/i915: Expose the busyspin durations for i915_wait_request | expand |
On 28/07/2018 17:46, Chris Wilson wrote: > An interesting discussion regarding "hybrid interrupt polling" for NVMe > came to the conclusion that the ideal busyspin before sleeping was half > of the expected request latency (and better if it was already halfway > through that request). This suggested that we too should look again at > our tradeoff between spinning and waiting. Currently, our spin simply > tries to hide the cost of enabling the interrupt, which is good to avoid > penalising nop requests (i.e. test throughput) and not much else. > Studying real world workloads suggests that a spin of upto 500us can > dramatically boost performance, but the suggestion is that this is not > from avoiding interrupt latency per-se, but from secondary effects of > sleeping such as allowing the CPU reduce cstate and context switch away. > > In a truly hybrid interrupt polling scheme, we would aim to sleep until > just before the request completed and then wake up in advance of the > interrupt and do a quick poll to handle completion. This is tricky for > ourselves at the moment as we are not recording request times, and since > we allow preemption, our requests are not on as a nicely ordered > timeline as IO. However, the idea is interesting, for it will certainly > help us decide when busyspinning is worthwhile. > > v2: Expose the spin setting via Kconfig options for easier adjustment > and testing. > v3: Don't get caught sneaking in a change to the busyspin parameters. > v4: Explain more about the "hybrid interrupt polling" scheme that we > want to migrate towards. > > Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com> > References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Sagar Kamble <sagar.a.kamble@intel.com> > Cc: Eero Tamminen <eero.t.tamminen@intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Ben Widawsky <ben@bwidawsk.net> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Michał Winiarski <michal.winiarski@intel.com> > Reviewed-by: Sagar Kamble <sagar.a.kamble@intel.com> > --- > drivers/gpu/drm/i915/Kconfig | 6 +++++ > drivers/gpu/drm/i915/Kconfig.profile | 26 +++++++++++++++++++ > drivers/gpu/drm/i915/i915_request.c | 39 +++++++++++++++++++++++++--- > 3 files changed, 67 insertions(+), 4 deletions(-) > create mode 100644 drivers/gpu/drm/i915/Kconfig.profile > > diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig > index 5c607f2c707b..387613f29cb0 100644 > --- a/drivers/gpu/drm/i915/Kconfig > +++ b/drivers/gpu/drm/i915/Kconfig > @@ -132,3 +132,9 @@ depends on DRM_I915 > depends on EXPERT > source drivers/gpu/drm/i915/Kconfig.debug > endmenu > + > +menu "drm/i915 Profile Guided Optimisation" Or something like a more generic drm/i915 Manual Tuning Options so it sounds less like an automated thing where you can feed the results of something straight in? > + visible if EXPERT > + depends on DRM_I915 > + source drivers/gpu/drm/i915/Kconfig.profile > +endmenu > diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile > new file mode 100644 > index 000000000000..8a230eeb98df > --- /dev/null > +++ b/drivers/gpu/drm/i915/Kconfig.profile > @@ -0,0 +1,26 @@ > +config DRM_I915_SPIN_REQUEST_IRQ > + int > + default 5 # microseconds > + help > + Before sleeping waiting for a request (GPU operation) to complete, > + we may spend some time polling for its completion. As the IRQ may > + take a non-negligible time to setup, we do a short spin first to > + check if the request will complete in the time it would have taken > + us to enable the interrupt. > + > + May be 0 to disable the initial spin. In practice, we estimate > + the cost of enabling the interrupt (if currently disabled) to be > + a few microseconds. > + > +config DRM_I915_SPIN_REQUEST_CS > + int > + default 2 # microseconds > + help > + After sleeping for a request (GPU operation) to complete, we will > + be woken up on the completion of every request prior to the one > + being waited on. For very short requests, going back to sleep and > + be woken up again may add considerably to the wakeup latency. To > + avoid incurring extra latency from the scheduler, we may choose to > + spin prior to sleeping again. > + > + May be 0 to disable spinning after being woken. > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > index 5c2c93cbab12..ca25c53f8daa 100644 > --- a/drivers/gpu/drm/i915/i915_request.c > +++ b/drivers/gpu/drm/i915/i915_request.c > @@ -1336,8 +1336,32 @@ long i915_request_wait(struct i915_request *rq, > GEM_BUG_ON(!intel_wait_has_seqno(&wait)); > GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit)); > > - /* Optimistic short spin before touching IRQs */ > - if (__i915_spin_request(rq, wait.seqno, state, 5)) > + /* > + * Optimistic spin before touching IRQs. > + * > + * We may use a rather large value here to offset the penalty of > + * switching away from the active task. Frequently, the client will > + * wait upon an old swapbuffer to throttle itself to remain within a > + * frame of the gpu. If the client is running in lockstep with the gpu, > + * then it should not be waiting long at all, and a sleep now will incur > + * extra scheduler latency in producing the next frame. To try to > + * avoid adding the cost of enabling/disabling the interrupt to the > + * short wait, we first spin to see if the request would have completed > + * in the time taken to setup the interrupt. > + * > + * We need upto 5us to enable the irq, and upto 20us to hide the > + * scheduler latency of a context switch, ignoring the secondary > + * impacts from a context switch such as cache eviction. > + * > + * The scheme used for low-latency IO is called "hybrid interrupt > + * polling". The suggestion there is to sleep until just before you > + * expect to be woken by the device interrupt and then poll for its > + * completion. That requires having a good predictor for the request > + * duration, which we currently lack. > + */ > + if (CONFIG_DRM_I915_SPIN_REQUEST_IRQ && > + __i915_spin_request(rq, wait.seqno, state, > + CONFIG_DRM_I915_SPIN_REQUEST_IRQ)) > goto complete; > > set_current_state(state); > @@ -1396,8 +1420,15 @@ long i915_request_wait(struct i915_request *rq, > __i915_wait_request_check_and_reset(rq)) > continue; > > - /* Only spin if we know the GPU is processing this request */ > - if (__i915_spin_request(rq, wait.seqno, state, 2)) > + /* > + * A quick spin now we are on the CPU to offset the cost of > + * context switching away (and so spin for roughly the same as > + * the scheduler latency). We only spin if we know the GPU is > + * processing this request, and so likely to finish shortly. > + */ > + if (CONFIG_DRM_I915_SPIN_REQUEST_CS && > + __i915_spin_request(rq, wait.seqno, state, > + CONFIG_DRM_I915_SPIN_REQUEST_CS)) > break; > > if (!intel_wait_check_request(&wait, rq)) { > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> We could add PMU metrics to count hit/missed busy spins so profile guided bit becomes more direct? Regards, Tvrtko
Quoting Tvrtko Ursulin (2018-08-02 15:15:35) > > On 28/07/2018 17:46, Chris Wilson wrote: > > An interesting discussion regarding "hybrid interrupt polling" for NVMe > > came to the conclusion that the ideal busyspin before sleeping was half > > of the expected request latency (and better if it was already halfway > > through that request). This suggested that we too should look again at > > our tradeoff between spinning and waiting. Currently, our spin simply > > tries to hide the cost of enabling the interrupt, which is good to avoid > > penalising nop requests (i.e. test throughput) and not much else. > > Studying real world workloads suggests that a spin of upto 500us can > > dramatically boost performance, but the suggestion is that this is not > > from avoiding interrupt latency per-se, but from secondary effects of > > sleeping such as allowing the CPU reduce cstate and context switch away. > > > > In a truly hybrid interrupt polling scheme, we would aim to sleep until > > just before the request completed and then wake up in advance of the > > interrupt and do a quick poll to handle completion. This is tricky for > > ourselves at the moment as we are not recording request times, and since > > we allow preemption, our requests are not on as a nicely ordered > > timeline as IO. However, the idea is interesting, for it will certainly > > help us decide when busyspinning is worthwhile. > > > > v2: Expose the spin setting via Kconfig options for easier adjustment > > and testing. > > v3: Don't get caught sneaking in a change to the busyspin parameters. > > v4: Explain more about the "hybrid interrupt polling" scheme that we > > want to migrate towards. > > > > Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com> > > References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Sagar Kamble <sagar.a.kamble@intel.com> > > Cc: Eero Tamminen <eero.t.tamminen@intel.com> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Cc: Ben Widawsky <ben@bwidawsk.net> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > Cc: Michał Winiarski <michal.winiarski@intel.com> > > Reviewed-by: Sagar Kamble <sagar.a.kamble@intel.com> > > --- > > drivers/gpu/drm/i915/Kconfig | 6 +++++ > > drivers/gpu/drm/i915/Kconfig.profile | 26 +++++++++++++++++++ > > drivers/gpu/drm/i915/i915_request.c | 39 +++++++++++++++++++++++++--- > > 3 files changed, 67 insertions(+), 4 deletions(-) > > create mode 100644 drivers/gpu/drm/i915/Kconfig.profile > > > > diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig > > index 5c607f2c707b..387613f29cb0 100644 > > --- a/drivers/gpu/drm/i915/Kconfig > > +++ b/drivers/gpu/drm/i915/Kconfig > > @@ -132,3 +132,9 @@ depends on DRM_I915 > > depends on EXPERT > > source drivers/gpu/drm/i915/Kconfig.debug > > endmenu > > + > > +menu "drm/i915 Profile Guided Optimisation" > > Or something like a more generic drm/i915 Manual Tuning Options so it > sounds less like an automated thing where you can feed the results of > something straight in? Hah, good called. PGO was taken right out of the gcc manual, so yeah it's a bit misleading. > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > We could add PMU metrics to count hit/missed busy spins so profile > guided bit becomes more direct? Sensible. I wonder though how much we can do already with kprobes? Another one I think I'd like here is the DMA_LATENCY qos parameter. If you haven't seen https://patchwork.freedesktop.org/patch/241571/ that's worth throwing against media-bench. -Chris
diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig index 5c607f2c707b..387613f29cb0 100644 --- a/drivers/gpu/drm/i915/Kconfig +++ b/drivers/gpu/drm/i915/Kconfig @@ -132,3 +132,9 @@ depends on DRM_I915 depends on EXPERT source drivers/gpu/drm/i915/Kconfig.debug endmenu + +menu "drm/i915 Profile Guided Optimisation" + visible if EXPERT + depends on DRM_I915 + source drivers/gpu/drm/i915/Kconfig.profile +endmenu diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile new file mode 100644 index 000000000000..8a230eeb98df --- /dev/null +++ b/drivers/gpu/drm/i915/Kconfig.profile @@ -0,0 +1,26 @@ +config DRM_I915_SPIN_REQUEST_IRQ + int + default 5 # microseconds + help + Before sleeping waiting for a request (GPU operation) to complete, + we may spend some time polling for its completion. As the IRQ may + take a non-negligible time to setup, we do a short spin first to + check if the request will complete in the time it would have taken + us to enable the interrupt. + + May be 0 to disable the initial spin. In practice, we estimate + the cost of enabling the interrupt (if currently disabled) to be + a few microseconds. + +config DRM_I915_SPIN_REQUEST_CS + int + default 2 # microseconds + help + After sleeping for a request (GPU operation) to complete, we will + be woken up on the completion of every request prior to the one + being waited on. For very short requests, going back to sleep and + be woken up again may add considerably to the wakeup latency. To + avoid incurring extra latency from the scheduler, we may choose to + spin prior to sleeping again. + + May be 0 to disable spinning after being woken. diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 5c2c93cbab12..ca25c53f8daa 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1336,8 +1336,32 @@ long i915_request_wait(struct i915_request *rq, GEM_BUG_ON(!intel_wait_has_seqno(&wait)); GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit)); - /* Optimistic short spin before touching IRQs */ - if (__i915_spin_request(rq, wait.seqno, state, 5)) + /* + * Optimistic spin before touching IRQs. + * + * We may use a rather large value here to offset the penalty of + * switching away from the active task. Frequently, the client will + * wait upon an old swapbuffer to throttle itself to remain within a + * frame of the gpu. If the client is running in lockstep with the gpu, + * then it should not be waiting long at all, and a sleep now will incur + * extra scheduler latency in producing the next frame. To try to + * avoid adding the cost of enabling/disabling the interrupt to the + * short wait, we first spin to see if the request would have completed + * in the time taken to setup the interrupt. + * + * We need upto 5us to enable the irq, and upto 20us to hide the + * scheduler latency of a context switch, ignoring the secondary + * impacts from a context switch such as cache eviction. + * + * The scheme used for low-latency IO is called "hybrid interrupt + * polling". The suggestion there is to sleep until just before you + * expect to be woken by the device interrupt and then poll for its + * completion. That requires having a good predictor for the request + * duration, which we currently lack. + */ + if (CONFIG_DRM_I915_SPIN_REQUEST_IRQ && + __i915_spin_request(rq, wait.seqno, state, + CONFIG_DRM_I915_SPIN_REQUEST_IRQ)) goto complete; set_current_state(state); @@ -1396,8 +1420,15 @@ long i915_request_wait(struct i915_request *rq, __i915_wait_request_check_and_reset(rq)) continue; - /* Only spin if we know the GPU is processing this request */ - if (__i915_spin_request(rq, wait.seqno, state, 2)) + /* + * A quick spin now we are on the CPU to offset the cost of + * context switching away (and so spin for roughly the same as + * the scheduler latency). We only spin if we know the GPU is + * processing this request, and so likely to finish shortly. + */ + if (CONFIG_DRM_I915_SPIN_REQUEST_CS && + __i915_spin_request(rq, wait.seqno, state, + CONFIG_DRM_I915_SPIN_REQUEST_CS)) break; if (!intel_wait_check_request(&wait, rq)) {