diff mbox series

[1/5] drm/i915: Expose the busyspin durations for i915_wait_request

Message ID 20180728164623.10613-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [1/5] drm/i915: Expose the busyspin durations for i915_wait_request | expand

Commit Message

Chris Wilson July 28, 2018, 4:46 p.m. UTC
An interesting discussion regarding "hybrid interrupt polling" for NVMe
came to the conclusion that the ideal busyspin before sleeping was half
of the expected request latency (and better if it was already halfway
through that request). This suggested that we too should look again at
our tradeoff between spinning and waiting. Currently, our spin simply
tries to hide the cost of enabling the interrupt, which is good to avoid
penalising nop requests (i.e. test throughput) and not much else.
Studying real world workloads suggests that a spin of upto 500us can
dramatically boost performance, but the suggestion is that this is not
from avoiding interrupt latency per-se, but from secondary effects of
sleeping such as allowing the CPU reduce cstate and context switch away.

In a truly hybrid interrupt polling scheme, we would aim to sleep until
just before the request completed and then wake up in advance of the
interrupt and do a quick poll to handle completion. This is tricky for
ourselves at the moment as we are not recording request times, and since
we allow preemption, our requests are not on as a nicely ordered
timeline as IO. However, the idea is interesting, for it will certainly
help us decide when busyspinning is worthwhile.

v2: Expose the spin setting via Kconfig options for easier adjustment
and testing.
v3: Don't get caught sneaking in a change to the busyspin parameters.
v4: Explain more about the "hybrid interrupt polling" scheme that we
want to migrate towards.

Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sagar Kamble <sagar.a.kamble@intel.com>
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Ben Widawsky <ben@bwidawsk.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Reviewed-by: Sagar Kamble <sagar.a.kamble@intel.com>
---
 drivers/gpu/drm/i915/Kconfig         |  6 +++++
 drivers/gpu/drm/i915/Kconfig.profile | 26 +++++++++++++++++++
 drivers/gpu/drm/i915/i915_request.c  | 39 +++++++++++++++++++++++++---
 3 files changed, 67 insertions(+), 4 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/Kconfig.profile

Comments

Tvrtko Ursulin Aug. 2, 2018, 2:15 p.m. UTC | #1
On 28/07/2018 17:46, Chris Wilson wrote:
> An interesting discussion regarding "hybrid interrupt polling" for NVMe
> came to the conclusion that the ideal busyspin before sleeping was half
> of the expected request latency (and better if it was already halfway
> through that request). This suggested that we too should look again at
> our tradeoff between spinning and waiting. Currently, our spin simply
> tries to hide the cost of enabling the interrupt, which is good to avoid
> penalising nop requests (i.e. test throughput) and not much else.
> Studying real world workloads suggests that a spin of upto 500us can
> dramatically boost performance, but the suggestion is that this is not
> from avoiding interrupt latency per-se, but from secondary effects of
> sleeping such as allowing the CPU reduce cstate and context switch away.
> 
> In a truly hybrid interrupt polling scheme, we would aim to sleep until
> just before the request completed and then wake up in advance of the
> interrupt and do a quick poll to handle completion. This is tricky for
> ourselves at the moment as we are not recording request times, and since
> we allow preemption, our requests are not on as a nicely ordered
> timeline as IO. However, the idea is interesting, for it will certainly
> help us decide when busyspinning is worthwhile.
> 
> v2: Expose the spin setting via Kconfig options for easier adjustment
> and testing.
> v3: Don't get caught sneaking in a change to the busyspin parameters.
> v4: Explain more about the "hybrid interrupt polling" scheme that we
> want to migrate towards.
> 
> Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Ben Widawsky <ben@bwidawsk.net>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> Reviewed-by: Sagar Kamble <sagar.a.kamble@intel.com>
> ---
>   drivers/gpu/drm/i915/Kconfig         |  6 +++++
>   drivers/gpu/drm/i915/Kconfig.profile | 26 +++++++++++++++++++
>   drivers/gpu/drm/i915/i915_request.c  | 39 +++++++++++++++++++++++++---
>   3 files changed, 67 insertions(+), 4 deletions(-)
>   create mode 100644 drivers/gpu/drm/i915/Kconfig.profile
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 5c607f2c707b..387613f29cb0 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -132,3 +132,9 @@ depends on DRM_I915
>   depends on EXPERT
>   source drivers/gpu/drm/i915/Kconfig.debug
>   endmenu
> +
> +menu "drm/i915 Profile Guided Optimisation"

Or something like a more generic drm/i915 Manual Tuning Options so it 
sounds less like an automated thing where you can feed the results of 
something straight in?

> +	visible if EXPERT
> +	depends on DRM_I915
> +	source drivers/gpu/drm/i915/Kconfig.profile
> +endmenu
> diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
> new file mode 100644
> index 000000000000..8a230eeb98df
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/Kconfig.profile
> @@ -0,0 +1,26 @@
> +config DRM_I915_SPIN_REQUEST_IRQ
> +	int
> +	default 5 # microseconds
> +	help
> +	  Before sleeping waiting for a request (GPU operation) to complete,
> +	  we may spend some time polling for its completion. As the IRQ may
> +	  take a non-negligible time to setup, we do a short spin first to
> +	  check if the request will complete in the time it would have taken
> +	  us to enable the interrupt.
> +
> +	  May be 0 to disable the initial spin. In practice, we estimate
> +	  the cost of enabling the interrupt (if currently disabled) to be
> +	  a few microseconds.
> +
> +config DRM_I915_SPIN_REQUEST_CS
> +	int
> +	default 2 # microseconds
> +	help
> +	  After sleeping for a request (GPU operation) to complete, we will
> +	  be woken up on the completion of every request prior to the one
> +	  being waited on. For very short requests, going back to sleep and
> +	  be woken up again may add considerably to the wakeup latency. To
> +	  avoid incurring extra latency from the scheduler, we may choose to
> +	  spin prior to sleeping again.
> +
> +	  May be 0 to disable spinning after being woken.
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 5c2c93cbab12..ca25c53f8daa 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1336,8 +1336,32 @@ long i915_request_wait(struct i915_request *rq,
>   	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>   	GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
>   
> -	/* Optimistic short spin before touching IRQs */
> -	if (__i915_spin_request(rq, wait.seqno, state, 5))
> +	/*
> +	 * Optimistic spin before touching IRQs.
> +	 *
> +	 * We may use a rather large value here to offset the penalty of
> +	 * switching away from the active task. Frequently, the client will
> +	 * wait upon an old swapbuffer to throttle itself to remain within a
> +	 * frame of the gpu. If the client is running in lockstep with the gpu,
> +	 * then it should not be waiting long at all, and a sleep now will incur
> +	 * extra scheduler latency in producing the next frame. To try to
> +	 * avoid adding the cost of enabling/disabling the interrupt to the
> +	 * short wait, we first spin to see if the request would have completed
> +	 * in the time taken to setup the interrupt.
> +	 *
> +	 * We need upto 5us to enable the irq, and upto 20us to hide the
> +	 * scheduler latency of a context switch, ignoring the secondary
> +	 * impacts from a context switch such as cache eviction.
> +	 *
> +	 * The scheme used for low-latency IO is called "hybrid interrupt
> +	 * polling". The suggestion there is to sleep until just before you
> +	 * expect to be woken by the device interrupt and then poll for its
> +	 * completion. That requires having a good predictor for the request
> +	 * duration, which we currently lack.
> +	 */
> +	if (CONFIG_DRM_I915_SPIN_REQUEST_IRQ &&
> +	    __i915_spin_request(rq, wait.seqno, state,
> +				CONFIG_DRM_I915_SPIN_REQUEST_IRQ))
>   		goto complete;
>   
>   	set_current_state(state);
> @@ -1396,8 +1420,15 @@ long i915_request_wait(struct i915_request *rq,
>   		    __i915_wait_request_check_and_reset(rq))
>   			continue;
>   
> -		/* Only spin if we know the GPU is processing this request */
> -		if (__i915_spin_request(rq, wait.seqno, state, 2))
> +		/*
> +		 * A quick spin now we are on the CPU to offset the cost of
> +		 * context switching away (and so spin for roughly the same as
> +		 * the scheduler latency). We only spin if we know the GPU is
> +		 * processing this request, and so likely to finish shortly.
> +		 */
> +		if (CONFIG_DRM_I915_SPIN_REQUEST_CS &&
> +		    __i915_spin_request(rq, wait.seqno, state,
> +					CONFIG_DRM_I915_SPIN_REQUEST_CS))
>   			break;
>   
>   		if (!intel_wait_check_request(&wait, rq)) {
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

We could add PMU metrics to count hit/missed busy spins so profile 
guided bit becomes more direct?

Regards,

Tvrtko
Chris Wilson Aug. 2, 2018, 2:29 p.m. UTC | #2
Quoting Tvrtko Ursulin (2018-08-02 15:15:35)
> 
> On 28/07/2018 17:46, Chris Wilson wrote:
> > An interesting discussion regarding "hybrid interrupt polling" for NVMe
> > came to the conclusion that the ideal busyspin before sleeping was half
> > of the expected request latency (and better if it was already halfway
> > through that request). This suggested that we too should look again at
> > our tradeoff between spinning and waiting. Currently, our spin simply
> > tries to hide the cost of enabling the interrupt, which is good to avoid
> > penalising nop requests (i.e. test throughput) and not much else.
> > Studying real world workloads suggests that a spin of upto 500us can
> > dramatically boost performance, but the suggestion is that this is not
> > from avoiding interrupt latency per-se, but from secondary effects of
> > sleeping such as allowing the CPU reduce cstate and context switch away.
> > 
> > In a truly hybrid interrupt polling scheme, we would aim to sleep until
> > just before the request completed and then wake up in advance of the
> > interrupt and do a quick poll to handle completion. This is tricky for
> > ourselves at the moment as we are not recording request times, and since
> > we allow preemption, our requests are not on as a nicely ordered
> > timeline as IO. However, the idea is interesting, for it will certainly
> > help us decide when busyspinning is worthwhile.
> > 
> > v2: Expose the spin setting via Kconfig options for easier adjustment
> > and testing.
> > v3: Don't get caught sneaking in a change to the busyspin parameters.
> > v4: Explain more about the "hybrid interrupt polling" scheme that we
> > want to migrate towards.
> > 
> > Suggested-by: Sagar Kamble <sagar.a.kamble@intel.com>
> > References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Sagar Kamble <sagar.a.kamble@intel.com>
> > Cc: Eero Tamminen <eero.t.tamminen@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Ben Widawsky <ben@bwidawsk.net>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > Reviewed-by: Sagar Kamble <sagar.a.kamble@intel.com>
> > ---
> >   drivers/gpu/drm/i915/Kconfig         |  6 +++++
> >   drivers/gpu/drm/i915/Kconfig.profile | 26 +++++++++++++++++++
> >   drivers/gpu/drm/i915/i915_request.c  | 39 +++++++++++++++++++++++++---
> >   3 files changed, 67 insertions(+), 4 deletions(-)
> >   create mode 100644 drivers/gpu/drm/i915/Kconfig.profile
> > 
> > diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> > index 5c607f2c707b..387613f29cb0 100644
> > --- a/drivers/gpu/drm/i915/Kconfig
> > +++ b/drivers/gpu/drm/i915/Kconfig
> > @@ -132,3 +132,9 @@ depends on DRM_I915
> >   depends on EXPERT
> >   source drivers/gpu/drm/i915/Kconfig.debug
> >   endmenu
> > +
> > +menu "drm/i915 Profile Guided Optimisation"
> 
> Or something like a more generic drm/i915 Manual Tuning Options so it 
> sounds less like an automated thing where you can feed the results of 
> something straight in?

Hah, good called. PGO was taken right out of the gcc manual, so yeah
it's a bit misleading.
 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> We could add PMU metrics to count hit/missed busy spins so profile 
> guided bit becomes more direct?

Sensible. I wonder though how much we can do already with kprobes?

Another one I think I'd like here is the DMA_LATENCY qos parameter. If
you haven't seen https://patchwork.freedesktop.org/patch/241571/ that's
worth throwing against media-bench.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 5c607f2c707b..387613f29cb0 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -132,3 +132,9 @@  depends on DRM_I915
 depends on EXPERT
 source drivers/gpu/drm/i915/Kconfig.debug
 endmenu
+
+menu "drm/i915 Profile Guided Optimisation"
+	visible if EXPERT
+	depends on DRM_I915
+	source drivers/gpu/drm/i915/Kconfig.profile
+endmenu
diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
new file mode 100644
index 000000000000..8a230eeb98df
--- /dev/null
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -0,0 +1,26 @@ 
+config DRM_I915_SPIN_REQUEST_IRQ
+	int
+	default 5 # microseconds
+	help
+	  Before sleeping waiting for a request (GPU operation) to complete,
+	  we may spend some time polling for its completion. As the IRQ may
+	  take a non-negligible time to setup, we do a short spin first to
+	  check if the request will complete in the time it would have taken
+	  us to enable the interrupt.
+
+	  May be 0 to disable the initial spin. In practice, we estimate
+	  the cost of enabling the interrupt (if currently disabled) to be
+	  a few microseconds.
+
+config DRM_I915_SPIN_REQUEST_CS
+	int
+	default 2 # microseconds
+	help
+	  After sleeping for a request (GPU operation) to complete, we will
+	  be woken up on the completion of every request prior to the one
+	  being waited on. For very short requests, going back to sleep and
+	  be woken up again may add considerably to the wakeup latency. To
+	  avoid incurring extra latency from the scheduler, we may choose to
+	  spin prior to sleeping again.
+
+	  May be 0 to disable spinning after being woken.
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 5c2c93cbab12..ca25c53f8daa 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1336,8 +1336,32 @@  long i915_request_wait(struct i915_request *rq,
 	GEM_BUG_ON(!intel_wait_has_seqno(&wait));
 	GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
 
-	/* Optimistic short spin before touching IRQs */
-	if (__i915_spin_request(rq, wait.seqno, state, 5))
+	/*
+	 * Optimistic spin before touching IRQs.
+	 *
+	 * We may use a rather large value here to offset the penalty of
+	 * switching away from the active task. Frequently, the client will
+	 * wait upon an old swapbuffer to throttle itself to remain within a
+	 * frame of the gpu. If the client is running in lockstep with the gpu,
+	 * then it should not be waiting long at all, and a sleep now will incur
+	 * extra scheduler latency in producing the next frame. To try to
+	 * avoid adding the cost of enabling/disabling the interrupt to the
+	 * short wait, we first spin to see if the request would have completed
+	 * in the time taken to setup the interrupt.
+	 *
+	 * We need upto 5us to enable the irq, and upto 20us to hide the
+	 * scheduler latency of a context switch, ignoring the secondary
+	 * impacts from a context switch such as cache eviction.
+	 *
+	 * The scheme used for low-latency IO is called "hybrid interrupt
+	 * polling". The suggestion there is to sleep until just before you
+	 * expect to be woken by the device interrupt and then poll for its
+	 * completion. That requires having a good predictor for the request
+	 * duration, which we currently lack.
+	 */
+	if (CONFIG_DRM_I915_SPIN_REQUEST_IRQ &&
+	    __i915_spin_request(rq, wait.seqno, state,
+				CONFIG_DRM_I915_SPIN_REQUEST_IRQ))
 		goto complete;
 
 	set_current_state(state);
@@ -1396,8 +1420,15 @@  long i915_request_wait(struct i915_request *rq,
 		    __i915_wait_request_check_and_reset(rq))
 			continue;
 
-		/* Only spin if we know the GPU is processing this request */
-		if (__i915_spin_request(rq, wait.seqno, state, 2))
+		/*
+		 * A quick spin now we are on the CPU to offset the cost of
+		 * context switching away (and so spin for roughly the same as
+		 * the scheduler latency). We only spin if we know the GPU is
+		 * processing this request, and so likely to finish shortly.
+		 */
+		if (CONFIG_DRM_I915_SPIN_REQUEST_CS &&
+		    __i915_spin_request(rq, wait.seqno, state,
+					CONFIG_DRM_I915_SPIN_REQUEST_CS))
 			break;
 
 		if (!intel_wait_check_request(&wait, rq)) {