Message ID | 1448786893-2522-3-git-send-email-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 29/11/15 08:48, Chris Wilson wrote: > When waiting for high frequency requests, the finite amount of time > required to set up the irq and wait upon it limits the response rate. By > busywaiting on the request completion for a short while we can service > the high frequency waits as quick as possible. However, if it is a slow > request, we want to sleep as quickly as possible. The tradeoff between > waiting and sleeping is roughly the time it takes to sleep on a request, > on the order of a microsecond. Based on measurements of synchronous > workloads from across big core and little atom, I have set the limit for > busywaiting as 10 microseconds. In most of the synchronous cases, we can > reduce the limit down to as little as 2 miscroseconds, but that leaves > quite a few test cases regressing by factors of 3 and more. > > The code currently uses the jiffie clock, but that is far too coarse (on > the order of 10 milliseconds) and results in poor interactivity as the > CPU ends up being hogged by slow requests. To get microsecond resolution > we need to use a high resolution timer. The cheapest of which is polling > local_clock(), but that is only valid on the same CPU. If we switch CPUs > because the task was preempted, we can also use that as an indicator that > the system is too busy to waste cycles on spinning and we should sleep > instead. > > __i915_spin_request was introduced in > commit 2def4ad99befa25775dd2f714fdd4d92faec6e34 [v4.2] > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Apr 7 16:20:41 2015 +0100 > > drm/i915: Optimistically spin for the request completion > > v2: Drop full u64 for unsigned long - the timer is 32bit wraparound safe, > so we can use native register sizes on smaller architectures. Mention > the approximate microseconds units for elapsed time and add some extra > comments describing the reason for busywaiting. > > v3: Raise the limit to 10us > > Reported-by: Jens Axboe <axboe@kernel.dk> > Link: https://lkml.org/lkml/2015/11/12/621 > Cc; "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> > Cc: Eero Tamminen <eero.t.tamminen@intel.com> > Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com> > Cc: stable@kernel.vger.org > --- > drivers/gpu/drm/i915/i915_gem.c | 47 +++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 45 insertions(+), 2 deletions(-) Again, Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Regards, Tvrtko > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 87fc34f5899f..bad112abb16b 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1146,14 +1146,57 @@ static bool missed_irq(struct drm_i915_private *dev_priv, > return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings); > } > > +static unsigned long local_clock_us(unsigned *cpu) > +{ > + unsigned long t; > + > + /* Cheaply and approximately convert from nanoseconds to microseconds. > + * The result and subsequent calculations are also defined in the same > + * approximate microseconds units. The principal source of timing > + * error here is from the simple truncation. > + * > + * Note that local_clock() is only defined wrt to the current CPU; > + * the comparisons are no longer valid if we switch CPUs. Instead of > + * blocking preemption for the entire busywait, we can detect the CPU > + * switch and use that as indicator of system load and a reason to > + * stop busywaiting, see busywait_stop(). > + */ > + *cpu = get_cpu(); > + t = local_clock() >> 10; > + put_cpu(); > + > + return t; > +} > + > +static bool busywait_stop(unsigned long timeout, unsigned cpu) > +{ > + unsigned this_cpu; > + > + if (time_after(local_clock_us(&this_cpu), timeout)) > + return true; > + > + return this_cpu != cpu; > +} > + > static int __i915_spin_request(struct drm_i915_gem_request *req, int state) > { > unsigned long timeout; > + unsigned cpu; > + > + /* When waiting for high frequency requests, e.g. during synchronous > + * rendering split between the CPU and GPU, the finite amount of time > + * required to set up the irq and wait upon it limits the response > + * rate. By busywaiting on the request completion for a short while we > + * can service the high frequency waits as quick as possible. However, > + * if it is a slow request, we want to sleep as quickly as possible. > + * The tradeoff between waiting and sleeping is roughly the time it > + * takes to sleep on a request, on the order of a microsecond. > + */ > > if (i915_gem_request_get_ring(req)->irq_refcount) > return -EBUSY; > > - timeout = jiffies + 1; > + timeout = local_clock_us(&cpu) + 10; > while (!need_resched()) { > if (i915_gem_request_completed(req, true)) > return 0; > @@ -1161,7 +1204,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state) > if (signal_pending_state(state, current)) > break; > > - if (time_after_eq(jiffies, timeout)) > + if (busywait_stop(timeout, cpu)) > break; > > cpu_relax_lowlatency(); >
On Mon, Nov 30, 2015 at 10:02:30AM +0000, Tvrtko Ursulin wrote: > > > On 29/11/15 08:48, Chris Wilson wrote: > >When waiting for high frequency requests, the finite amount of time > >required to set up the irq and wait upon it limits the response rate. By > >busywaiting on the request completion for a short while we can service > >the high frequency waits as quick as possible. However, if it is a slow > >request, we want to sleep as quickly as possible. The tradeoff between > >waiting and sleeping is roughly the time it takes to sleep on a request, > >on the order of a microsecond. Based on measurements of synchronous > >workloads from across big core and little atom, I have set the limit for > >busywaiting as 10 microseconds. In most of the synchronous cases, we can > >reduce the limit down to as little as 2 miscroseconds, but that leaves > >quite a few test cases regressing by factors of 3 and more. > > > >The code currently uses the jiffie clock, but that is far too coarse (on > >the order of 10 milliseconds) and results in poor interactivity as the > >CPU ends up being hogged by slow requests. To get microsecond resolution > >we need to use a high resolution timer. The cheapest of which is polling > >local_clock(), but that is only valid on the same CPU. If we switch CPUs > >because the task was preempted, we can also use that as an indicator that > > the system is too busy to waste cycles on spinning and we should sleep > >instead. > > > >__i915_spin_request was introduced in > >commit 2def4ad99befa25775dd2f714fdd4d92faec6e34 [v4.2] > >Author: Chris Wilson <chris@chris-wilson.co.uk> > >Date: Tue Apr 7 16:20:41 2015 +0100 > > > > drm/i915: Optimistically spin for the request completion > > > >v2: Drop full u64 for unsigned long - the timer is 32bit wraparound safe, > >so we can use native register sizes on smaller architectures. Mention > >the approximate microseconds units for elapsed time and add some extra > >comments describing the reason for busywaiting. > > > >v3: Raise the limit to 10us > > > >Reported-by: Jens Axboe <axboe@kernel.dk> > >Link: https://lkml.org/lkml/2015/11/12/621 > >Cc; "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> > >Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > >Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> > >Cc: Eero Tamminen <eero.t.tamminen@intel.com> > >Cc: "Rantala, Valtteri" <valtteri.rantala@intel.com> > >Cc: stable@kernel.vger.org > >--- > > drivers/gpu/drm/i915/i915_gem.c | 47 +++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 45 insertions(+), 2 deletions(-) > > Again, > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Sorry, I was just including these patches here so that anyone who wanted to look at the wait-request patches only had one series to pull :) -Chris
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 87fc34f5899f..bad112abb16b 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1146,14 +1146,57 @@ static bool missed_irq(struct drm_i915_private *dev_priv, return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings); } +static unsigned long local_clock_us(unsigned *cpu) +{ + unsigned long t; + + /* Cheaply and approximately convert from nanoseconds to microseconds. + * The result and subsequent calculations are also defined in the same + * approximate microseconds units. The principal source of timing + * error here is from the simple truncation. + * + * Note that local_clock() is only defined wrt to the current CPU; + * the comparisons are no longer valid if we switch CPUs. Instead of + * blocking preemption for the entire busywait, we can detect the CPU + * switch and use that as indicator of system load and a reason to + * stop busywaiting, see busywait_stop(). + */ + *cpu = get_cpu(); + t = local_clock() >> 10; + put_cpu(); + + return t; +} + +static bool busywait_stop(unsigned long timeout, unsigned cpu) +{ + unsigned this_cpu; + + if (time_after(local_clock_us(&this_cpu), timeout)) + return true; + + return this_cpu != cpu; +} + static int __i915_spin_request(struct drm_i915_gem_request *req, int state) { unsigned long timeout; + unsigned cpu; + + /* When waiting for high frequency requests, e.g. during synchronous + * rendering split between the CPU and GPU, the finite amount of time + * required to set up the irq and wait upon it limits the response + * rate. By busywaiting on the request completion for a short while we + * can service the high frequency waits as quick as possible. However, + * if it is a slow request, we want to sleep as quickly as possible. + * The tradeoff between waiting and sleeping is roughly the time it + * takes to sleep on a request, on the order of a microsecond. + */ if (i915_gem_request_get_ring(req)->irq_refcount) return -EBUSY; - timeout = jiffies + 1; + timeout = local_clock_us(&cpu) + 10; while (!need_resched()) { if (i915_gem_request_completed(req, true)) return 0; @@ -1161,7 +1204,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state) if (signal_pending_state(state, current)) break; - if (time_after_eq(jiffies, timeout)) + if (busywait_stop(timeout, cpu)) break; cpu_relax_lowlatency();