diff mbox

__i915_spin_request() sucks

Message ID 20151112221908.GA26194@nuc-i3427.alporthouse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson Nov. 12, 2015, 10:19 p.m. UTC
On Thu, Nov 12, 2015 at 01:40:33PM -0700, Jens Axboe wrote:
> On 11/12/2015 01:36 PM, Jens Axboe wrote:
> >Hi,
> >
> >So a few months ago I got an XPS13 laptop, the one with the high res
> >screen. GUI performance was never really that great, I attributed it to
> >coming from a more powerful laptop, and the i915 driving a lot of
> >pixels. But yesterday I browsed from my wife's macbook, and was blown
> >away. Wow, scrolling in chrome SUCKS on the xps13. Not just scrolling,
> >basically anything in chrome. Molasses. So I got sick of it, fired up a
> >quick perf record, did a bunch of stuff in chrome. No super smoking
> >guns, but one thing did stick out - the path leading to
> >__i915_spin_request().

smoking gun pointing at the messenger normally.

> >So today, I figured I'd try just killing that spin. If it fails, we'll
> >punt to normal completions, so easy change. And wow, MASSIVE difference.
> >I can now scroll in chrome and not rage! It's like the laptop is 10x
> >faster now.
> >
> >Ran git blame, and found:
> >
> >commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
> >Author: Chris Wilson <chris@chris-wilson.co.uk>
> >Date:   Tue Apr 7 16:20:41 2015 +0100
> >
> >     drm/i915: Optimistically spin for the request completion
> >
> >and read the commit message. Doesn't sound that impressive. Especially
> >not for something that screws up interactive performance by a LOT.
> >
> >What's the deal? Revert?

The tests that it improved the most were the latency sensitive tests and
since my Broadwell xps13 behaves itself, I'd like to understand how it
culminates in an interactivity loss.

1. Maybe it is the uninterruptible nature of the polling, making X's
SIGIO jerky:

gitt a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 19e8f5442cf8..8099c2a9f88e 100644

Or maybe it is an indirect effect, such as power balancing between the
CPU and GPU, or just thermal throttling, or it may be the task being
penalised for consuming its timeslice (for which any completion polling
seems susceptible).

> BTW, this:
> 
> "Limit the spinning to a single jiffie (~1us) at most"
> 
> is totally wrong. I have HZ=100 on my laptop. That's 10ms. 10ms!
> Even if I had HZ=1000, that'd still be 1ms of spinning. That's
> seriously screwed up, guys.

That's over and above the termination condition for blk_poll().
-Chris

Comments

Jens Axboe Nov. 12, 2015, 10:52 p.m. UTC | #1
On 11/12/2015 03:19 PM, Chris Wilson wrote:
>>> So today, I figured I'd try just killing that spin. If it fails, we'll
>>> punt to normal completions, so easy change. And wow, MASSIVE difference.
>>> I can now scroll in chrome and not rage! It's like the laptop is 10x
>>> faster now.
>>>
>>> Ran git blame, and found:
>>>
>>> commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
>>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>>> Date:   Tue Apr 7 16:20:41 2015 +0100
>>>
>>>      drm/i915: Optimistically spin for the request completion
>>>
>>> and read the commit message. Doesn't sound that impressive. Especially
>>> not for something that screws up interactive performance by a LOT.
>>>
>>> What's the deal? Revert?
>
> The tests that it improved the most were the latency sensitive tests and
> since my Broadwell xps13 behaves itself, I'd like to understand how it
> culminates in an interactivity loss.
>
> 1. Maybe it is the uninterruptible nature of the polling, making X's
> SIGIO jerky:

This one still feels bad.

> 2. Or maybe it is increased mutex contention:

And so does this one... I had to manually apply hunks 2-3, and after 
doing seat-of-the-pants testing for both variants, I confirmed with perf 
that we're still seeing a ton of time in __i915_wait_request() for both 
of them.

> Or maybe it is an indirect effect, such as power balancing between the
> CPU and GPU, or just thermal throttling, or it may be the task being
> penalised for consuming its timeslice (for which any completion polling
> seems susceptible).

Look, polls in the 1-10ms range are just insane. Either you botched the 
commit message and really meant "~1ms at most" and in which case I'd 
suspect you of smoking something good, or you hacked it up wrong and 
used jiffies when you really wanted to be using some other time check 
that really did give you 1us.

I'll take an IRQ over 10 msecs of busy looping on my laptop, thanks.

>> "Limit the spinning to a single jiffie (~1us) at most"
>>
>> is totally wrong. I have HZ=100 on my laptop. That's 10ms. 10ms!
>> Even if I had HZ=1000, that'd still be 1ms of spinning. That's
>> seriously screwed up, guys.
>
> That's over and above the termination condition for blk_poll().

?! And this is related, how? Comparing apples and oranges. One is a test 
opt-in feature for experimentation, the other is unconditionally enabled 
for everyone. I believe the commit even says so. See the difference? 
Would I use busy loop spinning waiting for rotating storage completions, 
which are in the 1-10ms range? No, with the reason being that the 
potential wins for spins are in the usec range.
Jens Axboe Nov. 12, 2015, 10:59 p.m. UTC | #2
On 11/12/2015 03:52 PM, Jens Axboe wrote:
> On 11/12/2015 03:19 PM, Chris Wilson wrote:
>>>> So today, I figured I'd try just killing that spin. If it fails, we'll
>>>> punt to normal completions, so easy change. And wow, MASSIVE
>>>> difference.
>>>> I can now scroll in chrome and not rage! It's like the laptop is 10x
>>>> faster now.
>>>>
>>>> Ran git blame, and found:
>>>>
>>>> commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
>>>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Date:   Tue Apr 7 16:20:41 2015 +0100
>>>>
>>>>      drm/i915: Optimistically spin for the request completion
>>>>
>>>> and read the commit message. Doesn't sound that impressive. Especially
>>>> not for something that screws up interactive performance by a LOT.
>>>>
>>>> What's the deal? Revert?
>>
>> The tests that it improved the most were the latency sensitive tests and
>> since my Broadwell xps13 behaves itself, I'd like to understand how it
>> culminates in an interactivity loss.
>>
>> 1. Maybe it is the uninterruptible nature of the polling, making X's
>> SIGIO jerky:
>
> This one still feels bad.
>
>> 2. Or maybe it is increased mutex contention:
>
> And so does this one... I had to manually apply hunks 2-3, and after
> doing seat-of-the-pants testing for both variants, I confirmed with perf
> that we're still seeing a ton of time in __i915_wait_request() for both
> of them.

I don't see how #2 could make any difference, you're passing in 0x3 hard 
coded for most call sites, so we poll. The ones that don't, pass a bool 
(?!).

I should note that with the basic patch of just never spinning, I don't 
see __i915_wait_request() in the profiles. At all.
Chris Wilson Nov. 13, 2015, 9:15 a.m. UTC | #3
On Thu, Nov 12, 2015 at 03:52:02PM -0700, Jens Axboe wrote:
> On 11/12/2015 03:19 PM, Chris Wilson wrote:
> >>>So today, I figured I'd try just killing that spin. If it fails, we'll
> >>>punt to normal completions, so easy change. And wow, MASSIVE difference.
> >>>I can now scroll in chrome and not rage! It's like the laptop is 10x
> >>>faster now.
> >>>
> >>>Ran git blame, and found:
> >>>
> >>>commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
> >>>Author: Chris Wilson <chris@chris-wilson.co.uk>
> >>>Date:   Tue Apr 7 16:20:41 2015 +0100
> >>>
> >>>     drm/i915: Optimistically spin for the request completion
> >>>
> >>>and read the commit message. Doesn't sound that impressive. Especially
> >>>not for something that screws up interactive performance by a LOT.
> >>>
> >>>What's the deal? Revert?
> >
> >The tests that it improved the most were the latency sensitive tests and
> >since my Broadwell xps13 behaves itself, I'd like to understand how it
> >culminates in an interactivity loss.
> >
> >1. Maybe it is the uninterruptible nature of the polling, making X's
> >SIGIO jerky:
> 
> This one still feels bad.
> 
> >2. Or maybe it is increased mutex contention:
> 
> And so does this one... I had to manually apply hunks 2-3, and after
> doing seat-of-the-pants testing for both variants, I confirmed with
> perf that we're still seeing a ton of time in __i915_wait_request()
> for both of them.
> 
> >Or maybe it is an indirect effect, such as power balancing between the
> >CPU and GPU, or just thermal throttling, or it may be the task being
> >penalised for consuming its timeslice (for which any completion polling
> >seems susceptible).
> 
> Look, polls in the 1-10ms range are just insane. Either you botched
> the commit message and really meant "~1ms at most" and in which case
> I'd suspect you of smoking something good, or you hacked it up wrong
> and used jiffies when you really wanted to be using some other time
> check that really did give you 1us.

What other time check? I was under the impression of setting up a
hrtimer was expensive and jiffies was available.
 
> I'll take an IRQ over 10 msecs of busy looping on my laptop, thanks.
> 
> >>"Limit the spinning to a single jiffie (~1us) at most"
> >>
> >>is totally wrong. I have HZ=100 on my laptop. That's 10ms. 10ms!
> >>Even if I had HZ=1000, that'd still be 1ms of spinning. That's
> >>seriously screwed up, guys.
> >
> >That's over and above the termination condition for blk_poll().
> 
> ?! And this is related, how? Comparing apples and oranges. One is a
> test opt-in feature for experimentation, the other is
> unconditionally enabled for everyone. I believe the commit even says
> so. See the difference? Would I use busy loop spinning waiting for
> rotating storage completions, which are in the 1-10ms range? No,
> with the reason being that the potential wins for spins are in the
> usec range.

Equally I expect the service interval for a batch to be around 2-20us.
There are many workloads that execute 30-50k requests/s, and you can
appreciate that they are sensitive to the latency in setting up an irq
and receiving it - just as equally leaving that irq enabled saturates a
CPU with simply executing the irq handler. So what mechanism do you use
to guard against either a very long queue depth or an abnormal request
causing msec+ spins?
-Chris
Mike Galbraith Nov. 13, 2015, 4:13 p.m. UTC | #4
On Fri, 2015-11-13 at 08:36 -0700, Jens Axboe wrote:
> Previous patch was obvious pre-coffee crap, this should work for using
> ktime to spin max 1usec.

1us seems a tad low.  I doubt the little wooden gears and pulleys of my
core2 Toshiba Satellite lappy can get one loop ground out in a usec :)

> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 5cf4a1998273..21192e55c33c 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1148,17 +1148,18 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
>  
>  static int __i915_spin_request(struct drm_i915_gem_request *req)
>  {
> -	unsigned long timeout;
> +	ktime_t timeout;
>  
>  	if (i915_gem_request_get_ring(req)->irq_refcount)
>  		return -EBUSY;
>  
> -	timeout = jiffies + 1;
> +	timeout = ktime_get();
> +	ktime_add_us(timeout, 1);
>  	while (!need_resched()) {
>  		if (i915_gem_request_completed(req, true))
>  			return 0;
>  
> -		if (time_after_eq(jiffies, timeout))
> +		if (ktime_after(ktime_get(), timeout))
>  			break;
>  
>  		cpu_relax_lowlatency();
>
Jens Axboe Nov. 13, 2015, 4:22 p.m. UTC | #5
On 11/13/2015 09:13 AM, Mike Galbraith wrote:
> On Fri, 2015-11-13 at 08:36 -0700, Jens Axboe wrote:
>> Previous patch was obvious pre-coffee crap, this should work for using
>> ktime to spin max 1usec.
>
> 1us seems a tad low.  I doubt the little wooden gears and pulleys of my
> core2 Toshiba Satellite lappy can get one loop ground out in a usec :)

Maybe it is, it's based off the original intent of the function, though. 
See the original commit referenced.
Chris Wilson Nov. 13, 2015, 10:12 p.m. UTC | #6
On Fri, Nov 13, 2015 at 09:22:52AM -0700, Jens Axboe wrote:
> On 11/13/2015 09:13 AM, Mike Galbraith wrote:
> >On Fri, 2015-11-13 at 08:36 -0700, Jens Axboe wrote:
> >>Previous patch was obvious pre-coffee crap, this should work for using
> >>ktime to spin max 1usec.
> >
> >1us seems a tad low.  I doubt the little wooden gears and pulleys of my
> >core2 Toshiba Satellite lappy can get one loop ground out in a usec :)
> 
> Maybe it is, it's based off the original intent of the function,
> though. See the original commit referenced.

I've been looking at numbers from one laptop and I can set the timeout
at 2us before we see a steep decline in what is more or less synchronous
request handling (which affects a variety of rendering workloads).

Looking around, other busy loops seem to use local_clock() (i.e. rdstcll
with a fair wind). Is that worth using here?
-Chris
Jens Axboe Nov. 13, 2015, 10:16 p.m. UTC | #7
On 11/13/2015 03:12 PM, Chris Wilson wrote:
> On Fri, Nov 13, 2015 at 09:22:52AM -0700, Jens Axboe wrote:
>> On 11/13/2015 09:13 AM, Mike Galbraith wrote:
>>> On Fri, 2015-11-13 at 08:36 -0700, Jens Axboe wrote:
>>>> Previous patch was obvious pre-coffee crap, this should work for using
>>>> ktime to spin max 1usec.
>>>
>>> 1us seems a tad low.  I doubt the little wooden gears and pulleys of my
>>> core2 Toshiba Satellite lappy can get one loop ground out in a usec :)
>>
>> Maybe it is, it's based off the original intent of the function,
>> though. See the original commit referenced.
>
> I've been looking at numbers from one laptop and I can set the timeout
> at 2us before we see a steep decline in what is more or less synchronous
> request handling (which affects a variety of rendering workloads).

Alright, at least that's a vast improvement from 10ms. If you send me 
something tested, I can try it here.

> Looking around, other busy loops seem to use local_clock() (i.e. rdstcll
> with a fair wind). Is that worth using here?

Honestly, don't think it matters too much for this case. You'd have to 
disable preempt to use local_clock(), fwiw. It is a faster variant 
though, but the RT people might hate you for 2us preempt disables :-)
diff mbox

Patch

--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1146,7 +1146,7 @@  static bool missed_irq(struct drm_i915_private *dev_priv,
        return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
 }
 
-static int __i915_spin_request(struct drm_i915_gem_request *req)
+static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
 {
        unsigned long timeout;
 
@@ -1161,6 +1161,9 @@  static int __i915_spin_request(struct drm_i915_gem_request *req)
                if (time_after_eq(jiffies, timeout))
                        break;
 
+               if (signal_pending_state(state, current))
+                       break;
+
                cpu_relax_lowlatency();
        }
        if (i915_gem_request_completed(req, false))
@@ -1197,6 +1200,7 @@  int __i915_wait_request(struct drm_i915_gem_request *req,
        struct drm_i915_private *dev_priv = dev->dev_private;
        const bool irq_test_in_progress =
                ACCESS_ONCE(dev_priv->gpu_error.test_irq_rings) & intel_ring_flag(ring);
+       int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
        DEFINE_WAIT(wait);
        unsigned long timeout_expire;
        s64 before, now;
@@ -1221,7 +1225,7 @@  int __i915_wait_request(struct drm_i915_gem_request *req,
        before = ktime_get_raw_ns();
 
        /* Optimistic spin for the next jiffie before touching IRQs */
-       ret = __i915_spin_request(req);
+       ret = __i915_spin_request(req, state);
        if (ret == 0)
                goto out;
 
@@ -1233,8 +1237,7 @@  int __i915_wait_request(struct drm_i915_gem_request *req,
        for (;;) {
                struct timer_list timer;
 
-               prepare_to_wait(&ring->irq_queue, &wait,
-                               interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE);
+               prepare_to_wait(&ring->irq_queue, &wait, state);
 
                /* We need to check whether any gpu reset happened in between
                 * the caller grabbing the seqno and now ... */
@@ -1252,7 +1255,7 @@  int __i915_wait_request(struct drm_i915_gem_request *req,
                        break;
                }
 
-               if (interruptible && signal_pending(current)) {
+               if (signal_pending_state(state, current)) {
                        ret = -ERESTARTSYS;
                        break;
                }

2. Or maybe it is increased mutex contention:

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a275b0478200..1e52a7444e0c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2973,9 +2973,11 @@  void __i915_add_request(struct drm_i915_gem_request *req,
        __i915_add_request(req, NULL, false)
 int __i915_wait_request(struct drm_i915_gem_request *req,
                        unsigned reset_counter,
-                       bool interruptible,
+                       unsigned flags,
                        s64 *timeout,
                        struct intel_rps_client *rps);
+#define WAIT_INTERRUPTIBLE 0x1
+#define WAIT_UNLOCKED 0x2
 int __must_check i915_wait_request(struct drm_i915_gem_request *req);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
 int __must_check
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8099c2a9f88e..ce17d42f1c62 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1191,7 +1191,7 @@  static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
  */
 int __i915_wait_request(struct drm_i915_gem_request *req,
                        unsigned reset_counter,
-                       bool interruptible,
+                       unsigned flags,
                        s64 *timeout,
                        struct intel_rps_client *rps)
 {
@@ -1200,7 +1200,7 @@  int __i915_wait_request(struct drm_i915_gem_request *req,
        struct drm_i915_private *dev_priv = dev->dev_private;
        const bool irq_test_in_progress =
                ACCESS_ONCE(dev_priv->gpu_error.test_irq_rings) & intel_ring_flag(ring);
-       int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
+       int state = flags & WAIT_INTERRUPTIBLE ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
        DEFINE_WAIT(wait);
        unsigned long timeout_expire;
        s64 before, now;
@@ -1225,9 +1225,11 @@  int __i915_wait_request(struct drm_i915_gem_request *req,
        before = ktime_get_raw_ns();
 
        /* Optimistic spin for the next jiffie before touching IRQs */
-       ret = __i915_spin_request(req, state);
-       if (ret == 0)
-               goto out;
+       if (flags & WAIT_UNLOCKED) {
+               ret = __i915_spin_request(req, state);
+               if (ret == 0)
+                       goto out;
+       }
 
        if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
                ret = -ENODEV;
@@ -1244,7 +1246,8 @@  int __i915_wait_request(struct drm_i915_gem_request *req,
                if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
                        /* ... but upgrade the -EAGAIN to an -EIO if the gpu
                         * is truely gone. */
-                       ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
+                       ret = i915_gem_check_wedge(&dev_priv->gpu_error,
+                                                  flags & WAIT_INTERRUPTIBLE);
                        if (ret == 0)
                                ret = -EAGAIN;
                        break;
@@ -1532,7 +1535,7 @@  i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
 
        mutex_unlock(&dev->struct_mutex);
        for (i = 0; ret == 0 && i < n; i++)
-               ret = __i915_wait_request(requests[i], reset_counter, true,
+               ret = __i915_wait_request(requests[i], reset_counter, 0x3,
                                          NULL, rps);
        mutex_lock(&dev->struct_mutex);
 
@@ -3067,7 +3070,7 @@  i915_gem_wait_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 
        for (i = 0; i < n; i++) {
                if (ret == 0)
-                       ret = __i915_wait_request(req[i], reset_counter, true,
+                       ret = __i915_wait_request(req[i], reset_counter, 0x3,
                                                  args->timeout_ns > 0 ? &args->timeout_ns : NULL,
                                                  file->driver_priv);
                i915_gem_request_unreference__unlocked(req[i]);
@@ -4043,7 +4046,7 @@  i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
        if (target == NULL)
                return 0;
 
-       ret = __i915_wait_request(target, reset_counter, true, NULL, NULL);
+       ret = __i915_wait_request(target, reset_counter, 0x3, NULL, NULL);
        if (ret == 0)
                queue_delayed_work(dev_priv->wq, &dev_priv->mm.retire_work, 0);