[RFC,v2] drm/i915: Move execlists irq handler to a bottom half
diff mbox

Message ID 1458745056-25673-1-git-send-email-tvrtko.ursulin@linux.intel.com
State New
Headers show

Commit Message

Tvrtko Ursulin March 23, 2016, 2:57 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Doing a lot of work in the interrupt handler introduces huge
latencies to the system as a whole.

Most dramatic effect can be seen by running an all engine
stress test like igt/gem_exec_nop/all where, when the kernel
config is lean enough, the whole system can be brought into
multi-second periods of complete non-interactivty. That can
look for example like this:

 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
 Modules linked in: [redacted for brevity]
 CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
 Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
 Workqueue: i915 gen6_pm_rps_work [i915]
 task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
 RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
 RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
 RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
 RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
 RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
 R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
 R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
 FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Stack:
  042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
  0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
  0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
 Call Trace:
  <IRQ>
  [<ffffffff8104a716>] irq_exit+0x86/0x90
  [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
  [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
  <EOI>
  [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
  [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
  [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
  [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
  [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
  [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
  [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
  [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
  [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
  [<ffffffff8105ab29>] process_one_work+0x139/0x350
  [<ffffffff8105b186>] worker_thread+0x126/0x490
  [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
  [<ffffffff8105fa64>] kthread+0xc4/0xe0
  [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
  [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
  [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170

I could not explain, or find a code path, which would explain
a +20 second lockup, but from some instrumentation it was
apparent the interrupts off proportion of time was between
10-25% under heavy load which is quite bad.

By moving the GT interrupt handling to a tasklet in a most
simple way, the problem above disappears completely.

Also, gem_latency -n 100 shows 25% better throughput and CPU
usage, and 14% better latencies.

I did not find any gains or regressions with Synmark2 or
GLbench under light testing. More benchmarking is certainly
required.

v2:
   * execlists_lock should be taken as spin_lock_bh when
     queuing work from userspace now. (Chris Wilson)
   * uncore.lock must be taken with spin_lock_irq when
     submitting requests since that now runs from either
     softirq or process context.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_irq.c         |  2 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 24 ++++++++++++++----------
 drivers/gpu/drm/i915/intel_lrc.h        |  1 -
 drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
 4 files changed, 16 insertions(+), 12 deletions(-)

Comments

Chris Wilson March 24, 2016, 10:56 a.m. UTC | #1
On Wed, Mar 23, 2016 at 02:57:36PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Doing a lot of work in the interrupt handler introduces huge
> latencies to the system as a whole.
> 
> Most dramatic effect can be seen by running an all engine
> stress test like igt/gem_exec_nop/all where, when the kernel
> config is lean enough, the whole system can be brought into
> multi-second periods of complete non-interactivty. That can
> look for example like this:
> 
>  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
>  Modules linked in: [redacted for brevity]
>  CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
>  Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
>  Workqueue: i915 gen6_pm_rps_work [i915]
>  task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
>  RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
>  RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
>  RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
>  RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
>  RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
>  R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
>  R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
>  FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Stack:
>   042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
>   0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
>   0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
>  Call Trace:
>   <IRQ>
>   [<ffffffff8104a716>] irq_exit+0x86/0x90
>   [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
>   [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
>   <EOI>
>   [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
>   [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
>   [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
>   [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
>   [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
>   [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
>   [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
>   [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
>   [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
>   [<ffffffff8105ab29>] process_one_work+0x139/0x350
>   [<ffffffff8105b186>] worker_thread+0x126/0x490
>   [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
>   [<ffffffff8105fa64>] kthread+0xc4/0xe0
>   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
>   [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
>   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> 
> I could not explain, or find a code path, which would explain
> a +20 second lockup, but from some instrumentation it was
> apparent the interrupts off proportion of time was between
> 10-25% under heavy load which is quite bad.
> 
> By moving the GT interrupt handling to a tasklet in a most
> simple way, the problem above disappears completely.

Perfect segue into gem_syslatency. I think gem_syslatency is the better
tool to correlate disruptive system behaviour. And then continue on with
gem_latency to demonstrate that is doesn't adversely affect our
performance.

> Also, gem_latency -n 100 shows 25% better throughput and CPU
> usage, and 14% better latencies.

Mention the benefits of parallelising dispatch.

As fairly hit-and-miss as perf testing is on these machines, it is
looking in favour of using tasklets vs the rt kthread. The numbers swing
between 2-10%, but consistently improves in the nop sync latencies.
There's still several hours to go in this run before we cover the
dispatch latenies, but so far reasonable.

(Hmm, looks like there may be a possible degredation on the single nop
dispatch but an improvement on the continuous nop dispatch.)

> I did not find any gains or regressions with Synmark2 or
> GLbench under light testing. More benchmarking is certainly
> required.
> 
> v2:
>    * execlists_lock should be taken as spin_lock_bh when
>      queuing work from userspace now. (Chris Wilson)
>    * uncore.lock must be taken with spin_lock_irq when
>      submitting requests since that now runs from either
>      softirq or process context.

There are a couple of execlist_lock usage outside of intel_lrc that may
or may not be useful to convert (low frequency reset / debug paths, so
way off the critical paths, but consistency in locking is invaluable).

> +	tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler,
> +		     (unsigned long)engine);

I like trying to split lines to cluster arguments if possible. Here I
think intel_lrc_irq_handler pairs with engine,

	tasklet_init(&engine->irq_tasklet,
		     intel_lrc_irq_handler, (unsigned long)engine);

*shrug*

> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 221a94627aab..29810cba8a8c 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -266,6 +266,7 @@ struct  intel_engine_cs {
>  	} semaphore;
>  
>  	/* Execlists */
> +	struct tasklet_struct irq_tasklet;
>  	spinlock_t execlist_lock;

spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */

It's looking good, but once this run completes, I'm going to repeat it
just to confirm how stable my numbers are.

Critical bugfix, improvements, simpler patch than my kthread
implementation,
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Tvrtko Ursulin March 24, 2016, 11:50 a.m. UTC | #2
On 24/03/16 10:56, Chris Wilson wrote:
> On Wed, Mar 23, 2016 at 02:57:36PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Doing a lot of work in the interrupt handler introduces huge
>> latencies to the system as a whole.
>>
>> Most dramatic effect can be seen by running an all engine
>> stress test like igt/gem_exec_nop/all where, when the kernel
>> config is lean enough, the whole system can be brought into
>> multi-second periods of complete non-interactivty. That can
>> look for example like this:
>>
>>   NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
>>   Modules linked in: [redacted for brevity]
>>   CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
>>   Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
>>   Workqueue: i915 gen6_pm_rps_work [i915]
>>   task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
>>   RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
>>   RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
>>   RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
>>   RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
>>   RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
>>   R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
>>   R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
>>   FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
>>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>   Stack:
>>    042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
>>    0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
>>    0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
>>   Call Trace:
>>    <IRQ>
>>    [<ffffffff8104a716>] irq_exit+0x86/0x90
>>    [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
>>    [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
>>    <EOI>
>>    [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
>>    [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
>>    [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
>>    [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
>>    [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
>>    [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
>>    [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
>>    [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
>>    [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
>>    [<ffffffff8105ab29>] process_one_work+0x139/0x350
>>    [<ffffffff8105b186>] worker_thread+0x126/0x490
>>    [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
>>    [<ffffffff8105fa64>] kthread+0xc4/0xe0
>>    [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
>>    [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
>>    [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
>>
>> I could not explain, or find a code path, which would explain
>> a +20 second lockup, but from some instrumentation it was
>> apparent the interrupts off proportion of time was between
>> 10-25% under heavy load which is quite bad.
>>
>> By moving the GT interrupt handling to a tasklet in a most
>> simple way, the problem above disappears completely.
>
> Perfect segue into gem_syslatency. I think gem_syslatency is the better
> tool to correlate disruptive system behaviour. And then continue on with
> gem_latency to demonstrate that is doesn't adversely affect our
> performance.

Will do.

>> Also, gem_latency -n 100 shows 25% better throughput and CPU
>> usage, and 14% better latencies.
>
> Mention the benefits of parallelising dispatch.

Hm, actually this should be the same as before I think.

> As fairly hit-and-miss as perf testing is on these machines, it is
> looking in favour of using tasklets vs the rt kthread. The numbers swing
> between 2-10%, but consistently improves in the nop sync latencies.
> There's still several hours to go in this run before we cover the
> dispatch latenies, but so far reasonable.
>
> (Hmm, looks like there may be a possible degredation on the single nop
> dispatch but an improvement on the continuous nop dispatch.)

We can add all the numbers you get to the commit message as well.

>> I did not find any gains or regressions with Synmark2 or
>> GLbench under light testing. More benchmarking is certainly
>> required.
>>
>> v2:
>>     * execlists_lock should be taken as spin_lock_bh when
>>       queuing work from userspace now. (Chris Wilson)
>>     * uncore.lock must be taken with spin_lock_irq when
>>       submitting requests since that now runs from either
>>       softirq or process context.
>
> There are a couple of execlist_lock usage outside of intel_lrc that may
> or may not be useful to convert (low frequency reset / debug paths, so
> way off the critical paths, but consistency in locking is invaluable).

Oh right, I've missed those.

>
>> +	tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler,
>> +		     (unsigned long)engine);
>
> I like trying to split lines to cluster arguments if possible. Here I
> think intel_lrc_irq_handler pairs with engine,
>
> 	tasklet_init(&engine->irq_tasklet,
> 		     intel_lrc_irq_handler, (unsigned long)engine);
>
> *shrug*

Yeah it is nicer.

>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
>> index 221a94627aab..29810cba8a8c 100644
>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
>> @@ -266,6 +266,7 @@ struct  intel_engine_cs {
>>   	} semaphore;
>>
>>   	/* Execlists */
>> +	struct tasklet_struct irq_tasklet;
>>   	spinlock_t execlist_lock;
>
> spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */

Will do.

> It's looking good, but once this run completes, I'm going to repeat it
> just to confirm how stable my numbers are.
>
> Critical bugfix, improvements, simpler patch than my kthread
> implementation,
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

Okay I will respin with the above and we'll see.

Unfortunately my test platform just died so there will be a delay.

Regards,

Tvrtko
Tvrtko Ursulin March 24, 2016, 12:58 p.m. UTC | #3
On 24/03/16 11:50, Tvrtko Ursulin wrote:
>>> Also, gem_latency -n 100 shows 25% better throughput and CPU
>>> usage, and 14% better latencies.
>>
>> Mention the benefits of parallelising dispatch.
>
> Hm, actually this should be the same as before I think.

Of course not, silly me. Will add this at the next opportunity then.

Regards,

Tvrtko
Imre Deak March 24, 2016, 3:56 p.m. UTC | #4
On ke, 2016-03-23 at 14:57 +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Doing a lot of work in the interrupt handler introduces huge
> latencies to the system as a whole.
> 
> Most dramatic effect can be seen by running an all engine
> stress test like igt/gem_exec_nop/all where, when the kernel
> config is lean enough, the whole system can be brought into
> multi-second periods of complete non-interactivty. That can
> look for example like this:
> 
>  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> [kworker/u8:3:143]
>  Modules linked in: [redacted for brevity]
>  CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-
> 160321+ #183
>  Hardware name: Intel Corporation Broadwell Client platform/WhiteTip
> Mountain 1
>  Workqueue: i915 gen6_pm_rps_work [i915]
>  task: ffff8800aae88000 ti: ffff8800aae90000 task.ti:
> ffff8800aae90000
>  RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>]
> __do_softirq+0x72/0x1d0
>  RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
>  RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
>  RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
>  RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
>  R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
>  R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
>  FS:  0000000000000000(0000) GS:ffff88014f400000(0000)
> knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Stack:
>   042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
>   0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
>   0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
>  Call Trace:
>   <IRQ>
>   [<ffffffff8104a716>] irq_exit+0x86/0x90
>   [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
>   [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
>   <EOI>
>   [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
>   [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
>   [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
>   [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
>   [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
>   [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
>   [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
>   [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
>   [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
>   [<ffffffff8105ab29>] process_one_work+0x139/0x350
>   [<ffffffff8105b186>] worker_thread+0x126/0x490
>   [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
>   [<ffffffff8105fa64>] kthread+0xc4/0xe0
>   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
>   [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
>   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> 
> I could not explain, or find a code path, which would explain
> a +20 second lockup, but from some instrumentation it was
> apparent the interrupts off proportion of time was between
> 10-25% under heavy load which is quite bad.
> 
> By moving the GT interrupt handling to a tasklet in a most
> simple way, the problem above disappears completely.
> 
> Also, gem_latency -n 100 shows 25% better throughput and CPU
> usage, and 14% better latencies.
> 
> I did not find any gains or regressions with Synmark2 or
> GLbench under light testing. More benchmarking is certainly
> required.
> 
> v2:
>    * execlists_lock should be taken as spin_lock_bh when
>      queuing work from userspace now. (Chris Wilson)
>    * uncore.lock must be taken with spin_lock_irq when
>      submitting requests since that now runs from either
>      softirq or process context.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>

You also have to synchronize against the tasklet now whenever we
synchronize against the IRQ, see gen6_disable_rps_interrupts(),
gen8_irq_power_well_pre_disable() and
intel_runtime_pm_disable_interrupts(). Not saying you should use a
threaded IRQ instead, but it does provide for this automatically.

--Imre

> ---
>  drivers/gpu/drm/i915/i915_irq.c         |  2 +-
>  drivers/gpu/drm/i915/intel_lrc.c        | 24 ++++++++++++++---------
> -
>  drivers/gpu/drm/i915/intel_lrc.h        |  1 -
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
>  4 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c
> b/drivers/gpu/drm/i915/i915_irq.c
> index 8f3e3309c3ab..e68134347007 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1324,7 +1324,7 @@ gen8_cs_irq_handler(struct intel_engine_cs
> *engine, u32 iir, int test_shift)
>  	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift))
>  		notify_ring(engine);
>  	if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift))
> -		intel_lrc_irq_handler(engine);
> +		tasklet_schedule(&engine->irq_tasklet);
>  }
>  
>  static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private
> *dev_priv,
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> b/drivers/gpu/drm/i915/intel_lrc.c
> index 67592f8354d6..b3b62b3cd90d 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -418,20 +418,18 @@ static void execlists_submit_requests(struct
> drm_i915_gem_request *rq0,
>  {
>  	struct drm_i915_private *dev_priv = rq0->i915;
>  
> -	/* BUG_ON(!irqs_disabled());  */
> -
>  	execlists_update_context(rq0);
>  
>  	if (rq1)
>  		execlists_update_context(rq1);
>  
> -	spin_lock(&dev_priv->uncore.lock);
> +	spin_lock_irq(&dev_priv->uncore.lock);
>  	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
>  
>  	execlists_elsp_write(rq0, rq1);
>  
>  	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
> -	spin_unlock(&dev_priv->uncore.lock);
> +	spin_unlock_irq(&dev_priv->uncore.lock);
>  }
>  
>  static void execlists_context_unqueue(struct intel_engine_cs
> *engine)
> @@ -536,13 +534,14 @@ get_context_status(struct drm_i915_private
> *dev_priv, u32 csb_base,
>  
>  /**
>   * intel_lrc_irq_handler() - handle Context Switch interrupts
> - * @ring: Engine Command Streamer to handle.
> + * @engine: Engine Command Streamer to handle.
>   *
>   * Check the unread Context Status Buffers and manage the submission
> of new
>   * contexts to the ELSP accordingly.
>   */
> -void intel_lrc_irq_handler(struct intel_engine_cs *engine)
> +void intel_lrc_irq_handler(unsigned long data)
>  {
> +	struct intel_engine_cs *engine = (struct intel_engine_cs
> *)data;
>  	struct drm_i915_private *dev_priv = engine->dev-
> >dev_private;
>  	u32 status_pointer;
>  	unsigned int read_pointer, write_pointer;
> @@ -551,7 +550,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs
> *engine)
>  	unsigned int csb_read = 0, i;
>  	unsigned int submit_contexts = 0;
>  
> -	spin_lock(&dev_priv->uncore.lock);
> +	spin_lock_irq(&dev_priv->uncore.lock);
>  	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
>  
>  	status_pointer =
> I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine));
> @@ -579,7 +578,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs
> *engine)
>  				    engine-
> >next_context_status_buffer << 8));
>  
>  	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
> -	spin_unlock(&dev_priv->uncore.lock);
> +	spin_unlock_irq(&dev_priv->uncore.lock);
>  
>  	spin_lock(&engine->execlist_lock);
>  
> @@ -621,7 +620,7 @@ static void execlists_context_queue(struct
> drm_i915_gem_request *request)
>  
>  	i915_gem_request_reference(request);
>  
> -	spin_lock_irq(&engine->execlist_lock);
> +	spin_lock_bh(&engine->execlist_lock);
>  
>  	list_for_each_entry(cursor, &engine->execlist_queue,
> execlist_link)
>  		if (++num_elements > 2)
> @@ -646,7 +645,7 @@ static void execlists_context_queue(struct
> drm_i915_gem_request *request)
>  	if (num_elements == 0)
>  		execlists_context_unqueue(engine);
>  
> -	spin_unlock_irq(&engine->execlist_lock);
> +	spin_unlock_bh(&engine->execlist_lock);
>  }
>  
>  static int logical_ring_invalidate_all_caches(struct
> drm_i915_gem_request *req)
> @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct
> intel_engine_cs *engine)
>  	if (!intel_engine_initialized(engine))
>  		return;
>  
> +	tasklet_kill(&engine->irq_tasklet);
> +
>  	dev_priv = engine->dev->dev_private;
>  
>  	if (engine->buffer) {
> @@ -2089,6 +2090,9 @@ logical_ring_init(struct drm_device *dev,
> struct intel_engine_cs *engine)
>  	INIT_LIST_HEAD(&engine->execlist_retired_req_list);
>  	spin_lock_init(&engine->execlist_lock);
>  
> +	tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler,
> +		     (unsigned long)engine);
> +
>  	logical_ring_init_platform_invariants(engine);
>  
>  	ret = i915_cmd_parser_init_ring(engine);
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h
> b/drivers/gpu/drm/i915/intel_lrc.h
> index 6690d93d603f..efcbd7bf9cc9 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -123,7 +123,6 @@ int intel_execlists_submission(struct
> i915_execbuffer_params *params,
>  			       struct drm_i915_gem_execbuffer2
> *args,
>  			       struct list_head *vmas);
>  
> -void intel_lrc_irq_handler(struct intel_engine_cs *engine);
>  void intel_execlists_retire_requests(struct intel_engine_cs
> *engine);
>  
>  #endif /* _INTEL_LRC_H_ */
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h
> b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 221a94627aab..29810cba8a8c 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -266,6 +266,7 @@ struct  intel_engine_cs {
>  	} semaphore;
>  
>  	/* Execlists */
> +	struct tasklet_struct irq_tasklet;
>  	spinlock_t execlist_lock;
>  	struct list_head execlist_queue;
>  	struct list_head execlist_retired_req_list;
Chris Wilson March 24, 2016, 4:05 p.m. UTC | #5
On Thu, Mar 24, 2016 at 05:56:40PM +0200, Imre Deak wrote:
> On ke, 2016-03-23 at 14:57 +0000, Tvrtko Ursulin wrote:
> > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > 
> > Doing a lot of work in the interrupt handler introduces huge
> > latencies to the system as a whole.
> > 
> > Most dramatic effect can be seen by running an all engine
> > stress test like igt/gem_exec_nop/all where, when the kernel
> > config is lean enough, the whole system can be brought into
> > multi-second periods of complete non-interactivty. That can
> > look for example like this:
> > 
> >  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> > [kworker/u8:3:143]
> >  Modules linked in: [redacted for brevity]
> >  CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-
> > 160321+ #183
> >  Hardware name: Intel Corporation Broadwell Client platform/WhiteTip
> > Mountain 1
> >  Workqueue: i915 gen6_pm_rps_work [i915]
> >  task: ffff8800aae88000 ti: ffff8800aae90000 task.ti:
> > ffff8800aae90000
> >  RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>]
> > __do_softirq+0x72/0x1d0
> >  RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
> >  RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
> >  RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
> >  RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
> >  R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
> >  R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
> >  FS:  0000000000000000(0000) GS:ffff88014f400000(0000)
> > knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
> >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >  Stack:
> >   042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
> >   0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
> >   0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
> >  Call Trace:
> >   <IRQ>
> >   [<ffffffff8104a716>] irq_exit+0x86/0x90
> >   [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
> >   [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
> >   <EOI>
> >   [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
> >   [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
> >   [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
> >   [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
> >   [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
> >   [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
> >   [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
> >   [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
> >   [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
> >   [<ffffffff8105ab29>] process_one_work+0x139/0x350
> >   [<ffffffff8105b186>] worker_thread+0x126/0x490
> >   [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
> >   [<ffffffff8105fa64>] kthread+0xc4/0xe0
> >   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> >   [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
> >   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> > 
> > I could not explain, or find a code path, which would explain
> > a +20 second lockup, but from some instrumentation it was
> > apparent the interrupts off proportion of time was between
> > 10-25% under heavy load which is quite bad.
> > 
> > By moving the GT interrupt handling to a tasklet in a most
> > simple way, the problem above disappears completely.
> > 
> > Also, gem_latency -n 100 shows 25% better throughput and CPU
> > usage, and 14% better latencies.
> > 
> > I did not find any gains or regressions with Synmark2 or
> > GLbench under light testing. More benchmarking is certainly
> > required.
> > 
> > v2:
> >    * execlists_lock should be taken as spin_lock_bh when
> >      queuing work from userspace now. (Chris Wilson)
> >    * uncore.lock must be taken with spin_lock_irq when
> >      submitting requests since that now runs from either
> >      softirq or process context.
> > 
> > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> 
> You also have to synchronize against the tasklet now whenever we
> synchronize against the IRQ, see gen6_disable_rps_interrupts(),
> gen8_irq_power_well_pre_disable() and
> intel_runtime_pm_disable_interrupts(). Not saying you should use a
> threaded IRQ instead, but it does provide for this automatically.

But we don't synchronize against the irq for execlists since this
tasklet is guarded by the rpm wakeref (though mark_busy / mark_idle)
and we stop it before we finally release the irq. Or have I missed
something?
-Chris
Imre Deak March 24, 2016, 4:40 p.m. UTC | #6
On to, 2016-03-24 at 16:05 +0000, Chris Wilson wrote:
> On Thu, Mar 24, 2016 at 05:56:40PM +0200, Imre Deak wrote:
> > On ke, 2016-03-23 at 14:57 +0000, Tvrtko Ursulin wrote:
> > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > 
> > > Doing a lot of work in the interrupt handler introduces huge
> > > latencies to the system as a whole.
> > > 
> > > Most dramatic effect can be seen by running an all engine
> > > stress test like igt/gem_exec_nop/all where, when the kernel
> > > config is lean enough, the whole system can be brought into
> > > multi-second periods of complete non-interactivty. That can
> > > look for example like this:
> > > 
> > >  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> > > [kworker/u8:3:143]
> > >  Modules linked in: [redacted for brevity]
> > >  CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted:
> > > G     U       L  4.5.0-
> > > 160321+ #183
> > >  Hardware name: Intel Corporation Broadwell Client
> > > platform/WhiteTip
> > > Mountain 1
> > >  Workqueue: i915 gen6_pm_rps_work [i915]
> > >  task: ffff8800aae88000 ti: ffff8800aae90000 task.ti:
> > > ffff8800aae90000
> > >  RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>]
> > > __do_softirq+0x72/0x1d0
> > >  RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
> > >  RAX: ffff8800aae94000 RBX: 0000000000000000 RCX:
> > > 00000000000006e0
> > >  RDX: 0000000000000020 RSI: 0000000004208060 RDI:
> > > 0000000000215d80
> > >  RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09:
> > > 0000000000000022
> > >  R10: 0000000000000004 R11: 00000000ffffffff R12:
> > > 000000000000a030
> > >  R13: 0000000000000082 R14: ffff8800aa4d0080 R15:
> > > 0000000000000082
> > >  FS:  0000000000000000(0000) GS:ffff88014f400000(0000)
> > > knlGS:0000000000000000
> > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >  CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4:
> > > 00000000001406f0
> > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > >  Stack:
> > >   042080601b33869f ffff8800aae94000 00000000fffc2678
> > > ffff88010000000a
> > >   0000000000000000 000000000000a030 0000000000005302
> > > ffff8800aa4d0080
> > >   0000000000000206 ffff88014f403f90 ffffffff8104a716
> > > ffff88014f403fa8
> > >  Call Trace:
> > >   <IRQ>
> > >   [<ffffffff8104a716>] irq_exit+0x86/0x90
> > >   [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
> > >   [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
> > >   <EOI>
> > >   [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
> > >   [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
> > >   [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
> > >   [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
> > >   [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
> > >   [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
> > >   [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
> > >   [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
> > >   [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
> > >   [<ffffffff8105ab29>] process_one_work+0x139/0x350
> > >   [<ffffffff8105b186>] worker_thread+0x126/0x490
> > >   [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
> > >   [<ffffffff8105fa64>] kthread+0xc4/0xe0
> > >   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> > >   [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
> > >   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> > > 
> > > I could not explain, or find a code path, which would explain
> > > a +20 second lockup, but from some instrumentation it was
> > > apparent the interrupts off proportion of time was between
> > > 10-25% under heavy load which is quite bad.
> > > 
> > > By moving the GT interrupt handling to a tasklet in a most
> > > simple way, the problem above disappears completely.
> > > 
> > > Also, gem_latency -n 100 shows 25% better throughput and CPU
> > > usage, and 14% better latencies.
> > > 
> > > I did not find any gains or regressions with Synmark2 or
> > > GLbench under light testing. More benchmarking is certainly
> > > required.
> > > 
> > > v2:
> > >    * execlists_lock should be taken as spin_lock_bh when
> > >      queuing work from userspace now. (Chris Wilson)
> > >    * uncore.lock must be taken with spin_lock_irq when
> > >      submitting requests since that now runs from either
> > >      softirq or process context.
> > > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > You also have to synchronize against the tasklet now whenever we
> > synchronize against the IRQ, see gen6_disable_rps_interrupts(),
> > gen8_irq_power_well_pre_disable() and
> > intel_runtime_pm_disable_interrupts(). Not saying you should use a
> > threaded IRQ instead, but it does provide for this automatically.
> 
> But we don't synchronize against the irq for execlists since this
> tasklet is guarded by the rpm wakeref (though mark_busy / mark_idle)
> and we stop it before we finally release the irq. 

Hm yea, I missed that it's only an execlist tasklet and so there
shouldn't be any pending tasklet after mark_idle(). Perhaps it would
still make sense to assert for this in gen8_logical_ring_put_irq() or
somewhere? Similarly there is a tasklet_kill() in
intel_logical_ring_cleanup(), but there shouldn't be any pending
tasklet there either, so should we add an assert there too?

--Imre

> Or have I missed something?
> -Chris
>
Chris Wilson March 24, 2016, 7:56 p.m. UTC | #7
On Thu, Mar 24, 2016 at 06:40:55PM +0200, Imre Deak wrote:
> Hm yea, I missed that it's only an execlist tasklet and so there
> shouldn't be any pending tasklet after mark_idle(). Perhaps it would
> still make sense to assert for this in gen8_logical_ring_put_irq() or
> somewhere? Similarly there is a tasklet_kill() in
> intel_logical_ring_cleanup(), but there shouldn't be any pending
> tasklet there either, so should we add an assert there too?

Yes, tasklet_kill() should be a nop. We could

if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &tasklet->state))
	tasklet_kill(&tasklet);

I don't see a particular sensible spot to assert that the engines are
off before irq uninstall other than the assertions we have in execlists
that irqs are actually enabled when we try to submit, and the battery
of WARNs we have for trying to access the hardware whilst !rpm.
-Chris
Imre Deak March 24, 2016, 10:13 p.m. UTC | #8
On Thu, 2016-03-24 at 19:56 +0000, Chris Wilson wrote:
> On Thu, Mar 24, 2016 at 06:40:55PM +0200, Imre Deak wrote:
> > Hm yea, I missed that it's only an execlist tasklet and so there
> > shouldn't be any pending tasklet after mark_idle(). Perhaps it
> > would
> > still make sense to assert for this in gen8_logical_ring_put_irq()
> > or
> > somewhere? Similarly there is a tasklet_kill() in
> > intel_logical_ring_cleanup(), but there shouldn't be any pending
> > tasklet there either, so should we add an assert there too?
> 
> Yes, tasklet_kill() should be a nop. We could
> 
> if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &tasklet->state))
> 	tasklet_kill(&tasklet);
> 
> I don't see a particular sensible spot to assert that the engines are
> off before irq uninstall other than the assertions we have in
> execlists
> that irqs are actually enabled when we try to submit, and the battery
> of WARNs we have for trying to access the hardware whilst !rpm.

Ok, this was just a hand-wavy idea then and also this tasklet isn't
much different from other work we schedule from the interrupt handler
and we don't have special checks for those either. The above WARN_ON
would be still useful for documentation imo.

--Imre

Patch
diff mbox

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8f3e3309c3ab..e68134347007 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1324,7 +1324,7 @@  gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
 	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift))
 		notify_ring(engine);
 	if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift))
-		intel_lrc_irq_handler(engine);
+		tasklet_schedule(&engine->irq_tasklet);
 }
 
 static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 67592f8354d6..b3b62b3cd90d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -418,20 +418,18 @@  static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
 {
 	struct drm_i915_private *dev_priv = rq0->i915;
 
-	/* BUG_ON(!irqs_disabled());  */
-
 	execlists_update_context(rq0);
 
 	if (rq1)
 		execlists_update_context(rq1);
 
-	spin_lock(&dev_priv->uncore.lock);
+	spin_lock_irq(&dev_priv->uncore.lock);
 	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
 
 	execlists_elsp_write(rq0, rq1);
 
 	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
+	spin_unlock_irq(&dev_priv->uncore.lock);
 }
 
 static void execlists_context_unqueue(struct intel_engine_cs *engine)
@@ -536,13 +534,14 @@  get_context_status(struct drm_i915_private *dev_priv, u32 csb_base,
 
 /**
  * intel_lrc_irq_handler() - handle Context Switch interrupts
- * @ring: Engine Command Streamer to handle.
+ * @engine: Engine Command Streamer to handle.
  *
  * Check the unread Context Status Buffers and manage the submission of new
  * contexts to the ELSP accordingly.
  */
-void intel_lrc_irq_handler(struct intel_engine_cs *engine)
+void intel_lrc_irq_handler(unsigned long data)
 {
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
 	struct drm_i915_private *dev_priv = engine->dev->dev_private;
 	u32 status_pointer;
 	unsigned int read_pointer, write_pointer;
@@ -551,7 +550,7 @@  void intel_lrc_irq_handler(struct intel_engine_cs *engine)
 	unsigned int csb_read = 0, i;
 	unsigned int submit_contexts = 0;
 
-	spin_lock(&dev_priv->uncore.lock);
+	spin_lock_irq(&dev_priv->uncore.lock);
 	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
 
 	status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine));
@@ -579,7 +578,7 @@  void intel_lrc_irq_handler(struct intel_engine_cs *engine)
 				    engine->next_context_status_buffer << 8));
 
 	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
+	spin_unlock_irq(&dev_priv->uncore.lock);
 
 	spin_lock(&engine->execlist_lock);
 
@@ -621,7 +620,7 @@  static void execlists_context_queue(struct drm_i915_gem_request *request)
 
 	i915_gem_request_reference(request);
 
-	spin_lock_irq(&engine->execlist_lock);
+	spin_lock_bh(&engine->execlist_lock);
 
 	list_for_each_entry(cursor, &engine->execlist_queue, execlist_link)
 		if (++num_elements > 2)
@@ -646,7 +645,7 @@  static void execlists_context_queue(struct drm_i915_gem_request *request)
 	if (num_elements == 0)
 		execlists_context_unqueue(engine);
 
-	spin_unlock_irq(&engine->execlist_lock);
+	spin_unlock_bh(&engine->execlist_lock);
 }
 
 static int logical_ring_invalidate_all_caches(struct drm_i915_gem_request *req)
@@ -2016,6 +2015,8 @@  void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
 	if (!intel_engine_initialized(engine))
 		return;
 
+	tasklet_kill(&engine->irq_tasklet);
+
 	dev_priv = engine->dev->dev_private;
 
 	if (engine->buffer) {
@@ -2089,6 +2090,9 @@  logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine)
 	INIT_LIST_HEAD(&engine->execlist_retired_req_list);
 	spin_lock_init(&engine->execlist_lock);
 
+	tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler,
+		     (unsigned long)engine);
+
 	logical_ring_init_platform_invariants(engine);
 
 	ret = i915_cmd_parser_init_ring(engine);
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 6690d93d603f..efcbd7bf9cc9 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -123,7 +123,6 @@  int intel_execlists_submission(struct i915_execbuffer_params *params,
 			       struct drm_i915_gem_execbuffer2 *args,
 			       struct list_head *vmas);
 
-void intel_lrc_irq_handler(struct intel_engine_cs *engine);
 void intel_execlists_retire_requests(struct intel_engine_cs *engine);
 
 #endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 221a94627aab..29810cba8a8c 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -266,6 +266,7 @@  struct  intel_engine_cs {
 	} semaphore;
 
 	/* Execlists */
+	struct tasklet_struct irq_tasklet;
 	spinlock_t execlist_lock;
 	struct list_head execlist_queue;
 	struct list_head execlist_retired_req_list;