[RFC] drm/i915: Move execlists irq handler to a bottom half
diff mbox

Message ID 1458667804-14121-1-git-send-email-tvrtko.ursulin@linux.intel.com
State New
Headers show

Commit Message

Tvrtko Ursulin March 22, 2016, 5:30 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Doing a lot of work in the interrupt handler introduces huge
latencies to the system as a whole.

Most dramatic effect can be seen by running an all engine
stress test like igt/gem_exec_nop/all where, when the kernel
config is lean enough, the whole system can be brought into
multi-second periods of complete non-interactivty. That can
look for example like this:

 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
 Modules linked in: [redacted for brevity]
 CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
 Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
 Workqueue: i915 gen6_pm_rps_work [i915]
 task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
 RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
 RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
 RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
 RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
 RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
 R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
 R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
 FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Stack:
  042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
  0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
  0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
 Call Trace:
  <IRQ>
  [<ffffffff8104a716>] irq_exit+0x86/0x90
  [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
  [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
  <EOI>
  [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
  [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
  [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
  [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
  [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
  [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
  [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
  [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
  [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
  [<ffffffff8105ab29>] process_one_work+0x139/0x350
  [<ffffffff8105b186>] worker_thread+0x126/0x490
  [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
  [<ffffffff8105fa64>] kthread+0xc4/0xe0
  [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
  [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
  [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170

I could not explain, or find a code path, which would explain
a +20 second lockup, but from some instrumentation it was
apparent the interrupts off proportion of time was between
10-25% under heavy load which is quite bad.

By moving the GT interrupt handling to a tasklet in a most
simple way, the problem above disappears completely.

Also, gem_latency -n 100 shows 25% better throughput and CPU
usage, and 14% better latencies.

I did not find any gains or regressions with Synmark2 or
GLbench under light testing. More benchmarking is certainly
required.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_irq.c         |  2 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 19 +++++++++++++------
 drivers/gpu/drm/i915/intel_lrc.h        |  1 -
 drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
 4 files changed, 15 insertions(+), 8 deletions(-)

Comments

Daniel Vetter March 23, 2016, 9:07 a.m. UTC | #1
On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Doing a lot of work in the interrupt handler introduces huge
> latencies to the system as a whole.
> 
> Most dramatic effect can be seen by running an all engine
> stress test like igt/gem_exec_nop/all where, when the kernel
> config is lean enough, the whole system can be brought into
> multi-second periods of complete non-interactivty. That can
> look for example like this:
> 
>  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
>  Modules linked in: [redacted for brevity]
>  CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
>  Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
>  Workqueue: i915 gen6_pm_rps_work [i915]
>  task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
>  RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
>  RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
>  RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
>  RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
>  RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
>  R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
>  R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
>  FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Stack:
>   042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
>   0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
>   0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
>  Call Trace:
>   <IRQ>
>   [<ffffffff8104a716>] irq_exit+0x86/0x90
>   [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
>   [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
>   <EOI>
>   [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
>   [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
>   [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
>   [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
>   [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
>   [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
>   [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
>   [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
>   [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
>   [<ffffffff8105ab29>] process_one_work+0x139/0x350
>   [<ffffffff8105b186>] worker_thread+0x126/0x490
>   [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
>   [<ffffffff8105fa64>] kthread+0xc4/0xe0
>   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
>   [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
>   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> 
> I could not explain, or find a code path, which would explain
> a +20 second lockup, but from some instrumentation it was
> apparent the interrupts off proportion of time was between
> 10-25% under heavy load which is quite bad.
> 
> By moving the GT interrupt handling to a tasklet in a most
> simple way, the problem above disappears completely.
> 
> Also, gem_latency -n 100 shows 25% better throughput and CPU
> usage, and 14% better latencies.
> 
> I did not find any gains or regressions with Synmark2 or
> GLbench under light testing. More benchmarking is certainly
> required.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>

I thought tasklets are considered unpopular nowadays? They still steal cpu
time, just have the benefit of not also disabling hard interrupts. There
should be mitigation though to offload these softinterrupts to threads.
Have you tried to create a threaded interrupt thread just for these pins
instead? A bit of boilerplate, but not much using the genirq stuff iirc.

Anyway just an idea to play with/benchmark on top of this one here.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_irq.c         |  2 +-
>  drivers/gpu/drm/i915/intel_lrc.c        | 19 +++++++++++++------
>  drivers/gpu/drm/i915/intel_lrc.h        |  1 -
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
>  4 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 8f3e3309c3ab..e68134347007 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1324,7 +1324,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
>  	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift))
>  		notify_ring(engine);
>  	if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift))
> -		intel_lrc_irq_handler(engine);
> +		tasklet_schedule(&engine->irq_tasklet);
>  }
>  
>  static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 6916991bdceb..283426c02f8b 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -538,21 +538,23 @@ get_context_status(struct intel_engine_cs *engine, unsigned int read_pointer,
>  
>  /**
>   * intel_lrc_irq_handler() - handle Context Switch interrupts
> - * @ring: Engine Command Streamer to handle.
> + * @engine: Engine Command Streamer to handle.
>   *
>   * Check the unread Context Status Buffers and manage the submission of new
>   * contexts to the ELSP accordingly.
>   */
> -void intel_lrc_irq_handler(struct intel_engine_cs *engine)
> +void intel_lrc_irq_handler(unsigned long data)
>  {
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
>  	struct drm_i915_private *dev_priv = engine->dev->dev_private;
>  	u32 status_pointer;
>  	unsigned int read_pointer, write_pointer;
>  	u32 csb[GEN8_CSB_ENTRIES][2];
>  	unsigned int csb_read = 0, i;
>  	unsigned int submit_contexts = 0;
> +	unsigned long flags;
>  
> -	spin_lock(&dev_priv->uncore.lock);
> +	spin_lock_irqsave(&dev_priv->uncore.lock, flags);
>  	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
>  
>  	status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine));
> @@ -579,9 +581,9 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine)
>  				    engine->next_context_status_buffer << 8));
>  
>  	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
> -	spin_unlock(&dev_priv->uncore.lock);
> +	spin_unlock_irqrestore(&dev_priv->uncore.lock, flags);
>  
> -	spin_lock(&engine->execlist_lock);
> +	spin_lock_irqsave(&engine->execlist_lock, flags);
>  
>  	for (i = 0; i < csb_read; i++) {
>  		if (unlikely(csb[i][0] & GEN8_CTX_STATUS_PREEMPTED)) {
> @@ -604,7 +606,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine)
>  			execlists_context_unqueue(engine);
>  	}
>  
> -	spin_unlock(&engine->execlist_lock);
> +	spin_unlock_irqrestore(&engine->execlist_lock, flags);
>  
>  	if (unlikely(submit_contexts > 2))
>  		DRM_ERROR("More than two context complete events?\n");
> @@ -2020,6 +2022,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
>  	if (!intel_engine_initialized(engine))
>  		return;
>  
> +	tasklet_kill(&engine->irq_tasklet);
> +
>  	dev_priv = engine->dev->dev_private;
>  
>  	if (engine->buffer) {
> @@ -2093,6 +2097,9 @@ logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine)
>  	INIT_LIST_HEAD(&engine->execlist_retired_req_list);
>  	spin_lock_init(&engine->execlist_lock);
>  
> +	tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler,
> +		     (unsigned long)engine);
> +
>  	logical_ring_init_platform_invariants(engine);
>  
>  	ret = i915_cmd_parser_init_ring(engine);
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index a17cb12221ba..0b0853eee91e 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -118,7 +118,6 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
>  			       struct drm_i915_gem_execbuffer2 *args,
>  			       struct list_head *vmas);
>  
> -void intel_lrc_irq_handler(struct intel_engine_cs *engine);
>  void intel_execlists_retire_requests(struct intel_engine_cs *engine);
>  
>  #endif /* _INTEL_LRC_H_ */
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 221a94627aab..29810cba8a8c 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -266,6 +266,7 @@ struct  intel_engine_cs {
>  	} semaphore;
>  
>  	/* Execlists */
> +	struct tasklet_struct irq_tasklet;
>  	spinlock_t execlist_lock;
>  	struct list_head execlist_queue;
>  	struct list_head execlist_retired_req_list;
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson March 23, 2016, 9:14 a.m. UTC | #2
On Wed, Mar 23, 2016 at 10:07:35AM +0100, Daniel Vetter wrote:
> On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote:
> > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > 
> > Doing a lot of work in the interrupt handler introduces huge
> > latencies to the system as a whole.
> > 
> > Most dramatic effect can be seen by running an all engine
> > stress test like igt/gem_exec_nop/all where, when the kernel
> > config is lean enough, the whole system can be brought into
> > multi-second periods of complete non-interactivty. That can
> > look for example like this:
> > 
> >  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
> >  Modules linked in: [redacted for brevity]
> >  CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
> >  Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
> >  Workqueue: i915 gen6_pm_rps_work [i915]
> >  task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
> >  RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
> >  RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
> >  RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
> >  RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
> >  RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
> >  R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
> >  R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
> >  FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
> >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >  Stack:
> >   042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
> >   0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
> >   0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
> >  Call Trace:
> >   <IRQ>
> >   [<ffffffff8104a716>] irq_exit+0x86/0x90
> >   [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
> >   [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
> >   <EOI>
> >   [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
> >   [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
> >   [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
> >   [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
> >   [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
> >   [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
> >   [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
> >   [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
> >   [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
> >   [<ffffffff8105ab29>] process_one_work+0x139/0x350
> >   [<ffffffff8105b186>] worker_thread+0x126/0x490
> >   [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
> >   [<ffffffff8105fa64>] kthread+0xc4/0xe0
> >   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> >   [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
> >   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> > 
> > I could not explain, or find a code path, which would explain
> > a +20 second lockup, but from some instrumentation it was
> > apparent the interrupts off proportion of time was between
> > 10-25% under heavy load which is quite bad.
> > 
> > By moving the GT interrupt handling to a tasklet in a most
> > simple way, the problem above disappears completely.
> > 
> > Also, gem_latency -n 100 shows 25% better throughput and CPU
> > usage, and 14% better latencies.

Forgot gem_syslatency, since that does reflect UX under load really
startlingly well.

> > I did not find any gains or regressions with Synmark2 or
> > GLbench under light testing. More benchmarking is certainly
> > required.
> > 

Bugzilla?
> > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> 
> I thought tasklets are considered unpopular nowadays? They still steal cpu
> time, just have the benefit of not also disabling hard interrupts. There
> should be mitigation though to offload these softinterrupts to threads.
> Have you tried to create a threaded interrupt thread just for these pins
> instead? A bit of boilerplate, but not much using the genirq stuff iirc.

Ah, you haven't been reading patches. Yes, there's been a patch to fix
the hardlockup using kthreads for a few months. Tvrtko is trying to move
this forward since he too has found a way of locking up his machine
using execlist under load.


So far kthreads seems to have a slight edge in the benchmarks, or rather
using tasklet I have some very wild results on Braswell. Using tasklets
the CPU time is accounted to the process (i.e. whoever was running at
the time of the irq, typically the benchmark), using kthread we have
independent entries in the process table/top (which is quite nice to see
just how much time is been eaten up by the context-switches).

Benchmarks still progessing, haven't yet got on to the latency
measureemnts....
-Chris
Tvrtko Ursulin March 23, 2016, 10:08 a.m. UTC | #3
On 23/03/16 09:14, Chris Wilson wrote:
> On Wed, Mar 23, 2016 at 10:07:35AM +0100, Daniel Vetter wrote:
>> On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote:
>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>
>>> Doing a lot of work in the interrupt handler introduces huge
>>> latencies to the system as a whole.
>>>
>>> Most dramatic effect can be seen by running an all engine
>>> stress test like igt/gem_exec_nop/all where, when the kernel
>>> config is lean enough, the whole system can be brought into
>>> multi-second periods of complete non-interactivty. That can
>>> look for example like this:
>>>
>>>   NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
>>>   Modules linked in: [redacted for brevity]
>>>   CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
>>>   Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
>>>   Workqueue: i915 gen6_pm_rps_work [i915]
>>>   task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
>>>   RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
>>>   RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
>>>   RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
>>>   RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
>>>   RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
>>>   R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
>>>   R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
>>>   FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
>>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>   CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
>>>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>   Stack:
>>>    042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
>>>    0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
>>>    0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
>>>   Call Trace:
>>>    <IRQ>
>>>    [<ffffffff8104a716>] irq_exit+0x86/0x90
>>>    [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
>>>    [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
>>>    <EOI>
>>>    [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
>>>    [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
>>>    [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
>>>    [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
>>>    [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
>>>    [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
>>>    [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
>>>    [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
>>>    [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
>>>    [<ffffffff8105ab29>] process_one_work+0x139/0x350
>>>    [<ffffffff8105b186>] worker_thread+0x126/0x490
>>>    [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
>>>    [<ffffffff8105fa64>] kthread+0xc4/0xe0
>>>    [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
>>>    [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
>>>    [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
>>>
>>> I could not explain, or find a code path, which would explain
>>> a +20 second lockup, but from some instrumentation it was
>>> apparent the interrupts off proportion of time was between
>>> 10-25% under heavy load which is quite bad.
>>>
>>> By moving the GT interrupt handling to a tasklet in a most
>>> simple way, the problem above disappears completely.
>>>
>>> Also, gem_latency -n 100 shows 25% better throughput and CPU
>>> usage, and 14% better latencies.
>
> Forgot gem_syslatency, since that does reflect UX under load really
> startlingly well.

gem_syslatency, before:

gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us
gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us
gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us
gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us

with tasklet:

gem_syslatency: cycles=808907, latency mean=53.133us max=1640us
gem_syslatency: cycles=862154, latency mean=62.778us max=2117us
gem_syslatency: cycles=856039, latency mean=58.079us max=2123us
gem_syslatency: cycles=841683, latency mean=56.914us max=1667us

Is this smaller throughput and better latency?

gem_syslatency -n, before:

gem_syslatency: cycles=0, latency mean=2.446us max=18us
gem_syslatency: cycles=0, latency mean=7.220us max=37us
gem_syslatency: cycles=0, latency mean=6.949us max=36us
gem_syslatency: cycles=0, latency mean=5.931us max=39us

with tasklet:

gem_syslatency: cycles=0, latency mean=2.477us max=5us
gem_syslatency: cycles=0, latency mean=2.471us max=6us
gem_syslatency: cycles=0, latency mean=2.696us max=24us
gem_syslatency: cycles=0, latency mean=6.414us max=39us

This looks potentially the same or very similar. May need more runs to 
get a definitive picture.

gem_exec_nop/all has a huge improvement also, if we ignore the fact it 
locks up the system with the curret irq handler on full tilt, when I 
limit the max GPU frequency a bit it can avoid that problem but tasklets 
make it twice as fast here.

>>> I did not find any gains or regressions with Synmark2 or
>>> GLbench under light testing. More benchmarking is certainly
>>> required.
>>>
>
> Bugzilla?

You think it is OK to continue sharing your one, 
https://bugs.freedesktop.org/show_bug.cgi?id=93467?

>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>
>> I thought tasklets are considered unpopular nowadays? They still steal cpu

Did not know, last (and first) time I've used them was ~15 years ago. :) 
You got any links to read about it? Since (see below) I am not sure they 
"steal" CPU time.

>> time, just have the benefit of not also disabling hard interrupts. There
>> should be mitigation though to offload these softinterrupts to threads.
>> Have you tried to create a threaded interrupt thread just for these pins
>> instead? A bit of boilerplate, but not much using the genirq stuff iirc.
>
> Ah, you haven't been reading patches. Yes, there's been a patch to fix
> the hardlockup using kthreads for a few months. Tvrtko is trying to move
> this forward since he too has found a way of locking up his machine
> using execlist under load.

Correct.

> So far kthreads seems to have a slight edge in the benchmarks, or rather
> using tasklet I have some very wild results on Braswell. Using tasklets
> the CPU time is accounted to the process (i.e. whoever was running at
 > the time of the irq, typically the benchmark), using kthread we have

I thought they run from ksoftirqd so the CPU time is accounted against 
it. And looking at top, that even seems what actually happens.

> independent entries in the process table/top (which is quite nice to see
> just how much time is been eaten up by the context-switches).
>
> Benchmarks still progessing, haven't yet got on to the latency
> measureemnts....

My tasklets hack required surprisingly little code change, at least if 
there are not some missed corner cases to handle, but I don't mind your 
threads either.

Regards,

Tvrtko
Chris Wilson March 23, 2016, 11:31 a.m. UTC | #4
On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote:
> 
> On 23/03/16 09:14, Chris Wilson wrote:
> >On Wed, Mar 23, 2016 at 10:07:35AM +0100, Daniel Vetter wrote:
> >>On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote:
> >>>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>
> >>>Doing a lot of work in the interrupt handler introduces huge
> >>>latencies to the system as a whole.
> >>>
> >>>Most dramatic effect can be seen by running an all engine
> >>>stress test like igt/gem_exec_nop/all where, when the kernel
> >>>config is lean enough, the whole system can be brought into
> >>>multi-second periods of complete non-interactivty. That can
> >>>look for example like this:
> >>>
> >>>  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
> >>>  Modules linked in: [redacted for brevity]
> >>>  CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
> >>>  Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
> >>>  Workqueue: i915 gen6_pm_rps_work [i915]
> >>>  task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
> >>>  RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
> >>>  RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
> >>>  RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
> >>>  RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
> >>>  RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
> >>>  R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
> >>>  R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
> >>>  FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
> >>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>  CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
> >>>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>  Stack:
> >>>   042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
> >>>   0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
> >>>   0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
> >>>  Call Trace:
> >>>   <IRQ>
> >>>   [<ffffffff8104a716>] irq_exit+0x86/0x90
> >>>   [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
> >>>   [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
> >>>   <EOI>
> >>>   [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
> >>>   [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
> >>>   [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
> >>>   [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
> >>>   [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
> >>>   [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
> >>>   [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
> >>>   [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
> >>>   [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
> >>>   [<ffffffff8105ab29>] process_one_work+0x139/0x350
> >>>   [<ffffffff8105b186>] worker_thread+0x126/0x490
> >>>   [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
> >>>   [<ffffffff8105fa64>] kthread+0xc4/0xe0
> >>>   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> >>>   [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
> >>>   [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
> >>>
> >>>I could not explain, or find a code path, which would explain
> >>>a +20 second lockup, but from some instrumentation it was
> >>>apparent the interrupts off proportion of time was between
> >>>10-25% under heavy load which is quite bad.
> >>>
> >>>By moving the GT interrupt handling to a tasklet in a most
> >>>simple way, the problem above disappears completely.
> >>>
> >>>Also, gem_latency -n 100 shows 25% better throughput and CPU
> >>>usage, and 14% better latencies.
> >
> >Forgot gem_syslatency, since that does reflect UX under load really
> >startlingly well.
> 
> gem_syslatency, before:
> 
> gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us
> gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us
> gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us
> gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us
> 
> with tasklet:
> 
> gem_syslatency: cycles=808907, latency mean=53.133us max=1640us
> gem_syslatency: cycles=862154, latency mean=62.778us max=2117us
> gem_syslatency: cycles=856039, latency mean=58.079us max=2123us
> gem_syslatency: cycles=841683, latency mean=56.914us max=1667us
> 
> Is this smaller throughput and better latency?

Yeah. I wasn't expecting the smaller throughput, but the impact on other
users is massive. You should be able to feel the difference if you try
to use the machine whilst gem_syslatency or gem_exec_nop is running, a
delay of up to 2s in responding to human input can be annoying!

> gem_syslatency -n, before:
> 
> gem_syslatency: cycles=0, latency mean=2.446us max=18us
> gem_syslatency: cycles=0, latency mean=7.220us max=37us
> gem_syslatency: cycles=0, latency mean=6.949us max=36us
> gem_syslatency: cycles=0, latency mean=5.931us max=39us
> 
> with tasklet:
> 
> gem_syslatency: cycles=0, latency mean=2.477us max=5us
> gem_syslatency: cycles=0, latency mean=2.471us max=6us
> gem_syslatency: cycles=0, latency mean=2.696us max=24us
> gem_syslatency: cycles=0, latency mean=6.414us max=39us
> 
> This looks potentially the same or very similar. May need more runs
> to get a definitive picture.

-n should be unaffected, since it measures the background without gem
operations (cycles=0), so should tell us how stable the numbers are for
the timers.
 
> gem_exec_nop/all has a huge improvement also, if we ignore the fact
> it locks up the system with the curret irq handler on full tilt,
> when I limit the max GPU frequency a bit it can avoid that problem
> but tasklets make it twice as fast here.

Yes, with threaded submission can then concurrently submit requests to
multiple rings. I take it you have a 2-processor machine? We should
ideally see linear scaling upto min(num_engines, nproc-1) if we assume
that one cpu is enough to sustain gem_execbuf() ioctls.

> >>>I did not find any gains or regressions with Synmark2 or
> >>>GLbench under light testing. More benchmarking is certainly
> >>>required.
> >>>
> >
> >Bugzilla?
> 
> You think it is OK to continue sharing your one,
> https://bugs.freedesktop.org/show_bug.cgi?id=93467?

Yes, it fixes the same freeze (and we've removed the loop from chv irq
so there really shouldn't be any others left!)

> >>>Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>
> >>I thought tasklets are considered unpopular nowadays? They still steal cpu
> 
> Did not know, last (and first) time I've used them was ~15 years
> ago. :) You got any links to read about it? Since (see below) I am
> not sure they "steal" CPU time.
> 
> >>time, just have the benefit of not also disabling hard interrupts. There
> >>should be mitigation though to offload these softinterrupts to threads.
> >>Have you tried to create a threaded interrupt thread just for these pins
> >>instead? A bit of boilerplate, but not much using the genirq stuff iirc.
> >
> >Ah, you haven't been reading patches. Yes, there's been a patch to fix
> >the hardlockup using kthreads for a few months. Tvrtko is trying to move
> >this forward since he too has found a way of locking up his machine
> >using execlist under load.
> 
> Correct.
> 
> >So far kthreads seems to have a slight edge in the benchmarks, or rather
> >using tasklet I have some very wild results on Braswell. Using tasklets
> >the CPU time is accounted to the process (i.e. whoever was running at
> > the time of the irq, typically the benchmark), using kthread we have
> 
> I thought they run from ksoftirqd so the CPU time is accounted
> against it. And looking at top, that even seems what actually
> happens.

Not for me. :| Though I'm using simple CPU time accounting.

> >independent entries in the process table/top (which is quite nice to see
> >just how much time is been eaten up by the context-switches).
> >
> >Benchmarks still progessing, haven't yet got on to the latency
> >measureemnts....
> 
> My tasklets hack required surprisingly little code change, at least
> if there are not some missed corner cases to handle, but I don't
> mind your threads either.

Yes, though when moving to kthreads I dropped the requirement for
spin_lock_irq(engine->execlists_lock) and so there is a large amount of
fluff in changing those callsites to spin_lock(). (For tasklet, we could
argue that requirement is now changed to spin_lock_bh()...) The real meat
of the change is that with kthreads we have to worry about doing the
scheduling() ourselves, and that impacts upon the forcewake dance so
certainly more complex than tasklets! I liked how simple this patch is
and so far it looks as good as making our own kthread. The biggest
difference really is just who gets the CPU time!
-Chris
Tvrtko Ursulin March 23, 2016, 12:46 p.m. UTC | #5
On 23/03/16 11:31, Chris Wilson wrote:
> On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote:
>>
>> On 23/03/16 09:14, Chris Wilson wrote:
>>> On Wed, Mar 23, 2016 at 10:07:35AM +0100, Daniel Vetter wrote:
>>>> On Tue, Mar 22, 2016 at 05:30:04PM +0000, Tvrtko Ursulin wrote:
>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>
>>>>> Doing a lot of work in the interrupt handler introduces huge
>>>>> latencies to the system as a whole.
>>>>>
>>>>> Most dramatic effect can be seen by running an all engine
>>>>> stress test like igt/gem_exec_nop/all where, when the kernel
>>>>> config is lean enough, the whole system can be brought into
>>>>> multi-second periods of complete non-interactivty. That can
>>>>> look for example like this:
>>>>>
>>>>>   NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
>>>>>   Modules linked in: [redacted for brevity]
>>>>>   CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
>>>>>   Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
>>>>>   Workqueue: i915 gen6_pm_rps_work [i915]
>>>>>   task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
>>>>>   RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
>>>>>   RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
>>>>>   RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
>>>>>   RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
>>>>>   RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
>>>>>   R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
>>>>>   R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
>>>>>   FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
>>>>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>   CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
>>>>>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>   Stack:
>>>>>    042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
>>>>>    0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
>>>>>    0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
>>>>>   Call Trace:
>>>>>    <IRQ>
>>>>>    [<ffffffff8104a716>] irq_exit+0x86/0x90
>>>>>    [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
>>>>>    [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
>>>>>    <EOI>
>>>>>    [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
>>>>>    [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
>>>>>    [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
>>>>>    [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
>>>>>    [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
>>>>>    [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
>>>>>    [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
>>>>>    [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
>>>>>    [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
>>>>>    [<ffffffff8105ab29>] process_one_work+0x139/0x350
>>>>>    [<ffffffff8105b186>] worker_thread+0x126/0x490
>>>>>    [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
>>>>>    [<ffffffff8105fa64>] kthread+0xc4/0xe0
>>>>>    [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
>>>>>    [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
>>>>>    [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
>>>>>
>>>>> I could not explain, or find a code path, which would explain
>>>>> a +20 second lockup, but from some instrumentation it was
>>>>> apparent the interrupts off proportion of time was between
>>>>> 10-25% under heavy load which is quite bad.
>>>>>
>>>>> By moving the GT interrupt handling to a tasklet in a most
>>>>> simple way, the problem above disappears completely.
>>>>>
>>>>> Also, gem_latency -n 100 shows 25% better throughput and CPU
>>>>> usage, and 14% better latencies.
>>>
>>> Forgot gem_syslatency, since that does reflect UX under load really
>>> startlingly well.
>>
>> gem_syslatency, before:
>>
>> gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us
>> gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us
>> gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us
>> gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us
>>
>> with tasklet:
>>
>> gem_syslatency: cycles=808907, latency mean=53.133us max=1640us
>> gem_syslatency: cycles=862154, latency mean=62.778us max=2117us
>> gem_syslatency: cycles=856039, latency mean=58.079us max=2123us
>> gem_syslatency: cycles=841683, latency mean=56.914us max=1667us
>>
>> Is this smaller throughput and better latency?
>
> Yeah. I wasn't expecting the smaller throughput, but the impact on other
> users is massive. You should be able to feel the difference if you try
> to use the machine whilst gem_syslatency or gem_exec_nop is running, a
> delay of up to 2s in responding to human input can be annoying!

Yes, impact is easily felt.

>> gem_exec_nop/all has a huge improvement also, if we ignore the fact
>> it locks up the system with the curret irq handler on full tilt,
>> when I limit the max GPU frequency a bit it can avoid that problem
>> but tasklets make it twice as fast here.
>
> Yes, with threaded submission can then concurrently submit requests to
> multiple rings. I take it you have a 2-processor machine? We should
> ideally see linear scaling upto min(num_engines, nproc-1) if we assume
> that one cpu is enough to sustain gem_execbuf() ioctls.

2C/4T correct.

>>>>> I did not find any gains or regressions with Synmark2 or
>>>>> GLbench under light testing. More benchmarking is certainly
>>>>> required.
>>>>>
>>>
>>> Bugzilla?
>>
>> You think it is OK to continue sharing your one,
>> https://bugs.freedesktop.org/show_bug.cgi?id=93467?
>
> Yes, it fixes the same freeze (and we've removed the loop from chv irq
> so there really shouldn't be any others left!)

I don't see that has been merged. Is it all ready CI wise so we could?

>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>
>>>> I thought tasklets are considered unpopular nowadays? They still steal cpu
>>
>> Did not know, last (and first) time I've used them was ~15 years
>> ago. :) You got any links to read about it? Since (see below) I am
>> not sure they "steal" CPU time.
>>
>>>> time, just have the benefit of not also disabling hard interrupts. There
>>>> should be mitigation though to offload these softinterrupts to threads.
>>>> Have you tried to create a threaded interrupt thread just for these pins
>>>> instead? A bit of boilerplate, but not much using the genirq stuff iirc.
>>>
>>> Ah, you haven't been reading patches. Yes, there's been a patch to fix
>>> the hardlockup using kthreads for a few months. Tvrtko is trying to move
>>> this forward since he too has found a way of locking up his machine
>>> using execlist under load.
>>
>> Correct.
>>
>>> So far kthreads seems to have a slight edge in the benchmarks, or rather
>>> using tasklet I have some very wild results on Braswell. Using tasklets
>>> the CPU time is accounted to the process (i.e. whoever was running at
>>> the time of the irq, typically the benchmark), using kthread we have
>>
>> I thought they run from ksoftirqd so the CPU time is accounted
>> against it. And looking at top, that even seems what actually
>> happens.
>
> Not for me. :| Though I'm using simple CPU time accounting.

I suppose it must be this one you don't have then:

config IRQ_TIME_ACCOUNTING
         bool "Fine granularity task level IRQ time accounting"
         depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL
         help
           Select this option to enable fine granularity task irq time
           accounting. This is done by reading a timestamp on each
           transitions between softirq and hardirq state, so there can
           be a small performance impact.

>>> independent entries in the process table/top (which is quite nice to see
>>> just how much time is been eaten up by the context-switches).
>>>
>>> Benchmarks still progessing, haven't yet got on to the latency
>>> measureemnts....
>>
>> My tasklets hack required surprisingly little code change, at least
>> if there are not some missed corner cases to handle, but I don't
>> mind your threads either.
>
> Yes, though when moving to kthreads I dropped the requirement for
> spin_lock_irq(engine->execlists_lock) and so there is a large amount of
> fluff in changing those callsites to spin_lock(). (For tasklet, we could
> argue that requirement is now changed to spin_lock_bh()...) The real meat

Ooops yes, _bh variant is the correct one. I wonder if that would 
further improve things. Will try.

> of the change is that with kthreads we have to worry about doing the
> scheduling() ourselves, and that impacts upon the forcewake dance so
> certainly more complex than tasklets! I liked how simple this patch is
> and so far it looks as good as making our own kthread. The biggest
> difference really is just who gets the CPU time!

Note that in that respect it is then no worse than the current situation 
wrt CPU time accounting.

Regards,

Tvrtko
Chris Wilson March 23, 2016, 12:56 p.m. UTC | #6
On Wed, Mar 23, 2016 at 12:46:35PM +0000, Tvrtko Ursulin wrote:
> 
> On 23/03/16 11:31, Chris Wilson wrote:
> >On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote:
> >>You think it is OK to continue sharing your one,
> >>https://bugs.freedesktop.org/show_bug.cgi?id=93467?
> >
> >Yes, it fixes the same freeze (and we've removed the loop from chv irq
> >so there really shouldn't be any others left!)
> 
> I don't see that has been merged. Is it all ready CI wise so we could?

On the CI ping:
id:20160314103014.30028.12472@emeril.freedesktop.org

== Summary ==

Series 4298v3 drm/i915: Exit cherryview_irq_handler() after one pass
http://patchwork.freedesktop.org/api/1.0/series/4298/revisions/3/mbox/

Test drv_module_reload_basic:
                pass       -> SKIP       (skl-i5k-2)
                pass       -> INCOMPLETE (bsw-nuc-2)
Test gem_ringfill:
        Subgroup basic-default-s3:
                dmesg-warn -> PASS       (bsw-nuc-2)
Test gem_tiled_pread_basic:
                incomplete -> PASS       (byt-nuc)
Test kms_flip:
        Subgroup basic-flip-vs-dpms:
                dmesg-warn -> PASS       (bdw-ultra)
        Subgroup basic-flip-vs-modeset:
                incomplete -> PASS       (bsw-nuc-2)
Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-a:
                incomplete -> PASS       (hsw-gt2)
Test pm_rpm:
        Subgroup basic-pci-d3-state:
                dmesg-warn -> PASS       (bsw-nuc-2)

bdw-nuci7        total:194  pass:182  dwarn:0   dfail:0   fail:0   skip:12
bdw-ultra        total:194  pass:173  dwarn:0   dfail:0   fail:0   skip:21
bsw-nuc-2        total:189  pass:151  dwarn:0   dfail:0   fail:0   skip:37
byt-nuc          total:194  pass:154  dwarn:4   dfail:0   fail:1   skip:35
hsw-brixbox      total:194  pass:172  dwarn:0   dfail:0   fail:0   skip:22
hsw-gt2          total:194  pass:176  dwarn:1   dfail:0   fail:0   skip:17
ivb-t430s        total:194  pass:169  dwarn:0   dfail:0   fail:0   skip:25
skl-i5k-2        total:194  pass:170  dwarn:0   dfail:0   fail:0   skip:24
skl-i7k-2        total:194  pass:171  dwarn:0   dfail:0   fail:0   skip:23
snb-dellxps      total:194  pass:159  dwarn:1   dfail:0   fail:0   skip:34
snb-x220t        total:194  pass:159  dwarn:1   dfail:0   fail:1   skip:33

Results at /archive/results/CI_IGT_test/Patchwork_1589/

3e5ecc8c5ff80cb1fb635ce1cf16b7cd4cfb1979 drm-intel-nightly: 2016y-03m-14d-09h-06m-00s UTC integration manifest
7928c2133b16eb2f26866ca05d1cb7bb6d41c765 drm/i915: Exit cherryview_irq_handler() after one pass

==

drv_module_reload_basic is weird, but it appears the hiccup CI had on the
previous run were external (and affected several CI runs afaict).
-Chris
Tvrtko Ursulin March 23, 2016, 3:23 p.m. UTC | #7
On 23/03/16 12:56, Chris Wilson wrote:
> On Wed, Mar 23, 2016 at 12:46:35PM +0000, Tvrtko Ursulin wrote:
>>
>> On 23/03/16 11:31, Chris Wilson wrote:
>>> On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote:
>>>> You think it is OK to continue sharing your one,
>>>> https://bugs.freedesktop.org/show_bug.cgi?id=93467?
>>>
>>> Yes, it fixes the same freeze (and we've removed the loop from chv irq
>>> so there really shouldn't be any others left!)
>>
>> I don't see that has been merged. Is it all ready CI wise so we could?
>
> On the CI ping:
> id:20160314103014.30028.12472@emeril.freedesktop.org
>
> == Summary ==
>
> Series 4298v3 drm/i915: Exit cherryview_irq_handler() after one pass
> http://patchwork.freedesktop.org/api/1.0/series/4298/revisions/3/mbox/
>
> Test drv_module_reload_basic:
>                  pass       -> SKIP       (skl-i5k-2)
>                  pass       -> INCOMPLETE (bsw-nuc-2)
> Test gem_ringfill:
>          Subgroup basic-default-s3:
>                  dmesg-warn -> PASS       (bsw-nuc-2)
> Test gem_tiled_pread_basic:
>                  incomplete -> PASS       (byt-nuc)
> Test kms_flip:
>          Subgroup basic-flip-vs-dpms:
>                  dmesg-warn -> PASS       (bdw-ultra)
>          Subgroup basic-flip-vs-modeset:
>                  incomplete -> PASS       (bsw-nuc-2)
> Test kms_pipe_crc_basic:
>          Subgroup suspend-read-crc-pipe-a:
>                  incomplete -> PASS       (hsw-gt2)
> Test pm_rpm:
>          Subgroup basic-pci-d3-state:
>                  dmesg-warn -> PASS       (bsw-nuc-2)
>
> bdw-nuci7        total:194  pass:182  dwarn:0   dfail:0   fail:0   skip:12
> bdw-ultra        total:194  pass:173  dwarn:0   dfail:0   fail:0   skip:21
> bsw-nuc-2        total:189  pass:151  dwarn:0   dfail:0   fail:0   skip:37
> byt-nuc          total:194  pass:154  dwarn:4   dfail:0   fail:1   skip:35
> hsw-brixbox      total:194  pass:172  dwarn:0   dfail:0   fail:0   skip:22
> hsw-gt2          total:194  pass:176  dwarn:1   dfail:0   fail:0   skip:17
> ivb-t430s        total:194  pass:169  dwarn:0   dfail:0   fail:0   skip:25
> skl-i5k-2        total:194  pass:170  dwarn:0   dfail:0   fail:0   skip:24
> skl-i7k-2        total:194  pass:171  dwarn:0   dfail:0   fail:0   skip:23
> snb-dellxps      total:194  pass:159  dwarn:1   dfail:0   fail:0   skip:34
> snb-x220t        total:194  pass:159  dwarn:1   dfail:0   fail:1   skip:33
>
> Results at /archive/results/CI_IGT_test/Patchwork_1589/
>
> 3e5ecc8c5ff80cb1fb635ce1cf16b7cd4cfb1979 drm-intel-nightly: 2016y-03m-14d-09h-06m-00s UTC integration manifest
> 7928c2133b16eb2f26866ca05d1cb7bb6d41c765 drm/i915: Exit cherryview_irq_handler() after one pass
>
> ==
>
> drv_module_reload_basic is weird, but it appears the hiccup CI had on the
> previous run were external (and affected several CI runs afaict).

That part is goog then, but I am not sure what to do with the incomplete 
run. Maybe have it do another one? Although that is quite weak. Problem 
is it has no other hangs with that test in the history.

Regards,

Tvrtko
Tvrtko Ursulin March 24, 2016, 9:37 a.m. UTC | #8
On 23/03/16 15:23, Tvrtko Ursulin wrote:
>
> On 23/03/16 12:56, Chris Wilson wrote:
>> On Wed, Mar 23, 2016 at 12:46:35PM +0000, Tvrtko Ursulin wrote:
>>>
>>> On 23/03/16 11:31, Chris Wilson wrote:
>>>> On Wed, Mar 23, 2016 at 10:08:46AM +0000, Tvrtko Ursulin wrote:
>>>>> You think it is OK to continue sharing your one,
>>>>> https://bugs.freedesktop.org/show_bug.cgi?id=93467?
>>>>
>>>> Yes, it fixes the same freeze (and we've removed the loop from chv irq
>>>> so there really shouldn't be any others left!)
>>>
>>> I don't see that has been merged. Is it all ready CI wise so we could?
>>
>> On the CI ping:
>> id:20160314103014.30028.12472@emeril.freedesktop.org
>>
>> == Summary ==
>>
>> Series 4298v3 drm/i915: Exit cherryview_irq_handler() after one pass
>> http://patchwork.freedesktop.org/api/1.0/series/4298/revisions/3/mbox/
>>
>> Test drv_module_reload_basic:
>>                  pass       -> SKIP       (skl-i5k-2)
>>                  pass       -> INCOMPLETE (bsw-nuc-2)
>> Test gem_ringfill:
>>          Subgroup basic-default-s3:
>>                  dmesg-warn -> PASS       (bsw-nuc-2)
>> Test gem_tiled_pread_basic:
>>                  incomplete -> PASS       (byt-nuc)
>> Test kms_flip:
>>          Subgroup basic-flip-vs-dpms:
>>                  dmesg-warn -> PASS       (bdw-ultra)
>>          Subgroup basic-flip-vs-modeset:
>>                  incomplete -> PASS       (bsw-nuc-2)
>> Test kms_pipe_crc_basic:
>>          Subgroup suspend-read-crc-pipe-a:
>>                  incomplete -> PASS       (hsw-gt2)
>> Test pm_rpm:
>>          Subgroup basic-pci-d3-state:
>>                  dmesg-warn -> PASS       (bsw-nuc-2)
>>
>> bdw-nuci7        total:194  pass:182  dwarn:0   dfail:0   fail:0
>> skip:12
>> bdw-ultra        total:194  pass:173  dwarn:0   dfail:0   fail:0
>> skip:21
>> bsw-nuc-2        total:189  pass:151  dwarn:0   dfail:0   fail:0
>> skip:37
>> byt-nuc          total:194  pass:154  dwarn:4   dfail:0   fail:1
>> skip:35
>> hsw-brixbox      total:194  pass:172  dwarn:0   dfail:0   fail:0
>> skip:22
>> hsw-gt2          total:194  pass:176  dwarn:1   dfail:0   fail:0
>> skip:17
>> ivb-t430s        total:194  pass:169  dwarn:0   dfail:0   fail:0
>> skip:25
>> skl-i5k-2        total:194  pass:170  dwarn:0   dfail:0   fail:0
>> skip:24
>> skl-i7k-2        total:194  pass:171  dwarn:0   dfail:0   fail:0
>> skip:23
>> snb-dellxps      total:194  pass:159  dwarn:1   dfail:0   fail:0
>> skip:34
>> snb-x220t        total:194  pass:159  dwarn:1   dfail:0   fail:1
>> skip:33
>>
>> Results at /archive/results/CI_IGT_test/Patchwork_1589/
>>
>> 3e5ecc8c5ff80cb1fb635ce1cf16b7cd4cfb1979 drm-intel-nightly:
>> 2016y-03m-14d-09h-06m-00s UTC integration manifest
>> 7928c2133b16eb2f26866ca05d1cb7bb6d41c765 drm/i915: Exit
>> cherryview_irq_handler() after one pass
>>
>> ==
>>
>> drv_module_reload_basic is weird, but it appears the hiccup CI had on the
>> previous run were external (and affected several CI runs afaict).
>
> That part is goog then, but I am not sure what to do with the incomplete
> run. Maybe have it do another one? Although that is quite weak. Problem
> is it has no other hangs with that test in the history.

goog yes :) I got a bsw-nuc2 hang in results yesterday for a series 
which I don't think could really have caused it. So I think there might 
be something really wrong either with that machine or with the driver 
reload on chv/bsw in general.

Regards,

Tvrtko
Tvrtko Ursulin March 24, 2016, 3:17 p.m. UTC | #9
On 24/03/16 14:03, Patchwork wrote:
> == Series Details ==
>
> Series: drm/i915: Move execlists irq handler to a bottom half (rev3)
> URL   : https://patchwork.freedesktop.org/series/4764/
> State : warning
>
> == Summary ==
>
> Series 4764v3 drm/i915: Move execlists irq handler to a bottom half
> http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/3/mbox/
>
> Test gem_exec_suspend:
>          Subgroup basic-s3:
>                  dmesg-warn -> PASS       (bsw-nuc-2)
> Test pm_rpm:
>          Subgroup basic-pci-d3-state:
>                  dmesg-warn -> PASS       (bsw-nuc-2)
>                  pass       -> DMESG-WARN (byt-nuc)

Unclaimed register prior to suspending on BYT:
https://bugs.freedesktop.org/show_bug.cgi?id=94164

>          Subgroup basic-rte:
>                  dmesg-warn -> PASS       (byt-nuc) UNSTABLE
>
> bdw-nuci7        total:192  pass:179  dwarn:0   dfail:0   fail:1   skip:12
> bdw-ultra        total:192  pass:170  dwarn:0   dfail:0   fail:1   skip:21
> bsw-nuc-2        total:192  pass:155  dwarn:0   dfail:0   fail:0   skip:37
> byt-nuc          total:192  pass:156  dwarn:1   dfail:0   fail:0   skip:35
> hsw-brixbox      total:192  pass:170  dwarn:0   dfail:0   fail:0   skip:22
> hsw-gt2          total:192  pass:175  dwarn:0   dfail:0   fail:0   skip:17
> ivb-t430s        total:192  pass:167  dwarn:0   dfail:0   fail:0   skip:25
> skl-i7k-2        total:192  pass:169  dwarn:0   dfail:0   fail:0   skip:23
> skl-nuci5        total:192  pass:181  dwarn:0   dfail:0   fail:0   skip:11
> snb-dellxps      total:192  pass:158  dwarn:0   dfail:0   fail:0   skip:34
> snb-x220t        total:192  pass:158  dwarn:0   dfail:0   fail:1   skip:33
>
> Results at /archive/results/CI_IGT_test/Patchwork_1707/
>
> 83ec122b900baae1aca2bc11eedc28f2d9ea5060 drm-intel-nightly: 2016y-03m-24d-12h-48m-43s UTC integration manifest
> b4a4e726b4f10b0782c821bf73c945533ec882e8 drm/i915: Move execlists irq handler to a bottom half

Sooo... who dares to merge this? It kind of looks too simple not to 
result in some fallout.

Regards,

Tvrtko
Tvrtko Ursulin April 4, 2016, 12:42 p.m. UTC | #10
On 04/04/16 13:33, Patchwork wrote:
> == Series Details ==
>
> Series: drm/i915: Move execlists irq handler to a bottom half (rev4)
> URL   : https://patchwork.freedesktop.org/series/4764/
> State : failure
>
> == Summary ==
>
> Series 4764v4 drm/i915: Move execlists irq handler to a bottom half
> http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/4/mbox/
>
> Test gem_sync:
>          Subgroup basic-bsd:
>                  pass       -> DMESG-FAIL (ilk-hp8440p)

Unrelated hangcheck timer elapsed on ILK: 
https://bugs.freedesktop.org/show_bug.cgi?id=94307

> Test kms_pipe_crc_basic:
>          Subgroup suspend-read-crc-pipe-a:
>                  incomplete -> PASS       (skl-nuci5)
>
> bdw-nuci7        total:196  pass:184  dwarn:0   dfail:0   fail:0   skip:12
> bdw-ultra        total:196  pass:175  dwarn:0   dfail:0   fail:0   skip:21
> bsw-nuc-2        total:196  pass:159  dwarn:0   dfail:0   fail:0   skip:37
> byt-nuc          total:196  pass:161  dwarn:0   dfail:0   fail:0   skip:35
> hsw-brixbox      total:196  pass:174  dwarn:0   dfail:0   fail:0   skip:22
> hsw-gt2          total:196  pass:179  dwarn:0   dfail:0   fail:0   skip:17
> ilk-hp8440p      total:196  pass:131  dwarn:0   dfail:1   fail:0   skip:64
> ivb-t430s        total:196  pass:171  dwarn:0   dfail:0   fail:0   skip:25
> skl-i7k-2        total:196  pass:173  dwarn:0   dfail:0   fail:0   skip:23
> skl-nuci5        total:105  pass:100  dwarn:0   dfail:0   fail:0   skip:4
> snb-dellxps      total:196  pass:162  dwarn:0   dfail:0   fail:0   skip:34
> snb-x220t        total:164  pass:139  dwarn:0   dfail:0   fail:0   skip:25
>
> Results at /archive/results/CI_IGT_test/Patchwork_1786/
>
> 3e353ec38c8fe68e9a243a9388389a8815115451 drm-intel-nightly: 2016y-04m-04d-11h-13m-54s UTC integration manifest
> 95dc10d4f71a6cf473aa874b0a74036f251aef8c drm/i915: Move execlists irq handler to a bottom half

So cross fingers and merge?

Regards,

Tvrtko
Chris Wilson April 4, 2016, 12:53 p.m. UTC | #11
On Mon, Apr 04, 2016 at 01:42:06PM +0100, Tvrtko Ursulin wrote:
> 
> 
> On 04/04/16 13:33, Patchwork wrote:
> >== Series Details ==
> >
> >Series: drm/i915: Move execlists irq handler to a bottom half (rev4)
> >URL   : https://patchwork.freedesktop.org/series/4764/
> >State : failure
> >
> >== Summary ==
> >
> >Series 4764v4 drm/i915: Move execlists irq handler to a bottom half
> >http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/4/mbox/
> >
> >Test gem_sync:
> >         Subgroup basic-bsd:
> >                 pass       -> DMESG-FAIL (ilk-hp8440p)
> 
> Unrelated hangcheck timer elapsed on ILK:
> https://bugs.freedesktop.org/show_bug.cgi?id=94307
> 
> >Test kms_pipe_crc_basic:
> >         Subgroup suspend-read-crc-pipe-a:
> >                 incomplete -> PASS       (skl-nuci5)
> >
> >bdw-nuci7        total:196  pass:184  dwarn:0   dfail:0   fail:0   skip:12
> >bdw-ultra        total:196  pass:175  dwarn:0   dfail:0   fail:0   skip:21
> >bsw-nuc-2        total:196  pass:159  dwarn:0   dfail:0   fail:0   skip:37
> >byt-nuc          total:196  pass:161  dwarn:0   dfail:0   fail:0   skip:35
> >hsw-brixbox      total:196  pass:174  dwarn:0   dfail:0   fail:0   skip:22
> >hsw-gt2          total:196  pass:179  dwarn:0   dfail:0   fail:0   skip:17
> >ilk-hp8440p      total:196  pass:131  dwarn:0   dfail:1   fail:0   skip:64
> >ivb-t430s        total:196  pass:171  dwarn:0   dfail:0   fail:0   skip:25
> >skl-i7k-2        total:196  pass:173  dwarn:0   dfail:0   fail:0   skip:23
> >skl-nuci5        total:105  pass:100  dwarn:0   dfail:0   fail:0   skip:4
> >snb-dellxps      total:196  pass:162  dwarn:0   dfail:0   fail:0   skip:34
> >snb-x220t        total:164  pass:139  dwarn:0   dfail:0   fail:0   skip:25
> >
> >Results at /archive/results/CI_IGT_test/Patchwork_1786/
> >
> >3e353ec38c8fe68e9a243a9388389a8815115451 drm-intel-nightly: 2016y-04m-04d-11h-13m-54s UTC integration manifest
> >95dc10d4f71a6cf473aa874b0a74036f251aef8c drm/i915: Move execlists irq handler to a bottom half
> 
> So cross fingers and merge?

Yes!
-Chris
Tvrtko Ursulin April 4, 2016, 1:14 p.m. UTC | #12
On 04/04/16 13:53, Chris Wilson wrote:
> On Mon, Apr 04, 2016 at 01:42:06PM +0100, Tvrtko Ursulin wrote:
>>
>>
>> On 04/04/16 13:33, Patchwork wrote:
>>> == Series Details ==
>>>
>>> Series: drm/i915: Move execlists irq handler to a bottom half (rev4)
>>> URL   : https://patchwork.freedesktop.org/series/4764/
>>> State : failure
>>>
>>> == Summary ==
>>>
>>> Series 4764v4 drm/i915: Move execlists irq handler to a bottom half
>>> http://patchwork.freedesktop.org/api/1.0/series/4764/revisions/4/mbox/
>>>
>>> Test gem_sync:
>>>          Subgroup basic-bsd:
>>>                  pass       -> DMESG-FAIL (ilk-hp8440p)
>>
>> Unrelated hangcheck timer elapsed on ILK:
>> https://bugs.freedesktop.org/show_bug.cgi?id=94307
>>
>>> Test kms_pipe_crc_basic:
>>>          Subgroup suspend-read-crc-pipe-a:
>>>                  incomplete -> PASS       (skl-nuci5)
>>>
>>> bdw-nuci7        total:196  pass:184  dwarn:0   dfail:0   fail:0   skip:12
>>> bdw-ultra        total:196  pass:175  dwarn:0   dfail:0   fail:0   skip:21
>>> bsw-nuc-2        total:196  pass:159  dwarn:0   dfail:0   fail:0   skip:37
>>> byt-nuc          total:196  pass:161  dwarn:0   dfail:0   fail:0   skip:35
>>> hsw-brixbox      total:196  pass:174  dwarn:0   dfail:0   fail:0   skip:22
>>> hsw-gt2          total:196  pass:179  dwarn:0   dfail:0   fail:0   skip:17
>>> ilk-hp8440p      total:196  pass:131  dwarn:0   dfail:1   fail:0   skip:64
>>> ivb-t430s        total:196  pass:171  dwarn:0   dfail:0   fail:0   skip:25
>>> skl-i7k-2        total:196  pass:173  dwarn:0   dfail:0   fail:0   skip:23
>>> skl-nuci5        total:105  pass:100  dwarn:0   dfail:0   fail:0   skip:4
>>> snb-dellxps      total:196  pass:162  dwarn:0   dfail:0   fail:0   skip:34
>>> snb-x220t        total:164  pass:139  dwarn:0   dfail:0   fail:0   skip:25
>>>
>>> Results at /archive/results/CI_IGT_test/Patchwork_1786/
>>>
>>> 3e353ec38c8fe68e9a243a9388389a8815115451 drm-intel-nightly: 2016y-04m-04d-11h-13m-54s UTC integration manifest
>>> 95dc10d4f71a6cf473aa874b0a74036f251aef8c drm/i915: Move execlists irq handler to a bottom half
>>
>> So cross fingers and merge?
>
> Yes!

Okay, it's done, we'll see what happens next. :)

Regards,

Tvrtko

Patch
diff mbox

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8f3e3309c3ab..e68134347007 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1324,7 +1324,7 @@  gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
 	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift))
 		notify_ring(engine);
 	if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift))
-		intel_lrc_irq_handler(engine);
+		tasklet_schedule(&engine->irq_tasklet);
 }
 
 static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 6916991bdceb..283426c02f8b 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -538,21 +538,23 @@  get_context_status(struct intel_engine_cs *engine, unsigned int read_pointer,
 
 /**
  * intel_lrc_irq_handler() - handle Context Switch interrupts
- * @ring: Engine Command Streamer to handle.
+ * @engine: Engine Command Streamer to handle.
  *
  * Check the unread Context Status Buffers and manage the submission of new
  * contexts to the ELSP accordingly.
  */
-void intel_lrc_irq_handler(struct intel_engine_cs *engine)
+void intel_lrc_irq_handler(unsigned long data)
 {
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
 	struct drm_i915_private *dev_priv = engine->dev->dev_private;
 	u32 status_pointer;
 	unsigned int read_pointer, write_pointer;
 	u32 csb[GEN8_CSB_ENTRIES][2];
 	unsigned int csb_read = 0, i;
 	unsigned int submit_contexts = 0;
+	unsigned long flags;
 
-	spin_lock(&dev_priv->uncore.lock);
+	spin_lock_irqsave(&dev_priv->uncore.lock, flags);
 	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
 
 	status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine));
@@ -579,9 +581,9 @@  void intel_lrc_irq_handler(struct intel_engine_cs *engine)
 				    engine->next_context_status_buffer << 8));
 
 	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
+	spin_unlock_irqrestore(&dev_priv->uncore.lock, flags);
 
-	spin_lock(&engine->execlist_lock);
+	spin_lock_irqsave(&engine->execlist_lock, flags);
 
 	for (i = 0; i < csb_read; i++) {
 		if (unlikely(csb[i][0] & GEN8_CTX_STATUS_PREEMPTED)) {
@@ -604,7 +606,7 @@  void intel_lrc_irq_handler(struct intel_engine_cs *engine)
 			execlists_context_unqueue(engine);
 	}
 
-	spin_unlock(&engine->execlist_lock);
+	spin_unlock_irqrestore(&engine->execlist_lock, flags);
 
 	if (unlikely(submit_contexts > 2))
 		DRM_ERROR("More than two context complete events?\n");
@@ -2020,6 +2022,8 @@  void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
 	if (!intel_engine_initialized(engine))
 		return;
 
+	tasklet_kill(&engine->irq_tasklet);
+
 	dev_priv = engine->dev->dev_private;
 
 	if (engine->buffer) {
@@ -2093,6 +2097,9 @@  logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine)
 	INIT_LIST_HEAD(&engine->execlist_retired_req_list);
 	spin_lock_init(&engine->execlist_lock);
 
+	tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler,
+		     (unsigned long)engine);
+
 	logical_ring_init_platform_invariants(engine);
 
 	ret = i915_cmd_parser_init_ring(engine);
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index a17cb12221ba..0b0853eee91e 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -118,7 +118,6 @@  int intel_execlists_submission(struct i915_execbuffer_params *params,
 			       struct drm_i915_gem_execbuffer2 *args,
 			       struct list_head *vmas);
 
-void intel_lrc_irq_handler(struct intel_engine_cs *engine);
 void intel_execlists_retire_requests(struct intel_engine_cs *engine);
 
 #endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 221a94627aab..29810cba8a8c 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -266,6 +266,7 @@  struct  intel_engine_cs {
 	} semaphore;
 
 	/* Execlists */
+	struct tasklet_struct irq_tasklet;
 	spinlock_t execlist_lock;
 	struct list_head execlist_queue;
 	struct list_head execlist_retired_req_list;