[v3] drm/i915: Move execlists irq handler to a bottom half
diff mbox

Message ID 1458821929-20785-1-git-send-email-tvrtko.ursulin@linux.intel.com
State New
Headers show

Commit Message

Tvrtko Ursulin March 24, 2016, 12:18 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Doing a lot of work in the interrupt handler introduces huge
latencies to the system as a whole.

Most dramatic effect can be seen by running an all engine
stress test like igt/gem_exec_nop/all where, when the kernel
config is lean enough, the whole system can be brought into
multi-second periods of complete non-interactivty. That can
look for example like this:

 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
 Modules linked in: [redacted for brevity]
 CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
 Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
 Workqueue: i915 gen6_pm_rps_work [i915]
 task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
 RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
 RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
 RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
 RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
 RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
 R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
 R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
 FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Stack:
  042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
  0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
  0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
 Call Trace:
  <IRQ>
  [<ffffffff8104a716>] irq_exit+0x86/0x90
  [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
  [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
  <EOI>
  [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
  [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
  [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
  [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
  [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
  [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
  [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
  [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
  [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
  [<ffffffff8105ab29>] process_one_work+0x139/0x350
  [<ffffffff8105b186>] worker_thread+0x126/0x490
  [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
  [<ffffffff8105fa64>] kthread+0xc4/0xe0
  [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
  [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
  [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170

I could not explain, or find a code path, which would explain
a +20 second lockup, but from some instrumentation it was
apparent the interrupts off proportion of time was between
10-25% under heavy load which is quite bad.

When a interrupt "cliff" is reached, which was >~320k irq/s on
my machine, the whole system goes into a terrible state of the
above described multi-second lockups.

By moving the GT interrupt handling to a tasklet in a most
simple way, the problem above disappears completely.

Testing the effect on sytem-wide latencies using
igt/gem_syslatency shows the following before this patch:

gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us
gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us
gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us
gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us

This shows that the unrelated process is experiencing huge
delays in its wake-up latency. After the patch the results
look like this:

gem_syslatency: cycles=808907, latency mean=53.133us max=1640us
gem_syslatency: cycles=862154, latency mean=62.778us max=2117us
gem_syslatency: cycles=856039, latency mean=58.079us max=2123us
gem_syslatency: cycles=841683, latency mean=56.914us max=1667us

Showing a huge improvement in the unrelated process wake-up
latency. It also shows an approximate halving in the number
of total empty batches submitted during the test. This may
not be worrying since the test puts the driver under
a very unrealistic load with ncpu threads doing empty batch
submission to all GPU engines each.

More interesting scenario with regards to throughput is
"gem_latency -n 100" which  shows 25% better throughput and
CPU usage, and 14% better dispatch latencies.

I did not find any gains or regressions with Synmark2 or
GLbench under light testing. More benchmarking is certainly
required.

v2:
   * execlists_lock should be taken as spin_lock_bh when
     queuing work from userspace now. (Chris Wilson)
   * uncore.lock must be taken with spin_lock_irq when
     submitting requests since that now runs from either
     softirq or process context.

v3:
   * Expanded commit message with more testing data;
   * converted missed locking sites to _bh;
   * added execlist_lock comment. (Chris Wilson)

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Testcase: igt/gem_exec_nop/all
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_debugfs.c     |  5 ++---
 drivers/gpu/drm/i915/i915_gem.c         |  8 ++++----
 drivers/gpu/drm/i915/i915_irq.c         |  2 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 28 ++++++++++++++++------------
 drivers/gpu/drm/i915/intel_lrc.h        |  1 -
 drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
 6 files changed, 25 insertions(+), 22 deletions(-)

Comments

Chris Wilson March 24, 2016, 10:24 p.m. UTC | #1
On Thu, Mar 24, 2016 at 12:18:49PM +0000, Tvrtko Ursulin wrote:
>  /**
>   * intel_lrc_irq_handler() - handle Context Switch interrupts
> - * @ring: Engine Command Streamer to handle.
> + * @engine: Engine Command Streamer to handle.
>   *
>   * Check the unread Context Status Buffers and manage the submission of new
>   * contexts to the ELSP accordingly.
>   */
> -void intel_lrc_irq_handler(struct intel_engine_cs *engine)
> +void intel_lrc_irq_handler(unsigned long data)

Now should be made static.

> @@ -552,7 +551,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine)
>  	unsigned int csb_read = 0, i;
>  	unsigned int submit_contexts = 0;
>  
> -	spin_lock(&dev_priv->uncore.lock);
> +	spin_lock_irq(&dev_priv->uncore.lock);
>  	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
>  
>  	status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine));
> @@ -579,7 +578,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine)
>  				    engine->next_context_status_buffer << 8));
>  
>  	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
> -	spin_unlock(&dev_priv->uncore.lock);
> +	spin_unlock_irq(&dev_priv->uncore.lock);

We actually can make this section just guarded by
intel_uncore_forcewake_get/intel_uncore_forcewake_put and drop the
explicit spin_lock_irq(uncore.lock). We know no one else can do
simultaneous access to this cacheline of mmio space (though that
shouldn't affect any of these machines!) and so can reduce the critical
section. At the cost of the extra lock/unlock though, versus the
potential for multiple cores to be reading from mmio.

> @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
>  	if (!intel_engine_initialized(engine))
>  		return;
>  
> +	tasklet_kill(&engine->irq_tasklet);

Imre suggested that we assert that the irq_tasklet is idle.
WARN_ON(test_bit(TASK_STATE_SCHED, &engine->irq_tasklet.state));
-Chris
Chris Wilson March 25, 2016, 12:56 p.m. UTC | #2
On Thu, Mar 24, 2016 at 10:24:49PM +0000, Chris Wilson wrote:
> On Thu, Mar 24, 2016 at 12:18:49PM +0000, Tvrtko Ursulin wrote:
> > @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
> >  	if (!intel_engine_initialized(engine))
> >  		return;
> >  
> > +	tasklet_kill(&engine->irq_tasklet);
> 
> Imre suggested that we assert that the irq_tasklet is idle.
> WARN_ON(test_bit(TASK_STATE_SCHED, &engine->irq_tasklet.state));

Whilst it may not be necessary in cleanup(), we should put a
tasklet_kill() in i915_reset_engine_cleanup()
-Chris

Patch
diff mbox

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index e0ba3e38000f..daf8e76966ae 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -2092,7 +2092,6 @@  static int i915_execlists(struct seq_file *m, void *data)
 	for_each_engine(engine, dev_priv, ring_id) {
 		struct drm_i915_gem_request *head_req = NULL;
 		int count = 0;
-		unsigned long flags;
 
 		seq_printf(m, "%s\n", engine->name);
 
@@ -2119,13 +2118,13 @@  static int i915_execlists(struct seq_file *m, void *data)
 				   i, status, ctx_id);
 		}
 
-		spin_lock_irqsave(&engine->execlist_lock, flags);
+		spin_lock_bh(&engine->execlist_lock);
 		list_for_each(cursor, &engine->execlist_queue)
 			count++;
 		head_req = list_first_entry_or_null(&engine->execlist_queue,
 						    struct drm_i915_gem_request,
 						    execlist_link);
-		spin_unlock_irqrestore(&engine->execlist_lock, flags);
+		spin_unlock_bh(&engine->execlist_lock);
 
 		seq_printf(m, "\t%d requests in queue\n", count);
 		if (head_req) {
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8588c83abb35..898a99a630b8 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2840,13 +2840,13 @@  static void i915_gem_reset_engine_cleanup(struct drm_i915_private *dev_priv,
 	 */
 
 	if (i915.enable_execlists) {
-		spin_lock_irq(&engine->execlist_lock);
+		spin_lock_bh(&engine->execlist_lock);
 
 		/* list_splice_tail_init checks for empty lists */
 		list_splice_tail_init(&engine->execlist_queue,
 				      &engine->execlist_retired_req_list);
 
-		spin_unlock_irq(&engine->execlist_lock);
+		spin_unlock_bh(&engine->execlist_lock);
 		intel_execlists_retire_requests(engine);
 	}
 
@@ -2968,9 +2968,9 @@  i915_gem_retire_requests(struct drm_device *dev)
 		i915_gem_retire_requests_ring(engine);
 		idle &= list_empty(&engine->request_list);
 		if (i915.enable_execlists) {
-			spin_lock_irq(&engine->execlist_lock);
+			spin_lock_bh(&engine->execlist_lock);
 			idle &= list_empty(&engine->execlist_queue);
-			spin_unlock_irq(&engine->execlist_lock);
+			spin_unlock_bh(&engine->execlist_lock);
 
 			intel_execlists_retire_requests(engine);
 		}
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index a55a7cc317f8..0637c01d3c01 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1324,7 +1324,7 @@  gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
 	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift))
 		notify_ring(engine);
 	if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift))
-		intel_lrc_irq_handler(engine);
+		tasklet_schedule(&engine->irq_tasklet);
 }
 
 static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 40ef4eaf580f..2d5e833b14d1 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -418,20 +418,18 @@  static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
 {
 	struct drm_i915_private *dev_priv = rq0->i915;
 
-	/* BUG_ON(!irqs_disabled());  */
-
 	execlists_update_context(rq0);
 
 	if (rq1)
 		execlists_update_context(rq1);
 
-	spin_lock(&dev_priv->uncore.lock);
+	spin_lock_irq(&dev_priv->uncore.lock);
 	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
 
 	execlists_elsp_write(rq0, rq1);
 
 	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
+	spin_unlock_irq(&dev_priv->uncore.lock);
 }
 
 static void execlists_context_unqueue(struct intel_engine_cs *engine)
@@ -538,13 +536,14 @@  get_context_status(struct intel_engine_cs *engine, unsigned int read_pointer,
 
 /**
  * intel_lrc_irq_handler() - handle Context Switch interrupts
- * @ring: Engine Command Streamer to handle.
+ * @engine: Engine Command Streamer to handle.
  *
  * Check the unread Context Status Buffers and manage the submission of new
  * contexts to the ELSP accordingly.
  */
-void intel_lrc_irq_handler(struct intel_engine_cs *engine)
+void intel_lrc_irq_handler(unsigned long data)
 {
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
 	struct drm_i915_private *dev_priv = engine->dev->dev_private;
 	u32 status_pointer;
 	unsigned int read_pointer, write_pointer;
@@ -552,7 +551,7 @@  void intel_lrc_irq_handler(struct intel_engine_cs *engine)
 	unsigned int csb_read = 0, i;
 	unsigned int submit_contexts = 0;
 
-	spin_lock(&dev_priv->uncore.lock);
+	spin_lock_irq(&dev_priv->uncore.lock);
 	intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL);
 
 	status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine));
@@ -579,7 +578,7 @@  void intel_lrc_irq_handler(struct intel_engine_cs *engine)
 				    engine->next_context_status_buffer << 8));
 
 	intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL);
-	spin_unlock(&dev_priv->uncore.lock);
+	spin_unlock_irq(&dev_priv->uncore.lock);
 
 	spin_lock(&engine->execlist_lock);
 
@@ -621,7 +620,7 @@  static void execlists_context_queue(struct drm_i915_gem_request *request)
 
 	i915_gem_request_reference(request);
 
-	spin_lock_irq(&engine->execlist_lock);
+	spin_lock_bh(&engine->execlist_lock);
 
 	list_for_each_entry(cursor, &engine->execlist_queue, execlist_link)
 		if (++num_elements > 2)
@@ -646,7 +645,7 @@  static void execlists_context_queue(struct drm_i915_gem_request *request)
 	if (num_elements == 0)
 		execlists_context_unqueue(engine);
 
-	spin_unlock_irq(&engine->execlist_lock);
+	spin_unlock_bh(&engine->execlist_lock);
 }
 
 static int logical_ring_invalidate_all_caches(struct drm_i915_gem_request *req)
@@ -1033,9 +1032,9 @@  void intel_execlists_retire_requests(struct intel_engine_cs *engine)
 		return;
 
 	INIT_LIST_HEAD(&retired_list);
-	spin_lock_irq(&engine->execlist_lock);
+	spin_lock_bh(&engine->execlist_lock);
 	list_replace_init(&engine->execlist_retired_req_list, &retired_list);
-	spin_unlock_irq(&engine->execlist_lock);
+	spin_unlock_bh(&engine->execlist_lock);
 
 	list_for_each_entry_safe(req, tmp, &retired_list, execlist_link) {
 		struct intel_context *ctx = req->ctx;
@@ -2016,6 +2015,8 @@  void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
 	if (!intel_engine_initialized(engine))
 		return;
 
+	tasklet_kill(&engine->irq_tasklet);
+
 	dev_priv = engine->dev->dev_private;
 
 	if (engine->buffer) {
@@ -2089,6 +2090,9 @@  logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine)
 	INIT_LIST_HEAD(&engine->execlist_retired_req_list);
 	spin_lock_init(&engine->execlist_lock);
 
+	tasklet_init(&engine->irq_tasklet,
+		     intel_lrc_irq_handler, (unsigned long)engine);
+
 	logical_ring_init_platform_invariants(engine);
 
 	ret = i915_cmd_parser_init_ring(engine);
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index a17cb12221ba..0b0853eee91e 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -118,7 +118,6 @@  int intel_execlists_submission(struct i915_execbuffer_params *params,
 			       struct drm_i915_gem_execbuffer2 *args,
 			       struct list_head *vmas);
 
-void intel_lrc_irq_handler(struct intel_engine_cs *engine);
 void intel_execlists_retire_requests(struct intel_engine_cs *engine);
 
 #endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 221a94627aab..18074ab55f61 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -266,7 +266,8 @@  struct  intel_engine_cs {
 	} semaphore;
 
 	/* Execlists */
-	spinlock_t execlist_lock;
+	struct tasklet_struct irq_tasklet;
+	spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */
 	struct list_head execlist_queue;
 	struct list_head execlist_retired_req_list;
 	unsigned int next_context_status_buffer;