Message ID | 1458745056-25673-1-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Mar 23, 2016 at 02:57:36PM +0000, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Doing a lot of work in the interrupt handler introduces huge > latencies to the system as a whole. > > Most dramatic effect can be seen by running an all engine > stress test like igt/gem_exec_nop/all where, when the kernel > config is lean enough, the whole system can be brought into > multi-second periods of complete non-interactivty. That can > look for example like this: > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] > Modules linked in: [redacted for brevity] > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 > Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 > Workqueue: i915 gen6_pm_rps_work [i915] > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Stack: > 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > Call Trace: > <IRQ> > [<ffffffff8104a716>] irq_exit+0x86/0x90 > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > <EOI> > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > [<ffffffff8105b186>] worker_thread+0x126/0x490 > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > I could not explain, or find a code path, which would explain > a +20 second lockup, but from some instrumentation it was > apparent the interrupts off proportion of time was between > 10-25% under heavy load which is quite bad. > > By moving the GT interrupt handling to a tasklet in a most > simple way, the problem above disappears completely. Perfect segue into gem_syslatency. I think gem_syslatency is the better tool to correlate disruptive system behaviour. And then continue on with gem_latency to demonstrate that is doesn't adversely affect our performance. > Also, gem_latency -n 100 shows 25% better throughput and CPU > usage, and 14% better latencies. Mention the benefits of parallelising dispatch. As fairly hit-and-miss as perf testing is on these machines, it is looking in favour of using tasklets vs the rt kthread. The numbers swing between 2-10%, but consistently improves in the nop sync latencies. There's still several hours to go in this run before we cover the dispatch latenies, but so far reasonable. (Hmm, looks like there may be a possible degredation on the single nop dispatch but an improvement on the continuous nop dispatch.) > I did not find any gains or regressions with Synmark2 or > GLbench under light testing. More benchmarking is certainly > required. > > v2: > * execlists_lock should be taken as spin_lock_bh when > queuing work from userspace now. (Chris Wilson) > * uncore.lock must be taken with spin_lock_irq when > submitting requests since that now runs from either > softirq or process context. There are a couple of execlist_lock usage outside of intel_lrc that may or may not be useful to convert (low frequency reset / debug paths, so way off the critical paths, but consistency in locking is invaluable). > + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, > + (unsigned long)engine); I like trying to split lines to cluster arguments if possible. Here I think intel_lrc_irq_handler pairs with engine, tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, (unsigned long)engine); *shrug* > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h > index 221a94627aab..29810cba8a8c 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h > @@ -266,6 +266,7 @@ struct intel_engine_cs { > } semaphore; > > /* Execlists */ > + struct tasklet_struct irq_tasklet; > spinlock_t execlist_lock; spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */ It's looking good, but once this run completes, I'm going to repeat it just to confirm how stable my numbers are. Critical bugfix, improvements, simpler patch than my kthread implementation, Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> -Chris
On 24/03/16 10:56, Chris Wilson wrote: > On Wed, Mar 23, 2016 at 02:57:36PM +0000, Tvrtko Ursulin wrote: >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> >> Doing a lot of work in the interrupt handler introduces huge >> latencies to the system as a whole. >> >> Most dramatic effect can be seen by running an all engine >> stress test like igt/gem_exec_nop/all where, when the kernel >> config is lean enough, the whole system can be brought into >> multi-second periods of complete non-interactivty. That can >> look for example like this: >> >> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] >> Modules linked in: [redacted for brevity] >> CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 >> Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 >> Workqueue: i915 gen6_pm_rps_work [i915] >> task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 >> RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 >> RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 >> RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 >> RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 >> RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 >> R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 >> R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 >> FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Stack: >> 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a >> 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 >> 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 >> Call Trace: >> <IRQ> >> [<ffffffff8104a716>] irq_exit+0x86/0x90 >> [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 >> [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 >> <EOI> >> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] >> [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 >> [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] >> [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 >> [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] >> [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] >> [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] >> [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] >> [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 >> [<ffffffff8105ab29>] process_one_work+0x139/0x350 >> [<ffffffff8105b186>] worker_thread+0x126/0x490 >> [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 >> [<ffffffff8105fa64>] kthread+0xc4/0xe0 >> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >> [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 >> [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 >> >> I could not explain, or find a code path, which would explain >> a +20 second lockup, but from some instrumentation it was >> apparent the interrupts off proportion of time was between >> 10-25% under heavy load which is quite bad. >> >> By moving the GT interrupt handling to a tasklet in a most >> simple way, the problem above disappears completely. > > Perfect segue into gem_syslatency. I think gem_syslatency is the better > tool to correlate disruptive system behaviour. And then continue on with > gem_latency to demonstrate that is doesn't adversely affect our > performance. Will do. >> Also, gem_latency -n 100 shows 25% better throughput and CPU >> usage, and 14% better latencies. > > Mention the benefits of parallelising dispatch. Hm, actually this should be the same as before I think. > As fairly hit-and-miss as perf testing is on these machines, it is > looking in favour of using tasklets vs the rt kthread. The numbers swing > between 2-10%, but consistently improves in the nop sync latencies. > There's still several hours to go in this run before we cover the > dispatch latenies, but so far reasonable. > > (Hmm, looks like there may be a possible degredation on the single nop > dispatch but an improvement on the continuous nop dispatch.) We can add all the numbers you get to the commit message as well. >> I did not find any gains or regressions with Synmark2 or >> GLbench under light testing. More benchmarking is certainly >> required. >> >> v2: >> * execlists_lock should be taken as spin_lock_bh when >> queuing work from userspace now. (Chris Wilson) >> * uncore.lock must be taken with spin_lock_irq when >> submitting requests since that now runs from either >> softirq or process context. > > There are a couple of execlist_lock usage outside of intel_lrc that may > or may not be useful to convert (low frequency reset / debug paths, so > way off the critical paths, but consistency in locking is invaluable). Oh right, I've missed those. > >> + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, >> + (unsigned long)engine); > > I like trying to split lines to cluster arguments if possible. Here I > think intel_lrc_irq_handler pairs with engine, > > tasklet_init(&engine->irq_tasklet, > intel_lrc_irq_handler, (unsigned long)engine); > > *shrug* Yeah it is nicer. >> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h >> index 221a94627aab..29810cba8a8c 100644 >> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h >> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h >> @@ -266,6 +266,7 @@ struct intel_engine_cs { >> } semaphore; >> >> /* Execlists */ >> + struct tasklet_struct irq_tasklet; >> spinlock_t execlist_lock; > > spinlock_t execlist_lock; /* used inside tasklet, use spin_lock_bh */ Will do. > It's looking good, but once this run completes, I'm going to repeat it > just to confirm how stable my numbers are. > > Critical bugfix, improvements, simpler patch than my kthread > implementation, > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Okay I will respin with the above and we'll see. Unfortunately my test platform just died so there will be a delay. Regards, Tvrtko
On 24/03/16 11:50, Tvrtko Ursulin wrote: >>> Also, gem_latency -n 100 shows 25% better throughput and CPU >>> usage, and 14% better latencies. >> >> Mention the benefits of parallelising dispatch. > > Hm, actually this should be the same as before I think. Of course not, silly me. Will add this at the next opportunity then. Regards, Tvrtko
On ke, 2016-03-23 at 14:57 +0000, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Doing a lot of work in the interrupt handler introduces huge > latencies to the system as a whole. > > Most dramatic effect can be seen by running an all engine > stress test like igt/gem_exec_nop/all where, when the kernel > config is lean enough, the whole system can be brought into > multi-second periods of complete non-interactivty. That can > look for example like this: > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! > [kworker/u8:3:143] > Modules linked in: [redacted for brevity] > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0- > 160321+ #183 > Hardware name: Intel Corporation Broadwell Client platform/WhiteTip > Mountain 1 > Workqueue: i915 gen6_pm_rps_work [i915] > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: > ffff8800aae90000 > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] > __do_softirq+0x72/0x1d0 > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Stack: > 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > Call Trace: > <IRQ> > [<ffffffff8104a716>] irq_exit+0x86/0x90 > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > <EOI> > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > [<ffffffff8105b186>] worker_thread+0x126/0x490 > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > I could not explain, or find a code path, which would explain > a +20 second lockup, but from some instrumentation it was > apparent the interrupts off proportion of time was between > 10-25% under heavy load which is quite bad. > > By moving the GT interrupt handling to a tasklet in a most > simple way, the problem above disappears completely. > > Also, gem_latency -n 100 shows 25% better throughput and CPU > usage, and 14% better latencies. > > I did not find any gains or regressions with Synmark2 or > GLbench under light testing. More benchmarking is certainly > required. > > v2: > * execlists_lock should be taken as spin_lock_bh when > queuing work from userspace now. (Chris Wilson) > * uncore.lock must be taken with spin_lock_irq when > submitting requests since that now runs from either > softirq or process context. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> You also have to synchronize against the tasklet now whenever we synchronize against the IRQ, see gen6_disable_rps_interrupts(), gen8_irq_power_well_pre_disable() and intel_runtime_pm_disable_interrupts(). Not saying you should use a threaded IRQ instead, but it does provide for this automatically. --Imre > --- > drivers/gpu/drm/i915/i915_irq.c | 2 +- > drivers/gpu/drm/i915/intel_lrc.c | 24 ++++++++++++++--------- > - > drivers/gpu/drm/i915/intel_lrc.h | 1 - > drivers/gpu/drm/i915/intel_ringbuffer.h | 1 + > 4 files changed, 16 insertions(+), 12 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_irq.c > b/drivers/gpu/drm/i915/i915_irq.c > index 8f3e3309c3ab..e68134347007 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -1324,7 +1324,7 @@ gen8_cs_irq_handler(struct intel_engine_cs > *engine, u32 iir, int test_shift) > if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) > notify_ring(engine); > if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift)) > - intel_lrc_irq_handler(engine); > + tasklet_schedule(&engine->irq_tasklet); > } > > static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private > *dev_priv, > diff --git a/drivers/gpu/drm/i915/intel_lrc.c > b/drivers/gpu/drm/i915/intel_lrc.c > index 67592f8354d6..b3b62b3cd90d 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -418,20 +418,18 @@ static void execlists_submit_requests(struct > drm_i915_gem_request *rq0, > { > struct drm_i915_private *dev_priv = rq0->i915; > > - /* BUG_ON(!irqs_disabled()); */ > - > execlists_update_context(rq0); > > if (rq1) > execlists_update_context(rq1); > > - spin_lock(&dev_priv->uncore.lock); > + spin_lock_irq(&dev_priv->uncore.lock); > intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); > > execlists_elsp_write(rq0, rq1); > > intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); > - spin_unlock(&dev_priv->uncore.lock); > + spin_unlock_irq(&dev_priv->uncore.lock); > } > > static void execlists_context_unqueue(struct intel_engine_cs > *engine) > @@ -536,13 +534,14 @@ get_context_status(struct drm_i915_private > *dev_priv, u32 csb_base, > > /** > * intel_lrc_irq_handler() - handle Context Switch interrupts > - * @ring: Engine Command Streamer to handle. > + * @engine: Engine Command Streamer to handle. > * > * Check the unread Context Status Buffers and manage the submission > of new > * contexts to the ELSP accordingly. > */ > -void intel_lrc_irq_handler(struct intel_engine_cs *engine) > +void intel_lrc_irq_handler(unsigned long data) > { > + struct intel_engine_cs *engine = (struct intel_engine_cs > *)data; > struct drm_i915_private *dev_priv = engine->dev- > >dev_private; > u32 status_pointer; > unsigned int read_pointer, write_pointer; > @@ -551,7 +550,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs > *engine) > unsigned int csb_read = 0, i; > unsigned int submit_contexts = 0; > > - spin_lock(&dev_priv->uncore.lock); > + spin_lock_irq(&dev_priv->uncore.lock); > intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); > > status_pointer = > I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine)); > @@ -579,7 +578,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs > *engine) > engine- > >next_context_status_buffer << 8)); > > intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); > - spin_unlock(&dev_priv->uncore.lock); > + spin_unlock_irq(&dev_priv->uncore.lock); > > spin_lock(&engine->execlist_lock); > > @@ -621,7 +620,7 @@ static void execlists_context_queue(struct > drm_i915_gem_request *request) > > i915_gem_request_reference(request); > > - spin_lock_irq(&engine->execlist_lock); > + spin_lock_bh(&engine->execlist_lock); > > list_for_each_entry(cursor, &engine->execlist_queue, > execlist_link) > if (++num_elements > 2) > @@ -646,7 +645,7 @@ static void execlists_context_queue(struct > drm_i915_gem_request *request) > if (num_elements == 0) > execlists_context_unqueue(engine); > > - spin_unlock_irq(&engine->execlist_lock); > + spin_unlock_bh(&engine->execlist_lock); > } > > static int logical_ring_invalidate_all_caches(struct > drm_i915_gem_request *req) > @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct > intel_engine_cs *engine) > if (!intel_engine_initialized(engine)) > return; > > + tasklet_kill(&engine->irq_tasklet); > + > dev_priv = engine->dev->dev_private; > > if (engine->buffer) { > @@ -2089,6 +2090,9 @@ logical_ring_init(struct drm_device *dev, > struct intel_engine_cs *engine) > INIT_LIST_HEAD(&engine->execlist_retired_req_list); > spin_lock_init(&engine->execlist_lock); > > + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, > + (unsigned long)engine); > + > logical_ring_init_platform_invariants(engine); > > ret = i915_cmd_parser_init_ring(engine); > diff --git a/drivers/gpu/drm/i915/intel_lrc.h > b/drivers/gpu/drm/i915/intel_lrc.h > index 6690d93d603f..efcbd7bf9cc9 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.h > +++ b/drivers/gpu/drm/i915/intel_lrc.h > @@ -123,7 +123,6 @@ int intel_execlists_submission(struct > i915_execbuffer_params *params, > struct drm_i915_gem_execbuffer2 > *args, > struct list_head *vmas); > > -void intel_lrc_irq_handler(struct intel_engine_cs *engine); > void intel_execlists_retire_requests(struct intel_engine_cs > *engine); > > #endif /* _INTEL_LRC_H_ */ > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h > b/drivers/gpu/drm/i915/intel_ringbuffer.h > index 221a94627aab..29810cba8a8c 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h > @@ -266,6 +266,7 @@ struct intel_engine_cs { > } semaphore; > > /* Execlists */ > + struct tasklet_struct irq_tasklet; > spinlock_t execlist_lock; > struct list_head execlist_queue; > struct list_head execlist_retired_req_list;
On Thu, Mar 24, 2016 at 05:56:40PM +0200, Imre Deak wrote: > On ke, 2016-03-23 at 14:57 +0000, Tvrtko Ursulin wrote: > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > Doing a lot of work in the interrupt handler introduces huge > > latencies to the system as a whole. > > > > Most dramatic effect can be seen by running an all engine > > stress test like igt/gem_exec_nop/all where, when the kernel > > config is lean enough, the whole system can be brought into > > multi-second periods of complete non-interactivty. That can > > look for example like this: > > > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! > > [kworker/u8:3:143] > > Modules linked in: [redacted for brevity] > > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0- > > 160321+ #183 > > Hardware name: Intel Corporation Broadwell Client platform/WhiteTip > > Mountain 1 > > Workqueue: i915 gen6_pm_rps_work [i915] > > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: > > ffff8800aae90000 > > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] > > __do_softirq+0x72/0x1d0 > > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 > > RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 > > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 > > R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 > > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 > > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) > > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Stack: > > 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a > > 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 > > 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 > > Call Trace: > > <IRQ> > > [<ffffffff8104a716>] irq_exit+0x86/0x90 > > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > > <EOI> > > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > > [<ffffffff8105b186>] worker_thread+0x126/0x490 > > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > > > I could not explain, or find a code path, which would explain > > a +20 second lockup, but from some instrumentation it was > > apparent the interrupts off proportion of time was between > > 10-25% under heavy load which is quite bad. > > > > By moving the GT interrupt handling to a tasklet in a most > > simple way, the problem above disappears completely. > > > > Also, gem_latency -n 100 shows 25% better throughput and CPU > > usage, and 14% better latencies. > > > > I did not find any gains or regressions with Synmark2 or > > GLbench under light testing. More benchmarking is certainly > > required. > > > > v2: > > * execlists_lock should be taken as spin_lock_bh when > > queuing work from userspace now. (Chris Wilson) > > * uncore.lock must be taken with spin_lock_irq when > > submitting requests since that now runs from either > > softirq or process context. > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > You also have to synchronize against the tasklet now whenever we > synchronize against the IRQ, see gen6_disable_rps_interrupts(), > gen8_irq_power_well_pre_disable() and > intel_runtime_pm_disable_interrupts(). Not saying you should use a > threaded IRQ instead, but it does provide for this automatically. But we don't synchronize against the irq for execlists since this tasklet is guarded by the rpm wakeref (though mark_busy / mark_idle) and we stop it before we finally release the irq. Or have I missed something? -Chris
On to, 2016-03-24 at 16:05 +0000, Chris Wilson wrote: > On Thu, Mar 24, 2016 at 05:56:40PM +0200, Imre Deak wrote: > > On ke, 2016-03-23 at 14:57 +0000, Tvrtko Ursulin wrote: > > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > > > Doing a lot of work in the interrupt handler introduces huge > > > latencies to the system as a whole. > > > > > > Most dramatic effect can be seen by running an all engine > > > stress test like igt/gem_exec_nop/all where, when the kernel > > > config is lean enough, the whole system can be brought into > > > multi-second periods of complete non-interactivty. That can > > > look for example like this: > > > > > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! > > > [kworker/u8:3:143] > > > Modules linked in: [redacted for brevity] > > > CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: > > > G U L 4.5.0- > > > 160321+ #183 > > > Hardware name: Intel Corporation Broadwell Client > > > platform/WhiteTip > > > Mountain 1 > > > Workqueue: i915 gen6_pm_rps_work [i915] > > > task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: > > > ffff8800aae90000 > > > RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] > > > __do_softirq+0x72/0x1d0 > > > RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 > > > RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: > > > 00000000000006e0 > > > RDX: 0000000000000020 RSI: 0000000004208060 RDI: > > > 0000000000215d80 > > > RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: > > > 0000000000000022 > > > R10: 0000000000000004 R11: 00000000ffffffff R12: > > > 000000000000a030 > > > R13: 0000000000000082 R14: ffff8800aa4d0080 R15: > > > 0000000000000082 > > > FS: 0000000000000000(0000) GS:ffff88014f400000(0000) > > > knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: > > > 00000000001406f0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > 0000000000000400 > > > Stack: > > > 042080601b33869f ffff8800aae94000 00000000fffc2678 > > > ffff88010000000a > > > 0000000000000000 000000000000a030 0000000000005302 > > > ffff8800aa4d0080 > > > 0000000000000206 ffff88014f403f90 ffffffff8104a716 > > > ffff88014f403fa8 > > > Call Trace: > > > <IRQ> > > > [<ffffffff8104a716>] irq_exit+0x86/0x90 > > > [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 > > > [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 > > > <EOI> > > > [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] > > > [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 > > > [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] > > > [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 > > > [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] > > > [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] > > > [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] > > > [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] > > > [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 > > > [<ffffffff8105ab29>] process_one_work+0x139/0x350 > > > [<ffffffff8105b186>] worker_thread+0x126/0x490 > > > [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 > > > [<ffffffff8105fa64>] kthread+0xc4/0xe0 > > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > > [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 > > > [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 > > > > > > I could not explain, or find a code path, which would explain > > > a +20 second lockup, but from some instrumentation it was > > > apparent the interrupts off proportion of time was between > > > 10-25% under heavy load which is quite bad. > > > > > > By moving the GT interrupt handling to a tasklet in a most > > > simple way, the problem above disappears completely. > > > > > > Also, gem_latency -n 100 shows 25% better throughput and CPU > > > usage, and 14% better latencies. > > > > > > I did not find any gains or regressions with Synmark2 or > > > GLbench under light testing. More benchmarking is certainly > > > required. > > > > > > v2: > > > * execlists_lock should be taken as spin_lock_bh when > > > queuing work from userspace now. (Chris Wilson) > > > * uncore.lock must be taken with spin_lock_irq when > > > submitting requests since that now runs from either > > > softirq or process context. > > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > > > You also have to synchronize against the tasklet now whenever we > > synchronize against the IRQ, see gen6_disable_rps_interrupts(), > > gen8_irq_power_well_pre_disable() and > > intel_runtime_pm_disable_interrupts(). Not saying you should use a > > threaded IRQ instead, but it does provide for this automatically. > > But we don't synchronize against the irq for execlists since this > tasklet is guarded by the rpm wakeref (though mark_busy / mark_idle) > and we stop it before we finally release the irq. Hm yea, I missed that it's only an execlist tasklet and so there shouldn't be any pending tasklet after mark_idle(). Perhaps it would still make sense to assert for this in gen8_logical_ring_put_irq() or somewhere? Similarly there is a tasklet_kill() in intel_logical_ring_cleanup(), but there shouldn't be any pending tasklet there either, so should we add an assert there too? --Imre > Or have I missed something? > -Chris >
On Thu, Mar 24, 2016 at 06:40:55PM +0200, Imre Deak wrote: > Hm yea, I missed that it's only an execlist tasklet and so there > shouldn't be any pending tasklet after mark_idle(). Perhaps it would > still make sense to assert for this in gen8_logical_ring_put_irq() or > somewhere? Similarly there is a tasklet_kill() in > intel_logical_ring_cleanup(), but there shouldn't be any pending > tasklet there either, so should we add an assert there too? Yes, tasklet_kill() should be a nop. We could if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &tasklet->state)) tasklet_kill(&tasklet); I don't see a particular sensible spot to assert that the engines are off before irq uninstall other than the assertions we have in execlists that irqs are actually enabled when we try to submit, and the battery of WARNs we have for trying to access the hardware whilst !rpm. -Chris
On Thu, 2016-03-24 at 19:56 +0000, Chris Wilson wrote: > On Thu, Mar 24, 2016 at 06:40:55PM +0200, Imre Deak wrote: > > Hm yea, I missed that it's only an execlist tasklet and so there > > shouldn't be any pending tasklet after mark_idle(). Perhaps it > > would > > still make sense to assert for this in gen8_logical_ring_put_irq() > > or > > somewhere? Similarly there is a tasklet_kill() in > > intel_logical_ring_cleanup(), but there shouldn't be any pending > > tasklet there either, so should we add an assert there too? > > Yes, tasklet_kill() should be a nop. We could > > if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &tasklet->state)) > tasklet_kill(&tasklet); > > I don't see a particular sensible spot to assert that the engines are > off before irq uninstall other than the assertions we have in > execlists > that irqs are actually enabled when we try to submit, and the battery > of WARNs we have for trying to access the hardware whilst !rpm. Ok, this was just a hand-wavy idea then and also this tasklet isn't much different from other work we schedule from the interrupt handler and we don't have special checks for those either. The above WARN_ON would be still useful for documentation imo. --Imre
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 8f3e3309c3ab..e68134347007 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1324,7 +1324,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift) if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) notify_ring(engine); if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift)) - intel_lrc_irq_handler(engine); + tasklet_schedule(&engine->irq_tasklet); } static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv, diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 67592f8354d6..b3b62b3cd90d 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -418,20 +418,18 @@ static void execlists_submit_requests(struct drm_i915_gem_request *rq0, { struct drm_i915_private *dev_priv = rq0->i915; - /* BUG_ON(!irqs_disabled()); */ - execlists_update_context(rq0); if (rq1) execlists_update_context(rq1); - spin_lock(&dev_priv->uncore.lock); + spin_lock_irq(&dev_priv->uncore.lock); intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); execlists_elsp_write(rq0, rq1); intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); - spin_unlock(&dev_priv->uncore.lock); + spin_unlock_irq(&dev_priv->uncore.lock); } static void execlists_context_unqueue(struct intel_engine_cs *engine) @@ -536,13 +534,14 @@ get_context_status(struct drm_i915_private *dev_priv, u32 csb_base, /** * intel_lrc_irq_handler() - handle Context Switch interrupts - * @ring: Engine Command Streamer to handle. + * @engine: Engine Command Streamer to handle. * * Check the unread Context Status Buffers and manage the submission of new * contexts to the ELSP accordingly. */ -void intel_lrc_irq_handler(struct intel_engine_cs *engine) +void intel_lrc_irq_handler(unsigned long data) { + struct intel_engine_cs *engine = (struct intel_engine_cs *)data; struct drm_i915_private *dev_priv = engine->dev->dev_private; u32 status_pointer; unsigned int read_pointer, write_pointer; @@ -551,7 +550,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) unsigned int csb_read = 0, i; unsigned int submit_contexts = 0; - spin_lock(&dev_priv->uncore.lock); + spin_lock_irq(&dev_priv->uncore.lock); intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); status_pointer = I915_READ_FW(RING_CONTEXT_STATUS_PTR(engine)); @@ -579,7 +578,7 @@ void intel_lrc_irq_handler(struct intel_engine_cs *engine) engine->next_context_status_buffer << 8)); intel_uncore_forcewake_put__locked(dev_priv, FORCEWAKE_ALL); - spin_unlock(&dev_priv->uncore.lock); + spin_unlock_irq(&dev_priv->uncore.lock); spin_lock(&engine->execlist_lock); @@ -621,7 +620,7 @@ static void execlists_context_queue(struct drm_i915_gem_request *request) i915_gem_request_reference(request); - spin_lock_irq(&engine->execlist_lock); + spin_lock_bh(&engine->execlist_lock); list_for_each_entry(cursor, &engine->execlist_queue, execlist_link) if (++num_elements > 2) @@ -646,7 +645,7 @@ static void execlists_context_queue(struct drm_i915_gem_request *request) if (num_elements == 0) execlists_context_unqueue(engine); - spin_unlock_irq(&engine->execlist_lock); + spin_unlock_bh(&engine->execlist_lock); } static int logical_ring_invalidate_all_caches(struct drm_i915_gem_request *req) @@ -2016,6 +2015,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine) if (!intel_engine_initialized(engine)) return; + tasklet_kill(&engine->irq_tasklet); + dev_priv = engine->dev->dev_private; if (engine->buffer) { @@ -2089,6 +2090,9 @@ logical_ring_init(struct drm_device *dev, struct intel_engine_cs *engine) INIT_LIST_HEAD(&engine->execlist_retired_req_list); spin_lock_init(&engine->execlist_lock); + tasklet_init(&engine->irq_tasklet, intel_lrc_irq_handler, + (unsigned long)engine); + logical_ring_init_platform_invariants(engine); ret = i915_cmd_parser_init_ring(engine); diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index 6690d93d603f..efcbd7bf9cc9 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h @@ -123,7 +123,6 @@ int intel_execlists_submission(struct i915_execbuffer_params *params, struct drm_i915_gem_execbuffer2 *args, struct list_head *vmas); -void intel_lrc_irq_handler(struct intel_engine_cs *engine); void intel_execlists_retire_requests(struct intel_engine_cs *engine); #endif /* _INTEL_LRC_H_ */ diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 221a94627aab..29810cba8a8c 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -266,6 +266,7 @@ struct intel_engine_cs { } semaphore; /* Execlists */ + struct tasklet_struct irq_tasklet; spinlock_t execlist_lock; struct list_head execlist_queue; struct list_head execlist_retired_req_list;