From patchwork Sun Aug 7 14:45:11 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 9266403 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 57F7E60839 for ; Sun, 7 Aug 2016 14:45:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4811E27165 for ; Sun, 7 Aug 2016 14:45:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3CEAE27BFC; Sun, 7 Aug 2016 14:45:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_MED,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 7699D27165 for ; Sun, 7 Aug 2016 14:45:54 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0F42B6E2ED; Sun, 7 Aug 2016 14:45:54 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mail-wm0-x242.google.com (mail-wm0-x242.google.com [IPv6:2a00:1450:400c:c09::242]) by gabe.freedesktop.org (Postfix) with ESMTPS id 580736E2EA for ; Sun, 7 Aug 2016 14:45:51 +0000 (UTC) Received: by mail-wm0-x242.google.com with SMTP id x83so11037916wma.3 for ; Sun, 07 Aug 2016 07:45:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=IrdYKmN/HyvT9hoPtKLeV076riY4A3OX/kyHQOwi+Ro=; b=o5wTBDGGpNPiF1APNkozo6xXggVReXpFZGVRFx864pA019tsLyzdzSmaD1jBe64hHA LG1SDqU9n3xzam6Jvjy+QjmDcD1GMQeJ7mh99lZvm/wUIgjAlhLv1TTItg9h+4m3PHtC ss4L0oSZfRILnlmOVCM+/W75SlBDnU6adcgXIltN80gv3a3+ebWEk7Us92s5KlczlIfN CSelGDeskRWnM76QYSrMcsnk/csopW+kCCgMBP8glulClF4XdKWSIiYZEPBpJFNkR1Zu CGjTITAGed1gP30wJBhXZxIv7OYO1jZoIfUUMCE5A1+cuCuf1361hkbH6m5qhqu/Vfv4 JbLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references; bh=IrdYKmN/HyvT9hoPtKLeV076riY4A3OX/kyHQOwi+Ro=; b=hg4sTS24eZpr3fiu1OTCxRhTE0zQOc/pcvio/FHwvX82wlqGCqz40SJclnufDNXobD 9sMgRA5LtwiAf9O7Ew5l4b1iCscaLDTOhVi5nVL40N001v1OWJ679Pd05fuWe5plWiZO dtwuJe2I3FmFLxAMYpXjGJjXeeYHXDage/i3Kso23JrLH2RKaRADUWthaZny3qrBMWYC GQoZ0zC0jD90M5fE0sQHbE7XfG5FQ8pVXUyw/a2ZNo+p5f1HjQafFtdqhBhqT1gKFuga nGFcOZzPvMpCh6loZ+xzftmulAeo7lQxLxBDW2yOOOKBKLXv28mh/adG0fSfVCM8h/YJ T5bw== X-Gm-Message-State: AEkoouvcGeGlWgw4K/qtvc+AkCgkqKY3eL0KKLHWocdP9DDh/0Zzv6HPRRsy77HpjqqdKg== X-Received: by 10.194.186.231 with SMTP id fn7mr80545836wjc.164.1470581149745; Sun, 07 Aug 2016 07:45:49 -0700 (PDT) Received: from haswell.alporthouse.com ([78.156.65.138]) by smtp.gmail.com with ESMTPSA id o2sm1917611wjo.3.2016.08.07.07.45.48 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 07 Aug 2016 07:45:49 -0700 (PDT) From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Sun, 7 Aug 2016 15:45:11 +0100 Message-Id: <1470581141-14432-4-git-send-email-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.8.1 In-Reply-To: <1470581141-14432-1-git-send-email-chris@chris-wilson.co.uk> References: <1470581141-14432-1-git-send-email-chris@chris-wilson.co.uk> Cc: Mika Kuoppala Subject: [Intel-gfx] [PATCH 03/33] drm/i915: Move missed interrupt detection from hangcheck to breadcrumbs X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP In commit 2529d57050af ("drm/i915: Drop racy markup of missed-irqs from idle-worker") the racy detection of missed interrupts was removed when we went idle. This however opened up the issue that the stuck waiters were not being reported, causing a test case failure. If we move the stuck waiter detection out of hangcheck and into the breadcrumb mechanims (i.e. the waiter) itself, we can avoid this issue entirely. This leaves hangcheck looking for a stuck GPU (inspecting for request advancement and HEAD motion), and breadcrumbs looking for a stuck waiter - hopefully make both easier to understand by their segregation. v2: Reduce the error message as we now run independently of hangcheck, and the hanging batch used by igt also counts as a stuck waiter causing extra warnings in dmesg. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=97104 Fixes: 2529d57050af (waiter"drm/i915: Drop racy markup of missed-irqs...") Testcase: igt/drv_missed_irq Signed-off-by: Chris Wilson Cc: Joonas Lahtinen Cc: Tvrtko Ursulin Cc: Mika Kuoppala --- drivers/gpu/drm/i915/i915_debugfs.c | 11 +++---- drivers/gpu/drm/i915/i915_gem.c | 10 ------- drivers/gpu/drm/i915/i915_irq.c | 26 +---------------- drivers/gpu/drm/i915/intel_breadcrumbs.c | 49 +++++++++++++++++++++++--------- drivers/gpu/drm/i915/intel_engine_cs.c | 1 + drivers/gpu/drm/i915/intel_ringbuffer.h | 6 ++-- 6 files changed, 45 insertions(+), 58 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c index 9bd41581b592..0627e170ea25 100644 --- a/drivers/gpu/drm/i915/i915_debugfs.c +++ b/drivers/gpu/drm/i915/i915_debugfs.c @@ -787,8 +787,6 @@ static void i915_ring_seqno_info(struct seq_file *m, seq_printf(m, "Current sequence (%s): %x\n", engine->name, intel_engine_get_seqno(engine)); - seq_printf(m, "Current user interrupts (%s): %lx\n", - engine->name, READ_ONCE(engine->breadcrumbs.irq_wakeups)); spin_lock(&b->lock); for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) { @@ -1434,11 +1432,10 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused) engine->hangcheck.seqno, seqno[id], engine->last_submitted_seqno); - seq_printf(m, "\twaiters? %d\n", - intel_engine_has_waiter(engine)); - seq_printf(m, "\tuser interrupts = %lx [current %lx]\n", - engine->hangcheck.user_interrupts, - READ_ONCE(engine->breadcrumbs.irq_wakeups)); + seq_printf(m, "\twaiters? %s, fake irq active? %s\n", + yesno(intel_engine_has_waiter(engine)), + yesno(test_bit(engine->id, + &dev_priv->gpu_error.missed_irq_rings))); seq_printf(m, "\tACTHD = 0x%08llx [current 0x%08llx]\n", (long long)engine->hangcheck.acthd, (long long)acthd[id]); diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 654f0b015f97..cf94c6ed0ff5 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2526,7 +2526,6 @@ i915_gem_idle_work_handler(struct work_struct *work) container_of(work, typeof(*dev_priv), gt.idle_work.work); struct drm_device *dev = &dev_priv->drm; struct intel_engine_cs *engine; - unsigned int stuck_engines; bool rearm_hangcheck; if (!READ_ONCE(dev_priv->gt.awake)) @@ -2556,15 +2555,6 @@ i915_gem_idle_work_handler(struct work_struct *work) dev_priv->gt.awake = false; rearm_hangcheck = false; - /* As we have disabled hangcheck, we need to unstick any waiters still - * hanging around. However, as we may be racing against the interrupt - * handler or the waiters themselves, we skip enabling the fake-irq. - */ - stuck_engines = intel_kick_waiters(dev_priv); - if (unlikely(stuck_engines)) - DRM_DEBUG_DRIVER("kicked stuck waiters (%x)...missed irq?\n", - stuck_engines); - if (INTEL_GEN(dev_priv) >= 6) gen6_rps_idle(dev_priv); intel_runtime_pm_put(dev_priv); diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 591f452ece68..ebb83d5a448b 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -972,10 +972,8 @@ static void ironlake_rps_change_irq_handler(struct drm_i915_private *dev_priv) static void notify_ring(struct intel_engine_cs *engine) { smp_store_mb(engine->breadcrumbs.irq_posted, true); - if (intel_engine_wakeup(engine)) { + if (intel_engine_wakeup(engine)) trace_i915_gem_request_notify(engine); - engine->breadcrumbs.irq_wakeups++; - } } static void vlv_c0_read(struct drm_i915_private *dev_priv, @@ -3044,22 +3042,6 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd) return HANGCHECK_HUNG; } -static unsigned long kick_waiters(struct intel_engine_cs *engine) -{ - struct drm_i915_private *i915 = engine->i915; - unsigned long irq_count = READ_ONCE(engine->breadcrumbs.irq_wakeups); - - if (engine->hangcheck.user_interrupts == irq_count && - !test_and_set_bit(engine->id, &i915->gpu_error.missed_irq_rings)) { - if (!test_bit(engine->id, &i915->gpu_error.test_irq_rings)) - DRM_ERROR("Hangcheck timer elapsed... %s idle\n", - engine->name); - - intel_engine_enable_fake_irq(engine); - } - - return irq_count; -} /* * This is called when the chip hasn't reported back with completed * batchbuffers in a long time. We keep track per ring seqno progress and @@ -3097,7 +3079,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work) bool busy = intel_engine_has_waiter(engine); u64 acthd; u32 seqno; - unsigned user_interrupts; semaphore_clear_deadlocks(dev_priv); @@ -3114,15 +3095,11 @@ static void i915_hangcheck_elapsed(struct work_struct *work) acthd = intel_engine_get_active_head(engine); seqno = intel_engine_get_seqno(engine); - /* Reset stuck interrupts between batch advances */ - user_interrupts = 0; - if (engine->hangcheck.seqno == seqno) { if (!intel_engine_is_active(engine)) { engine->hangcheck.action = HANGCHECK_IDLE; if (busy) { /* Safeguard against driver failure */ - user_interrupts = kick_waiters(engine); engine->hangcheck.score += BUSY; } } else { @@ -3185,7 +3162,6 @@ static void i915_hangcheck_elapsed(struct work_struct *work) engine->hangcheck.seqno = seqno; engine->hangcheck.acthd = acthd; - engine->hangcheck.user_interrupts = user_interrupts; busy_count += busy; } diff --git a/drivers/gpu/drm/i915/intel_breadcrumbs.c b/drivers/gpu/drm/i915/intel_breadcrumbs.c index 90867446f1a5..8ecb3b6538fc 100644 --- a/drivers/gpu/drm/i915/intel_breadcrumbs.c +++ b/drivers/gpu/drm/i915/intel_breadcrumbs.c @@ -26,6 +26,29 @@ #include "i915_drv.h" +static void intel_breadcrumbs_hangcheck(unsigned long data) +{ + struct intel_engine_cs *engine = (struct intel_engine_cs *)data; + struct intel_breadcrumbs *b = &engine->breadcrumbs; + + if (!b->irq_enabled) + return; + + if (time_before(jiffies, b->timeout)) { + mod_timer(&b->hangcheck, b->timeout); + return; + } + + DRM_DEBUG("Hangcheck timer elapsed... %s idle\n", engine->name); + set_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings); + mod_timer(&engine->breadcrumbs.fake_irq, jiffies + 1); +} + +static unsigned long wait_timeout(void) +{ + return round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES); +} + static void intel_breadcrumbs_fake_irq(unsigned long data) { struct intel_engine_cs *engine = (struct intel_engine_cs *)data; @@ -51,13 +74,6 @@ static void irq_enable(struct intel_engine_cs *engine) */ engine->breadcrumbs.irq_posted = true; - /* Make sure the current hangcheck doesn't falsely accuse a just - * started irq handler from missing an interrupt (because the - * interrupt count still matches the stale value from when - * the irq handler was disabled, many hangchecks ago). - */ - engine->breadcrumbs.irq_wakeups++; - spin_lock_irq(&engine->i915->irq_lock); engine->irq_enable(engine); spin_unlock_irq(&engine->i915->irq_lock); @@ -98,8 +114,13 @@ static void __intel_breadcrumbs_enable_irq(struct intel_breadcrumbs *b) } if (!b->irq_enabled || - test_bit(engine->id, &i915->gpu_error.missed_irq_rings)) + test_bit(engine->id, &i915->gpu_error.missed_irq_rings)) { mod_timer(&b->fake_irq, jiffies + 1); + } else { + /* Ensure we never sleep indefinitely */ + GEM_BUG_ON(!time_after(b->timeout, jiffies)); + mod_timer(&b->hangcheck, b->timeout); + } /* Ensure that even if the GPU hangs, we get woken up. * @@ -219,6 +240,7 @@ static bool __intel_engine_add_wait(struct intel_engine_cs *engine, GEM_BUG_ON(!next && !first); if (next && next != &wait->node) { GEM_BUG_ON(first); + b->timeout = wait_timeout(); b->first_wait = to_wait(next); smp_store_mb(b->irq_seqno_bh, b->first_wait->tsk); /* As there is a delay between reading the current @@ -245,6 +267,7 @@ static bool __intel_engine_add_wait(struct intel_engine_cs *engine, if (first) { GEM_BUG_ON(rb_first(&b->waiters) != &wait->node); + b->timeout = wait_timeout(); b->first_wait = wait; smp_store_mb(b->irq_seqno_bh, wait->tsk); /* After assigning ourselves as the new bottom-half, we must @@ -277,11 +300,6 @@ bool intel_engine_add_wait(struct intel_engine_cs *engine, return first; } -void intel_engine_enable_fake_irq(struct intel_engine_cs *engine) -{ - mod_timer(&engine->breadcrumbs.fake_irq, jiffies + 1); -} - static inline bool chain_wakeup(struct rb_node *rb, int priority) { return rb && to_wait(rb)->tsk->prio <= priority; @@ -359,6 +377,7 @@ void intel_engine_remove_wait(struct intel_engine_cs *engine, * the interrupt, or if we have to handle an * exception rather than a seqno completion. */ + b->timeout = wait_timeout(); b->first_wait = to_wait(next); smp_store_mb(b->irq_seqno_bh, b->first_wait->tsk); if (b->first_wait->seqno != wait->seqno) @@ -533,6 +552,9 @@ int intel_engine_init_breadcrumbs(struct intel_engine_cs *engine) struct task_struct *tsk; spin_lock_init(&b->lock); + setup_timer(&b->hangcheck, + intel_breadcrumbs_hangcheck, + (unsigned long)engine); setup_timer(&b->fake_irq, intel_breadcrumbs_fake_irq, (unsigned long)engine); @@ -561,6 +583,7 @@ void intel_engine_fini_breadcrumbs(struct intel_engine_cs *engine) kthread_stop(b->signaler); del_timer_sync(&b->fake_irq); + del_timer_sync(&b->hangcheck); } unsigned int intel_kick_waiters(struct drm_i915_private *i915) diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c index e9b301ae2d0c..0dd3d1de18aa 100644 --- a/drivers/gpu/drm/i915/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/intel_engine_cs.c @@ -164,6 +164,7 @@ cleanup: void intel_engine_init_hangcheck(struct intel_engine_cs *engine) { memset(&engine->hangcheck, 0, sizeof(engine->hangcheck)); + clear_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings); } static void intel_engine_init_requests(struct intel_engine_cs *engine) diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 43e545e44352..4aed4586b0b6 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -75,7 +75,6 @@ enum intel_engine_hangcheck_action { struct intel_engine_hangcheck { u64 acthd; - unsigned long user_interrupts; u32 seqno; int score; enum intel_engine_hangcheck_action action; @@ -173,7 +172,6 @@ struct intel_engine_cs { */ struct intel_breadcrumbs { struct task_struct *irq_seqno_bh; /* bh for user interrupts */ - unsigned long irq_wakeups; bool irq_posted; spinlock_t lock; /* protects the lists of requests */ @@ -183,6 +181,9 @@ struct intel_engine_cs { struct task_struct *signaler; /* used for fence signalling */ struct drm_i915_gem_request *first_signal; struct timer_list fake_irq; /* used after a missed interrupt */ + struct timer_list hangcheck; /* detect missed interrupts */ + + unsigned long timeout; bool irq_enabled : 1; bool rpm_wakelock : 1; @@ -560,7 +561,6 @@ static inline bool intel_engine_wakeup(struct intel_engine_cs *engine) return wakeup; } -void intel_engine_enable_fake_irq(struct intel_engine_cs *engine); void intel_engine_fini_breadcrumbs(struct intel_engine_cs *engine); unsigned int intel_kick_waiters(struct drm_i915_private *i915); unsigned int intel_kick_signalers(struct drm_i915_private *i915);