Message ID | 1357784617-18956-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 10, 2013 at 02:23:37AM +0000, Chris Wilson wrote: > Once we thought we got semaphores working, we disabled kicking the ring > if hangcheck fired whilst waiting upon a ring as it was doing more harm > than good: > > commit 4e0e90dcb8a7df1229c69e30abebb59b0b3c2a1f > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Wed Dec 14 13:56:58 2011 +0100 > > drm/i915: kicking rings stuck on semaphores considered harmful > > However, life is never that easy and semaphores are still causing > problems whereby the value written by one ring (bcs) is not being > propagated to the waiter (rcs). Thus the waiter never wakes up and we > declare the GPU hung, which often has unfortunate consequences, even if > we successfully reset the GPU. > > But the GPU is idle as it has completed the work, just didn't notify its > clients. So we can detect the incomplete wait during hang check and > probe the target ring to see if has indeed emitted the breadcrumb seqno > following the work and then and only then kick the waiter. > > Based on a suggestion by Ben Widawsky. > > References: https://bugs.freedesktop.org/show_bug.cgi?id=54226 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: Ben Widawsky <ben@bwidawsk.net> > --- > drivers/gpu/drm/i915/i915_irq.c | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index 39e4177..3cc52b2 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -1693,6 +1693,32 @@ static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err) > return false; > } > > +static bool semaphore_passed(struct intel_ring_buffer *waiter) > +{ > + struct drm_i915_private *dev_priv = waiter->dev->dev_private; > + int acthd = intel_ring_get_active_head(waiter) & HEAD_ADDR; > + struct intel_ring_buffer *signaller; > + u32 cmd; > + > + /* ACTHD is likely pointing to the dword after the actual command, > + * so scan backwards until we find the MBOX. > + */ > +#define WAIT (MI_SEMAPHORE_MBOX | MI_SEMAPHORE_COMPARE | MI_SEMAPHORE_REGISTER) > + do { > + cmd = ioread32(waiter->virtual_start+acthd); > + if ((cmd & ~(0x3 << 16)) == WAIT) > + break; > + acthd -= 4; > + if (acthd < 0) > + return false; > + } while (1); > +#undef WAIT > + > + signaller = &dev_priv->ring[(waiter->id + (((cmd >> 16) & 3) - 1)) % 3]; I hate you for this line ^^ > + return i915_seqno_passed(signaller->get_seqno(signaller, false), > + ioread32(waiter->virtual_start+acthd+4)+1); > +} > + For the sake of just testing if this makes the problem go away, this could have been a lot simpler. Just check the RCS mbox (since in our discussion, you said all the bugs had been RCS waiting). i915_seqno_passed(bcs->get_seqno, GEN6_RBSYNC) > static bool kick_ring(struct intel_ring_buffer *ring) > { > struct drm_device *dev = ring->dev; > @@ -1704,6 +1730,14 @@ static bool kick_ring(struct intel_ring_buffer *ring) > I915_WRITE_CTL(ring, tmp); > return true; > } > + if (INTEL_INFO(dev)->gen >= 6 && > + tmp & RING_WAIT_SEMAPHORE && > + semaphore_passed(ring)) { > + DRM_ERROR("Kicking stuck semaphore on %s\n", > + ring->name); > + I915_WRITE_CTL(ring, tmp); > + return true; > + } > return false; > } I'd go with just gen == 6, for now. Also, since we know the waiter from your horrifically complex calculation above, maybe we can get that info in the DRM_ERROR as well? On the idea: Acked-by: Ben Widawsky <ben@bwidawsk.net> I'll happily take time to turn it into an r-b if it fixes a bug.
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 39e4177..3cc52b2 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1693,6 +1693,32 @@ static bool i915_hangcheck_ring_idle(struct intel_ring_buffer *ring, bool *err) return false; } +static bool semaphore_passed(struct intel_ring_buffer *waiter) +{ + struct drm_i915_private *dev_priv = waiter->dev->dev_private; + int acthd = intel_ring_get_active_head(waiter) & HEAD_ADDR; + struct intel_ring_buffer *signaller; + u32 cmd; + + /* ACTHD is likely pointing to the dword after the actual command, + * so scan backwards until we find the MBOX. + */ +#define WAIT (MI_SEMAPHORE_MBOX | MI_SEMAPHORE_COMPARE | MI_SEMAPHORE_REGISTER) + do { + cmd = ioread32(waiter->virtual_start+acthd); + if ((cmd & ~(0x3 << 16)) == WAIT) + break; + acthd -= 4; + if (acthd < 0) + return false; + } while (1); +#undef WAIT + + signaller = &dev_priv->ring[(waiter->id + (((cmd >> 16) & 3) - 1)) % 3]; + return i915_seqno_passed(signaller->get_seqno(signaller, false), + ioread32(waiter->virtual_start+acthd+4)+1); +} + static bool kick_ring(struct intel_ring_buffer *ring) { struct drm_device *dev = ring->dev; @@ -1704,6 +1730,14 @@ static bool kick_ring(struct intel_ring_buffer *ring) I915_WRITE_CTL(ring, tmp); return true; } + if (INTEL_INFO(dev)->gen >= 6 && + tmp & RING_WAIT_SEMAPHORE && + semaphore_passed(ring)) { + DRM_ERROR("Kicking stuck semaphore on %s\n", + ring->name); + I915_WRITE_CTL(ring, tmp); + return true; + } return false; }
Once we thought we got semaphores working, we disabled kicking the ring if hangcheck fired whilst waiting upon a ring as it was doing more harm than good: commit 4e0e90dcb8a7df1229c69e30abebb59b0b3c2a1f Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Dec 14 13:56:58 2011 +0100 drm/i915: kicking rings stuck on semaphores considered harmful However, life is never that easy and semaphores are still causing problems whereby the value written by one ring (bcs) is not being propagated to the waiter (rcs). Thus the waiter never wakes up and we declare the GPU hung, which often has unfortunate consequences, even if we successfully reset the GPU. But the GPU is idle as it has completed the work, just didn't notify its clients. So we can detect the incomplete wait during hang check and probe the target ring to see if has indeed emitted the breadcrumb seqno following the work and then and only then kick the waiter. Based on a suggestion by Ben Widawsky. References: https://bugs.freedesktop.org/show_bug.cgi?id=54226 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Ben Widawsky <ben@bwidawsk.net> --- drivers/gpu/drm/i915/i915_irq.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+)