Message ID | 1368019770-4653-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote: > There is an unlikely corner case whereby a lockless wait may not notice > a GPU hang and reset, and so continue to wait for the device to advance > beyond the chosen seqno. This of course may never happen as the waiter > may be the only user. Instead, we can explicitly advance the device > seqno to match the requests that are forcibly retired following the > hang. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> This race is why the reset counter must always increase and can't just flip-flop between the reset-in-progress and everything-works states. Now if we want to unwedge on resume we need to reconsider this, but imo it would be easier to simply remember the reset counter before we wedge the gpu and restore that one (incremented as if the gpu reset worked). We already assume that wedged will never collide with a real reset counter, so this should work. -Daniel > --- > drivers/gpu/drm/i915/i915_gem.c | 15 +++++++++++++-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 84ee1f2..b3c8abd 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2118,8 +2118,11 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request) > } > > static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv, > - struct intel_ring_buffer *ring) > + struct intel_ring_buffer *ring, > + u32 seqno) > { > + int i; > + > while (!list_empty(&ring->request_list)) { > struct drm_i915_gem_request *request; > > @@ -2139,6 +2142,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv, > > i915_gem_object_move_to_inactive(obj); > } > + > + intel_ring_init_seqno(ring, seqno); > + for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++) > + ring->sync_seqno[i] = 0; > } > > static void i915_gem_reset_fences(struct drm_device *dev) > @@ -2167,10 +2174,14 @@ void i915_gem_reset(struct drm_device *dev) > struct drm_i915_private *dev_priv = dev->dev_private; > struct drm_i915_gem_object *obj; > struct intel_ring_buffer *ring; > + u32 seqno; > int i; > > + if (i915_gem_get_seqno(dev, &seqno)) > + seqno = dev_priv->next_seqno - 1; > + > for_each_ring(ring, dev_priv, i) > - i915_gem_reset_ring_lists(dev_priv, ring); > + i915_gem_reset_ring_lists(dev_priv, ring, seqno); > > /* Move everything out of the GPU domains to ensure we do any > * necessary invalidation upon reuse. > -- > 1.7.10.4 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Wed, May 08, 2013 at 04:02:00PM +0200, Daniel Vetter wrote: > On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote: > > There is an unlikely corner case whereby a lockless wait may not notice > > a GPU hang and reset, and so continue to wait for the device to advance > > beyond the chosen seqno. This of course may never happen as the waiter > > may be the only user. Instead, we can explicitly advance the device > > seqno to match the requests that are forcibly retired following the > > hang. > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > This race is why the reset counter must always increase and can't just > flip-flop between the reset-in-progress and everything-works states. > > Now if we want to unwedge on resume we need to reconsider this, but imo it > would be easier to simply remember the reset counter before we wedge the > gpu and restore that one (incremented as if the gpu reset worked). We > already assume that wedged will never collide with a real reset counter, > so this should work. Agree that this a unwedge-upon-resume issue, but my argument here is that this leaves the hardware state consistent with what we forcibly reset it to. From that perspective your suggestion is papering over this here bug and this is the neat solution. -Chris
On Wed, May 8, 2013 at 4:06 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Wed, May 08, 2013 at 04:02:00PM +0200, Daniel Vetter wrote: >> On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote: >> > There is an unlikely corner case whereby a lockless wait may not notice >> > a GPU hang and reset, and so continue to wait for the device to advance >> > beyond the chosen seqno. This of course may never happen as the waiter >> > may be the only user. Instead, we can explicitly advance the device >> > seqno to match the requests that are forcibly retired following the >> > hang. >> > >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> >> >> This race is why the reset counter must always increase and can't just >> flip-flop between the reset-in-progress and everything-works states. >> >> Now if we want to unwedge on resume we need to reconsider this, but imo it >> would be easier to simply remember the reset counter before we wedge the >> gpu and restore that one (incremented as if the gpu reset worked). We >> already assume that wedged will never collide with a real reset counter, >> so this should work. > > Agree that this a unwedge-upon-resume issue, but my argument here is > that this leaves the hardware state consistent with what we forcibly > reset it to. From that perspective your suggestion is papering over this > here bug and this is the neat solution. Yeah, for the reset case I agree that just continuing in the sequence would be more resilient. I'm still a bit unsure though what to do across suspend/resume (where we currently force-reset the sequence numbers, too). Maybe we need the poke-y stick there, too (in the form of kicking waiters and incrementing the reset counter). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Chris Wilson <chris@chris-wilson.co.uk> writes: > There is an unlikely corner case whereby a lockless wait may not notice > a GPU hang and reset, and so continue to wait for the device to advance > beyond the chosen seqno. This of course may never happen as the waiter > may be the only user. Instead, we can explicitly advance the device > seqno to match the requests that are forcibly retired following the > hang. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > --- > drivers/gpu/drm/i915/i915_gem.c | 15 +++++++++++++-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 84ee1f2..b3c8abd 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2118,8 +2118,11 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request) > } > > static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv, > - struct intel_ring_buffer *ring) > + struct intel_ring_buffer *ring, > + u32 seqno) > { > + int i; > + > while (!list_empty(&ring->request_list)) { > struct drm_i915_gem_request *request; > > @@ -2139,6 +2142,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv, > > i915_gem_object_move_to_inactive(obj); > } > + > + intel_ring_init_seqno(ring, seqno); > + for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++) > + ring->sync_seqno[i] = 0; > } I remember pondering about resetting sync_seqno's inside intel_ring_init_seqno(). Is there reason not to? > static void i915_gem_reset_fences(struct drm_device *dev) > @@ -2167,10 +2174,14 @@ void i915_gem_reset(struct drm_device *dev) > struct drm_i915_private *dev_priv = dev->dev_private; > struct drm_i915_gem_object *obj; > struct intel_ring_buffer *ring; > + u32 seqno; > int i; > > + if (i915_gem_get_seqno(dev, &seqno)) > + seqno = dev_priv->next_seqno - 1; > + > for_each_ring(ring, dev_priv, i) > - i915_gem_reset_ring_lists(dev_priv, ring); > + i915_gem_reset_ring_lists(dev_priv, ring, seqno); > > /* Move everything out of the GPU domains to ensure we do any > * necessary invalidation upon reuse. > -- > 1.7.10.4 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Mon, May 13, 2013 at 04:10:18PM +0300, Mika Kuoppala wrote: > Chris Wilson <chris@chris-wilson.co.uk> writes: > > + > > + intel_ring_init_seqno(ring, seqno); > > + for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++) > > + ring->sync_seqno[i] = 0; > > } > > I remember pondering about resetting sync_seqno's > inside intel_ring_init_seqno(). Is there reason > not to? Not a strong one. Conceptually the ring->sync_seqno[] belong to the other rings, so I felt it was clumsy for intel_ring_init_seqno() to falsely claim ownership and reset its own sync_seqno. But I think we can refactor away those qualms with a comment. -Chris
Chris Wilson <chris@chris-wilson.co.uk> writes: > On Mon, May 13, 2013 at 04:10:18PM +0300, Mika Kuoppala wrote: >> Chris Wilson <chris@chris-wilson.co.uk> writes: >> > + >> > + intel_ring_init_seqno(ring, seqno); >> > + for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++) >> > + ring->sync_seqno[i] = 0; >> > } >> >> I remember pondering about resetting sync_seqno's >> inside intel_ring_init_seqno(). Is there reason >> not to? > > Not a strong one. Conceptually the ring->sync_seqno[] belong to the other > rings, so I felt it was clumsy for intel_ring_init_seqno() to falsely > claim ownership and reset its own sync_seqno. But I think we can > refactor away those qualms with a comment. The existing intel_ring_init_seqno() already clumsily resets the sync registers of other rings. As we can't and wont initialize anything but all of the ring seqnos at once, the existing code could be more explicit on that. But for this patch: Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 84ee1f2..b3c8abd 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2118,8 +2118,11 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request) } static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv, - struct intel_ring_buffer *ring) + struct intel_ring_buffer *ring, + u32 seqno) { + int i; + while (!list_empty(&ring->request_list)) { struct drm_i915_gem_request *request; @@ -2139,6 +2142,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv, i915_gem_object_move_to_inactive(obj); } + + intel_ring_init_seqno(ring, seqno); + for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++) + ring->sync_seqno[i] = 0; } static void i915_gem_reset_fences(struct drm_device *dev) @@ -2167,10 +2174,14 @@ void i915_gem_reset(struct drm_device *dev) struct drm_i915_private *dev_priv = dev->dev_private; struct drm_i915_gem_object *obj; struct intel_ring_buffer *ring; + u32 seqno; int i; + if (i915_gem_get_seqno(dev, &seqno)) + seqno = dev_priv->next_seqno - 1; + for_each_ring(ring, dev_priv, i) - i915_gem_reset_ring_lists(dev_priv, ring); + i915_gem_reset_ring_lists(dev_priv, ring, seqno); /* Move everything out of the GPU domains to ensure we do any * necessary invalidation upon reuse.
There is an unlikely corner case whereby a lockless wait may not notice a GPU hang and reset, and so continue to wait for the device to advance beyond the chosen seqno. This of course may never happen as the waiter may be the only user. Instead, we can explicitly advance the device seqno to match the requests that are forcibly retired following the hang. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/i915_gem.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)