diff mbox

drm/i915: Advance seqno upon reseting the GPU following a hang

Message ID 1368019770-4653-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson May 8, 2013, 1:29 p.m. UTC
There is an unlikely corner case whereby a lockless wait may not notice
a GPU hang and reset, and so continue to wait for the device to advance
beyond the chosen seqno. This of course may never happen as the waiter
may be the only user. Instead, we can explicitly advance the device
seqno to match the requests that are forcibly retired following the
hang.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Comments

Daniel Vetter May 8, 2013, 2:02 p.m. UTC | #1
On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote:
> There is an unlikely corner case whereby a lockless wait may not notice
> a GPU hang and reset, and so continue to wait for the device to advance
> beyond the chosen seqno. This of course may never happen as the waiter
> may be the only user. Instead, we can explicitly advance the device
> seqno to match the requests that are forcibly retired following the
> hang.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

This race is why the reset counter must always increase and can't just
flip-flop between the reset-in-progress and everything-works states.

Now if we want to unwedge on resume we need to reconsider this, but imo it
would be easier to simply remember the reset counter before we wedge the
gpu and restore that one (incremented as if the gpu reset worked). We
already assume that wedged will never collide with a real reset counter,
so this should work.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_gem.c |   15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 84ee1f2..b3c8abd 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2118,8 +2118,11 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
>  }
>  
>  static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
> -				      struct intel_ring_buffer *ring)
> +				      struct intel_ring_buffer *ring,
> +				      u32 seqno)
>  {
> +	int i;
> +
>  	while (!list_empty(&ring->request_list)) {
>  		struct drm_i915_gem_request *request;
>  
> @@ -2139,6 +2142,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
>  
>  		i915_gem_object_move_to_inactive(obj);
>  	}
> +
> +	intel_ring_init_seqno(ring, seqno);
> +	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
> +		ring->sync_seqno[i] = 0;
>  }
>  
>  static void i915_gem_reset_fences(struct drm_device *dev)
> @@ -2167,10 +2174,14 @@ void i915_gem_reset(struct drm_device *dev)
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct drm_i915_gem_object *obj;
>  	struct intel_ring_buffer *ring;
> +	u32 seqno;
>  	int i;
>  
> +	if (i915_gem_get_seqno(dev, &seqno))
> +		seqno = dev_priv->next_seqno - 1;
> +
>  	for_each_ring(ring, dev_priv, i)
> -		i915_gem_reset_ring_lists(dev_priv, ring);
> +		i915_gem_reset_ring_lists(dev_priv, ring, seqno);
>  
>  	/* Move everything out of the GPU domains to ensure we do any
>  	 * necessary invalidation upon reuse.
> -- 
> 1.7.10.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson May 8, 2013, 2:06 p.m. UTC | #2
On Wed, May 08, 2013 at 04:02:00PM +0200, Daniel Vetter wrote:
> On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote:
> > There is an unlikely corner case whereby a lockless wait may not notice
> > a GPU hang and reset, and so continue to wait for the device to advance
> > beyond the chosen seqno. This of course may never happen as the waiter
> > may be the only user. Instead, we can explicitly advance the device
> > seqno to match the requests that are forcibly retired following the
> > hang.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> This race is why the reset counter must always increase and can't just
> flip-flop between the reset-in-progress and everything-works states.
> 
> Now if we want to unwedge on resume we need to reconsider this, but imo it
> would be easier to simply remember the reset counter before we wedge the
> gpu and restore that one (incremented as if the gpu reset worked). We
> already assume that wedged will never collide with a real reset counter,
> so this should work.

Agree that this a unwedge-upon-resume issue, but my argument here is
that this leaves the hardware state consistent with what we forcibly
reset it to. From that perspective your suggestion is papering over this
here bug and this is the neat solution.
-Chris
Daniel Vetter May 10, 2013, 3:02 p.m. UTC | #3
On Wed, May 8, 2013 at 4:06 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Wed, May 08, 2013 at 04:02:00PM +0200, Daniel Vetter wrote:
>> On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote:
>> > There is an unlikely corner case whereby a lockless wait may not notice
>> > a GPU hang and reset, and so continue to wait for the device to advance
>> > beyond the chosen seqno. This of course may never happen as the waiter
>> > may be the only user. Instead, we can explicitly advance the device
>> > seqno to match the requests that are forcibly retired following the
>> > hang.
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>
>> This race is why the reset counter must always increase and can't just
>> flip-flop between the reset-in-progress and everything-works states.
>>
>> Now if we want to unwedge on resume we need to reconsider this, but imo it
>> would be easier to simply remember the reset counter before we wedge the
>> gpu and restore that one (incremented as if the gpu reset worked). We
>> already assume that wedged will never collide with a real reset counter,
>> so this should work.
>
> Agree that this a unwedge-upon-resume issue, but my argument here is
> that this leaves the hardware state consistent with what we forcibly
> reset it to. From that perspective your suggestion is papering over this
> here bug and this is the neat solution.

Yeah, for the reset case I agree that just continuing in the sequence
would be more resilient. I'm still a bit unsure though what to do
across suspend/resume (where we currently force-reset the sequence
numbers, too). Maybe we need the poke-y stick there, too (in the form
of kicking waiters and incrementing the reset counter).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Mika Kuoppala May 13, 2013, 1:10 p.m. UTC | #4
Chris Wilson <chris@chris-wilson.co.uk> writes:

> There is an unlikely corner case whereby a lockless wait may not notice
> a GPU hang and reset, and so continue to wait for the device to advance
> beyond the chosen seqno. This of course may never happen as the waiter
> may be the only user. Instead, we can explicitly advance the device
> seqno to match the requests that are forcibly retired following the
> hang.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_gem.c |   15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 84ee1f2..b3c8abd 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2118,8 +2118,11 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
>  }
>  
>  static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
> -				      struct intel_ring_buffer *ring)
> +				      struct intel_ring_buffer *ring,
> +				      u32 seqno)
>  {
> +	int i;
> +
>  	while (!list_empty(&ring->request_list)) {
>  		struct drm_i915_gem_request *request;
>  
> @@ -2139,6 +2142,10 @@ static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
>  
>  		i915_gem_object_move_to_inactive(obj);
>  	}
> +
> +	intel_ring_init_seqno(ring, seqno);
> +	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
> +		ring->sync_seqno[i] = 0;
>  }

I remember pondering about resetting sync_seqno's
inside intel_ring_init_seqno(). Is there reason
not to?

>  static void i915_gem_reset_fences(struct drm_device *dev)
> @@ -2167,10 +2174,14 @@ void i915_gem_reset(struct drm_device *dev)
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct drm_i915_gem_object *obj;
>  	struct intel_ring_buffer *ring;
> +	u32 seqno;
>  	int i;
>  
> +	if (i915_gem_get_seqno(dev, &seqno))
> +		seqno = dev_priv->next_seqno - 1;
> +
>  	for_each_ring(ring, dev_priv, i)
> -		i915_gem_reset_ring_lists(dev_priv, ring);
> +		i915_gem_reset_ring_lists(dev_priv, ring, seqno);
>  
>  	/* Move everything out of the GPU domains to ensure we do any
>  	 * necessary invalidation upon reuse.
> -- 
> 1.7.10.4
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson May 14, 2013, 10:34 a.m. UTC | #5
On Mon, May 13, 2013 at 04:10:18PM +0300, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> > +
> > +	intel_ring_init_seqno(ring, seqno);
> > +	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
> > +		ring->sync_seqno[i] = 0;
> >  }
> 
> I remember pondering about resetting sync_seqno's
> inside intel_ring_init_seqno(). Is there reason
> not to?

Not a strong one. Conceptually the ring->sync_seqno[] belong to the other
rings, so I felt it was clumsy for intel_ring_init_seqno() to falsely
claim ownership and reset its own sync_seqno. But I think we can
refactor away those qualms with a comment.
-Chris
Mika Kuoppala May 14, 2013, 12:31 p.m. UTC | #6
Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Mon, May 13, 2013 at 04:10:18PM +0300, Mika Kuoppala wrote:
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> > +
>> > +	intel_ring_init_seqno(ring, seqno);
>> > +	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
>> > +		ring->sync_seqno[i] = 0;
>> >  }
>> 
>> I remember pondering about resetting sync_seqno's
>> inside intel_ring_init_seqno(). Is there reason
>> not to?
>
> Not a strong one. Conceptually the ring->sync_seqno[] belong to the other
> rings, so I felt it was clumsy for intel_ring_init_seqno() to falsely
> claim ownership and reset its own sync_seqno. But I think we can
> refactor away those qualms with a comment.

The existing intel_ring_init_seqno() already clumsily
resets the sync registers of other rings. As we can't
and wont initialize anything but all of the ring seqnos
at once, the existing code could be more explicit on that.

But for this patch:
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 84ee1f2..b3c8abd 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2118,8 +2118,11 @@  static void i915_gem_free_request(struct drm_i915_gem_request *request)
 }
 
 static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
-				      struct intel_ring_buffer *ring)
+				      struct intel_ring_buffer *ring,
+				      u32 seqno)
 {
+	int i;
+
 	while (!list_empty(&ring->request_list)) {
 		struct drm_i915_gem_request *request;
 
@@ -2139,6 +2142,10 @@  static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
 
 		i915_gem_object_move_to_inactive(obj);
 	}
+
+	intel_ring_init_seqno(ring, seqno);
+	for (i = 0; i < ARRAY_SIZE(ring->sync_seqno); i++)
+		ring->sync_seqno[i] = 0;
 }
 
 static void i915_gem_reset_fences(struct drm_device *dev)
@@ -2167,10 +2174,14 @@  void i915_gem_reset(struct drm_device *dev)
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct drm_i915_gem_object *obj;
 	struct intel_ring_buffer *ring;
+	u32 seqno;
 	int i;
 
+	if (i915_gem_get_seqno(dev, &seqno))
+		seqno = dev_priv->next_seqno - 1;
+
 	for_each_ring(ring, dev_priv, i)
-		i915_gem_reset_ring_lists(dev_priv, ring);
+		i915_gem_reset_ring_lists(dev_priv, ring, seqno);
 
 	/* Move everything out of the GPU domains to ensure we do any
 	 * necessary invalidation upon reuse.