diff mbox

drm/i915: Upgrade execbuffer fail after resume failure to EIO

Message ID 1395734608-6561-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson March 25, 2014, 8:03 a.m. UTC
If we try to execute on a known ring, but it has failed to be
initialised correctly, report that the GPU is hung rather than the
command invalid. This leaves us reporting EINVAL only if the user
requests execution on a ring that is not supported by the device.

This should prevent UXA from getting stuck in a null render loop after a
failed resume.

Reported-by: Jiri Kosina <jikos@jikos.cz>
References: https://bugs.freedesktop.org/show_bug.cgi?id=76554
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Daniel Vetter March 25, 2014, 8:07 a.m. UTC | #1
On Tue, Mar 25, 2014 at 9:03 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> If we try to execute on a known ring, but it has failed to be
> initialised correctly, report that the GPU is hung rather than the
> command invalid. This leaves us reporting EINVAL only if the user
> requests execution on a ring that is not supported by the device.
>
> This should prevent UXA from getting stuck in a null render loop after a
> failed resume.
>
> Reported-by: Jiri Kosina <jikos@jikos.cz>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=76554
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Feels a bit like duct-tape ... Shouldn't we instead stop tearing down
ringbuffer structures over suspend/resume? And maybe the same for init
with your patch applied.

Or we simply check for wedged first thing in execbuf ...
-Daniel
Chris Wilson March 25, 2014, 8:15 a.m. UTC | #2
On Tue, Mar 25, 2014 at 09:07:00AM +0100, Daniel Vetter wrote:
> On Tue, Mar 25, 2014 at 9:03 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > If we try to execute on a known ring, but it has failed to be
> > initialised correctly, report that the GPU is hung rather than the
> > command invalid. This leaves us reporting EINVAL only if the user
> > requests execution on a ring that is not supported by the device.
> >
> > This should prevent UXA from getting stuck in a null render loop after a
> > failed resume.
> >
> > Reported-by: Jiri Kosina <jikos@jikos.cz>
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Feels a bit like duct-tape ... Shouldn't we instead stop tearing down
> ringbuffer structures over suspend/resume? And maybe the same for init
> with your patch applied.

Even if we did, we would still end up with g45 failing to restart
the ring at random, so we need some w/a.
 
> Or we simply check for wedged first thing in execbuf ...

See the first 2 patches ;-) The first is actually essential as we have
no other guard against writing into the non-existent ring.

I thought about doing that. However, I prefer doing arg validation
first i.e. get all (or as much as is feasible) of the EINVAL checks out
of the way so that we avoid touching hardware or leaking any information
to a malicious client. The problem in this case is where is not actually
an invalid arg.

Note that we will detect the EIO later before touching hardware (so long
as the first two patches are in place).
-Chris
Ben Widawsky March 25, 2014, 3:24 p.m. UTC | #3
On Tue, Mar 25, 2014 at 08:03:28AM +0000, Chris Wilson wrote:
> If we try to execute on a known ring, but it has failed to be
> initialised correctly, report that the GPU is hung rather than the
> command invalid. This leaves us reporting EINVAL only if the user
> requests execution on a ring that is not supported by the device.
> 
> This should prevent UXA from getting stuck in a null render loop after a
> failed resume.
> 
> Reported-by: Jiri Kosina <jikos@jikos.cz>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=76554
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 1b45163e19f3..22c650490f54 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -991,6 +991,18 @@ i915_reset_gen7_sol_offsets(struct drm_device *dev,
>  	return 0;
>  }
>  
> +static bool
> +intel_ring_valid(struct intel_ring_buffer *ring)
> +{
> +	switch (ring->id) {
> +	case RCS: return true;
> +	case VCS: return HAS_BSD(ring->dev);
> +	case BCS: return HAS_BLT(ring->dev);

intel_enable_blt()?

> +	case VECS: return HAS_VEBOX(ring->dev);
> +	default: return false;
> +	}
> +}
> +
>  static int
>  i915_gem_do_execbuffer(struct drm_device *dev, void *data,
>  		       struct drm_file *file,
> @@ -1041,7 +1053,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
>  	if (!intel_ring_initialized(ring)) {
>  		DRM_DEBUG("execbuf with invalid ring: %d\n",
>  			  (int)(args->flags & I915_EXEC_RING_MASK));
> -		return -EINVAL;
> +		return intel_ring_valid(ring) ? -EIO : -EINVAL;
>  	}
>  
>  	mode = args->flags & I915_EXEC_CONSTANTS_MASK;
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Daniel Vetter March 25, 2014, 3:36 p.m. UTC | #4
On Tue, Mar 25, 2014 at 08:15:54AM +0000, Chris Wilson wrote:
> On Tue, Mar 25, 2014 at 09:07:00AM +0100, Daniel Vetter wrote:
> > On Tue, Mar 25, 2014 at 9:03 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > If we try to execute on a known ring, but it has failed to be
> > > initialised correctly, report that the GPU is hung rather than the
> > > command invalid. This leaves us reporting EINVAL only if the user
> > > requests execution on a ring that is not supported by the device.
> > >
> > > This should prevent UXA from getting stuck in a null render loop after a
> > > failed resume.
> > >
> > > Reported-by: Jiri Kosina <jikos@jikos.cz>
> > > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > Feels a bit like duct-tape ... Shouldn't we instead stop tearing down
> > ringbuffer structures over suspend/resume? And maybe the same for init
> > with your patch applied.
> 
> Even if we did, we would still end up with g45 failing to restart
> the ring at random, so we need some w/a.
>  
> > Or we simply check for wedged first thing in execbuf ...
> 
> See the first 2 patches ;-) The first is actually essential as we have
> no other guard against writing into the non-existent ring.
> 
> I thought about doing that. However, I prefer doing arg validation
> first i.e. get all (or as much as is feasible) of the EINVAL checks out
> of the way so that we avoid touching hardware or leaking any information
> to a malicious client. The problem in this case is where is not actually
> an invalid arg.
> 
> Note that we will detect the EIO later before touching hardware (so long
> as the first two patches are in place).

Yeah I've seen the other patches. I think we should try to keep all the
ring structures around even when the hw init failed. I've made some feeble
attempts a while ago to split the structure init from the hw init stuff,
but kinda never fully materialized ...

Imo if our set of valid rings semi-randomly changes at runtime even,
that's not good.
-Daniel
Chris Wilson March 25, 2014, 4:28 p.m. UTC | #5
On Tue, Mar 25, 2014 at 08:24:00AM -0700, Ben Widawsky wrote:
> On Tue, Mar 25, 2014 at 08:03:28AM +0000, Chris Wilson wrote:
> > If we try to execute on a known ring, but it has failed to be
> > initialised correctly, report that the GPU is hung rather than the
> > command invalid. This leaves us reporting EINVAL only if the user
> > requests execution on a ring that is not supported by the device.
> > 
> > This should prevent UXA from getting stuck in a null render loop after a
> > failed resume.
> > 
> > Reported-by: Jiri Kosina <jikos@jikos.cz>
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_execbuffer.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> > index 1b45163e19f3..22c650490f54 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> > @@ -991,6 +991,18 @@ i915_reset_gen7_sol_offsets(struct drm_device *dev,
> >  	return 0;
> >  }
> >  
> > +static bool
> > +intel_ring_valid(struct intel_ring_buffer *ring)
> > +{
> > +	switch (ring->id) {
> > +	case RCS: return true;
> > +	case VCS: return HAS_BSD(ring->dev);
> > +	case BCS: return HAS_BLT(ring->dev);
> 
> intel_enable_blt()?

But not exported, and below my level of caring ;-)
The cases were intel_enable_blt() != HAS_BLT() are those where the hw
would hang anyway.
-Chris
Chris Wilson March 25, 2014, 4:29 p.m. UTC | #6
On Tue, Mar 25, 2014 at 04:36:05PM +0100, Daniel Vetter wrote:
> On Tue, Mar 25, 2014 at 08:15:54AM +0000, Chris Wilson wrote:
> > On Tue, Mar 25, 2014 at 09:07:00AM +0100, Daniel Vetter wrote:
> > > On Tue, Mar 25, 2014 at 9:03 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > > > If we try to execute on a known ring, but it has failed to be
> > > > initialised correctly, report that the GPU is hung rather than the
> > > > command invalid. This leaves us reporting EINVAL only if the user
> > > > requests execution on a ring that is not supported by the device.
> > > >
> > > > This should prevent UXA from getting stuck in a null render loop after a
> > > > failed resume.
> > > >
> > > > Reported-by: Jiri Kosina <jikos@jikos.cz>
> > > > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554
> > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > 
> > > Feels a bit like duct-tape ... Shouldn't we instead stop tearing down
> > > ringbuffer structures over suspend/resume? And maybe the same for init
> > > with your patch applied.
> > 
> > Even if we did, we would still end up with g45 failing to restart
> > the ring at random, so we need some w/a.
> >  
> > > Or we simply check for wedged first thing in execbuf ...
> > 
> > See the first 2 patches ;-) The first is actually essential as we have
> > no other guard against writing into the non-existent ring.
> > 
> > I thought about doing that. However, I prefer doing arg validation
> > first i.e. get all (or as much as is feasible) of the EINVAL checks out
> > of the way so that we avoid touching hardware or leaking any information
> > to a malicious client. The problem in this case is where is not actually
> > an invalid arg.
> > 
> > Note that we will detect the EIO later before touching hardware (so long
> > as the first two patches are in place).
> 
> Yeah I've seen the other patches. I think we should try to keep all the
> ring structures around even when the hw init failed. I've made some feeble
> attempts a while ago to split the structure init from the hw init stuff,
> but kinda never fully materialized ...
> 
> Imo if our set of valid rings semi-randomly changes at runtime even,
> that's not good.

Agreed, but sadly we can't trust hardware to always work, and we need
something to prevent explosions. I quite like the idea of marking the
GPU wedged if hw init fails so that we lose acceleration but keep
modesetting around.
-Chris
Daniel Vetter March 25, 2014, 4:52 p.m. UTC | #7
On Tue, Mar 25, 2014 at 5:29 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> Yeah I've seen the other patches. I think we should try to keep all the
>> ring structures around even when the hw init failed. I've made some feeble
>> attempts a while ago to split the structure init from the hw init stuff,
>> but kinda never fully materialized ...
>>
>> Imo if our set of valid rings semi-randomly changes at runtime even,
>> that's not good.
>
> Agreed, but sadly we can't trust hardware to always work, and we need
> something to prevent explosions. I quite like the idea of marking the
> GPU wedged if hw init fails so that we lose acceleration but keep
> modesetting around.

Yeah, I agree that the  other two patches are neat indeed, it's this
one here where the shiny starts to come off a bit ;-) tbh I'd prefer a
simply if (terminally_wedged) return -EIO; here before the ring
checks, maybe with a comment stating why we need to have this order.

That, or fix the mess called ring init code ...
-Daniel
Chris Wilson March 25, 2014, 9:33 p.m. UTC | #8
On Tue, Mar 25, 2014 at 05:52:00PM +0100, Daniel Vetter wrote:
> On Tue, Mar 25, 2014 at 5:29 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >> Yeah I've seen the other patches. I think we should try to keep all the
> >> ring structures around even when the hw init failed. I've made some feeble
> >> attempts a while ago to split the structure init from the hw init stuff,
> >> but kinda never fully materialized ...
> >>
> >> Imo if our set of valid rings semi-randomly changes at runtime even,
> >> that's not good.
> >
> > Agreed, but sadly we can't trust hardware to always work, and we need
> > something to prevent explosions. I quite like the idea of marking the
> > GPU wedged if hw init fails so that we lose acceleration but keep
> > modesetting around.
> 
> Yeah, I agree that the  other two patches are neat indeed, it's this
> one here where the shiny starts to come off a bit ;-) tbh I'd prefer a
> simply if (terminally_wedged) return -EIO; here before the ring
> checks, maybe with a comment stating why we need to have this order.

It's ok, it is only to prevent UXA from going off the rails after the
odd resume hang on g45...

> That, or fix the mess called ring init code ...

So if we fixed resume to avoid reallocating the ringbuffers across
resume, g45 would still fail to restart, but now we still have valid
objects (or would we tear them down because of the failure?) and so this
check passes and we later hit the EIO checks?
-Chris
Daniel Vetter March 25, 2014, 10:49 p.m. UTC | #9
On Tue, Mar 25, 2014 at 10:33 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> That, or fix the mess called ring init code ...
>
> So if we fixed resume to avoid reallocating the ringbuffers across
> resume, g45 would still fail to restart, but now we still have valid
> objects (or would we tear them down because of the failure?) and so this
> check passes and we later hit the EIO checks?

Yeah that's my idea: We always set up the objects for all valid rings
and never tear them down. Upon failure we just set wedged and execbuf
would fall through until it notices the wedged states and returns
-EIO. Gives us tidy unified code and no special-casing for the ring
init failures. And if we do the split in ring init correctly the
ring_hw_init could unconditionally set wedged internally if it fails
and our code would dtrt no matter whether it's driver load, resume or
failure to resurrect the hw after reset. So callers wouldn't need to
care. Of course the ring_init functions which allocates the structures
in driver load still needs to be able to bail out in case that fails.
I can dream ...
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 1b45163e19f3..22c650490f54 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -991,6 +991,18 @@  i915_reset_gen7_sol_offsets(struct drm_device *dev,
 	return 0;
 }
 
+static bool
+intel_ring_valid(struct intel_ring_buffer *ring)
+{
+	switch (ring->id) {
+	case RCS: return true;
+	case VCS: return HAS_BSD(ring->dev);
+	case BCS: return HAS_BLT(ring->dev);
+	case VECS: return HAS_VEBOX(ring->dev);
+	default: return false;
+	}
+}
+
 static int
 i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 		       struct drm_file *file,
@@ -1041,7 +1053,7 @@  i915_gem_do_execbuffer(struct drm_device *dev, void *data,
 	if (!intel_ring_initialized(ring)) {
 		DRM_DEBUG("execbuf with invalid ring: %d\n",
 			  (int)(args->flags & I915_EXEC_RING_MASK));
-		return -EINVAL;
+		return intel_ring_valid(ring) ? -EIO : -EINVAL;
 	}
 
 	mode = args->flags & I915_EXEC_CONSTANTS_MASK;