Message ID | 1395734608-6561-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Mar 25, 2014 at 9:03 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > If we try to execute on a known ring, but it has failed to be > initialised correctly, report that the GPU is hung rather than the > command invalid. This leaves us reporting EINVAL only if the user > requests execution on a ring that is not supported by the device. > > This should prevent UXA from getting stuck in a null render loop after a > failed resume. > > Reported-by: Jiri Kosina <jikos@jikos.cz> > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Feels a bit like duct-tape ... Shouldn't we instead stop tearing down ringbuffer structures over suspend/resume? And maybe the same for init with your patch applied. Or we simply check for wedged first thing in execbuf ... -Daniel
On Tue, Mar 25, 2014 at 09:07:00AM +0100, Daniel Vetter wrote: > On Tue, Mar 25, 2014 at 9:03 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > If we try to execute on a known ring, but it has failed to be > > initialised correctly, report that the GPU is hung rather than the > > command invalid. This leaves us reporting EINVAL only if the user > > requests execution on a ring that is not supported by the device. > > > > This should prevent UXA from getting stuck in a null render loop after a > > failed resume. > > > > Reported-by: Jiri Kosina <jikos@jikos.cz> > > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554 > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Feels a bit like duct-tape ... Shouldn't we instead stop tearing down > ringbuffer structures over suspend/resume? And maybe the same for init > with your patch applied. Even if we did, we would still end up with g45 failing to restart the ring at random, so we need some w/a. > Or we simply check for wedged first thing in execbuf ... See the first 2 patches ;-) The first is actually essential as we have no other guard against writing into the non-existent ring. I thought about doing that. However, I prefer doing arg validation first i.e. get all (or as much as is feasible) of the EINVAL checks out of the way so that we avoid touching hardware or leaking any information to a malicious client. The problem in this case is where is not actually an invalid arg. Note that we will detect the EIO later before touching hardware (so long as the first two patches are in place). -Chris
On Tue, Mar 25, 2014 at 08:03:28AM +0000, Chris Wilson wrote: > If we try to execute on a known ring, but it has failed to be > initialised correctly, report that the GPU is hung rather than the > command invalid. This leaves us reporting EINVAL only if the user > requests execution on a ring that is not supported by the device. > > This should prevent UXA from getting stuck in a null render loop after a > failed resume. > > Reported-by: Jiri Kosina <jikos@jikos.cz> > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > --- > drivers/gpu/drm/i915/i915_gem_execbuffer.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > index 1b45163e19f3..22c650490f54 100644 > --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c > +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > @@ -991,6 +991,18 @@ i915_reset_gen7_sol_offsets(struct drm_device *dev, > return 0; > } > > +static bool > +intel_ring_valid(struct intel_ring_buffer *ring) > +{ > + switch (ring->id) { > + case RCS: return true; > + case VCS: return HAS_BSD(ring->dev); > + case BCS: return HAS_BLT(ring->dev); intel_enable_blt()? > + case VECS: return HAS_VEBOX(ring->dev); > + default: return false; > + } > +} > + > static int > i915_gem_do_execbuffer(struct drm_device *dev, void *data, > struct drm_file *file, > @@ -1041,7 +1053,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data, > if (!intel_ring_initialized(ring)) { > DRM_DEBUG("execbuf with invalid ring: %d\n", > (int)(args->flags & I915_EXEC_RING_MASK)); > - return -EINVAL; > + return intel_ring_valid(ring) ? -EIO : -EINVAL; > } > > mode = args->flags & I915_EXEC_CONSTANTS_MASK; > -- > 1.9.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Tue, Mar 25, 2014 at 08:15:54AM +0000, Chris Wilson wrote: > On Tue, Mar 25, 2014 at 09:07:00AM +0100, Daniel Vetter wrote: > > On Tue, Mar 25, 2014 at 9:03 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > If we try to execute on a known ring, but it has failed to be > > > initialised correctly, report that the GPU is hung rather than the > > > command invalid. This leaves us reporting EINVAL only if the user > > > requests execution on a ring that is not supported by the device. > > > > > > This should prevent UXA from getting stuck in a null render loop after a > > > failed resume. > > > > > > Reported-by: Jiri Kosina <jikos@jikos.cz> > > > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554 > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > Feels a bit like duct-tape ... Shouldn't we instead stop tearing down > > ringbuffer structures over suspend/resume? And maybe the same for init > > with your patch applied. > > Even if we did, we would still end up with g45 failing to restart > the ring at random, so we need some w/a. > > > Or we simply check for wedged first thing in execbuf ... > > See the first 2 patches ;-) The first is actually essential as we have > no other guard against writing into the non-existent ring. > > I thought about doing that. However, I prefer doing arg validation > first i.e. get all (or as much as is feasible) of the EINVAL checks out > of the way so that we avoid touching hardware or leaking any information > to a malicious client. The problem in this case is where is not actually > an invalid arg. > > Note that we will detect the EIO later before touching hardware (so long > as the first two patches are in place). Yeah I've seen the other patches. I think we should try to keep all the ring structures around even when the hw init failed. I've made some feeble attempts a while ago to split the structure init from the hw init stuff, but kinda never fully materialized ... Imo if our set of valid rings semi-randomly changes at runtime even, that's not good. -Daniel
On Tue, Mar 25, 2014 at 08:24:00AM -0700, Ben Widawsky wrote: > On Tue, Mar 25, 2014 at 08:03:28AM +0000, Chris Wilson wrote: > > If we try to execute on a known ring, but it has failed to be > > initialised correctly, report that the GPU is hung rather than the > > command invalid. This leaves us reporting EINVAL only if the user > > requests execution on a ring that is not supported by the device. > > > > This should prevent UXA from getting stuck in a null render loop after a > > failed resume. > > > > Reported-by: Jiri Kosina <jikos@jikos.cz> > > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554 > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > --- > > drivers/gpu/drm/i915/i915_gem_execbuffer.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > > index 1b45163e19f3..22c650490f54 100644 > > --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c > > +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > > @@ -991,6 +991,18 @@ i915_reset_gen7_sol_offsets(struct drm_device *dev, > > return 0; > > } > > > > +static bool > > +intel_ring_valid(struct intel_ring_buffer *ring) > > +{ > > + switch (ring->id) { > > + case RCS: return true; > > + case VCS: return HAS_BSD(ring->dev); > > + case BCS: return HAS_BLT(ring->dev); > > intel_enable_blt()? But not exported, and below my level of caring ;-) The cases were intel_enable_blt() != HAS_BLT() are those where the hw would hang anyway. -Chris
On Tue, Mar 25, 2014 at 04:36:05PM +0100, Daniel Vetter wrote: > On Tue, Mar 25, 2014 at 08:15:54AM +0000, Chris Wilson wrote: > > On Tue, Mar 25, 2014 at 09:07:00AM +0100, Daniel Vetter wrote: > > > On Tue, Mar 25, 2014 at 9:03 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > > > If we try to execute on a known ring, but it has failed to be > > > > initialised correctly, report that the GPU is hung rather than the > > > > command invalid. This leaves us reporting EINVAL only if the user > > > > requests execution on a ring that is not supported by the device. > > > > > > > > This should prevent UXA from getting stuck in a null render loop after a > > > > failed resume. > > > > > > > > Reported-by: Jiri Kosina <jikos@jikos.cz> > > > > References: https://bugs.freedesktop.org/show_bug.cgi?id=76554 > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > > > Feels a bit like duct-tape ... Shouldn't we instead stop tearing down > > > ringbuffer structures over suspend/resume? And maybe the same for init > > > with your patch applied. > > > > Even if we did, we would still end up with g45 failing to restart > > the ring at random, so we need some w/a. > > > > > Or we simply check for wedged first thing in execbuf ... > > > > See the first 2 patches ;-) The first is actually essential as we have > > no other guard against writing into the non-existent ring. > > > > I thought about doing that. However, I prefer doing arg validation > > first i.e. get all (or as much as is feasible) of the EINVAL checks out > > of the way so that we avoid touching hardware or leaking any information > > to a malicious client. The problem in this case is where is not actually > > an invalid arg. > > > > Note that we will detect the EIO later before touching hardware (so long > > as the first two patches are in place). > > Yeah I've seen the other patches. I think we should try to keep all the > ring structures around even when the hw init failed. I've made some feeble > attempts a while ago to split the structure init from the hw init stuff, > but kinda never fully materialized ... > > Imo if our set of valid rings semi-randomly changes at runtime even, > that's not good. Agreed, but sadly we can't trust hardware to always work, and we need something to prevent explosions. I quite like the idea of marking the GPU wedged if hw init fails so that we lose acceleration but keep modesetting around. -Chris
On Tue, Mar 25, 2014 at 5:29 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: >> Yeah I've seen the other patches. I think we should try to keep all the >> ring structures around even when the hw init failed. I've made some feeble >> attempts a while ago to split the structure init from the hw init stuff, >> but kinda never fully materialized ... >> >> Imo if our set of valid rings semi-randomly changes at runtime even, >> that's not good. > > Agreed, but sadly we can't trust hardware to always work, and we need > something to prevent explosions. I quite like the idea of marking the > GPU wedged if hw init fails so that we lose acceleration but keep > modesetting around. Yeah, I agree that the other two patches are neat indeed, it's this one here where the shiny starts to come off a bit ;-) tbh I'd prefer a simply if (terminally_wedged) return -EIO; here before the ring checks, maybe with a comment stating why we need to have this order. That, or fix the mess called ring init code ... -Daniel
On Tue, Mar 25, 2014 at 05:52:00PM +0100, Daniel Vetter wrote: > On Tue, Mar 25, 2014 at 5:29 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > >> Yeah I've seen the other patches. I think we should try to keep all the > >> ring structures around even when the hw init failed. I've made some feeble > >> attempts a while ago to split the structure init from the hw init stuff, > >> but kinda never fully materialized ... > >> > >> Imo if our set of valid rings semi-randomly changes at runtime even, > >> that's not good. > > > > Agreed, but sadly we can't trust hardware to always work, and we need > > something to prevent explosions. I quite like the idea of marking the > > GPU wedged if hw init fails so that we lose acceleration but keep > > modesetting around. > > Yeah, I agree that the other two patches are neat indeed, it's this > one here where the shiny starts to come off a bit ;-) tbh I'd prefer a > simply if (terminally_wedged) return -EIO; here before the ring > checks, maybe with a comment stating why we need to have this order. It's ok, it is only to prevent UXA from going off the rails after the odd resume hang on g45... > That, or fix the mess called ring init code ... So if we fixed resume to avoid reallocating the ringbuffers across resume, g45 would still fail to restart, but now we still have valid objects (or would we tear them down because of the failure?) and so this check passes and we later hit the EIO checks? -Chris
On Tue, Mar 25, 2014 at 10:33 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: >> That, or fix the mess called ring init code ... > > So if we fixed resume to avoid reallocating the ringbuffers across > resume, g45 would still fail to restart, but now we still have valid > objects (or would we tear them down because of the failure?) and so this > check passes and we later hit the EIO checks? Yeah that's my idea: We always set up the objects for all valid rings and never tear them down. Upon failure we just set wedged and execbuf would fall through until it notices the wedged states and returns -EIO. Gives us tidy unified code and no special-casing for the ring init failures. And if we do the split in ring init correctly the ring_hw_init could unconditionally set wedged internally if it fails and our code would dtrt no matter whether it's driver load, resume or failure to resurrect the hw after reset. So callers wouldn't need to care. Of course the ring_init functions which allocates the structures in driver load still needs to be able to bail out in case that fails. I can dream ... -Daniel
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c index 1b45163e19f3..22c650490f54 100644 --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c @@ -991,6 +991,18 @@ i915_reset_gen7_sol_offsets(struct drm_device *dev, return 0; } +static bool +intel_ring_valid(struct intel_ring_buffer *ring) +{ + switch (ring->id) { + case RCS: return true; + case VCS: return HAS_BSD(ring->dev); + case BCS: return HAS_BLT(ring->dev); + case VECS: return HAS_VEBOX(ring->dev); + default: return false; + } +} + static int i915_gem_do_execbuffer(struct drm_device *dev, void *data, struct drm_file *file, @@ -1041,7 +1053,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data, if (!intel_ring_initialized(ring)) { DRM_DEBUG("execbuf with invalid ring: %d\n", (int)(args->flags & I915_EXEC_RING_MASK)); - return -EINVAL; + return intel_ring_valid(ring) ? -EIO : -EINVAL; } mode = args->flags & I915_EXEC_CONSTANTS_MASK;
If we try to execute on a known ring, but it has failed to be initialised correctly, report that the GPU is hung rather than the command invalid. This leaves us reporting EINVAL only if the user requests execution on a ring that is not supported by the device. This should prevent UXA from getting stuck in a null render loop after a failed resume. Reported-by: Jiri Kosina <jikos@jikos.cz> References: https://bugs.freedesktop.org/show_bug.cgi?id=76554 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/i915_gem_execbuffer.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)