Message ID | 1446216229-26474-1-git-send-email-mika.kuoppala@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Oct 30, 2015 at 04:43:49PM +0200, Mika Kuoppala wrote: > Gen9 has had demonstrated cases where forcing a not ready gpu > into reset has caused system hang [1]. > > Gen8 has never to this date demonstrated such behaviour. > > In our CI tests bsw sometimes ends up in a state where it claims it > is not ready for reset, based on reset request, after gpu hang. > > Allow gen8 to reset even after claims of nonreadiness in order > to keep the gpu accessible. Enhance logging so that it will be > clear what conditions led to decision of proceeding or bailing out, > so that we will spot if this way of forcing our will against gpu turns > out to be foolhardy. > > References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > --- > drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c > index f0f97b2..47c17f2 100644 > --- a/drivers/gpu/drm/i915/intel_uncore.c > +++ b/drivers/gpu/drm/i915/intel_uncore.c > @@ -1504,7 +1504,14 @@ not_ready: > I915_WRITE(RING_RESET_CTL(engine->mmio_base), > _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); > > - return -EIO; Where's the reference for where we hit this EIO on gen8? > + if (INTEL_INFO(dev)->gen == 9) { > + DRM_ERROR("Reset would risk system stability, bailing out\n"); > + return -EIO; > + } > + > + DRM_ERROR("Forcing non ready gpu into reset\n"); > + > + return gen6_do_reset(dev); > } > > static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *) > -- > 2.5.0 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson <chris@chris-wilson.co.uk> writes: > On Fri, Oct 30, 2015 at 04:43:49PM +0200, Mika Kuoppala wrote: >> Gen9 has had demonstrated cases where forcing a not ready gpu >> into reset has caused system hang [1]. >> >> Gen8 has never to this date demonstrated such behaviour. >> >> In our CI tests bsw sometimes ends up in a state where it claims it >> is not ready for reset, based on reset request, after gpu hang. >> >> Allow gen8 to reset even after claims of nonreadiness in order >> to keep the gpu accessible. Enhance logging so that it will be >> clear what conditions led to decision of proceeding or bailing out, >> so that we will spot if this way of forcing our will against gpu turns >> out to be foolhardy. >> >> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> >> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> >> --- >> drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c >> index f0f97b2..47c17f2 100644 >> --- a/drivers/gpu/drm/i915/intel_uncore.c >> +++ b/drivers/gpu/drm/i915/intel_uncore.c >> @@ -1504,7 +1504,14 @@ not_ready: >> I915_WRITE(RING_RESET_CTL(engine->mmio_base), >> _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); >> >> - return -EIO; > > Where's the reference for where we hit this EIO on gen8? > Internal CI logs, relevant part cutpasted below. If you want full log holler me in irc. [ 119.147727] kms_pipe_crc_basic: starting subtest hang-read-crc-pipe-A [ 124.785063] [drm] stuck on render ring [ 124.800850] [drm] GPU HANG: ecode 8:0:0xfffffffe, in kms_pipe_crc_ba [5590], reason: Ring hung, action: reset [ 124.801154] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 124.801161] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 124.801167] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 124.801173] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 124.801179] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 124.801785] kobject: 'card0' (ffff880174ad92a0): kobject_uevent_env [ 124.801940] kobject: 'card0' (ffff880174ad92a0): fill_kobj_path: path = '/devices/pci0000:00/0000:00:02.0/drm/card0' [ 124.805032] kobject: 'card0' (ffff880174ad92a0): kobject_uevent_env [ 124.805089] kobject: 'card0' (ffff880174ad92a0): fill_kobj_path: path = '/devices/pci0000:00/0000:00:02.0/drm/card0' [ 125.511744] [drm:gen8_do_reset [i915]] *ERROR* render ring: reset request timeout [ 125.511922] [drm] Simulated gpu hang, resetting stop_rings [ 125.511927] drm/i915: Resetting chip after gpu hang [ 125.511954] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5 [ 125.637612] kms_pipe_crc_basic: exiting, ret=0 [ 125.653608] [drm:intel_lr_context_deferred_alloc [i915]] *ERROR* ring create req: -5 [ 125.847695] gem_ctx_param_basic: executing [ 125.850086] [drm:intel_lr_context_deferred_alloc [i915]] *ERROR* ring create req: -5 [ 125.854482] gem_ctx_param_basic: exiting, ret=99 [ 126.038693] kms_addfb_basic: executing [ 126.041754] [drm:intel_lr_context_deferred_alloc [i915]] *ERROR* ring create req: -5 -Mika >> + if (INTEL_INFO(dev)->gen == 9) { >> + DRM_ERROR("Reset would risk system stability, bailing out\n"); >> + return -EIO; >> + } >> + >> + DRM_ERROR("Forcing non ready gpu into reset\n"); >> + >> + return gen6_do_reset(dev); >> } >> >> static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *) >> -- >> 2.5.0 >> >> _______________________________________________ >> Intel-gfx mailing list >> Intel-gfx@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/intel-gfx > > -- > Chris Wilson, Intel Open Source Technology Centre
On Fri, Oct 30, 2015 at 05:18:18PM +0200, Mika Kuoppala wrote: > Chris Wilson <chris@chris-wilson.co.uk> writes: > > > On Fri, Oct 30, 2015 at 04:43:49PM +0200, Mika Kuoppala wrote: > >> Gen9 has had demonstrated cases where forcing a not ready gpu > >> into reset has caused system hang [1]. > >> > >> Gen8 has never to this date demonstrated such behaviour. > >> > >> In our CI tests bsw sometimes ends up in a state where it claims it > >> is not ready for reset, based on reset request, after gpu hang. > >> > >> Allow gen8 to reset even after claims of nonreadiness in order > >> to keep the gpu accessible. Enhance logging so that it will be > >> clear what conditions led to decision of proceeding or bailing out, > >> so that we will spot if this way of forcing our will against gpu turns > >> out to be foolhardy. > >> > >> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 > >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > >> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> > >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > >> --- > >> drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- > >> 1 file changed, 8 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c > >> index f0f97b2..47c17f2 100644 > >> --- a/drivers/gpu/drm/i915/intel_uncore.c > >> +++ b/drivers/gpu/drm/i915/intel_uncore.c > >> @@ -1504,7 +1504,14 @@ not_ready: > >> I915_WRITE(RING_RESET_CTL(engine->mmio_base), > >> _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); > >> > >> - return -EIO; > > > > Where's the reference for where we hit this EIO on gen8? > > > > Internal CI logs, relevant part cutpasted below. If you want > full log holler me in irc. So you are saying that's there no bugzilla for this... :-p -Chris
Chris Wilson <chris@chris-wilson.co.uk> writes: > On Fri, Oct 30, 2015 at 05:18:18PM +0200, Mika Kuoppala wrote: >> Chris Wilson <chris@chris-wilson.co.uk> writes: >> >> > On Fri, Oct 30, 2015 at 04:43:49PM +0200, Mika Kuoppala wrote: >> >> Gen9 has had demonstrated cases where forcing a not ready gpu >> >> into reset has caused system hang [1]. >> >> >> >> Gen8 has never to this date demonstrated such behaviour. >> >> >> >> In our CI tests bsw sometimes ends up in a state where it claims it >> >> is not ready for reset, based on reset request, after gpu hang. >> >> >> >> Allow gen8 to reset even after claims of nonreadiness in order >> >> to keep the gpu accessible. Enhance logging so that it will be >> >> clear what conditions led to decision of proceeding or bailing out, >> >> so that we will spot if this way of forcing our will against gpu turns >> >> out to be foolhardy. >> >> >> >> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 >> >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> >> >> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> >> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> >> >> --- >> >> drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- >> >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c >> >> index f0f97b2..47c17f2 100644 >> >> --- a/drivers/gpu/drm/i915/intel_uncore.c >> >> +++ b/drivers/gpu/drm/i915/intel_uncore.c >> >> @@ -1504,7 +1504,14 @@ not_ready: >> >> I915_WRITE(RING_RESET_CTL(engine->mmio_base), >> >> _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); >> >> >> >> - return -EIO; >> > >> > Where's the reference for where we hit this EIO on gen8? >> > >> >> Internal CI logs, relevant part cutpasted below. If you want >> full log holler me in irc. > > So you are saying that's there no bugzilla for this... :-p Bugzilla fairy might surprise us after a good weekend rest. -Mika > -Chris > > -- > Chris Wilson, Intel Open Source Technology Centre
> From: Mika Kuoppala [mailto:mika.kuoppala@linux.intel.com] > Chris Wilson <chris@chris-wilson.co.uk> writes: > > So you are saying that's there no bugzilla for this... :-p > > Bugzilla fairy might surprise us after a good weekend rest. https://bugs.freedesktop.org/show_bug.cgi?id=92774 Regards, Tomi --------------------------------------------------------------------- Intel Finland Oy Registered Address: PL 281, 00181 Helsinki Business Identity Code: 0357606 - 4 Domiciled in Helsinki This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index f0f97b2..47c17f2 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1504,7 +1504,14 @@ not_ready: I915_WRITE(RING_RESET_CTL(engine->mmio_base), _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); - return -EIO; + if (INTEL_INFO(dev)->gen == 9) { + DRM_ERROR("Reset would risk system stability, bailing out\n"); + return -EIO; + } + + DRM_ERROR("Forcing non ready gpu into reset\n"); + + return gen6_do_reset(dev); } static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
Gen9 has had demonstrated cases where forcing a not ready gpu into reset has caused system hang [1]. Gen8 has never to this date demonstrated such behaviour. In our CI tests bsw sometimes ends up in a state where it claims it is not ready for reset, based on reset request, after gpu hang. Allow gen8 to reset even after claims of nonreadiness in order to keep the gpu accessible. Enhance logging so that it will be clear what conditions led to decision of proceeding or bailing out, so that we will spot if this way of forcing our will against gpu turns out to be foolhardy. References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> --- drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)