Message ID | 1446456308-13773-1-git-send-email-mika.kuoppala@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Nov 02, 2015 at 11:25:08AM +0200, Mika Kuoppala wrote: > Gen9 has had demonstrated cases where forcing a not ready gpu > into reset has caused system hang [1]. > > Gen8 has never to this date demonstrated such behaviour. > > In our CI tests there have been two cases of bsw ending in a state > where it claims it is not ready for reset, based on reset request, > after gpu hang [2]. Both of these cases have happened with kernel > where there are lots of debugs enabled. So it is possible that > something timing related is at play here like that wait_for_register() > usleep wakeups collide badly with forcewake. > > If we assume that gen8 is safe to reset even with claims of nonreadiness, > we can alleviate this situations by forcing a reset and revive the gpu. > Enhance logging so that it will be clear what conditions led to decision > of proceeding or bailing out, so that we will spot if this way of forcing > our will against gpu turns out to be foolhardy. > > v2: - add bugzilla entry for bsw behaviour (Chris) > - improve commit message > > References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 > References [2]: https://bugs.freedesktop.org/show_bug.cgi?id=92774 > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > --- > drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c > index f0f97b2..47c17f2 100644 > --- a/drivers/gpu/drm/i915/intel_uncore.c > +++ b/drivers/gpu/drm/i915/intel_uncore.c > @@ -1504,7 +1504,14 @@ not_ready: > I915_WRITE(RING_RESET_CTL(engine->mmio_base), > _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); > > - return -EIO; > + if (INTEL_INFO(dev)->gen == 9) { > + DRM_ERROR("Reset would risk system stability, bailing out\n"); > + return -EIO; > + } > + > + DRM_ERROR("Forcing non ready gpu into reset\n"); DRM_ERROR will cause dmes-warn failures when running igt. If we agree that this is ok, it should be at most DRM_INFO. I'd got with DRM_DEBUG_DRIVER. Ack with that change. -Daniel > + > + return gen6_do_reset(dev); > } > > static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *) > -- > 2.5.0 >
Daniel Vetter <daniel@ffwll.ch> writes: > On Mon, Nov 02, 2015 at 11:25:08AM +0200, Mika Kuoppala wrote: >> Gen9 has had demonstrated cases where forcing a not ready gpu >> into reset has caused system hang [1]. >> >> Gen8 has never to this date demonstrated such behaviour. >> >> In our CI tests there have been two cases of bsw ending in a state >> where it claims it is not ready for reset, based on reset request, >> after gpu hang [2]. Both of these cases have happened with kernel >> where there are lots of debugs enabled. So it is possible that >> something timing related is at play here like that wait_for_register() >> usleep wakeups collide badly with forcewake. >> >> If we assume that gen8 is safe to reset even with claims of nonreadiness, >> we can alleviate this situations by forcing a reset and revive the gpu. >> Enhance logging so that it will be clear what conditions led to decision >> of proceeding or bailing out, so that we will spot if this way of forcing >> our will against gpu turns out to be foolhardy. >> >> v2: - add bugzilla entry for bsw behaviour (Chris) >> - improve commit message >> >> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 >> References [2]: https://bugs.freedesktop.org/show_bug.cgi?id=92774 >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> >> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> >> --- >> drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c >> index f0f97b2..47c17f2 100644 >> --- a/drivers/gpu/drm/i915/intel_uncore.c >> +++ b/drivers/gpu/drm/i915/intel_uncore.c >> @@ -1504,7 +1504,14 @@ not_ready: >> I915_WRITE(RING_RESET_CTL(engine->mmio_base), >> _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); >> >> - return -EIO; >> + if (INTEL_INFO(dev)->gen == 9) { >> + DRM_ERROR("Reset would risk system stability, bailing out\n"); >> + return -EIO; >> + } >> + >> + DRM_ERROR("Forcing non ready gpu into reset\n"); > > DRM_ERROR will cause dmes-warn failures when running igt. If we agree that > this is ok, it should be at most DRM_INFO. I'd got with DRM_DEBUG_DRIVER. > Ack with that change. This is in the category 'shouldn't never happen' so I would prefer igt to complain loudly if it can't request reset. The real culprit was forcewakes missing on reset path, so this patch can be forgotten and only resurrected if, even with forcewake fix in, gen8 fails to request reset. Here is the patch that fixed the underlying issue: commit 99106bc17e667989b4c0af0a6afcbd6ddbada8fb Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Thu Nov 5 13:11:38 2015 +0200 drm/i915: Do graphics device reset under forcewake Thanks, -Mika > -Daniel > >> + >> + return gen6_do_reset(dev); >> } >> >> static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *) >> -- >> 2.5.0 >> > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch
On Thu, Nov 19, 2015 at 11:41:07AM +0200, Mika Kuoppala wrote: > Daniel Vetter <daniel@ffwll.ch> writes: > > > On Mon, Nov 02, 2015 at 11:25:08AM +0200, Mika Kuoppala wrote: > >> Gen9 has had demonstrated cases where forcing a not ready gpu > >> into reset has caused system hang [1]. > >> > >> Gen8 has never to this date demonstrated such behaviour. > >> > >> In our CI tests there have been two cases of bsw ending in a state > >> where it claims it is not ready for reset, based on reset request, > >> after gpu hang [2]. Both of these cases have happened with kernel > >> where there are lots of debugs enabled. So it is possible that > >> something timing related is at play here like that wait_for_register() > >> usleep wakeups collide badly with forcewake. > >> > >> If we assume that gen8 is safe to reset even with claims of nonreadiness, > >> we can alleviate this situations by forcing a reset and revive the gpu. > >> Enhance logging so that it will be clear what conditions led to decision > >> of proceeding or bailing out, so that we will spot if this way of forcing > >> our will against gpu turns out to be foolhardy. > >> > >> v2: - add bugzilla entry for bsw behaviour (Chris) > >> - improve commit message > >> > >> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 > >> References [2]: https://bugs.freedesktop.org/show_bug.cgi?id=92774 > >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > >> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> > >> Cc: Chris Wilson <chris@chris-wilson.co.uk> > >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > >> --- > >> drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- > >> 1 file changed, 8 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c > >> index f0f97b2..47c17f2 100644 > >> --- a/drivers/gpu/drm/i915/intel_uncore.c > >> +++ b/drivers/gpu/drm/i915/intel_uncore.c > >> @@ -1504,7 +1504,14 @@ not_ready: > >> I915_WRITE(RING_RESET_CTL(engine->mmio_base), > >> _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); > >> > >> - return -EIO; > >> + if (INTEL_INFO(dev)->gen == 9) { > >> + DRM_ERROR("Reset would risk system stability, bailing out\n"); > >> + return -EIO; > >> + } > >> + > >> + DRM_ERROR("Forcing non ready gpu into reset\n"); > > > > DRM_ERROR will cause dmes-warn failures when running igt. If we agree that > > this is ok, it should be at most DRM_INFO. I'd got with DRM_DEBUG_DRIVER. > > Ack with that change. > > This is in the category 'shouldn't never happen' so I would prefer > igt to complain loudly if it can't request reset. > > The real culprit was forcewakes missing on reset path, so > this patch can be forgotten and only resurrected if, > even with forcewake fix in, gen8 fails to request reset. > > Here is the patch that fixed the underlying issue: > > commit 99106bc17e667989b4c0af0a6afcbd6ddbada8fb > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Date: Thu Nov 5 13:11:38 2015 +0200 > > drm/i915: Do graphics device reset under forcewake Oh that story. With the above explanation added to the commit message: Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> > > > Thanks, > -Mika > > > -Daniel > > > >> + > >> + return gen6_do_reset(dev); > >> } > >> > >> static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *) > >> -- > >> 2.5.0 > >> > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch
On Thu, 19 Nov 2015, Daniel Vetter <daniel@ffwll.ch> wrote: > On Thu, Nov 19, 2015 at 11:41:07AM +0200, Mika Kuoppala wrote: >> Daniel Vetter <daniel@ffwll.ch> writes: >> >> > On Mon, Nov 02, 2015 at 11:25:08AM +0200, Mika Kuoppala wrote: >> >> Gen9 has had demonstrated cases where forcing a not ready gpu >> >> into reset has caused system hang [1]. >> >> >> >> Gen8 has never to this date demonstrated such behaviour. >> >> >> >> In our CI tests there have been two cases of bsw ending in a state >> >> where it claims it is not ready for reset, based on reset request, >> >> after gpu hang [2]. Both of these cases have happened with kernel >> >> where there are lots of debugs enabled. So it is possible that >> >> something timing related is at play here like that wait_for_register() >> >> usleep wakeups collide badly with forcewake. >> >> >> >> If we assume that gen8 is safe to reset even with claims of nonreadiness, >> >> we can alleviate this situations by forcing a reset and revive the gpu. >> >> Enhance logging so that it will be clear what conditions led to decision >> >> of proceeding or bailing out, so that we will spot if this way of forcing >> >> our will against gpu turns out to be foolhardy. >> >> >> >> v2: - add bugzilla entry for bsw behaviour (Chris) >> >> - improve commit message >> >> >> >> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 >> >> References [2]: https://bugs.freedesktop.org/show_bug.cgi?id=92774 >> >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> >> >> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> >> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> >> >> --- >> >> drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- >> >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c >> >> index f0f97b2..47c17f2 100644 >> >> --- a/drivers/gpu/drm/i915/intel_uncore.c >> >> +++ b/drivers/gpu/drm/i915/intel_uncore.c >> >> @@ -1504,7 +1504,14 @@ not_ready: >> >> I915_WRITE(RING_RESET_CTL(engine->mmio_base), >> >> _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); >> >> >> >> - return -EIO; >> >> + if (INTEL_INFO(dev)->gen == 9) { >> >> + DRM_ERROR("Reset would risk system stability, bailing out\n"); >> >> + return -EIO; >> >> + } >> >> + >> >> + DRM_ERROR("Forcing non ready gpu into reset\n"); >> > >> > DRM_ERROR will cause dmes-warn failures when running igt. If we agree that >> > this is ok, it should be at most DRM_INFO. I'd got with DRM_DEBUG_DRIVER. >> > Ack with that change. >> >> This is in the category 'shouldn't never happen' so I would prefer >> igt to complain loudly if it can't request reset. >> >> The real culprit was forcewakes missing on reset path, so >> this patch can be forgotten and only resurrected if, >> even with forcewake fix in, gen8 fails to request reset. >> >> Here is the patch that fixed the underlying issue: >> >> commit 99106bc17e667989b4c0af0a6afcbd6ddbada8fb >> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> >> Date: Thu Nov 5 13:11:38 2015 +0200 >> >> drm/i915: Do graphics device reset under forcewake > > Oh that story. With the above explanation added to the commit message: > > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Mika, is this patch still relevant? Please repost with the above fixed if it is. BR, Jani. > >> >> >> Thanks, >> -Mika >> >> > -Daniel >> > >> >> + >> >> + return gen6_do_reset(dev); >> >> } >> >> >> >> static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *) >> >> -- >> >> 2.5.0 >> >> >> > >> > -- >> > Daniel Vetter >> > Software Engineer, Intel Corporation >> > http://blog.ffwll.ch
Jani Nikula <jani.nikula@linux.intel.com> writes: > On Thu, 19 Nov 2015, Daniel Vetter <daniel@ffwll.ch> wrote: >> On Thu, Nov 19, 2015 at 11:41:07AM +0200, Mika Kuoppala wrote: >>> Daniel Vetter <daniel@ffwll.ch> writes: >>> >>> > On Mon, Nov 02, 2015 at 11:25:08AM +0200, Mika Kuoppala wrote: >>> >> Gen9 has had demonstrated cases where forcing a not ready gpu >>> >> into reset has caused system hang [1]. >>> >> >>> >> Gen8 has never to this date demonstrated such behaviour. >>> >> >>> >> In our CI tests there have been two cases of bsw ending in a state >>> >> where it claims it is not ready for reset, based on reset request, >>> >> after gpu hang [2]. Both of these cases have happened with kernel >>> >> where there are lots of debugs enabled. So it is possible that >>> >> something timing related is at play here like that wait_for_register() >>> >> usleep wakeups collide badly with forcewake. >>> >> >>> >> If we assume that gen8 is safe to reset even with claims of nonreadiness, >>> >> we can alleviate this situations by forcing a reset and revive the gpu. >>> >> Enhance logging so that it will be clear what conditions led to decision >>> >> of proceeding or bailing out, so that we will spot if this way of forcing >>> >> our will against gpu turns out to be foolhardy. >>> >> >>> >> v2: - add bugzilla entry for bsw behaviour (Chris) >>> >> - improve commit message >>> >> >>> >> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 >>> >> References [2]: https://bugs.freedesktop.org/show_bug.cgi?id=92774 >>> >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> >>> >> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> >>> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >>> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> >>> >> --- >>> >> drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- >>> >> 1 file changed, 8 insertions(+), 1 deletion(-) >>> >> >>> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c >>> >> index f0f97b2..47c17f2 100644 >>> >> --- a/drivers/gpu/drm/i915/intel_uncore.c >>> >> +++ b/drivers/gpu/drm/i915/intel_uncore.c >>> >> @@ -1504,7 +1504,14 @@ not_ready: >>> >> I915_WRITE(RING_RESET_CTL(engine->mmio_base), >>> >> _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); >>> >> >>> >> - return -EIO; >>> >> + if (INTEL_INFO(dev)->gen == 9) { >>> >> + DRM_ERROR("Reset would risk system stability, bailing out\n"); >>> >> + return -EIO; >>> >> + } >>> >> + >>> >> + DRM_ERROR("Forcing non ready gpu into reset\n"); >>> > >>> > DRM_ERROR will cause dmes-warn failures when running igt. If we agree that >>> > this is ok, it should be at most DRM_INFO. I'd got with DRM_DEBUG_DRIVER. >>> > Ack with that change. >>> >>> This is in the category 'shouldn't never happen' so I would prefer >>> igt to complain loudly if it can't request reset. >>> >>> The real culprit was forcewakes missing on reset path, so >>> this patch can be forgotten and only resurrected if, >>> even with forcewake fix in, gen8 fails to request reset. >>> >>> Here is the patch that fixed the underlying issue: >>> >>> commit 99106bc17e667989b4c0af0a6afcbd6ddbada8fb >>> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> >>> Date: Thu Nov 5 13:11:38 2015 +0200 >>> >>> drm/i915: Do graphics device reset under forcewake >> >> Oh that story. With the above explanation added to the commit message: >> >> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> > > Mika, is this patch still relevant? Please repost with the above fixed > if it is. > I think we should forget this and only resurrect it if we ever again see failed resets. As commit 99106bc17e667989b4c0af0a6afcbd6ddbada8fb Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Thu Nov 5 13:11:38 2015 +0200 drm/i915: Do graphics device reset under forcewake should have fixed atleast those that we saw on CI. -Mika > BR, > Jani. > > >> >>> >>> >>> Thanks, >>> -Mika >>> >>> > -Daniel >>> > >>> >> + >>> >> + return gen6_do_reset(dev); >>> >> } >>> >> >>> >> static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *) >>> >> -- >>> >> 2.5.0 >>> >> >>> > >>> > -- >>> > Daniel Vetter >>> > Software Engineer, Intel Corporation >>> > http://blog.ffwll.ch > > -- > Jani Nikula, Intel Open Source Technology Center
On Fri, Jan 08, 2016 at 06:31:44PM +0200, Mika Kuoppala wrote: > I think we should forget this and only resurrect it > if we ever again see failed resets. Bingo! I have been able to wedge the machine (bdw and bsw) recently so badly that the resets failed with the "reset request timeout". -Chris
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index f0f97b2..47c17f2 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1504,7 +1504,14 @@ not_ready: I915_WRITE(RING_RESET_CTL(engine->mmio_base), _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET)); - return -EIO; + if (INTEL_INFO(dev)->gen == 9) { + DRM_ERROR("Reset would risk system stability, bailing out\n"); + return -EIO; + } + + DRM_ERROR("Forcing non ready gpu into reset\n"); + + return gen6_do_reset(dev); } static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
Gen9 has had demonstrated cases where forcing a not ready gpu into reset has caused system hang [1]. Gen8 has never to this date demonstrated such behaviour. In our CI tests there have been two cases of bsw ending in a state where it claims it is not ready for reset, based on reset request, after gpu hang [2]. Both of these cases have happened with kernel where there are lots of debugs enabled. So it is possible that something timing related is at play here like that wait_for_register() usleep wakeups collide badly with forcewake. If we assume that gen8 is safe to reset even with claims of nonreadiness, we can alleviate this situations by forcing a reset and revive the gpu. Enhance logging so that it will be clear what conditions led to decision of proceeding or bailing out, so that we will spot if this way of forcing our will against gpu turns out to be foolhardy. v2: - add bugzilla entry for bsw behaviour (Chris) - improve commit message References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959 References [2]: https://bugs.freedesktop.org/show_bug.cgi?id=92774 Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> --- drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)