diff mbox

drm/i915: Allow unready gpu to be reset on gen8

Message ID 1446216229-26474-1-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mika Kuoppala Oct. 30, 2015, 2:43 p.m. UTC
Gen9 has had demonstrated cases where forcing a not ready gpu
into reset has caused system hang [1].

Gen8 has never to this date demonstrated such behaviour.

In our CI tests bsw sometimes ends up in a state where it claims it
is not ready for reset, based on reset request, after gpu hang.

Allow gen8 to reset even after claims of nonreadiness in order
to keep the gpu accessible. Enhance logging so that it will be
clear what conditions led to decision of proceeding or bailing out,
so that we will spot if this way of forcing our will against gpu turns
out to be foolhardy.

References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Tomi Sarvela <tomix.p.sarvela@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Chris Wilson Oct. 30, 2015, 2:58 p.m. UTC | #1
On Fri, Oct 30, 2015 at 04:43:49PM +0200, Mika Kuoppala wrote:
> Gen9 has had demonstrated cases where forcing a not ready gpu
> into reset has caused system hang [1].
> 
> Gen8 has never to this date demonstrated such behaviour.
> 
> In our CI tests bsw sometimes ends up in a state where it claims it
> is not ready for reset, based on reset request, after gpu hang.
> 
> Allow gen8 to reset even after claims of nonreadiness in order
> to keep the gpu accessible. Enhance logging so that it will be
> clear what conditions led to decision of proceeding or bailing out,
> so that we will spot if this way of forcing our will against gpu turns
> out to be foolhardy.
> 
> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index f0f97b2..47c17f2 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1504,7 +1504,14 @@ not_ready:
>  		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>  			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
>  
> -	return -EIO;

Where's the reference for where we hit this EIO on gen8?

> +	if (INTEL_INFO(dev)->gen == 9) {
> +		DRM_ERROR("Reset would risk system stability, bailing out\n");
> +		return -EIO;
> +	}
> +
> +	DRM_ERROR("Forcing non ready gpu into reset\n");
> +
> +	return gen6_do_reset(dev);
>  }
>  
>  static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
> -- 
> 2.5.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Mika Kuoppala Oct. 30, 2015, 3:18 p.m. UTC | #2
Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Fri, Oct 30, 2015 at 04:43:49PM +0200, Mika Kuoppala wrote:
>> Gen9 has had demonstrated cases where forcing a not ready gpu
>> into reset has caused system hang [1].
>> 
>> Gen8 has never to this date demonstrated such behaviour.
>> 
>> In our CI tests bsw sometimes ends up in a state where it claims it
>> is not ready for reset, based on reset request, after gpu hang.
>> 
>> Allow gen8 to reset even after claims of nonreadiness in order
>> to keep the gpu accessible. Enhance logging so that it will be
>> clear what conditions led to decision of proceeding or bailing out,
>> so that we will spot if this way of forcing our will against gpu turns
>> out to be foolhardy.
>> 
>> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> ---
>>  drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> index f0f97b2..47c17f2 100644
>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> @@ -1504,7 +1504,14 @@ not_ready:
>>  		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>>  			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
>>  
>> -	return -EIO;
>
> Where's the reference for where we hit this EIO on gen8?
>

Internal CI logs, relevant part cutpasted below. If you want
full log holler me in irc.

[  119.147727] kms_pipe_crc_basic: starting subtest hang-read-crc-pipe-A
[  124.785063] [drm] stuck on render ring
[  124.800850] [drm] GPU HANG: ecode 8:0:0xfffffffe, in kms_pipe_crc_ba
[5590], reason: Ring hung, action: reset
[  124.801154] [drm] GPU hangs can indicate a bug anywhere in the entire
gfx stack, including userspace.
[  124.801161] [drm] Please file a _new_ bug report on
bugs.freedesktop.org against DRI -> DRM/Intel
[  124.801167] [drm] drm/i915 developers can then reassign to the right
component if it's not a kernel issue.
[  124.801173] [drm] The gpu crash dump is required to analyze gpu
hangs, so please always attach it.
[  124.801179] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  124.801785] kobject: 'card0' (ffff880174ad92a0): kobject_uevent_env
[  124.801940] kobject: 'card0' (ffff880174ad92a0): fill_kobj_path: path
= '/devices/pci0000:00/0000:00:02.0/drm/card0'
[  124.805032] kobject: 'card0' (ffff880174ad92a0): kobject_uevent_env
[  124.805089] kobject: 'card0' (ffff880174ad92a0): fill_kobj_path: path
= '/devices/pci0000:00/0000:00:02.0/drm/card0'
[  125.511744] [drm:gen8_do_reset [i915]] *ERROR* render ring: reset
request timeout
[  125.511922] [drm] Simulated gpu hang, resetting stop_rings
[  125.511927] drm/i915: Resetting chip after gpu hang
[  125.511954] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
[  125.637612] kms_pipe_crc_basic: exiting, ret=0
[  125.653608] [drm:intel_lr_context_deferred_alloc [i915]] *ERROR* ring
create req: -5
[  125.847695] gem_ctx_param_basic: executing
[  125.850086] [drm:intel_lr_context_deferred_alloc [i915]] *ERROR* ring
create req: -5
[  125.854482] gem_ctx_param_basic: exiting, ret=99
[  126.038693] kms_addfb_basic: executing
[  126.041754] [drm:intel_lr_context_deferred_alloc [i915]] *ERROR* ring
create req: -5

-Mika

>> +	if (INTEL_INFO(dev)->gen == 9) {
>> +		DRM_ERROR("Reset would risk system stability, bailing out\n");
>> +		return -EIO;
>> +	}
>> +
>> +	DRM_ERROR("Forcing non ready gpu into reset\n");
>> +
>> +	return gen6_do_reset(dev);
>>  }
>>  
>>  static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)
>> -- 
>> 2.5.0
>> 
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
Chris Wilson Oct. 30, 2015, 3:28 p.m. UTC | #3
On Fri, Oct 30, 2015 at 05:18:18PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > On Fri, Oct 30, 2015 at 04:43:49PM +0200, Mika Kuoppala wrote:
> >> Gen9 has had demonstrated cases where forcing a not ready gpu
> >> into reset has caused system hang [1].
> >> 
> >> Gen8 has never to this date demonstrated such behaviour.
> >> 
> >> In our CI tests bsw sometimes ends up in a state where it claims it
> >> is not ready for reset, based on reset request, after gpu hang.
> >> 
> >> Allow gen8 to reset even after claims of nonreadiness in order
> >> to keep the gpu accessible. Enhance logging so that it will be
> >> clear what conditions led to decision of proceeding or bailing out,
> >> so that we will spot if this way of forcing our will against gpu turns
> >> out to be foolhardy.
> >> 
> >> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959
> >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com>
> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> >> ---
> >>  drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++-
> >>  1 file changed, 8 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> >> index f0f97b2..47c17f2 100644
> >> --- a/drivers/gpu/drm/i915/intel_uncore.c
> >> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> >> @@ -1504,7 +1504,14 @@ not_ready:
> >>  		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> >>  			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
> >>  
> >> -	return -EIO;
> >
> > Where's the reference for where we hit this EIO on gen8?
> >
> 
> Internal CI logs, relevant part cutpasted below. If you want
> full log holler me in irc.

So you are saying that's there no bugzilla for this... :-p
-Chris
Mika Kuoppala Oct. 30, 2015, 4:07 p.m. UTC | #4
Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Fri, Oct 30, 2015 at 05:18:18PM +0200, Mika Kuoppala wrote:
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > On Fri, Oct 30, 2015 at 04:43:49PM +0200, Mika Kuoppala wrote:
>> >> Gen9 has had demonstrated cases where forcing a not ready gpu
>> >> into reset has caused system hang [1].
>> >> 
>> >> Gen8 has never to this date demonstrated such behaviour.
>> >> 
>> >> In our CI tests bsw sometimes ends up in a state where it claims it
>> >> is not ready for reset, based on reset request, after gpu hang.
>> >> 
>> >> Allow gen8 to reset even after claims of nonreadiness in order
>> >> to keep the gpu accessible. Enhance logging so that it will be
>> >> clear what conditions led to decision of proceeding or bailing out,
>> >> so that we will spot if this way of forcing our will against gpu turns
>> >> out to be foolhardy.
>> >> 
>> >> References [1]: https://bugs.freedesktop.org/show_bug.cgi?id=89959
>> >> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> >> Cc: Tomi Sarvela <tomix.p.sarvela@intel.com>
>> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> >> ---
>> >>  drivers/gpu/drm/i915/intel_uncore.c | 9 ++++++++-
>> >>  1 file changed, 8 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> >> index f0f97b2..47c17f2 100644
>> >> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> >> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> >> @@ -1504,7 +1504,14 @@ not_ready:
>> >>  		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
>> >>  			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
>> >>  
>> >> -	return -EIO;
>> >
>> > Where's the reference for where we hit this EIO on gen8?
>> >
>> 
>> Internal CI logs, relevant part cutpasted below. If you want
>> full log holler me in irc.
>
> So you are saying that's there no bugzilla for this... :-p

Bugzilla fairy might surprise us after a good weekend rest.

-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
Sarvela, TomiX P Nov. 2, 2015, 8:35 a.m. UTC | #5
> From: Mika Kuoppala [mailto:mika.kuoppala@linux.intel.com]
> Chris Wilson <chris@chris-wilson.co.uk> writes:

> > So you are saying that's there no bugzilla for this... :-p
> 
> Bugzilla fairy might surprise us after a good weekend rest.

https://bugs.freedesktop.org/show_bug.cgi?id=92774 

Regards,

Tomi
---------------------------------------------------------------------
Intel Finland Oy
Registered Address: PL 281, 00181 Helsinki 
Business Identity Code: 0357606 - 4 
Domiciled in Helsinki 

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index f0f97b2..47c17f2 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1504,7 +1504,14 @@  not_ready:
 		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
 			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
 
-	return -EIO;
+	if (INTEL_INFO(dev)->gen == 9) {
+		DRM_ERROR("Reset would risk system stability, bailing out\n");
+		return -EIO;
+	}
+
+	DRM_ERROR("Forcing non ready gpu into reset\n");
+
+	return gen6_do_reset(dev);
 }
 
 static int (*intel_get_gpu_reset(struct drm_device *dev))(struct drm_device *)