diff mbox

drm/i915: Show stack (by WARN) for hitting forcewake errors

Message ID 20180720111102.11549-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson July 20, 2018, 11:11 a.m. UTC
On Sandybridge, we need a workaround to wait for the CPU thread to wake
up before we are sure that we have enabled the GT power well. However,
we do see the errors being reported and failed reads returning spurious
results. To try and capture more details as it fails, promote the error
into a WARN so we grab the stacktrace, and to try and reduce the
frequency of error increase the timeout from 500us to 5ms.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_uncore.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Comments

Tvrtko Ursulin July 24, 2018, 10:33 a.m. UTC | #1
On 20/07/2018 12:11, Chris Wilson wrote:
> On Sandybridge, we need a workaround to wait for the CPU thread to wake
> up before we are sure that we have enabled the GT power well. However,
> we do see the errors being reported and failed reads returning spurious
> results. To try and capture more details as it fails, promote the error
> into a WARN so we grab the stacktrace, and to try and reduce the
> frequency of error increase the timeout from 500us to 5ms.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/intel_uncore.c | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index b892ca8396e8..284be151f645 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -283,14 +283,24 @@ fw_domains_reset(struct drm_i915_private *i915,
>   		fw_domain_reset(i915, d);
>   }
>   
> +static inline u32 gt_thread_status(struct drm_i915_private *dev_priv)
> +{
> +	u32 val;
> +
> +	val = __raw_i915_read32(dev_priv, GEN6_GT_THREAD_STATUS_REG);
> +	val &= GEN6_GT_THREAD_STATUS_CORE_MASK;
> +
> +	return val;
> +}
> +
>   static void __gen6_gt_wait_for_thread_c0(struct drm_i915_private *dev_priv)
>   {
> -	/* w/a for a sporadic read returning 0 by waiting for the GT
> +	/*
> +	 * w/a for a sporadic read returning 0 by waiting for the GT
>   	 * thread to wake up.
>   	 */
> -	if (wait_for_atomic_us((__raw_i915_read32(dev_priv, GEN6_GT_THREAD_STATUS_REG) &
> -				GEN6_GT_THREAD_STATUS_CORE_MASK) == 0, 500))
> -		DRM_ERROR("GT thread status wait timed out\n");
> +	WARN_ONCE(wait_for_atomic_us(gt_thread_status(dev_priv) == 0, 5000),
> +		  "GT thread status wait timed out\n");
>   }
>   
>   static void fw_domains_get_with_thread_status(struct drm_i915_private *dev_priv,
> 

Quite a large increase - but I guess nothing we can do if it's hit, nor 
it's better to time out faster if the wait needs to be longer.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
Chris Wilson July 24, 2018, 10:59 a.m. UTC | #2
Quoting Tvrtko Ursulin (2018-07-24 11:33:47)
> 
> On 20/07/2018 12:11, Chris Wilson wrote:
> > On Sandybridge, we need a workaround to wait for the CPU thread to wake
> > up before we are sure that we have enabled the GT power well. However,
> > we do see the errors being reported and failed reads returning spurious
> > results. To try and capture more details as it fails, promote the error
> > into a WARN so we grab the stacktrace, and to try and reduce the
> > frequency of error increase the timeout from 500us to 5ms.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/intel_uncore.c | 18 ++++++++++++++----
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> > index b892ca8396e8..284be151f645 100644
> > --- a/drivers/gpu/drm/i915/intel_uncore.c
> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
> > @@ -283,14 +283,24 @@ fw_domains_reset(struct drm_i915_private *i915,
> >               fw_domain_reset(i915, d);
> >   }
> >   
> > +static inline u32 gt_thread_status(struct drm_i915_private *dev_priv)
> > +{
> > +     u32 val;
> > +
> > +     val = __raw_i915_read32(dev_priv, GEN6_GT_THREAD_STATUS_REG);
> > +     val &= GEN6_GT_THREAD_STATUS_CORE_MASK;
> > +
> > +     return val;
> > +}
> > +
> >   static void __gen6_gt_wait_for_thread_c0(struct drm_i915_private *dev_priv)
> >   {
> > -     /* w/a for a sporadic read returning 0 by waiting for the GT
> > +     /*
> > +      * w/a for a sporadic read returning 0 by waiting for the GT
> >        * thread to wake up.
> >        */
> > -     if (wait_for_atomic_us((__raw_i915_read32(dev_priv, GEN6_GT_THREAD_STATUS_REG) &
> > -                             GEN6_GT_THREAD_STATUS_CORE_MASK) == 0, 500))
> > -             DRM_ERROR("GT thread status wait timed out\n");
> > +     WARN_ONCE(wait_for_atomic_us(gt_thread_status(dev_priv) == 0, 5000),
> > +               "GT thread status wait timed out\n");
> >   }
> >   
> >   static void fw_domains_get_with_thread_status(struct drm_i915_private *dev_priv,
> > 
> 
> Quite a large increase - but I guess nothing we can do if it's hit, nor 
> it's better to time out faster if the wait needs to be longer.

It was a bit of a jump, but I felt that if we are prematurely declaring
the machine dead, we end up killing it anyway.

A long, long time ago, when we saw these errors it was terminal. The
recent sightings in CI however looked transient (hard to tell as we
panicked shortly after), so I'm betting on a longer timeout will help
the false positives (additional hardirq latency is better than death).
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index b892ca8396e8..284be151f645 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -283,14 +283,24 @@  fw_domains_reset(struct drm_i915_private *i915,
 		fw_domain_reset(i915, d);
 }
 
+static inline u32 gt_thread_status(struct drm_i915_private *dev_priv)
+{
+	u32 val;
+
+	val = __raw_i915_read32(dev_priv, GEN6_GT_THREAD_STATUS_REG);
+	val &= GEN6_GT_THREAD_STATUS_CORE_MASK;
+
+	return val;
+}
+
 static void __gen6_gt_wait_for_thread_c0(struct drm_i915_private *dev_priv)
 {
-	/* w/a for a sporadic read returning 0 by waiting for the GT
+	/*
+	 * w/a for a sporadic read returning 0 by waiting for the GT
 	 * thread to wake up.
 	 */
-	if (wait_for_atomic_us((__raw_i915_read32(dev_priv, GEN6_GT_THREAD_STATUS_REG) &
-				GEN6_GT_THREAD_STATUS_CORE_MASK) == 0, 500))
-		DRM_ERROR("GT thread status wait timed out\n");
+	WARN_ONCE(wait_for_atomic_us(gt_thread_status(dev_priv) == 0, 5000),
+		  "GT thread status wait timed out\n");
 }
 
 static void fw_domains_get_with_thread_status(struct drm_i915_private *dev_priv,