diff mbox

drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3

Message ID 1487166779-26945-1-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mika Kuoppala Feb. 15, 2017, 1:52 p.m. UTC
Certain Baytrails, namely the 4 cpu core variants, have been
plaqued by spurious system hangs, mostly occurring with light loads.

Multiple bisects by various people point to a commit which changes the
reclocking strategy for Baytrail to follow its bigger brethen:
commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")

There is also a review comment attached to this commit from Deepak S
on avoiding punit access on Cherryview and thus it was excluded on
common reclocking path. By taking the same approach and omitting
the punit access by not tweaking the thresholds when the hardware
has been asked to move into different frequency, considerable gains
in stability have been observed.

With J1900 box, light render/video load would end up in system hang
in usually less than 12 hours. With this patch applied, the cumulative
uptime has now been 34 days without issues. To provoke system hang,
light loads on both render and bsd engines in parallel have been used:
glxgears >/dev/null 2>/dev/null &
mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4

So far, author has not witnessed system hang with above load
and this patch applied. Reports from the tenacious people at
kernel bugzilla are also promising.

Considering that the punit access frequency with this patch is
considerably less, there is a possibility that this will push
the, still unknown, root cause past the triggering point on most loads.

But as we now can reliably reproduce the hang independently,
we can reduce the pain that users are having and use a
static thresholds until a root cause is found.

v3: don't break debugfs and simplification (Chris Wilson)

References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: fritsch@xbmc.org
Cc: miku@iki.fi
Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
CC: Michal Feix <michal@feix.cz>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Deepak S <deepak.s@linux.intel.com>
Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Cc: <stable@vger.kernel.org> # v4.2+
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_pm.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Chris Wilson Feb. 27, 2017, 9:25 a.m. UTC | #1
On Wed, Feb 15, 2017 at 03:52:59PM +0200, Mika Kuoppala wrote:
> Certain Baytrails, namely the 4 cpu core variants, have been
> plaqued by spurious system hangs, mostly occurring with light loads.
> 
> Multiple bisects by various people point to a commit which changes the
> reclocking strategy for Baytrail to follow its bigger brethen:
> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
> 
> There is also a review comment attached to this commit from Deepak S
> on avoiding punit access on Cherryview and thus it was excluded on
> common reclocking path. By taking the same approach and omitting
> the punit access by not tweaking the thresholds when the hardware
> has been asked to move into different frequency, considerable gains
> in stability have been observed.
> 
> With J1900 box, light render/video load would end up in system hang
> in usually less than 12 hours. With this patch applied, the cumulative
> uptime has now been 34 days without issues. To provoke system hang,
> light loads on both render and bsd engines in parallel have been used:
> glxgears >/dev/null 2>/dev/null &
> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
> 
> So far, author has not witnessed system hang with above load
> and this patch applied. Reports from the tenacious people at
> kernel bugzilla are also promising.
> 
> Considering that the punit access frequency with this patch is
> considerably less, there is a possibility that this will push
> the, still unknown, root cause past the triggering point on most loads.
> 
> But as we now can reliably reproduce the hang independently,
> we can reduce the pain that users are having and use a
> static thresholds until a root cause is found.
> 
> v3: don't break debugfs and simplification (Chris Wilson)
> 
> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Jani Nikula <jani.nikula@intel.com>
> Cc: fritsch@xbmc.org
> Cc: miku@iki.fi
> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
> CC: Michal Feix <michal@feix.cz>
> Cc: Hans de Goede <hdegoede@redhat.com>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
> Cc: <stable@vger.kernel.org> # v4.2+
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

Had a couple of weekends to try and find an alternative explanation
(a root cause for the hangs would be nice!). If it is just the writes to
the RPS registers, are we safe on resume (etc)?

However, I've drawn a blank on explaining what the hw is doing wrong
(but found a couple of bugs in the byt manual RPS evaluation which
desire review), so
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Mika Kuoppala Feb. 27, 2017, 1:22 p.m. UTC | #2
Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Wed, Feb 15, 2017 at 03:52:59PM +0200, Mika Kuoppala wrote:
>> Certain Baytrails, namely the 4 cpu core variants, have been
>> plaqued by spurious system hangs, mostly occurring with light loads.
>> 
>> Multiple bisects by various people point to a commit which changes the
>> reclocking strategy for Baytrail to follow its bigger brethen:
>> commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
>> 
>> There is also a review comment attached to this commit from Deepak S
>> on avoiding punit access on Cherryview and thus it was excluded on
>> common reclocking path. By taking the same approach and omitting
>> the punit access by not tweaking the thresholds when the hardware
>> has been asked to move into different frequency, considerable gains
>> in stability have been observed.
>> 
>> With J1900 box, light render/video load would end up in system hang
>> in usually less than 12 hours. With this patch applied, the cumulative
>> uptime has now been 34 days without issues. To provoke system hang,
>> light loads on both render and bsd engines in parallel have been used:
>> glxgears >/dev/null 2>/dev/null &
>> mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
>> 
>> So far, author has not witnessed system hang with above load
>> and this patch applied. Reports from the tenacious people at
>> kernel bugzilla are also promising.
>> 
>> Considering that the punit access frequency with this patch is
>> considerably less, there is a possibility that this will push
>> the, still unknown, root cause past the triggering point on most loads.
>> 
>> But as we now can reliably reproduce the hang independently,
>> we can reduce the pain that users are having and use a
>> static thresholds until a root cause is found.
>> 
>> v3: don't break debugfs and simplification (Chris Wilson)
>> 
>> References: https://bugzilla.kernel.org/show_bug.cgi?id=109051
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: Len Brown <len.brown@intel.com>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Cc: Jani Nikula <jani.nikula@intel.com>
>> Cc: fritsch@xbmc.org
>> Cc: miku@iki.fi
>> Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
>> CC: Michal Feix <michal@feix.cz>
>> Cc: Hans de Goede <hdegoede@redhat.com>
>> Cc: Deepak S <deepak.s@linux.intel.com>
>> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
>> Cc: <stable@vger.kernel.org> # v4.2+
>> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>
> Had a couple of weekends to try and find an alternative explanation
> (a root cause for the hangs would be nice!). If it is just the writes to
> the RPS registers, are we safe on resume (etc)?
>
> However, I've drawn a blank on explaining what the hw is doing wrong
> (but found a couple of bugs in the byt manual RPS evaluation which
> desire review), so
> Acked-by: Chris Wilson <chris@chris-wilson.co.uk>

Pushed, thanks.
-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 3d311e1..732256c 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -4870,6 +4870,12 @@  static void gen6_set_rps_thresholds(struct drm_i915_private *dev_priv, u8 val)
 		break;
 	}
 
+	/* When byt can survive without system hang with dynamic
+	 * sw freq adjustments, this restriction can be lifted.
+	 */
+	if (IS_VALLEYVIEW(dev_priv))
+		goto skip_hw_write;
+
 	I915_WRITE(GEN6_RP_UP_EI,
 		   GT_INTERVAL_FROM_US(dev_priv, ei_up));
 	I915_WRITE(GEN6_RP_UP_THRESHOLD,
@@ -4890,6 +4896,7 @@  static void gen6_set_rps_thresholds(struct drm_i915_private *dev_priv, u8 val)
 		   GEN6_RP_UP_BUSY_AVG |
 		   GEN6_RP_DOWN_IDLE_AVG);
 
+skip_hw_write:
 	dev_priv->rps.power = new_power;
 	dev_priv->rps.up_threshold = threshold_up;
 	dev_priv->rps.down_threshold = threshold_down;