diff mbox

[2/4] drm/i915: Let hangcheck score decay faster than loop increment

Message ID 1448902389-12477-2-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mika Kuoppala Nov. 30, 2015, 4:53 p.m. UTC
We decay hangcheck score, instead of setting it to zero,
when seqno moves forward. This was added as mechanism to
detect batches full of invalid waits. But with multiple runs of
very intensive batches (shader tests), the score accumulates
even without wait/kick pairs only by engine being active inside
shader loops multiple times in succession.

Prevent this mechanism to falsely trigger on loops by
decaying faster than we accumulate during active looping.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Chris Wilson Nov. 30, 2015, 5:18 p.m. UTC | #1
On Mon, Nov 30, 2015 at 06:53:07PM +0200, Mika Kuoppala wrote:
> We decay hangcheck score, instead of setting it to zero,
> when seqno moves forward. This was added as mechanism to
> detect batches full of invalid waits.

And slow DoS. So no.
-Chris
Daniel Vetter Dec. 1, 2015, 8:55 a.m. UTC | #2
On Mon, Nov 30, 2015 at 05:18:43PM +0000, Chris Wilson wrote:
> On Mon, Nov 30, 2015 at 06:53:07PM +0200, Mika Kuoppala wrote:
> > We decay hangcheck score, instead of setting it to zero,
> > when seqno moves forward. This was added as mechanism to
> > detect batches full of invalid waits.
> 
> And slow DoS. So no.

Yeah we need to fix the deqp CI issues with tunables. Running a shader for
10s and monopolizing the gpu might be ok in a stress tests, but it's
definitely not ok by default on a normal system.
-Daniel
Mika Kuoppala Dec. 1, 2015, 12:09 p.m. UTC | #3
Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Mon, Nov 30, 2015 at 06:53:07PM +0200, Mika Kuoppala wrote:
>> We decay hangcheck score, instead of setting it to zero,
>> when seqno moves forward. This was added as mechanism to
>> detect batches full of invalid waits.
>
> And slow DoS. So no.

Point was to decay faster than 'loop detection' (-2) but slower
than kicking (+5) to keep the score decaying even if there is
one 'loop detection' per batch.

The criticism about pipe control (patch 1/4) is warranted as it just
happens to help now when the synchronization happens on ring, and thus
only helps with batches that follow this pattern wrt pipe control.

I will post a patch that tries to achieve more generic way
by inspecting subunit states if head seems to be stuck, so
that substitutes 1/4. And also makes 4/4 even more moot.

There is not much to salvage in this series so please ignore.

Thanks,
-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 6ed6571..3507269 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -3009,6 +3009,7 @@  static void i915_hangcheck_elapsed(struct work_struct *work)
 #define BUSY 1
 #define KICK 5
 #define HUNG 20
+#define BUSY_DECAY (2*BUSY)
 
 	if (!i915.enable_hangcheck)
 		return;
@@ -3084,8 +3085,8 @@  static void i915_hangcheck_elapsed(struct work_struct *work)
 			/* Gradually reduce the count so that we catch DoS
 			 * attempts across multiple batches.
 			 */
-			if (ring->hangcheck.score > 0)
-				ring->hangcheck.score--;
+			if (ring->hangcheck.score > BUSY_DECAY)
+				ring->hangcheck.score -= BUSY_DECAY;
 
 			ring->hangcheck.acthd = ring->hangcheck.max_acthd = 0;
 		}