diff mbox series

drm/i915/selftests: Use preemption timeout on cleanup

Message ID 20241213190122.513709-2-janusz.krzysztofik@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/selftests: Use preemption timeout on cleanup | expand

Commit Message

Janusz Krzysztofik Dec. 13, 2024, 6:59 p.m. UTC
Many selftests call igt_flush_test() on cleanup.  With default preemption
timeout of compute engines raised to 7.5 seconds, hardcoded flush timeout
of 3 seconds is too short.  That results in GPU forcibly wedged and kernel
taineted, then IGT abort triggered.  CI BAT runs loose a part of their
expected coverage.

Calculate the flush timeout based on the longest preemption timeout
currently configured for any engine.  That way, selftest can still report
detected issues as non-critical, and the GPU gets a chance to recover from
preemptible hangs and prepare for fluent execution of next test cases.

Link: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
---
 drivers/gpu/drm/i915/selftests/igt_flush_test.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Andi Shyti Dec. 16, 2024, 1:26 p.m. UTC | #1
Hi Janusz,

...

>  	for_each_gt(gt, i915, i) {
> +		struct intel_engine_cs *engine;
> +		unsigned long timeout_ms = 0;
> +		unsigned int id;
> +
>  		if (intel_gt_is_wedged(gt))
>  			ret = -EIO;
>  
> +		for_each_engine(engine, gt, id) {
> +			if (engine->props.preempt_timeout_ms > timeout_ms)
> +				timeout_ms = engine->props.preempt_timeout_ms;
> +		}


the brackets are not really required here.

> +
>  		cond_resched();
>  
> -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) {

where is this 500 coming from?

Thanks,
Andi

>  			pr_err("%pS timed out, cancelling all further testing.\n",
>  			       __builtin_return_address(0));
>  
> -- 
> 2.47.1
Janusz Krzysztofik Dec. 16, 2024, 6:06 p.m. UTC | #2
Hi Andi,

Thanks for review.

On Monday, 16 December 2024 14:26:58 CET Andi Shyti wrote:
> Hi Janusz,
> 
> ...
> 
> >  	for_each_gt(gt, i915, i) {
> > +		struct intel_engine_cs *engine;
> > +		unsigned long timeout_ms = 0;
> > +		unsigned int id;
> > +
> >  		if (intel_gt_is_wedged(gt))
> >  			ret = -EIO;
> >  
> > +		for_each_engine(engine, gt, id) {
> > +			if (engine->props.preempt_timeout_ms > 
timeout_ms)
> > +				timeout_ms = engine-
>props.preempt_timeout_ms;
> > +		}
> 
> 
> the brackets are not really required here.

OK, I was not sure if for_each_if used inside for_each_engine is supposed to 
resolve potential issues with potentially confusing if nesting, but from your 
comment I conclude it does.  I'll fix it.

> 
> > +
> >  		cond_resched();
> >  
> > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
ETIME) {
> 
> where is this 500 coming from?

/ 1000 would convert it to seconds as needed, and / 500 used instead was 
supposed to to mean that we are willing to wait for preempt_timeout_ms * 2.  
Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
or maybe better use explicit 2 * preempt_timeout / 1000 ?

Thanks,
Janusz

> 
> Thanks,
> Andi
> 
> >  			pr_err("%pS timed out, cancelling all further 
testing.\n",
> >  			       __builtin_return_address(0));
> >  
>
Andi Shyti Dec. 17, 2024, 5:12 p.m. UTC | #3
Hi Janusz,

...

> > > +
> > >  		cond_resched();
> > >  
> > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
> ETIME) {
> > 
> > where is this 500 coming from?
> 
> / 1000 would convert it to seconds as needed, and / 500 used instead was 
> supposed to to mean that we are willing to wait for preempt_timeout_ms * 2.  
> Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
> or maybe better use explicit 2 * preempt_timeout / 1000 ?

It was clear that you were doubling it, but what's more
interesting to know (perhaps in a comment) is why you are
choosing to use the double of the timeout_ms instead of other
values.

Makes sense?

Thanks,
Andi
Janusz Krzysztofik Dec. 17, 2024, 6 p.m. UTC | #4
Hi Andi,

On Tuesday, 17 December 2024 18:12:08 CET Andi Shyti wrote:
> Hi Janusz,
> 
> ...
> 
> > > > +
> > > >  		cond_resched();
> > > >  
> > > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
> > ETIME) {
> > > 
> > > where is this 500 coming from?
> > 
> > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
2.  
> > Sorry for that shortcut.  Would you like me to provide a clarifying 
comment, 
> > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> 
> It was clear that you were doubling it, but what's more
> interesting to know (perhaps in a comment) is why you are
> choosing to use the double of the timeout_ms instead of other
> values.
> 
> Makes sense?

Yes, good question.

Is it possible for more than one bb to hang?  If yes then should we wait 
longer than the longest preemption timeout?  Before I assumed that maybe we 
should, just in case, but now, having that revisited and reconsidered, I tend 
to agree that the longest preempt timeout, perhaps with a small margin (let's 
say +100ms) should be enough to recover from a single failing test case.  Let 
me verify if that works for the linked case.

Thanks,
Janusz

> 
> Thanks,
> Andi
>
Janusz Krzysztofik Dec. 20, 2024, 11:49 a.m. UTC | #5
Hi Andi,

On Tuesday, 17 December 2024 19:00:40 CET Janusz Krzysztofik wrote:
> Hi Andi,
> 
> On Tuesday, 17 December 2024 18:12:08 CET Andi Shyti wrote:
> > Hi Janusz,
> > 
> > ...
> > 
> > > > > +
> > > > >  		cond_resched();
> > > > >  
> > > > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) {
> > > > 
> > > > where is this 500 coming from?
> > > 
> > > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
> 2.  
> > > Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
> > > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> > 
> > It was clear that you were doubling it, but what's more
> > interesting to know (perhaps in a comment) is why you are
> > choosing to use the double of the timeout_ms instead of other
> > values.
> > 
> > Makes sense?
> 
> Yes, good question.
> 
> Is it possible for more than one bb to hang?  If yes then should we wait 
> longer than the longest preemption timeout?  Before I assumed that maybe we 
> should, just in case, but now, having that revisited and reconsidered, I tend 
> to agree that the longest preempt timeout, perhaps with a small margin (let's 
> say +100ms) should be enough to recover from a single failing test case.  Let 
> me verify if that works for the linked case.

I've done some testing and got a confirmation that the issue I'm trying to 
address in the first place requires a timeout almost twice as long as the 
longest preemption timeout.

I propose the following correction:

-	if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
+	/* 2 x longest preempt timeout, experimentally determined */
+	if (intel_gt_wait_for_idle(gt, 2 * timeout_ms * HZ / 1000) == -ETIME) {

Thanks,
Janusz

> 
> Thanks,
> Janusz
> 
> > 
> > Thanks,
> > Andi
> > 
> 
> 
> 
> 
>
Andi Shyti Dec. 20, 2024, 2:02 p.m. UTC | #6
Hi Janusz,

> > > > > +
> > > > >  		cond_resched();
> > > > >  
> > > > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
> > > ETIME) {
> > > > 
> > > > where is this 500 coming from?
> > > 
> > > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
> 2.  
> > > Sorry for that shortcut.  Would you like me to provide a clarifying 
> comment, 
> > > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> > 
> > It was clear that you were doubling it, but what's more
> > interesting to know (perhaps in a comment) is why you are
> > choosing to use the double of the timeout_ms instead of other
> > values.
> > 
> > Makes sense?
> 
> Yes, good question.
> 
> Is it possible for more than one bb to hang?  If yes then should we wait 
> longer than the longest preemption timeout?  Before I assumed that maybe we 
> should, just in case, but now, having that revisited and reconsidered, I tend 
> to agree that the longest preempt timeout, perhaps with a small margin (let's 
> say +100ms) should be enough to recover from a single failing test case.  Let 
> me verify if that works for the linked case.

As we agreed offline, I'm going to add this comment you suggested
to your change as a justification to the "/ 500":

/* 2x longest preempt timeout, experimentally determined */

With this:

Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Thanks,
Andi
Andi Shyti Dec. 20, 2024, 10:38 p.m. UTC | #7
Hi Janusz,

> > > > > > +
> > > > > >  		cond_resched();
> > > > > >  
> > > > > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) {
> > > > > 
> > > > > where is this 500 coming from?
> > > > 
> > > > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > > > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
> > 2.  
> > > > Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
> > > > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> > > 
> > > It was clear that you were doubling it, but what's more
> > > interesting to know (perhaps in a comment) is why you are
> > > choosing to use the double of the timeout_ms instead of other
> > > values.
> > > 
> > > Makes sense?
> > 
> > Yes, good question.
> > 
> > Is it possible for more than one bb to hang?  If yes then should we wait 
> > longer than the longest preemption timeout?  Before I assumed that maybe we 
> > should, just in case, but now, having that revisited and reconsidered, I tend 
> > to agree that the longest preempt timeout, perhaps with a small margin (let's 
> > say +100ms) should be enough to recover from a single failing test case.  Let 
> > me verify if that works for the linked case.
> 
> I've done some testing and got a confirmation that the issue I'm trying to 
> address in the first place requires a timeout almost twice as long as the 
> longest preemption timeout.
> 
> I propose the following correction:
> 
> -	if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> +	/* 2 x longest preempt timeout, experimentally determined */
> +	if (intel_gt_wait_for_idle(gt, 2 * timeout_ms * HZ / 1000) == -ETIME) {

with this change, I merge your patch to drm-intel-next.

Thanks,
Andi
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/selftests/igt_flush_test.c b/drivers/gpu/drm/i915/selftests/igt_flush_test.c
index 29110abb4fe05..d4b216065f2eb 100644
--- a/drivers/gpu/drm/i915/selftests/igt_flush_test.c
+++ b/drivers/gpu/drm/i915/selftests/igt_flush_test.c
@@ -19,12 +19,21 @@  int igt_flush_test(struct drm_i915_private *i915)
 	int ret = 0;
 
 	for_each_gt(gt, i915, i) {
+		struct intel_engine_cs *engine;
+		unsigned long timeout_ms = 0;
+		unsigned int id;
+
 		if (intel_gt_is_wedged(gt))
 			ret = -EIO;
 
+		for_each_engine(engine, gt, id) {
+			if (engine->props.preempt_timeout_ms > timeout_ms)
+				timeout_ms = engine->props.preempt_timeout_ms;
+		}
+
 		cond_resched();
 
-		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
+		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) {
 			pr_err("%pS timed out, cancelling all further testing.\n",
 			       __builtin_return_address(0));