drm/i915/selftests: Use preemption timeout on cleanup

Message ID	20241213190122.513709-2-janusz.krzysztofik@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> To: intel-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org, Jani Nikula <jani.nikula@linux.intel.com>, Joonas Lahtinen <joonas.lahtinen@linux.intel.com>, Rodrigo Vivi <rodrigo.vivi@intel.com>, Tvrtko Ursulin <tursulin@ursulin.net>, Chris Wilson <chris.p.wilson@linux.intel.com>, Andi Shyti <andi.shyti@linux.intel.com>, Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Subject: [PATCH] drm/i915/selftests: Use preemption timeout on cleanup Date: Fri, 13 Dec 2024 19:59:48 +0100 Message-ID: <20241213190122.513709-2-janusz.krzysztofik@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/i915/selftests: Use preemption timeout on cleanup \| expand drm/i915/selftests: Use preemption timeout on cleanup

Janusz Krzysztofik Dec. 13, 2024, 6:59 p.m. UTC

Many selftests call igt_flush_test() on cleanup.  With default preemption
timeout of compute engines raised to 7.5 seconds, hardcoded flush timeout
of 3 seconds is too short.  That results in GPU forcibly wedged and kernel
taineted, then IGT abort triggered.  CI BAT runs loose a part of their
expected coverage.

Calculate the flush timeout based on the longest preemption timeout
currently configured for any engine.  That way, selftest can still report
detected issues as non-critical, and the GPU gets a chance to recover from
preemptible hangs and prepare for fluent execution of next test cases.

Link: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
---
 drivers/gpu/drm/i915/selftests/igt_flush_test.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Andi Shyti Dec. 16, 2024, 1:26 p.m. UTC | #1

Hi Janusz,

...

>  	for_each_gt(gt, i915, i) {
> +		struct intel_engine_cs *engine;
> +		unsigned long timeout_ms = 0;
> +		unsigned int id;
> +
>  		if (intel_gt_is_wedged(gt))
>  			ret = -EIO;
>  
> +		for_each_engine(engine, gt, id) {
> +			if (engine->props.preempt_timeout_ms > timeout_ms)
> +				timeout_ms = engine->props.preempt_timeout_ms;
> +		}


the brackets are not really required here.

> +
>  		cond_resched();
>  
> -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) {

where is this 500 coming from?

Thanks,
Andi

>  			pr_err("%pS timed out, cancelling all further testing.\n",
>  			       __builtin_return_address(0));
>  
> -- 
> 2.47.1

Janusz Krzysztofik Dec. 16, 2024, 6:06 p.m. UTC | #2

Hi Andi,

Thanks for review.

On Monday, 16 December 2024 14:26:58 CET Andi Shyti wrote:
> Hi Janusz,
> 
> ...
> 
> >  	for_each_gt(gt, i915, i) {
> > +		struct intel_engine_cs *engine;
> > +		unsigned long timeout_ms = 0;
> > +		unsigned int id;
> > +
> >  		if (intel_gt_is_wedged(gt))
> >  			ret = -EIO;
> >  
> > +		for_each_engine(engine, gt, id) {
> > +			if (engine->props.preempt_timeout_ms > 
timeout_ms)
> > +				timeout_ms = engine-
>props.preempt_timeout_ms;
> > +		}
> 
> 
> the brackets are not really required here.

OK, I was not sure if for_each_if used inside for_each_engine is supposed to 
resolve potential issues with potentially confusing if nesting, but from your 
comment I conclude it does.  I'll fix it.

> 
> > +
> >  		cond_resched();
> >  
> > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
ETIME) {
> 
> where is this 500 coming from?

/ 1000 would convert it to seconds as needed, and / 500 used instead was 
supposed to to mean that we are willing to wait for preempt_timeout_ms * 2.  
Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
or maybe better use explicit 2 * preempt_timeout / 1000 ?

Thanks,
Janusz

> 
> Thanks,
> Andi
> 
> >  			pr_err("%pS timed out, cancelling all further 
testing.\n",
> >  			       __builtin_return_address(0));
> >  
>

Andi Shyti Dec. 17, 2024, 5:12 p.m. UTC | #3

Hi Janusz,

...

> > > +
> > >  		cond_resched();
> > >  
> > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
> ETIME) {
> > 
> > where is this 500 coming from?
> 
> / 1000 would convert it to seconds as needed, and / 500 used instead was 
> supposed to to mean that we are willing to wait for preempt_timeout_ms * 2.  
> Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
> or maybe better use explicit 2 * preempt_timeout / 1000 ?

It was clear that you were doubling it, but what's more
interesting to know (perhaps in a comment) is why you are
choosing to use the double of the timeout_ms instead of other
values.

Makes sense?

Thanks,
Andi

Janusz Krzysztofik Dec. 17, 2024, 6 p.m. UTC | #4

Hi Andi,

On Tuesday, 17 December 2024 18:12:08 CET Andi Shyti wrote:
> Hi Janusz,
> 
> ...
> 
> > > > +
> > > >  		cond_resched();
> > > >  
> > > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
> > ETIME) {
> > > 
> > > where is this 500 coming from?
> > 
> > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
2.  
> > Sorry for that shortcut.  Would you like me to provide a clarifying 
comment, 
> > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> 
> It was clear that you were doubling it, but what's more
> interesting to know (perhaps in a comment) is why you are
> choosing to use the double of the timeout_ms instead of other
> values.
> 
> Makes sense?

Yes, good question.

Is it possible for more than one bb to hang?  If yes then should we wait 
longer than the longest preemption timeout?  Before I assumed that maybe we 
should, just in case, but now, having that revisited and reconsidered, I tend 
to agree that the longest preempt timeout, perhaps with a small margin (let's 
say +100ms) should be enough to recover from a single failing test case.  Let 
me verify if that works for the linked case.

Thanks,
Janusz

> 
> Thanks,
> Andi
>

Janusz Krzysztofik Dec. 20, 2024, 11:49 a.m. UTC | #5

Hi Andi,

On Tuesday, 17 December 2024 19:00:40 CET Janusz Krzysztofik wrote:
> Hi Andi,
> 
> On Tuesday, 17 December 2024 18:12:08 CET Andi Shyti wrote:
> > Hi Janusz,
> > 
> > ...
> > 
> > > > > +
> > > > >  		cond_resched();
> > > > >  
> > > > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) {
> > > > 
> > > > where is this 500 coming from?
> > > 
> > > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
> 2.  
> > > Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
> > > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> > 
> > It was clear that you were doubling it, but what's more
> > interesting to know (perhaps in a comment) is why you are
> > choosing to use the double of the timeout_ms instead of other
> > values.
> > 
> > Makes sense?
> 
> Yes, good question.
> 
> Is it possible for more than one bb to hang?  If yes then should we wait 
> longer than the longest preemption timeout?  Before I assumed that maybe we 
> should, just in case, but now, having that revisited and reconsidered, I tend 
> to agree that the longest preempt timeout, perhaps with a small margin (let's 
> say +100ms) should be enough to recover from a single failing test case.  Let 
> me verify if that works for the linked case.

I've done some testing and got a confirmation that the issue I'm trying to 
address in the first place requires a timeout almost twice as long as the 
longest preemption timeout.

I propose the following correction:

-	if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
+	/* 2 x longest preempt timeout, experimentally determined */
+	if (intel_gt_wait_for_idle(gt, 2 * timeout_ms * HZ / 1000) == -ETIME) {

Thanks,
Janusz

> 
> Thanks,
> Janusz
> 
> > 
> > Thanks,
> > Andi
> > 
> 
> 
> 
> 
>

Andi Shyti Dec. 20, 2024, 2:02 p.m. UTC | #6

Hi Janusz,

> > > > > +
> > > > >  		cond_resched();
> > > > >  
> > > > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
> > > ETIME) {
> > > > 
> > > > where is this 500 coming from?
> > > 
> > > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
> 2.  
> > > Sorry for that shortcut.  Would you like me to provide a clarifying 
> comment, 
> > > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> > 
> > It was clear that you were doubling it, but what's more
> > interesting to know (perhaps in a comment) is why you are
> > choosing to use the double of the timeout_ms instead of other
> > values.
> > 
> > Makes sense?
> 
> Yes, good question.
> 
> Is it possible for more than one bb to hang?  If yes then should we wait 
> longer than the longest preemption timeout?  Before I assumed that maybe we 
> should, just in case, but now, having that revisited and reconsidered, I tend 
> to agree that the longest preempt timeout, perhaps with a small margin (let's 
> say +100ms) should be enough to recover from a single failing test case.  Let 
> me verify if that works for the linked case.

As we agreed offline, I'm going to add this comment you suggested
to your change as a justification to the "/ 500":

/* 2x longest preempt timeout, experimentally determined */

With this:

Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Thanks,
Andi

Andi Shyti Dec. 20, 2024, 10:38 p.m. UTC | #7

Hi Janusz,

> > > > > > +
> > > > > >  		cond_resched();
> > > > > >  
> > > > > > -		if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > > > +		if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) {
> > > > > 
> > > > > where is this 500 coming from?
> > > > 
> > > > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > > > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
> > 2.  
> > > > Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
> > > > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> > > 
> > > It was clear that you were doubling it, but what's more
> > > interesting to know (perhaps in a comment) is why you are
> > > choosing to use the double of the timeout_ms instead of other
> > > values.
> > > 
> > > Makes sense?
> > 
> > Yes, good question.
> > 
> > Is it possible for more than one bb to hang?  If yes then should we wait 
> > longer than the longest preemption timeout?  Before I assumed that maybe we 
> > should, just in case, but now, having that revisited and reconsidered, I tend 
> > to agree that the longest preempt timeout, perhaps with a small margin (let's 
> > say +100ms) should be enough to recover from a single failing test case.  Let 
> > me verify if that works for the linked case.
> 
> I've done some testing and got a confirmation that the issue I'm trying to 
> address in the first place requires a timeout almost twice as long as the 
> longest preemption timeout.
> 
> I propose the following correction:
> 
> -	if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> +	/* 2 x longest preempt timeout, experimentally determined */
> +	if (intel_gt_wait_for_idle(gt, 2 * timeout_ms * HZ / 1000) == -ETIME) {

with this change, I merge your patch to drm-intel-next.

Thanks,
Andi

drm/i915/selftests: Use preemption timeout on cleanup

Commit Message

Comments

Patch