Message ID | 20241213190122.513709-2-janusz.krzysztofik@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915/selftests: Use preemption timeout on cleanup | expand |
Hi Janusz, ... > for_each_gt(gt, i915, i) { > + struct intel_engine_cs *engine; > + unsigned long timeout_ms = 0; > + unsigned int id; > + > if (intel_gt_is_wedged(gt)) > ret = -EIO; > > + for_each_engine(engine, gt, id) { > + if (engine->props.preempt_timeout_ms > timeout_ms) > + timeout_ms = engine->props.preempt_timeout_ms; > + } the brackets are not really required here. > + > cond_resched(); > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) { where is this 500 coming from? Thanks, Andi > pr_err("%pS timed out, cancelling all further testing.\n", > __builtin_return_address(0)); > > -- > 2.47.1
Hi Andi, Thanks for review. On Monday, 16 December 2024 14:26:58 CET Andi Shyti wrote: > Hi Janusz, > > ... > > > for_each_gt(gt, i915, i) { > > + struct intel_engine_cs *engine; > > + unsigned long timeout_ms = 0; > > + unsigned int id; > > + > > if (intel_gt_is_wedged(gt)) > > ret = -EIO; > > > > + for_each_engine(engine, gt, id) { > > + if (engine->props.preempt_timeout_ms > timeout_ms) > > + timeout_ms = engine- >props.preempt_timeout_ms; > > + } > > > the brackets are not really required here. OK, I was not sure if for_each_if used inside for_each_engine is supposed to resolve potential issues with potentially confusing if nesting, but from your comment I conclude it does. I'll fix it. > > > + > > cond_resched(); > > > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { > > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == - ETIME) { > > where is this 500 coming from? / 1000 would convert it to seconds as needed, and / 500 used instead was supposed to to mean that we are willing to wait for preempt_timeout_ms * 2. Sorry for that shortcut. Would you like me to provide a clarifying comment, or maybe better use explicit 2 * preempt_timeout / 1000 ? Thanks, Janusz > > Thanks, > Andi > > > pr_err("%pS timed out, cancelling all further testing.\n", > > __builtin_return_address(0)); > > >
Hi Janusz, ... > > > + > > > cond_resched(); > > > > > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { > > > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == - > ETIME) { > > > > where is this 500 coming from? > > / 1000 would convert it to seconds as needed, and / 500 used instead was > supposed to to mean that we are willing to wait for preempt_timeout_ms * 2. > Sorry for that shortcut. Would you like me to provide a clarifying comment, > or maybe better use explicit 2 * preempt_timeout / 1000 ? It was clear that you were doubling it, but what's more interesting to know (perhaps in a comment) is why you are choosing to use the double of the timeout_ms instead of other values. Makes sense? Thanks, Andi
Hi Andi, On Tuesday, 17 December 2024 18:12:08 CET Andi Shyti wrote: > Hi Janusz, > > ... > > > > > + > > > > cond_resched(); > > > > > > > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { > > > > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == - > > ETIME) { > > > > > > where is this 500 coming from? > > > > / 1000 would convert it to seconds as needed, and / 500 used instead was > > supposed to to mean that we are willing to wait for preempt_timeout_ms * 2. > > Sorry for that shortcut. Would you like me to provide a clarifying comment, > > or maybe better use explicit 2 * preempt_timeout / 1000 ? > > It was clear that you were doubling it, but what's more > interesting to know (perhaps in a comment) is why you are > choosing to use the double of the timeout_ms instead of other > values. > > Makes sense? Yes, good question. Is it possible for more than one bb to hang? If yes then should we wait longer than the longest preemption timeout? Before I assumed that maybe we should, just in case, but now, having that revisited and reconsidered, I tend to agree that the longest preempt timeout, perhaps with a small margin (let's say +100ms) should be enough to recover from a single failing test case. Let me verify if that works for the linked case. Thanks, Janusz > > Thanks, > Andi >
Hi Andi, On Tuesday, 17 December 2024 19:00:40 CET Janusz Krzysztofik wrote: > Hi Andi, > > On Tuesday, 17 December 2024 18:12:08 CET Andi Shyti wrote: > > Hi Janusz, > > > > ... > > > > > > > + > > > > > cond_resched(); > > > > > > > > > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { > > > > > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) { > > > > > > > > where is this 500 coming from? > > > > > > / 1000 would convert it to seconds as needed, and / 500 used instead was > > > supposed to to mean that we are willing to wait for preempt_timeout_ms * > 2. > > > Sorry for that shortcut. Would you like me to provide a clarifying comment, > > > or maybe better use explicit 2 * preempt_timeout / 1000 ? > > > > It was clear that you were doubling it, but what's more > > interesting to know (perhaps in a comment) is why you are > > choosing to use the double of the timeout_ms instead of other > > values. > > > > Makes sense? > > Yes, good question. > > Is it possible for more than one bb to hang? If yes then should we wait > longer than the longest preemption timeout? Before I assumed that maybe we > should, just in case, but now, having that revisited and reconsidered, I tend > to agree that the longest preempt timeout, perhaps with a small margin (let's > say +100ms) should be enough to recover from a single failing test case. Let > me verify if that works for the linked case. I've done some testing and got a confirmation that the issue I'm trying to address in the first place requires a timeout almost twice as long as the longest preemption timeout. I propose the following correction: - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { + /* 2 x longest preempt timeout, experimentally determined */ + if (intel_gt_wait_for_idle(gt, 2 * timeout_ms * HZ / 1000) == -ETIME) { Thanks, Janusz > > Thanks, > Janusz > > > > > Thanks, > > Andi > > > > > > >
Hi Janusz, > > > > > + > > > > > cond_resched(); > > > > > > > > > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { > > > > > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == - > > > ETIME) { > > > > > > > > where is this 500 coming from? > > > > > > / 1000 would convert it to seconds as needed, and / 500 used instead was > > > supposed to to mean that we are willing to wait for preempt_timeout_ms * > 2. > > > Sorry for that shortcut. Would you like me to provide a clarifying > comment, > > > or maybe better use explicit 2 * preempt_timeout / 1000 ? > > > > It was clear that you were doubling it, but what's more > > interesting to know (perhaps in a comment) is why you are > > choosing to use the double of the timeout_ms instead of other > > values. > > > > Makes sense? > > Yes, good question. > > Is it possible for more than one bb to hang? If yes then should we wait > longer than the longest preemption timeout? Before I assumed that maybe we > should, just in case, but now, having that revisited and reconsidered, I tend > to agree that the longest preempt timeout, perhaps with a small margin (let's > say +100ms) should be enough to recover from a single failing test case. Let > me verify if that works for the linked case. As we agreed offline, I'm going to add this comment you suggested to your change as a justification to the "/ 500": /* 2x longest preempt timeout, experimentally determined */ With this: Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Thanks, Andi
Hi Janusz, > > > > > > + > > > > > > cond_resched(); > > > > > > > > > > > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { > > > > > > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) { > > > > > > > > > > where is this 500 coming from? > > > > > > > > / 1000 would convert it to seconds as needed, and / 500 used instead was > > > > supposed to to mean that we are willing to wait for preempt_timeout_ms * > > 2. > > > > Sorry for that shortcut. Would you like me to provide a clarifying comment, > > > > or maybe better use explicit 2 * preempt_timeout / 1000 ? > > > > > > It was clear that you were doubling it, but what's more > > > interesting to know (perhaps in a comment) is why you are > > > choosing to use the double of the timeout_ms instead of other > > > values. > > > > > > Makes sense? > > > > Yes, good question. > > > > Is it possible for more than one bb to hang? If yes then should we wait > > longer than the longest preemption timeout? Before I assumed that maybe we > > should, just in case, but now, having that revisited and reconsidered, I tend > > to agree that the longest preempt timeout, perhaps with a small margin (let's > > say +100ms) should be enough to recover from a single failing test case. Let > > me verify if that works for the linked case. > > I've done some testing and got a confirmation that the issue I'm trying to > address in the first place requires a timeout almost twice as long as the > longest preemption timeout. > > I propose the following correction: > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { > + /* 2 x longest preempt timeout, experimentally determined */ > + if (intel_gt_wait_for_idle(gt, 2 * timeout_ms * HZ / 1000) == -ETIME) { with this change, I merge your patch to drm-intel-next. Thanks, Andi
diff --git a/drivers/gpu/drm/i915/selftests/igt_flush_test.c b/drivers/gpu/drm/i915/selftests/igt_flush_test.c index 29110abb4fe05..d4b216065f2eb 100644 --- a/drivers/gpu/drm/i915/selftests/igt_flush_test.c +++ b/drivers/gpu/drm/i915/selftests/igt_flush_test.c @@ -19,12 +19,21 @@ int igt_flush_test(struct drm_i915_private *i915) int ret = 0; for_each_gt(gt, i915, i) { + struct intel_engine_cs *engine; + unsigned long timeout_ms = 0; + unsigned int id; + if (intel_gt_is_wedged(gt)) ret = -EIO; + for_each_engine(engine, gt, id) { + if (engine->props.preempt_timeout_ms > timeout_ms) + timeout_ms = engine->props.preempt_timeout_ms; + } + cond_resched(); - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) { + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -ETIME) { pr_err("%pS timed out, cancelling all further testing.\n", __builtin_return_address(0));
Many selftests call igt_flush_test() on cleanup. With default preemption timeout of compute engines raised to 7.5 seconds, hardcoded flush timeout of 3 seconds is too short. That results in GPU forcibly wedged and kernel taineted, then IGT abort triggered. CI BAT runs loose a part of their expected coverage. Calculate the flush timeout based on the longest preemption timeout currently configured for any engine. That way, selftest can still report detected issues as non-critical, and the GPU gets a chance to recover from preemptible hangs and prepare for fluent execution of next test cases. Link: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061 Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> --- drivers/gpu/drm/i915/selftests/igt_flush_test.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)