Message ID | 20221121145655.75141-2-janusz.krzysztofik@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915: Fix timeout handling when retiring requests | expand |
On 21.11.2022 15:56, Janusz Krzysztofik wrote: > Commit b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work > with GuC") extended the API of intel_gt_retire_requests_timeout() with an > extra argument 'remaining_timeout', intended for passing back unconsumed > portion of requested timeout when 0 (success) is returned. However, when > request retirement happens to succeed despite an error returned by a call > to dma_fence_wait_timeout(), that error code (a negative value) is passed > back instead of remaining time. If we then pass that negative value > forward as requested timeout to intel_uc_wait_for_idle(), an explicit BUG > will be triggered. > > If request retirement succeeds but an error code is passed back via > remaininig_timeout, we may have no clue on how much of the initial timeout > might have been left for spending it on waiting for GuC to become idle. > OTOH, since all pending requests have been successfully retired, that > error code has been already ignored by intel_gt_retire_requests_timeout(), > then we shouldn't fail. > > Assume no more time has been left on error and pass 0 timeout value to > intel_uc_wait_for_idle() to give it a chance to return success if GuC is > already idle. > > v3: Don't fail on any error passed back via remaining_timeout. > > v2: Fix the issue on the caller side, not the provider. > > Fixes: b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work with GuC") > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > Cc: stable@vger.kernel.org # v5.15+ Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> Regards Andrzej > --- > drivers/gpu/drm/i915/gt/intel_gt.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c > index b5ad9caa55372..7ef0edb2e37cd 100644 > --- a/drivers/gpu/drm/i915/gt/intel_gt.c > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c > @@ -677,8 +677,13 @@ int intel_gt_wait_for_idle(struct intel_gt *gt, long timeout) > return -EINTR; > } > > - return timeout ? timeout : intel_uc_wait_for_idle(>->uc, > - remaining_timeout); > + if (timeout) > + return timeout; > + > + if (remaining_timeout < 0) > + remaining_timeout = 0; > + > + return intel_uc_wait_for_idle(>->uc, remaining_timeout); > } > > int intel_gt_init(struct intel_gt *gt)
Hi Andrzej, Thanks for providing your R-b. On Monday, 21 November 2022 18:40:51 CET Andrzej Hajda wrote: > On 21.11.2022 15:56, Janusz Krzysztofik wrote: > > Commit b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work > > with GuC") extended the API of intel_gt_retire_requests_timeout() with an > > extra argument 'remaining_timeout', intended for passing back unconsumed > > portion of requested timeout when 0 (success) is returned. However, when > > request retirement happens to succeed despite an error returned by a call > > to dma_fence_wait_timeout(), that error code (a negative value) is passed > > back instead of remaining time. If we then pass that negative value > > forward as requested timeout to intel_uc_wait_for_idle(), an explicit BUG > > will be triggered. > > > > If request retirement succeeds but an error code is passed back via > > remaininig_timeout, we may have no clue on how much of the initial timeout > > might have been left for spending it on waiting for GuC to become idle. > > OTOH, since all pending requests have been successfully retired, that > > error code has been already ignored by intel_gt_retire_requests_timeout(), > > then we shouldn't fail. > > > > Assume no more time has been left on error and pass 0 timeout value to > > intel_uc_wait_for_idle() to give it a chance to return success if GuC is > > already idle. > > > > v3: Don't fail on any error passed back via remaining_timeout. > > > > v2: Fix the issue on the caller side, not the provider. > > > > Fixes: b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work with GuC") > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > > Cc: stable@vger.kernel.org # v5.15+ > > Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> While still open for comments from others, I'm now looking for potential committer. Thanks, Janusz > > Regards > Andrzej > > > --- > > drivers/gpu/drm/i915/gt/intel_gt.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/ intel_gt.c > > index b5ad9caa55372..7ef0edb2e37cd 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_gt.c > > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c > > @@ -677,8 +677,13 @@ int intel_gt_wait_for_idle(struct intel_gt *gt, long timeout) > > return -EINTR; > > } > > > > - return timeout ? timeout : intel_uc_wait_for_idle(>->uc, > > - remaining_timeout); > > + if (timeout) > > + return timeout; > > + > > + if (remaining_timeout < 0) > > + remaining_timeout = 0; > > + > > + return intel_uc_wait_for_idle(>->uc, remaining_timeout); > > } > > > > int intel_gt_init(struct intel_gt *gt) > >
On 21/11/2022 23:19, Janusz Krzysztofik wrote: > Hi Andrzej, > > Thanks for providing your R-b. > > On Monday, 21 November 2022 18:40:51 CET Andrzej Hajda wrote: >> On 21.11.2022 15:56, Janusz Krzysztofik wrote: >>> Commit b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work >>> with GuC") extended the API of intel_gt_retire_requests_timeout() with an >>> extra argument 'remaining_timeout', intended for passing back unconsumed >>> portion of requested timeout when 0 (success) is returned. However, when >>> request retirement happens to succeed despite an error returned by a call >>> to dma_fence_wait_timeout(), that error code (a negative value) is passed >>> back instead of remaining time. If we then pass that negative value >>> forward as requested timeout to intel_uc_wait_for_idle(), an explicit BUG >>> will be triggered. Right, AFAICT a GEM_BUG_ON in debug builds, but in production builds negative timeout will get passed along all the way to schedule_timeout where error and call trace will be dumped. So fix appears warranted indeed. >>> If request retirement succeeds but an error code is passed back via >>> remaininig_timeout, we may have no clue on how much of the initial timeout >>> might have been left for spending it on waiting for GuC to become idle. >>> OTOH, since all pending requests have been successfully retired, that >>> error code has been already ignored by intel_gt_retire_requests_timeout(), >>> then we shouldn't fail. >>> >>> Assume no more time has been left on error and pass 0 timeout value to >>> intel_uc_wait_for_idle() to give it a chance to return success if GuC is >>> already idle. >>> >>> v3: Don't fail on any error passed back via remaining_timeout. >>> >>> v2: Fix the issue on the caller side, not the provider. >>> >>> Fixes: b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work > with GuC") >>> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> >>> Cc: stable@vger.kernel.org # v5.15+ >> >> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> > > While still open for comments from others, I'm now looking for potential > committer. Both patches are considered good to go? Regards, Tvrtko > > Thanks, > Janusz > > >> >> Regards >> Andrzej >> >>> --- >>> drivers/gpu/drm/i915/gt/intel_gt.c | 9 +++++++-- >>> 1 file changed, 7 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/ > intel_gt.c >>> index b5ad9caa55372..7ef0edb2e37cd 100644 >>> --- a/drivers/gpu/drm/i915/gt/intel_gt.c >>> +++ b/drivers/gpu/drm/i915/gt/intel_gt.c >>> @@ -677,8 +677,13 @@ int intel_gt_wait_for_idle(struct intel_gt *gt, long > timeout) >>> return -EINTR; >>> } >>> >>> - return timeout ? timeout : intel_uc_wait_for_idle(>->uc, >>> - > remaining_timeout); >>> + if (timeout) >>> + return timeout; >>> + >>> + if (remaining_timeout < 0) >>> + remaining_timeout = 0; >>> + >>> + return intel_uc_wait_for_idle(>->uc, remaining_timeout); >>> } >>> >>> int intel_gt_init(struct intel_gt *gt) >> >> > > > >
On Tuesday, 22 November 2022 11:41:29 CET Tvrtko Ursulin wrote: > > On 21/11/2022 23:19, Janusz Krzysztofik wrote: > > Hi Andrzej, > > > > Thanks for providing your R-b. > > > > On Monday, 21 November 2022 18:40:51 CET Andrzej Hajda wrote: > >> On 21.11.2022 15:56, Janusz Krzysztofik wrote: > >>> Commit b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work > >>> with GuC") extended the API of intel_gt_retire_requests_timeout() with an > >>> extra argument 'remaining_timeout', intended for passing back unconsumed > >>> portion of requested timeout when 0 (success) is returned. However, when > >>> request retirement happens to succeed despite an error returned by a call > >>> to dma_fence_wait_timeout(), that error code (a negative value) is passed > >>> back instead of remaining time. If we then pass that negative value > >>> forward as requested timeout to intel_uc_wait_for_idle(), an explicit BUG > >>> will be triggered. > > Right, AFAICT a GEM_BUG_ON in debug builds, but in production builds > negative timeout will get passed along all the way to schedule_timeout > where error and call trace will be dumped. So fix appears warranted indeed. > > >>> If request retirement succeeds but an error code is passed back via > >>> remaininig_timeout, we may have no clue on how much of the initial timeout > >>> might have been left for spending it on waiting for GuC to become idle. > >>> OTOH, since all pending requests have been successfully retired, that > >>> error code has been already ignored by intel_gt_retire_requests_timeout(), > >>> then we shouldn't fail. > >>> > >>> Assume no more time has been left on error and pass 0 timeout value to > >>> intel_uc_wait_for_idle() to give it a chance to return success if GuC is > >>> already idle. > >>> > >>> v3: Don't fail on any error passed back via remaining_timeout. > >>> > >>> v2: Fix the issue on the caller side, not the provider. > >>> > >>> Fixes: b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work > > with GuC") > >>> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > >>> Cc: stable@vger.kernel.org # v5.15+ > >> > >> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> > > > > While still open for comments from others, I'm now looking for potential > > committer. > > Both patches are considered good to go? Yes, I hope so. The actual diff of 2/2 v3 has received R-b from Andrzej still under discussion of v2, then I decided to add that tag to v3. Andrzej, can you please respond to 2/2 v3 and confirm your R-b applies? Thanks, Janusz > > Regards, > > Tvrtko > > > > > Thanks, > > Janusz > > > > > >> > >> Regards > >> Andrzej > >> > >>> --- > >>> drivers/gpu/drm/i915/gt/intel_gt.c | 9 +++++++-- > >>> 1 file changed, 7 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/ > > intel_gt.c > >>> index b5ad9caa55372..7ef0edb2e37cd 100644 > >>> --- a/drivers/gpu/drm/i915/gt/intel_gt.c > >>> +++ b/drivers/gpu/drm/i915/gt/intel_gt.c > >>> @@ -677,8 +677,13 @@ int intel_gt_wait_for_idle(struct intel_gt *gt, long > > timeout) > >>> return -EINTR; > >>> } > >>> > >>> - return timeout ? timeout : intel_uc_wait_for_idle(>->uc, > >>> - > > remaining_timeout); > >>> + if (timeout) > >>> + return timeout; > >>> + > >>> + if (remaining_timeout < 0) > >>> + remaining_timeout = 0; > >>> + > >>> + return intel_uc_wait_for_idle(>->uc, remaining_timeout); > >>> } > >>> > >>> int intel_gt_init(struct intel_gt *gt) > >> > >> > > > > > > > > >
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index b5ad9caa55372..7ef0edb2e37cd 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -677,8 +677,13 @@ int intel_gt_wait_for_idle(struct intel_gt *gt, long timeout) return -EINTR; } - return timeout ? timeout : intel_uc_wait_for_idle(>->uc, - remaining_timeout); + if (timeout) + return timeout; + + if (remaining_timeout < 0) + remaining_timeout = 0; + + return intel_uc_wait_for_idle(>->uc, remaining_timeout); } int intel_gt_init(struct intel_gt *gt)
Commit b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work with GuC") extended the API of intel_gt_retire_requests_timeout() with an extra argument 'remaining_timeout', intended for passing back unconsumed portion of requested timeout when 0 (success) is returned. However, when request retirement happens to succeed despite an error returned by a call to dma_fence_wait_timeout(), that error code (a negative value) is passed back instead of remaining time. If we then pass that negative value forward as requested timeout to intel_uc_wait_for_idle(), an explicit BUG will be triggered. If request retirement succeeds but an error code is passed back via remaininig_timeout, we may have no clue on how much of the initial timeout might have been left for spending it on waiting for GuC to become idle. OTOH, since all pending requests have been successfully retired, that error code has been already ignored by intel_gt_retire_requests_timeout(), then we shouldn't fail. Assume no more time has been left on error and pass 0 timeout value to intel_uc_wait_for_idle() to give it a chance to return success if GuC is already idle. v3: Don't fail on any error passed back via remaining_timeout. v2: Fix the issue on the caller side, not the provider. Fixes: b97060a99b01 ("drm/i915/guc: Update intel_gt_wait_for_idle to work with GuC") Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: stable@vger.kernel.org # v5.15+ --- drivers/gpu/drm/i915/gt/intel_gt.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)