mbox series

[v1,0/6] drm/lima: fixes and improvements to error recovery

Message ID 20240117031212.1104034-1-nunes.erico@gmail.com (mailing list archive)
Headers show
Series drm/lima: fixes and improvements to error recovery | expand

Message

Erico Nunes Jan. 17, 2024, 3:12 a.m. UTC
There have been reports from users for a long time about applications
hitting random "*ERROR* lima job timeout" and the application or GPU
becoming unresponsive.

This series addresses a number of related bugs, especially in the pp
timeout recovery path.

Patch 1 fixes a stack trace initially featured in a user report here:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415

Patch 2 fixes a "pp reset time out" message which was fairly confusing
at first. Debugging with an application which just submits one single
job at a time made it clear that the timeout actually was reported in
the application that runs next, even if that is a good application.

Patch 3 is one of the most important fixes and stops random "mmu command
2 timeout" ppmmu timeouts that I hit while running my tests. Initially I
thought it came from some race condition on running/resetting jobs, but
it was actually due to hard resetting live tasks.
After setting this there was never a mmu timeout anymore (even after
dropping the guilty flag in the upcoming patch, which by itself was
actually the easiest reproducer of the ppmmu timeouts).

Patch 4 might be debatable, but hopefully we can agree to go with it
since it was already discussed in Panfrost here:
https://patchwork.freedesktop.org/series/120820/
It is actually not that hard to reproduce both "spurious timeout" and
"unexpectedly high interrupt latency" by creating an irq which takes a
longer than usual to execute, and paired with the issues in timeout
recovery this could cause an unrelated application to hang.

Patch 5 removes the guilty drm_sched from context handling in lima.
Often users report issues with a user-visible effect of "application
flipping the last 2 rendered frames":
https://gitlab.freedesktop.org/mesa/mesa/-/issues/8410
This was ultimately caused because lima sets the guilty flag to a
context which causes all further rendering jobs from that context to be
dropped.
Without the fixes from patches 1-4 it was not possible to just drop the
guilty flag as that could trigger the mentioned issues and still hang
the GPU by attempting a recovery.
After the fixes above it seems to be reliable and survives some fairly
torturing tests mentioned below.
Other similar GPU drivers like panfrost, v3d don't make use of the
guilty context flag. So I think we can drop it for lima too.

Patch 6 is just some debug message cleanup.


Some of the tests which I ran with this patchset:

- Application which renders normal frames, then frames which /barely/
timeout, then frames which legitimely timeout, then frames which timeout
enough to trigger the hardware watchdog irq, then goes back to render
normal frames. This used to hang the application at first but now
survives the entire process repeated indefinitely.

- High irq latency from an unrelated source while rendering. I put a
mdelay() in the touchscreen driver irq and could cause all "spurious
timeout", "unexpectedly high interrupt latency" and actual timeouts.
Those are now all reported to dmesg for information and I am no longer
able to cause an unrelated application to hang.

- Game running with lower configured drm_sched timeout (locally set to
200ms) with other applications triggering timeouts in the background.
Different applications trigger resets independently but none of them
cause the GPU to hang or the game to stop working.


Erico Nunes (6):
  drm/lima: fix devfreq refcount imbalance for job timeouts
  drm/lima: reset async_reset on pp hard reset
  drm/lima: set bus_stop bit before hard reset
  drm/lima: handle spurious timeouts due to high irq latency
  drm/lima: remove guilty drm_sched context handling
  drm/lima: improve some pp debug messages

 drivers/gpu/drm/lima/lima_ctx.c   |  2 +-
 drivers/gpu/drm/lima/lima_ctx.h   |  1 -
 drivers/gpu/drm/lima/lima_pp.c    | 26 +++++++++++++++++---
 drivers/gpu/drm/lima/lima_sched.c | 40 ++++++++++++++++++++++++-------
 drivers/gpu/drm/lima/lima_sched.h |  5 ++--
 5 files changed, 58 insertions(+), 16 deletions(-)

Comments

Christian König Jan. 17, 2024, 7:22 a.m. UTC | #1
Am 17.01.24 um 04:12 schrieb Erico Nunes:
> There have been reports from users for a long time about applications
> hitting random "*ERROR* lima job timeout" and the application or GPU
> becoming unresponsive.
>
> This series addresses a number of related bugs, especially in the pp
> timeout recovery path.
>
> Patch 1 fixes a stack trace initially featured in a user report here:
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415
>
> Patch 2 fixes a "pp reset time out" message which was fairly confusing
> at first. Debugging with an application which just submits one single
> job at a time made it clear that the timeout actually was reported in
> the application that runs next, even if that is a good application.
>
> Patch 3 is one of the most important fixes and stops random "mmu command
> 2 timeout" ppmmu timeouts that I hit while running my tests. Initially I
> thought it came from some race condition on running/resetting jobs, but
> it was actually due to hard resetting live tasks.
> After setting this there was never a mmu timeout anymore (even after
> dropping the guilty flag in the upcoming patch, which by itself was
> actually the easiest reproducer of the ppmmu timeouts).
>
> Patch 4 might be debatable, but hopefully we can agree to go with it
> since it was already discussed in Panfrost here:
> https://patchwork.freedesktop.org/series/120820/
> It is actually not that hard to reproduce both "spurious timeout" and
> "unexpectedly high interrupt latency" by creating an irq which takes a
> longer than usual to execute, and paired with the issues in timeout
> recovery this could cause an unrelated application to hang.
>
> Patch 5 removes the guilty drm_sched from context handling in lima.
> Often users report issues with a user-visible effect of "application
> flipping the last 2 rendered frames":
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/8410
> This was ultimately caused because lima sets the guilty flag to a
> context which causes all further rendering jobs from that context to be
> dropped.
> Without the fixes from patches 1-4 it was not possible to just drop the
> guilty flag as that could trigger the mentioned issues and still hang
> the GPU by attempting a recovery.
> After the fixes above it seems to be reliable and survives some fairly
> torturing tests mentioned below.
> Other similar GPU drivers like panfrost, v3d don't make use of the
> guilty context flag. So I think we can drop it for lima too.

Oh, yes please! Thanks so much for that.

I'm working on removing the guilty handling in amdgpu as well and 
together this will most likely allow us to completely remove the guilty 
handling from the scheduler.

In general driver should use the dma_fence error handling instead. That 
one is well defined and way more reliable.

Feel free to add my acked-by to patch 5.

Thanks,
Christian.

>
> Patch 6 is just some debug message cleanup.
>
>
> Some of the tests which I ran with this patchset:
>
> - Application which renders normal frames, then frames which /barely/
> timeout, then frames which legitimely timeout, then frames which timeout
> enough to trigger the hardware watchdog irq, then goes back to render
> normal frames. This used to hang the application at first but now
> survives the entire process repeated indefinitely.
>
> - High irq latency from an unrelated source while rendering. I put a
> mdelay() in the touchscreen driver irq and could cause all "spurious
> timeout", "unexpectedly high interrupt latency" and actual timeouts.
> Those are now all reported to dmesg for information and I am no longer
> able to cause an unrelated application to hang.
>
> - Game running with lower configured drm_sched timeout (locally set to
> 200ms) with other applications triggering timeouts in the background.
> Different applications trigger resets independently but none of them
> cause the GPU to hang or the game to stop working.
>
>
> Erico Nunes (6):
>    drm/lima: fix devfreq refcount imbalance for job timeouts
>    drm/lima: reset async_reset on pp hard reset
>    drm/lima: set bus_stop bit before hard reset
>    drm/lima: handle spurious timeouts due to high irq latency
>    drm/lima: remove guilty drm_sched context handling
>    drm/lima: improve some pp debug messages
>
>   drivers/gpu/drm/lima/lima_ctx.c   |  2 +-
>   drivers/gpu/drm/lima/lima_ctx.h   |  1 -
>   drivers/gpu/drm/lima/lima_pp.c    | 26 +++++++++++++++++---
>   drivers/gpu/drm/lima/lima_sched.c | 40 ++++++++++++++++++++++++-------
>   drivers/gpu/drm/lima/lima_sched.h |  5 ++--
>   5 files changed, 58 insertions(+), 16 deletions(-)
>