mbox series

[PULL] drm-xe-fixes

Message ID trlkoiewtc4x2cyhsxmj3atayyq4zwto4iryea5pvya2ymc3yp@fdx5nhwmiyem (mailing list archive)
State New, archived
Headers show
Series [PULL] drm-xe-fixes | expand

Pull-request

https://gitlab.freedesktop.org/drm/xe/kernel.git tags/drm-xe-fixes-2024-10-24-1

Message

Lucas De Marchi Oct. 24, 2024, 11:15 p.m. UTC
Hi Dave and Simona,

drm-xe-fixes for 6.12-rc5 with commits mostly improving error handling.
The g2h flush helps some LNL we are seeing, but we still have other 2
similar ones - however they didn't make it in time to drm-xe-next to be
properly tested, so I'm leaving for later.

There are 2 conflicts when merging drm-next on top that I fixed
in drm-tip: the first is trivial, just taking drm-next. The second is
also trivial, preferring xa_erase() over xa_erase_irq(), but the diff
context is more scary, so I'm pasting here (with a | prefix so bots
don't try anything funny):

| remerge CONFLICT (content): Merge conflict in drivers/gpu/drm/xe/xe_guc_ct.c
| index c6caf8f92421..c260d8840990 100644
| --- a/drivers/gpu/drm/xe/xe_guc_ct.c
| +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
| @@ -1019,7 +1019,6 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
|          ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
|   
|          /*
| -<<<<<<< 3cf59b00bd34 (Merge remote-tracking branch 'drm-xe/drm-xe-fixes' into drm-tip)
|           * Occasionally it is seen that the G2H worker starts running after a delay of more than
|           * a second even after being queued and activated by the Linux workqueue subsystem. This
|           * leads to G2H timeout error. The root cause of issue lies with scheduling latency of
| @@ -1044,22 +1043,10 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
|           * correct ordering, and we lack the needed barriers.
|           */
|          mutex_lock(&ct->lock);
| -       if (!ret) {
| -               xe_gt_err(gt, "Timed out wait for G2H, fence %u, action %04x, done %s",
| -                         g2h_fence.seqno, action[0], str_yes_no(g2h_fence.done));
| -               xa_erase_irq(&ct->fence_lookup, g2h_fence.seqno);
| -=======
| -        * Ensure we serialize with completion side to prevent UAF with fence going out of scope on
| -        * the stack, since we have no clue if it will fire after the timeout before we can erase
| -        * from the xa. Also we have some dependent loads and stores below for which we need the
| -        * correct ordering, and we lack the needed barriers.
| -        */
| -       mutex_lock(&ct->lock);
|          if (!ret) {
|                  xe_gt_err(gt, "Timed out wait for G2H, fence %u, action %04x, done %s",
|                            g2h_fence.seqno, action[0], str_yes_no(g2h_fence.done));
|                  xa_erase(&ct->fence_lookup, g2h_fence.seqno);
| ->>>>>>> c9ff14d0339a (Merge tag 'drm-intel-gt-next-2024-10-23' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-next)

thanks
Lucas De Marchi

drm-xe-fixes-2024-10-24-1:
Driver Changes:
- Increase invalidation timeout to avoid errors in some hosts (Shuicheng)
- Flush worker on timeout (Badal)
- Better handling for force wake failure (Shuicheng)
- Improve argument check on user fence creation (Nirmoy)
- Don't restart parallel queues multiple times on GT reset (Nirmoy)
The following changes since commit 42f7652d3eb527d03665b09edac47f85fb600924:

   Linux 6.12-rc4 (2024-10-20 15:19:38 -0700)

are available in the Git repository at:

   https://gitlab.freedesktop.org/drm/xe/kernel.git tags/drm-xe-fixes-2024-10-24-1

for you to fetch changes up to cdc21021f0351226a4845715564afd5dc50ed44b:

   drm/xe: Don't restart parallel queues multiple times on GT reset (2024-10-24 12:42:52 -0500)

----------------------------------------------------------------
Driver Changes:
- Increase invalidation timeout to avoid errors in some hosts (Shuicheng)
- Flush worker on timeout (Badal)
- Better handling for force wake failure (Shuicheng)
- Improve argument check on user fence creation (Nirmoy)
- Don't restart parallel queues multiple times on GT reset (Nirmoy)

----------------------------------------------------------------
Badal Nilawar (1):
       drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout

Nirmoy Das (2):
       drm/xe/ufence: Prefetch ufence addr to catch bogus address
       drm/xe: Don't restart parallel queues multiple times on GT reset

Shuicheng Lin (2):
       drm/xe: Enlarge the invalidation timeout from 150 to 500
       drm/xe: Handle unreliable MMIO reads during forcewake

  drivers/gpu/drm/xe/xe_device.c     |  2 +-
  drivers/gpu/drm/xe/xe_force_wake.c | 12 +++++++++---
  drivers/gpu/drm/xe/xe_guc_ct.c     | 18 ++++++++++++++++++
  drivers/gpu/drm/xe/xe_guc_submit.c | 14 ++++++++++++--
  drivers/gpu/drm/xe/xe_sync.c       |  3 ++-
  5 files changed, 42 insertions(+), 7 deletions(-)