Message ID | 1554998624-25799-4-git-send-email-andrey.grodzovsky@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/4] drm/scheduler: rework job destruction | expand |
On 4/11/19 12:03 PM, Andrey Grodzovsky wrote: > Patch '5edb0c9b Fix deadlock with display during hanged ring recovery' > was accidentaly removed during one of DALs code merges. > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Probably got lost in a refactor. Also, didn't Christian have a patch recently to not lock the reservation object when waiting for the fence? Looks like that's missing too, or maybe it didn't get merged. Nicholas Kazlauskas > --- > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------ > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > index 0648794..27e0383 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > @@ -5138,14 +5138,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, > */ > abo = gem_to_amdgpu_bo(fb->obj[0]); > r = amdgpu_bo_reserve(abo, true); > - if (unlikely(r != 0)) { > + if (unlikely(r != 0)) > DRM_ERROR("failed to reserve buffer before flip\n"); > - WARN_ON(1); > - } > > - /* Wait for all fences on this FB */ > - WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false, > - MAX_SCHEDULE_TIMEOUT) < 0); > + /* > + * Wait for all fences on this FB. Do limited wait to avoid > + * deadlock during GPU reset when this fence will not signal > + * but we hold reservation lock for the BO. > + */ > + r = reservation_object_wait_timeout_rcu(abo->tbo.resv, > + true, false, > + msecs_to_jiffies(5000)); > + if (unlikely(r == 0)) > + DRM_ERROR("Waiting for fences timed out."); > + > + > > amdgpu_bo_get_tiling_flags(abo, &tiling_flags); > >
On 4/11/19 12:41 PM, Kazlauskas, Nicholas wrote: > On 4/11/19 12:03 PM, Andrey Grodzovsky wrote: >> Patch '5edb0c9b Fix deadlock with display during hanged ring recovery' >> was accidentaly removed during one of DALs code merges. >> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> > > Probably got lost in a refactor. > > Also, didn't Christian have a patch recently to not lock the reservation > object when waiting for the fence? Looks like that's missing too, or > maybe it didn't get merged. > > Nicholas Kazlauskas This patch actually didn't help to resolve the deadlock - take a look at https://bugs.freedesktop.org/show_bug.cgi?id=109692 toward the end. I believe that the reason is that the fences attached to the FB BO in the reservation object are SW fences, they are dettached from HW fences during reset and twill only be attached back/signaled later in the reset sequence when we resubmit the jobs in the ring mirror list. So force signaling the HW fences that we do before suspending the display will have no effect. But I am not 100% sure about this. Andrey > >> --- >> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------ >> 1 file changed, 13 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c >> index 0648794..27e0383 100644 >> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c >> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c >> @@ -5138,14 +5138,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, >> */ >> abo = gem_to_amdgpu_bo(fb->obj[0]); >> r = amdgpu_bo_reserve(abo, true); >> - if (unlikely(r != 0)) { >> + if (unlikely(r != 0)) >> DRM_ERROR("failed to reserve buffer before flip\n"); >> - WARN_ON(1); >> - } >> >> - /* Wait for all fences on this FB */ >> - WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false, >> - MAX_SCHEDULE_TIMEOUT) < 0); >> + /* >> + * Wait for all fences on this FB. Do limited wait to avoid >> + * deadlock during GPU reset when this fence will not signal >> + * but we hold reservation lock for the BO. >> + */ >> + r = reservation_object_wait_timeout_rcu(abo->tbo.resv, >> + true, false, >> + msecs_to_jiffies(5000)); >> + if (unlikely(r == 0)) >> + DRM_ERROR("Waiting for fences timed out."); >> + >> + >> >> amdgpu_bo_get_tiling_flags(abo, &tiling_flags); >> >>
Am 11.04.19 um 18:03 schrieb Andrey Grodzovsky: > Patch '5edb0c9b Fix deadlock with display during hanged ring recovery' > was accidentaly removed during one of DALs code merges. > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > --- > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------ > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > index 0648794..27e0383 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > @@ -5138,14 +5138,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, > */ > abo = gem_to_amdgpu_bo(fb->obj[0]); > r = amdgpu_bo_reserve(abo, true); > - if (unlikely(r != 0)) { > + if (unlikely(r != 0)) > DRM_ERROR("failed to reserve buffer before flip\n"); > - WARN_ON(1); > - } I also already suggested to completely stop waiting while the BO is being reserved, but looks like that got dropped as well. I would say something is seriously wrong with DALs development process here. Christian. > > - /* Wait for all fences on this FB */ > - WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false, > - MAX_SCHEDULE_TIMEOUT) < 0); > + /* > + * Wait for all fences on this FB. Do limited wait to avoid > + * deadlock during GPU reset when this fence will not signal > + * but we hold reservation lock for the BO. > + */ > + r = reservation_object_wait_timeout_rcu(abo->tbo.resv, > + true, false, > + msecs_to_jiffies(5000)); > + if (unlikely(r == 0)) > + DRM_ERROR("Waiting for fences timed out."); > + > + > > amdgpu_bo_get_tiling_flags(abo, &tiling_flags); >
On 4/12/19 3:40 AM, Christian König wrote: > Am 11.04.19 um 18:03 schrieb Andrey Grodzovsky: >> Patch '5edb0c9b Fix deadlock with display during hanged ring recovery' >> was accidentaly removed during one of DALs code merges. >> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >> --- >> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 >> +++++++++++++------ >> 1 file changed, 13 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c >> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c >> index 0648794..27e0383 100644 >> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c >> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c >> @@ -5138,14 +5138,21 @@ static void amdgpu_dm_commit_planes(struct >> drm_atomic_state *state, >> */ >> abo = gem_to_amdgpu_bo(fb->obj[0]); >> r = amdgpu_bo_reserve(abo, true); >> - if (unlikely(r != 0)) { >> + if (unlikely(r != 0)) >> DRM_ERROR("failed to reserve buffer before flip\n"); >> - WARN_ON(1); >> - } > > I also already suggested to completely stop waiting while the BO is > being reserved, but looks like that got dropped as well. > > I would say something is seriously wrong with DALs development process > here. > > Christian. Yea, I think your patch that moved the wait out of the reserved section got dropped as well, when I re spin the series with your comments for the TDR stuff I will also add a patch restoring your change. Andrey > >> - /* Wait for all fences on this FB */ >> - WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, >> false, >> - MAX_SCHEDULE_TIMEOUT) < 0); >> + /* >> + * Wait for all fences on this FB. Do limited wait to avoid >> + * deadlock during GPU reset when this fence will not signal >> + * but we hold reservation lock for the BO. >> + */ >> + r = reservation_object_wait_timeout_rcu(abo->tbo.resv, >> + true, false, >> + msecs_to_jiffies(5000)); >> + if (unlikely(r == 0)) >> + DRM_ERROR("Waiting for fences timed out."); >> + >> + >> amdgpu_bo_get_tiling_flags(abo, &tiling_flags); >
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 0648794..27e0383 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -5138,14 +5138,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, */ abo = gem_to_amdgpu_bo(fb->obj[0]); r = amdgpu_bo_reserve(abo, true); - if (unlikely(r != 0)) { + if (unlikely(r != 0)) DRM_ERROR("failed to reserve buffer before flip\n"); - WARN_ON(1); - } - /* Wait for all fences on this FB */ - WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false, - MAX_SCHEDULE_TIMEOUT) < 0); + /* + * Wait for all fences on this FB. Do limited wait to avoid + * deadlock during GPU reset when this fence will not signal + * but we hold reservation lock for the BO. + */ + r = reservation_object_wait_timeout_rcu(abo->tbo.resv, + true, false, + msecs_to_jiffies(5000)); + if (unlikely(r == 0)) + DRM_ERROR("Waiting for fences timed out."); + + amdgpu_bo_get_tiling_flags(abo, &tiling_flags);
Patch '5edb0c9b Fix deadlock with display during hanged ring recovery' was accidentaly removed during one of DALs code merges. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-)