diff mbox series

[4/4] drm/amd/display: Restore deleted patch to resolve reset deadlock.

Message ID 1554998624-25799-4-git-send-email-andrey.grodzovsky@amd.com (mailing list archive)
State New, archived
Headers show
Series [1/4] drm/scheduler: rework job destruction | expand

Commit Message

Andrey Grodzovsky April 11, 2019, 4:03 p.m. UTC
Patch '5edb0c9b Fix deadlock with display during hanged ring recovery'
was accidentaly removed during one of DALs code merges.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Comments

Kazlauskas, Nicholas April 11, 2019, 4:41 p.m. UTC | #1
On 4/11/19 12:03 PM, Andrey Grodzovsky wrote:
> Patch '5edb0c9b Fix deadlock with display during hanged ring recovery'
> was accidentaly removed during one of DALs code merges.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>

Probably got lost in a refactor.

Also, didn't Christian have a patch recently to not lock the reservation 
object when waiting for the fence? Looks like that's missing too, or 
maybe it didn't get merged.

Nicholas Kazlauskas

> ---
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 0648794..27e0383 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -5138,14 +5138,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>   		 */
>   		abo = gem_to_amdgpu_bo(fb->obj[0]);
>   		r = amdgpu_bo_reserve(abo, true);
> -		if (unlikely(r != 0)) {
> +		if (unlikely(r != 0))
>   			DRM_ERROR("failed to reserve buffer before flip\n");
> -			WARN_ON(1);
> -		}
>   
> -		/* Wait for all fences on this FB */
> -		WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
> -									    MAX_SCHEDULE_TIMEOUT) < 0);
> +		/*
> +		 * Wait for all fences on this FB. Do limited wait to avoid
> +		 * deadlock during GPU reset when this fence will not signal
> +		 * but we hold reservation lock for the BO.
> +		 */
> +		r = reservation_object_wait_timeout_rcu(abo->tbo.resv,
> +							true, false,
> +							msecs_to_jiffies(5000));
> +		if (unlikely(r == 0))
> +			DRM_ERROR("Waiting for fences timed out.");
> +
> +
>   
>   		amdgpu_bo_get_tiling_flags(abo, &tiling_flags);
>   
>
Andrey Grodzovsky April 11, 2019, 4:57 p.m. UTC | #2
On 4/11/19 12:41 PM, Kazlauskas, Nicholas wrote:
> On 4/11/19 12:03 PM, Andrey Grodzovsky wrote:
>> Patch '5edb0c9b Fix deadlock with display during hanged ring recovery'
>> was accidentaly removed during one of DALs code merges.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
>
> Probably got lost in a refactor.
>
> Also, didn't Christian have a patch recently to not lock the reservation
> object when waiting for the fence? Looks like that's missing too, or
> maybe it didn't get merged.
>
> Nicholas Kazlauskas


This patch actually didn't help to resolve the deadlock - take a look at 
https://bugs.freedesktop.org/show_bug.cgi?id=109692 toward the end. I 
believe that the reason is that the fences attached to the FB BO in the 
reservation object are SW fences, they are dettached from HW fences 
during reset  and twill only be attached back/signaled later in the 
reset sequence when we resubmit the jobs in the ring mirror list. So 
force signaling the HW fences that we do before suspending the display 
will have no effect. But I am not 100% sure about this.

Andrey

>
>> ---
>>    drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------
>>    1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> index 0648794..27e0383 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> @@ -5138,14 +5138,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>    		 */
>>    		abo = gem_to_amdgpu_bo(fb->obj[0]);
>>    		r = amdgpu_bo_reserve(abo, true);
>> -		if (unlikely(r != 0)) {
>> +		if (unlikely(r != 0))
>>    			DRM_ERROR("failed to reserve buffer before flip\n");
>> -			WARN_ON(1);
>> -		}
>>    
>> -		/* Wait for all fences on this FB */
>> -		WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
>> -									    MAX_SCHEDULE_TIMEOUT) < 0);
>> +		/*
>> +		 * Wait for all fences on this FB. Do limited wait to avoid
>> +		 * deadlock during GPU reset when this fence will not signal
>> +		 * but we hold reservation lock for the BO.
>> +		 */
>> +		r = reservation_object_wait_timeout_rcu(abo->tbo.resv,
>> +							true, false,
>> +							msecs_to_jiffies(5000));
>> +		if (unlikely(r == 0))
>> +			DRM_ERROR("Waiting for fences timed out.");
>> +
>> +
>>    
>>    		amdgpu_bo_get_tiling_flags(abo, &tiling_flags);
>>    
>>
Christian König April 12, 2019, 7:40 a.m. UTC | #3
Am 11.04.19 um 18:03 schrieb Andrey Grodzovsky:
> Patch '5edb0c9b Fix deadlock with display during hanged ring recovery'
> was accidentaly removed during one of DALs code merges.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 0648794..27e0383 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -5138,14 +5138,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>   		 */
>   		abo = gem_to_amdgpu_bo(fb->obj[0]);
>   		r = amdgpu_bo_reserve(abo, true);
> -		if (unlikely(r != 0)) {
> +		if (unlikely(r != 0))
>   			DRM_ERROR("failed to reserve buffer before flip\n");
> -			WARN_ON(1);
> -		}

I also already suggested to completely stop waiting while the BO is 
being reserved, but looks like that got dropped as well.

I would say something is seriously wrong with DALs development process here.

Christian.

>   
> -		/* Wait for all fences on this FB */
> -		WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
> -									    MAX_SCHEDULE_TIMEOUT) < 0);
> +		/*
> +		 * Wait for all fences on this FB. Do limited wait to avoid
> +		 * deadlock during GPU reset when this fence will not signal
> +		 * but we hold reservation lock for the BO.
> +		 */
> +		r = reservation_object_wait_timeout_rcu(abo->tbo.resv,
> +							true, false,
> +							msecs_to_jiffies(5000));
> +		if (unlikely(r == 0))
> +			DRM_ERROR("Waiting for fences timed out.");
> +
> +
>   
>   		amdgpu_bo_get_tiling_flags(abo, &tiling_flags);
>
Andrey Grodzovsky April 12, 2019, 2:28 p.m. UTC | #4
On 4/12/19 3:40 AM, Christian König wrote:
> Am 11.04.19 um 18:03 schrieb Andrey Grodzovsky:
>> Patch '5edb0c9b Fix deadlock with display during hanged ring recovery'
>> was accidentaly removed during one of DALs code merges.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 
>> +++++++++++++------
>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
>> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> index 0648794..27e0383 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> @@ -5138,14 +5138,21 @@ static void amdgpu_dm_commit_planes(struct 
>> drm_atomic_state *state,
>>            */
>>           abo = gem_to_amdgpu_bo(fb->obj[0]);
>>           r = amdgpu_bo_reserve(abo, true);
>> -        if (unlikely(r != 0)) {
>> +        if (unlikely(r != 0))
>>               DRM_ERROR("failed to reserve buffer before flip\n");
>> -            WARN_ON(1);
>> -        }
>
> I also already suggested to completely stop waiting while the BO is 
> being reserved, but looks like that got dropped as well.
>
> I would say something is seriously wrong with DALs development process 
> here.
>
> Christian.


Yea, I think your patch that moved the wait out of the reserved section 
got dropped as well, when I re spin the series with your comments for 
the TDR stuff I will also add a patch restoring your change.

Andrey


>
>>   -        /* Wait for all fences on this FB */
>> - WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, 
>> false,
>> -                                        MAX_SCHEDULE_TIMEOUT) < 0);
>> +        /*
>> +         * Wait for all fences on this FB. Do limited wait to avoid
>> +         * deadlock during GPU reset when this fence will not signal
>> +         * but we hold reservation lock for the BO.
>> +         */
>> +        r = reservation_object_wait_timeout_rcu(abo->tbo.resv,
>> +                            true, false,
>> +                            msecs_to_jiffies(5000));
>> +        if (unlikely(r == 0))
>> +            DRM_ERROR("Waiting for fences timed out.");
>> +
>> +
>>             amdgpu_bo_get_tiling_flags(abo, &tiling_flags);
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 0648794..27e0383 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -5138,14 +5138,21 @@  static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		 */
 		abo = gem_to_amdgpu_bo(fb->obj[0]);
 		r = amdgpu_bo_reserve(abo, true);
-		if (unlikely(r != 0)) {
+		if (unlikely(r != 0))
 			DRM_ERROR("failed to reserve buffer before flip\n");
-			WARN_ON(1);
-		}
 
-		/* Wait for all fences on this FB */
-		WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
-									    MAX_SCHEDULE_TIMEOUT) < 0);
+		/*
+		 * Wait for all fences on this FB. Do limited wait to avoid
+		 * deadlock during GPU reset when this fence will not signal
+		 * but we hold reservation lock for the BO.
+		 */
+		r = reservation_object_wait_timeout_rcu(abo->tbo.resv,
+							true, false,
+							msecs_to_jiffies(5000));
+		if (unlikely(r == 0))
+			DRM_ERROR("Waiting for fences timed out.");
+
+
 
 		amdgpu_bo_get_tiling_flags(abo, &tiling_flags);