diff mbox series

[RFC,3/6] drm/amdgpu: Fix crash on modprobe

Message ID 20211217222745.881637-4-andrey.grodzovsky@amd.com (mailing list archive)
State New, archived
Headers show
Series Define and use reset domain for GPU recovery in amdgpu | expand

Commit Message

Andrey Grodzovsky Dec. 17, 2021, 10:27 p.m. UTC
Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Christian König Dec. 20, 2021, 7:17 a.m. UTC | #1
Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
> Restrict jobs resubmission to suspend case
> only since schedulers not initialised yet on
> probe.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 5527c68c51de..8ebd954e06c6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
>   		if (!ring || !ring->fence_drv.initialized)
>   			continue;
>   
> -		if (!ring->no_scheduler) {
> +		if (adev->in_suspend && !ring->no_scheduler) {

Uff, why is that suddenly necessary? Because of the changed order?

Christian.

>   			drm_sched_resubmit_jobs(&ring->sched);
>   			drm_sched_start(&ring->sched, true);
>   		}
Andrey Grodzovsky Dec. 20, 2021, 7:22 p.m. UTC | #2
On 2021-12-20 2:17 a.m., Christian König wrote:
> Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
>> Restrict jobs resubmission to suspend case
>> only since schedulers not initialised yet on
>> probe.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 5527c68c51de..8ebd954e06c6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct 
>> amdgpu_device *adev)
>>           if (!ring || !ring->fence_drv.initialized)
>>               continue;
>>   -        if (!ring->no_scheduler) {
>> +        if (adev->in_suspend && !ring->no_scheduler) {
>
> Uff, why is that suddenly necessary? Because of the changed order?
>
> Christian.


Yes.

Andrey


>
>> drm_sched_resubmit_jobs(&ring->sched);
>>               drm_sched_start(&ring->sched, true);
>>           }
>
Christian König Dec. 21, 2021, 7:02 a.m. UTC | #3
Am 20.12.21 um 20:22 schrieb Andrey Grodzovsky:
>
> On 2021-12-20 2:17 a.m., Christian König wrote:
>> Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
>>> Restrict jobs resubmission to suspend case
>>> only since schedulers not initialised yet on
>>> probe.
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 5527c68c51de..8ebd954e06c6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct 
>>> amdgpu_device *adev)
>>>           if (!ring || !ring->fence_drv.initialized)
>>>               continue;
>>>   -        if (!ring->no_scheduler) {
>>> +        if (adev->in_suspend && !ring->no_scheduler) {
>>
>> Uff, why is that suddenly necessary? Because of the changed order?
>>
>> Christian.
>
>
> Yes.

Mhm, that's quite bad design then.

How about we keep the order as is and allow specifying the reset work 
queue with drm_sched_start() ?

Christian.

>
> Andrey
>
>
>>
>>> drm_sched_resubmit_jobs(&ring->sched);
>>>               drm_sched_start(&ring->sched, true);
>>>           }
>>
Andrey Grodzovsky Dec. 21, 2021, 4:03 p.m. UTC | #4
On 2021-12-21 2:02 a.m., Christian König wrote:
>
>
> Am 20.12.21 um 20:22 schrieb Andrey Grodzovsky:
>>
>> On 2021-12-20 2:17 a.m., Christian König wrote:
>>> Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
>>>> Restrict jobs resubmission to suspend case
>>>> only since schedulers not initialised yet on
>>>> probe.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 5527c68c51de..8ebd954e06c6 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct 
>>>> amdgpu_device *adev)
>>>>           if (!ring || !ring->fence_drv.initialized)
>>>>               continue;
>>>>   -        if (!ring->no_scheduler) {
>>>> +        if (adev->in_suspend && !ring->no_scheduler) {
>>>
>>> Uff, why is that suddenly necessary? Because of the changed order?
>>>
>>> Christian.
>>
>>
>> Yes.
>
> Mhm, that's quite bad design then.


If you look at the original patch for this 
https://www.spinics.net/lists/amd-gfx/msg67560.html you will
see that that restarting scheduler here is only relevant for 
suspend/resume case because there was
a race to fix. There is no point in this code on driver init because 
nothing was submitted to scheduler yet
and so it seems to me ok to add condition that this code run only 
in_suspend case.


>
> How about we keep the order as is and allow specifying the reset work 
> queue with drm_sched_start() ?


As i mentioned above, the fact we even have drm_sched_start there is 
just part of a solution to resolve a race
during suspend/resume. It is not for device initialization and indeed, 
other client drivers of gpu shcheduler never call
drm_sched_start on device init. We must guarantee that reset work queue 
already initialized before any job submission to scheduler
and because of that IMHO the right place for this is drm_sched_init.

Andrey


>
> Christian.
>
>>
>> Andrey
>>
>>
>>>
>>>> drm_sched_resubmit_jobs(&ring->sched);
>>>>               drm_sched_start(&ring->sched, true);
>>>>           }
>>>
>
Christian König Dec. 22, 2021, 7:50 a.m. UTC | #5
Am 21.12.21 um 17:03 schrieb Andrey Grodzovsky:
>
> On 2021-12-21 2:02 a.m., Christian König wrote:
>>
>>
>> Am 20.12.21 um 20:22 schrieb Andrey Grodzovsky:
>>>
>>> On 2021-12-20 2:17 a.m., Christian König wrote:
>>>> Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
>>>>> Restrict jobs resubmission to suspend case
>>>>> only since schedulers not initialised yet on
>>>>> probe.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 5527c68c51de..8ebd954e06c6 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct 
>>>>> amdgpu_device *adev)
>>>>>           if (!ring || !ring->fence_drv.initialized)
>>>>>               continue;
>>>>>   -        if (!ring->no_scheduler) {
>>>>> +        if (adev->in_suspend && !ring->no_scheduler) {
>>>>
>>>> Uff, why is that suddenly necessary? Because of the changed order?
>>>>
>>>> Christian.
>>>
>>>
>>> Yes.
>>
>> Mhm, that's quite bad design then.
>
>
> If you look at the original patch for this 
> https://www.spinics.net/lists/amd-gfx/msg67560.html you will
> see that that restarting scheduler here is only relevant for 
> suspend/resume case because there was
> a race to fix. There is no point in this code on driver init because 
> nothing was submitted to scheduler yet
> and so it seems to me ok to add condition that this code run only 
> in_suspend case.

Yeah, but having extra logic like this means that we have some design 
issue in the IP block handling.

We need to clean that and some other odd approaches up at some point, 
but probably not now.

Christian.

>
>
>>
>> How about we keep the order as is and allow specifying the reset work 
>> queue with drm_sched_start() ?
>
>
> As i mentioned above, the fact we even have drm_sched_start there is 
> just part of a solution to resolve a race
> during suspend/resume. It is not for device initialization and indeed, 
> other client drivers of gpu shcheduler never call
> drm_sched_start on device init. We must guarantee that reset work 
> queue already initialized before any job submission to scheduler
> and because of that IMHO the right place for this is drm_sched_init.
>
> Andrey
>
>
>>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>> drm_sched_resubmit_jobs(&ring->sched);
>>>>>               drm_sched_start(&ring->sched, true);
>>>>>           }
>>>>
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 5527c68c51de..8ebd954e06c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -582,7 +582,7 @@  void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
 		if (!ring || !ring->fence_drv.initialized)
 			continue;
 
-		if (!ring->no_scheduler) {
+		if (adev->in_suspend && !ring->no_scheduler) {
 			drm_sched_resubmit_jobs(&ring->sched);
 			drm_sched_start(&ring->sched, true);
 		}