drm/radeon: add a force flush to delay work when radeon

Message ID	20220811072540.964309-1-lizhenneng@kylinos.cn (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Zhenneng Li <lizhenneng@kylinos.cn> To: Alex Deucher <alexander.deucher@amd.com> Subject: [PATCH] drm/radeon: add a force flush to delay work when radeon Date: Thu, 11 Aug 2022 15:25:40 +0800 Message-Id: <20220811072540.964309-1-lizhenneng@kylinos.cn> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: David Airlie <airlied@linux.ie>, Pan Xinhui <Xinhui.Pan@amd.com>, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Zhenneng Li <lizhenneng@kylinos.cn>, amd-gfx@lists.freedesktop.org, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com> Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/radeon: add a force flush to delay work when radeon \| expand drm/radeon: add a force flush to delay work when radeon

Zhenneng Li Aug. 11, 2022, 7:25 a.m. UTC

Although radeon card fence and wait for gpu to finish processing current batch rings,
there is still a corner case that radeon lockup work queue may not be fully flushed,
and meanwhile the radeon_suspend_kms() function has called pci_set_power_state() to
put device in D3hot state.
Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
> Configuration and Message requests are the only TLPs accepted by a Function in
> the D3hot state. All other received Requests must be handled as Unsupported Requests,
> and all received Completions may optionally be handled as Unexpected Completions.
This issue will happen in following logs:
Unable to handle kernel paging request at virtual address 00008800e0008010
CPU 0 kworker/0:3(131): Oops 0
pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 Tainted: G        W
pc is at si_gpu_check_soft_reset+0x3c/0x240
ra is at si_dma_is_lockup+0x34/0xd0
v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
s6 = fff00007ef07bd98
a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
gp = ffffffff81d89690  sp = 00000000aa814126
Disabling lock debugging due to kernel taint
Trace:
[<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
[<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
[<ffffffff80977010>] process_one_work+0x280/0x550
[<ffffffff80977350>] worker_thread+0x70/0x7c0
[<ffffffff80977410>] worker_thread+0x130/0x7c0
[<ffffffff80982040>] kthread+0x200/0x210
[<ffffffff809772e0>] worker_thread+0x0/0x7c0
[<ffffffff80981f8c>] kthread+0x14c/0x210
[<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
[<ffffffff80981e40>] kthread+0x0/0x210
 Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8  40230101
 <88210000> 4821ed21
So force lockup work queue flush to fix this problem.

Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
---
 drivers/gpu/drm/radeon/radeon_device.c | 3 +++
 1 file changed, 3 insertions(+)

Christian König Aug. 12, 2022, 10:55 a.m. UTC | #1

Am 11.08.22 um 09:25 schrieb Zhenneng Li:
> Although radeon card fence and wait for gpu to finish processing current batch rings,
> there is still a corner case that radeon lockup work queue may not be fully flushed,
> and meanwhile the radeon_suspend_kms() function has called pci_set_power_state() to
> put device in D3hot state.

If I'm not completely mistaken the reset worker uses the suspend/resume 
functionality as well to get the hardware into a working state again.

So if I'm not completely mistaken this here would lead to a deadlock, 
please double check that.

Regards,
Christian.

> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>> Configuration and Message requests are the only TLPs accepted by a Function in
>> the D3hot state. All other received Requests must be handled as Unsupported Requests,
>> and all received Completions may optionally be handled as Unexpected Completions.
> This issue will happen in following logs:
> Unable to handle kernel paging request at virtual address 00008800e0008010
> CPU 0 kworker/0:3(131): Oops 0
> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 Tainted: G        W
> pc is at si_gpu_check_soft_reset+0x3c/0x240
> ra is at si_dma_is_lockup+0x34/0xd0
> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
> s6 = fff00007ef07bd98
> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
> gp = ffffffff81d89690  sp = 00000000aa814126
> Disabling lock debugging due to kernel taint
> Trace:
> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
> [<ffffffff80977010>] process_one_work+0x280/0x550
> [<ffffffff80977350>] worker_thread+0x70/0x7c0
> [<ffffffff80977410>] worker_thread+0x130/0x7c0
> [<ffffffff80982040>] kthread+0x200/0x210
> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
> [<ffffffff80981f8c>] kthread+0x14c/0x210
> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
> [<ffffffff80981e40>] kthread+0x0/0x210
>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8  40230101
>   <88210000> 4821ed21
> So force lockup work queue flush to fix this problem.
>
> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
> ---
>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
> index 15692cb241fc..e608ca26780a 100644
> --- a/drivers/gpu/drm/radeon/radeon_device.c
> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, bool suspend,
>   		if (r) {
>   			/* delay GPU reset to resume */
>   			radeon_fence_driver_force_completion(rdev, i);
> +		} else {
> +			/* finish executing delayed work */
> +			flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>   		}
>   	}
>

Zhenneng Li Aug. 15, 2022, 7:34 a.m. UTC | #2

在 2022/8/12 18:55, Christian König 写道:
> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>> Although radeon card fence and wait for gpu to finish processing 
>> current batch rings,
>> there is still a corner case that radeon lockup work queue may not be 
>> fully flushed,
>> and meanwhile the radeon_suspend_kms() function has called 
>> pci_set_power_state() to
>> put device in D3hot state.
>
> If I'm not completely mistaken the reset worker uses the 
> suspend/resume functionality as well to get the hardware into a 
> working state again.
>
> So if I'm not completely mistaken this here would lead to a deadlock, 
> please double check that.

We have tested many times, there are no deadlock.

In which situation, there would lead to a deadlock?

>
> Regards,
> Christian.
>
>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>> Configuration and Message requests are the only TLPs accepted by a 
>>> Function in
>>> the D3hot state. All other received Requests must be handled as 
>>> Unsupported Requests,
>>> and all received Completions may optionally be handled as Unexpected 
>>> Completions.
>> This issue will happen in following logs:
>> Unable to handle kernel paging request at virtual address 
>> 00008800e0008010
>> CPU 0 kworker/0:3(131): Oops 0
>> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 
>> Tainted: G        W
>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>> ra is at si_dma_is_lockup+0x34/0xd0
>> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
>> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
>> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
>> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
>> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
>> s6 = fff00007ef07bd98
>> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
>> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
>> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
>> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
>> gp = ffffffff81d89690  sp = 00000000aa814126
>> Disabling lock debugging due to kernel taint
>> Trace:
>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>> [<ffffffff80977010>] process_one_work+0x280/0x550
>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>> [<ffffffff80982040>] kthread+0x200/0x210
>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>> [<ffffffff80981e40>] kthread+0x0/0x210
>>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8 40230101
>>   <88210000> 4821ed21
>> So force lockup work queue flush to fix this problem.
>>
>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>> ---
>>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>> b/drivers/gpu/drm/radeon/radeon_device.c
>> index 15692cb241fc..e608ca26780a 100644
>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, 
>> bool suspend,
>>           if (r) {
>>               /* delay GPU reset to resume */
>>               radeon_fence_driver_force_completion(rdev, i);
>> +        } else {
>> +            /* finish executing delayed work */
>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>           }
>>       }
>

Christian König Aug. 15, 2022, 1:12 p.m. UTC | #3

Am 15.08.22 um 09:34 schrieb 李真能:
>
> 在 2022/8/12 18:55, Christian König 写道:
>> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>>> Although radeon card fence and wait for gpu to finish processing 
>>> current batch rings,
>>> there is still a corner case that radeon lockup work queue may not 
>>> be fully flushed,
>>> and meanwhile the radeon_suspend_kms() function has called 
>>> pci_set_power_state() to
>>> put device in D3hot state.
>>
>> If I'm not completely mistaken the reset worker uses the 
>> suspend/resume functionality as well to get the hardware into a 
>> working state again.
>>
>> So if I'm not completely mistaken this here would lead to a deadlock, 
>> please double check that.
>
> We have tested many times, there are no deadlock.

Testing doesn't tells you anything, you need to audit the call paths.

> In which situation, there would lead to a deadlock?

GPU resets.

Regards,
Christian.

>
>>
>> Regards,
>> Christian.
>>
>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>>> Configuration and Message requests are the only TLPs accepted by a 
>>>> Function in
>>>> the D3hot state. All other received Requests must be handled as 
>>>> Unsupported Requests,
>>>> and all received Completions may optionally be handled as 
>>>> Unexpected Completions.
>>> This issue will happen in following logs:
>>> Unable to handle kernel paging request at virtual address 
>>> 00008800e0008010
>>> CPU 0 kworker/0:3(131): Oops 0
>>> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 
>>> Tainted: G        W
>>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>>> ra is at si_dma_is_lockup+0x34/0xd0
>>> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
>>> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
>>> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
>>> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
>>> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
>>> s6 = fff00007ef07bd98
>>> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
>>> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
>>> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
>>> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
>>> gp = ffffffff81d89690  sp = 00000000aa814126
>>> Disabling lock debugging due to kernel taint
>>> Trace:
>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>>> [<ffffffff80977010>] process_one_work+0x280/0x550
>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>>> [<ffffffff80982040>] kthread+0x200/0x210
>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>>> [<ffffffff80981e40>] kthread+0x0/0x210
>>>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8 40230101
>>>   <88210000> 4821ed21
>>> So force lockup work queue flush to fix this problem.
>>>
>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>> ---
>>>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>> index 15692cb241fc..e608ca26780a 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, 
>>> bool suspend,
>>>           if (r) {
>>>               /* delay GPU reset to resume */
>>>               radeon_fence_driver_force_completion(rdev, i);
>>> +        } else {
>>> +            /* finish executing delayed work */
>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>>           }
>>>       }
>>

Zhenneng Li Aug. 17, 2022, 7:31 a.m. UTC | #4

在 2022/8/15 21:12, Christian König 写道:
> Am 15.08.22 um 09:34 schrieb 李真能:
>>
>> 在 2022/8/12 18:55, Christian König 写道:
>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>>>> Although radeon card fence and wait for gpu to finish processing 
>>>> current batch rings,
>>>> there is still a corner case that radeon lockup work queue may not 
>>>> be fully flushed,
>>>> and meanwhile the radeon_suspend_kms() function has called 
>>>> pci_set_power_state() to
>>>> put device in D3hot state.
>>>
>>> If I'm not completely mistaken the reset worker uses the 
>>> suspend/resume functionality as well to get the hardware into a 
>>> working state again.
>>>
>>> So if I'm not completely mistaken this here would lead to a 
>>> deadlock, please double check that.
>>
>> We have tested many times, there are no deadlock.
>
> Testing doesn't tells you anything, you need to audit the call paths.
>
>> In which situation, there would lead to a deadlock?
>
> GPU resets.

Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait 
for a lockup_work to finish executing the last queueing,  but this 
kernel func haven't get any lock, and lockup_work will run in another 
kernel thread, so I think flush_delayed_work could not lead to a deadlock.

Therefor if radeon_gpu_reset is called in another thread when 
radeon_suspend_kms is blocked on flush_delayed_work, there could not 
lead to a deadlock.

>
> Regards,
> Christian.
>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>>>> Configuration and Message requests are the only TLPs accepted by a 
>>>>> Function in
>>>>> the D3hot state. All other received Requests must be handled as 
>>>>> Unsupported Requests,
>>>>> and all received Completions may optionally be handled as 
>>>>> Unexpected Completions.
>>>> This issue will happen in following logs:
>>>> Unable to handle kernel paging request at virtual address 
>>>> 00008800e0008010
>>>> CPU 0 kworker/0:3(131): Oops 0
>>>> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 
>>>> Tainted: G        W
>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>>>> ra is at si_dma_is_lockup+0x34/0xd0
>>>> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
>>>> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
>>>> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
>>>> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
>>>> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
>>>> s6 = fff00007ef07bd98
>>>> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
>>>> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
>>>> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
>>>> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
>>>> gp = ffffffff81d89690  sp = 00000000aa814126
>>>> Disabling lock debugging due to kernel taint
>>>> Trace:
>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>>>> [<ffffffff80977010>] process_one_work+0x280/0x550
>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>>>> [<ffffffff80982040>] kthread+0x200/0x210
>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>>>> [<ffffffff80981e40>] kthread+0x0/0x210
>>>>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8 40230101
>>>>   <88210000> 4821ed21
>>>> So force lockup work queue flush to fix this problem.
>>>>
>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>>> ---
>>>>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>> index 15692cb241fc..e608ca26780a 100644
>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device 
>>>> *dev, bool suspend,
>>>>           if (r) {
>>>>               /* delay GPU reset to resume */
>>>>               radeon_fence_driver_force_completion(rdev, i);
>>>> +        } else {
>>>> +            /* finish executing delayed work */
>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>>>           }
>>>>       }
>>>
>

Christian König Aug. 17, 2022, 11:40 a.m. UTC | #5

Am 17.08.22 um 09:31 schrieb 李真能:
>
> 在 2022/8/15 21:12, Christian König 写道:
>> Am 15.08.22 um 09:34 schrieb 李真能:
>>>
>>> 在 2022/8/12 18:55, Christian König 写道:
>>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>>>>> Although radeon card fence and wait for gpu to finish processing 
>>>>> current batch rings,
>>>>> there is still a corner case that radeon lockup work queue may not 
>>>>> be fully flushed,
>>>>> and meanwhile the radeon_suspend_kms() function has called 
>>>>> pci_set_power_state() to
>>>>> put device in D3hot state.
>>>>
>>>> If I'm not completely mistaken the reset worker uses the 
>>>> suspend/resume functionality as well to get the hardware into a 
>>>> working state again.
>>>>
>>>> So if I'm not completely mistaken this here would lead to a 
>>>> deadlock, please double check that.
>>>
>>> We have tested many times, there are no deadlock.
>>
>> Testing doesn't tells you anything, you need to audit the call paths.
>>
>>> In which situation, there would lead to a deadlock?
>>
>> GPU resets.
>
> Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait 
> for a lockup_work to finish executing the last queueing,  but this 
> kernel func haven't get any lock, and lockup_work will run in another 
> kernel thread, so I think flush_delayed_work could not lead to a 
> deadlock.
>
> Therefor if radeon_gpu_reset is called in another thread when 
> radeon_suspend_kms is blocked on flush_delayed_work, there could not 
> lead to a deadlock.

Ok sounds like you didn't go what I wanted to say.

The key problem is that radeon_gpu_reset() calls radeon_suspend() which 
in turn calls rdev->asic->suspend().

And this function in turn could end up in radeon_suspend_kms() again, 
but I'm not 100% sure about that.

Just double check the order of function called here (e.g. if 
radeon_suspend_kms() call radeon_suspend() or the other way around).  
Apart from that your patch looks correct to me as well.

Regards,
Christian.

>
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>>>>> Configuration and Message requests are the only TLPs accepted by 
>>>>>> a Function in
>>>>>> the D3hot state. All other received Requests must be handled as 
>>>>>> Unsupported Requests,
>>>>>> and all received Completions may optionally be handled as 
>>>>>> Unexpected Completions.
>>>>> This issue will happen in following logs:
>>>>> Unable to handle kernel paging request at virtual address 
>>>>> 00008800e0008010
>>>>> CPU 0 kworker/0:3(131): Oops 0
>>>>> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 
>>>>> Tainted: G        W
>>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>>>>> ra is at si_dma_is_lockup+0x34/0xd0
>>>>> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
>>>>> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
>>>>> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
>>>>> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
>>>>> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
>>>>> s6 = fff00007ef07bd98
>>>>> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
>>>>> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
>>>>> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
>>>>> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
>>>>> gp = ffffffff81d89690  sp = 00000000aa814126
>>>>> Disabling lock debugging due to kernel taint
>>>>> Trace:
>>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>>>>> [<ffffffff80977010>] process_one_work+0x280/0x550
>>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>>>>> [<ffffffff80982040>] kthread+0x200/0x210
>>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>>>>> [<ffffffff80981e40>] kthread+0x0/0x210
>>>>>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8 40230101
>>>>>   <88210000> 4821ed21
>>>>> So force lockup work queue flush to fix this problem.
>>>>>
>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>>>> ---
>>>>>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>>>>   1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>> index 15692cb241fc..e608ca26780a 100644
>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device 
>>>>> *dev, bool suspend,
>>>>>           if (r) {
>>>>>               /* delay GPU reset to resume */
>>>>>               radeon_fence_driver_force_completion(rdev, i);
>>>>> +        } else {
>>>>> +            /* finish executing delayed work */
>>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>>>>           }
>>>>>       }
>>>>
>>

Zhenneng Li Aug. 19, 2022, 9:34 a.m. UTC | #6

在 2022/8/17 19:40, Christian König 写道:
> Am 17.08.22 um 09:31 schrieb 李真能:
>>
>> 在 2022/8/15 21:12, Christian König 写道:
>>> Am 15.08.22 um 09:34 schrieb 李真能:
>>>>
>>>> 在 2022/8/12 18:55, Christian König 写道:
>>>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>>>>>> Although radeon card fence and wait for gpu to finish processing 
>>>>>> current batch rings,
>>>>>> there is still a corner case that radeon lockup work queue may 
>>>>>> not be fully flushed,
>>>>>> and meanwhile the radeon_suspend_kms() function has called 
>>>>>> pci_set_power_state() to
>>>>>> put device in D3hot state.
>>>>>
>>>>> If I'm not completely mistaken the reset worker uses the 
>>>>> suspend/resume functionality as well to get the hardware into a 
>>>>> working state again.
>>>>>
>>>>> So if I'm not completely mistaken this here would lead to a 
>>>>> deadlock, please double check that.
>>>>
>>>> We have tested many times, there are no deadlock.
>>>
>>> Testing doesn't tells you anything, you need to audit the call paths.
>>>
>>>> In which situation, there would lead to a deadlock?
>>>
>>> GPU resets.
>>
>> Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will 
>> wait for a lockup_work to finish executing the last queueing, but 
>> this kernel func haven't get any lock, and lockup_work will run in 
>> another kernel thread, so I think flush_delayed_work could not lead 
>> to a deadlock.
>>
>> Therefor if radeon_gpu_reset is called in another thread when 
>> radeon_suspend_kms is blocked on flush_delayed_work, there could not 
>> lead to a deadlock.
>
> Ok sounds like you didn't go what I wanted to say.
>
> The key problem is that radeon_gpu_reset() calls radeon_suspend() 
> which in turn calls rdev->asic->suspend().
>
> And this function in turn could end up in radeon_suspend_kms() again, 
> but I'm not 100% sure about that.
>
> Just double check the order of function called here (e.g. if 
> radeon_suspend_kms() call radeon_suspend() or the other way around).  
> Apart from that your patch looks correct to me as well.
>
radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not 
call radeon_suspend_kms, the more detail of call flow, we can see the 
attachement file: radeon-suspend-reset.png.

Sorry  I may have mistaken your meaning.


> Regards,
> Christian.
>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>>>>>> Configuration and Message requests are the only TLPs accepted by 
>>>>>>> a Function in
>>>>>>> the D3hot state. All other received Requests must be handled as 
>>>>>>> Unsupported Requests,
>>>>>>> and all received Completions may optionally be handled as 
>>>>>>> Unexpected Completions.
>>>>>> This issue will happen in following logs:
>>>>>> Unable to handle kernel paging request at virtual address 
>>>>>> 00008800e0008010
>>>>>> CPU 0 kworker/0:3(131): Oops 0
>>>>>> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 
>>>>>> Tainted: G W
>>>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>>>>>> ra is at si_dma_is_lockup+0x34/0xd0
>>>>>> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
>>>>>> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
>>>>>> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
>>>>>> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
>>>>>> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
>>>>>> s6 = fff00007ef07bd98
>>>>>> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
>>>>>> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
>>>>>> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
>>>>>> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
>>>>>> gp = ffffffff81d89690  sp = 00000000aa814126
>>>>>> Disabling lock debugging due to kernel taint
>>>>>> Trace:
>>>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>>>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>>>>>> [<ffffffff80977010>] process_one_work+0x280/0x550
>>>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>>>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>>>>>> [<ffffffff80982040>] kthread+0x200/0x210
>>>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>>>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>>>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>>>>>> [<ffffffff80981e40>] kthread+0x0/0x210
>>>>>>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8 40230101
>>>>>>   <88210000> 4821ed21
>>>>>> So force lockup work queue flush to fix this problem.
>>>>>>
>>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>>>>> ---
>>>>>>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>>>>>   1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> index 15692cb241fc..e608ca26780a 100644
>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device 
>>>>>> *dev, bool suspend,
>>>>>>           if (r) {
>>>>>>               /* delay GPU reset to resume */
>>>>>>               radeon_fence_driver_force_completion(rdev, i);
>>>>>> +        } else {
>>>>>> +            /* finish executing delayed work */
>>>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>>>>>           }
>>>>>>       }
>>>>>
>>>
>

Christian König Aug. 19, 2022, 10:07 a.m. UTC | #7

Am 19.08.22 um 11:34 schrieb 李真能:
>
> 在 2022/8/17 19:40, Christian König 写道:
>
>> Am 17.08.22 um 09:31 schrieb 李真能:
>>>
>>> 在 2022/8/15 21:12, Christian König 写道:
>>>> Am 15.08.22 um 09:34 schrieb 李真能:
>>>>>
>>>>> 在 2022/8/12 18:55, Christian König 写道:
>>>>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>>>>>>> Although radeon card fence and wait for gpu to finish processing 
>>>>>>> current batch rings,
>>>>>>> there is still a corner case that radeon lockup work queue may 
>>>>>>> not be fully flushed,
>>>>>>> and meanwhile the radeon_suspend_kms() function has called 
>>>>>>> pci_set_power_state() to
>>>>>>> put device in D3hot state.
>>>>>>
>>>>>> If I'm not completely mistaken the reset worker uses the 
>>>>>> suspend/resume functionality as well to get the hardware into a 
>>>>>> working state again.
>>>>>>
>>>>>> So if I'm not completely mistaken this here would lead to a 
>>>>>> deadlock, please double check that.
>>>>>
>>>>> We have tested many times, there are no deadlock.
>>>>
>>>> Testing doesn't tells you anything, you need to audit the call paths.
>>>>
>>>>> In which situation, there would lead to a deadlock?
>>>>
>>>> GPU resets.
>>>
>>> Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will 
>>> wait for a lockup_work to finish executing the last queueing,  but 
>>> this kernel func haven't get any lock, and lockup_work will run in 
>>> another kernel thread, so I think flush_delayed_work could not lead 
>>> to a deadlock.
>>>
>>> Therefor if radeon_gpu_reset is called in another thread when 
>>> radeon_suspend_kms is blocked on flush_delayed_work, there could not 
>>> lead to a deadlock.
>>
>> Ok sounds like you didn't go what I wanted to say.
>>
>> The key problem is that radeon_gpu_reset() calls radeon_suspend() 
>> which in turn calls rdev->asic->suspend().
>>
>> And this function in turn could end up in radeon_suspend_kms() again, 
>> but I'm not 100% sure about that.
>>
>> Just double check the order of function called here (e.g. if 
>> radeon_suspend_kms() call radeon_suspend() or the other way around).  
>> Apart from that your patch looks correct to me as well.
>>
> radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not 
> call radeon_suspend_kms, the more detail of call flow, we can see the 
> attachement file: radeon-suspend-reset.png.
>
> Sorry  I may have mistaken your meaning.
>

Ok in this case my memory of the call flow wasn't correct any more.

Feel free to add an Acked-by: Christian König <christian.koenig@amd.com> 
to the patch.

Alex should then pick it up for upstreaming.

Thanks,
Christian.

>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>>>>>>> Configuration and Message requests are the only TLPs accepted 
>>>>>>>> by a Function in
>>>>>>>> the D3hot state. All other received Requests must be handled as 
>>>>>>>> Unsupported Requests,
>>>>>>>> and all received Completions may optionally be handled as 
>>>>>>>> Unexpected Completions.
>>>>>>> This issue will happen in following logs:
>>>>>>> Unable to handle kernel paging request at virtual address 
>>>>>>> 00008800e0008010
>>>>>>> CPU 0 kworker/0:3(131): Oops 0
>>>>>>> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 
>>>>>>> Tainted: G        W
>>>>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>>>>>>> ra is at si_dma_is_lockup+0x34/0xd0
>>>>>>> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
>>>>>>> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
>>>>>>> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
>>>>>>> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
>>>>>>> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
>>>>>>> s6 = fff00007ef07bd98
>>>>>>> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
>>>>>>> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
>>>>>>> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = 
>>>>>>> ff6769c5d964b800
>>>>>>> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
>>>>>>> gp = ffffffff81d89690  sp = 00000000aa814126
>>>>>>> Disabling lock debugging due to kernel taint
>>>>>>> Trace:
>>>>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>>>>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>>>>>>> [<ffffffff80977010>] process_one_work+0x280/0x550
>>>>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>>>>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>>>>>>> [<ffffffff80982040>] kthread+0x200/0x210
>>>>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>>>>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>>>>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>>>>>>> [<ffffffff80981e40>] kthread+0x0/0x210
>>>>>>>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020 8c3001e8 40230101
>>>>>>>   <88210000> 4821ed21
>>>>>>> So force lockup work queue flush to fix this problem.
>>>>>>>
>>>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>>>>>>   1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> index 15692cb241fc..e608ca26780a 100644
>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device 
>>>>>>> *dev, bool suspend,
>>>>>>>           if (r) {
>>>>>>>               /* delay GPU reset to resume */
>>>>>>> radeon_fence_driver_force_completion(rdev, i);
>>>>>>> +        } else {
>>>>>>> +            /* finish executing delayed work */
>>>>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>>>>>>           }
>>>>>>>       }
>>>>>>
>>>>
>>

Alex Deucher Aug. 19, 2022, 3:55 p.m. UTC | #8

Applied.  Thanks!

Alex

On Fri, Aug 19, 2022 at 6:07 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 19.08.22 um 11:34 schrieb 李真能:
>
> 在 2022/8/17 19:40, Christian König 写道:
>
> Am 17.08.22 um 09:31 schrieb 李真能:
>
>
> 在 2022/8/15 21:12, Christian König 写道:
>
> Am 15.08.22 um 09:34 schrieb 李真能:
>
>
> 在 2022/8/12 18:55, Christian König 写道:
>
> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>
> Although radeon card fence and wait for gpu to finish processing current batch rings,
> there is still a corner case that radeon lockup work queue may not be fully flushed,
> and meanwhile the radeon_suspend_kms() function has called pci_set_power_state() to
> put device in D3hot state.
>
>
> If I'm not completely mistaken the reset worker uses the suspend/resume functionality as well to get the hardware into a working state again.
>
> So if I'm not completely mistaken this here would lead to a deadlock, please double check that.
>
>
> We have tested many times, there are no deadlock.
>
>
> Testing doesn't tells you anything, you need to audit the call paths.
>
> In which situation, there would lead to a deadlock?
>
>
> GPU resets.
>
>
> Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait for a lockup_work to finish executing the last queueing,  but this kernel func haven't get any lock, and lockup_work will run in another kernel thread, so I think flush_delayed_work could not lead to a deadlock.
>
> Therefor if radeon_gpu_reset is called in another thread when radeon_suspend_kms is blocked on flush_delayed_work, there could not lead to a deadlock.
>
>
> Ok sounds like you didn't go what I wanted to say.
>
> The key problem is that radeon_gpu_reset() calls radeon_suspend() which in turn calls rdev->asic->suspend().
>
> And this function in turn could end up in radeon_suspend_kms() again, but I'm not 100% sure about that.
>
> Just double check the order of function called here (e.g. if radeon_suspend_kms() call radeon_suspend() or the other way around).  Apart from that your patch looks correct to me as well.
>
> radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not call radeon_suspend_kms, the more detail of call flow, we can see the attachement file: radeon-suspend-reset.png.
>
> Sorry  I may have mistaken your meaning.
>
>
> Ok in this case my memory of the call flow wasn't correct any more.
>
> Feel free to add an Acked-by: Christian König <christian.koenig@amd.com> to the patch.
>
> Alex should then pick it up for upstreaming.
>
> Thanks,
> Christian.
>
>
> Regards,
> Christian.
>
>
>
> Regards,
> Christian.
>
>
>
> Regards,
> Christian.
>
> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>
> Configuration and Message requests are the only TLPs accepted by a Function in
> the D3hot state. All other received Requests must be handled as Unsupported Requests,
> and all received Completions may optionally be handled as Unexpected Completions.
>
> This issue will happen in following logs:
> Unable to handle kernel paging request at virtual address 00008800e0008010
> CPU 0 kworker/0:3(131): Oops 0
> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 Tainted: G        W
> pc is at si_gpu_check_soft_reset+0x3c/0x240
> ra is at si_dma_is_lockup+0x34/0xd0
> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
> s6 = fff00007ef07bd98
> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
> gp = ffffffff81d89690  sp = 00000000aa814126
> Disabling lock debugging due to kernel taint
> Trace:
> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
> [<ffffffff80977010>] process_one_work+0x280/0x550
> [<ffffffff80977350>] worker_thread+0x70/0x7c0
> [<ffffffff80977410>] worker_thread+0x130/0x7c0
> [<ffffffff80982040>] kthread+0x200/0x210
> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
> [<ffffffff80981f8c>] kthread+0x14c/0x210
> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
> [<ffffffff80981e40>] kthread+0x0/0x210
>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8 40230101
>   <88210000> 4821ed21
> So force lockup work queue flush to fix this problem.
>
> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
> ---
>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
> index 15692cb241fc..e608ca26780a 100644
> --- a/drivers/gpu/drm/radeon/radeon_device.c
> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, bool suspend,
>           if (r) {
>               /* delay GPU reset to resume */
>               radeon_fence_driver_force_completion(rdev, i);
> +        } else {
> +            /* finish executing delayed work */
> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>           }
>       }
>
>
>
>
>

drm/radeon: add a force flush to delay work when radeon

Commit Message

Comments

Patch