Message ID | 20220811072540.964309-1-lizhenneng@kylinos.cn (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/radeon: add a force flush to delay work when radeon | expand |
Am 11.08.22 um 09:25 schrieb Zhenneng Li: > Although radeon card fence and wait for gpu to finish processing current batch rings, > there is still a corner case that radeon lockup work queue may not be fully flushed, > and meanwhile the radeon_suspend_kms() function has called pci_set_power_state() to > put device in D3hot state. If I'm not completely mistaken the reset worker uses the suspend/resume functionality as well to get the hardware into a working state again. So if I'm not completely mistaken this here would lead to a deadlock, please double check that. Regards, Christian. > Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. >> Configuration and Message requests are the only TLPs accepted by a Function in >> the D3hot state. All other received Requests must be handled as Unsupported Requests, >> and all received Completions may optionally be handled as Unexpected Completions. > This issue will happen in following logs: > Unable to handle kernel paging request at virtual address 00008800e0008010 > CPU 0 kworker/0:3(131): Oops 0 > pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000 Tainted: G W > pc is at si_gpu_check_soft_reset+0x3c/0x240 > ra is at si_dma_is_lockup+0x34/0xd0 > v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000 > t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258 > t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000 > s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018 > s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000 > s6 = fff00007ef07bd98 > a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008 > a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338 > t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800 > t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000 > gp = ffffffff81d89690 sp = 00000000aa814126 > Disabling lock debugging due to kernel taint > Trace: > [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0 > [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290 > [<ffffffff80977010>] process_one_work+0x280/0x550 > [<ffffffff80977350>] worker_thread+0x70/0x7c0 > [<ffffffff80977410>] worker_thread+0x130/0x7c0 > [<ffffffff80982040>] kthread+0x200/0x210 > [<ffffffff809772e0>] worker_thread+0x0/0x7c0 > [<ffffffff80981f8c>] kthread+0x14c/0x210 > [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20 > [<ffffffff80981e40>] kthread+0x0/0x210 > Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 > <88210000> 4821ed21 > So force lockup work queue flush to fix this problem. > > Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> > --- > drivers/gpu/drm/radeon/radeon_device.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c > index 15692cb241fc..e608ca26780a 100644 > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, bool suspend, > if (r) { > /* delay GPU reset to resume */ > radeon_fence_driver_force_completion(rdev, i); > + } else { > + /* finish executing delayed work */ > + flush_delayed_work(&rdev->fence_drv[i].lockup_work); > } > } >
在 2022/8/12 18:55, Christian König 写道: > Am 11.08.22 um 09:25 schrieb Zhenneng Li: >> Although radeon card fence and wait for gpu to finish processing >> current batch rings, >> there is still a corner case that radeon lockup work queue may not be >> fully flushed, >> and meanwhile the radeon_suspend_kms() function has called >> pci_set_power_state() to >> put device in D3hot state. > > If I'm not completely mistaken the reset worker uses the > suspend/resume functionality as well to get the hardware into a > working state again. > > So if I'm not completely mistaken this here would lead to a deadlock, > please double check that. We have tested many times, there are no deadlock. In which situation, there would lead to a deadlock? > > Regards, > Christian. > >> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. >>> Configuration and Message requests are the only TLPs accepted by a >>> Function in >>> the D3hot state. All other received Requests must be handled as >>> Unsupported Requests, >>> and all received Completions may optionally be handled as Unexpected >>> Completions. >> This issue will happen in following logs: >> Unable to handle kernel paging request at virtual address >> 00008800e0008010 >> CPU 0 kworker/0:3(131): Oops 0 >> pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000 >> Tainted: G W >> pc is at si_gpu_check_soft_reset+0x3c/0x240 >> ra is at si_dma_is_lockup+0x34/0xd0 >> v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000 >> t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258 >> t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000 >> s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018 >> s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000 >> s6 = fff00007ef07bd98 >> a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008 >> a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338 >> t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800 >> t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000 >> gp = ffffffff81d89690 sp = 00000000aa814126 >> Disabling lock debugging due to kernel taint >> Trace: >> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0 >> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290 >> [<ffffffff80977010>] process_one_work+0x280/0x550 >> [<ffffffff80977350>] worker_thread+0x70/0x7c0 >> [<ffffffff80977410>] worker_thread+0x130/0x7c0 >> [<ffffffff80982040>] kthread+0x200/0x210 >> [<ffffffff809772e0>] worker_thread+0x0/0x7c0 >> [<ffffffff80981f8c>] kthread+0x14c/0x210 >> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20 >> [<ffffffff80981e40>] kthread+0x0/0x210 >> Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 >> <88210000> 4821ed21 >> So force lockup work queue flush to fix this problem. >> >> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >> --- >> drivers/gpu/drm/radeon/radeon_device.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >> b/drivers/gpu/drm/radeon/radeon_device.c >> index 15692cb241fc..e608ca26780a 100644 >> --- a/drivers/gpu/drm/radeon/radeon_device.c >> +++ b/drivers/gpu/drm/radeon/radeon_device.c >> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, >> bool suspend, >> if (r) { >> /* delay GPU reset to resume */ >> radeon_fence_driver_force_completion(rdev, i); >> + } else { >> + /* finish executing delayed work */ >> + flush_delayed_work(&rdev->fence_drv[i].lockup_work); >> } >> } >
Am 15.08.22 um 09:34 schrieb 李真能: > > 在 2022/8/12 18:55, Christian König 写道: >> Am 11.08.22 um 09:25 schrieb Zhenneng Li: >>> Although radeon card fence and wait for gpu to finish processing >>> current batch rings, >>> there is still a corner case that radeon lockup work queue may not >>> be fully flushed, >>> and meanwhile the radeon_suspend_kms() function has called >>> pci_set_power_state() to >>> put device in D3hot state. >> >> If I'm not completely mistaken the reset worker uses the >> suspend/resume functionality as well to get the hardware into a >> working state again. >> >> So if I'm not completely mistaken this here would lead to a deadlock, >> please double check that. > > We have tested many times, there are no deadlock. Testing doesn't tells you anything, you need to audit the call paths. > In which situation, there would lead to a deadlock? GPU resets. Regards, Christian. > >> >> Regards, >> Christian. >> >>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. >>>> Configuration and Message requests are the only TLPs accepted by a >>>> Function in >>>> the D3hot state. All other received Requests must be handled as >>>> Unsupported Requests, >>>> and all received Completions may optionally be handled as >>>> Unexpected Completions. >>> This issue will happen in following logs: >>> Unable to handle kernel paging request at virtual address >>> 00008800e0008010 >>> CPU 0 kworker/0:3(131): Oops 0 >>> pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000 >>> Tainted: G W >>> pc is at si_gpu_check_soft_reset+0x3c/0x240 >>> ra is at si_dma_is_lockup+0x34/0xd0 >>> v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000 >>> t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258 >>> t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000 >>> s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018 >>> s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000 >>> s6 = fff00007ef07bd98 >>> a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008 >>> a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338 >>> t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800 >>> t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000 >>> gp = ffffffff81d89690 sp = 00000000aa814126 >>> Disabling lock debugging due to kernel taint >>> Trace: >>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0 >>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290 >>> [<ffffffff80977010>] process_one_work+0x280/0x550 >>> [<ffffffff80977350>] worker_thread+0x70/0x7c0 >>> [<ffffffff80977410>] worker_thread+0x130/0x7c0 >>> [<ffffffff80982040>] kthread+0x200/0x210 >>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0 >>> [<ffffffff80981f8c>] kthread+0x14c/0x210 >>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20 >>> [<ffffffff80981e40>] kthread+0x0/0x210 >>> Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 >>> <88210000> 4821ed21 >>> So force lockup work queue flush to fix this problem. >>> >>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>> --- >>> drivers/gpu/drm/radeon/radeon_device.c | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>> b/drivers/gpu/drm/radeon/radeon_device.c >>> index 15692cb241fc..e608ca26780a 100644 >>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, >>> bool suspend, >>> if (r) { >>> /* delay GPU reset to resume */ >>> radeon_fence_driver_force_completion(rdev, i); >>> + } else { >>> + /* finish executing delayed work */ >>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work); >>> } >>> } >>
在 2022/8/15 21:12, Christian König 写道: > Am 15.08.22 um 09:34 schrieb 李真能: >> >> 在 2022/8/12 18:55, Christian König 写道: >>> Am 11.08.22 um 09:25 schrieb Zhenneng Li: >>>> Although radeon card fence and wait for gpu to finish processing >>>> current batch rings, >>>> there is still a corner case that radeon lockup work queue may not >>>> be fully flushed, >>>> and meanwhile the radeon_suspend_kms() function has called >>>> pci_set_power_state() to >>>> put device in D3hot state. >>> >>> If I'm not completely mistaken the reset worker uses the >>> suspend/resume functionality as well to get the hardware into a >>> working state again. >>> >>> So if I'm not completely mistaken this here would lead to a >>> deadlock, please double check that. >> >> We have tested many times, there are no deadlock. > > Testing doesn't tells you anything, you need to audit the call paths. > >> In which situation, there would lead to a deadlock? > > GPU resets. Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait for a lockup_work to finish executing the last queueing, but this kernel func haven't get any lock, and lockup_work will run in another kernel thread, so I think flush_delayed_work could not lead to a deadlock. Therefor if radeon_gpu_reset is called in another thread when radeon_suspend_kms is blocked on flush_delayed_work, there could not lead to a deadlock. > > Regards, > Christian. > >> >>> >>> Regards, >>> Christian. >>> >>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. >>>>> Configuration and Message requests are the only TLPs accepted by a >>>>> Function in >>>>> the D3hot state. All other received Requests must be handled as >>>>> Unsupported Requests, >>>>> and all received Completions may optionally be handled as >>>>> Unexpected Completions. >>>> This issue will happen in following logs: >>>> Unable to handle kernel paging request at virtual address >>>> 00008800e0008010 >>>> CPU 0 kworker/0:3(131): Oops 0 >>>> pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000 >>>> Tainted: G W >>>> pc is at si_gpu_check_soft_reset+0x3c/0x240 >>>> ra is at si_dma_is_lockup+0x34/0xd0 >>>> v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000 >>>> t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258 >>>> t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000 >>>> s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018 >>>> s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000 >>>> s6 = fff00007ef07bd98 >>>> a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008 >>>> a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338 >>>> t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800 >>>> t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000 >>>> gp = ffffffff81d89690 sp = 00000000aa814126 >>>> Disabling lock debugging due to kernel taint >>>> Trace: >>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0 >>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290 >>>> [<ffffffff80977010>] process_one_work+0x280/0x550 >>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0 >>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0 >>>> [<ffffffff80982040>] kthread+0x200/0x210 >>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0 >>>> [<ffffffff80981f8c>] kthread+0x14c/0x210 >>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20 >>>> [<ffffffff80981e40>] kthread+0x0/0x210 >>>> Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 >>>> <88210000> 4821ed21 >>>> So force lockup work queue flush to fix this problem. >>>> >>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>>> --- >>>> drivers/gpu/drm/radeon/radeon_device.c | 3 +++ >>>> 1 file changed, 3 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>> index 15692cb241fc..e608ca26780a 100644 >>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device >>>> *dev, bool suspend, >>>> if (r) { >>>> /* delay GPU reset to resume */ >>>> radeon_fence_driver_force_completion(rdev, i); >>>> + } else { >>>> + /* finish executing delayed work */ >>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work); >>>> } >>>> } >>> >
Am 17.08.22 um 09:31 schrieb 李真能: > > 在 2022/8/15 21:12, Christian König 写道: >> Am 15.08.22 um 09:34 schrieb 李真能: >>> >>> 在 2022/8/12 18:55, Christian König 写道: >>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li: >>>>> Although radeon card fence and wait for gpu to finish processing >>>>> current batch rings, >>>>> there is still a corner case that radeon lockup work queue may not >>>>> be fully flushed, >>>>> and meanwhile the radeon_suspend_kms() function has called >>>>> pci_set_power_state() to >>>>> put device in D3hot state. >>>> >>>> If I'm not completely mistaken the reset worker uses the >>>> suspend/resume functionality as well to get the hardware into a >>>> working state again. >>>> >>>> So if I'm not completely mistaken this here would lead to a >>>> deadlock, please double check that. >>> >>> We have tested many times, there are no deadlock. >> >> Testing doesn't tells you anything, you need to audit the call paths. >> >>> In which situation, there would lead to a deadlock? >> >> GPU resets. > > Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait > for a lockup_work to finish executing the last queueing, but this > kernel func haven't get any lock, and lockup_work will run in another > kernel thread, so I think flush_delayed_work could not lead to a > deadlock. > > Therefor if radeon_gpu_reset is called in another thread when > radeon_suspend_kms is blocked on flush_delayed_work, there could not > lead to a deadlock. Ok sounds like you didn't go what I wanted to say. The key problem is that radeon_gpu_reset() calls radeon_suspend() which in turn calls rdev->asic->suspend(). And this function in turn could end up in radeon_suspend_kms() again, but I'm not 100% sure about that. Just double check the order of function called here (e.g. if radeon_suspend_kms() call radeon_suspend() or the other way around). Apart from that your patch looks correct to me as well. Regards, Christian. > >> >> Regards, >> Christian. >> >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. >>>>>> Configuration and Message requests are the only TLPs accepted by >>>>>> a Function in >>>>>> the D3hot state. All other received Requests must be handled as >>>>>> Unsupported Requests, >>>>>> and all received Completions may optionally be handled as >>>>>> Unexpected Completions. >>>>> This issue will happen in following logs: >>>>> Unable to handle kernel paging request at virtual address >>>>> 00008800e0008010 >>>>> CPU 0 kworker/0:3(131): Oops 0 >>>>> pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000 >>>>> Tainted: G W >>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240 >>>>> ra is at si_dma_is_lockup+0x34/0xd0 >>>>> v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000 >>>>> t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258 >>>>> t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000 >>>>> s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018 >>>>> s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000 >>>>> s6 = fff00007ef07bd98 >>>>> a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008 >>>>> a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338 >>>>> t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800 >>>>> t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000 >>>>> gp = ffffffff81d89690 sp = 00000000aa814126 >>>>> Disabling lock debugging due to kernel taint >>>>> Trace: >>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0 >>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290 >>>>> [<ffffffff80977010>] process_one_work+0x280/0x550 >>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0 >>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0 >>>>> [<ffffffff80982040>] kthread+0x200/0x210 >>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0 >>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210 >>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20 >>>>> [<ffffffff80981e40>] kthread+0x0/0x210 >>>>> Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 >>>>> <88210000> 4821ed21 >>>>> So force lockup work queue flush to fix this problem. >>>>> >>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>>>> --- >>>>> drivers/gpu/drm/radeon/radeon_device.c | 3 +++ >>>>> 1 file changed, 3 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>>> index 15692cb241fc..e608ca26780a 100644 >>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device >>>>> *dev, bool suspend, >>>>> if (r) { >>>>> /* delay GPU reset to resume */ >>>>> radeon_fence_driver_force_completion(rdev, i); >>>>> + } else { >>>>> + /* finish executing delayed work */ >>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work); >>>>> } >>>>> } >>>> >>
在 2022/8/17 19:40, Christian König 写道: > Am 17.08.22 um 09:31 schrieb 李真能: >> >> 在 2022/8/15 21:12, Christian König 写道: >>> Am 15.08.22 um 09:34 schrieb 李真能: >>>> >>>> 在 2022/8/12 18:55, Christian König 写道: >>>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li: >>>>>> Although radeon card fence and wait for gpu to finish processing >>>>>> current batch rings, >>>>>> there is still a corner case that radeon lockup work queue may >>>>>> not be fully flushed, >>>>>> and meanwhile the radeon_suspend_kms() function has called >>>>>> pci_set_power_state() to >>>>>> put device in D3hot state. >>>>> >>>>> If I'm not completely mistaken the reset worker uses the >>>>> suspend/resume functionality as well to get the hardware into a >>>>> working state again. >>>>> >>>>> So if I'm not completely mistaken this here would lead to a >>>>> deadlock, please double check that. >>>> >>>> We have tested many times, there are no deadlock. >>> >>> Testing doesn't tells you anything, you need to audit the call paths. >>> >>>> In which situation, there would lead to a deadlock? >>> >>> GPU resets. >> >> Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will >> wait for a lockup_work to finish executing the last queueing, but >> this kernel func haven't get any lock, and lockup_work will run in >> another kernel thread, so I think flush_delayed_work could not lead >> to a deadlock. >> >> Therefor if radeon_gpu_reset is called in another thread when >> radeon_suspend_kms is blocked on flush_delayed_work, there could not >> lead to a deadlock. > > Ok sounds like you didn't go what I wanted to say. > > The key problem is that radeon_gpu_reset() calls radeon_suspend() > which in turn calls rdev->asic->suspend(). > > And this function in turn could end up in radeon_suspend_kms() again, > but I'm not 100% sure about that. > > Just double check the order of function called here (e.g. if > radeon_suspend_kms() call radeon_suspend() or the other way around). > Apart from that your patch looks correct to me as well. > radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not call radeon_suspend_kms, the more detail of call flow, we can see the attachement file: radeon-suspend-reset.png. Sorry I may have mistaken your meaning. > Regards, > Christian. > >> >>> >>> Regards, >>> Christian. >>> >>>> >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. >>>>>>> Configuration and Message requests are the only TLPs accepted by >>>>>>> a Function in >>>>>>> the D3hot state. All other received Requests must be handled as >>>>>>> Unsupported Requests, >>>>>>> and all received Completions may optionally be handled as >>>>>>> Unexpected Completions. >>>>>> This issue will happen in following logs: >>>>>> Unable to handle kernel paging request at virtual address >>>>>> 00008800e0008010 >>>>>> CPU 0 kworker/0:3(131): Oops 0 >>>>>> pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000 >>>>>> Tainted: G W >>>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240 >>>>>> ra is at si_dma_is_lockup+0x34/0xd0 >>>>>> v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000 >>>>>> t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258 >>>>>> t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000 >>>>>> s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018 >>>>>> s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000 >>>>>> s6 = fff00007ef07bd98 >>>>>> a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008 >>>>>> a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338 >>>>>> t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800 >>>>>> t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000 >>>>>> gp = ffffffff81d89690 sp = 00000000aa814126 >>>>>> Disabling lock debugging due to kernel taint >>>>>> Trace: >>>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0 >>>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290 >>>>>> [<ffffffff80977010>] process_one_work+0x280/0x550 >>>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0 >>>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0 >>>>>> [<ffffffff80982040>] kthread+0x200/0x210 >>>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0 >>>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210 >>>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20 >>>>>> [<ffffffff80981e40>] kthread+0x0/0x210 >>>>>> Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 >>>>>> <88210000> 4821ed21 >>>>>> So force lockup work queue flush to fix this problem. >>>>>> >>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>>>>> --- >>>>>> drivers/gpu/drm/radeon/radeon_device.c | 3 +++ >>>>>> 1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>>>> index 15692cb241fc..e608ca26780a 100644 >>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device >>>>>> *dev, bool suspend, >>>>>> if (r) { >>>>>> /* delay GPU reset to resume */ >>>>>> radeon_fence_driver_force_completion(rdev, i); >>>>>> + } else { >>>>>> + /* finish executing delayed work */ >>>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work); >>>>>> } >>>>>> } >>>>> >>> >
Am 19.08.22 um 11:34 schrieb 李真能: > > 在 2022/8/17 19:40, Christian König 写道: > >> Am 17.08.22 um 09:31 schrieb 李真能: >>> >>> 在 2022/8/15 21:12, Christian König 写道: >>>> Am 15.08.22 um 09:34 schrieb 李真能: >>>>> >>>>> 在 2022/8/12 18:55, Christian König 写道: >>>>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li: >>>>>>> Although radeon card fence and wait for gpu to finish processing >>>>>>> current batch rings, >>>>>>> there is still a corner case that radeon lockup work queue may >>>>>>> not be fully flushed, >>>>>>> and meanwhile the radeon_suspend_kms() function has called >>>>>>> pci_set_power_state() to >>>>>>> put device in D3hot state. >>>>>> >>>>>> If I'm not completely mistaken the reset worker uses the >>>>>> suspend/resume functionality as well to get the hardware into a >>>>>> working state again. >>>>>> >>>>>> So if I'm not completely mistaken this here would lead to a >>>>>> deadlock, please double check that. >>>>> >>>>> We have tested many times, there are no deadlock. >>>> >>>> Testing doesn't tells you anything, you need to audit the call paths. >>>> >>>>> In which situation, there would lead to a deadlock? >>>> >>>> GPU resets. >>> >>> Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will >>> wait for a lockup_work to finish executing the last queueing, but >>> this kernel func haven't get any lock, and lockup_work will run in >>> another kernel thread, so I think flush_delayed_work could not lead >>> to a deadlock. >>> >>> Therefor if radeon_gpu_reset is called in another thread when >>> radeon_suspend_kms is blocked on flush_delayed_work, there could not >>> lead to a deadlock. >> >> Ok sounds like you didn't go what I wanted to say. >> >> The key problem is that radeon_gpu_reset() calls radeon_suspend() >> which in turn calls rdev->asic->suspend(). >> >> And this function in turn could end up in radeon_suspend_kms() again, >> but I'm not 100% sure about that. >> >> Just double check the order of function called here (e.g. if >> radeon_suspend_kms() call radeon_suspend() or the other way around). >> Apart from that your patch looks correct to me as well. >> > radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not > call radeon_suspend_kms, the more detail of call flow, we can see the > attachement file: radeon-suspend-reset.png. > > Sorry I may have mistaken your meaning. > Ok in this case my memory of the call flow wasn't correct any more. Feel free to add an Acked-by: Christian König <christian.koenig@amd.com> to the patch. Alex should then pick it up for upstreaming. Thanks, Christian. > >> Regards, >> Christian. >> >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>>> >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. >>>>>>>> Configuration and Message requests are the only TLPs accepted >>>>>>>> by a Function in >>>>>>>> the D3hot state. All other received Requests must be handled as >>>>>>>> Unsupported Requests, >>>>>>>> and all received Completions may optionally be handled as >>>>>>>> Unexpected Completions. >>>>>>> This issue will happen in following logs: >>>>>>> Unable to handle kernel paging request at virtual address >>>>>>> 00008800e0008010 >>>>>>> CPU 0 kworker/0:3(131): Oops 0 >>>>>>> pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000 >>>>>>> Tainted: G W >>>>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240 >>>>>>> ra is at si_dma_is_lockup+0x34/0xd0 >>>>>>> v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000 >>>>>>> t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258 >>>>>>> t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000 >>>>>>> s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018 >>>>>>> s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000 >>>>>>> s6 = fff00007ef07bd98 >>>>>>> a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008 >>>>>>> a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338 >>>>>>> t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = >>>>>>> ff6769c5d964b800 >>>>>>> t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000 >>>>>>> gp = ffffffff81d89690 sp = 00000000aa814126 >>>>>>> Disabling lock debugging due to kernel taint >>>>>>> Trace: >>>>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0 >>>>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290 >>>>>>> [<ffffffff80977010>] process_one_work+0x280/0x550 >>>>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0 >>>>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0 >>>>>>> [<ffffffff80982040>] kthread+0x200/0x210 >>>>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0 >>>>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210 >>>>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20 >>>>>>> [<ffffffff80981e40>] kthread+0x0/0x210 >>>>>>> Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 >>>>>>> <88210000> 4821ed21 >>>>>>> So force lockup work queue flush to fix this problem. >>>>>>> >>>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> >>>>>>> --- >>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 3 +++ >>>>>>> 1 file changed, 3 insertions(+) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>>>>> index 15692cb241fc..e608ca26780a 100644 >>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device >>>>>>> *dev, bool suspend, >>>>>>> if (r) { >>>>>>> /* delay GPU reset to resume */ >>>>>>> radeon_fence_driver_force_completion(rdev, i); >>>>>>> + } else { >>>>>>> + /* finish executing delayed work */ >>>>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work); >>>>>>> } >>>>>>> } >>>>>> >>>> >>
Applied. Thanks! Alex On Fri, Aug 19, 2022 at 6:07 AM Christian König <christian.koenig@amd.com> wrote: > > Am 19.08.22 um 11:34 schrieb 李真能: > > 在 2022/8/17 19:40, Christian König 写道: > > Am 17.08.22 um 09:31 schrieb 李真能: > > > 在 2022/8/15 21:12, Christian König 写道: > > Am 15.08.22 um 09:34 schrieb 李真能: > > > 在 2022/8/12 18:55, Christian König 写道: > > Am 11.08.22 um 09:25 schrieb Zhenneng Li: > > Although radeon card fence and wait for gpu to finish processing current batch rings, > there is still a corner case that radeon lockup work queue may not be fully flushed, > and meanwhile the radeon_suspend_kms() function has called pci_set_power_state() to > put device in D3hot state. > > > If I'm not completely mistaken the reset worker uses the suspend/resume functionality as well to get the hardware into a working state again. > > So if I'm not completely mistaken this here would lead to a deadlock, please double check that. > > > We have tested many times, there are no deadlock. > > > Testing doesn't tells you anything, you need to audit the call paths. > > In which situation, there would lead to a deadlock? > > > GPU resets. > > > Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait for a lockup_work to finish executing the last queueing, but this kernel func haven't get any lock, and lockup_work will run in another kernel thread, so I think flush_delayed_work could not lead to a deadlock. > > Therefor if radeon_gpu_reset is called in another thread when radeon_suspend_kms is blocked on flush_delayed_work, there could not lead to a deadlock. > > > Ok sounds like you didn't go what I wanted to say. > > The key problem is that radeon_gpu_reset() calls radeon_suspend() which in turn calls rdev->asic->suspend(). > > And this function in turn could end up in radeon_suspend_kms() again, but I'm not 100% sure about that. > > Just double check the order of function called here (e.g. if radeon_suspend_kms() call radeon_suspend() or the other way around). Apart from that your patch looks correct to me as well. > > radeon_gpu_reset will call radeon_suspend, but radeon_suspend will not call radeon_suspend_kms, the more detail of call flow, we can see the attachement file: radeon-suspend-reset.png. > > Sorry I may have mistaken your meaning. > > > Ok in this case my memory of the call flow wasn't correct any more. > > Feel free to add an Acked-by: Christian König <christian.koenig@amd.com> to the patch. > > Alex should then pick it up for upstreaming. > > Thanks, > Christian. > > > Regards, > Christian. > > > > Regards, > Christian. > > > > Regards, > Christian. > > Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State. > > Configuration and Message requests are the only TLPs accepted by a Function in > the D3hot state. All other received Requests must be handled as Unsupported Requests, > and all received Completions may optionally be handled as Unexpected Completions. > > This issue will happen in following logs: > Unable to handle kernel paging request at virtual address 00008800e0008010 > CPU 0 kworker/0:3(131): Oops 0 > pc = [<ffffffff811bea5c>] ra = [<ffffffff81240844>] ps = 0000 Tainted: G W > pc is at si_gpu_check_soft_reset+0x3c/0x240 > ra is at si_dma_is_lockup+0x34/0xd0 > v0 = 0000000000000000 t0 = fff08800e0008010 t1 = 0000000000010000 > t2 = 0000000000008010 t3 = fff00007e3c00000 t4 = fff00007e3c00258 > t5 = 000000000000ffff t6 = 0000000000000001 t7 = fff00007ef078000 > s0 = fff00007e3c016e8 s1 = fff00007e3c00000 s2 = fff00007e3c00018 > s3 = fff00007e3c00000 s4 = fff00007fff59d80 s5 = 0000000000000000 > s6 = fff00007ef07bd98 > a0 = fff00007e3c00000 a1 = fff00007e3c016e8 a2 = 0000000000000008 > a3 = 0000000000000001 a4 = 8f5c28f5c28f5c29 a5 = ffffffff810f4338 > t8 = 0000000000000275 t9 = ffffffff809b66f8 t10 = ff6769c5d964b800 > t11= 000000000000b886 pv = ffffffff811bea20 at = 0000000000000000 > gp = ffffffff81d89690 sp = 00000000aa814126 > Disabling lock debugging due to kernel taint > Trace: > [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0 > [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290 > [<ffffffff80977010>] process_one_work+0x280/0x550 > [<ffffffff80977350>] worker_thread+0x70/0x7c0 > [<ffffffff80977410>] worker_thread+0x130/0x7c0 > [<ffffffff80982040>] kthread+0x200/0x210 > [<ffffffff809772e0>] worker_thread+0x0/0x7c0 > [<ffffffff80981f8c>] kthread+0x14c/0x210 > [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20 > [<ffffffff80981e40>] kthread+0x0/0x210 > Code: ad3e0008 43f0074a ad7e0018 ad9e0020 8c3001e8 40230101 > <88210000> 4821ed21 > So force lockup work queue flush to fix this problem. > > Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn> > --- > drivers/gpu/drm/radeon/radeon_device.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c > index 15692cb241fc..e608ca26780a 100644 > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, bool suspend, > if (r) { > /* delay GPU reset to resume */ > radeon_fence_driver_force_completion(rdev, i); > + } else { > + /* finish executing delayed work */ > + flush_delayed_work(&rdev->fence_drv[i].lockup_work); > } > } > > > > >
diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 15692cb241fc..e608ca26780a 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device *dev, bool suspend, if (r) { /* delay GPU reset to resume */ radeon_fence_driver_force_completion(rdev, i); + } else { + /* finish executing delayed work */ + flush_delayed_work(&rdev->fence_drv[i].lockup_work); } }