diff mbox series

AW: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks

Message ID MN2PR12MB3775E6C1ECA915283108E6D783F99@MN2PR12MB3775.namprd12.prod.outlook.com (mailing list archive)
State New, archived
Headers show
Series AW: [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks | expand

Commit Message

Christian König Aug. 12, 2021, 5:55 a.m. UTC
Hi James,

Evan seems to have understood how this all works together.

See while any begin/end use critical section is active the work should not be active.

When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.

Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.

Something similar applies to the first patch I think, so when this makes a difference it is actually a bug.

Regards,
Christian.

Comments

Michel Dänzer Aug. 12, 2021, 8:11 a.m. UTC | #1
On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
> Hi James,
> 
> Evan seems to have understood how this all works together.
> 
> See while any begin/end use critical section is active the work should not be active.
> 
> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
> 
> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.

It merely assumes that the work may already have been scheduled before.

Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.

So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.


> Something similar applies to the first patch I think,

There are no cancel work calls in that case, so the commit log is accurate TTBOMK.

I noticed this because current mutter Git main wasn't able to sustain 60 fps on Navi 14 with a simple glxgears -fullscreen. mutter was dropping frames because its CPU work for a frame update occasionally took up to 3 ms, instead of the normal 2-300 microseconds. sysprof showed a lot of cycles spent in the functions which enable/disable GFXOFF in the HW.


> so when this makes a difference it is actually a bug.

There was certainly a bug though, which patch 1 fixes. :)
Lazar, Lijo Aug. 12, 2021, 11:33 a.m. UTC | #2
On 8/12/2021 1:41 PM, Michel Dänzer wrote:
> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>> Hi James,
>>
>> Evan seems to have understood how this all works together.
>>
>> See while any begin/end use critical section is active the work should not be active.
>>
>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>
>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
> 
> It merely assumes that the work may already have been scheduled before.
> 
> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
> 
> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
> 
> 
>> Something similar applies to the first patch I think,
> 
> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.

Curious -

For patch 1, does it make a difference if any delayed work scheduled is 
cancelled in the else part before proceeding?

} else if (!enable && adev->gfx.gfx_off_state) {
cancel_delayed_work();


Thanks,
Lijo

> 
> I noticed this because current mutter Git main wasn't able to sustain 60 fps on Navi 14 with a simple glxgears -fullscreen. mutter was dropping frames because its CPU work for a frame update occasionally took up to 3 ms, instead of the normal 2-300 microseconds. sysprof showed a lot of cycles spent in the functions which enable/disable GFXOFF in the HW.
> 
> 
>> so when this makes a difference it is actually a bug.
> 
> There was certainly a bug though, which patch 1 fixes. :)
> 
>
Michel Dänzer Aug. 12, 2021, 4:54 p.m. UTC | #3
On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
> On 8/12/2021 1:41 PM, Michel Dänzer wrote:
>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>>> Hi James,
>>>
>>> Evan seems to have understood how this all works together.
>>>
>>> See while any begin/end use critical section is active the work should not be active.
>>>
>>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>>
>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
>>
>> It merely assumes that the work may already have been scheduled before.
>>
>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>>
>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
>>
>>
>>> Something similar applies to the first patch I think,
>>
>> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
> 
> Curious -
> 
> For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
> 
> } else if (!enable && adev->gfx.gfx_off_state) {
> cancel_delayed_work();

I tried the patch below.

While this does seem to fix the problem as well, I see a potential issue:

1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync

I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)


Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c

index a0be0772c8b3..3e4585ffb9af 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c

@@ -570,8 +570,11 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)



        if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {

                schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);

-       } else if (!enable && adev->gfx.gfx_off_state) {

-               if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {

+       } else if (!enable) {

+               cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);

+

+               if (adev->gfx.gfx_off_state &&

+                   !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {

                        adev->gfx.gfx_off_state = false;



                        if (adev->gfx.funcs->init_spm_golden) {
Lazar, Lijo Aug. 13, 2021, 4:23 a.m. UTC | #4
On 8/12/2021 10:24 PM, Michel Dänzer wrote:
> On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
>> On 8/12/2021 1:41 PM, Michel Dänzer wrote:
>>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>>>> Hi James,
>>>>
>>>> Evan seems to have understood how this all works together.
>>>>
>>>> See while any begin/end use critical section is active the work should not be active.
>>>>
>>>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>>>
>>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
>>>
>>> It merely assumes that the work may already have been scheduled before.
>>>
>>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>>>
>>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
>>>
>>>
>>>> Something similar applies to the first patch I think,
>>>
>>> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
>>
>> Curious -
>>
>> For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
>>
>> } else if (!enable && adev->gfx.gfx_off_state) {
>> cancel_delayed_work();
> 
> I tried the patch below.
> 
> While this does seem to fix the problem as well, I see a potential issue:
> 
> 1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
> 2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
> 3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync
> 
> I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)
> 

Should use the cancel_delayed_work instead of the _sync version. As you 
mentioned - at best work is not scheduled yet and cancelled 
successfully, or at worst it's waiting for the mutex. In the worst case, 
if amdgpu_device_delay_enable_gfx_off gets the mutex after 
amdgpu_gfx_off_ctrl unlocks it, there is an extra check as below.

if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)

The count wouldn't be 0 and hence it won't enable GFXOFF.

> 
> Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)

As mentioned earlier, cancel_delayed_work won't cause this issue.

In the mod_delayed_ patch, mod_ version is called only when req_count is 
0. While that is a good thing, it keeps alive one more contender for the 
mutex.

The cancel_ version eliminates that contender if happens to be called at 
the right time (more likely if there are multiple requests to disable 
gfxoff). On the other hand, don't know how costly it is to call cancel_ 
every time on the else part (or maybe call only once when count 
increments to 1?).

Thanks,
Lijo

> 
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> 
> index a0be0772c8b3..3e4585ffb9af 100644
> 
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> 
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> 
> @@ -570,8 +570,11 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
> 
> 
> 
>          if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
> 
>                  schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
> 
> -       } else if (!enable && adev->gfx.gfx_off_state) {
> 
> -               if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> 
> +       } else if (!enable) {
> 
> +               cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
> 
> +
> 
> +               if (adev->gfx.gfx_off_state &&
> 
> +                   !amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
> 
>                          adev->gfx.gfx_off_state = false;
> 
> 
> 
>                          if (adev->gfx.funcs->init_spm_golden) {
> 
> 
>
Michel Dänzer Aug. 13, 2021, 10:31 a.m. UTC | #5
On 2021-08-13 6:23 a.m., Lazar, Lijo wrote:
> 
> 
> On 8/12/2021 10:24 PM, Michel Dänzer wrote:
>> On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
>>> On 8/12/2021 1:41 PM, Michel Dänzer wrote:
>>>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>>>>> Hi James,
>>>>>
>>>>> Evan seems to have understood how this all works together.
>>>>>
>>>>> See while any begin/end use critical section is active the work should not be active.
>>>>>
>>>>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>>>>
>>>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
>>>>
>>>> It merely assumes that the work may already have been scheduled before.
>>>>
>>>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>>>>
>>>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
>>>>
>>>>
>>>>> Something similar applies to the first patch I think,
>>>>
>>>> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
>>>
>>> Curious -
>>>
>>> For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
>>>
>>> } else if (!enable && adev->gfx.gfx_off_state) {
>>> cancel_delayed_work();
>>
>> I tried the patch below.
>>
>> While this does seem to fix the problem as well, I see a potential issue:
>>
>> 1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
>> 2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
>> 3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync
>>
>> I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)
> 
> Should use the cancel_delayed_work instead of the _sync version.

The thing is, it's not clear to me from cancel_delayed_work's description that it's guaranteed not to wait for amdgpu_device_delay_enable_gfx_off to finish if it's already running. If that's not guaranteed, it's prone to the same deadlock.

> As you mentioned - at best work is not scheduled yet and cancelled successfully, or at worst it's waiting for the mutex. In the worst case, if amdgpu_device_delay_enable_gfx_off gets the mutex after amdgpu_gfx_off_ctrl unlocks it, there is an extra check as below.
> 
> if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)
> 
> The count wouldn't be 0 and hence it won't enable GFXOFF.

I'm not sure, but it might also be possible for amdgpu_device_delay_enable_gfx_off to get the mutex only after amdgpu_gfx_off_ctrl was called again and set adev->gfx.gfx_off_req_count back to 0.


>> Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)
> 
> As mentioned earlier, cancel_delayed_work won't cause this issue.
> 
> In the mod_delayed_ patch, mod_ version is called only when req_count is 0. While that is a good thing, it keeps alive one more contender for the mutex.

Not sure what you mean. It leaves the possibility of amdgpu_device_delay_enable_gfx_off running just after amdgpu_gfx_off_ctrl tried to postpone it. As discussed above, something similar might be possible with cancel_delayed_work as well.

> The cancel_ version eliminates that contender if happens to be called at the right time (more likely if there are multiple requests to disable gfxoff). On the other hand, don't know how costly it is to call cancel_ every time on the else part (or maybe call only once when count increments to 1?).

Sure, why not, though I doubt it matters much — I expect adev->gfx.gfx_off_req_count transitioning between 0 <-> 1 to be the most common case by far.


I sent out a v2 patch which should address all these issues.
Lazar, Lijo Aug. 13, 2021, 11:18 a.m. UTC | #6
On 8/13/2021 4:01 PM, Michel Dänzer wrote:
> On 2021-08-13 6:23 a.m., Lazar, Lijo wrote:
>>
>>
>> On 8/12/2021 10:24 PM, Michel Dänzer wrote:
>>> On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
>>>> On 8/12/2021 1:41 PM, Michel Dänzer wrote:
>>>>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>>>>>> Hi James,
>>>>>>
>>>>>> Evan seems to have understood how this all works together.
>>>>>>
>>>>>> See while any begin/end use critical section is active the work should not be active.
>>>>>>
>>>>>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>>>>>
>>>>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
>>>>>
>>>>> It merely assumes that the work may already have been scheduled before.
>>>>>
>>>>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>>>>>
>>>>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.
>>>>>
>>>>>
>>>>>> Something similar applies to the first patch I think,
>>>>>
>>>>> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
>>>>
>>>> Curious -
>>>>
>>>> For patch 1, does it make a difference if any delayed work scheduled is cancelled in the else part before proceeding?
>>>>
>>>> } else if (!enable && adev->gfx.gfx_off_state) {
>>>> cancel_delayed_work();
>>>
>>> I tried the patch below.
>>>
>>> While this does seem to fix the problem as well, I see a potential issue:
>>>
>>> 1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
>>> 2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
>>> 3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync
>>>
>>> I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)
>>
>> Should use the cancel_delayed_work instead of the _sync version.
> 
> The thing is, it's not clear to me from cancel_delayed_work's description that it's guaranteed not to wait for amdgpu_device_delay_enable_gfx_off to finish if it's already running. If that's not guaranteed, it's prone to the same deadlock.

 From what I understood from the the description, cancel initiates a 
cancel. If the work has already started, it returns false saying it 
couldn't succeed otherwise cancels out the scheduled work and returns 
true. In the note below, it asks to specifically use the _sync version 
if we need to wait for an already started work and that definitely has 
the problem of deadlock you mentioned above.

  * Note:
  * The work callback function may still be running on return, unless
  * it returns %true and the work doesn't re-arm itself.  Explicitly 
flush or
  * use cancel_delayed_work_sync() to wait on it.


> 
>> As you mentioned - at best work is not scheduled yet and cancelled successfully, or at worst it's waiting for the mutex. In the worst case, if amdgpu_device_delay_enable_gfx_off gets the mutex after amdgpu_gfx_off_ctrl unlocks it, there is an extra check as below.
>>
>> if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)
>>
>> The count wouldn't be 0 and hence it won't enable GFXOFF.
> 
> I'm not sure, but it might also be possible for amdgpu_device_delay_enable_gfx_off to get the mutex only after amdgpu_gfx_off_ctrl was called again and set adev->gfx.gfx_off_req_count back to 0.
> 

Yes, this is a case we can't avoid in either case. If the work has 
already started, then mod_delayed_ also doesn't have any impact. Another 
case is work thread already got the mutex and a disable request comes 
just at that time. It needs to wait till mutex is released by work, that 
could mean enable gfxoff immediately followed by disable.

> 
>>> Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm not sure how offhand. (With cancel_delayed_work instead, I'm worried amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might happen with mod_delayed_work as well...)
>>
>> As mentioned earlier, cancel_delayed_work won't cause this issue.
>>
>> In the mod_delayed_ patch, mod_ version is called only when req_count is 0. While that is a good thing, it keeps alive one more contender for the mutex.
> 
> Not sure what you mean. It leaves the possibility of amdgpu_device_delay_enable_gfx_off running just after amdgpu_gfx_off_ctrl tried to postpone it. As discussed above, something similar might be possible with cancel_delayed_work as well.
> 

The mod_delayed is called only req_count gets back to 0. If there is 
another disable request comes after that, it doesn't cancel out the work
scheduled nor does it adjust the delay.

Ex:
Disable gfxoff -> Enable gfxoff (now the work is scheduled) -> Disable 
gfxoff (within 5ms or whatever the delay be, but this call won't go to 
the mod_delayed path to delay it further) -> Work starts after 5ms and 
creates a contention for the mutex -> Enable gfxoff

When cancel_ is used, the second disable call immediately cancels out 
any work that is scheduled but not started and it doesn't create an 
unnecessary contention for the mutex. It's a matter of who gets the 
mutex first. Cancel has a better chance to eliminate the second thread 
possibility.

>> The cancel_ version eliminates that contender if happens to be called at the right time (more likely if there are multiple requests to disable gfxoff). On the other hand, don't know how costly it is to call cancel_ every time on the else part (or maybe call only once when count increments to 1?).
> 
> Sure, why not, though I doubt it matters much — I expect adev->gfx.gfx_off_req_count transitioning between 0 <-> 1 to be the most common case by far.
> 
> 
> I sent out a v2 patch which should address all these issues.
> 

Will check that.

Thanks,
Lijo

>
Christian König Aug. 16, 2021, 7:33 a.m. UTC | #7
Am 12.08.21 um 10:11 schrieb Michel Dänzer:
> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>> Hi James,
>>
>> Evan seems to have understood how this all works together.
>>
>> See while any begin/end use critical section is active the work should not be active.
>>
>> When you handle only one ring you can just call cancel in begin use and schedule in end use. But when you have more than one ring you need a lock or counter to prevent concurrent work items to be started.
>>
>> Michelle's idea to use mod_delayed_work is a bad one because it assumes that the delayed work is still running.
> It merely assumes that the work may already have been scheduled before.
>
> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I think it can still have some effect when there's a single work item for multiple rings, as described by James, it's probably negligible, since presumably the time intervals between ring_begin_use and ring_end_use are normally much shorter than a second.
>
> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with dropping it.

Yeah, I think that would be much better.

>> Something similar applies to the first patch I think,
> There are no cancel work calls in that case, so the commit log is accurate TTBOMK.
>
> I noticed this because current mutter Git main wasn't able to sustain 60 fps on Navi 14 with a simple glxgears -fullscreen. mutter was dropping frames because its CPU work for a frame update occasionally took up to 3 ms, instead of the normal 2-300 microseconds. sysprof showed a lot of cycles spent in the functions which enable/disable GFXOFF in the HW.
>
>
>> so when this makes a difference it is actually a bug.
> There was certainly a bug though, which patch 1 fixes. :)

Agreed, just wanted to note that this is most likely not the right 
solution since Alex was already picking it up.

Going to reply separately on the new patch as well.

Regards,
Christian.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
index 8996cb4ed57a..2c0040153f6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c
@@ -110,7 +110,7 @@  void amdgpu_jpeg_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_jpeg_ring_end_use(struct amdgpu_ring *ring)
 {
         atomic_dec(&ring->adev->jpeg.total_submission_cnt);
-       schedule_delayed_work(&ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->jpeg.idle_work, JPEG_IDLE_TIMEOUT);
 }

 int amdgpu_jpeg_dec_ring_test_ring(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 0f576f294d8a..b6b1d7eeb8e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -1283,7 +1283,7 @@  void amdgpu_uvd_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_uvd_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->uvd.idle_work, UVD_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
index 1ae7f824adc7..2253c18a6688 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
@@ -401,7 +401,7 @@  void amdgpu_vce_ring_begin_use(struct amdgpu_ring *ring)
 void amdgpu_vce_ring_end_use(struct amdgpu_ring *ring)
 {
         if (!amdgpu_sriov_vf(ring->adev))
-               schedule_delayed_work(&ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
+               mod_delayed_work(system_wq, &ring->adev->vce.idle_work, VCE_IDLE_TIMEOUT);
 }

 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
index 284bb42d6c86..d5937ab5ac80 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c
@@ -1874,7 +1874,7 @@  void vcn_v1_0_set_pg_for_begin_use(struct amdgpu_ring *ring, bool set_clocks)

 void vcn_v1_0_ring_end_use(struct amdgpu_ring *ring)
 {
-       schedule_delayed_work(&ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
+       mod_delayed_work(system_wq, &ring->adev->vcn.idle_work, VCN_IDLE_TIMEOUT);
         mutex_unlock(&ring->adev->vcn.vcn1_jpeg1_workaround);
 }