diff mbox series

drm/ttm: update bulk move object of ghost BO

Message ID 20220901092946.2030744-1-zhenguo.yin@amd.com (mailing list archive)
State New, archived
Headers show
Series drm/ttm: update bulk move object of ghost BO | expand

Commit Message

ZhenGuo Yin Sept. 1, 2022, 9:29 a.m. UTC
[Why]
Ghost BO is released with non-empty bulk move object. There is a
warning trace:
WARNING: CPU: 19 PID: 1582 at ttm/ttm_bo.c:366 ttm_bo_release+0x2e1/0x2f0 [amdttm]
Call Trace:
  amddma_resv_reserve_fences+0x10d/0x1f0 [amdkcl]
  amdttm_bo_put+0x28/0x30 [amdttm]
  amdttm_bo_move_accel_cleanup+0x126/0x200 [amdttm]
  amdgpu_bo_move+0x1a8/0x770 [amdgpu]
  ttm_bo_handle_move_mem+0xb0/0x140 [amdttm]
  amdttm_bo_validate+0xbf/0x100 [amdttm]

[How]
The resource of ghost BO should be moved to LRU directly, instead of
using bulk move. The bulk move object of ghost BO should set to NULL
before function ttm_bo_move_to_lru_tail_unlocked.

Fixed:·5b951e487fd6bf5f·("drm/ttm:·fix·bulk·move·handling·v2")
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 1 +
 1 file changed, 1 insertion(+)

Comments

JingWen Chen Sept. 1, 2022, 9:55 a.m. UTC | #1
Acked-by: Jingwen Chen <Jingwen.Chen2@amd.com>

still need confirmation from Christian

On 9/1/22 5:29 PM, ZhenGuo Yin wrote:
> [Why]
> Ghost BO is released with non-empty bulk move object. There is a
> warning trace:
> WARNING: CPU: 19 PID: 1582 at ttm/ttm_bo.c:366 ttm_bo_release+0x2e1/0x2f0 [amdttm]
> Call Trace:
>   amddma_resv_reserve_fences+0x10d/0x1f0 [amdkcl]
>   amdttm_bo_put+0x28/0x30 [amdttm]
>   amdttm_bo_move_accel_cleanup+0x126/0x200 [amdttm]
>   amdgpu_bo_move+0x1a8/0x770 [amdgpu]
>   ttm_bo_handle_move_mem+0xb0/0x140 [amdttm]
>   amdttm_bo_validate+0xbf/0x100 [amdttm]
>
> [How]
> The resource of ghost BO should be moved to LRU directly, instead of
> using bulk move. The bulk move object of ghost BO should set to NULL
> before function ttm_bo_move_to_lru_tail_unlocked.
>
> Fixed:·5b951e487fd6bf5f·("drm/ttm:·fix·bulk·move·handling·v2")
> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
> ---
>  drivers/gpu/drm/ttm/ttm_bo_util.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index 1cbfb00c1d65..a90bbbd91910 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -238,6 +238,7 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo,
>  
>  	if (fbo->base.resource) {
>  		ttm_resource_set_bo(fbo->base.resource, &fbo->base);
> +		ttm_bo_set_bulk_move(&fbo->base, NULL);
>  		bo->resource = NULL;
>  	}
>
Christian König Sept. 1, 2022, 11:11 a.m. UTC | #2
Am 01.09.22 um 11:29 schrieb ZhenGuo Yin:
> [Why]
> Ghost BO is released with non-empty bulk move object. There is a
> warning trace:
> WARNING: CPU: 19 PID: 1582 at ttm/ttm_bo.c:366 ttm_bo_release+0x2e1/0x2f0 [amdttm]
> Call Trace:
>    amddma_resv_reserve_fences+0x10d/0x1f0 [amdkcl]
>    amdttm_bo_put+0x28/0x30 [amdttm]
>    amdttm_bo_move_accel_cleanup+0x126/0x200 [amdttm]
>    amdgpu_bo_move+0x1a8/0x770 [amdgpu]
>    ttm_bo_handle_move_mem+0xb0/0x140 [amdttm]
>    amdttm_bo_validate+0xbf/0x100 [amdttm]
>
> [How]
> The resource of ghost BO should be moved to LRU directly, instead of
> using bulk move. The bulk move object of ghost BO should set to NULL
> before function ttm_bo_move_to_lru_tail_unlocked.
>
> Fixed:·5b951e487fd6bf5f·("drm/ttm:·fix·bulk·move·handling·v2")
> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>

Good catch, but the fix is not 100% correct. Please rather just NULL the 
member while initializing the BO structure.

E.g. something like this:

  ....
  fbo->base.pin_count = 0;
+fbo->base.bulk_move= NULL;
  if (bo->type != ttm_bo_type_sg)
  ....

Thanks,
Christian.

> ---
>   drivers/gpu/drm/ttm/ttm_bo_util.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index 1cbfb00c1d65..a90bbbd91910 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -238,6 +238,7 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo,
>   
>   	if (fbo->base.resource) {
>   		ttm_resource_set_bo(fbo->base.resource, &fbo->base);
> +		ttm_bo_set_bulk_move(&fbo->base, NULL);
>   		bo->resource = NULL;
>   	}
>
Christian König Sept. 1, 2022, 11:13 a.m. UTC | #3
Am 01.09.22 um 13:11 schrieb Christian König:
> Am 01.09.22 um 11:29 schrieb ZhenGuo Yin:
>> [Why]
>> Ghost BO is released with non-empty bulk move object. There is a
>> warning trace:
>> WARNING: CPU: 19 PID: 1582 at ttm/ttm_bo.c:366 
>> ttm_bo_release+0x2e1/0x2f0 [amdttm]
>> Call Trace:
>>    amddma_resv_reserve_fences+0x10d/0x1f0 [amdkcl]
>>    amdttm_bo_put+0x28/0x30 [amdttm]
>>    amdttm_bo_move_accel_cleanup+0x126/0x200 [amdttm]
>>    amdgpu_bo_move+0x1a8/0x770 [amdgpu]
>>    ttm_bo_handle_move_mem+0xb0/0x140 [amdttm]
>>    amdttm_bo_validate+0xbf/0x100 [amdttm]
>>
>> [How]
>> The resource of ghost BO should be moved to LRU directly, instead of
>> using bulk move. The bulk move object of ghost BO should set to NULL
>> before function ttm_bo_move_to_lru_tail_unlocked.
>>
>> Fixed:·5b951e487fd6bf5f·("drm/ttm:·fix·bulk·move·handling·v2")
>> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
>
> Good catch, but the fix is not 100% correct. Please rather just NULL 
> the member while initializing the BO structure.
>
> E.g. something like this:
>
>  ....
>  fbo->base.pin_count = 0;
> +fbo->base.bulk_move= NULL;
>  if (bo->type != ttm_bo_type_sg)
>  ....

On the other hand thinking about it that won't work either.

You need to set bulk_move to NULL manually in an else clauses or 
something like this.

Regards,
Christian.

>
> Thanks,
> Christian.
>
>> ---
>>   drivers/gpu/drm/ttm/ttm_bo_util.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
>> b/drivers/gpu/drm/ttm/ttm_bo_util.c
>> index 1cbfb00c1d65..a90bbbd91910 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
>> @@ -238,6 +238,7 @@ static int ttm_buffer_object_transfer(struct 
>> ttm_buffer_object *bo,
>>         if (fbo->base.resource) {
>>           ttm_resource_set_bo(fbo->base.resource, &fbo->base);
>> +        ttm_bo_set_bulk_move(&fbo->base, NULL);
>>           bo->resource = NULL;
>>       }
>
Yin, ZhenGuo (Chris) Sept. 5, 2022, 7:59 a.m. UTC | #4
Inside the function ttm_bo_set_bulk_move, it calls 
ttm_resource_del_bulk_move to remove the old resource from the bulk_move 
list.

If we set the bulk_move to NULL manually as suggested, the old resource 
attached in the ghost BO seems won't be removed from the bulk_move.

On 9/1/2022 7:13 PM, Christian König wrote:
> Am 01.09.22 um 13:11 schrieb Christian König:
>> Am 01.09.22 um 11:29 schrieb ZhenGuo Yin:
>>> [Why]
>>> Ghost BO is released with non-empty bulk move object. There is a
>>> warning trace:
>>> WARNING: CPU: 19 PID: 1582 at ttm/ttm_bo.c:366 
>>> ttm_bo_release+0x2e1/0x2f0 [amdttm]
>>> Call Trace:
>>>    amddma_resv_reserve_fences+0x10d/0x1f0 [amdkcl]
>>>    amdttm_bo_put+0x28/0x30 [amdttm]
>>>    amdttm_bo_move_accel_cleanup+0x126/0x200 [amdttm]
>>>    amdgpu_bo_move+0x1a8/0x770 [amdgpu]
>>>    ttm_bo_handle_move_mem+0xb0/0x140 [amdttm]
>>>    amdttm_bo_validate+0xbf/0x100 [amdttm]
>>>
>>> [How]
>>> The resource of ghost BO should be moved to LRU directly, instead of
>>> using bulk move. The bulk move object of ghost BO should set to NULL
>>> before function ttm_bo_move_to_lru_tail_unlocked.
>>>
>>> Fixed:·5b951e487fd6bf5f·("drm/ttm:·fix·bulk·move·handling·v2")
>>> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
>>
>> Good catch, but the fix is not 100% correct. Please rather just NULL 
>> the member while initializing the BO structure.
>>
>> E.g. something like this:
>>
>>  ....
>>  fbo->base.pin_count = 0;
>> +fbo->base.bulk_move= NULL;
>>  if (bo->type != ttm_bo_type_sg)
>>  ....
>
> On the other hand thinking about it that won't work either.
>
> You need to set bulk_move to NULL manually in an else clauses or 
> something like this.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Christian.
>>
>>> ---
>>>   drivers/gpu/drm/ttm/ttm_bo_util.c | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
>>> b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> index 1cbfb00c1d65..a90bbbd91910 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> @@ -238,6 +238,7 @@ static int ttm_buffer_object_transfer(struct 
>>> ttm_buffer_object *bo,
>>>         if (fbo->base.resource) {
>>>           ttm_resource_set_bo(fbo->base.resource, &fbo->base);
>>> +        ttm_bo_set_bulk_move(&fbo->base, NULL);
>>>           bo->resource = NULL;
>>>       }
>>
>
Christian König Sept. 5, 2022, 11:05 a.m. UTC | #5
Yeah, I realized that as well after sending the first mail.

The problem is that we keep the bulk move around when there currently 
isn't any resource associated with the template.

So the correct code should look something like this:

if (fbo->base.resource) {
     ttm_resource_set_bo(fbo->base.resource, &fbo->base);
     bo->resource = NULL;
     ttm_bo_set_bulk_move(&fbo->base, NULL);
} else {
     fbo->bulk_move = NULL;
}

Regards,
Christian.

Am 05.09.22 um 09:59 schrieb Yin, ZhenGuo (Chris):
> Inside the function ttm_bo_set_bulk_move, it calls 
> ttm_resource_del_bulk_move to remove the old resource from the 
> bulk_move list.
>
> If we set the bulk_move to NULL manually as suggested, the old 
> resource attached in the ghost BO seems won't be removed from the 
> bulk_move.
>
> On 9/1/2022 7:13 PM, Christian König wrote:
>> Am 01.09.22 um 13:11 schrieb Christian König:
>>> Am 01.09.22 um 11:29 schrieb ZhenGuo Yin:
>>>> [Why]
>>>> Ghost BO is released with non-empty bulk move object. There is a
>>>> warning trace:
>>>> WARNING: CPU: 19 PID: 1582 at ttm/ttm_bo.c:366 
>>>> ttm_bo_release+0x2e1/0x2f0 [amdttm]
>>>> Call Trace:
>>>>    amddma_resv_reserve_fences+0x10d/0x1f0 [amdkcl]
>>>>    amdttm_bo_put+0x28/0x30 [amdttm]
>>>>    amdttm_bo_move_accel_cleanup+0x126/0x200 [amdttm]
>>>>    amdgpu_bo_move+0x1a8/0x770 [amdgpu]
>>>>    ttm_bo_handle_move_mem+0xb0/0x140 [amdttm]
>>>>    amdttm_bo_validate+0xbf/0x100 [amdttm]
>>>>
>>>> [How]
>>>> The resource of ghost BO should be moved to LRU directly, instead of
>>>> using bulk move. The bulk move object of ghost BO should set to NULL
>>>> before function ttm_bo_move_to_lru_tail_unlocked.
>>>>
>>>> Fixed:·5b951e487fd6bf5f·("drm/ttm:·fix·bulk·move·handling·v2")
>>>> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
>>>
>>> Good catch, but the fix is not 100% correct. Please rather just NULL 
>>> the member while initializing the BO structure.
>>>
>>> E.g. something like this:
>>>
>>>  ....
>>>  fbo->base.pin_count = 0;
>>> +fbo->base.bulk_move= NULL;
>>>  if (bo->type != ttm_bo_type_sg)
>>>  ....
>>
>> On the other hand thinking about it that won't work either.
>>
>> You need to set bulk_move to NULL manually in an else clauses or 
>> something like this.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>>> ---
>>>>   drivers/gpu/drm/ttm/ttm_bo_util.c | 1 +
>>>>   1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
>>>> b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>>> index 1cbfb00c1d65..a90bbbd91910 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>>> @@ -238,6 +238,7 @@ static int ttm_buffer_object_transfer(struct 
>>>> ttm_buffer_object *bo,
>>>>         if (fbo->base.resource) {
>>>>           ttm_resource_set_bo(fbo->base.resource, &fbo->base);
>>>> +        ttm_bo_set_bulk_move(&fbo->base, NULL);
>>>>           bo->resource = NULL;
>>>>       }
>>>
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 1cbfb00c1d65..a90bbbd91910 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -238,6 +238,7 @@  static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo,
 
 	if (fbo->base.resource) {
 		ttm_resource_set_bo(fbo->base.resource, &fbo->base);
+		ttm_bo_set_bulk_move(&fbo->base, NULL);
 		bo->resource = NULL;
 	}