diff mbox series

回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list

Message ID DM4PR12MB51657C53FAA6C096884118AD87929@DM4PR12MB5165.namprd12.prod.outlook.com (mailing list archive)
State New, archived
Headers show
Series 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list | expand

Commit Message

Pan, Xinhui Nov. 9, 2021, 1:05 p.m. UTC
[AMD Official Use Only]

Yes, a stable tag is needed. vulkan guys say 5.14 hit this issue too.

I think that amdgpu_bo_move() does support copy from sysMem to sysMem correctly.
maybe something below is needed.

otherwise, amdgpu_move_blit() is called to do the system memory copy which use a wrong address.
 206         /* Map only what can't be accessed directly */
 207         if (!tmz && mem->start != AMDGPU_BO_INVALID_OFFSET) {
 208                 *addr = amdgpu_ttm_domain_start(adev, mem->mem_type) +
 209                         mm_cur->start;
 210                 return 0;
 211         }

line 208, *addr is zero. So when amdgpu_copy_buffer submit job with such addr, page fault happens.

Comments

Christian König Nov. 9, 2021, 1:16 p.m. UTC | #1
Yeah, but that should never happen in the first place.

Even when the BO is on the wrong LRU TTM should check that beforehand.

In other words when we pick a BO from the LRU we should still double 
check bo->resource->mem_type to make sure it is what we are searching for.

Christian.

Am 09.11.21 um 14:05 schrieb Pan, Xinhui:
> [AMD Official Use Only]
>
> Yes, a stable tag is needed. vulkan guys say 5.14 hit this issue too.
>
> I think that amdgpu_bo_move() does support copy from sysMem to sysMem correctly.
> maybe something below is needed.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index c83ef42ca702..aa63ae7ddf1e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -485,7 +485,8 @@ static int amdgpu_bo_move(struct ttm_buffer_object *bo, bool evict,
>          }
>          if (old_mem->mem_type == TTM_PL_SYSTEM &&
>              (new_mem->mem_type == TTM_PL_TT ||
> -            new_mem->mem_type == AMDGPU_PL_PREEMPT)) {
> +            new_mem->mem_type == AMDGPU_PL_PREEMPT ||
> +            new_mem->mem_type == TTM_PL_SYSTEM)) {
>                  ttm_bo_move_null(bo, new_mem);
>                  goto out;
>          }
>
> otherwise, amdgpu_move_blit() is called to do the system memory copy which use a wrong address.
>   206         /* Map only what can't be accessed directly */
>   207         if (!tmz && mem->start != AMDGPU_BO_INVALID_OFFSET) {
>   208                 *addr = amdgpu_ttm_domain_start(adev, mem->mem_type) +
>   209                         mm_cur->start;
>   210                 return 0;
>   211         }
>
> line 208, *addr is zero. So when amdgpu_copy_buffer submit job with such addr, page fault happens.
>
>
> ________________________________________
> 发件人: Koenig, Christian <Christian.Koenig@amd.com>
> 发送时间: 2021年11月9日 20:35
> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
> 抄送: dri-devel@lists.freedesktop.org
> 主题: Re: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>
> Mhm, I'm not sure what the rational behind that is.
>
> Not moving the BO would make things less efficient, but should never
> cause a crash.
>
> Maybe we should add a CC: stable tag and push it to -fixes instead?
>
> Christian.
>
> Am 09.11.21 um 13:28 schrieb Pan, Xinhui:
>> [AMD Official Use Only]
>>
>> I hit vulkan cts test hang with navi23.
>>
>> dmesg says gmc page fault with address 0x0, 0x1000, 0x2000....
>> And some debug log also says amdgu copy one BO from system Domain to system Domain which is really weird.
>> ________________________________________
>> 发件人: Koenig, Christian <Christian.Koenig@amd.com>
>> 发送时间: 2021年11月9日 20:20
>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>> 抄送: dri-devel@lists.freedesktop.org
>> 主题: Re: [PATCH] drm/ttm: Put BO in its memory manager's lru list
>>
>> Am 09.11.21 um 12:19 schrieb xinhui pan:
>>> After we move BO to a new memory region, we should put it to
>>> the new memory manager's lru list regardless we unlock the resv or not.
>>>
>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> Interesting find, did you trigger that somehow or did you just stumbled
>> over it by reading the code?
>>
>> Patch is Reviewed-by: Christian König <christian.koenig@amd.com>, I will
>> pick that up for drm-misc-next.
>>
>> Thanks,
>> Christian.
>>
>>> ---
>>>     drivers/gpu/drm/ttm/ttm_bo.c | 2 ++
>>>     1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>>> index f1367107925b..e307004f0b28 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>> @@ -701,6 +701,8 @@ int ttm_mem_evict_first(struct ttm_device *bdev,
>>>         ret = ttm_bo_evict(bo, ctx);
>>>         if (locked)
>>>                 ttm_bo_unreserve(bo);
>>> +     else
>>> +             ttm_bo_move_to_lru_tail_unlocked(bo);
>>>
>>>         ttm_bo_put(bo);
>>>         return ret;
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c83ef42ca702..aa63ae7ddf1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -485,7 +485,8 @@  static int amdgpu_bo_move(struct ttm_buffer_object *bo, bool evict,
        }
        if (old_mem->mem_type == TTM_PL_SYSTEM &&
            (new_mem->mem_type == TTM_PL_TT ||
-            new_mem->mem_type == AMDGPU_PL_PREEMPT)) {
+            new_mem->mem_type == AMDGPU_PL_PREEMPT ||
+            new_mem->mem_type == TTM_PL_SYSTEM)) {
                ttm_bo_move_null(bo, new_mem);
                goto out;
        }