diff mbox series

drm/amdgpu: Let userptr BO ttm have TTM_PAGE_FLAG_SG set

Message ID 20210520031523.12834-1-xinhui.pan@amd.com (mailing list archive)
State New, archived
Headers show
Series drm/amdgpu: Let userptr BO ttm have TTM_PAGE_FLAG_SG set | expand

Commit Message

Pan, Xinhui May 20, 2021, 3:15 a.m. UTC
We have met memory corruption due to unexcepted swapout/swapin.

swapout function create one swap storage which is filled with zero. And
set ttm->page_flags as TTM_PAGE_FLAG_SWAPPED. But because userptr BO ttm
has no backend page at that time, no real data is swapout to swap
storage.

swapin function is called during userptr BO populate as
TTM_PAGE_FLAG_SWAPPED is set. Now here is the problem, we swapin data to
ttm bakend memory from swap storage. That just causes the memory been
overwritten.

CPU 1						CPU 2
kfd alloc BO A(userptr)                         alloc BO B(GTT)
    ->init -> validate(create ttm)		-> init -> validate -> populate
    init_user_pages                               -> swapout BO A
        -> get_user_pages (fill up ttm->pages)
         -> validate -> populate
          -> swapin BO A // memory overwritten

To fix this issue, we can set TTM_PAGE_FLAG_SG when we create userptr BO
ttm. Then swapout function would not swap it.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          | 4 ++++
 2 files changed, 5 insertions(+), 3 deletions(-)

Comments

Felix Kuehling May 20, 2021, 3:37 a.m. UTC | #1
I think this works for KFD userptr BOs. But this problem is probably not
specific to KFD. It's only most obvious with KFD because we rely so
heavily for userptrs.

I don't really understand why we're messing with TTM_PAGE_FLAG_SG in
amdgpu_ttm_tt_populate and amdgpu_ttm_tt_unpopulate. And why are userptr
BOs created as ttm_bo_type_device, not ttm_bo_type_sg? Christian, do you
know about the history of this code?

Either way, the patch is

Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>

Thanks for looking into this!

Regards,
  Felix

Am 2021-05-19 um 11:15 p.m. schrieb xinhui pan:
> We have met memory corruption due to unexcepted swapout/swapin.
>
> swapout function create one swap storage which is filled with zero. And
> set ttm->page_flags as TTM_PAGE_FLAG_SWAPPED. But because userptr BO ttm
> has no backend page at that time, no real data is swapout to swap
> storage.
>
> swapin function is called during userptr BO populate as
> TTM_PAGE_FLAG_SWAPPED is set. Now here is the problem, we swapin data to
> ttm bakend memory from swap storage. That just causes the memory been
> overwritten.
>
> CPU 1						CPU 2
> kfd alloc BO A(userptr)                         alloc BO B(GTT)
>     ->init -> validate(create ttm)		-> init -> validate -> populate
>     init_user_pages                               -> swapout BO A
>         -> get_user_pages (fill up ttm->pages)
>          -> validate -> populate
>           -> swapin BO A // memory overwritten
>
> To fix this issue, we can set TTM_PAGE_FLAG_SG when we create userptr BO
> ttm. Then swapout function would not swap it.
>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 +---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          | 4 ++++
>  2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 928e8d57cd08..9a6ea966ddb2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1410,7 +1410,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
>  	} else if (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) {
>  		domain = AMDGPU_GEM_DOMAIN_GTT;
>  		alloc_domain = AMDGPU_GEM_DOMAIN_CPU;
> -		alloc_flags = 0;
> +		alloc_flags = AMDGPU_AMDKFD_CREATE_USERPTR_BO;
>  		if (!offset || !*offset)
>  			return -EINVAL;
>  		user_addr = untagged_addr(*offset);
> @@ -1477,8 +1477,6 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
>  	}
>  	bo->kfd_bo = *mem;
>  	(*mem)->bo = bo;
> -	if (user_addr)
> -		bo->flags |= AMDGPU_AMDKFD_CREATE_USERPTR_BO;
>  
>  	(*mem)->va = va;
>  	(*mem)->domain = domain;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index c7f5cc503601..5b3f45637fb5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1119,6 +1119,10 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo,
>  		kfree(gtt);
>  		return NULL;
>  	}
> +
> +	if (abo->flags & AMDGPU_AMDKFD_CREATE_USERPTR_BO)
> +		gtt->ttm.page_flags |= TTM_PAGE_FLAG_SG;
> +
>  	return &gtt->ttm;
>  }
>
Christian König May 20, 2021, 6:58 a.m. UTC | #2
Am 20.05.21 um 05:15 schrieb xinhui pan:
> We have met memory corruption due to unexcepted swapout/swapin.
>
> swapout function create one swap storage which is filled with zero. And
> set ttm->page_flags as TTM_PAGE_FLAG_SWAPPED. But because userptr BO ttm
> has no backend page at that time, no real data is swapout to swap
> storage.
>
> swapin function is called during userptr BO populate as
> TTM_PAGE_FLAG_SWAPPED is set. Now here is the problem, we swapin data to
> ttm bakend memory from swap storage. That just causes the memory been
> overwritten.
>
> CPU 1						CPU 2
> kfd alloc BO A(userptr)                         alloc BO B(GTT)
>      ->init -> validate(create ttm)		-> init -> validate -> populate
>      init_user_pages                               -> swapout BO A
>          -> get_user_pages (fill up ttm->pages)
>           -> validate -> populate
>            -> swapin BO A // memory overwritten
>
> To fix this issue, we can set TTM_PAGE_FLAG_SG when we create userptr BO
> ttm. Then swapout function would not swap it.

That's a possible solution, but I would rather like to have the 
underlying problem in TTM fixed.

Christian.

>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          | 4 ++++
>   2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 928e8d57cd08..9a6ea966ddb2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1410,7 +1410,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
>   	} else if (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) {
>   		domain = AMDGPU_GEM_DOMAIN_GTT;
>   		alloc_domain = AMDGPU_GEM_DOMAIN_CPU;
> -		alloc_flags = 0;
> +		alloc_flags = AMDGPU_AMDKFD_CREATE_USERPTR_BO;
>   		if (!offset || !*offset)
>   			return -EINVAL;
>   		user_addr = untagged_addr(*offset);
> @@ -1477,8 +1477,6 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
>   	}
>   	bo->kfd_bo = *mem;
>   	(*mem)->bo = bo;
> -	if (user_addr)
> -		bo->flags |= AMDGPU_AMDKFD_CREATE_USERPTR_BO;
>   
>   	(*mem)->va = va;
>   	(*mem)->domain = domain;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index c7f5cc503601..5b3f45637fb5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1119,6 +1119,10 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo,
>   		kfree(gtt);
>   		return NULL;
>   	}
> +
> +	if (abo->flags & AMDGPU_AMDKFD_CREATE_USERPTR_BO)
> +		gtt->ttm.page_flags |= TTM_PAGE_FLAG_SG;
> +
>   	return &gtt->ttm;
>   }
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 928e8d57cd08..9a6ea966ddb2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1410,7 +1410,7 @@  int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
 	} else if (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) {
 		domain = AMDGPU_GEM_DOMAIN_GTT;
 		alloc_domain = AMDGPU_GEM_DOMAIN_CPU;
-		alloc_flags = 0;
+		alloc_flags = AMDGPU_AMDKFD_CREATE_USERPTR_BO;
 		if (!offset || !*offset)
 			return -EINVAL;
 		user_addr = untagged_addr(*offset);
@@ -1477,8 +1477,6 @@  int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
 	}
 	bo->kfd_bo = *mem;
 	(*mem)->bo = bo;
-	if (user_addr)
-		bo->flags |= AMDGPU_AMDKFD_CREATE_USERPTR_BO;
 
 	(*mem)->va = va;
 	(*mem)->domain = domain;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c7f5cc503601..5b3f45637fb5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1119,6 +1119,10 @@  static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo,
 		kfree(gtt);
 		return NULL;
 	}
+
+	if (abo->flags & AMDGPU_AMDKFD_CREATE_USERPTR_BO)
+		gtt->ttm.page_flags |= TTM_PAGE_FLAG_SG;
+
 	return &gtt->ttm;
 }