diff mbox series

[1/2] drm/amdgpu: fix missing fence reserve in amdgpu_vm_sdma_commit

Message ID 20230621162652.10875-2-Yunxiang.Li@amd.com (mailing list archive)
State New, archived
Headers show
Series drm: fix soft lockup when saturating vram | expand

Commit Message

Yunxiang Li June 21, 2023, 4:23 p.m. UTC
When amdgpu_bo_fence is converted to dma_resv_add_fence, the reserve was
removed in that process, so putting it back.

Fixes: 4247084057cf ("drm/amdgpu: use DMA_RESV_USAGE_BOOKKEEP v2")
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Christian König June 22, 2023, 7:39 a.m. UTC | #1
Am 21.06.23 um 18:23 schrieb Yunxiang Li:
> When amdgpu_bo_fence is converted to dma_resv_add_fence, the reserve was
> removed in that process, so putting it back.

The slots for this are reserved in amdgpu_vm_get_pd_bo():

         /* Two for VM updates, one for TTM and one for the CS job */
         entry->tv.num_shared = 4;

The problem here is rather that we didn't took the gang submit into 
account. See my patch earlier this week I've CCed you on.
.
>
> Fixes: 4247084057cf ("drm/amdgpu: use DMA_RESV_USAGE_BOOKKEEP v2")
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 349416e176a1..f590b97853d9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -120,6 +120,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	struct amdgpu_ib *ib = p->job->ibs;
>   	struct amdgpu_ring *ring;
>   	struct dma_fence *f;
> +	int r;
>   
>   	ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring,
>   			    sched);
> @@ -135,6 +136,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   		swap(p->vm->last_unlocked, tmp);
>   		dma_fence_put(tmp);
>   	} else {
> +		r = dma_resv_reserve_fences(p->vm->root.bo->tbo.base.resv, 1);
> +		if (r)
> +			return r;

That is simply illegal and would potentially lead to memory corruption.

Regards,
Christian.

>   		dma_resv_add_fence(p->vm->root.bo->tbo.base.resv, f,
>   				   DMA_RESV_USAGE_BOOKKEEP);
>   	}
Yunxiang Li June 22, 2023, 2:11 p.m. UTC | #2
[Public]

If I take out this reserve and the change that makes dma_resv_reserve_fences nestable, I get a BUG in the add_fence, so we might be missing a reserve somewhere else then.

Teddy

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Thursday, June 22, 2023 3:39
To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; dri-devel@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: [PATCH 1/2] drm/amdgpu: fix missing fence reserve in amdgpu_vm_sdma_commit

Am 21.06.23 um 18:23 schrieb Yunxiang Li:
> When amdgpu_bo_fence is converted to dma_resv_add_fence, the reserve
> was removed in that process, so putting it back.

The slots for this are reserved in amdgpu_vm_get_pd_bo():

         /* Two for VM updates, one for TTM and one for the CS job */
         entry->tv.num_shared = 4;

The problem here is rather that we didn't took the gang submit into account. See my patch earlier this week I've CCed you on.
.
>
> Fixes: 4247084057cf ("drm/amdgpu: use DMA_RESV_USAGE_BOOKKEEP v2")
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 349416e176a1..f590b97853d9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -120,6 +120,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>       struct amdgpu_ib *ib = p->job->ibs;
>       struct amdgpu_ring *ring;
>       struct dma_fence *f;
> +     int r;
>
>       ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring,
>                           sched);
> @@ -135,6 +136,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>               swap(p->vm->last_unlocked, tmp);
>               dma_fence_put(tmp);
>       } else {
> +             r = dma_resv_reserve_fences(p->vm->root.bo->tbo.base.resv, 1);
> +             if (r)
> +                     return r;

That is simply illegal and would potentially lead to memory corruption.

Regards,
Christian.

>               dma_resv_add_fence(p->vm->root.bo->tbo.base.resv, f,
>                                  DMA_RESV_USAGE_BOOKKEEP);
>       }
Yunxiang Li June 22, 2023, 2:44 p.m. UTC | #3
[Public]

Hmm, seems all the stack trace I get now are from amdgpu_gem_va_ioctl, so I suppose the issue is only on that path now, the gang submit patch fixed it for amdgpu_cs_ioctl.

Teddy

-----Original Message-----
From: Li, Yunxiang (Teddy)
Sent: Thursday, June 22, 2023 10:12
To: Koenig, Christian <Christian.Koenig@amd.com>; dri-devel@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: RE: [PATCH 1/2] drm/amdgpu: fix missing fence reserve in amdgpu_vm_sdma_commit

If I take out this reserve and the change that makes dma_resv_reserve_fences nestable, I get a BUG in the add_fence, so we might be missing a reserve somewhere else then.

Teddy

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Thursday, June 22, 2023 3:39
To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; dri-devel@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: [PATCH 1/2] drm/amdgpu: fix missing fence reserve in amdgpu_vm_sdma_commit

Am 21.06.23 um 18:23 schrieb Yunxiang Li:
> When amdgpu_bo_fence is converted to dma_resv_add_fence, the reserve
> was removed in that process, so putting it back.

The slots for this are reserved in amdgpu_vm_get_pd_bo():

         /* Two for VM updates, one for TTM and one for the CS job */
         entry->tv.num_shared = 4;

The problem here is rather that we didn't took the gang submit into account. See my patch earlier this week I've CCed you on.
.
>
> Fixes: 4247084057cf ("drm/amdgpu: use DMA_RESV_USAGE_BOOKKEEP v2")
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 349416e176a1..f590b97853d9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -120,6 +120,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>       struct amdgpu_ib *ib = p->job->ibs;
>       struct amdgpu_ring *ring;
>       struct dma_fence *f;
> +     int r;
>
>       ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring,
>                           sched);
> @@ -135,6 +136,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>               swap(p->vm->last_unlocked, tmp);
>               dma_fence_put(tmp);
>       } else {
> +             r = dma_resv_reserve_fences(p->vm->root.bo->tbo.base.resv, 1);
> +             if (r)
> +                     return r;

That is simply illegal and would potentially lead to memory corruption.

Regards,
Christian.

>               dma_resv_add_fence(p->vm->root.bo->tbo.base.resv, f,
>                                  DMA_RESV_USAGE_BOOKKEEP);
>       }
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 349416e176a1..f590b97853d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -120,6 +120,7 @@  static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 	struct amdgpu_ib *ib = p->job->ibs;
 	struct amdgpu_ring *ring;
 	struct dma_fence *f;
+	int r;
 
 	ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring,
 			    sched);
@@ -135,6 +136,9 @@  static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 		swap(p->vm->last_unlocked, tmp);
 		dma_fence_put(tmp);
 	} else {
+		r = dma_resv_reserve_fences(p->vm->root.bo->tbo.base.resv, 1);
+		if (r)
+			return r;
 		dma_resv_add_fence(p->vm->root.bo->tbo.base.resv, f,
 				   DMA_RESV_USAGE_BOOKKEEP);
 	}