diff mbox series

drm/nouveau: prime: fix ttm_bo_delayed_delete oops

Message ID Z9GHj-edWJmyzpdY@debian.local (mailing list archive)
State New
Headers show
Series drm/nouveau: prime: fix ttm_bo_delayed_delete oops | expand

Commit Message

Chris Bainbridge March 12, 2025, 1:09 p.m. UTC
Fix an oops in ttm_bo_delayed_delete which results from dererencing a
dangling pointer:

Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b7b: 0000 [#1] PREEMPT SMP
CPU: 4 UID: 0 PID: 1082 Comm: kworker/u65:2 Not tainted 6.14.0-rc4-00267-g505460b44513-dirty #216
Hardware name: LENOVO 82N6/LNVNB161216, BIOS GKCN65WW 01/16/2024
Workqueue: ttm ttm_bo_delayed_delete [ttm]
RIP: 0010:dma_resv_iter_first_unlocked+0x55/0x290
Code: 31 f6 48 c7 c7 00 2b fa aa e8 97 bd 52 ff e8 a2 c1 53 00 5a 85 c0 74 48 e9 88 01 00 00 4c 89 63 20 4d 85 e4 0f 84 30 01 00 00 <41> 8b 44 24 10 c6 43 2c 01 48 89 df 89 43 28 e8 97 fd ff ff 4c 8b
RSP: 0018:ffffbf9383473d60 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffbf9383473d88 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffbf9383473d78 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b
R13: ffffa003bbf78580 R14: ffffa003a6728040 R15: 00000000000383cc
FS:  0000000000000000(0000) GS:ffffa00991c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000758348024dd0 CR3: 000000012c259000 CR4: 0000000000f50ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die_body.cold+0x19/0x26
 ? die_addr+0x3d/0x70
 ? exc_general_protection+0x159/0x460
 ? asm_exc_general_protection+0x27/0x30
 ? dma_resv_iter_first_unlocked+0x55/0x290
 dma_resv_wait_timeout+0x56/0x100
 ttm_bo_delayed_delete+0x69/0xb0 [ttm]
 process_one_work+0x217/0x5c0
 worker_thread+0x1c8/0x3d0
 ? apply_wqattrs_cleanup.part.0+0xc0/0xc0
 kthread+0x10b/0x240
 ? kthreads_online_cpu+0x140/0x140
 ret_from_fork+0x40/0x70
 ? kthreads_online_cpu+0x140/0x140
 ret_from_fork_asm+0x11/0x20
 </TASK>

The cause of this is:

- drm_prime_gem_destroy calls dma_buf_put(dma_buf) which releases the
  reference to the shared dma_buf. The reference count is 0, so the
  dma_buf is destroyed, which in turn decrements the corresponding
  amdgpu_bo reference count to 0, and the amdgpu_bo is destroyed -
  calling drm_gem_object_release then dma_resv_fini (which destroys the
  reservation object), then finally freeing the amdgpu_bo.

- nouveau_bo obj->bo.base.resv is now a dangling pointer to the memory
  formerly allocated to the amdgpu_bo.

- nouveau_gem_object_del calls ttm_bo_put(&nvbo->bo) which calls
  ttm_bo_release, which schedules ttm_bo_delayed_delete.

- ttm_bo_delayed_delete runs and dereferences the dangling resv pointer,
  resulting in a general protection fault.

Fix this by moving the drm_prime_gem_destroy call from
nouveau_gem_object_del to nouveau_bo_del_ttm. This ensures that it will
be run after ttm_bo_delayed_delete.

Signed-off-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Co-Developed-by: Christian König <christian.koenig@amd.com>
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3937
---
 drivers/gpu/drm/drm_prime.c           | 8 ++++++--
 drivers/gpu/drm/nouveau/nouveau_bo.c  | 3 +++
 drivers/gpu/drm/nouveau/nouveau_gem.c | 3 ---
 3 files changed, 9 insertions(+), 5 deletions(-)

Comments

Christian König March 12, 2025, 1:23 p.m. UTC | #1
Am 12.03.25 um 14:09 schrieb Chris Bainbridge:
> Fix an oops in ttm_bo_delayed_delete which results from dererencing a
> dangling pointer:
>
> Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b7b: 0000 [#1] PREEMPT SMP
> CPU: 4 UID: 0 PID: 1082 Comm: kworker/u65:2 Not tainted 6.14.0-rc4-00267-g505460b44513-dirty #216
> Hardware name: LENOVO 82N6/LNVNB161216, BIOS GKCN65WW 01/16/2024
> Workqueue: ttm ttm_bo_delayed_delete [ttm]
> RIP: 0010:dma_resv_iter_first_unlocked+0x55/0x290
> Code: 31 f6 48 c7 c7 00 2b fa aa e8 97 bd 52 ff e8 a2 c1 53 00 5a 85 c0 74 48 e9 88 01 00 00 4c 89 63 20 4d 85 e4 0f 84 30 01 00 00 <41> 8b 44 24 10 c6 43 2c 01 48 89 df 89 43 28 e8 97 fd ff ff 4c 8b
> RSP: 0018:ffffbf9383473d60 EFLAGS: 00010202
> RAX: 0000000000000001 RBX: ffffbf9383473d88 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffffbf9383473d78 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b
> R13: ffffa003bbf78580 R14: ffffa003a6728040 R15: 00000000000383cc
> FS:  0000000000000000(0000) GS:ffffa00991c00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000758348024dd0 CR3: 000000012c259000 CR4: 0000000000f50ef0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  ? __die_body.cold+0x19/0x26
>  ? die_addr+0x3d/0x70
>  ? exc_general_protection+0x159/0x460
>  ? asm_exc_general_protection+0x27/0x30
>  ? dma_resv_iter_first_unlocked+0x55/0x290
>  dma_resv_wait_timeout+0x56/0x100
>  ttm_bo_delayed_delete+0x69/0xb0 [ttm]
>  process_one_work+0x217/0x5c0
>  worker_thread+0x1c8/0x3d0
>  ? apply_wqattrs_cleanup.part.0+0xc0/0xc0
>  kthread+0x10b/0x240
>  ? kthreads_online_cpu+0x140/0x140
>  ret_from_fork+0x40/0x70
>  ? kthreads_online_cpu+0x140/0x140
>  ret_from_fork_asm+0x11/0x20
>  </TASK>
>
> The cause of this is:
>
> - drm_prime_gem_destroy calls dma_buf_put(dma_buf) which releases the
>   reference to the shared dma_buf. The reference count is 0, so the
>   dma_buf is destroyed, which in turn decrements the corresponding
>   amdgpu_bo reference count to 0, and the amdgpu_bo is destroyed -
>   calling drm_gem_object_release then dma_resv_fini (which destroys the
>   reservation object), then finally freeing the amdgpu_bo.
>
> - nouveau_bo obj->bo.base.resv is now a dangling pointer to the memory
>   formerly allocated to the amdgpu_bo.
>
> - nouveau_gem_object_del calls ttm_bo_put(&nvbo->bo) which calls
>   ttm_bo_release, which schedules ttm_bo_delayed_delete.
>
> - ttm_bo_delayed_delete runs and dereferences the dangling resv pointer,
>   resulting in a general protection fault.
>
> Fix this by moving the drm_prime_gem_destroy call from
> nouveau_gem_object_del to nouveau_bo_del_ttm. This ensures that it will
> be run after ttm_bo_delayed_delete.
>
> Signed-off-by: Chris Bainbridge <chris.bainbridge@gmail.com>
> Co-Developed-by: Christian König <christian.koenig@amd.com>
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3937

That should probably be Fixes instead of Link, Link is to reference discussions and not bug reports.

> ---
>  drivers/gpu/drm/drm_prime.c           | 8 ++++++--
>  drivers/gpu/drm/nouveau/nouveau_bo.c  | 3 +++
>  drivers/gpu/drm/nouveau/nouveau_gem.c | 3 ---
>  3 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> index 32a8781cfd67..4b90fa8954d7 100644
> --- a/drivers/gpu/drm/drm_prime.c
> +++ b/drivers/gpu/drm/drm_prime.c
> @@ -929,7 +929,9 @@ EXPORT_SYMBOL(drm_gem_prime_export);
>   * &drm_driver.gem_prime_import_sg_table internally.
>   *
>   * Drivers must arrange to call drm_prime_gem_destroy() from their
> - * &drm_gem_object_funcs.free hook when using this function.
> + * &ttm_buffer_object.destroy hook when using this function,
> + * to avoid the dma_buf being freed while the ttm_buffer_object can still
> + * dereference it.

Looks mostly good to me, but I would write here:

Drivers must arrange to call drm_prime_gem_destroy() from their
&drm_gem_object_funcs.free hook or &ttm_buffer_object.destroy
hook when using this function,

Since it is perfectly possible that drivers don't use TTM.

I also skimmed over all usages of drm_prime_gem_destroy() and except for i915 all other drivers seem to do the right thing and call the function directly before drm_gem_object_release().

For i915 the code is not straight forward to follow, but since it isn't using TTM I'm pretty sure it should work there as well.

I'm really wondering if we couldn't add the call to drm_prime_gem_destroy() into drm_gem_object_release() and call it a day. It's just one thing less that drivers could get wrong.

Regards,
Christian.

>   */
>  struct drm_gem_object *drm_gem_prime_import_dev(struct drm_device *dev,
>  					    struct dma_buf *dma_buf,
> @@ -999,7 +1001,9 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev);
>   * implementation in drm_gem_prime_fd_to_handle().
>   *
>   * Drivers must arrange to call drm_prime_gem_destroy() from their
> - * &drm_gem_object_funcs.free hook when using this function.
> + * &ttm_buffer_object.destroy hook when using this function,
> + * to avoid the dma_buf being freed while the ttm_buffer_object can still
> + * dereference it.
>   */
>  struct drm_gem_object *drm_gem_prime_import(struct drm_device *dev,
>  					    struct dma_buf *dma_buf)
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index db961eade225..2016c1e7242f 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -144,6 +144,9 @@ nouveau_bo_del_ttm(struct ttm_buffer_object *bo)
>  	nouveau_bo_del_io_reserve_lru(bo);
>  	nv10_bo_put_tile_region(dev, nvbo->tile, NULL);
>  
> +	if (bo->base.import_attach)
> +		drm_prime_gem_destroy(&bo->base, bo->sg);
> +
>  	/*
>  	 * If nouveau_bo_new() allocated this buffer, the GEM object was never
>  	 * initialized, so don't attempt to release it.
> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
> index 9ae2cee1c7c5..67e3c99de73a 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
> @@ -87,9 +87,6 @@ nouveau_gem_object_del(struct drm_gem_object *gem)
>  		return;
>  	}
>  
> -	if (gem->import_attach)
> -		drm_prime_gem_destroy(gem, nvbo->bo.sg);
> -
>  	ttm_bo_put(&nvbo->bo);
>  
>  	pm_runtime_mark_last_busy(dev);
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 32a8781cfd67..4b90fa8954d7 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -929,7 +929,9 @@  EXPORT_SYMBOL(drm_gem_prime_export);
  * &drm_driver.gem_prime_import_sg_table internally.
  *
  * Drivers must arrange to call drm_prime_gem_destroy() from their
- * &drm_gem_object_funcs.free hook when using this function.
+ * &ttm_buffer_object.destroy hook when using this function,
+ * to avoid the dma_buf being freed while the ttm_buffer_object can still
+ * dereference it.
  */
 struct drm_gem_object *drm_gem_prime_import_dev(struct drm_device *dev,
 					    struct dma_buf *dma_buf,
@@ -999,7 +1001,9 @@  EXPORT_SYMBOL(drm_gem_prime_import_dev);
  * implementation in drm_gem_prime_fd_to_handle().
  *
  * Drivers must arrange to call drm_prime_gem_destroy() from their
- * &drm_gem_object_funcs.free hook when using this function.
+ * &ttm_buffer_object.destroy hook when using this function,
+ * to avoid the dma_buf being freed while the ttm_buffer_object can still
+ * dereference it.
  */
 struct drm_gem_object *drm_gem_prime_import(struct drm_device *dev,
 					    struct dma_buf *dma_buf)
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index db961eade225..2016c1e7242f 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -144,6 +144,9 @@  nouveau_bo_del_ttm(struct ttm_buffer_object *bo)
 	nouveau_bo_del_io_reserve_lru(bo);
 	nv10_bo_put_tile_region(dev, nvbo->tile, NULL);
 
+	if (bo->base.import_attach)
+		drm_prime_gem_destroy(&bo->base, bo->sg);
+
 	/*
 	 * If nouveau_bo_new() allocated this buffer, the GEM object was never
 	 * initialized, so don't attempt to release it.
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index 9ae2cee1c7c5..67e3c99de73a 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -87,9 +87,6 @@  nouveau_gem_object_del(struct drm_gem_object *gem)
 		return;
 	}
 
-	if (gem->import_attach)
-		drm_prime_gem_destroy(gem, nvbo->bo.sg);
-
 	ttm_bo_put(&nvbo->bo);
 
 	pm_runtime_mark_last_busy(dev);