diff mbox series

[v2] drm/i915/ttm: never purge busy objects

Message ID 20221115104620.120432-1-matthew.auld@intel.com (mailing list archive)
State New, archived
Headers show
Series [v2] drm/i915/ttm: never purge busy objects | expand

Commit Message

Matthew Auld Nov. 15, 2022, 10:46 a.m. UTC
In i915_gem_madvise_ioctl() we immediately purge the object is not
currently used, like when the mm.pages are NULL.  With shmem the pages
might still be hanging around or are perhaps swapped out. Similarly with
ttm we might still have the pages hanging around on the ttm resource,
like with lmem or shmem, but here we need to be extra careful since
async unbinds are possible as well as in-progress kernel moves. In
i915_ttm_purge() we expect the pipeline-gutting to nuke the ttm resource
for us, however if it's busy the memory is only moved to a ghost object,
which then leads to broken behaviour when for example clearing the
i915_tt->filp, since the actual ttm_tt is still alive and populated,
even though it's been moved to the ghost object.  When we later destroy
the ghost object we hit the following, since the filp is now NULL:

[  +0.006982] #PF: supervisor read access in kernel mode
[  +0.005149] #PF: error_code(0x0000) - not-present page
[  +0.005147] PGD 11631d067 P4D 11631d067 PUD 115972067 PMD 0
[  +0.005676] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  +0.012962] Workqueue: events ttm_device_delayed_workqueue [ttm]
[  +0.006022] RIP: 0010:i915_ttm_tt_unpopulate+0x3a/0x70 [i915]
[  +0.005879] Code: 89 fb 48 85 f6 74 11 8b 55 4c 48 8b 7d 30 45 31 c0 31 c9 e8 18 6a e5 e0 80 7d 60 00 74 20 48 8b 45 68
8b 55 08 4c 89 e7 5b 5d <48> 8b 40 20 83 e2 01 41 5c 89 d1 48 8b 70
 30 e9 42 b2 ff ff 4c 89
[  +0.018782] RSP: 0000:ffffc9000bf6fd70 EFLAGS: 00010202
[  +0.005244] RAX: 0000000000000000 RBX: ffff8883e12ae380 RCX: 0000000000000000
[  +0.007150] RDX: 000000008000000e RSI: ffffffff823559b4 RDI: ffff8883e12ae3c0
[  +0.007142] RBP: ffff888103b65d48 R08: 0000000000000001 R09: 0000000000000001
[  +0.007144] R10: 0000000000000001 R11: ffff88829c2c8040 R12: ffff8883e12ae3c0
[  +0.007148] R13: 0000000000000001 R14: ffff888115184140 R15: ffff888115184248
[  +0.007154] FS:  0000000000000000(0000) GS:ffff88844db00000(0000) knlGS:0000000000000000
[  +0.008108] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.005763] CR2: 0000000000000020 CR3: 000000013fdb4004 CR4: 00000000003706e0
[  +0.007152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.007145] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.007154] Call Trace:
[  +0.002459]  <TASK>
[  +0.002126]  ttm_tt_unpopulate.part.0+0x17/0x70 [ttm]
[  +0.005068]  ttm_bo_tt_destroy+0x1c/0x50 [ttm]
[  +0.004464]  ttm_bo_cleanup_memtype_use+0x25/0x40 [ttm]
[  +0.005244]  ttm_bo_cleanup_refs+0x90/0x2c0 [ttm]
[  +0.004721]  ttm_bo_delayed_delete+0x235/0x250 [ttm]
[  +0.004981]  ttm_device_delayed_workqueue+0x13/0x40 [ttm]
[  +0.005422]  process_one_work+0x248/0x560
[  +0.004028]  worker_thread+0x4b/0x390
[  +0.003682]  ? process_one_work+0x560/0x560
[  +0.004199]  kthread+0xeb/0x120
[  +0.003163]  ? kthread_complete_and_exit+0x20/0x20
[  +0.004815]  ret_from_fork+0x1f/0x30

v2:
 - Just use ttm_bo_wait() directly (Niranjana)
 - Add testcase reference

Testcase: igt@gem_madvise@dontneed-evict-race
Fixes: 213d50927763 ("drm/i915/ttm: Introduce a TTM i915 gem object backend")
Reported-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Cc: <stable@vger.kernel.org> # v5.15+
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Nirmoy Das Nov. 15, 2022, 1:45 p.m. UTC | #1
On 11/15/2022 11:46 AM, Matthew Auld wrote:
> In i915_gem_madvise_ioctl() we immediately purge the object is not
> currently used, like when the mm.pages are NULL.  With shmem the pages
> might still be hanging around or are perhaps swapped out. Similarly with
> ttm we might still have the pages hanging around on the ttm resource,
> like with lmem or shmem, but here we need to be extra careful since
> async unbinds are possible as well as in-progress kernel moves. In
> i915_ttm_purge() we expect the pipeline-gutting to nuke the ttm resource
> for us, however if it's busy the memory is only moved to a ghost object,
> which then leads to broken behaviour when for example clearing the
> i915_tt->filp, since the actual ttm_tt is still alive and populated,
> even though it's been moved to the ghost object.  When we later destroy
> the ghost object we hit the following, since the filp is now NULL:
>
> [  +0.006982] #PF: supervisor read access in kernel mode
> [  +0.005149] #PF: error_code(0x0000) - not-present page
> [  +0.005147] PGD 11631d067 P4D 11631d067 PUD 115972067 PMD 0
> [  +0.005676] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [  +0.012962] Workqueue: events ttm_device_delayed_workqueue [ttm]
> [  +0.006022] RIP: 0010:i915_ttm_tt_unpopulate+0x3a/0x70 [i915]
> [  +0.005879] Code: 89 fb 48 85 f6 74 11 8b 55 4c 48 8b 7d 30 45 31 c0 31 c9 e8 18 6a e5 e0 80 7d 60 00 74 20 48 8b 45 68
> 8b 55 08 4c 89 e7 5b 5d <48> 8b 40 20 83 e2 01 41 5c 89 d1 48 8b 70
>   30 e9 42 b2 ff ff 4c 89
> [  +0.018782] RSP: 0000:ffffc9000bf6fd70 EFLAGS: 00010202
> [  +0.005244] RAX: 0000000000000000 RBX: ffff8883e12ae380 RCX: 0000000000000000
> [  +0.007150] RDX: 000000008000000e RSI: ffffffff823559b4 RDI: ffff8883e12ae3c0
> [  +0.007142] RBP: ffff888103b65d48 R08: 0000000000000001 R09: 0000000000000001
> [  +0.007144] R10: 0000000000000001 R11: ffff88829c2c8040 R12: ffff8883e12ae3c0
> [  +0.007148] R13: 0000000000000001 R14: ffff888115184140 R15: ffff888115184248
> [  +0.007154] FS:  0000000000000000(0000) GS:ffff88844db00000(0000) knlGS:0000000000000000
> [  +0.008108] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.005763] CR2: 0000000000000020 CR3: 000000013fdb4004 CR4: 00000000003706e0
> [  +0.007152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  +0.007145] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  +0.007154] Call Trace:
> [  +0.002459]  <TASK>
> [  +0.002126]  ttm_tt_unpopulate.part.0+0x17/0x70 [ttm]
> [  +0.005068]  ttm_bo_tt_destroy+0x1c/0x50 [ttm]
> [  +0.004464]  ttm_bo_cleanup_memtype_use+0x25/0x40 [ttm]
> [  +0.005244]  ttm_bo_cleanup_refs+0x90/0x2c0 [ttm]
> [  +0.004721]  ttm_bo_delayed_delete+0x235/0x250 [ttm]
> [  +0.004981]  ttm_device_delayed_workqueue+0x13/0x40 [ttm]
> [  +0.005422]  process_one_work+0x248/0x560
> [  +0.004028]  worker_thread+0x4b/0x390
> [  +0.003682]  ? process_one_work+0x560/0x560
> [  +0.004199]  kthread+0xeb/0x120
> [  +0.003163]  ? kthread_complete_and_exit+0x20/0x20
> [  +0.004815]  ret_from_fork+0x1f/0x30
>
> v2:
>   - Just use ttm_bo_wait() directly (Niranjana)
>   - Add testcase reference
>
> Testcase: igt@gem_madvise@dontneed-evict-race
> Fixes: 213d50927763 ("drm/i915/ttm: Introduce a TTM i915 gem object backend")
> Reported-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: Nirmoy Das <nirmoy.das@intel.com>


Acked-by: Nirmoy Das <Nirmoy.Das@intel.com>


> Cc: <stable@vger.kernel.org> # v5.15+
> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> index e4e55e3f4e41..34793863ea45 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> @@ -603,6 +603,10 @@ static int i915_ttm_truncate(struct drm_i915_gem_object *obj)
>   
>   	WARN_ON_ONCE(obj->mm.madv == I915_MADV_WILLNEED);
>   
> +	err = ttm_bo_wait(bo, true, false);
> +	if (err)
> +		return err;
> +
>   	err = i915_ttm_move_notify(bo);
>   	if (err)
>   		return err;
Andrzej Hajda Nov. 15, 2022, 4:20 p.m. UTC | #2
On 15.11.2022 11:46, Matthew Auld wrote:
> In i915_gem_madvise_ioctl() we immediately purge the object is not
> currently used, like when the mm.pages are NULL.  With shmem the pages
> might still be hanging around or are perhaps swapped out. Similarly with
> ttm we might still have the pages hanging around on the ttm resource,
> like with lmem or shmem, but here we need to be extra careful since
> async unbinds are possible as well as in-progress kernel moves. In
> i915_ttm_purge() we expect the pipeline-gutting to nuke the ttm resource
> for us, however if it's busy the memory is only moved to a ghost object,
> which then leads to broken behaviour when for example clearing the
> i915_tt->filp, since the actual ttm_tt is still alive and populated,
> even though it's been moved to the ghost object.  When we later destroy
> the ghost object we hit the following, since the filp is now NULL:
>
> [  +0.006982] #PF: supervisor read access in kernel mode
> [  +0.005149] #PF: error_code(0x0000) - not-present page
> [  +0.005147] PGD 11631d067 P4D 11631d067 PUD 115972067 PMD 0
> [  +0.005676] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [  +0.012962] Workqueue: events ttm_device_delayed_workqueue [ttm]
> [  +0.006022] RIP: 0010:i915_ttm_tt_unpopulate+0x3a/0x70 [i915]
> [  +0.005879] Code: 89 fb 48 85 f6 74 11 8b 55 4c 48 8b 7d 30 45 31 c0 31 c9 e8 18 6a e5 e0 80 7d 60 00 74 20 48 8b 45 68
> 8b 55 08 4c 89 e7 5b 5d <48> 8b 40 20 83 e2 01 41 5c 89 d1 48 8b 70
>   30 e9 42 b2 ff ff 4c 89
> [  +0.018782] RSP: 0000:ffffc9000bf6fd70 EFLAGS: 00010202
> [  +0.005244] RAX: 0000000000000000 RBX: ffff8883e12ae380 RCX: 0000000000000000
> [  +0.007150] RDX: 000000008000000e RSI: ffffffff823559b4 RDI: ffff8883e12ae3c0
> [  +0.007142] RBP: ffff888103b65d48 R08: 0000000000000001 R09: 0000000000000001
> [  +0.007144] R10: 0000000000000001 R11: ffff88829c2c8040 R12: ffff8883e12ae3c0
> [  +0.007148] R13: 0000000000000001 R14: ffff888115184140 R15: ffff888115184248
> [  +0.007154] FS:  0000000000000000(0000) GS:ffff88844db00000(0000) knlGS:0000000000000000
> [  +0.008108] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.005763] CR2: 0000000000000020 CR3: 000000013fdb4004 CR4: 00000000003706e0
> [  +0.007152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  +0.007145] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  +0.007154] Call Trace:
> [  +0.002459]  <TASK>
> [  +0.002126]  ttm_tt_unpopulate.part.0+0x17/0x70 [ttm]
> [  +0.005068]  ttm_bo_tt_destroy+0x1c/0x50 [ttm]
> [  +0.004464]  ttm_bo_cleanup_memtype_use+0x25/0x40 [ttm]
> [  +0.005244]  ttm_bo_cleanup_refs+0x90/0x2c0 [ttm]
> [  +0.004721]  ttm_bo_delayed_delete+0x235/0x250 [ttm]
> [  +0.004981]  ttm_device_delayed_workqueue+0x13/0x40 [ttm]
> [  +0.005422]  process_one_work+0x248/0x560
> [  +0.004028]  worker_thread+0x4b/0x390
> [  +0.003682]  ? process_one_work+0x560/0x560
> [  +0.004199]  kthread+0xeb/0x120
> [  +0.003163]  ? kthread_complete_and_exit+0x20/0x20
> [  +0.004815]  ret_from_fork+0x1f/0x30
>
> v2:
>   - Just use ttm_bo_wait() directly (Niranjana)
>   - Add testcase reference
>
> Testcase: igt@gem_madvise@dontneed-evict-race
> Fixes: 213d50927763 ("drm/i915/ttm: Introduce a TTM i915 gem object backend")
> Reported-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: Nirmoy Das <nirmoy.das@intel.com>
> Cc: <stable@vger.kernel.org> # v5.15+
> Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>

Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>

Regards
Andrzej
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> index e4e55e3f4e41..34793863ea45 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> @@ -603,6 +603,10 @@ static int i915_ttm_truncate(struct drm_i915_gem_object *obj)
>   
>   	WARN_ON_ONCE(obj->mm.madv == I915_MADV_WILLNEED);
>   
> +	err = ttm_bo_wait(bo, true, false);
> +	if (err)
> +		return err;
> +
>   	err = i915_ttm_move_notify(bo);
>   	if (err)
>   		return err;
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index e4e55e3f4e41..34793863ea45 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -603,6 +603,10 @@  static int i915_ttm_truncate(struct drm_i915_gem_object *obj)
 
 	WARN_ON_ONCE(obj->mm.madv == I915_MADV_WILLNEED);
 
+	err = ttm_bo_wait(bo, true, false);
+	if (err)
+		return err;
+
 	err = i915_ttm_move_notify(bo);
 	if (err)
 		return err;