diff mbox series

[1/5] drm/i915: audit bo->resource usage v3

Message ID 20230124125726.13323-1-christian.koenig@amd.com (mailing list archive)
State New, archived
Headers show
Series [1/5] drm/i915: audit bo->resource usage v3 | expand

Commit Message

Christian König Jan. 24, 2023, 12:57 p.m. UTC
From: Christian König <ckoenig.leichtzumerken@gmail.com>

Make sure we can at least move and alloc TT objects without backing store.

v2: clear the tt object even when no resource is allocated.
v3: add Matthews changes for i915 as well.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 27 ++++++++--
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 56 +++++++++++++++++---
 2 files changed, 71 insertions(+), 12 deletions(-)

Comments

Matthew Auld Jan. 24, 2023, 1:48 p.m. UTC | #1
On Tue, 24 Jan 2023 at 12:57, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>
> Make sure we can at least move and alloc TT objects without backing store.
>
> v2: clear the tt object even when no resource is allocated.
> v3: add Matthews changes for i915 as well.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Matthew Auld Jan. 24, 2023, 5:15 p.m. UTC | #2
On Tue, 24 Jan 2023 at 13:48, Matthew Auld
<matthew.william.auld@gmail.com> wrote:
>
> On Tue, 24 Jan 2023 at 12:57, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >
> > Make sure we can at least move and alloc TT objects without backing store.
> >
> > v2: clear the tt object even when no resource is allocated.
> > v3: add Matthews changes for i915 as well.
> >
> > Signed-off-by: Christian König <christian.koenig@amd.com>
> Reviewed-by: Matthew Auld <matthew.auld@intel.com>

Ofc that assumes intel-gfx CI is now happy with the series.
Matthew Auld Jan. 25, 2023, 9:56 a.m. UTC | #3
On Tue, 24 Jan 2023 at 17:15, Matthew Auld
<matthew.william.auld@gmail.com> wrote:
>
> On Tue, 24 Jan 2023 at 13:48, Matthew Auld
> <matthew.william.auld@gmail.com> wrote:
> >
> > On Tue, 24 Jan 2023 at 12:57, Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> > >
> > > From: Christian König <ckoenig.leichtzumerken@gmail.com>
> > >
> > > Make sure we can at least move and alloc TT objects without backing store.
> > >
> > > v2: clear the tt object even when no resource is allocated.
> > > v3: add Matthews changes for i915 as well.
> > >
> > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>
> Ofc that assumes intel-gfx CI is now happy with the series.

There are still some nasty failures it seems (in the extended test
list). But it looks like the series is already merged. Can we quickly
revert and try again?
Christian König Jan. 25, 2023, 10:07 a.m. UTC | #4
Am 25.01.23 um 10:56 schrieb Matthew Auld:
> On Tue, 24 Jan 2023 at 17:15, Matthew Auld
> <matthew.william.auld@gmail.com> wrote:
>> On Tue, 24 Jan 2023 at 13:48, Matthew Auld
>> <matthew.william.auld@gmail.com> wrote:
>>> On Tue, 24 Jan 2023 at 12:57, Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>
>>>> Make sure we can at least move and alloc TT objects without backing store.
>>>>
>>>> v2: clear the tt object even when no resource is allocated.
>>>> v3: add Matthews changes for i915 as well.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>> Ofc that assumes intel-gfx CI is now happy with the series.
> There are still some nasty failures it seems (in the extended test
> list). But it looks like the series is already merged. Can we quickly
> revert and try again?

Ah, crap. I thought everything would be fine after the CI gave it's go.

Which patch is causing the fallout?

Christian.
Matthew Auld Jan. 25, 2023, 10:21 a.m. UTC | #5
On Wed, 25 Jan 2023 at 10:07, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
>
>
> Am 25.01.23 um 10:56 schrieb Matthew Auld:
> > On Tue, 24 Jan 2023 at 17:15, Matthew Auld
> > <matthew.william.auld@gmail.com> wrote:
> >> On Tue, 24 Jan 2023 at 13:48, Matthew Auld
> >> <matthew.william.auld@gmail.com> wrote:
> >>> On Tue, 24 Jan 2023 at 12:57, Christian König
> >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >>>>
> >>>> Make sure we can at least move and alloc TT objects without backing store.
> >>>>
> >>>> v2: clear the tt object even when no resource is allocated.
> >>>> v3: add Matthews changes for i915 as well.
> >>>>
> >>>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> >> Ofc that assumes intel-gfx CI is now happy with the series.
> > There are still some nasty failures it seems (in the extended test
> > list). But it looks like the series is already merged. Can we quickly
> > revert and try again?
>
> Ah, crap. I thought everything would be fine after the CI gave it's go.
>
> Which patch is causing the fallout?

I'm not sure. I think all of the patches kind of interact with each
other, but for sure there is an issue with the first patch. There is
one splat like:

<1>[  109.735148] BUG: kernel NULL pointer dereference, address:
0000000000000010
<1>[  109.735151] #PF: supervisor read access in kernel mode
<1>[  109.735152] #PF: error_code(0x0000) - not-present page
<6>[  109.735153] PGD 0 P4D 0
<4>[  109.735155] Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>[  109.735157] CPU: 1 PID: 92 Comm: kworker/u12:6 Not tainted
6.2.0-rc5-Patchwork_113269v1-gc4d436608c4e+ #1
<4>[  109.735159] Hardware name: Gigabyte Technology Co., Ltd. GB-Z390
Garuda/GB-Z390 Garuda-CF, BIOS IG1c 11/19/2019
<4>[  109.735160] Workqueue: events_unbound async_run_entry_fn
<4>[  109.735163] RIP: 0010:i915_ttm_resource_mappable+0x4/0x30 [i915]
<4>[  109.735286] Code: b8 f9 ff ff ff eb c2 e8 aa 5e 52 e1 e9 4f 0f
18 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
66 0f 1f 00 <8b> 57 10 b8 01 00 00 00 85 d2 74 15 48 8b 47 08 48 05 ff
0f 00 00
<4>[  109.735288] RSP: 0018:ffffc90000f339a8 EFLAGS: 00010246
<4>[  109.735289] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff88810cea3a00
<4>[  109.735290] RDX: 0000000000000000 RSI: ffffc90000f33af0 RDI:
0000000000000000
<4>[  109.735292] RBP: ffff88811645d7c0 R08: 0000000000000000 R09:
ffff888123afa940
<4>[  109.735292] R10: 0000000000000001 R11: ffff888104b70040 R12:
0000000000000000
<4>[  109.735293] R13: 0000000000000000 R14: ffffc90000f33b08 R15:
ffffc90000f33af0
<4>[  109.735294] FS:  0000000000000000(0000)
GS:ffff8884ad680000(0000) knlGS:0000000000000000
<4>[  109.735295] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  109.735296] CR2: 0000000000000010 CR3: 000000011f9c6003 CR4:
00000000003706e0
<4>[  109.735297] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
<4>[  109.735298] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
<4>[  109.735299] Call Trace:
<4>[  109.735300]  <TASK>
<4>[  109.735301]  __i915_ttm_move+0x128/0x940 [i915]
<4>[  109.735408]  ? dma_resv_iter_next+0x91/0xb0
<4>[  109.735412]  ? dma_resv_iter_first+0x42/0xb0
<4>[  109.735414]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
<4>[  109.735520]  i915_gem_obj_copy_ttm+0x12f/0x250 [i915]
<4>[  109.735625]  i915_ttm_restore+0x167/0x250 [i915]
<4>[  109.735759]  i915_gem_process_region+0x27a/0x3b0 [i915]
<4>[  109.735881]  i915_ttm_restore_region+0x4b/0x70 [i915]
<4>[  109.735999]  lmem_restore+0x3a/0x60 [i915]
<4>[  109.736101]  i915_gem_resume+0x4c/0x100 [i915]
<4>[  109.736202]  i915_drm_resume+0xc2/0x170 [i915]

Plus some other less obvious issue(s) with some tests failing.

>
> Christian.
Christian König Jan. 25, 2023, 11:35 a.m. UTC | #6
Am 25.01.23 um 11:21 schrieb Matthew Auld:
> On Wed, 25 Jan 2023 at 10:07, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 25.01.23 um 10:56 schrieb Matthew Auld:
>>> On Tue, 24 Jan 2023 at 17:15, Matthew Auld
>>> <matthew.william.auld@gmail.com> wrote:
>>>> On Tue, 24 Jan 2023 at 13:48, Matthew Auld
>>>> <matthew.william.auld@gmail.com> wrote:
>>>>> On Tue, 24 Jan 2023 at 12:57, Christian König
>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>>>
>>>>>> Make sure we can at least move and alloc TT objects without backing store.
>>>>>>
>>>>>> v2: clear the tt object even when no resource is allocated.
>>>>>> v3: add Matthews changes for i915 as well.
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>>>> Ofc that assumes intel-gfx CI is now happy with the series.
>>> There are still some nasty failures it seems (in the extended test
>>> list). But it looks like the series is already merged. Can we quickly
>>> revert and try again?
>> Ah, crap. I thought everything would be fine after the CI gave it's go.
>>
>> Which patch is causing the fallout?
> I'm not sure. I think all of the patches kind of interact with each
> other, but for sure there is an issue with the first patch. There is
> one splat like:

Well I would rather like to revert as less as possible.

Are you sure that this isn't only on some i915 specific branch with not 
yet upstream changes?

I can't even find the i915_gem_obj_copy_ttm function in drm-misc-next 
nor drm-next.

Regards,
Christian.

>
> <1>[  109.735148] BUG: kernel NULL pointer dereference, address:
> 0000000000000010
> <1>[  109.735151] #PF: supervisor read access in kernel mode
> <1>[  109.735152] #PF: error_code(0x0000) - not-present page
> <6>[  109.735153] PGD 0 P4D 0
> <4>[  109.735155] Oops: 0000 [#1] PREEMPT SMP NOPTI
> <4>[  109.735157] CPU: 1 PID: 92 Comm: kworker/u12:6 Not tainted
> 6.2.0-rc5-Patchwork_113269v1-gc4d436608c4e+ #1
> <4>[  109.735159] Hardware name: Gigabyte Technology Co., Ltd. GB-Z390
> Garuda/GB-Z390 Garuda-CF, BIOS IG1c 11/19/2019
> <4>[  109.735160] Workqueue: events_unbound async_run_entry_fn
> <4>[  109.735163] RIP: 0010:i915_ttm_resource_mappable+0x4/0x30 [i915]
> <4>[  109.735286] Code: b8 f9 ff ff ff eb c2 e8 aa 5e 52 e1 e9 4f 0f
> 18 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> 66 0f 1f 00 <8b> 57 10 b8 01 00 00 00 85 d2 74 15 48 8b 47 08 48 05 ff
> 0f 00 00
> <4>[  109.735288] RSP: 0018:ffffc90000f339a8 EFLAGS: 00010246
> <4>[  109.735289] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> ffff88810cea3a00
> <4>[  109.735290] RDX: 0000000000000000 RSI: ffffc90000f33af0 RDI:
> 0000000000000000
> <4>[  109.735292] RBP: ffff88811645d7c0 R08: 0000000000000000 R09:
> ffff888123afa940
> <4>[  109.735292] R10: 0000000000000001 R11: ffff888104b70040 R12:
> 0000000000000000
> <4>[  109.735293] R13: 0000000000000000 R14: ffffc90000f33b08 R15:
> ffffc90000f33af0
> <4>[  109.735294] FS:  0000000000000000(0000)
> GS:ffff8884ad680000(0000) knlGS:0000000000000000
> <4>[  109.735295] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[  109.735296] CR2: 0000000000000010 CR3: 000000011f9c6003 CR4:
> 00000000003706e0
> <4>[  109.735297] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> <4>[  109.735298] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> <4>[  109.735299] Call Trace:
> <4>[  109.735300]  <TASK>
> <4>[  109.735301]  __i915_ttm_move+0x128/0x940 [i915]
> <4>[  109.735408]  ? dma_resv_iter_next+0x91/0xb0
> <4>[  109.735412]  ? dma_resv_iter_first+0x42/0xb0
> <4>[  109.735414]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
> <4>[  109.735520]  i915_gem_obj_copy_ttm+0x12f/0x250 [i915]
> <4>[  109.735625]  i915_ttm_restore+0x167/0x250 [i915]
> <4>[  109.735759]  i915_gem_process_region+0x27a/0x3b0 [i915]
> <4>[  109.735881]  i915_ttm_restore_region+0x4b/0x70 [i915]
> <4>[  109.735999]  lmem_restore+0x3a/0x60 [i915]
> <4>[  109.736101]  i915_gem_resume+0x4c/0x100 [i915]
> <4>[  109.736202]  i915_drm_resume+0xc2/0x170 [i915]
>
> Plus some other less obvious issue(s) with some tests failing.
>
>> Christian.
Matthew Auld Jan. 25, 2023, 12:53 p.m. UTC | #7
On Wed, 25 Jan 2023 at 11:35, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 25.01.23 um 11:21 schrieb Matthew Auld:
> > On Wed, 25 Jan 2023 at 10:07, Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 25.01.23 um 10:56 schrieb Matthew Auld:
> >>> On Tue, 24 Jan 2023 at 17:15, Matthew Auld
> >>> <matthew.william.auld@gmail.com> wrote:
> >>>> On Tue, 24 Jan 2023 at 13:48, Matthew Auld
> >>>> <matthew.william.auld@gmail.com> wrote:
> >>>>> On Tue, 24 Jan 2023 at 12:57, Christian König
> >>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >>>>>>
> >>>>>> Make sure we can at least move and alloc TT objects without backing store.
> >>>>>>
> >>>>>> v2: clear the tt object even when no resource is allocated.
> >>>>>> v3: add Matthews changes for i915 as well.
> >>>>>>
> >>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>>>> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> >>>> Ofc that assumes intel-gfx CI is now happy with the series.
> >>> There are still some nasty failures it seems (in the extended test
> >>> list). But it looks like the series is already merged. Can we quickly
> >>> revert and try again?
> >> Ah, crap. I thought everything would be fine after the CI gave it's go.
> >>
> >> Which patch is causing the fallout?
> > I'm not sure. I think all of the patches kind of interact with each
> > other, but for sure there is an issue with the first patch. There is
> > one splat like:
>
> Well I would rather like to revert as less as possible.
>
> Are you sure that this isn't only on some i915 specific branch with not
> yet upstream changes?

Yeah, that splat is taken directly from the CI results reported with
this series. So it's just your series applied on top of drm-tip.

Can you take a look at the first patch here:
https://patchwork.freedesktop.org/series/113332/

Maybe you have a better idea? For reference the IGTs that we have for
verifying userspace object clearing are now failing, so hoping that
fixes it. The other two patches I'm hoping will fix the splat.

>
> I can't even find the i915_gem_obj_copy_ttm function in drm-misc-next
> nor drm-next.
>
> Regards,
> Christian.
>
> >
> > <1>[  109.735148] BUG: kernel NULL pointer dereference, address:
> > 0000000000000010
> > <1>[  109.735151] #PF: supervisor read access in kernel mode
> > <1>[  109.735152] #PF: error_code(0x0000) - not-present page
> > <6>[  109.735153] PGD 0 P4D 0
> > <4>[  109.735155] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > <4>[  109.735157] CPU: 1 PID: 92 Comm: kworker/u12:6 Not tainted
> > 6.2.0-rc5-Patchwork_113269v1-gc4d436608c4e+ #1
> > <4>[  109.735159] Hardware name: Gigabyte Technology Co., Ltd. GB-Z390
> > Garuda/GB-Z390 Garuda-CF, BIOS IG1c 11/19/2019
> > <4>[  109.735160] Workqueue: events_unbound async_run_entry_fn
> > <4>[  109.735163] RIP: 0010:i915_ttm_resource_mappable+0x4/0x30 [i915]
> > <4>[  109.735286] Code: b8 f9 ff ff ff eb c2 e8 aa 5e 52 e1 e9 4f 0f
> > 18 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> > 66 0f 1f 00 <8b> 57 10 b8 01 00 00 00 85 d2 74 15 48 8b 47 08 48 05 ff
> > 0f 00 00
> > <4>[  109.735288] RSP: 0018:ffffc90000f339a8 EFLAGS: 00010246
> > <4>[  109.735289] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > ffff88810cea3a00
> > <4>[  109.735290] RDX: 0000000000000000 RSI: ffffc90000f33af0 RDI:
> > 0000000000000000
> > <4>[  109.735292] RBP: ffff88811645d7c0 R08: 0000000000000000 R09:
> > ffff888123afa940
> > <4>[  109.735292] R10: 0000000000000001 R11: ffff888104b70040 R12:
> > 0000000000000000
> > <4>[  109.735293] R13: 0000000000000000 R14: ffffc90000f33b08 R15:
> > ffffc90000f33af0
> > <4>[  109.735294] FS:  0000000000000000(0000)
> > GS:ffff8884ad680000(0000) knlGS:0000000000000000
> > <4>[  109.735295] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[  109.735296] CR2: 0000000000000010 CR3: 000000011f9c6003 CR4:
> > 00000000003706e0
> > <4>[  109.735297] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > <4>[  109.735298] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > <4>[  109.735299] Call Trace:
> > <4>[  109.735300]  <TASK>
> > <4>[  109.735301]  __i915_ttm_move+0x128/0x940 [i915]
> > <4>[  109.735408]  ? dma_resv_iter_next+0x91/0xb0
> > <4>[  109.735412]  ? dma_resv_iter_first+0x42/0xb0
> > <4>[  109.735414]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
> > <4>[  109.735520]  i915_gem_obj_copy_ttm+0x12f/0x250 [i915]
> > <4>[  109.735625]  i915_ttm_restore+0x167/0x250 [i915]
> > <4>[  109.735759]  i915_gem_process_region+0x27a/0x3b0 [i915]
> > <4>[  109.735881]  i915_ttm_restore_region+0x4b/0x70 [i915]
> > <4>[  109.735999]  lmem_restore+0x3a/0x60 [i915]
> > <4>[  109.736101]  i915_gem_resume+0x4c/0x100 [i915]
> > <4>[  109.736202]  i915_drm_resume+0xc2/0x170 [i915]
> >
> > Plus some other less obvious issue(s) with some tests failing.
> >
> >> Christian.
>
Christian König Jan. 25, 2023, 2:20 p.m. UTC | #8
Am 25.01.23 um 13:53 schrieb Matthew Auld:
> On Wed, 25 Jan 2023 at 11:35, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 25.01.23 um 11:21 schrieb Matthew Auld:
>>> On Wed, 25 Jan 2023 at 10:07, Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>> Am 25.01.23 um 10:56 schrieb Matthew Auld:
>>>>> On Tue, 24 Jan 2023 at 17:15, Matthew Auld
>>>>> <matthew.william.auld@gmail.com> wrote:
>>>>>> On Tue, 24 Jan 2023 at 13:48, Matthew Auld
>>>>>> <matthew.william.auld@gmail.com> wrote:
>>>>>>> On Tue, 24 Jan 2023 at 12:57, Christian König
>>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>>>>>
>>>>>>>> Make sure we can at least move and alloc TT objects without backing store.
>>>>>>>>
>>>>>>>> v2: clear the tt object even when no resource is allocated.
>>>>>>>> v3: add Matthews changes for i915 as well.
>>>>>>>>
>>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>>>>>> Ofc that assumes intel-gfx CI is now happy with the series.
>>>>> There are still some nasty failures it seems (in the extended test
>>>>> list). But it looks like the series is already merged. Can we quickly
>>>>> revert and try again?
>>>> Ah, crap. I thought everything would be fine after the CI gave it's go.
>>>>
>>>> Which patch is causing the fallout?
>>> I'm not sure. I think all of the patches kind of interact with each
>>> other, but for sure there is an issue with the first patch. There is
>>> one splat like:
>> Well I would rather like to revert as less as possible.
>>
>> Are you sure that this isn't only on some i915 specific branch with not
>> yet upstream changes?
> Yeah, that splat is taken directly from the CI results reported with
> this series. So it's just your series applied on top of drm-tip.
>
> Can you take a look at the first patch here:
> https://patchwork.freedesktop.org/series/113332/
>
> Maybe you have a better idea? For reference the IGTs that we have for
> verifying userspace object clearing are now failing, so hoping that
> fixes it. The other two patches I'm hoping will fix the splat.

The TTM change looks like a good idea to me. Feel free to add my rb to 
this one.

I can't say much about the i915 changes.

Maybe we should revert the two TTM patches to not allocate resources for 
now and fix i915 first?

Christian.

>
>> I can't even find the i915_gem_obj_copy_ttm function in drm-misc-next
>> nor drm-next.
>>
>> Regards,
>> Christian.
>>
>>> <1>[  109.735148] BUG: kernel NULL pointer dereference, address:
>>> 0000000000000010
>>> <1>[  109.735151] #PF: supervisor read access in kernel mode
>>> <1>[  109.735152] #PF: error_code(0x0000) - not-present page
>>> <6>[  109.735153] PGD 0 P4D 0
>>> <4>[  109.735155] Oops: 0000 [#1] PREEMPT SMP NOPTI
>>> <4>[  109.735157] CPU: 1 PID: 92 Comm: kworker/u12:6 Not tainted
>>> 6.2.0-rc5-Patchwork_113269v1-gc4d436608c4e+ #1
>>> <4>[  109.735159] Hardware name: Gigabyte Technology Co., Ltd. GB-Z390
>>> Garuda/GB-Z390 Garuda-CF, BIOS IG1c 11/19/2019
>>> <4>[  109.735160] Workqueue: events_unbound async_run_entry_fn
>>> <4>[  109.735163] RIP: 0010:i915_ttm_resource_mappable+0x4/0x30 [i915]
>>> <4>[  109.735286] Code: b8 f9 ff ff ff eb c2 e8 aa 5e 52 e1 e9 4f 0f
>>> 18 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
>>> 66 0f 1f 00 <8b> 57 10 b8 01 00 00 00 85 d2 74 15 48 8b 47 08 48 05 ff
>>> 0f 00 00
>>> <4>[  109.735288] RSP: 0018:ffffc90000f339a8 EFLAGS: 00010246
>>> <4>[  109.735289] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>> ffff88810cea3a00
>>> <4>[  109.735290] RDX: 0000000000000000 RSI: ffffc90000f33af0 RDI:
>>> 0000000000000000
>>> <4>[  109.735292] RBP: ffff88811645d7c0 R08: 0000000000000000 R09:
>>> ffff888123afa940
>>> <4>[  109.735292] R10: 0000000000000001 R11: ffff888104b70040 R12:
>>> 0000000000000000
>>> <4>[  109.735293] R13: 0000000000000000 R14: ffffc90000f33b08 R15:
>>> ffffc90000f33af0
>>> <4>[  109.735294] FS:  0000000000000000(0000)
>>> GS:ffff8884ad680000(0000) knlGS:0000000000000000
>>> <4>[  109.735295] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> <4>[  109.735296] CR2: 0000000000000010 CR3: 000000011f9c6003 CR4:
>>> 00000000003706e0
>>> <4>[  109.735297] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> <4>[  109.735298] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400
>>> <4>[  109.735299] Call Trace:
>>> <4>[  109.735300]  <TASK>
>>> <4>[  109.735301]  __i915_ttm_move+0x128/0x940 [i915]
>>> <4>[  109.735408]  ? dma_resv_iter_next+0x91/0xb0
>>> <4>[  109.735412]  ? dma_resv_iter_first+0x42/0xb0
>>> <4>[  109.735414]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
>>> <4>[  109.735520]  i915_gem_obj_copy_ttm+0x12f/0x250 [i915]
>>> <4>[  109.735625]  i915_ttm_restore+0x167/0x250 [i915]
>>> <4>[  109.735759]  i915_gem_process_region+0x27a/0x3b0 [i915]
>>> <4>[  109.735881]  i915_ttm_restore_region+0x4b/0x70 [i915]
>>> <4>[  109.735999]  lmem_restore+0x3a/0x60 [i915]
>>> <4>[  109.736101]  i915_gem_resume+0x4c/0x100 [i915]
>>> <4>[  109.736202]  i915_drm_resume+0xc2/0x170 [i915]
>>>
>>> Plus some other less obvious issue(s) with some tests failing.
>>>
>>>> Christian.
Matthew Auld Jan. 25, 2023, 3:46 p.m. UTC | #9
On Wed, 25 Jan 2023 at 14:20, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 25.01.23 um 13:53 schrieb Matthew Auld:
> > On Wed, 25 Jan 2023 at 11:35, Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 25.01.23 um 11:21 schrieb Matthew Auld:
> >>> On Wed, 25 Jan 2023 at 10:07, Christian König
> >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>> Am 25.01.23 um 10:56 schrieb Matthew Auld:
> >>>>> On Tue, 24 Jan 2023 at 17:15, Matthew Auld
> >>>>> <matthew.william.auld@gmail.com> wrote:
> >>>>>> On Tue, 24 Jan 2023 at 13:48, Matthew Auld
> >>>>>> <matthew.william.auld@gmail.com> wrote:
> >>>>>>> On Tue, 24 Jan 2023 at 12:57, Christian König
> >>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >>>>>>>>
> >>>>>>>> Make sure we can at least move and alloc TT objects without backing store.
> >>>>>>>>
> >>>>>>>> v2: clear the tt object even when no resource is allocated.
> >>>>>>>> v3: add Matthews changes for i915 as well.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>>>>>> Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> >>>>>> Ofc that assumes intel-gfx CI is now happy with the series.
> >>>>> There are still some nasty failures it seems (in the extended test
> >>>>> list). But it looks like the series is already merged. Can we quickly
> >>>>> revert and try again?
> >>>> Ah, crap. I thought everything would be fine after the CI gave it's go.
> >>>>
> >>>> Which patch is causing the fallout?
> >>> I'm not sure. I think all of the patches kind of interact with each
> >>> other, but for sure there is an issue with the first patch. There is
> >>> one splat like:
> >> Well I would rather like to revert as less as possible.
> >>
> >> Are you sure that this isn't only on some i915 specific branch with not
> >> yet upstream changes?
> > Yeah, that splat is taken directly from the CI results reported with
> > this series. So it's just your series applied on top of drm-tip.
> >
> > Can you take a look at the first patch here:
> > https://patchwork.freedesktop.org/series/113332/
> >
> > Maybe you have a better idea? For reference the IGTs that we have for
> > verifying userspace object clearing are now failing, so hoping that
> > fixes it. The other two patches I'm hoping will fix the splat.
>
> The TTM change looks like a good idea to me. Feel free to add my rb to
> this one.
>
> I can't say much about the i915 changes.
>
> Maybe we should revert the two TTM patches to not allocate resources for
> now and fix i915 first?

From what I can see, we would need to revert all three TTM patches,
keeping just the i915 one. Reverting for now I think makes sense.

>
> Christian.
>
> >
> >> I can't even find the i915_gem_obj_copy_ttm function in drm-misc-next
> >> nor drm-next.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> <1>[  109.735148] BUG: kernel NULL pointer dereference, address:
> >>> 0000000000000010
> >>> <1>[  109.735151] #PF: supervisor read access in kernel mode
> >>> <1>[  109.735152] #PF: error_code(0x0000) - not-present page
> >>> <6>[  109.735153] PGD 0 P4D 0
> >>> <4>[  109.735155] Oops: 0000 [#1] PREEMPT SMP NOPTI
> >>> <4>[  109.735157] CPU: 1 PID: 92 Comm: kworker/u12:6 Not tainted
> >>> 6.2.0-rc5-Patchwork_113269v1-gc4d436608c4e+ #1
> >>> <4>[  109.735159] Hardware name: Gigabyte Technology Co., Ltd. GB-Z390
> >>> Garuda/GB-Z390 Garuda-CF, BIOS IG1c 11/19/2019
> >>> <4>[  109.735160] Workqueue: events_unbound async_run_entry_fn
> >>> <4>[  109.735163] RIP: 0010:i915_ttm_resource_mappable+0x4/0x30 [i915]
> >>> <4>[  109.735286] Code: b8 f9 ff ff ff eb c2 e8 aa 5e 52 e1 e9 4f 0f
> >>> 18 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> >>> 66 0f 1f 00 <8b> 57 10 b8 01 00 00 00 85 d2 74 15 48 8b 47 08 48 05 ff
> >>> 0f 00 00
> >>> <4>[  109.735288] RSP: 0018:ffffc90000f339a8 EFLAGS: 00010246
> >>> <4>[  109.735289] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> >>> ffff88810cea3a00
> >>> <4>[  109.735290] RDX: 0000000000000000 RSI: ffffc90000f33af0 RDI:
> >>> 0000000000000000
> >>> <4>[  109.735292] RBP: ffff88811645d7c0 R08: 0000000000000000 R09:
> >>> ffff888123afa940
> >>> <4>[  109.735292] R10: 0000000000000001 R11: ffff888104b70040 R12:
> >>> 0000000000000000
> >>> <4>[  109.735293] R13: 0000000000000000 R14: ffffc90000f33b08 R15:
> >>> ffffc90000f33af0
> >>> <4>[  109.735294] FS:  0000000000000000(0000)
> >>> GS:ffff8884ad680000(0000) knlGS:0000000000000000
> >>> <4>[  109.735295] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> <4>[  109.735296] CR2: 0000000000000010 CR3: 000000011f9c6003 CR4:
> >>> 00000000003706e0
> >>> <4>[  109.735297] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>> 0000000000000000
> >>> <4>[  109.735298] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >>> 0000000000000400
> >>> <4>[  109.735299] Call Trace:
> >>> <4>[  109.735300]  <TASK>
> >>> <4>[  109.735301]  __i915_ttm_move+0x128/0x940 [i915]
> >>> <4>[  109.735408]  ? dma_resv_iter_next+0x91/0xb0
> >>> <4>[  109.735412]  ? dma_resv_iter_first+0x42/0xb0
> >>> <4>[  109.735414]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
> >>> <4>[  109.735520]  i915_gem_obj_copy_ttm+0x12f/0x250 [i915]
> >>> <4>[  109.735625]  i915_ttm_restore+0x167/0x250 [i915]
> >>> <4>[  109.735759]  i915_gem_process_region+0x27a/0x3b0 [i915]
> >>> <4>[  109.735881]  i915_ttm_restore_region+0x4b/0x70 [i915]
> >>> <4>[  109.735999]  lmem_restore+0x3a/0x60 [i915]
> >>> <4>[  109.736101]  i915_gem_resume+0x4c/0x100 [i915]
> >>> <4>[  109.736202]  i915_drm_resume+0xc2/0x170 [i915]
> >>>
> >>> Plus some other less obvious issue(s) with some tests failing.
> >>>
> >>>> Christian.
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index 8cfed1bef629..7420276827a5 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -274,8 +274,6 @@  static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 {
 	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
 						     bdev);
-	struct ttm_resource_manager *man =
-		ttm_manager_type(bo->bdev, bo->resource->mem_type);
 	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
 	unsigned long ccs_pages = 0;
 	enum ttm_caching caching;
@@ -289,8 +287,8 @@  static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 	if (!i915_tt)
 		return NULL;
 
-	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
-	    man->use_tt)
+	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && (!bo->resource ||
+	    ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt))
 		page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
 
 	caching = i915_ttm_select_tt_caching(obj);
@@ -1058,7 +1056,26 @@  static vm_fault_t vm_fault_ttm(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 	}
 
-	if (!i915_ttm_resource_mappable(bo->resource)) {
+	/*
+	 * This must be swapped out with shmem ttm_tt (pipeline-gutting).
+	 * Calling ttm_bo_validate() here with TTM_PL_SYSTEM should only go as
+	 * far as far doing a ttm_bo_move_null(), which should skip all the
+	 * other junk.
+	 */
+	if (!bo->resource) {
+		struct ttm_operation_ctx ctx = {
+			.interruptible = true,
+			.no_wait_gpu = true, /* should be idle already */
+		};
+
+		GEM_BUG_ON(!bo->ttm || !(bo->ttm->page_flags & TTM_TT_FLAG_SWAPPED));
+
+		ret = ttm_bo_validate(bo, i915_ttm_sys_placement(), &ctx);
+		if (ret) {
+			dma_resv_unlock(bo->base.resv);
+			return VM_FAULT_SIGBUS;
+		}
+	} else if (!i915_ttm_resource_mappable(bo->resource)) {
 		int err = -ENODEV;
 		int i;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index 2ebaaf4d663c..76dd9e5e1a8b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -103,7 +103,27 @@  void i915_ttm_adjust_gem_after_move(struct drm_i915_gem_object *obj)
 {
 	struct ttm_buffer_object *bo = i915_gem_to_ttm(obj);
 	unsigned int cache_level;
+	unsigned int mem_flags;
 	unsigned int i;
+	int mem_type;
+
+	/*
+	 * We might have been purged (or swapped out) if the resource is NULL,
+	 * in which case the SYSTEM placement is the closest match to describe
+	 * the current domain. If the object is ever used in this state then we
+	 * will require moving it again.
+	 */
+	if (!bo->resource) {
+		mem_flags = I915_BO_FLAG_STRUCT_PAGE;
+		mem_type = I915_PL_SYSTEM;
+		cache_level = I915_CACHE_NONE;
+	} else {
+		mem_flags = i915_ttm_cpu_maps_iomem(bo->resource) ? I915_BO_FLAG_IOMEM :
+			I915_BO_FLAG_STRUCT_PAGE;
+		mem_type = bo->resource->mem_type;
+		cache_level = i915_ttm_cache_level(to_i915(bo->base.dev), bo->resource,
+						   bo->ttm);
+	}
 
 	/*
 	 * If object was moved to an allowable region, update the object
@@ -111,11 +131,11 @@  void i915_ttm_adjust_gem_after_move(struct drm_i915_gem_object *obj)
 	 * in an allowable region, it's evicted and we don't update the
 	 * object region.
 	 */
-	if (intel_region_to_ttm_type(obj->mm.region) != bo->resource->mem_type) {
+	if (intel_region_to_ttm_type(obj->mm.region) != mem_type) {
 		for (i = 0; i < obj->mm.n_placements; ++i) {
 			struct intel_memory_region *mr = obj->mm.placements[i];
 
-			if (intel_region_to_ttm_type(mr) == bo->resource->mem_type &&
+			if (intel_region_to_ttm_type(mr) == mem_type &&
 			    mr != obj->mm.region) {
 				i915_gem_object_release_memory_region(obj);
 				i915_gem_object_init_memory_region(obj, mr);
@@ -125,12 +145,8 @@  void i915_ttm_adjust_gem_after_move(struct drm_i915_gem_object *obj)
 	}
 
 	obj->mem_flags &= ~(I915_BO_FLAG_STRUCT_PAGE | I915_BO_FLAG_IOMEM);
+	obj->mem_flags |= mem_flags;
 
-	obj->mem_flags |= i915_ttm_cpu_maps_iomem(bo->resource) ? I915_BO_FLAG_IOMEM :
-		I915_BO_FLAG_STRUCT_PAGE;
-
-	cache_level = i915_ttm_cache_level(to_i915(bo->base.dev), bo->resource,
-					   bo->ttm);
 	i915_gem_object_set_cache_coherency(obj, cache_level);
 }
 
@@ -565,6 +581,32 @@  int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
 		return 0;
 	}
 
+	if (!bo->resource) {
+		if (dst_mem->mem_type != TTM_PL_SYSTEM) {
+			hop->mem_type = TTM_PL_SYSTEM;
+			hop->flags = TTM_PL_FLAG_TEMPORARY;
+			return -EMULTIHOP;
+		}
+
+		/*
+		 * This is only reached when first creating the object, or if
+		 * the object was purged or swapped out (pipeline-gutting). For
+		 * the former we can safely skip all of the below since we are
+		 * only using a dummy SYSTEM placement here. And with the latter
+		 * we will always re-enter here with bo->resource set correctly
+		 * (as per the above), since this is part of a multi-hop
+		 * sequence, where at the end we can do the move for real.
+		 *
+		 * The special case here is when the dst_mem is TTM_PL_SYSTEM,
+		 * which doens't require any kind of move, so it should be safe
+		 * to skip all the below and call ttm_bo_move_null() here, where
+		 * the caller in __i915_ttm_get_pages() will take care of the
+		 * rest, since we should have a valid ttm_tt.
+		 */
+		ttm_bo_move_null(bo, dst_mem);
+		return 0;
+	}
+
 	ret = i915_ttm_move_notify(bo);
 	if (ret)
 		return ret;