diff mbox series

[v9,2/8] drm/i915/ttm: add tt shmem backend

Message ID 20211018091055.1998191-2-matthew.auld@intel.com (mailing list archive)
State New, archived
Headers show
Series [v9,1/8] drm/i915/gem: Break out some shmem backend utils | expand

Commit Message

Matthew Auld Oct. 18, 2021, 9:10 a.m. UTC
For cached objects we can allocate our pages directly in shmem. This
should make it possible(in a later patch) to utilise the existing
i915-gem shrinker code for such objects. For now this is still disabled.

v2(Thomas):
  - Add optional try_to_writeback hook for objects. Importantly we need
    to check if the object is even still shrinkable; in between us
    dropping the shrinker LRU lock and acquiring the object lock it could for
    example have been moved. Also we need to differentiate between
    "lazy" shrinking and the immediate writeback mode. Also later we need to
    handle objects which don't even have mm.pages, so bundling this into
    put_pages() would require somehow handling that edge case, hence
    just letting the ttm backend handle everything in try_to_writeback
    doesn't seem too bad.
v3(Thomas):
  - Likely a bad idea to touch the object from the unpopulate hook,
    since it's not possible to hold a reference, without also creating
    circular dependency, so likely this is too fragile. For now just
    ensure we at least mark the pages as dirty/accessed when called from the
    shrinker on WILLNEED objects.
  - s/try_to_writeback/shrinker_release_pages, since this can do more
    than just writeback.
  - Get rid of do_backup boolean and just set the SWAPPED flag prior to
    calling unpopulate.
  - Keep shmem_tt as lowest priority for the TTM LRU bo_swapout walk, since
    these just get skipped anyway. We can try to come up with something
    better later.
v4(Thomas):
  - s/PCI_DMA/DMA/. Also drop NO_KERNEL_MAPPING and NO_WARN, which
    apparently doesn't do anything with streaming mappings.
  - Just pass along the error for ->truncate, and assume nothing.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Oak Zeng <oak.zeng@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Acked-by: Oak Zeng <oak.zeng@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_object.h    |  11 +-
 .../gpu/drm/i915/gem/i915_gem_object_types.h  |   4 +-
 drivers/gpu/drm/i915/gem/i915_gem_pages.c     |   6 +-
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c     |  18 +-
 drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  17 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c       | 233 ++++++++++++++++--
 6 files changed, 247 insertions(+), 42 deletions(-)

Comments

Tvrtko Ursulin Dec. 7, 2021, 1:10 p.m. UTC | #1
On 18/10/2021 10:10, Matthew Auld wrote:
> For cached objects we can allocate our pages directly in shmem. This
> should make it possible(in a later patch) to utilise the existing
> i915-gem shrinker code for such objects. For now this is still disabled.
> 
> v2(Thomas):
>    - Add optional try_to_writeback hook for objects. Importantly we need
>      to check if the object is even still shrinkable; in between us
>      dropping the shrinker LRU lock and acquiring the object lock it could for
>      example have been moved. Also we need to differentiate between
>      "lazy" shrinking and the immediate writeback mode. Also later we need to
>      handle objects which don't even have mm.pages, so bundling this into
>      put_pages() would require somehow handling that edge case, hence
>      just letting the ttm backend handle everything in try_to_writeback
>      doesn't seem too bad.
> v3(Thomas):
>    - Likely a bad idea to touch the object from the unpopulate hook,
>      since it's not possible to hold a reference, without also creating
>      circular dependency, so likely this is too fragile. For now just
>      ensure we at least mark the pages as dirty/accessed when called from the
>      shrinker on WILLNEED objects.
>    - s/try_to_writeback/shrinker_release_pages, since this can do more
>      than just writeback.
>    - Get rid of do_backup boolean and just set the SWAPPED flag prior to
>      calling unpopulate.
>    - Keep shmem_tt as lowest priority for the TTM LRU bo_swapout walk, since
>      these just get skipped anyway. We can try to come up with something
>      better later.
> v4(Thomas):
>    - s/PCI_DMA/DMA/. Also drop NO_KERNEL_MAPPING and NO_WARN, which
>      apparently doesn't do anything with streaming mappings.
>    - Just pass along the error for ->truncate, and assume nothing.
> 
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Oak Zeng <oak.zeng@intel.com>
> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Acked-by: Oak Zeng <oak.zeng@intel.com>

[snip]

> -static void try_to_writeback(struct drm_i915_gem_object *obj,
> -			     unsigned int flags)
> +static int try_to_writeback(struct drm_i915_gem_object *obj, unsigned int flags)
>   {
> +	if (obj->ops->shrinker_release_pages)
> +		return obj->ops->shrinker_release_pages(obj,
> +							flags & I915_SHRINK_WRITEBACK);

I have a high level question about how does this new vfunc fit with the existing code.

Because I notice in the implementation (i915_ttm_shrinker_release_pages) it ends up doing:
...

        switch (obj->mm.madv) {
        case I915_MADV_DONTNEED:
                return i915_ttm_purge(obj);
        case __I915_MADV_PURGED:
                return 0;
        }

... and then...

        if (should_writeback)
                __shmem_writeback(obj->base.size, i915_tt->filp->f_mapping);

Which on a glance looks like a possible conceptual duplication of the concepts in this very function (try_to_writeback):

> +
>   	switch (obj->mm.madv) {
>   	case I915_MADV_DONTNEED:
>   		i915_gem_object_truncate(obj);
> -		return;
> +		return 0;
>   	case __I915_MADV_PURGED:
> -		return;
> +		return 0;
>   	}
>   
>   	if (flags & I915_SHRINK_WRITEBACK)
>   		i915_gem_object_writeback(obj);

So question is this the final design or some futher tidy is possible/planned?

Background to my question is that I will float a patch which proposes to remove writeback altogether. There are reports from the fields that it causes deadlocks and looking at 2d6692e642e7 ("drm/i915: Start writeback from the shrinker") and its history it seems like it was a known risk.

Apart from the code organisation questions, on the practical level - do you need writeback from the TTM backend or while I am proposing to remove it from the "legacy" paths, I can propose removing it from the TTM flow as well?

Regards,

Tvrtko
Matthew Auld Dec. 7, 2021, 2:05 p.m. UTC | #2
On 07/12/2021 13:10, Tvrtko Ursulin wrote:
> 
> On 18/10/2021 10:10, Matthew Auld wrote:
>> For cached objects we can allocate our pages directly in shmem. This
>> should make it possible(in a later patch) to utilise the existing
>> i915-gem shrinker code for such objects. For now this is still disabled.
>>
>> v2(Thomas):
>>    - Add optional try_to_writeback hook for objects. Importantly we need
>>      to check if the object is even still shrinkable; in between us
>>      dropping the shrinker LRU lock and acquiring the object lock it 
>> could for
>>      example have been moved. Also we need to differentiate between
>>      "lazy" shrinking and the immediate writeback mode. Also later we 
>> need to
>>      handle objects which don't even have mm.pages, so bundling this into
>>      put_pages() would require somehow handling that edge case, hence
>>      just letting the ttm backend handle everything in try_to_writeback
>>      doesn't seem too bad.
>> v3(Thomas):
>>    - Likely a bad idea to touch the object from the unpopulate hook,
>>      since it's not possible to hold a reference, without also creating
>>      circular dependency, so likely this is too fragile. For now just
>>      ensure we at least mark the pages as dirty/accessed when called 
>> from the
>>      shrinker on WILLNEED objects.
>>    - s/try_to_writeback/shrinker_release_pages, since this can do more
>>      than just writeback.
>>    - Get rid of do_backup boolean and just set the SWAPPED flag prior to
>>      calling unpopulate.
>>    - Keep shmem_tt as lowest priority for the TTM LRU bo_swapout walk, 
>> since
>>      these just get skipped anyway. We can try to come up with something
>>      better later.
>> v4(Thomas):
>>    - s/PCI_DMA/DMA/. Also drop NO_KERNEL_MAPPING and NO_WARN, which
>>      apparently doesn't do anything with streaming mappings.
>>    - Just pass along the error for ->truncate, and assume nothing.
>>
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Oak Zeng <oak.zeng@intel.com>
>> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Acked-by: Oak Zeng <oak.zeng@intel.com>
> 
> [snip]
> 
>> -static void try_to_writeback(struct drm_i915_gem_object *obj,
>> -                 unsigned int flags)
>> +static int try_to_writeback(struct drm_i915_gem_object *obj, unsigned 
>> int flags)
>>   {
>> +    if (obj->ops->shrinker_release_pages)
>> +        return obj->ops->shrinker_release_pages(obj,
>> +                            flags & I915_SHRINK_WRITEBACK);
> 
> I have a high level question about how does this new vfunc fit with the 
> existing code.
> 
> Because I notice in the implementation (i915_ttm_shrinker_release_pages) 
> it ends up doing:
> ...
> 
>         switch (obj->mm.madv) {
>         case I915_MADV_DONTNEED:
>                 return i915_ttm_purge(obj);
>         case __I915_MADV_PURGED:
>                 return 0;
>         }
> 
> ... and then...
> 
>         if (should_writeback)
>                 __shmem_writeback(obj->base.size, 
> i915_tt->filp->f_mapping);
> 
> Which on a glance looks like a possible conceptual duplication of the 
> concepts in this very function (try_to_writeback):
> 
>> +
>>       switch (obj->mm.madv) {
>>       case I915_MADV_DONTNEED:
>>           i915_gem_object_truncate(obj);
>> -        return;
>> +        return 0;
>>       case __I915_MADV_PURGED:
>> -        return;
>> +        return 0;
>>       }
>>       if (flags & I915_SHRINK_WRITEBACK)
>>           i915_gem_object_writeback(obj);
> 
> So question is this the final design or some futher tidy is 
> possible/planned?

It seems ok to me, all things considered. The TTM version needs to check 
if the object is still shrinkable at the start(plus some other stuff), 
upon acquiring the object lock. If that succeeds we can do the above 
madv checks and potentially just call truncate. Otherwise we can proceed 
with shrinking, but again TTM is special here, and we have to call 
ttm_bo_validate() underneath(we might not even have mm.pages here). And 
then if that all works we might be able to also perform the writeback, 
if needed. So I suppose we could add all that directly in 
try_to_writeback(), and make it conditional for TTM devices, or I guess 
we need two separate hooks, one to do some pre-checking and another do 
the actual swap step. Not sure if that is better or worse though.

> 
> Background to my question is that I will float a patch which proposes to 
> remove writeback altogether. There are reports from the fields that it 
> causes deadlocks and looking at 2d6692e642e7 ("drm/i915: Start writeback 
> from the shrinker") and its history it seems like it was a known risk.
> 
> Apart from the code organisation questions, on the practical level - do 
> you need writeback from the TTM backend or while I am proposing to 
> remove it from the "legacy" paths, I can propose removing it from the 
> TTM flow as well?

Yeah, if that is somehow busted then we should remove from TTM backend also.

> 
> Regards,
> 
> Tvrtko
Tvrtko Ursulin Dec. 8, 2021, 8:30 a.m. UTC | #3
On 07/12/2021 14:05, Matthew Auld wrote:
> On 07/12/2021 13:10, Tvrtko Ursulin wrote:
>>
>> On 18/10/2021 10:10, Matthew Auld wrote:
>>> For cached objects we can allocate our pages directly in shmem. This
>>> should make it possible(in a later patch) to utilise the existing
>>> i915-gem shrinker code for such objects. For now this is still disabled.
>>>
>>> v2(Thomas):
>>>    - Add optional try_to_writeback hook for objects. Importantly we need
>>>      to check if the object is even still shrinkable; in between us
>>>      dropping the shrinker LRU lock and acquiring the object lock it 
>>> could for
>>>      example have been moved. Also we need to differentiate between
>>>      "lazy" shrinking and the immediate writeback mode. Also later we 
>>> need to
>>>      handle objects which don't even have mm.pages, so bundling this 
>>> into
>>>      put_pages() would require somehow handling that edge case, hence
>>>      just letting the ttm backend handle everything in try_to_writeback
>>>      doesn't seem too bad.
>>> v3(Thomas):
>>>    - Likely a bad idea to touch the object from the unpopulate hook,
>>>      since it's not possible to hold a reference, without also creating
>>>      circular dependency, so likely this is too fragile. For now just
>>>      ensure we at least mark the pages as dirty/accessed when called 
>>> from the
>>>      shrinker on WILLNEED objects.
>>>    - s/try_to_writeback/shrinker_release_pages, since this can do more
>>>      than just writeback.
>>>    - Get rid of do_backup boolean and just set the SWAPPED flag prior to
>>>      calling unpopulate.
>>>    - Keep shmem_tt as lowest priority for the TTM LRU bo_swapout 
>>> walk, since
>>>      these just get skipped anyway. We can try to come up with something
>>>      better later.
>>> v4(Thomas):
>>>    - s/PCI_DMA/DMA/. Also drop NO_KERNEL_MAPPING and NO_WARN, which
>>>      apparently doesn't do anything with streaming mappings.
>>>    - Just pass along the error for ->truncate, and assume nothing.
>>>
>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Oak Zeng <oak.zeng@intel.com>
>>> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> Acked-by: Oak Zeng <oak.zeng@intel.com>
>>
>> [snip]
>>
>>> -static void try_to_writeback(struct drm_i915_gem_object *obj,
>>> -                 unsigned int flags)
>>> +static int try_to_writeback(struct drm_i915_gem_object *obj, 
>>> unsigned int flags)
>>>   {
>>> +    if (obj->ops->shrinker_release_pages)
>>> +        return obj->ops->shrinker_release_pages(obj,
>>> +                            flags & I915_SHRINK_WRITEBACK);
>>
>> I have a high level question about how does this new vfunc fit with 
>> the existing code.
>>
>> Because I notice in the implementation 
>> (i915_ttm_shrinker_release_pages) it ends up doing:
>> ...
>>
>>         switch (obj->mm.madv) {
>>         case I915_MADV_DONTNEED:
>>                 return i915_ttm_purge(obj);
>>         case __I915_MADV_PURGED:
>>                 return 0;
>>         }
>>
>> ... and then...
>>
>>         if (should_writeback)
>>                 __shmem_writeback(obj->base.size, 
>> i915_tt->filp->f_mapping);
>>
>> Which on a glance looks like a possible conceptual duplication of the 
>> concepts in this very function (try_to_writeback):
>>
>>> +
>>>       switch (obj->mm.madv) {
>>>       case I915_MADV_DONTNEED:
>>>           i915_gem_object_truncate(obj);
>>> -        return;
>>> +        return 0;
>>>       case __I915_MADV_PURGED:
>>> -        return;
>>> +        return 0;
>>>       }
>>>       if (flags & I915_SHRINK_WRITEBACK)
>>>           i915_gem_object_writeback(obj);
>>
>> So question is this the final design or some futher tidy is 
>> possible/planned?
> 
> It seems ok to me, all things considered. The TTM version needs to check 
> if the object is still shrinkable at the start(plus some other stuff), 
> upon acquiring the object lock. If that succeeds we can do the above 
> madv checks and potentially just call truncate. Otherwise we can proceed 
> with shrinking, but again TTM is special here, and we have to call 
> ttm_bo_validate() underneath(we might not even have mm.pages here). And 
> then if that all works we might be able to also perform the writeback, 
> if needed. So I suppose we could add all that directly in 
> try_to_writeback(), and make it conditional for TTM devices, or I guess 
> we need two separate hooks, one to do some pre-checking and another do 
> the actual swap step. Not sure if that is better or worse though.

Would implementing the shrinker_release_pages for all objects work? It 
would contain what currently is in try_to_writeback and so the two 
paths, if not compatible, would diverge cleanly on the same level of 
indirection. I mean we would not have two effectively mutually exclusive 
vfuncs (shrinker_release_pages and writeback) and could unexport 
i915_gem_object_writeback.

>> Background to my question is that I will float a patch which proposes 
>> to remove writeback altogether. There are reports from the fields that 
>> it causes deadlocks and looking at 2d6692e642e7 ("drm/i915: Start 
>> writeback from the shrinker") and its history it seems like it was a 
>> known risk.
>>
>> Apart from the code organisation questions, on the practical level - 
>> do you need writeback from the TTM backend or while I am proposing to 
>> remove it from the "legacy" paths, I can propose removing it from the 
>> TTM flow as well?
> 
> Yeah, if that is somehow busted then we should remove from TTM backend 
> also.

Okay thanks, I wanted to check in case there was an extra need in TTM. I 
will float a patch soon hopefully but testing will be a problem since it 
seems very hard to repro at the moment.

Regards,

Tvrtko
Thomas Hellström Dec. 8, 2021, 8:39 a.m. UTC | #4
On 12/8/21 09:30, Tvrtko Ursulin wrote:


...


>>> Apart from the code organisation questions, on the practical level - 
>>> do you need writeback from the TTM backend or while I am proposing 
>>> to remove it from the "legacy" paths, I can propose removing it from 
>>> the TTM flow as well?
>>
>> Yeah, if that is somehow busted then we should remove from TTM 
>> backend also.
>
> Okay thanks, I wanted to check in case there was an extra need in TTM. 
> I will float a patch soon hopefully but testing will be a problem 
> since it seems very hard to repro at the moment.

Do we have some information about what's causing the deadlock or a 
signature? I'm asking because if some sort of shrinker was added to TTM 
itself, for the TTM page vectors, it would need to allocate shmem pages 
at shrink time rather than to unpin them at shrink time as we do here. 
And for that to have any chance of working sort of reliably, I think 
writeback is needed.

But I agree for this implementation, the need for writeback isn't 
different than for the non-TTM shmem objects

Thanks,

Thomas


>
> Regards,
>
> Tvrtko
Tvrtko Ursulin Dec. 8, 2021, 9:24 a.m. UTC | #5
On 08/12/2021 08:39, Thomas Hellström wrote:
> 
> On 12/8/21 09:30, Tvrtko Ursulin wrote:
> 
> 
> ...
> 
> 
>>>> Apart from the code organisation questions, on the practical level - 
>>>> do you need writeback from the TTM backend or while I am proposing 
>>>> to remove it from the "legacy" paths, I can propose removing it from 
>>>> the TTM flow as well?
>>>
>>> Yeah, if that is somehow busted then we should remove from TTM 
>>> backend also.
>>
>> Okay thanks, I wanted to check in case there was an extra need in TTM. 
>> I will float a patch soon hopefully but testing will be a problem 
>> since it seems very hard to repro at the moment.
> 
> Do we have some information about what's causing the deadlock or a 
> signature? I'm asking because if some sort of shrinker was added to TTM 

Yes, signature is hung task detector kicking in and pretty much system standstill ensues. You will find a bunch of tasks stuck like this:

[  247.165578] chrome          D    0  1791   1785 0x00004080
[  247.171732] Call Trace:
[  247.174492]  __schedule+0x57e/0x10d2
[  247.178514]  ? pcpu_alloc_area+0x25d/0x273
[  247.183115]  schedule+0x7c/0x9f
[  247.186647]  rwsem_down_write_slowpath+0x2ae/0x494
[  247.192025]  register_shrinker_prepared+0x19/0x48
[  247.197310]  ? test_single_super+0x10/0x10
[  247.201910]  sget_fc+0x1fc/0x20e
[  247.205540]  ? kill_litter_super+0x40/0x40
[  247.210139]  ? proc_apply_options+0x42/0x42
[  247.214838]  vfs_get_super+0x3a/0xdf
[  247.218855]  vfs_get_tree+0x2b/0xc3
[  247.222911]  fc_mount+0x12/0x39
[  247.226814]  pid_ns_prepare_proc+0x9d/0xc5
[  247.232468]  alloc_pid+0x275/0x289
[  247.236432]  copy_process+0x5e5/0xeea
[  247.240640]  _do_fork+0x95/0x303
[  247.244261]  __se_sys_clone+0x65/0x7f
[  247.248366]  do_syscall_64+0x54/0x7e
[  247.252365]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Or this:

[  247.030274] minijail-init   D    0  1773   1770 0x80004082
[  247.036419] Call Trace:
[  247.039167]  __schedule+0x57e/0x10d2
[  247.043175]  ? __schedule+0x586/0x10d2
[  247.047381]  ? _raw_spin_unlock+0xe/0x20
[  247.051779]  ? __queue_work+0x316/0x371
[  247.056079]  schedule+0x7c/0x9f
[  247.059602]  rwsem_down_write_slowpath+0x2ae/0x494
[  247.064971]  unregister_shrinker+0x20/0x65
[  247.069562]  deactivate_locked_super+0x38/0x88
[  247.074538]  cleanup_mnt+0xcc/0x10e
[  247.078447]  task_work_run+0x7d/0xa6
[  247.082459]  do_exit+0x23d/0x831
[  247.086079]  ? syscall_trace_enter+0x207/0x20e
[  247.091055]  do_group_exit+0x8d/0x9d
[  247.095062]  __x64_sys_exit_group+0x17/0x17
[  247.099750]  do_syscall_64+0x54/0x7e
[  247.103758]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

And smoking gun is:

[  247.383338] CPU: 3 PID: 88 Comm: kswapd0 Tainted: G     U            5.4.154 #36
[  247.383338] Hardware name: Google Delbin/Delbin, BIOS Google_Delbin.13672.57.0 02/09/2021
[  247.383339] RIP: 0010:__rcu_read_lock+0x0/0x1a
[  247.383339] Code: ff ff 0f 0b e9 61 fe ff ff 0f 0b e9 76 fe ff ff 0f 0b 49 8b 44 24 20 e9 59 ff ff ff 0f 0b e9 67 ff ff ff 0f 0b e9 1b ff ff ff <0f> 1f 44 00 00 55 48 89 e5 65 48 8b 04 25 80 5d 01 00 ff 80 f8 03
[  247.383340] RSP: 0018:ffffb0aa0031b978 EFLAGS: 00000286
[  247.383340] RAX: 0000000000000000 RBX: fffff6b944ca8040 RCX: fffff6b944ca8001
[  247.383341] RDX: 0000000000000028 RSI: 0000000000000001 RDI: ffff8b52bc618c18
[  247.383341] RBP: ffffb0aa0031b9d0 R08: 0000000000000000 R09: ffff8b52fb5f00d8
[  247.383341] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  247.383342] R13: 61c8864680b583eb R14: 0000000000000001 R15: ffffb0aa0031b980
[  247.383342] FS:  0000000000000000(0000) GS:ffff8b52fbf80000(0000) knlGS:0000000000000000
[  247.383343] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  247.383343] CR2: 00007c78a400d680 CR3: 0000000120f46006 CR4: 0000000000762ee0
[  247.383344] PKRU: 55555554
[  247.383344] Call Trace:
[  247.383345]  find_get_entry+0x4c/0x116
[  247.383345]  find_lock_entry+0xc8/0xec
[  247.383346]  shmem_writeback+0x7b/0x163
[  247.383346]  i915_gem_shrink+0x302/0x40b
[  247.383347]  ? __intel_runtime_pm_get+0x22/0x82
[  247.383347]  i915_gem_shrinker_scan+0x86/0xa8
[  247.383348]  shrink_slab+0x272/0x48b
[  247.383348]  shrink_node+0x784/0xbea
[  247.383348]  ? rcu_read_unlock_special+0x66/0x15f
[  247.383349]  ? update_batch_size+0x78/0x78
[  247.383349]  kswapd+0x75c/0xa56
[  247.383350]  kthread+0x147/0x156
[  247.383350]  ? kswapd_run+0xb6/0xb6
[  247.383351]  ? kthread_blkcg+0x2e/0x2e
[  247.383351]  ret_from_fork+0x1f/0x40

Sadly getting logs or repro from 5.16-rc is more difficult due other issues, or altogether gone, which is also a possibility. It is also possible that transparent hugepages either enable the hang, or just make it more likely.

However due history of writeback I think it sounds plausible that it is indeed unsafe. I will try to dig out a reply from Hugh Dickins who advised against doing it and I think that advice did not change, or I failed to find a later thread. There is at least a mention of that discussion in the patch which added writeback.

> itself, for the TTM page vectors, it would need to allocate shmem pages 
> at shrink time rather than to unpin them at shrink time as we do here. 
> And for that to have any chance of working sort of reliably, I think 
> writeback is needed.

I don't claim to understand it fully, but won't the system take care of that, with the only difference being that allocation might work a little less reliably under extreme memory pressure? And I did not find other drivers use it at least which may be and indication we should indeed steer clear of it.

Regards,

Tvrtko

> But I agree for this implementation, the need for writeback isn't 
> different than for the non-TTM shmem objects> 
> Thanks,
> 
> Thomas
> 
> 
>>
>> Regards,
>>
>> Tvrtko
>
Thomas Hellström Dec. 8, 2021, 9:32 a.m. UTC | #6
On 12/8/21 10:24, Tvrtko Ursulin wrote:
>
> On 08/12/2021 08:39, Thomas Hellström wrote:
>>
>> On 12/8/21 09:30, Tvrtko Ursulin wrote:
>>
>>
>> ...
>>
>>
>>>>> Apart from the code organisation questions, on the practical level 
>>>>> - do you need writeback from the TTM backend or while I am 
>>>>> proposing to remove it from the "legacy" paths, I can propose 
>>>>> removing it from the TTM flow as well?
>>>>
>>>> Yeah, if that is somehow busted then we should remove from TTM 
>>>> backend also.
>>>
>>> Okay thanks, I wanted to check in case there was an extra need in 
>>> TTM. I will float a patch soon hopefully but testing will be a 
>>> problem since it seems very hard to repro at the moment.
>>
>> Do we have some information about what's causing the deadlock or a 
>> signature? I'm asking because if some sort of shrinker was added to TTM 
>
> Yes, signature is hung task detector kicking in and pretty much system 
> standstill ensues. You will find a bunch of tasks stuck like this:
>
> [  247.165578] chrome          D    0  1791   1785 0x00004080
> [  247.171732] Call Trace:
> [  247.174492]  __schedule+0x57e/0x10d2
> [  247.178514]  ? pcpu_alloc_area+0x25d/0x273
> [  247.183115]  schedule+0x7c/0x9f
> [  247.186647]  rwsem_down_write_slowpath+0x2ae/0x494
> [  247.192025]  register_shrinker_prepared+0x19/0x48
> [  247.197310]  ? test_single_super+0x10/0x10
> [  247.201910]  sget_fc+0x1fc/0x20e
> [  247.205540]  ? kill_litter_super+0x40/0x40
> [  247.210139]  ? proc_apply_options+0x42/0x42
> [  247.214838]  vfs_get_super+0x3a/0xdf
> [  247.218855]  vfs_get_tree+0x2b/0xc3
> [  247.222911]  fc_mount+0x12/0x39
> [  247.226814]  pid_ns_prepare_proc+0x9d/0xc5
> [  247.232468]  alloc_pid+0x275/0x289
> [  247.236432]  copy_process+0x5e5/0xeea
> [  247.240640]  _do_fork+0x95/0x303
> [  247.244261]  __se_sys_clone+0x65/0x7f
> [  247.248366]  do_syscall_64+0x54/0x7e
> [  247.252365]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Or this:
>
> [  247.030274] minijail-init   D    0  1773   1770 0x80004082
> [  247.036419] Call Trace:
> [  247.039167]  __schedule+0x57e/0x10d2
> [  247.043175]  ? __schedule+0x586/0x10d2
> [  247.047381]  ? _raw_spin_unlock+0xe/0x20
> [  247.051779]  ? __queue_work+0x316/0x371
> [  247.056079]  schedule+0x7c/0x9f
> [  247.059602]  rwsem_down_write_slowpath+0x2ae/0x494
> [  247.064971]  unregister_shrinker+0x20/0x65
> [  247.069562]  deactivate_locked_super+0x38/0x88
> [  247.074538]  cleanup_mnt+0xcc/0x10e
> [  247.078447]  task_work_run+0x7d/0xa6
> [  247.082459]  do_exit+0x23d/0x831
> [  247.086079]  ? syscall_trace_enter+0x207/0x20e
> [  247.091055]  do_group_exit+0x8d/0x9d
> [  247.095062]  __x64_sys_exit_group+0x17/0x17
> [  247.099750]  do_syscall_64+0x54/0x7e
> [  247.103758]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> And smoking gun is:
>
> [  247.383338] CPU: 3 PID: 88 Comm: kswapd0 Tainted: G U            
> 5.4.154 #36
> [  247.383338] Hardware name: Google Delbin/Delbin, BIOS 
> Google_Delbin.13672.57.0 02/09/2021
> [  247.383339] RIP: 0010:__rcu_read_lock+0x0/0x1a
> [  247.383339] Code: ff ff 0f 0b e9 61 fe ff ff 0f 0b e9 76 fe ff ff 
> 0f 0b 49 8b 44 24 20 e9 59 ff ff ff 0f 0b e9 67 ff ff ff 0f 0b e9 1b 
> ff ff ff <0f> 1f 44 00 00 55 48 89 e5 65 48 8b 04 25 80 5d 01 00 ff 80 
> f8 03
> [  247.383340] RSP: 0018:ffffb0aa0031b978 EFLAGS: 00000286
> [  247.383340] RAX: 0000000000000000 RBX: fffff6b944ca8040 RCX: 
> fffff6b944ca8001
> [  247.383341] RDX: 0000000000000028 RSI: 0000000000000001 RDI: 
> ffff8b52bc618c18
> [  247.383341] RBP: ffffb0aa0031b9d0 R08: 0000000000000000 R09: 
> ffff8b52fb5f00d8
> [  247.383341] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [  247.383342] R13: 61c8864680b583eb R14: 0000000000000001 R15: 
> ffffb0aa0031b980
> [  247.383342] FS:  0000000000000000(0000) GS:ffff8b52fbf80000(0000) 
> knlGS:0000000000000000
> [  247.383343] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  247.383343] CR2: 00007c78a400d680 CR3: 0000000120f46006 CR4: 
> 0000000000762ee0
> [  247.383344] PKRU: 55555554
> [  247.383344] Call Trace:
> [  247.383345]  find_get_entry+0x4c/0x116
> [  247.383345]  find_lock_entry+0xc8/0xec
> [  247.383346]  shmem_writeback+0x7b/0x163
> [  247.383346]  i915_gem_shrink+0x302/0x40b
> [  247.383347]  ? __intel_runtime_pm_get+0x22/0x82
> [  247.383347]  i915_gem_shrinker_scan+0x86/0xa8
> [  247.383348]  shrink_slab+0x272/0x48b
> [  247.383348]  shrink_node+0x784/0xbea
> [  247.383348]  ? rcu_read_unlock_special+0x66/0x15f
> [  247.383349]  ? update_batch_size+0x78/0x78
> [  247.383349]  kswapd+0x75c/0xa56
> [  247.383350]  kthread+0x147/0x156
> [  247.383350]  ? kswapd_run+0xb6/0xb6
> [  247.383351]  ? kthread_blkcg+0x2e/0x2e
> [  247.383351]  ret_from_fork+0x1f/0x40
>
> Sadly getting logs or repro from 5.16-rc is more difficult due other 
> issues, or altogether gone, which is also a possibility. It is also 
> possible that transparent hugepages either enable the hang, or just 
> make it more likely.
>
> However due history of writeback I think it sounds plausible that it 
> is indeed unsafe. I will try to dig out a reply from Hugh Dickins who 
> advised against doing it and I think that advice did not change, or I 
> failed to find a later thread. There is at least a mention of that 
> discussion in the patch which added writeback.
>
>> itself, for the TTM page vectors, it would need to allocate shmem 
>> pages at shrink time rather than to unpin them at shrink time as we 
>> do here. And for that to have any chance of working sort of reliably, 
>> I think writeback is needed.
>
> I don't claim to understand it fully, but won't the system take care 
> of that, with the only difference being that allocation might work a 
> little less reliably under extreme memory pressure? 

Yes, IIRC it's exactly this that made the previous attempt of a generic 
TTM shrinker fail.


> And I did not find other drivers use it at least which may be and 
> indication we should indeed steer clear of it.

You're probably right.

>
> Regards,
>
> Tvrtko

Thanks,

Thomas




>
>> But I agree for this implementation, the need for writeback isn't 
>> different than for the non-TTM shmem objects> Thanks,
>>
>> Thomas
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 9df3ee60604e..e641db297e0e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -93,7 +93,6 @@  void i915_gem_flush_free_objects(struct drm_i915_private *i915);
 
 struct sg_table *
 __i915_gem_object_unset_pages(struct drm_i915_gem_object *obj);
-void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
 
 /**
  * i915_gem_object_lookup_rcu - look up a temporary GEM object from its handle
@@ -449,7 +448,7 @@  i915_gem_object_unpin_pages(struct drm_i915_gem_object *obj)
 }
 
 int __i915_gem_object_put_pages(struct drm_i915_gem_object *obj);
-void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
+int i915_gem_object_truncate(struct drm_i915_gem_object *obj);
 void i915_gem_object_writeback(struct drm_i915_gem_object *obj);
 
 /**
@@ -612,6 +611,14 @@  int i915_gem_object_wait_migration(struct drm_i915_gem_object *obj,
 bool i915_gem_object_placement_possible(struct drm_i915_gem_object *obj,
 					enum intel_memory_type type);
 
+struct sg_table *shmem_alloc_st(struct drm_i915_private *i915,
+				size_t size, struct intel_memory_region *mr,
+				struct address_space *mapping,
+				unsigned int max_segment);
+void shmem_free_st(struct sg_table *st, struct address_space *mapping,
+		   bool dirty, bool backup);
+void __shmem_writeback(size_t size, struct address_space *mapping);
+
 #ifdef CONFIG_MMU_NOTIFIER
 static inline bool
 i915_gem_object_is_userptr(struct drm_i915_gem_object *obj)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
index 7c3da4e3e737..7dd5f804aab3 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
@@ -54,8 +54,10 @@  struct drm_i915_gem_object_ops {
 	int (*get_pages)(struct drm_i915_gem_object *obj);
 	void (*put_pages)(struct drm_i915_gem_object *obj,
 			  struct sg_table *pages);
-	void (*truncate)(struct drm_i915_gem_object *obj);
+	int (*truncate)(struct drm_i915_gem_object *obj);
 	void (*writeback)(struct drm_i915_gem_object *obj);
+	int (*shrinker_release_pages)(struct drm_i915_gem_object *obj,
+				      bool should_writeback);
 
 	int (*pread)(struct drm_i915_gem_object *obj,
 		     const struct drm_i915_gem_pread *arg);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index 8eb1c3a6fc9c..ea6d9b3d2d6b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -158,11 +158,13 @@  int i915_gem_object_pin_pages_unlocked(struct drm_i915_gem_object *obj)
 }
 
 /* Immediately discard the backing storage */
-void i915_gem_object_truncate(struct drm_i915_gem_object *obj)
+int i915_gem_object_truncate(struct drm_i915_gem_object *obj)
 {
 	drm_gem_free_mmap_offset(&obj->base);
 	if (obj->ops->truncate)
-		obj->ops->truncate(obj);
+		return obj->ops->truncate(obj);
+
+	return 0;
 }
 
 /* Try to discard unwanted pages */
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 36b711ae9e28..8375dc23f233 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -25,8 +25,8 @@  static void check_release_pagevec(struct pagevec *pvec)
 	cond_resched();
 }
 
-static void shmem_free_st(struct sg_table *st, struct address_space *mapping,
-			  bool dirty, bool backup)
+void shmem_free_st(struct sg_table *st, struct address_space *mapping,
+		   bool dirty, bool backup)
 {
 	struct sgt_iter sgt_iter;
 	struct pagevec pvec;
@@ -52,10 +52,10 @@  static void shmem_free_st(struct sg_table *st, struct address_space *mapping,
 	kfree(st);
 }
 
-static struct sg_table *shmem_alloc_st(struct drm_i915_private *i915,
-				       size_t size, struct intel_memory_region *mr,
-				       struct address_space *mapping,
-				       unsigned int max_segment)
+struct sg_table *shmem_alloc_st(struct drm_i915_private *i915,
+				size_t size, struct intel_memory_region *mr,
+				struct address_space *mapping,
+				unsigned int max_segment)
 {
 	const unsigned long page_count = size / PAGE_SIZE;
 	unsigned long i;
@@ -286,7 +286,7 @@  static int shmem_get_pages(struct drm_i915_gem_object *obj)
 	return ret;
 }
 
-static void
+static int
 shmem_truncate(struct drm_i915_gem_object *obj)
 {
 	/*
@@ -298,9 +298,11 @@  shmem_truncate(struct drm_i915_gem_object *obj)
 	shmem_truncate_range(file_inode(obj->base.filp), 0, (loff_t)-1);
 	obj->mm.madv = __I915_MADV_PURGED;
 	obj->mm.pages = ERR_PTR(-EFAULT);
+
+	return 0;
 }
 
-static void __shmem_writeback(size_t size, struct address_space *mapping)
+void __shmem_writeback(size_t size, struct address_space *mapping)
 {
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
index 5ab136ffdeb2..ae2a8d54b7a4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
@@ -56,19 +56,24 @@  static bool unsafe_drop_pages(struct drm_i915_gem_object *obj,
 	return false;
 }
 
-static void try_to_writeback(struct drm_i915_gem_object *obj,
-			     unsigned int flags)
+static int try_to_writeback(struct drm_i915_gem_object *obj, unsigned int flags)
 {
+	if (obj->ops->shrinker_release_pages)
+		return obj->ops->shrinker_release_pages(obj,
+							flags & I915_SHRINK_WRITEBACK);
+
 	switch (obj->mm.madv) {
 	case I915_MADV_DONTNEED:
 		i915_gem_object_truncate(obj);
-		return;
+		return 0;
 	case __I915_MADV_PURGED:
-		return;
+		return 0;
 	}
 
 	if (flags & I915_SHRINK_WRITEBACK)
 		i915_gem_object_writeback(obj);
+
+	return 0;
 }
 
 /**
@@ -222,8 +227,8 @@  i915_gem_shrink(struct i915_gem_ww_ctx *ww,
 				}
 
 				if (!__i915_gem_object_put_pages(obj)) {
-					try_to_writeback(obj, shrink);
-					count += obj->base.size >> PAGE_SHIFT;
+					if (!try_to_writeback(obj, shrink))
+						count += obj->base.size >> PAGE_SHIFT;
 				}
 				if (!ww)
 					i915_gem_object_unlock(obj);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index 74a1ffd0d7dd..0fe1eb8f38e9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -35,6 +35,8 @@ 
  * @ttm: The base TTM page vector.
  * @dev: The struct device used for dma mapping and unmapping.
  * @cached_st: The cached scatter-gather table.
+ * @is_shmem: Set if using shmem.
+ * @filp: The shmem file, if using shmem backend.
  *
  * Note that DMA may be going on right up to the point where the page-
  * vector is unpopulated in delayed destroy. Hence keep the
@@ -46,6 +48,9 @@  struct i915_ttm_tt {
 	struct ttm_tt ttm;
 	struct device *dev;
 	struct sg_table *cached_st;
+
+	bool is_shmem;
+	struct file *filp;
 };
 
 static const struct ttm_place sys_placement_flags = {
@@ -179,12 +184,88 @@  i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj,
 	placement->busy_placement = busy;
 }
 
+static int i915_ttm_tt_shmem_populate(struct ttm_device *bdev,
+				      struct ttm_tt *ttm,
+				      struct ttm_operation_ctx *ctx)
+{
+	struct drm_i915_private *i915 = container_of(bdev, typeof(*i915), bdev);
+	struct intel_memory_region *mr = i915->mm.regions[INTEL_MEMORY_SYSTEM];
+	struct i915_ttm_tt *i915_tt = container_of(ttm, typeof(*i915_tt), ttm);
+	const unsigned int max_segment = i915_sg_segment_size();
+	const size_t size = ttm->num_pages << PAGE_SHIFT;
+	struct file *filp = i915_tt->filp;
+	struct sgt_iter sgt_iter;
+	struct sg_table *st;
+	struct page *page;
+	unsigned long i;
+	int err;
+
+	if (!filp) {
+		struct address_space *mapping;
+		gfp_t mask;
+
+		filp = shmem_file_setup("i915-shmem-tt", size, VM_NORESERVE);
+		if (IS_ERR(filp))
+			return PTR_ERR(filp);
+
+		mask = GFP_HIGHUSER | __GFP_RECLAIMABLE;
+
+		mapping = filp->f_mapping;
+		mapping_set_gfp_mask(mapping, mask);
+		GEM_BUG_ON(!(mapping_gfp_mask(mapping) & __GFP_RECLAIM));
+
+		i915_tt->filp = filp;
+	}
+
+	st = shmem_alloc_st(i915, size, mr, filp->f_mapping, max_segment);
+	if (IS_ERR(st))
+		return PTR_ERR(st);
+
+	err = dma_map_sg_attrs(i915_tt->dev,
+			       st->sgl, st->nents,
+			       DMA_BIDIRECTIONAL,
+			       DMA_ATTR_SKIP_CPU_SYNC);
+	if (err <= 0) {
+		err = -EINVAL;
+		goto err_free_st;
+	}
+
+	i = 0;
+	for_each_sgt_page(page, sgt_iter, st)
+		ttm->pages[i++] = page;
+
+	if (ttm->page_flags & TTM_TT_FLAG_SWAPPED)
+		ttm->page_flags &= ~TTM_TT_FLAG_SWAPPED;
+
+	i915_tt->cached_st = st;
+	return 0;
+
+err_free_st:
+	shmem_free_st(st, filp->f_mapping, false, false);
+	return err;
+}
+
+static void i915_ttm_tt_shmem_unpopulate(struct ttm_tt *ttm)
+{
+	struct i915_ttm_tt *i915_tt = container_of(ttm, typeof(*i915_tt), ttm);
+	bool backup = ttm->page_flags & TTM_TT_FLAG_SWAPPED;
+
+	dma_unmap_sg(i915_tt->dev, i915_tt->cached_st->sgl,
+		     i915_tt->cached_st->nents,
+		     DMA_BIDIRECTIONAL);
+
+	shmem_free_st(fetch_and_zero(&i915_tt->cached_st),
+		      file_inode(i915_tt->filp)->i_mapping,
+		      backup, backup);
+}
+
 static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 					 uint32_t page_flags)
 {
 	struct ttm_resource_manager *man =
 		ttm_manager_type(bo->bdev, bo->resource->mem_type);
 	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
+	enum ttm_caching caching = i915_ttm_select_tt_caching(obj);
 	struct i915_ttm_tt *i915_tt;
 	int ret;
 
@@ -196,36 +277,62 @@  static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 	    man->use_tt)
 		page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
 
-	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags,
-			  i915_ttm_select_tt_caching(obj));
-	if (ret) {
-		kfree(i915_tt);
-		return NULL;
+	if (i915_gem_object_is_shrinkable(obj) && caching == ttm_cached) {
+		page_flags |= TTM_TT_FLAG_EXTERNAL |
+			      TTM_TT_FLAG_EXTERNAL_MAPPABLE;
+		i915_tt->is_shmem = true;
 	}
 
+	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags, caching);
+	if (ret)
+		goto err_free;
+
 	i915_tt->dev = obj->base.dev->dev;
 
 	return &i915_tt->ttm;
+
+err_free:
+	kfree(i915_tt);
+	return NULL;
+}
+
+static int i915_ttm_tt_populate(struct ttm_device *bdev,
+				struct ttm_tt *ttm,
+				struct ttm_operation_ctx *ctx)
+{
+	struct i915_ttm_tt *i915_tt = container_of(ttm, typeof(*i915_tt), ttm);
+
+	if (i915_tt->is_shmem)
+		return i915_ttm_tt_shmem_populate(bdev, ttm, ctx);
+
+	return ttm_pool_alloc(&bdev->pool, ttm, ctx);
 }
 
 static void i915_ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)
 {
 	struct i915_ttm_tt *i915_tt = container_of(ttm, typeof(*i915_tt), ttm);
 
-	if (i915_tt->cached_st) {
-		dma_unmap_sgtable(i915_tt->dev, i915_tt->cached_st,
-				  DMA_BIDIRECTIONAL, 0);
-		sg_free_table(i915_tt->cached_st);
-		kfree(i915_tt->cached_st);
-		i915_tt->cached_st = NULL;
+	if (i915_tt->is_shmem) {
+		i915_ttm_tt_shmem_unpopulate(ttm);
+	} else {
+		if (i915_tt->cached_st) {
+			dma_unmap_sgtable(i915_tt->dev, i915_tt->cached_st,
+					  DMA_BIDIRECTIONAL, 0);
+			sg_free_table(i915_tt->cached_st);
+			kfree(i915_tt->cached_st);
+			i915_tt->cached_st = NULL;
+		}
+		ttm_pool_free(&bdev->pool, ttm);
 	}
-	ttm_pool_free(&bdev->pool, ttm);
 }
 
 static void i915_ttm_tt_destroy(struct ttm_device *bdev, struct ttm_tt *ttm)
 {
 	struct i915_ttm_tt *i915_tt = container_of(ttm, typeof(*i915_tt), ttm);
 
+	if (i915_tt->filp)
+		fput(i915_tt->filp);
+
 	ttm_tt_fini(ttm);
 	kfree(i915_tt);
 }
@@ -235,6 +342,14 @@  static bool i915_ttm_eviction_valuable(struct ttm_buffer_object *bo,
 {
 	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
 
+	/*
+	 * EXTERNAL objects should never be swapped out by TTM, instead we need
+	 * to handle that ourselves. TTM will already skip such objects for us,
+	 * but we would like to avoid grabbing locks for no good reason.
+	 */
+	if (bo->ttm && bo->ttm->page_flags & TTM_TT_FLAG_EXTERNAL)
+		return -EBUSY;
+
 	/* Will do for now. Our pinned objects are still on TTM's LRU lists */
 	return i915_gem_object_evictable(obj);
 }
@@ -328,9 +443,11 @@  static void i915_ttm_adjust_gem_after_move(struct drm_i915_gem_object *obj)
 	i915_gem_object_set_cache_coherency(obj, cache_level);
 }
 
-static void i915_ttm_purge(struct drm_i915_gem_object *obj)
+static int i915_ttm_purge(struct drm_i915_gem_object *obj)
 {
 	struct ttm_buffer_object *bo = i915_gem_to_ttm(obj);
+	struct i915_ttm_tt *i915_tt =
+		container_of(bo->ttm, typeof(*i915_tt), ttm);
 	struct ttm_operation_ctx ctx = {
 		.interruptible = true,
 		.no_wait_gpu = false,
@@ -339,17 +456,74 @@  static void i915_ttm_purge(struct drm_i915_gem_object *obj)
 	int ret;
 
 	if (obj->mm.madv == __I915_MADV_PURGED)
-		return;
+		return 0;
 
-	/* TTM's purge interface. Note that we might be reentering. */
 	ret = ttm_bo_validate(bo, &place, &ctx);
-	if (!ret) {
-		obj->write_domain = 0;
-		obj->read_domains = 0;
-		i915_ttm_adjust_gem_after_move(obj);
-		i915_ttm_free_cached_io_st(obj);
-		obj->mm.madv = __I915_MADV_PURGED;
+	if (ret)
+		return ret;
+
+	if (bo->ttm && i915_tt->filp) {
+		/*
+		 * The below fput(which eventually calls shmem_truncate) might
+		 * be delayed by worker, so when directly called to purge the
+		 * pages(like by the shrinker) we should try to be more
+		 * aggressive and release the pages immediately.
+		 */
+		shmem_truncate_range(file_inode(i915_tt->filp),
+				     0, (loff_t)-1);
+		fput(fetch_and_zero(&i915_tt->filp));
 	}
+
+	obj->write_domain = 0;
+	obj->read_domains = 0;
+	i915_ttm_adjust_gem_after_move(obj);
+	i915_ttm_free_cached_io_st(obj);
+	obj->mm.madv = __I915_MADV_PURGED;
+	return 0;
+}
+
+static int i915_ttm_shrinker_release_pages(struct drm_i915_gem_object *obj,
+					   bool should_writeback)
+{
+	struct ttm_buffer_object *bo = i915_gem_to_ttm(obj);
+	struct i915_ttm_tt *i915_tt =
+		container_of(bo->ttm, typeof(*i915_tt), ttm);
+	struct ttm_operation_ctx ctx = {
+		.interruptible = true,
+		.no_wait_gpu = false,
+	};
+	struct ttm_placement place = {};
+	int ret;
+
+	if (!bo->ttm || bo->resource->mem_type != TTM_PL_SYSTEM)
+		return 0;
+
+	GEM_BUG_ON(!i915_tt->is_shmem);
+
+	if (!i915_tt->filp)
+		return 0;
+
+	switch (obj->mm.madv) {
+	case I915_MADV_DONTNEED:
+		return i915_ttm_purge(obj);
+	case __I915_MADV_PURGED:
+		return 0;
+	}
+
+	if (bo->ttm->page_flags & TTM_TT_FLAG_SWAPPED)
+		return 0;
+
+	bo->ttm->page_flags |= TTM_TT_FLAG_SWAPPED;
+	ret = ttm_bo_validate(bo, &place, &ctx);
+	if (ret) {
+		bo->ttm->page_flags &= ~TTM_TT_FLAG_SWAPPED;
+		return ret;
+	}
+
+	if (should_writeback)
+		__shmem_writeback(obj->base.size, i915_tt->filp->f_mapping);
+
+	return 0;
 }
 
 static void i915_ttm_swap_notify(struct ttm_buffer_object *bo)
@@ -620,6 +794,7 @@  static unsigned long i915_ttm_io_mem_pfn(struct ttm_buffer_object *bo,
 
 static struct ttm_device_funcs i915_ttm_bo_driver = {
 	.ttm_tt_create = i915_ttm_tt_create,
+	.ttm_tt_populate = i915_ttm_tt_populate,
 	.ttm_tt_unpopulate = i915_ttm_tt_unpopulate,
 	.ttm_tt_destroy = i915_ttm_tt_destroy,
 	.eviction_valuable = i915_ttm_eviction_valuable,
@@ -687,12 +862,17 @@  static int __i915_ttm_get_pages(struct drm_i915_gem_object *obj,
 	}
 
 	if (!i915_gem_object_has_pages(obj)) {
+		struct i915_ttm_tt *i915_tt =
+			container_of(bo->ttm, typeof(*i915_tt), ttm);
+
 		/* Object either has a page vector or is an iomem object */
 		st = bo->ttm ? i915_ttm_tt_get_st(bo->ttm) : obj->ttm.cached_io_st;
 		if (IS_ERR(st))
 			return PTR_ERR(st);
 
 		__i915_gem_object_set_pages(obj, st, i915_sg_dma_sizes(st->sgl));
+		if (!bo->ttm || !i915_tt->is_shmem)
+			i915_gem_object_make_unshrinkable(obj);
 	}
 
 	return ret;
@@ -772,6 +952,8 @@  static void i915_ttm_put_pages(struct drm_i915_gem_object *obj,
 static void i915_ttm_adjust_lru(struct drm_i915_gem_object *obj)
 {
 	struct ttm_buffer_object *bo = i915_gem_to_ttm(obj);
+	struct i915_ttm_tt *i915_tt =
+		container_of(bo->ttm, typeof(*i915_tt), ttm);
 
 	/*
 	 * Don't manipulate the TTM LRUs while in TTM bo destruction.
@@ -784,7 +966,10 @@  static void i915_ttm_adjust_lru(struct drm_i915_gem_object *obj)
 	 * Put on the correct LRU list depending on the MADV status
 	 */
 	spin_lock(&bo->bdev->lru_lock);
-	if (obj->mm.madv != I915_MADV_WILLNEED) {
+	if (bo->ttm && i915_tt->filp) {
+		/* Try to keep shmem_tt from being considered for shrinking. */
+		bo->priority = TTM_MAX_BO_PRIORITY - 1;
+	} else if (obj->mm.madv != I915_MADV_WILLNEED) {
 		bo->priority = I915_TTM_PRIO_PURGE;
 	} else if (!i915_gem_object_has_pages(obj)) {
 		if (bo->priority < I915_TTM_PRIO_HAS_PAGES)
@@ -886,9 +1071,12 @@  static const struct drm_i915_gem_object_ops i915_gem_ttm_obj_ops = {
 	.get_pages = i915_ttm_get_pages,
 	.put_pages = i915_ttm_put_pages,
 	.truncate = i915_ttm_purge,
+	.shrinker_release_pages = i915_ttm_shrinker_release_pages,
+
 	.adjust_lru = i915_ttm_adjust_lru,
 	.delayed_free = i915_ttm_delayed_free,
 	.migrate = i915_ttm_migrate,
+
 	.mmap_offset = i915_ttm_mmap_offset,
 	.mmap_ops = &vm_ops_ttm,
 };
@@ -943,7 +1131,6 @@  int __i915_gem_ttm_object_init(struct intel_memory_region *mem,
 	obj->mm.region = intel_memory_region_get(mem);
 	INIT_LIST_HEAD(&obj->mm.region_link);
 
-	i915_gem_object_make_unshrinkable(obj);
 	INIT_RADIX_TREE(&obj->ttm.get_io_page.radix, GFP_KERNEL | __GFP_NOWARN);
 	mutex_init(&obj->ttm.get_io_page.lock);
 	bo_type = (obj->flags & I915_BO_ALLOC_USER) ? ttm_bo_type_device :