diff mbox series

[1/3] drm/i915: audit bo->resource usage

Message ID 20220824142353.10293-1-luben.tuikov@amd.com (mailing list archive)
State New, archived
Headers show
Series [1/3] drm/i915: audit bo->resource usage | expand

Commit Message

Luben Tuikov Aug. 24, 2022, 2:23 p.m. UTC
From: Christian König <christian.koenig@amd.com>

Make sure we can at least move and alloc TT objects without backing store.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
 2 files changed, 3 insertions(+), 5 deletions(-)

Comments

Christian König Aug. 30, 2022, 7:33 a.m. UTC | #1
Hi guys,

can we get an rb/acked-by for this i915 change?

Basically we are just making sure that the driver doesn't crash when 
bo->resource is NULL and a bo doesn't have any backing store assigned to it.

The Intel CI seems to be happy with this change, so I'm pretty sure it 
is correct.

Thanks,
Christian.

Am 24.08.22 um 16:23 schrieb Luben Tuikov:
> From: Christian König <christian.koenig@amd.com>
>
> Make sure we can at least move and alloc TT objects without backing store.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>   2 files changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> index bc9c432edffe03..45ce2d1f754cc4 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   {
>   	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
>   						     bdev);
> -	struct ttm_resource_manager *man =
> -		ttm_manager_type(bo->bdev, bo->resource->mem_type);
>   	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>   	unsigned long ccs_pages = 0;
>   	enum ttm_caching caching;
> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   	if (!i915_tt)
>   		return NULL;
>   
> -	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
> -	    man->use_tt)
> +	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
> +	    ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>   		page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>   
>   	caching = i915_ttm_select_tt_caching(obj);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> index 9a7e50534b84bb..c420d1ab605b6f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
>   	bool clear;
>   	int ret;
>   
> -	if (GEM_WARN_ON(!obj)) {
> +	if (GEM_WARN_ON(!obj) || !bo->resource) {
>   		ttm_bo_move_null(bo, dst_mem);
>   		return 0;
>   	}
Matthew Auld Aug. 30, 2022, 10:45 a.m. UTC | #2
Hi,

On 30/08/2022 08:33, Christian König wrote:
> Hi guys,
> 
> can we get an rb/acked-by for this i915 change?
> 
> Basically we are just making sure that the driver doesn't crash when 
> bo->resource is NULL and a bo doesn't have any backing store assigned to 
> it.
> 
> The Intel CI seems to be happy with this change, so I'm pretty sure it 
> is correct.

It looks like DG2/DG1 (which happen to use TTM here) are no longer 
loading the module:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/index.html?

According to the logs the firmware is failing to load, so perhaps 
related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare users. 
See below.

> 
> Thanks,
> Christian.
> 
> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>> From: Christian König <christian.koenig@amd.com>
>>
>> Make sure we can at least move and alloc TT objects without backing 
>> store.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>> index bc9c432edffe03..45ce2d1f754cc4 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>> ttm_buffer_object *bo,
>>   {
>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>> typeof(*i915),
>>                                bdev);
>> -    struct ttm_resource_manager *man =
>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>       unsigned long ccs_pages = 0;
>>       enum ttm_caching caching;
>> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>> ttm_buffer_object *bo,
>>       if (!i915_tt)
>>           return NULL;
>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>> -        man->use_tt)
>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;

AFAICT i915 was massively relying on everything starting out in a 
"dummy" system memory resource (ttm_tt), where it then later 
"transitions" to the real resource. And if we need to clear the memory 
we rely on ZERO_ALLOC being set before calling into the i915_ttm_move() 
callback (even when allocating local-memory).

For ttm_bo_type_device objects (userspace stuff) it looks like this was 
previously handled by ttm_bo_validate() always doing:

   ret = ttm_tt_create(bo, true); /* clear = true */

Which we would always hit since the resource was always "compatible" for 
the dummy case. But it looks like this is no longer even called, since 
we can now call into ttm_move with bo->resource == NULL, which still 
calls tt_create eventually, which not always with clear = true.

All other objects are then ttm_bo_type_kernel so we don't care about 
clearing, except in the rare case of ALLOC_CPU_CLEAR, which was handled 
as per above in i915_ttm_tt_create(). But I think here bo->resource is 
NULL at the start when first creating the object, which will skip 
setting ZERO_ALLOC, which might explain the CI failure.

The other possible concern (not sure since CI didn't get that far) is 
around ttm_bo_pipeline_gutting(), which now leaves bo->resource = NULL. 
It looks like i915_ttm_shrink() was relying on that to unpopulate the 
ttm_tt. When later calling ttm_bo_validate(), i915_ttm_move() would see 
the SWAPPED flag set on the ttm_tt , re-populate it and then potentially 
move it back to local-memory.

What are your thoughts here? Also sorry if i915 is making a bit of mess 
here.

>>       caching = i915_ttm_select_tt_caching(obj);
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
>> bool evict,
>>       bool clear;
>>       int ret;
>> -    if (GEM_WARN_ON(!obj)) {
>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>           ttm_bo_move_null(bo, dst_mem);
>>           return 0;
>>       }
>
Christian König Aug. 31, 2022, 8:16 a.m. UTC | #3
Hi Matthew,

Am 30.08.22 um 12:45 schrieb Matthew Auld:
> Hi,
>
> On 30/08/2022 08:33, Christian König wrote:
>> Hi guys,
>>
>> can we get an rb/acked-by for this i915 change?
>>
>> Basically we are just making sure that the driver doesn't crash when 
>> bo->resource is NULL and a bo doesn't have any backing store assigned 
>> to it.
>>
>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>> it is correct.
>
> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
> loading the module:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Caa9bdb0e31064859568708da8a74b899%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637974531164663116%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=UW8BEnIFXHawhAfLUcknGmE88g2wwAiTLAQ3Y5v1pFA%3D&amp;reserved=0? 
>
>
> According to the logs the firmware is failing to load, so perhaps 
> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
> users. See below.
>
>>
>> Thanks,
>> Christian.
>>
>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>> From: Christian König <christian.koenig@amd.com>
>>>
>>> Make sure we can at least move and alloc TT objects without backing 
>>> store.
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>> ttm_buffer_object *bo,
>>>   {
>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>> typeof(*i915),
>>>                                bdev);
>>> -    struct ttm_resource_manager *man =
>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>       unsigned long ccs_pages = 0;
>>>       enum ttm_caching caching;
>>> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>> ttm_buffer_object *bo,
>>>       if (!i915_tt)
>>>           return NULL;
>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>> -        man->use_tt)
>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>
> AFAICT i915 was massively relying on everything starting out in a 
> "dummy" system memory resource (ttm_tt), where it then later 
> "transitions" to the real resource. And if we need to clear the memory 
> we rely on ZERO_ALLOC being set before calling into the 
> i915_ttm_move() callback (even when allocating local-memory).
>
> For ttm_bo_type_device objects (userspace stuff) it looks like this 
> was previously handled by ttm_bo_validate() always doing:
>
>   ret = ttm_tt_create(bo, true); /* clear = true */
>
> Which we would always hit since the resource was always "compatible" 
> for the dummy case. But it looks like this is no longer even called, 
> since we can now call into ttm_move with bo->resource == NULL, which 
> still calls tt_create eventually, which not always with clear = true.
>
> All other objects are then ttm_bo_type_kernel so we don't care about 
> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
> handled as per above in i915_ttm_tt_create(). But I think here 
> bo->resource is NULL at the start when first creating the object, 
> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>
> The other possible concern (not sure since CI didn't get that far) is 
> around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
> NULL. It looks like i915_ttm_shrink() was relying on that to 
> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
> re-populate it and then potentially move it back to local-memory.
>
> What are your thoughts here? Also sorry if i915 is making a bit of 
> mess here.

First of all thanks a lot for taking a look. We previously got reports 
about strange crashes with this patch, but couldn't really reproduce 
them (even not by sending them out again). This explains that a bit.

The simplest solution would be to just change the && into a ||, e.g. 
when previously either no resource is allocated or the resource requires 
to use a tt we clear it.

That should give you the same behavior as before, but I agree that this 
is a bit messy.

I've been considering to replacing the ttm_bo_type with a bunch of 
behavior flags for a bo. I'm hoping that this will clean things up a bit.

Regards,
Christian.

>
>>>       caching = i915_ttm_select_tt_caching(obj);
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
>>> bool evict,
>>>       bool clear;
>>>       int ret;
>>> -    if (GEM_WARN_ON(!obj)) {
>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>           ttm_bo_move_null(bo, dst_mem);
>>>           return 0;
>>>       }
>>
Matthew Auld Aug. 31, 2022, 9:26 a.m. UTC | #4
On 31/08/2022 09:16, Christian König wrote:
> Hi Matthew,
> 
> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>> Hi,
>>
>> On 30/08/2022 08:33, Christian König wrote:
>>> Hi guys,
>>>
>>> can we get an rb/acked-by for this i915 change?
>>>
>>> Basically we are just making sure that the driver doesn't crash when 
>>> bo->resource is NULL and a bo doesn't have any backing store assigned 
>>> to it.
>>>
>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>> it is correct.
>>
>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>> loading the module:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Caa9bdb0e31064859568708da8a74b899%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637974531164663116%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=UW8BEnIFXHawhAfLUcknGmE88g2wwAiTLAQ3Y5v1pFA%3D&amp;reserved=0?
>>
>> According to the logs the firmware is failing to load, so perhaps 
>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>> users. See below.
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>> From: Christian König <christian.koenig@amd.com>
>>>>
>>>> Make sure we can at least move and alloc TT objects without backing 
>>>> store.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>>> ttm_buffer_object *bo,
>>>>   {
>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>> typeof(*i915),
>>>>                                bdev);
>>>> -    struct ttm_resource_manager *man =
>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>       unsigned long ccs_pages = 0;
>>>>       enum ttm_caching caching;
>>>> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>>> ttm_buffer_object *bo,
>>>>       if (!i915_tt)
>>>>           return NULL;
>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>> -        man->use_tt)
>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>
>> AFAICT i915 was massively relying on everything starting out in a 
>> "dummy" system memory resource (ttm_tt), where it then later 
>> "transitions" to the real resource. And if we need to clear the memory 
>> we rely on ZERO_ALLOC being set before calling into the 
>> i915_ttm_move() callback (even when allocating local-memory).
>>
>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>> was previously handled by ttm_bo_validate() always doing:
>>
>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>
>> Which we would always hit since the resource was always "compatible" 
>> for the dummy case. But it looks like this is no longer even called, 
>> since we can now call into ttm_move with bo->resource == NULL, which 
>> still calls tt_create eventually, which not always with clear = true.
>>
>> All other objects are then ttm_bo_type_kernel so we don't care about 
>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>> handled as per above in i915_ttm_tt_create(). But I think here 
>> bo->resource is NULL at the start when first creating the object, 
>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>
>> The other possible concern (not sure since CI didn't get that far) is 
>> around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>> re-populate it and then potentially move it back to local-memory.
>>
>> What are your thoughts here? Also sorry if i915 is making a bit of 
>> mess here.
> 
> First of all thanks a lot for taking a look. We previously got reports 
> about strange crashes with this patch, but couldn't really reproduce 
> them (even not by sending them out again). This explains that a bit.
> 
> The simplest solution would be to just change the && into a ||, e.g. 
> when previously either no resource is allocated or the resource requires 
> to use a tt we clear it.
> 
> That should give you the same behavior as before, but I agree that this 
> is a bit messy.

Yeah, that should do the trick.

That hopefully just leaves i915_ttm_shrink(), which is swapping out 
shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
force the pipeline-gutting path, which importantly unpopulates the 
ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
AFAICT it looks like that will now also nuke the bo->resource, instead 
of just leaving it in system memory. My assumption is that when later 
calling ttm_bo_validate(), it will just do the bo_move_null() in 
i915_ttm_move(), instead of re-populating the ttm_tt and then 
potentially copying it back to local-memory?

> 
> I've been considering to replacing the ttm_bo_type with a bunch of 
> behavior flags for a bo. I'm hoping that this will clean things up a bit.
> 
> Regards,
> Christian.
> 
>>
>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
>>>> bool evict,
>>>>       bool clear;
>>>>       int ret;
>>>> -    if (GEM_WARN_ON(!obj)) {
>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>           return 0;
>>>>       }
>>>
>
Christian König Aug. 31, 2022, 9:38 a.m. UTC | #5
Am 31.08.22 um 11:26 schrieb Matthew Auld:
> On 31/08/2022 09:16, Christian König wrote:
>> Hi Matthew,
>>
>> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>>> Hi,
>>>
>>> On 30/08/2022 08:33, Christian König wrote:
>>>> Hi guys,
>>>>
>>>> can we get an rb/acked-by for this i915 change?
>>>>
>>>> Basically we are just making sure that the driver doesn't crash 
>>>> when bo->resource is NULL and a bo doesn't have any backing store 
>>>> assigned to it.
>>>>
>>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>>> it is correct.
>>>
>>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>>> loading the module:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Ccaca567b3279492450fd08da8b32e598%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637975347967354305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=apanfOjzSWD2vduINzr2j6F2DiIBC93hLRnnGJcGQ5o%3D&amp;reserved=0? 
>>>
>>>
>>> According to the logs the firmware is failing to load, so perhaps 
>>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>>> users. See below.
>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>
>>>>> Make sure we can at least move and alloc TT objects without 
>>>>> backing store.
>>>>>
>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> @@ -271,8 +271,6 @@ static struct ttm_tt 
>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>   {
>>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>>> typeof(*i915),
>>>>>                                bdev);
>>>>> -    struct ttm_resource_manager *man =
>>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>>       unsigned long ccs_pages = 0;
>>>>>       enum ttm_caching caching;
>>>>> @@ -286,8 +284,8 @@ static struct ttm_tt 
>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>       if (!i915_tt)
>>>>>           return NULL;
>>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>>> -        man->use_tt)
>>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>>
>>> AFAICT i915 was massively relying on everything starting out in a 
>>> "dummy" system memory resource (ttm_tt), where it then later 
>>> "transitions" to the real resource. And if we need to clear the 
>>> memory we rely on ZERO_ALLOC being set before calling into the 
>>> i915_ttm_move() callback (even when allocating local-memory).
>>>
>>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>>> was previously handled by ttm_bo_validate() always doing:
>>>
>>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>>
>>> Which we would always hit since the resource was always "compatible" 
>>> for the dummy case. But it looks like this is no longer even called, 
>>> since we can now call into ttm_move with bo->resource == NULL, which 
>>> still calls tt_create eventually, which not always with clear = true.
>>>
>>> All other objects are then ttm_bo_type_kernel so we don't care about 
>>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>>> handled as per above in i915_ttm_tt_create(). But I think here 
>>> bo->resource is NULL at the start when first creating the object, 
>>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>>
>>> The other possible concern (not sure since CI didn't get that far) 
>>> is around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>>> re-populate it and then potentially move it back to local-memory.
>>>
>>> What are your thoughts here? Also sorry if i915 is making a bit of 
>>> mess here.
>>
>> First of all thanks a lot for taking a look. We previously got 
>> reports about strange crashes with this patch, but couldn't really 
>> reproduce them (even not by sending them out again). This explains 
>> that a bit.
>>
>> The simplest solution would be to just change the && into a ||, e.g. 
>> when previously either no resource is allocated or the resource 
>> requires to use a tt we clear it.
>>
>> That should give you the same behavior as before, but I agree that 
>> this is a bit messy.
>
> Yeah, that should do the trick.
>
> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
> shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
> force the pipeline-gutting path, which importantly unpopulates the 
> ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
> AFAICT it looks like that will now also nuke the bo->resource, instead 
> of just leaving it in system memory. My assumption is that when later 
> calling ttm_bo_validate(), it will just do the bo_move_null() in 
> i915_ttm_move(), instead of re-populating the ttm_tt and then 
> potentially copying it back to local-memory?

Well you do ttm_bo_validate() with something like GTT domain, don't you? 
This should result in re-populating the tt object, but I'm not 100% sure 
if that really works as expected.

Thanks,
Christian.

>
>>
>> I've been considering to replacing the ttm_bo_type with a bunch of 
>> behavior flags for a bo. I'm hoping that this will clean things up a 
>> bit.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>> *bo, bool evict,
>>>>>       bool clear;
>>>>>       int ret;
>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>           return 0;
>>>>>       }
>>>>
>>
Matthew Auld Aug. 31, 2022, 10:37 a.m. UTC | #6
On 31/08/2022 10:38, Christian König wrote:
> Am 31.08.22 um 11:26 schrieb Matthew Auld:
>> On 31/08/2022 09:16, Christian König wrote:
>>> Hi Matthew,
>>>
>>> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>>>> Hi,
>>>>
>>>> On 30/08/2022 08:33, Christian König wrote:
>>>>> Hi guys,
>>>>>
>>>>> can we get an rb/acked-by for this i915 change?
>>>>>
>>>>> Basically we are just making sure that the driver doesn't crash 
>>>>> when bo->resource is NULL and a bo doesn't have any backing store 
>>>>> assigned to it.
>>>>>
>>>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>>>> it is correct.
>>>>
>>>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>>>> loading the module:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Ccaca567b3279492450fd08da8b32e598%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637975347967354305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=apanfOjzSWD2vduINzr2j6F2DiIBC93hLRnnGJcGQ5o%3D&amp;reserved=0?
>>>>
>>>> According to the logs the firmware is failing to load, so perhaps 
>>>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>>>> users. See below.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>> Make sure we can at least move and alloc TT objects without 
>>>>>> backing store.
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> @@ -271,8 +271,6 @@ static struct ttm_tt 
>>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>>   {
>>>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>>>> typeof(*i915),
>>>>>>                                bdev);
>>>>>> -    struct ttm_resource_manager *man =
>>>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>>>       unsigned long ccs_pages = 0;
>>>>>>       enum ttm_caching caching;
>>>>>> @@ -286,8 +284,8 @@ static struct ttm_tt 
>>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>>       if (!i915_tt)
>>>>>>           return NULL;
>>>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>>>> -        man->use_tt)
>>>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>>>
>>>> AFAICT i915 was massively relying on everything starting out in a 
>>>> "dummy" system memory resource (ttm_tt), where it then later 
>>>> "transitions" to the real resource. And if we need to clear the 
>>>> memory we rely on ZERO_ALLOC being set before calling into the 
>>>> i915_ttm_move() callback (even when allocating local-memory).
>>>>
>>>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>>>> was previously handled by ttm_bo_validate() always doing:
>>>>
>>>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>>>
>>>> Which we would always hit since the resource was always "compatible" 
>>>> for the dummy case. But it looks like this is no longer even called, 
>>>> since we can now call into ttm_move with bo->resource == NULL, which 
>>>> still calls tt_create eventually, which not always with clear = true.
>>>>
>>>> All other objects are then ttm_bo_type_kernel so we don't care about 
>>>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>>>> handled as per above in i915_ttm_tt_create(). But I think here 
>>>> bo->resource is NULL at the start when first creating the object, 
>>>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>>>
>>>> The other possible concern (not sure since CI didn't get that far) 
>>>> is around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>>>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>>>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>>>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>>>> re-populate it and then potentially move it back to local-memory.
>>>>
>>>> What are your thoughts here? Also sorry if i915 is making a bit of 
>>>> mess here.
>>>
>>> First of all thanks a lot for taking a look. We previously got 
>>> reports about strange crashes with this patch, but couldn't really 
>>> reproduce them (even not by sending them out again). This explains 
>>> that a bit.
>>>
>>> The simplest solution would be to just change the && into a ||, e.g. 
>>> when previously either no resource is allocated or the resource 
>>> requires to use a tt we clear it.
>>>
>>> That should give you the same behavior as before, but I agree that 
>>> this is a bit messy.
>>
>> Yeah, that should do the trick.
>>
>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
>> force the pipeline-gutting path, which importantly unpopulates the 
>> ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
>> AFAICT it looks like that will now also nuke the bo->resource, instead 
>> of just leaving it in system memory. My assumption is that when later 
>> calling ttm_bo_validate(), it will just do the bo_move_null() in 
>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>> potentially copying it back to local-memory?
> 
> Well you do ttm_bo_validate() with something like GTT domain, don't you? 
> This should result in re-populating the tt object, but I'm not 100% sure 
> if that really works as expected.

AFAIK for domains we either have system memory (which uses ttm_tt and 
might be shmem underneath) or local-memory. But perhaps i915 is doing 
something wrong here, or abusing TTM in some way. I'm not sure tbh.

Anyway, I think we have two cases here:

- We have some system memory only object. After doing i915_ttm_shrink(), 
bo->resource is now NULL. We then call ttm_bo_validate() at some later 
point, but here we don't need to copy anything, but it also looks like 
ttm_bo_handle_move_mem() won't populate the ttm_tt or us either, since 
mem_type == TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care 
of this, but now we just call ttm_bo_move_null().

- We have a local-memory only object, which was evicted to shmem, and 
then swapped out by the shrinker like above. The bo->resource is NULL. 
However this time when calling ttm_bo_validate() we need to actually do 
a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
i915_ttm_move() was taking care of this, but now we just call 
ttm_bo_move_null().

Perhaps i915 is doing something wrong in the above two cases?

> 
> Thanks,
> Christian.
> 
>>
>>>
>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>> behavior flags for a bo. I'm hoping that this will clean things up a 
>>> bit.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>> *bo, bool evict,
>>>>>>       bool clear;
>>>>>>       int ret;
>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>           return 0;
>>>>>>       }
>>>>>
>>>
>
Christian König Aug. 31, 2022, 11:03 a.m. UTC | #7
Am 31.08.22 um 12:37 schrieb Matthew Auld:
> [SNIP]
>>>
>>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements 
>>> to force the pipeline-gutting path, which importantly unpopulates 
>>> the ttm_tt for us (since ttm_tt_unpopulate is not exported it 
>>> seems). But AFAICT it looks like that will now also nuke the 
>>> bo->resource, instead of just leaving it in system memory. My 
>>> assumption is that when later calling ttm_bo_validate(), it will 
>>> just do the bo_move_null() in i915_ttm_move(), instead of 
>>> re-populating the ttm_tt and then potentially copying it back to 
>>> local-memory?
>>
>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>> you? This should result in re-populating the tt object, but I'm not 
>> 100% sure if that really works as expected.
>
> AFAIK for domains we either have system memory (which uses ttm_tt and 
> might be shmem underneath) or local-memory. But perhaps i915 is doing 
> something wrong here, or abusing TTM in some way. I'm not sure tbh.
>
> Anyway, I think we have two cases here:
>
> - We have some system memory only object. After doing 
> i915_ttm_shrink(), bo->resource is now NULL. We then call 
> ttm_bo_validate() at some later point, but here we don't need to copy 
> anything, but it also looks like ttm_bo_handle_move_mem() won't 
> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. It 
> looks like i915_ttm_move() was taking care of this, but now we just 
> call ttm_bo_move_null().
>
> - We have a local-memory only object, which was evicted to shmem, and 
> then swapped out by the shrinker like above. The bo->resource is NULL. 
> However this time when calling ttm_bo_validate() we need to actually 
> do a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
> i915_ttm_move() was taking care of this, but now we just call 
> ttm_bo_move_null().
>
> Perhaps i915 is doing something wrong in the above two cases?

Mhm, as far as I can see that should still work.

See previously you should got a transition from SYSTEM->GTT in 
i915_ttm_move() to re-create your backing store. Not you get 
NULL->SYSTEM which is handled by ttm_bo_move_null() and then SYSTEM->GTT.

If you just validated to SYSTEM memory before I think the tt object 
wouldn't have been populated either.

Regards,
Christian.

>
>>
>> Thanks,
>> Christian.
>>
>>>
>>>>
>>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>>> behavior flags for a bo. I'm hoping that this will clean things up 
>>>> a bit.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>> *bo, bool evict,
>>>>>>>       bool clear;
>>>>>>>       int ret;
>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>           return 0;
>>>>>>>       }
>>>>>>
>>>>
>>
Matthew Auld Aug. 31, 2022, 12:06 p.m. UTC | #8
On 31/08/2022 12:03, Christian König wrote:
> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>> [SNIP]
>>>>
>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>>>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements 
>>>> to force the pipeline-gutting path, which importantly unpopulates 
>>>> the ttm_tt for us (since ttm_tt_unpopulate is not exported it 
>>>> seems). But AFAICT it looks like that will now also nuke the 
>>>> bo->resource, instead of just leaving it in system memory. My 
>>>> assumption is that when later calling ttm_bo_validate(), it will 
>>>> just do the bo_move_null() in i915_ttm_move(), instead of 
>>>> re-populating the ttm_tt and then potentially copying it back to 
>>>> local-memory?
>>>
>>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>>> you? This should result in re-populating the tt object, but I'm not 
>>> 100% sure if that really works as expected.
>>
>> AFAIK for domains we either have system memory (which uses ttm_tt and 
>> might be shmem underneath) or local-memory. But perhaps i915 is doing 
>> something wrong here, or abusing TTM in some way. I'm not sure tbh.
>>
>> Anyway, I think we have two cases here:
>>
>> - We have some system memory only object. After doing 
>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>> ttm_bo_validate() at some later point, but here we don't need to copy 
>> anything, but it also looks like ttm_bo_handle_move_mem() won't 
>> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. It 
>> looks like i915_ttm_move() was taking care of this, but now we just 
>> call ttm_bo_move_null().
>>
>> - We have a local-memory only object, which was evicted to shmem, and 
>> then swapped out by the shrinker like above. The bo->resource is NULL. 
>> However this time when calling ttm_bo_validate() we need to actually 
>> do a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
>> i915_ttm_move() was taking care of this, but now we just call 
>> ttm_bo_move_null().
>>
>> Perhaps i915 is doing something wrong in the above two cases?
> 
> Mhm, as far as I can see that should still work.
> 
> See previously you should got a transition from SYSTEM->GTT in 
> i915_ttm_move() to re-create your backing store. Not you get 
> NULL->SYSTEM which is handled by ttm_bo_move_null() and then SYSTEM->GTT.

What is GTT here in TTM world? Also I'm not seeing where there is this 
SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 is only 
calling ttm_bo_validate() once when acquiring the pages, and we don't 
call it again, unless it was evicted (and potentially swapped out).

> 
> If you just validated to SYSTEM memory before I think the tt object 
> wouldn't have been populated either.
> 
> Regards,
> Christian.
> 
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>>>> behavior flags for a bo. I'm hoping that this will clean things up 
>>>>> a bit.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>>> *bo, bool evict,
>>>>>>>>       bool clear;
>>>>>>>>       int ret;
>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>           return 0;
>>>>>>>>       }
>>>>>>>
>>>>>
>>>
>
Christian König Aug. 31, 2022, 12:35 p.m. UTC | #9
Am 31.08.22 um 14:06 schrieb Matthew Auld:
> On 31/08/2022 12:03, Christian König wrote:
>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>> [SNIP]
>>>>>
>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>> memory. My assumption is that when later calling 
>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>> potentially copying it back to local-memory?
>>>>
>>>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>>>> you? This should result in re-populating the tt object, but I'm not 
>>>> 100% sure if that really works as expected.
>>>
>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>> and might be shmem underneath) or local-memory. But perhaps i915 is 
>>> doing something wrong here, or abusing TTM in some way. I'm not sure 
>>> tbh.
>>>
>>> Anyway, I think we have two cases here:
>>>
>>> - We have some system memory only object. After doing 
>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>> ttm_bo_validate() at some later point, but here we don't need to 
>>> copy anything, but it also looks like ttm_bo_handle_move_mem() won't 
>>> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. 
>>> It looks like i915_ttm_move() was taking care of this, but now we 
>>> just call ttm_bo_move_null().
>>>
>>> - We have a local-memory only object, which was evicted to shmem, 
>>> and then swapped out by the shrinker like above. The bo->resource is 
>>> NULL. However this time when calling ttm_bo_validate() we need to 
>>> actually do a copy in i915_ttm_move(), as well as re-populate the 
>>> ttm_tt. i915_ttm_move() was taking care of this, but now we just 
>>> call ttm_bo_move_null().
>>>
>>> Perhaps i915 is doing something wrong in the above two cases?
>>
>> Mhm, as far as I can see that should still work.
>>
>> See previously you should got a transition from SYSTEM->GTT in 
>> i915_ttm_move() to re-create your backing store. Not you get 
>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>> SYSTEM->GTT.
>
> What is GTT here in TTM world? Also I'm not seeing where there is this 
> SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 is 
> only calling ttm_bo_validate() once when acquiring the pages, and we 
> don't call it again, unless it was evicted (and potentially swapped out).

Well GTT means TTM_PL_TT.

And calling it only once is perfectly fine, TTM will internally see that 
we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
transition and then SYSTEM->TT.

As far as I can see that should work like it did before.

Christian.

>
>>
>> If you just validated to SYSTEM memory before I think the tt object 
>> wouldn't have been populated either.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>> things up a bit.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>>>> *bo, bool evict,
>>>>>>>>>       bool clear;
>>>>>>>>>       int ret;
>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>           return 0;
>>>>>>>>>       }
>>>>>>>>
>>>>>>
>>>>
>>
Matthew Auld Aug. 31, 2022, 12:50 p.m. UTC | #10
On 31/08/2022 13:35, Christian König wrote:
> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>> On 31/08/2022 12:03, Christian König wrote:
>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>> [SNIP]
>>>>>>
>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>>> memory. My assumption is that when later calling 
>>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>>> potentially copying it back to local-memory?
>>>>>
>>>>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>>>>> you? This should result in re-populating the tt object, but I'm not 
>>>>> 100% sure if that really works as expected.
>>>>
>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>> and might be shmem underneath) or local-memory. But perhaps i915 is 
>>>> doing something wrong here, or abusing TTM in some way. I'm not sure 
>>>> tbh.
>>>>
>>>> Anyway, I think we have two cases here:
>>>>
>>>> - We have some system memory only object. After doing 
>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() won't 
>>>> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. 
>>>> It looks like i915_ttm_move() was taking care of this, but now we 
>>>> just call ttm_bo_move_null().
>>>>
>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>> and then swapped out by the shrinker like above. The bo->resource is 
>>>> NULL. However this time when calling ttm_bo_validate() we need to 
>>>> actually do a copy in i915_ttm_move(), as well as re-populate the 
>>>> ttm_tt. i915_ttm_move() was taking care of this, but now we just 
>>>> call ttm_bo_move_null().
>>>>
>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>
>>> Mhm, as far as I can see that should still work.
>>>
>>> See previously you should got a transition from SYSTEM->GTT in 
>>> i915_ttm_move() to re-create your backing store. Not you get 
>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>> SYSTEM->GTT.
>>
>> What is GTT here in TTM world? Also I'm not seeing where there is this 
>> SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 is 
>> only calling ttm_bo_validate() once when acquiring the pages, and we 
>> don't call it again, unless it was evicted (and potentially swapped out).
> 
> Well GTT means TTM_PL_TT.
> 
> And calling it only once is perfectly fine, TTM will internally see that 
> we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
> transition and then SYSTEM->TT.

Ah interesting, so that's what the multi-hop thing does. But AFAICT i915 
is not using either TTM_PL_TT or -EMULTIHOP.

Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
should you use one over the other?

> 
> As far as I can see that should work like it did before.
> 
> Christian.
> 
>>
>>>
>>> If you just validated to SYSTEM memory before I think the tt object 
>>> wouldn't have been populated either.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>>> things up a bit.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>>>>> *bo, bool evict,
>>>>>>>>>>       bool clear;
>>>>>>>>>>       int ret;
>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>           return 0;
>>>>>>>>>>       }
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
Christian König Aug. 31, 2022, 1:34 p.m. UTC | #11
Am 31.08.22 um 14:50 schrieb Matthew Auld:
> On 31/08/2022 13:35, Christian König wrote:
>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>> On 31/08/2022 12:03, Christian König wrote:
>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>> [SNIP]
>>>>>>>
>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>>>> memory. My assumption is that when later calling 
>>>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>>>> potentially copying it back to local-memory?
>>>>>>
>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>> don't you? This should result in re-populating the tt object, but 
>>>>>> I'm not 100% sure if that really works as expected.
>>>>>
>>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>>> and might be shmem underneath) or local-memory. But perhaps i915 
>>>>> is doing something wrong here, or abusing TTM in some way. I'm not 
>>>>> sure tbh.
>>>>>
>>>>> Anyway, I think we have two cases here:
>>>>>
>>>>> - We have some system memory only object. After doing 
>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() 
>>>>> won't populate the ttm_tt or us either, since mem_type == 
>>>>> TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of 
>>>>> this, but now we just call ttm_bo_move_null().
>>>>>
>>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>>> and then swapped out by the shrinker like above. The bo->resource 
>>>>> is NULL. However this time when calling ttm_bo_validate() we need 
>>>>> to actually do a copy in i915_ttm_move(), as well as re-populate 
>>>>> the ttm_tt. i915_ttm_move() was taking care of this, but now we 
>>>>> just call ttm_bo_move_null().
>>>>>
>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>
>>>> Mhm, as far as I can see that should still work.
>>>>
>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>> SYSTEM->GTT.
>>>
>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 
>>> is only calling ttm_bo_validate() once when acquiring the pages, and 
>>> we don't call it again, unless it was evicted (and potentially 
>>> swapped out).
>>
>> Well GTT means TTM_PL_TT.
>>
>> And calling it only once is perfectly fine, TTM will internally see 
>> that we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
>> transition and then SYSTEM->TT.
>
> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
> i915 is not using either TTM_PL_TT or -EMULTIHOP.

Mhm, it could be that we then have a problem and the i915 driver only 
sees NULL->TT directly. But I really don't know the i915 driver code 
good enough to judge that.

Can you take a look at this and test it maybe?

>
> Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
> should you use one over the other?

TTM_PL_SYSTEM means the device is not accessing the buffer and TTM has 
the control over the backing store and can swapout/swapin as it wants it.

TTM_PL_TT means that the device is accessing the data (TT stands for 
translation table) and so TTM can't swap the backing store in/out.

TTM_PL_VRAM well that one is obvious.

Thanks,
Christian.

>
>>
>> As far as I can see that should work like it did before.
>>
>> Christian.
>>
>>>
>>>>
>>>> If you just validated to SYSTEM memory before I think the tt object 
>>>> wouldn't have been populated either.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>>>> things up a bit.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
>>>>>>>>>>> ttm_buffer_object *bo, bool evict,
>>>>>>>>>>>       bool clear;
>>>>>>>>>>>       int ret;
>>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>>           return 0;
>>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
Matthew Auld Aug. 31, 2022, 2:53 p.m. UTC | #12
On 31/08/2022 14:34, Christian König wrote:
> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>> On 31/08/2022 13:35, Christian König wrote:
>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>> [SNIP]
>>>>>>>>
>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>>>>> memory. My assumption is that when later calling 
>>>>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>>>>> potentially copying it back to local-memory?
>>>>>>>
>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>> don't you? This should result in re-populating the tt object, but 
>>>>>>> I'm not 100% sure if that really works as expected.
>>>>>>
>>>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>>>> and might be shmem underneath) or local-memory. But perhaps i915 
>>>>>> is doing something wrong here, or abusing TTM in some way. I'm not 
>>>>>> sure tbh.
>>>>>>
>>>>>> Anyway, I think we have two cases here:
>>>>>>
>>>>>> - We have some system memory only object. After doing 
>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() 
>>>>>> won't populate the ttm_tt or us either, since mem_type == 
>>>>>> TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of 
>>>>>> this, but now we just call ttm_bo_move_null().
>>>>>>
>>>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>>>> and then swapped out by the shrinker like above. The bo->resource 
>>>>>> is NULL. However this time when calling ttm_bo_validate() we need 
>>>>>> to actually do a copy in i915_ttm_move(), as well as re-populate 
>>>>>> the ttm_tt. i915_ttm_move() was taking care of this, but now we 
>>>>>> just call ttm_bo_move_null().
>>>>>>
>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>
>>>>> Mhm, as far as I can see that should still work.
>>>>>
>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>> SYSTEM->GTT.
>>>>
>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 
>>>> is only calling ttm_bo_validate() once when acquiring the pages, and 
>>>> we don't call it again, unless it was evicted (and potentially 
>>>> swapped out).
>>>
>>> Well GTT means TTM_PL_TT.
>>>
>>> And calling it only once is perfectly fine, TTM will internally see 
>>> that we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
>>> transition and then SYSTEM->TT.
>>
>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
> 
> Mhm, it could be that we then have a problem and the i915 driver only 
> sees NULL->TT directly. But I really don't know the i915 driver code 
> good enough to judge that.
> 
> Can you take a look at this and test it maybe?

I'll grab a machine and try to see what is going on here.

> 
>>
>> Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
>> should you use one over the other?
> 
> TTM_PL_SYSTEM means the device is not accessing the buffer and TTM has 
> the control over the backing store and can swapout/swapin as it wants it.
> 
> TTM_PL_TT means that the device is accessing the data (TT stands for 
> translation table) and so TTM can't swap the backing store in/out.
> 
> TTM_PL_VRAM well that one is obvious.

Thanks for the explanation. So it looks like i915 is using TTM_PL_SYSTEM 
even for device access it seems.

> 
> Thanks,
> Christian.
> 
>>
>>>
>>> As far as I can see that should work like it did before.
>>>
>>> Christian.
>>>
>>>>
>>>>>
>>>>> If you just validated to SYSTEM memory before I think the tt object 
>>>>> wouldn't have been populated either.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>>>>> things up a bit.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
>>>>>>>>>>>> ttm_buffer_object *bo, bool evict,
>>>>>>>>>>>>       bool clear;
>>>>>>>>>>>>       int ret;
>>>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>>>           return 0;
>>>>>>>>>>>>       }
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
Matthew Auld Aug. 31, 2022, 4:32 p.m. UTC | #13
On 31/08/2022 15:53, Matthew Auld wrote:
> On 31/08/2022 14:34, Christian König wrote:
>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>> On 31/08/2022 13:35, Christian König wrote:
>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>> [SNIP]
>>>>>>>>>
>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>>>>> placements to force the pipeline-gutting path, which 
>>>>>>>>> importantly unpopulates the ttm_tt for us (since 
>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>> looks like that will now also nuke the bo->resource, instead of 
>>>>>>>>> just leaving it in system memory. My assumption is that when 
>>>>>>>>> later calling ttm_bo_validate(), it will just do the 
>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating the 
>>>>>>>>> ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>
>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>
>>>>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>>>>> and might be shmem underneath) or local-memory. But perhaps i915 
>>>>>>> is doing something wrong here, or abusing TTM in some way. I'm 
>>>>>>> not sure tbh.
>>>>>>>
>>>>>>> Anyway, I think we have two cases here:
>>>>>>>
>>>>>>> - We have some system memory only object. After doing 
>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() 
>>>>>>> won't populate the ttm_tt or us either, since mem_type == 
>>>>>>> TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of 
>>>>>>> this, but now we just call ttm_bo_move_null().
>>>>>>>
>>>>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>>>>> and then swapped out by the shrinker like above. The bo->resource 
>>>>>>> is NULL. However this time when calling ttm_bo_validate() we need 
>>>>>>> to actually do a copy in i915_ttm_move(), as well as re-populate 
>>>>>>> the ttm_tt. i915_ttm_move() was taking care of this, but now we 
>>>>>>> just call ttm_bo_move_null().
>>>>>>>
>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>
>>>>>> Mhm, as far as I can see that should still work.
>>>>>>
>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>> SYSTEM->GTT.
>>>>>
>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>> potentially swapped out).
>>>>
>>>> Well GTT means TTM_PL_TT.
>>>>
>>>> And calling it only once is perfectly fine, TTM will internally see 
>>>> that we need two hops to reach TTM_PL_TT and so does the 
>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>
>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>
>> Mhm, it could be that we then have a problem and the i915 driver only 
>> sees NULL->TT directly. But I really don't know the i915 driver code 
>> good enough to judge that.
>>
>> Can you take a look at this and test it maybe?
> 
> I'll grab a machine and try to see what is going on here.

Well at least the issue with the firmware not loading looks to be fixed now.

So running some eviction + oom tests it looks it now does:

/* eviction kicks in */
i915_ttm_move(bo):  LMEM -> PL_SYSTEM

/* shrinker/oom kicks in at some point */
i915_ttm_shrink(bo):
     bo->resource = NULL, /* pipeline_gutting */
     shmem ttm_tt is unpopulated and pages are correctly swapped out

/* user touches the same object later */
i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()

So seems to incorrectly skip swapping it back in and then copy over to 
lmem. It just allocates directly in lmem.

And previously the last two steps would have been:

i915_ttm_shrink(bo):
     bo->resource = PL_SYSTEM, /* pipeline_gutting */
     shmem ttm_tt is unpopulated and pages are correctly swapped out

i915_ttm_move(bo):
     PL_SYSTEM -> LMEM,
     ttm_tt is repopulated and pages are copied over to lmem

> 
>>
>>>
>>> Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
>>> should you use one over the other?
>>
>> TTM_PL_SYSTEM means the device is not accessing the buffer and TTM has 
>> the control over the backing store and can swapout/swapin as it wants it.
>>
>> TTM_PL_TT means that the device is accessing the data (TT stands for 
>> translation table) and so TTM can't swap the backing store in/out.
>>
>> TTM_PL_VRAM well that one is obvious.
> 
> Thanks for the explanation. So it looks like i915 is using TTM_PL_SYSTEM 
> even for device access it seems.
> 
>>
>> Thanks,
>> Christian.
>>
>>>
>>>>
>>>> As far as I can see that should work like it did before.
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> If you just validated to SYSTEM memory before I think the tt 
>>>>>> object wouldn't have been populated either.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've been considering to replacing the ttm_bo_type with a 
>>>>>>>>>> bunch of behavior flags for a bo. I'm hoping that this will 
>>>>>>>>>> clean things up a bit.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
>>>>>>>>>>>>> ttm_buffer_object *bo, bool evict,
>>>>>>>>>>>>>       bool clear;
>>>>>>>>>>>>>       int ret;
>>>>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>>>>           return 0;
>>>>>>>>>>>>>       }
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
Christian König Sept. 1, 2022, 8 a.m. UTC | #14
Am 31.08.22 um 18:32 schrieb Matthew Auld:
> On 31/08/2022 15:53, Matthew Auld wrote:
>> On 31/08/2022 14:34, Christian König wrote:
>>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>>> On 31/08/2022 13:35, Christian König wrote:
>>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>>> [SNIP]
>>>>>>>>>>
>>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is 
>>>>>>>>>> swapping out shmem ttm_tt and is calling ttm_bo_validate() 
>>>>>>>>>> with empty placements to force the pipeline-gutting path, 
>>>>>>>>>> which importantly unpopulates the ttm_tt for us (since 
>>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>>> looks like that will now also nuke the bo->resource, instead 
>>>>>>>>>> of just leaving it in system memory. My assumption is that 
>>>>>>>>>> when later calling ttm_bo_validate(), it will just do the 
>>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating 
>>>>>>>>>> the ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>>
>>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>>
>>>>>>>> AFAIK for domains we either have system memory (which uses 
>>>>>>>> ttm_tt and might be shmem underneath) or local-memory. But 
>>>>>>>> perhaps i915 is doing something wrong here, or abusing TTM in 
>>>>>>>> some way. I'm not sure tbh.
>>>>>>>>
>>>>>>>> Anyway, I think we have two cases here:
>>>>>>>>
>>>>>>>> - We have some system memory only object. After doing 
>>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>>> ttm_bo_validate() at some later point, but here we don't need 
>>>>>>>> to copy anything, but it also looks like 
>>>>>>>> ttm_bo_handle_move_mem() won't populate the ttm_tt or us 
>>>>>>>> either, since mem_type == TTM_PL_SYSTEM. It looks like 
>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>> ttm_bo_move_null().
>>>>>>>>
>>>>>>>> - We have a local-memory only object, which was evicted to 
>>>>>>>> shmem, and then swapped out by the shrinker like above. The 
>>>>>>>> bo->resource is NULL. However this time when calling 
>>>>>>>> ttm_bo_validate() we need to actually do a copy in 
>>>>>>>> i915_ttm_move(), as well as re-populate the ttm_tt. 
>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>> ttm_bo_move_null().
>>>>>>>>
>>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>>
>>>>>>> Mhm, as far as I can see that should still work.
>>>>>>>
>>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>>> SYSTEM->GTT.
>>>>>>
>>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>>> potentially swapped out).
>>>>>
>>>>> Well GTT means TTM_PL_TT.
>>>>>
>>>>> And calling it only once is perfectly fine, TTM will internally 
>>>>> see that we need two hops to reach TTM_PL_TT and so does the 
>>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>>
>>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>>
>>> Mhm, it could be that we then have a problem and the i915 driver 
>>> only sees NULL->TT directly. But I really don't know the i915 driver 
>>> code good enough to judge that.
>>>
>>> Can you take a look at this and test it maybe?
>>
>> I'll grab a machine and try to see what is going on here.
>
> Well at least the issue with the firmware not loading looks to be 
> fixed now.
>
> So running some eviction + oom tests it looks it now does:
>
> /* eviction kicks in */
> i915_ttm_move(bo):  LMEM -> PL_SYSTEM
>
> /* shrinker/oom kicks in at some point */
> i915_ttm_shrink(bo):
>     bo->resource = NULL, /* pipeline_gutting */
>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>
> /* user touches the same object later */
> i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()
>
> So seems to incorrectly skip swapping it back in and then copy over to 
> lmem. It just allocates directly in lmem.
>
> And previously the last two steps would have been:
>
> i915_ttm_shrink(bo):
>     bo->resource = PL_SYSTEM, /* pipeline_gutting */
>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>
> i915_ttm_move(bo):
>     PL_SYSTEM -> LMEM,
>     ttm_tt is repopulated and pages are copied over to lmem

Mhm, crap. That indeed looks like it won't work.

How about changing the code like this:

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index c420d1ab605b..1ee7d81ddcbc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -560,7 +560,17 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
bool evict,
         bool clear;
         int ret;

-       if (GEM_WARN_ON(!obj) || !bo->resource) {
+       if (GEM_WARN_ON(!obj)) {
+               ttm_bo_move_null(bo, dst_mem);
+               return 0;
+       }
+
+       if (!bo->resource) {
+               if (dst_mem->mem_type != TTM_PL_SYSTEM) {
+                        hop->mem_type = TTM_PL_SYSTEM;
+                        hop->flags = TTM_PL_FLAG_TEMPORARY;
+                       return -EMULTIHOP;
+               }
                 ttm_bo_move_null(bo, dst_mem);
                 return 0;
         }

That should result in allocating a TTM_PL_SYSTEM resource first and then 
moving from that to the final destination.

Thanks,
Christian.
Matthew Auld Sept. 1, 2022, 12:52 p.m. UTC | #15
On 01/09/2022 09:00, Christian König wrote:
> Am 31.08.22 um 18:32 schrieb Matthew Auld:
>> On 31/08/2022 15:53, Matthew Auld wrote:
>>> On 31/08/2022 14:34, Christian König wrote:
>>>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>>>> On 31/08/2022 13:35, Christian König wrote:
>>>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is 
>>>>>>>>>>> swapping out shmem ttm_tt and is calling ttm_bo_validate() 
>>>>>>>>>>> with empty placements to force the pipeline-gutting path, 
>>>>>>>>>>> which importantly unpopulates the ttm_tt for us (since 
>>>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>>>> looks like that will now also nuke the bo->resource, instead 
>>>>>>>>>>> of just leaving it in system memory. My assumption is that 
>>>>>>>>>>> when later calling ttm_bo_validate(), it will just do the 
>>>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating 
>>>>>>>>>>> the ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>>>
>>>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>>>
>>>>>>>>> AFAIK for domains we either have system memory (which uses 
>>>>>>>>> ttm_tt and might be shmem underneath) or local-memory. But 
>>>>>>>>> perhaps i915 is doing something wrong here, or abusing TTM in 
>>>>>>>>> some way. I'm not sure tbh.
>>>>>>>>>
>>>>>>>>> Anyway, I think we have two cases here:
>>>>>>>>>
>>>>>>>>> - We have some system memory only object. After doing 
>>>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>>>> ttm_bo_validate() at some later point, but here we don't need 
>>>>>>>>> to copy anything, but it also looks like 
>>>>>>>>> ttm_bo_handle_move_mem() won't populate the ttm_tt or us 
>>>>>>>>> either, since mem_type == TTM_PL_SYSTEM. It looks like 
>>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>>> ttm_bo_move_null().
>>>>>>>>>
>>>>>>>>> - We have a local-memory only object, which was evicted to 
>>>>>>>>> shmem, and then swapped out by the shrinker like above. The 
>>>>>>>>> bo->resource is NULL. However this time when calling 
>>>>>>>>> ttm_bo_validate() we need to actually do a copy in 
>>>>>>>>> i915_ttm_move(), as well as re-populate the ttm_tt. 
>>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>>> ttm_bo_move_null().
>>>>>>>>>
>>>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>>>
>>>>>>>> Mhm, as far as I can see that should still work.
>>>>>>>>
>>>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>>>> SYSTEM->GTT.
>>>>>>>
>>>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>>>> potentially swapped out).
>>>>>>
>>>>>> Well GTT means TTM_PL_TT.
>>>>>>
>>>>>> And calling it only once is perfectly fine, TTM will internally 
>>>>>> see that we need two hops to reach TTM_PL_TT and so does the 
>>>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>>>
>>>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>>>
>>>> Mhm, it could be that we then have a problem and the i915 driver 
>>>> only sees NULL->TT directly. But I really don't know the i915 driver 
>>>> code good enough to judge that.
>>>>
>>>> Can you take a look at this and test it maybe?
>>>
>>> I'll grab a machine and try to see what is going on here.
>>
>> Well at least the issue with the firmware not loading looks to be 
>> fixed now.
>>
>> So running some eviction + oom tests it looks it now does:
>>
>> /* eviction kicks in */
>> i915_ttm_move(bo):  LMEM -> PL_SYSTEM
>>
>> /* shrinker/oom kicks in at some point */
>> i915_ttm_shrink(bo):
>>     bo->resource = NULL, /* pipeline_gutting */
>>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>>
>> /* user touches the same object later */
>> i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()
>>
>> So seems to incorrectly skip swapping it back in and then copy over to 
>> lmem. It just allocates directly in lmem.
>>
>> And previously the last two steps would have been:
>>
>> i915_ttm_shrink(bo):
>>     bo->resource = PL_SYSTEM, /* pipeline_gutting */
>>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>>
>> i915_ttm_move(bo):
>>     PL_SYSTEM -> LMEM,
>>     ttm_tt is repopulated and pages are copied over to lmem
> 
> Mhm, crap. That indeed looks like it won't work.
> 
> How about changing the code like this:
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> index c420d1ab605b..1ee7d81ddcbc 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> @@ -560,7 +560,17 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
> bool evict,
>          bool clear;
>          int ret;
> 
> -       if (GEM_WARN_ON(!obj) || !bo->resource) {
> +       if (GEM_WARN_ON(!obj)) {
> +               ttm_bo_move_null(bo, dst_mem);
> +               return 0;
> +       }
> +
> +       if (!bo->resource) {
> +               if (dst_mem->mem_type != TTM_PL_SYSTEM) {
> +                        hop->mem_type = TTM_PL_SYSTEM;
> +                        hop->flags = TTM_PL_FLAG_TEMPORARY;
> +                       return -EMULTIHOP;
> +               }
>                  ttm_bo_move_null(bo, dst_mem);
>                  return 0;
>          }
> 
> That should result in allocating a TTM_PL_SYSTEM resource first and then 
> moving from that to the final destination.

Ok, I played around with this some more. The final diff looks like:
https://patchwork.freedesktop.org/patch/500786/?series=108027&rev=1

It looks like we had some more places where bo->resource == NULL didn't 
end well. It at least now seems to survive my local testing.

> 
> Thanks,
> Christian.
Thomas Hellström Sept. 1, 2022, 5:48 p.m. UTC | #16
On Wed, 2022-08-31 at 15:34 +0200, Christian König wrote:
> Am 31.08.22 um 14:50 schrieb Matthew Auld:
> > On 31/08/2022 13:35, Christian König wrote:
> > > Am 31.08.22 um 14:06 schrieb Matthew Auld:
> > > > On 31/08/2022 12:03, Christian König wrote:
> > > > > Am 31.08.22 um 12:37 schrieb Matthew Auld:
> > > > > > [SNIP]
> > > > > > > > 
> > > > > > > > That hopefully just leaves i915_ttm_shrink(), which is
> > > > > > > > swapping 
> > > > > > > > out shmem ttm_tt and is calling ttm_bo_validate() with
> > > > > > > > empty 
> > > > > > > > placements to force the pipeline-gutting path, which
> > > > > > > > importantly 
> > > > > > > > unpopulates the ttm_tt for us (since ttm_tt_unpopulate
> > > > > > > > is not 
> > > > > > > > exported it seems). But AFAICT it looks like that will
> > > > > > > > now also 
> > > > > > > > nuke the bo->resource, instead of just leaving it in
> > > > > > > > system 
> > > > > > > > memory. My assumption is that when later calling 
> > > > > > > > ttm_bo_validate(), it will just do the bo_move_null()
> > > > > > > > in 
> > > > > > > > i915_ttm_move(), instead of re-populating the ttm_tt
> > > > > > > > and then 
> > > > > > > > potentially copying it back to local-memory?
> > > > > > > 
> > > > > > > Well you do ttm_bo_validate() with something like GTT
> > > > > > > domain, 
> > > > > > > don't you? This should result in re-populating the tt
> > > > > > > object, but 
> > > > > > > I'm not 100% sure if that really works as expected.
> > > > > > 
> > > > > > AFAIK for domains we either have system memory (which uses
> > > > > > ttm_tt 
> > > > > > and might be shmem underneath) or local-memory. But perhaps
> > > > > > i915 
> > > > > > is doing something wrong here, or abusing TTM in some way.
> > > > > > I'm not 
> > > > > > sure tbh.
> > > > > > 
> > > > > > Anyway, I think we have two cases here:
> > > > > > 
> > > > > > - We have some system memory only object. After doing 
> > > > > > i915_ttm_shrink(), bo->resource is now NULL. We then call 
> > > > > > ttm_bo_validate() at some later point, but here we don't
> > > > > > need to 
> > > > > > copy anything, but it also looks like
> > > > > > ttm_bo_handle_move_mem() 
> > > > > > won't populate the ttm_tt or us either, since mem_type == 
> > > > > > TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking
> > > > > > care of 
> > > > > > this, but now we just call ttm_bo_move_null().
> > > > > > 
> > > > > > - We have a local-memory only object, which was evicted to
> > > > > > shmem, 
> > > > > > and then swapped out by the shrinker like above. The bo-
> > > > > > >resource 
> > > > > > is NULL. However this time when calling ttm_bo_validate()
> > > > > > we need 
> > > > > > to actually do a copy in i915_ttm_move(), as well as re-
> > > > > > populate 
> > > > > > the ttm_tt. i915_ttm_move() was taking care of this, but
> > > > > > now we 
> > > > > > just call ttm_bo_move_null().
> > > > > > 
> > > > > > Perhaps i915 is doing something wrong in the above two
> > > > > > cases?
> > > > > 
> > > > > Mhm, as far as I can see that should still work.
> > > > > 
> > > > > See previously you should got a transition from SYSTEM->GTT
> > > > > in 
> > > > > i915_ttm_move() to re-create your backing store. Not you get 
> > > > > NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
> > > > > SYSTEM->GTT.
> > > > 
> > > > What is GTT here in TTM world? Also I'm not seeing where there
> > > > is 
> > > > this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear,
> > > > i915 
> > > > is only calling ttm_bo_validate() once when acquiring the
> > > > pages, and 
> > > > we don't call it again, unless it was evicted (and potentially 
> > > > swapped out).
> > > 
> > > Well GTT means TTM_PL_TT.
> > > 
> > > And calling it only once is perfectly fine, TTM will internally
> > > see 
> > > that we need two hops to reach TTM_PL_TT and so does the NULL-
> > > >SYSTEM 
> > > transition and then SYSTEM->TT.
> > 
> > Ah interesting, so that's what the multi-hop thing does. But AFAICT
> > i915 is not using either TTM_PL_TT or -EMULTIHOP.
> 
> Mhm, it could be that we then have a problem and the i915 driver only
> sees NULL->TT directly. But I really don't know the i915 driver code 
> good enough to judge that.
> 
> Can you take a look at this and test it maybe?
> 
> > 
> > Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM?
> > When 
> > should you use one over the other?
> 
> TTM_PL_SYSTEM means the device is not accessing the buffer and TTM
> has 
> the control over the backing store and can swapout/swapin as it wants
> it.
> 
> TTM_PL_TT means that the device is accessing the data (TT stands for 
> translation table) and so TTM can't swap the backing store in/out.
> 
> TTM_PL_VRAM well that one is obvious.
> 
> Thanks,
> Christian.

We've had a previous long discussions on this listing pros and cons,
and IIRC concluded that either binding to the device from system needed
some TTM fixes and documentation to be straightforward, or a driver
should use the above scheme bouncing in TT. IIRC we concluded that the
best thing for i915 would be to transition and introduce a dummy TT
region and obey the scheme outlined above by Christian.
We unfortunately never gotten around to that, though, due to other work
got prioritized. Also need to solve the limbo (not populated) -> vram
transition without populating when moving to TT....

Originaly TT was intended for GGTT-like and AGP apertures that needed
cpu-mapping to the PCI address. Using it like Christian outlines helps
avoid special casing for swapout. Devices that bind to System memory
needs the swap notifier to unbind.

/Thomas



> 
> > 
> > > 
> > > As far as I can see that should work like it did before.
> > > 
> > > Christian.
> > > 
> > > > 
> > > > > 
> > > > > If you just validated to SYSTEM memory before I think the tt
> > > > > object 
> > > > > wouldn't have been populated either.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I've been considering to replacing the ttm_bo_type
> > > > > > > > > with a bunch 
> > > > > > > > > of behavior flags for a bo. I'm hoping that this will
> > > > > > > > > clean 
> > > > > > > > > things up a bit.
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > Christian.
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > >       caching =
> > > > > > > > > > > > i915_ttm_select_tt_caching(obj);
> > > > > > > > > > > > diff --git
> > > > > > > > > > > > a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
> > > > > > > > > > > > b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > > > > > > > > > > > index 9a7e50534b84bb..c420d1ab605b6f 100644
> > > > > > > > > > > > ---
> > > > > > > > > > > > a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > > > > > > > > > > > +++
> > > > > > > > > > > > b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > > > > > > > > > > > @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
> > > > > > > > > > > > ttm_buffer_object *bo, bool evict,
> > > > > > > > > > > >       bool clear;
> > > > > > > > > > > >       int ret;
> > > > > > > > > > > > -    if (GEM_WARN_ON(!obj)) {
> > > > > > > > > > > > +    if (GEM_WARN_ON(!obj) || !bo->resource) {
> > > > > > > > > > > >           ttm_bo_move_null(bo, dst_mem);
> > > > > > > > > > > >           return 0;
> > > > > > > > > > > >       }
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index bc9c432edffe03..45ce2d1f754cc4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -271,8 +271,6 @@  static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 {
 	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
 						     bdev);
-	struct ttm_resource_manager *man =
-		ttm_manager_type(bo->bdev, bo->resource->mem_type);
 	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
 	unsigned long ccs_pages = 0;
 	enum ttm_caching caching;
@@ -286,8 +284,8 @@  static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 	if (!i915_tt)
 		return NULL;
 
-	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
-	    man->use_tt)
+	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
+	    ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
 		page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
 
 	caching = i915_ttm_select_tt_caching(obj);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index 9a7e50534b84bb..c420d1ab605b6f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -560,7 +560,7 @@  int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
 	bool clear;
 	int ret;
 
-	if (GEM_WARN_ON(!obj)) {
+	if (GEM_WARN_ON(!obj) || !bo->resource) {
 		ttm_bo_move_null(bo, dst_mem);
 		return 0;
 	}