diff mbox series

[v4] drm/ttm: Clarify that the TTM_PL_SYSTEM is under TTMs control

Message ID 20211110145034.487512-1-zackr@vmware.com (mailing list archive)
State New, archived
Headers show
Series [v4] drm/ttm: Clarify that the TTM_PL_SYSTEM is under TTMs control | expand

Commit Message

Zack Rusin Nov. 10, 2021, 2:50 p.m. UTC
TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
driver internal usage of TTM_PL_SYSTEM prone to errors because it
requires the drivers to manually handle all interactions between TTM
which can swap out those buffers whenever it thinks it's the right
thing to do and driver.

CPU buffers which need to be fenced and shared with accelerators should
be placed in driver specific placements that can explicitly handle
CPU/accelerator buffer fencing.
Currently, apart, from things silently failing nothing is enforcing
that requirement which means that it's easy for drivers and new
developers to get this wrong. To avoid the confusion we can document
this requirement and clarify the solution.

This came up during a discussion on dri-devel:
https://lore.kernel.org/dri-devel/232f45e9-8748-1243-09bf-56763e6668b3@amd.com

Signed-off-by: Zack Rusin <zackr@vmware.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 include/drm/ttm/ttm_placement.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Zack Rusin Nov. 11, 2021, 4:44 p.m. UTC | #1
On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
> TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
> driver internal usage of TTM_PL_SYSTEM prone to errors because it
> requires the drivers to manually handle all interactions between TTM
> which can swap out those buffers whenever it thinks it's the right
> thing to do and driver.
> 
> CPU buffers which need to be fenced and shared with accelerators
> should
> be placed in driver specific placements that can explicitly handle
> CPU/accelerator buffer fencing.
> Currently, apart, from things silently failing nothing is enforcing
> that requirement which means that it's easy for drivers and new
> developers to get this wrong. To avoid the confusion we can document
> this requirement and clarify the solution.
> 
> This came up during a discussion on dri-devel:
> https://lore.kernel.org/dri-devel/232f45e9-8748-1243-09bf-56763e6668b3@amd.com


Polite and gentle ping on that one. Are we ok with the wording here?

z
Thomas Hellström Nov. 13, 2021, 11:26 a.m. UTC | #2
Hi, Zack,

On 11/11/21 17:44, Zack Rusin wrote:
> On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
>> TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
>> driver internal usage of TTM_PL_SYSTEM prone to errors because it
>> requires the drivers to manually handle all interactions between TTM
>> which can swap out those buffers whenever it thinks it's the right
>> thing to do and driver.
>>
>> CPU buffers which need to be fenced and shared with accelerators
>> should
>> be placed in driver specific placements that can explicitly handle
>> CPU/accelerator buffer fencing.
>> Currently, apart, from things silently failing nothing is enforcing
>> that requirement which means that it's easy for drivers and new
>> developers to get this wrong. To avoid the confusion we can document
>> this requirement and clarify the solution.
>>
>> This came up during a discussion on dri-devel:
>> https://lore.kernel.org/dri-devel/232f45e9-8748-1243-09bf-56763e6668b3@amd.com

I took a slightly deeper look into this. I think we need to formalize 
this a bit more to understand pros and cons and what the restrictions 
are really all about. Anybody looking at the prevous discussion will 
mostly see arguments similar to "this is stupid and difficult" and "it 
has always been this way" which are not really constructive.

First disregarding all accounting stuff, I think this all boils down to 
TTM_PL_SYSTEM having three distinct states:
1) POPULATED
2) LIMBO (Or whatever we want to call it. No pages present)
3) SWAPPED.

The ttm_bo_move_memcpy() helper understands these, and any standalone 
driver implementation of the move() callback _currently_ needs to 
understand these as well, unless using the ttm_bo_move_memcpy() helper.

Now using a bounce domain to proxy SYSTEM means that the driver can 
forget about the SWAPPED state, it's automatically handled by the move 
setup code. However, another pitfall is LIMBO, in that if when you move 
from SYSTEM/LIMBO to your bounce domain, the BO will be populated. So 
any naive accelerated move() implementation creating a 1GB BO in fixed 
memory, like VRAM, will needlessly allocate and free 1GB of system 
memory in the process instead of just performing a clear operation. 
Looks like amdgpu suffers from this?

I think what is really needed is either

a) A TTM helper that helps move callback implementations resolve the 
issues populating system from LIMBO or SWAP, and then also formalize 
driver notification for swapping. At a minimum, I think the 
swap_notify() callback needs to be able to return a late error.

b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd vote for 
this without looking into it in detail).

In both these cases, we should really make SYSTEM bindable by GPU, 
otherwise we'd just be trading one pitfall for another related without 
really resolving the root problem.

As for fencing not being supported by SYSTEM, I'm not sure why we don't 
want this, because it would for example prohibit async 
ttm_move_memcpy(), and also, async unbinding of ttm_tt memory like MOB 
on vmgfx. (I think it's still sync).

There might be an accounting issue related to this as well, but I guess 
Christian would need to chime in on this. If so, I think it needs to be 
well understood and documented (in TTM, not in AMD drivers).

Thanks,

/Thomas


>
> Polite and gentle ping on that one. Are we ok with the wording here?
>
> z
Christian König Nov. 16, 2021, 7:19 a.m. UTC | #3
Am 13.11.21 um 12:26 schrieb Thomas Hellström:
> Hi, Zack,
>
> On 11/11/21 17:44, Zack Rusin wrote:
>> On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
>>> TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
>>> driver internal usage of TTM_PL_SYSTEM prone to errors because it
>>> requires the drivers to manually handle all interactions between TTM
>>> which can swap out those buffers whenever it thinks it's the right
>>> thing to do and driver.
>>>
>>> CPU buffers which need to be fenced and shared with accelerators
>>> should
>>> be placed in driver specific placements that can explicitly handle
>>> CPU/accelerator buffer fencing.
>>> Currently, apart, from things silently failing nothing is enforcing
>>> that requirement which means that it's easy for drivers and new
>>> developers to get this wrong. To avoid the confusion we can document
>>> this requirement and clarify the solution.
>>>
>>> This came up during a discussion on dri-devel:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C3459542a8eab4bc98ecb08d9a69863d9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637723995727600044%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6SZIpReHIaNxbu0WsLmwkjKM6e%2Bsk5d%2BDUg1KrfYewI%3D&amp;reserved=0 
>>>
>
> I took a slightly deeper look into this. I think we need to formalize 
> this a bit more to understand pros and cons and what the restrictions 
> are really all about. Anybody looking at the prevous discussion will 
> mostly see arguments similar to "this is stupid and difficult" and "it 
> has always been this way" which are not really constructive.
>
> First disregarding all accounting stuff, I think this all boils down 
> to TTM_PL_SYSTEM having three distinct states:
> 1) POPULATED
> 2) LIMBO (Or whatever we want to call it. No pages present)
> 3) SWAPPED.
>
> The ttm_bo_move_memcpy() helper understands these, and any standalone 
> driver implementation of the move() callback _currently_ needs to 
> understand these as well, unless using the ttm_bo_move_memcpy() helper.
>
> Now using a bounce domain to proxy SYSTEM means that the driver can 
> forget about the SWAPPED state, it's automatically handled by the move 
> setup code. However, another pitfall is LIMBO, in that if when you 
> move from SYSTEM/LIMBO to your bounce domain, the BO will be 
> populated. So any naive accelerated move() implementation creating a 
> 1GB BO in fixed memory, like VRAM, will needlessly allocate and free 
> 1GB of system memory in the process instead of just performing a clear 
> operation. Looks like amdgpu suffers from this?
>
> I think what is really needed is either
>
> a) A TTM helper that helps move callback implementations resolve the 
> issues populating system from LIMBO or SWAP, and then also formalize 
> driver notification for swapping. At a minimum, I think the 
> swap_notify() callback needs to be able to return a late error.
>
> b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd vote 
> for this without looking into it in detail).
>
> In both these cases, we should really make SYSTEM bindable by GPU, 
> otherwise we'd just be trading one pitfall for another related without 
> really resolving the root problem.
>
> As for fencing not being supported by SYSTEM, I'm not sure why we 
> don't want this, because it would for example prohibit async 
> ttm_move_memcpy(), and also, async unbinding of ttm_tt memory like MOB 
> on vmgfx. (I think it's still sync).
>
> There might be an accounting issue related to this as well, but I 
> guess Christian would need to chime in on this. If so, I think it 
> needs to be well understood and documented (in TTM, not in AMD drivers).

I think the problem goes deeper than what has been mentioned here so far.

Having fences attached to BOs in the system domain is probably ok, but 
the key point is that the BOs in the system domain are under TTMs 
control and should not be touched by the driver.

What we have now is that TTMs internals like the allocation state of BOs 
in system memory (the populated, limbo, swapped you mentioned above) is 
leaking into the drivers and I think exactly that is the part which 
doesn't work reliable here. You can of course can get that working, but 
that requires knowledge of the internal state which in my eyes was 
always illegal.

What we could do is to split the system domain into SYSTEM and SWAP and 
then say only the swap domain is under TTMs control.

Regards,
Christian.

>
> Thanks,
>
> /Thomas
>
>
>>
>> Polite and gentle ping on that one. Are we ok with the wording here?
>>
>> z
Thomas Hellström Nov. 16, 2021, 7:43 a.m. UTC | #4
On 11/16/21 08:19, Christian König wrote:
> Am 13.11.21 um 12:26 schrieb Thomas Hellström:
>> Hi, Zack,
>>
>> On 11/11/21 17:44, Zack Rusin wrote:
>>> On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
>>>> TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
>>>> driver internal usage of TTM_PL_SYSTEM prone to errors because it
>>>> requires the drivers to manually handle all interactions between TTM
>>>> which can swap out those buffers whenever it thinks it's the right
>>>> thing to do and driver.
>>>>
>>>> CPU buffers which need to be fenced and shared with accelerators
>>>> should
>>>> be placed in driver specific placements that can explicitly handle
>>>> CPU/accelerator buffer fencing.
>>>> Currently, apart, from things silently failing nothing is enforcing
>>>> that requirement which means that it's easy for drivers and new
>>>> developers to get this wrong. To avoid the confusion we can document
>>>> this requirement and clarify the solution.
>>>>
>>>> This came up during a discussion on dri-devel:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C3459542a8eab4bc98ecb08d9a69863d9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637723995727600044%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6SZIpReHIaNxbu0WsLmwkjKM6e%2Bsk5d%2BDUg1KrfYewI%3D&amp;reserved=0 
>>>>
>>
>> I took a slightly deeper look into this. I think we need to formalize 
>> this a bit more to understand pros and cons and what the restrictions 
>> are really all about. Anybody looking at the prevous discussion will 
>> mostly see arguments similar to "this is stupid and difficult" and 
>> "it has always been this way" which are not really constructive.
>>
>> First disregarding all accounting stuff, I think this all boils down 
>> to TTM_PL_SYSTEM having three distinct states:
>> 1) POPULATED
>> 2) LIMBO (Or whatever we want to call it. No pages present)
>> 3) SWAPPED.
>>
>> The ttm_bo_move_memcpy() helper understands these, and any standalone 
>> driver implementation of the move() callback _currently_ needs to 
>> understand these as well, unless using the ttm_bo_move_memcpy() helper.
>>
>> Now using a bounce domain to proxy SYSTEM means that the driver can 
>> forget about the SWAPPED state, it's automatically handled by the 
>> move setup code. However, another pitfall is LIMBO, in that if when 
>> you move from SYSTEM/LIMBO to your bounce domain, the BO will be 
>> populated. So any naive accelerated move() implementation creating a 
>> 1GB BO in fixed memory, like VRAM, will needlessly allocate and free 
>> 1GB of system memory in the process instead of just performing a 
>> clear operation. Looks like amdgpu suffers from this?
>>
>> I think what is really needed is either
>>
>> a) A TTM helper that helps move callback implementations resolve the 
>> issues populating system from LIMBO or SWAP, and then also formalize 
>> driver notification for swapping. At a minimum, I think the 
>> swap_notify() callback needs to be able to return a late error.
>>
>> b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd vote 
>> for this without looking into it in detail).
>>
>> In both these cases, we should really make SYSTEM bindable by GPU, 
>> otherwise we'd just be trading one pitfall for another related 
>> without really resolving the root problem.
>>
>> As for fencing not being supported by SYSTEM, I'm not sure why we 
>> don't want this, because it would for example prohibit async 
>> ttm_move_memcpy(), and also, async unbinding of ttm_tt memory like 
>> MOB on vmgfx. (I think it's still sync).
>>
>> There might be an accounting issue related to this as well, but I 
>> guess Christian would need to chime in on this. If so, I think it 
>> needs to be well understood and documented (in TTM, not in AMD drivers).
>
> I think the problem goes deeper than what has been mentioned here so far.
>
> Having fences attached to BOs in the system domain is probably ok, but 
> the key point is that the BOs in the system domain are under TTMs 
> control and should not be touched by the driver.
>
> What we have now is that TTMs internals like the allocation state of 
> BOs in system memory (the populated, limbo, swapped you mentioned 
> above) is leaking into the drivers and I think exactly that is the 
> part which doesn't work reliable here. You can of course can get that 
> working, but that requires knowledge of the internal state which in my 
> eyes was always illegal.
>
Well, I tend to agree to some extent, but then, like said above even 
disregarding swap will cause trouble with the limbo state, because the 
driver's move callback would need knowledge of that to implement moves 
limbo -> vram efficiently.

> What we could do is to split the system domain into SYSTEM and SWAP 
> and then say only the swap domain is under TTMs control.

This probably needs some thought also on how to handle the limbo state?

I could craft an RFC hiding these states with helpers that we could 
compare against the SYSTEM + SWAP memory type.

/Thomas

>
> Regards,
> Christian.
>
>>
>> Thanks,
>>
>> /Thomas
>>
>>
>>>
>>> Polite and gentle ping on that one. Are we ok with the wording here?
>>>
>>> z
>
Christian König Nov. 16, 2021, 8:20 a.m. UTC | #5
Am 16.11.21 um 08:43 schrieb Thomas Hellström:
> On 11/16/21 08:19, Christian König wrote:
>> Am 13.11.21 um 12:26 schrieb Thomas Hellström:
>>> Hi, Zack,
>>>
>>> On 11/11/21 17:44, Zack Rusin wrote:
>>>> On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
>>>>> TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
>>>>> driver internal usage of TTM_PL_SYSTEM prone to errors because it
>>>>> requires the drivers to manually handle all interactions between TTM
>>>>> which can swap out those buffers whenever it thinks it's the right
>>>>> thing to do and driver.
>>>>>
>>>>> CPU buffers which need to be fenced and shared with accelerators
>>>>> should
>>>>> be placed in driver specific placements that can explicitly handle
>>>>> CPU/accelerator buffer fencing.
>>>>> Currently, apart, from things silently failing nothing is enforcing
>>>>> that requirement which means that it's easy for drivers and new
>>>>> developers to get this wrong. To avoid the confusion we can document
>>>>> this requirement and clarify the solution.
>>>>>
>>>>> This came up during a discussion on dri-devel:
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C55e15a3b151b401993ca08d9a8d4c878%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637726454113422983%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=HSg2rZf1yFsCCOUOcoG5Y0ogGE%2FsUymh3UqJYvZ1%2BDM%3D&amp;reserved=0 
>>>>>
>>>
>>> I took a slightly deeper look into this. I think we need to 
>>> formalize this a bit more to understand pros and cons and what the 
>>> restrictions are really all about. Anybody looking at the prevous 
>>> discussion will mostly see arguments similar to "this is stupid and 
>>> difficult" and "it has always been this way" which are not really 
>>> constructive.
>>>
>>> First disregarding all accounting stuff, I think this all boils down 
>>> to TTM_PL_SYSTEM having three distinct states:
>>> 1) POPULATED
>>> 2) LIMBO (Or whatever we want to call it. No pages present)
>>> 3) SWAPPED.
>>>
>>> The ttm_bo_move_memcpy() helper understands these, and any 
>>> standalone driver implementation of the move() callback _currently_ 
>>> needs to understand these as well, unless using the 
>>> ttm_bo_move_memcpy() helper.
>>>
>>> Now using a bounce domain to proxy SYSTEM means that the driver can 
>>> forget about the SWAPPED state, it's automatically handled by the 
>>> move setup code. However, another pitfall is LIMBO, in that if when 
>>> you move from SYSTEM/LIMBO to your bounce domain, the BO will be 
>>> populated. So any naive accelerated move() implementation creating a 
>>> 1GB BO in fixed memory, like VRAM, will needlessly allocate and free 
>>> 1GB of system memory in the process instead of just performing a 
>>> clear operation. Looks like amdgpu suffers from this?
>>>
>>> I think what is really needed is either
>>>
>>> a) A TTM helper that helps move callback implementations resolve the 
>>> issues populating system from LIMBO or SWAP, and then also formalize 
>>> driver notification for swapping. At a minimum, I think the 
>>> swap_notify() callback needs to be able to return a late error.
>>>
>>> b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd vote 
>>> for this without looking into it in detail).
>>>
>>> In both these cases, we should really make SYSTEM bindable by GPU, 
>>> otherwise we'd just be trading one pitfall for another related 
>>> without really resolving the root problem.
>>>
>>> As for fencing not being supported by SYSTEM, I'm not sure why we 
>>> don't want this, because it would for example prohibit async 
>>> ttm_move_memcpy(), and also, async unbinding of ttm_tt memory like 
>>> MOB on vmgfx. (I think it's still sync).
>>>
>>> There might be an accounting issue related to this as well, but I 
>>> guess Christian would need to chime in on this. If so, I think it 
>>> needs to be well understood and documented (in TTM, not in AMD 
>>> drivers).
>>
>> I think the problem goes deeper than what has been mentioned here so 
>> far.
>>
>> Having fences attached to BOs in the system domain is probably ok, 
>> but the key point is that the BOs in the system domain are under TTMs 
>> control and should not be touched by the driver.
>>
>> What we have now is that TTMs internals like the allocation state of 
>> BOs in system memory (the populated, limbo, swapped you mentioned 
>> above) is leaking into the drivers and I think exactly that is the 
>> part which doesn't work reliable here. You can of course can get that 
>> working, but that requires knowledge of the internal state which in 
>> my eyes was always illegal.
>>
> Well, I tend to agree to some extent, but then, like said above even 
> disregarding swap will cause trouble with the limbo state, because the 
> driver's move callback would need knowledge of that to implement moves 
> limbo -> vram efficiently.

Well my long term plan is to audit the code base once more and remove 
the limbo state from the SYSTEM domain.

E.g. instead of a SYSTEM BO without pages you allocate a BO without a 
resource in general which is now possible since bo->resource is a pointer.

This would still allow us to allocate "empty shell" BOs. But a 
validation of those BOs doesn't cause a move, but rather just allocates 
the resource for the first time.

The problem so far was just that we access bo->resource way to often 
without checking it.

Regards,
Christian.

>
>> What we could do is to split the system domain into SYSTEM and SWAP 
>> and then say only the swap domain is under TTMs control.
>
> This probably needs some thought also on how to handle the limbo state?
>
> I could craft an RFC hiding these states with helpers that we could 
> compare against the SYSTEM + SWAP memory type.
>
> /Thomas
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks,
>>>
>>> /Thomas
>>>
>>>
>>>>
>>>> Polite and gentle ping on that one. Are we ok with the wording here?
>>>>
>>>> z
>>
Thomas Hellström Nov. 16, 2021, 8:33 a.m. UTC | #6
On 11/16/21 09:20, Christian König wrote:
> Am 16.11.21 um 08:43 schrieb Thomas Hellström:
>> On 11/16/21 08:19, Christian König wrote:
>>> Am 13.11.21 um 12:26 schrieb Thomas Hellström:
>>>> Hi, Zack,
>>>>
>>>> On 11/11/21 17:44, Zack Rusin wrote:
>>>>> On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
>>>>>> TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
>>>>>> driver internal usage of TTM_PL_SYSTEM prone to errors because it
>>>>>> requires the drivers to manually handle all interactions between TTM
>>>>>> which can swap out those buffers whenever it thinks it's the right
>>>>>> thing to do and driver.
>>>>>>
>>>>>> CPU buffers which need to be fenced and shared with accelerators
>>>>>> should
>>>>>> be placed in driver specific placements that can explicitly handle
>>>>>> CPU/accelerator buffer fencing.
>>>>>> Currently, apart, from things silently failing nothing is enforcing
>>>>>> that requirement which means that it's easy for drivers and new
>>>>>> developers to get this wrong. To avoid the confusion we can document
>>>>>> this requirement and clarify the solution.
>>>>>>
>>>>>> This came up during a discussion on dri-devel:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C55e15a3b151b401993ca08d9a8d4c878%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637726454113422983%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=HSg2rZf1yFsCCOUOcoG5Y0ogGE%2FsUymh3UqJYvZ1%2BDM%3D&amp;reserved=0 
>>>>>>
>>>>
>>>> I took a slightly deeper look into this. I think we need to 
>>>> formalize this a bit more to understand pros and cons and what the 
>>>> restrictions are really all about. Anybody looking at the prevous 
>>>> discussion will mostly see arguments similar to "this is stupid and 
>>>> difficult" and "it has always been this way" which are not really 
>>>> constructive.
>>>>
>>>> First disregarding all accounting stuff, I think this all boils 
>>>> down to TTM_PL_SYSTEM having three distinct states:
>>>> 1) POPULATED
>>>> 2) LIMBO (Or whatever we want to call it. No pages present)
>>>> 3) SWAPPED.
>>>>
>>>> The ttm_bo_move_memcpy() helper understands these, and any 
>>>> standalone driver implementation of the move() callback _currently_ 
>>>> needs to understand these as well, unless using the 
>>>> ttm_bo_move_memcpy() helper.
>>>>
>>>> Now using a bounce domain to proxy SYSTEM means that the driver can 
>>>> forget about the SWAPPED state, it's automatically handled by the 
>>>> move setup code. However, another pitfall is LIMBO, in that if when 
>>>> you move from SYSTEM/LIMBO to your bounce domain, the BO will be 
>>>> populated. So any naive accelerated move() implementation creating 
>>>> a 1GB BO in fixed memory, like VRAM, will needlessly allocate and 
>>>> free 1GB of system memory in the process instead of just performing 
>>>> a clear operation. Looks like amdgpu suffers from this?
>>>>
>>>> I think what is really needed is either
>>>>
>>>> a) A TTM helper that helps move callback implementations resolve 
>>>> the issues populating system from LIMBO or SWAP, and then also 
>>>> formalize driver notification for swapping. At a minimum, I think 
>>>> the swap_notify() callback needs to be able to return a late error.
>>>>
>>>> b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd 
>>>> vote for this without looking into it in detail).
>>>>
>>>> In both these cases, we should really make SYSTEM bindable by GPU, 
>>>> otherwise we'd just be trading one pitfall for another related 
>>>> without really resolving the root problem.
>>>>
>>>> As for fencing not being supported by SYSTEM, I'm not sure why we 
>>>> don't want this, because it would for example prohibit async 
>>>> ttm_move_memcpy(), and also, async unbinding of ttm_tt memory like 
>>>> MOB on vmgfx. (I think it's still sync).
>>>>
>>>> There might be an accounting issue related to this as well, but I 
>>>> guess Christian would need to chime in on this. If so, I think it 
>>>> needs to be well understood and documented (in TTM, not in AMD 
>>>> drivers).
>>>
>>> I think the problem goes deeper than what has been mentioned here so 
>>> far.
>>>
>>> Having fences attached to BOs in the system domain is probably ok, 
>>> but the key point is that the BOs in the system domain are under 
>>> TTMs control and should not be touched by the driver.
>>>
>>> What we have now is that TTMs internals like the allocation state of 
>>> BOs in system memory (the populated, limbo, swapped you mentioned 
>>> above) is leaking into the drivers and I think exactly that is the 
>>> part which doesn't work reliable here. You can of course can get 
>>> that working, but that requires knowledge of the internal state 
>>> which in my eyes was always illegal.
>>>
>> Well, I tend to agree to some extent, but then, like said above even 
>> disregarding swap will cause trouble with the limbo state, because 
>> the driver's move callback would need knowledge of that to implement 
>> moves limbo -> vram efficiently.
>
> Well my long term plan is to audit the code base once more and remove 
> the limbo state from the SYSTEM domain.
>
> E.g. instead of a SYSTEM BO without pages you allocate a BO without a 
> resource in general which is now possible since bo->resource is a 
> pointer.
>
> This would still allow us to allocate "empty shell" BOs. But a 
> validation of those BOs doesn't cause a move, but rather just 
> allocates the resource for the first time.
>
> The problem so far was just that we access bo->resource way to often 
> without checking it.

So the driver would then at least need to be aware of these empty shell 
bos without resource for their move callbacks? (Again thinking of the 
move from empty shell -> VRAM).

Thanks,

/Thomas
Christian König Nov. 16, 2021, 8:54 a.m. UTC | #7
Am 16.11.21 um 09:33 schrieb Thomas Hellström:
> On 11/16/21 09:20, Christian König wrote:
>> Am 16.11.21 um 08:43 schrieb Thomas Hellström:
>>> On 11/16/21 08:19, Christian König wrote:
>>>> Am 13.11.21 um 12:26 schrieb Thomas Hellström:
>>>>> Hi, Zack,
>>>>>
>>>>> On 11/11/21 17:44, Zack Rusin wrote:
>>>>>> On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
>>>>>>> TTM takes full control over TTM_PL_SYSTEM placed buffers. This 
>>>>>>> makes
>>>>>>> driver internal usage of TTM_PL_SYSTEM prone to errors because it
>>>>>>> requires the drivers to manually handle all interactions between 
>>>>>>> TTM
>>>>>>> which can swap out those buffers whenever it thinks it's the right
>>>>>>> thing to do and driver.
>>>>>>>
>>>>>>> CPU buffers which need to be fenced and shared with accelerators
>>>>>>> should
>>>>>>> be placed in driver specific placements that can explicitly handle
>>>>>>> CPU/accelerator buffer fencing.
>>>>>>> Currently, apart, from things silently failing nothing is enforcing
>>>>>>> that requirement which means that it's easy for drivers and new
>>>>>>> developers to get this wrong. To avoid the confusion we can 
>>>>>>> document
>>>>>>> this requirement and clarify the solution.
>>>>>>>
>>>>>>> This came up during a discussion on dri-devel:
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C582935bfd2d94d97fa4808d9a8dbd574%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637726484406316306%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=4p9KuhMpWHabLEuIvJB2JEuKRhYx2gUuDywUuZ86s0o%3D&amp;reserved=0 
>>>>>>>
>>>>>
>>>>> I took a slightly deeper look into this. I think we need to 
>>>>> formalize this a bit more to understand pros and cons and what the 
>>>>> restrictions are really all about. Anybody looking at the prevous 
>>>>> discussion will mostly see arguments similar to "this is stupid 
>>>>> and difficult" and "it has always been this way" which are not 
>>>>> really constructive.
>>>>>
>>>>> First disregarding all accounting stuff, I think this all boils 
>>>>> down to TTM_PL_SYSTEM having three distinct states:
>>>>> 1) POPULATED
>>>>> 2) LIMBO (Or whatever we want to call it. No pages present)
>>>>> 3) SWAPPED.
>>>>>
>>>>> The ttm_bo_move_memcpy() helper understands these, and any 
>>>>> standalone driver implementation of the move() callback 
>>>>> _currently_ needs to understand these as well, unless using the 
>>>>> ttm_bo_move_memcpy() helper.
>>>>>
>>>>> Now using a bounce domain to proxy SYSTEM means that the driver 
>>>>> can forget about the SWAPPED state, it's automatically handled by 
>>>>> the move setup code. However, another pitfall is LIMBO, in that if 
>>>>> when you move from SYSTEM/LIMBO to your bounce domain, the BO will 
>>>>> be populated. So any naive accelerated move() implementation 
>>>>> creating a 1GB BO in fixed memory, like VRAM, will needlessly 
>>>>> allocate and free 1GB of system memory in the process instead of 
>>>>> just performing a clear operation. Looks like amdgpu suffers from 
>>>>> this?
>>>>>
>>>>> I think what is really needed is either
>>>>>
>>>>> a) A TTM helper that helps move callback implementations resolve 
>>>>> the issues populating system from LIMBO or SWAP, and then also 
>>>>> formalize driver notification for swapping. At a minimum, I think 
>>>>> the swap_notify() callback needs to be able to return a late error.
>>>>>
>>>>> b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd 
>>>>> vote for this without looking into it in detail).
>>>>>
>>>>> In both these cases, we should really make SYSTEM bindable by GPU, 
>>>>> otherwise we'd just be trading one pitfall for another related 
>>>>> without really resolving the root problem.
>>>>>
>>>>> As for fencing not being supported by SYSTEM, I'm not sure why we 
>>>>> don't want this, because it would for example prohibit async 
>>>>> ttm_move_memcpy(), and also, async unbinding of ttm_tt memory like 
>>>>> MOB on vmgfx. (I think it's still sync).
>>>>>
>>>>> There might be an accounting issue related to this as well, but I 
>>>>> guess Christian would need to chime in on this. If so, I think it 
>>>>> needs to be well understood and documented (in TTM, not in AMD 
>>>>> drivers).
>>>>
>>>> I think the problem goes deeper than what has been mentioned here 
>>>> so far.
>>>>
>>>> Having fences attached to BOs in the system domain is probably ok, 
>>>> but the key point is that the BOs in the system domain are under 
>>>> TTMs control and should not be touched by the driver.
>>>>
>>>> What we have now is that TTMs internals like the allocation state 
>>>> of BOs in system memory (the populated, limbo, swapped you 
>>>> mentioned above) is leaking into the drivers and I think exactly 
>>>> that is the part which doesn't work reliable here. You can of 
>>>> course can get that working, but that requires knowledge of the 
>>>> internal state which in my eyes was always illegal.
>>>>
>>> Well, I tend to agree to some extent, but then, like said above even 
>>> disregarding swap will cause trouble with the limbo state, because 
>>> the driver's move callback would need knowledge of that to implement 
>>> moves limbo -> vram efficiently.
>>
>> Well my long term plan is to audit the code base once more and remove 
>> the limbo state from the SYSTEM domain.
>>
>> E.g. instead of a SYSTEM BO without pages you allocate a BO without a 
>> resource in general which is now possible since bo->resource is a 
>> pointer.
>>
>> This would still allow us to allocate "empty shell" BOs. But a 
>> validation of those BOs doesn't cause a move, but rather just 
>> allocates the resource for the first time.
>>
>> The problem so far was just that we access bo->resource way to often 
>> without checking it.
>
> So the driver would then at least need to be aware of these empty 
> shell bos without resource for their move callbacks? (Again thinking 
> of the move from empty shell -> VRAM).

My thinking goes more into the direction that this looks like a BO 
directly allocated in VRAM to the driver.

We could of course also make it a move, but of hand I don't see a reason 
for it.

Christian.

>
> Thanks,
>
> /Thomas
>
>
Thomas Hellström Nov. 16, 2021, 9 a.m. UTC | #8
On 11/16/21 09:54, Christian König wrote:
> Am 16.11.21 um 09:33 schrieb Thomas Hellström:
>> On 11/16/21 09:20, Christian König wrote:
>>> Am 16.11.21 um 08:43 schrieb Thomas Hellström:
>>>> On 11/16/21 08:19, Christian König wrote:
>>>>> Am 13.11.21 um 12:26 schrieb Thomas Hellström:
>>>>>> Hi, Zack,
>>>>>>
>>>>>> On 11/11/21 17:44, Zack Rusin wrote:
>>>>>>> On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
>>>>>>>> TTM takes full control over TTM_PL_SYSTEM placed buffers. This 
>>>>>>>> makes
>>>>>>>> driver internal usage of TTM_PL_SYSTEM prone to errors because it
>>>>>>>> requires the drivers to manually handle all interactions 
>>>>>>>> between TTM
>>>>>>>> which can swap out those buffers whenever it thinks it's the right
>>>>>>>> thing to do and driver.
>>>>>>>>
>>>>>>>> CPU buffers which need to be fenced and shared with accelerators
>>>>>>>> should
>>>>>>>> be placed in driver specific placements that can explicitly handle
>>>>>>>> CPU/accelerator buffer fencing.
>>>>>>>> Currently, apart, from things silently failing nothing is 
>>>>>>>> enforcing
>>>>>>>> that requirement which means that it's easy for drivers and new
>>>>>>>> developers to get this wrong. To avoid the confusion we can 
>>>>>>>> document
>>>>>>>> this requirement and clarify the solution.
>>>>>>>>
>>>>>>>> This came up during a discussion on dri-devel:
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C582935bfd2d94d97fa4808d9a8dbd574%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637726484406316306%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=4p9KuhMpWHabLEuIvJB2JEuKRhYx2gUuDywUuZ86s0o%3D&amp;reserved=0 
>>>>>>>>
>>>>>>
>>>>>> I took a slightly deeper look into this. I think we need to 
>>>>>> formalize this a bit more to understand pros and cons and what 
>>>>>> the restrictions are really all about. Anybody looking at the 
>>>>>> prevous discussion will mostly see arguments similar to "this is 
>>>>>> stupid and difficult" and "it has always been this way" which are 
>>>>>> not really constructive.
>>>>>>
>>>>>> First disregarding all accounting stuff, I think this all boils 
>>>>>> down to TTM_PL_SYSTEM having three distinct states:
>>>>>> 1) POPULATED
>>>>>> 2) LIMBO (Or whatever we want to call it. No pages present)
>>>>>> 3) SWAPPED.
>>>>>>
>>>>>> The ttm_bo_move_memcpy() helper understands these, and any 
>>>>>> standalone driver implementation of the move() callback 
>>>>>> _currently_ needs to understand these as well, unless using the 
>>>>>> ttm_bo_move_memcpy() helper.
>>>>>>
>>>>>> Now using a bounce domain to proxy SYSTEM means that the driver 
>>>>>> can forget about the SWAPPED state, it's automatically handled by 
>>>>>> the move setup code. However, another pitfall is LIMBO, in that 
>>>>>> if when you move from SYSTEM/LIMBO to your bounce domain, the BO 
>>>>>> will be populated. So any naive accelerated move() implementation 
>>>>>> creating a 1GB BO in fixed memory, like VRAM, will needlessly 
>>>>>> allocate and free 1GB of system memory in the process instead of 
>>>>>> just performing a clear operation. Looks like amdgpu suffers from 
>>>>>> this?
>>>>>>
>>>>>> I think what is really needed is either
>>>>>>
>>>>>> a) A TTM helper that helps move callback implementations resolve 
>>>>>> the issues populating system from LIMBO or SWAP, and then also 
>>>>>> formalize driver notification for swapping. At a minimum, I think 
>>>>>> the swap_notify() callback needs to be able to return a late error.
>>>>>>
>>>>>> b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd 
>>>>>> vote for this without looking into it in detail).
>>>>>>
>>>>>> In both these cases, we should really make SYSTEM bindable by 
>>>>>> GPU, otherwise we'd just be trading one pitfall for another 
>>>>>> related without really resolving the root problem.
>>>>>>
>>>>>> As for fencing not being supported by SYSTEM, I'm not sure why we 
>>>>>> don't want this, because it would for example prohibit async 
>>>>>> ttm_move_memcpy(), and also, async unbinding of ttm_tt memory 
>>>>>> like MOB on vmgfx. (I think it's still sync).
>>>>>>
>>>>>> There might be an accounting issue related to this as well, but I 
>>>>>> guess Christian would need to chime in on this. If so, I think it 
>>>>>> needs to be well understood and documented (in TTM, not in AMD 
>>>>>> drivers).
>>>>>
>>>>> I think the problem goes deeper than what has been mentioned here 
>>>>> so far.
>>>>>
>>>>> Having fences attached to BOs in the system domain is probably ok, 
>>>>> but the key point is that the BOs in the system domain are under 
>>>>> TTMs control and should not be touched by the driver.
>>>>>
>>>>> What we have now is that TTMs internals like the allocation state 
>>>>> of BOs in system memory (the populated, limbo, swapped you 
>>>>> mentioned above) is leaking into the drivers and I think exactly 
>>>>> that is the part which doesn't work reliable here. You can of 
>>>>> course can get that working, but that requires knowledge of the 
>>>>> internal state which in my eyes was always illegal.
>>>>>
>>>> Well, I tend to agree to some extent, but then, like said above 
>>>> even disregarding swap will cause trouble with the limbo state, 
>>>> because the driver's move callback would need knowledge of that to 
>>>> implement moves limbo -> vram efficiently.
>>>
>>> Well my long term plan is to audit the code base once more and 
>>> remove the limbo state from the SYSTEM domain.
>>>
>>> E.g. instead of a SYSTEM BO without pages you allocate a BO without 
>>> a resource in general which is now possible since bo->resource is a 
>>> pointer.
>>>
>>> This would still allow us to allocate "empty shell" BOs. But a 
>>> validation of those BOs doesn't cause a move, but rather just 
>>> allocates the resource for the first time.
>>>
>>> The problem so far was just that we access bo->resource way to often 
>>> without checking it.
>>
>> So the driver would then at least need to be aware of these empty 
>> shell bos without resource for their move callbacks? (Again thinking 
>> of the move from empty shell -> VRAM).
>
> My thinking goes more into the direction that this looks like a BO 
> directly allocated in VRAM to the driver.
>
> We could of course also make it a move, but of hand I don't see a 
> reason for it.

As long as there is a way to provide accelerated VRAM clearing if 
necessary the directly allocated view sounds fine with me. (Looking at 
amdgpu it looks like you clear on resource destroy? I'm not fully sure 
that would work with all i915 use cases)

/Thomas


>
> Christian.
>
>>
>> Thanks,
>>
>> /Thomas
>>
>>
>
Christian König Nov. 16, 2021, 9:09 a.m. UTC | #9
Am 16.11.21 um 10:00 schrieb Thomas Hellström:
> On 11/16/21 09:54, Christian König wrote:
>> Am 16.11.21 um 09:33 schrieb Thomas Hellström:
>>> On 11/16/21 09:20, Christian König wrote:
>>>> Am 16.11.21 um 08:43 schrieb Thomas Hellström:
>>>>> On 11/16/21 08:19, Christian König wrote:
>>>>>> [SNIP]
>>>>
>>>> Well my long term plan is to audit the code base once more and 
>>>> remove the limbo state from the SYSTEM domain.
>>>>
>>>> E.g. instead of a SYSTEM BO without pages you allocate a BO without 
>>>> a resource in general which is now possible since bo->resource is a 
>>>> pointer.
>>>>
>>>> This would still allow us to allocate "empty shell" BOs. But a 
>>>> validation of those BOs doesn't cause a move, but rather just 
>>>> allocates the resource for the first time.
>>>>
>>>> The problem so far was just that we access bo->resource way to 
>>>> often without checking it.
>>>
>>> So the driver would then at least need to be aware of these empty 
>>> shell bos without resource for their move callbacks? (Again thinking 
>>> of the move from empty shell -> VRAM).
>>
>> My thinking goes more into the direction that this looks like a BO 
>> directly allocated in VRAM to the driver.
>>
>> We could of course also make it a move, but of hand I don't see a 
>> reason for it.
>
> As long as there is a way to provide accelerated VRAM clearing if 
> necessary the directly allocated view sounds fine with me. (Looking at 
> amdgpu it looks like you clear on resource destroy? I'm not fully sure 
> that would work with all i915 use cases)

In amdgpu we have both. The AMDGPU_GEM_CREATE_VRAM_CLEARED flag clears 
the memory on allocation and AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE flag 
makes sure it is wiped on release.

Wiping on release is sometimes faster because you don't need to wait for 
the clear to finish before you can first use it.

But thinking about it once more it might be a good idea to have move 
callbacks for empty shells and freshly allocated BOs as well, so that 
the driver is informed about the state change of the BO.

Regards,
Christian.

>
> /Thomas
>
>
>>
>> Christian.
>>
>>>
>>> Thanks,
>>>
>>> /Thomas
>>>
>>>
>>
Daniel Vetter Nov. 16, 2021, 3:49 p.m. UTC | #10
On Tue, Nov 16, 2021 at 10:09:44AM +0100, Christian König wrote:
> Am 16.11.21 um 10:00 schrieb Thomas Hellström:
> > On 11/16/21 09:54, Christian König wrote:
> > > Am 16.11.21 um 09:33 schrieb Thomas Hellström:
> > > > On 11/16/21 09:20, Christian König wrote:
> > > > > Am 16.11.21 um 08:43 schrieb Thomas Hellström:
> > > > > > On 11/16/21 08:19, Christian König wrote:
> > > > > > > [SNIP]
> > > > > 
> > > > > Well my long term plan is to audit the code base once more
> > > > > and remove the limbo state from the SYSTEM domain.
> > > > > 
> > > > > E.g. instead of a SYSTEM BO without pages you allocate a BO
> > > > > without a resource in general which is now possible since
> > > > > bo->resource is a pointer.
> > > > > 
> > > > > This would still allow us to allocate "empty shell" BOs. But
> > > > > a validation of those BOs doesn't cause a move, but rather
> > > > > just allocates the resource for the first time.
> > > > > 
> > > > > The problem so far was just that we access bo->resource way
> > > > > to often without checking it.
> > > > 
> > > > So the driver would then at least need to be aware of these
> > > > empty shell bos without resource for their move callbacks?
> > > > (Again thinking of the move from empty shell -> VRAM).
> > > 
> > > My thinking goes more into the direction that this looks like a BO
> > > directly allocated in VRAM to the driver.
> > > 
> > > We could of course also make it a move, but of hand I don't see a
> > > reason for it.
> > 
> > As long as there is a way to provide accelerated VRAM clearing if
> > necessary the directly allocated view sounds fine with me. (Looking at
> > amdgpu it looks like you clear on resource destroy? I'm not fully sure
> > that would work with all i915 use cases)
> 
> In amdgpu we have both. The AMDGPU_GEM_CREATE_VRAM_CLEARED flag clears the
> memory on allocation and AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE flag makes
> sure it is wiped on release.

btw how does this work when there was random previous stuff in that memory
region previously?

Like the more we do hmm style integration the more you cannot assume that
critical memory is cleared on free/release, because that's just not how
core mm/ works, where pages are all cleared on alloc before userspace sees
them.

Entirely sidetrack discussion :-)
-Daniel

> Wiping on release is sometimes faster because you don't need to wait for the
> clear to finish before you can first use it.
> 
> But thinking about it once more it might be a good idea to have move
> callbacks for empty shells and freshly allocated BOs as well, so that the
> driver is informed about the state change of the BO.
> 
> Regards,
> Christian.
> 
> > 
> > /Thomas
> > 
> > 
> > > 
> > > Christian.
> > > 
> > > > 
> > > > Thanks,
> > > > 
> > > > /Thomas
> > > > 
> > > > 
> > > 
>
Zack Rusin Nov. 16, 2021, 3:53 p.m. UTC | #11
> On Nov 16, 2021, at 03:20, Christian König <christian.koenig@amd.com> wrote:
> 
> Am 16.11.21 um 08:43 schrieb Thomas Hellström:
>> On 11/16/21 08:19, Christian König wrote:
>>> Am 13.11.21 um 12:26 schrieb Thomas Hellström:
>>>> Hi, Zack,
>>>> 
>>>> On 11/11/21 17:44, Zack Rusin wrote:
>>>>> On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
>>>>>> TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
>>>>>> driver internal usage of TTM_PL_SYSTEM prone to errors because it
>>>>>> requires the drivers to manually handle all interactions between TTM
>>>>>> which can swap out those buffers whenever it thinks it's the right
>>>>>> thing to do and driver.
>>>>>> 
>>>>>> CPU buffers which need to be fenced and shared with accelerators
>>>>>> should
>>>>>> be placed in driver specific placements that can explicitly handle
>>>>>> CPU/accelerator buffer fencing.
>>>>>> Currently, apart, from things silently failing nothing is enforcing
>>>>>> that requirement which means that it's easy for drivers and new
>>>>>> developers to get this wrong. To avoid the confusion we can document
>>>>>> this requirement and clarify the solution.
>>>>>> 
>>>>>> This came up during a discussion on dri-devel:
>>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Czackr%40vmware.com%7C2a4063bb30f84dc921ba08d9a8d9ea8c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637726476178183950%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=UF0kuamK%2FfETPW22EbEaQ0rUrDqSQbBtFwT5DwknY1s%3D&amp;reserved=0 
>>>> 
>>>> I took a slightly deeper look into this. I think we need to formalize this a bit more to understand pros and cons and what the restrictions are really all about. Anybody looking at the prevous discussion will mostly see arguments similar to "this is stupid and difficult" and "it has always been this way" which are not really constructive.
>>>> 
>>>> First disregarding all accounting stuff, I think this all boils down to TTM_PL_SYSTEM having three distinct states:
>>>> 1) POPULATED
>>>> 2) LIMBO (Or whatever we want to call it. No pages present)
>>>> 3) SWAPPED.
>>>> 
>>>> The ttm_bo_move_memcpy() helper understands these, and any standalone driver implementation of the move() callback _currently_ needs to understand these as well, unless using the ttm_bo_move_memcpy() helper.
>>>> 
>>>> Now using a bounce domain to proxy SYSTEM means that the driver can forget about the SWAPPED state, it's automatically handled by the move setup code. However, another pitfall is LIMBO, in that if when you move from SYSTEM/LIMBO to your bounce domain, the BO will be populated. So any naive accelerated move() implementation creating a 1GB BO in fixed memory, like VRAM, will needlessly allocate and free 1GB of system memory in the process instead of just performing a clear operation. Looks like amdgpu suffers from this?
>>>> 
>>>> I think what is really needed is either
>>>> 
>>>> a) A TTM helper that helps move callback implementations resolve the issues populating system from LIMBO or SWAP, and then also formalize driver notification for swapping. At a minimum, I think the swap_notify() callback needs to be able to return a late error.
>>>> 
>>>> b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd vote for this without looking into it in detail).
>>>> 
>>>> In both these cases, we should really make SYSTEM bindable by GPU, otherwise we'd just be trading one pitfall for another related without really resolving the root problem.
>>>> 
>>>> As for fencing not being supported by SYSTEM, I'm not sure why we don't want this, because it would for example prohibit async ttm_move_memcpy(), and also, async unbinding of ttm_tt memory like MOB on vmgfx. (I think it's still sync).
>>>> 
>>>> There might be an accounting issue related to this as well, but I guess Christian would need to chime in on this. If so, I think it needs to be well understood and documented (in TTM, not in AMD drivers).
>>> 
>>> I think the problem goes deeper than what has been mentioned here so far.
>>> 
>>> Having fences attached to BOs in the system domain is probably ok, but the key point is that the BOs in the system domain are under TTMs control and should not be touched by the driver.
>>> 
>>> What we have now is that TTMs internals like the allocation state of BOs in system memory (the populated, limbo, swapped you mentioned above) is leaking into the drivers and I think exactly that is the part which doesn't work reliable here. You can of course can get that working, but that requires knowledge of the internal state which in my eyes was always illegal.
>>> 
>> Well, I tend to agree to some extent, but then, like said above even disregarding swap will cause trouble with the limbo state, because the driver's move callback would need knowledge of that to implement moves limbo -> vram efficiently.
> 
> Well my long term plan is to audit the code base once more and remove the limbo state from the SYSTEM domain.
> 
> E.g. instead of a SYSTEM BO without pages you allocate a BO without a resource in general which is now possible since bo->resource is a pointer.
> 
> This would still allow us to allocate "empty shell" BOs. But a validation of those BOs doesn't cause a move, but rather just allocates the resource for the first time.
> 
> The problem so far was just that we access bo->resource way to often without checking it.

So all in all this would be a two step process 1) Eliminate the “limbo” state, 2) Split PL_SYSTEM into PL_SYSTEM, that drivers could use for really anything, and PL_SWAP which would be under complete TTM control, yes? If that’s the case that would certainly make my life a lot easier (because the drivers wouldn’t need to introduce/manage their own system placements) and I think it would make the code a lot easier to understand.

That’s a bit of work though so the question what happens until this lands comes to mind. Is introduction of VMW_PL_SYSTEM and documenting that PL_SYSTEM is under complete TTM control (like this patch does) the way to go or do we want to start working on the above immediately? Because I’d love to be able to unload vmwgfx without kernel oopsing =)

z
Christian König Nov. 22, 2021, 2:15 p.m. UTC | #12
Am 16.11.21 um 16:53 schrieb Zack Rusin:
>> On Nov 16, 2021, at 03:20, Christian König <christian.koenig@amd.com> wrote:
>>
>> Am 16.11.21 um 08:43 schrieb Thomas Hellström:
>>> On 11/16/21 08:19, Christian König wrote:
>>>> Am 13.11.21 um 12:26 schrieb Thomas Hellström:
>>>>> Hi, Zack,
>>>>>
>>>>> On 11/11/21 17:44, Zack Rusin wrote:
>>>>>> On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
>>>>>>> TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
>>>>>>> driver internal usage of TTM_PL_SYSTEM prone to errors because it
>>>>>>> requires the drivers to manually handle all interactions between TTM
>>>>>>> which can swap out those buffers whenever it thinks it's the right
>>>>>>> thing to do and driver.
>>>>>>>
>>>>>>> CPU buffers which need to be fenced and shared with accelerators
>>>>>>> should
>>>>>>> be placed in driver specific placements that can explicitly handle
>>>>>>> CPU/accelerator buffer fencing.
>>>>>>> Currently, apart, from things silently failing nothing is enforcing
>>>>>>> that requirement which means that it's easy for drivers and new
>>>>>>> developers to get this wrong. To avoid the confusion we can document
>>>>>>> this requirement and clarify the solution.
>>>>>>>
>>>>>>> This came up during a discussion on dri-devel:
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C57658f299a72436627e608d9a9194209%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637726748229186505%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=UbE43u8a0MPNgXtzcqkJJfSe0I%2BA5Yojz7yoh7e6Oyo%3D&amp;reserved=0
>>>>> I took a slightly deeper look into this. I think we need to formalize this a bit more to understand pros and cons and what the restrictions are really all about. Anybody looking at the prevous discussion will mostly see arguments similar to "this is stupid and difficult" and "it has always been this way" which are not really constructive.
>>>>>
>>>>> First disregarding all accounting stuff, I think this all boils down to TTM_PL_SYSTEM having three distinct states:
>>>>> 1) POPULATED
>>>>> 2) LIMBO (Or whatever we want to call it. No pages present)
>>>>> 3) SWAPPED.
>>>>>
>>>>> The ttm_bo_move_memcpy() helper understands these, and any standalone driver implementation of the move() callback _currently_ needs to understand these as well, unless using the ttm_bo_move_memcpy() helper.
>>>>>
>>>>> Now using a bounce domain to proxy SYSTEM means that the driver can forget about the SWAPPED state, it's automatically handled by the move setup code. However, another pitfall is LIMBO, in that if when you move from SYSTEM/LIMBO to your bounce domain, the BO will be populated. So any naive accelerated move() implementation creating a 1GB BO in fixed memory, like VRAM, will needlessly allocate and free 1GB of system memory in the process instead of just performing a clear operation. Looks like amdgpu suffers from this?
>>>>>
>>>>> I think what is really needed is either
>>>>>
>>>>> a) A TTM helper that helps move callback implementations resolve the issues populating system from LIMBO or SWAP, and then also formalize driver notification for swapping. At a minimum, I think the swap_notify() callback needs to be able to return a late error.
>>>>>
>>>>> b) Make LIMBO and SWAPPED distinct memory regions. (I think I'd vote for this without looking into it in detail).
>>>>>
>>>>> In both these cases, we should really make SYSTEM bindable by GPU, otherwise we'd just be trading one pitfall for another related without really resolving the root problem.
>>>>>
>>>>> As for fencing not being supported by SYSTEM, I'm not sure why we don't want this, because it would for example prohibit async ttm_move_memcpy(), and also, async unbinding of ttm_tt memory like MOB on vmgfx. (I think it's still sync).
>>>>>
>>>>> There might be an accounting issue related to this as well, but I guess Christian would need to chime in on this. If so, I think it needs to be well understood and documented (in TTM, not in AMD drivers).
>>>> I think the problem goes deeper than what has been mentioned here so far.
>>>>
>>>> Having fences attached to BOs in the system domain is probably ok, but the key point is that the BOs in the system domain are under TTMs control and should not be touched by the driver.
>>>>
>>>> What we have now is that TTMs internals like the allocation state of BOs in system memory (the populated, limbo, swapped you mentioned above) is leaking into the drivers and I think exactly that is the part which doesn't work reliable here. You can of course can get that working, but that requires knowledge of the internal state which in my eyes was always illegal.
>>>>
>>> Well, I tend to agree to some extent, but then, like said above even disregarding swap will cause trouble with the limbo state, because the driver's move callback would need knowledge of that to implement moves limbo -> vram efficiently.
>> Well my long term plan is to audit the code base once more and remove the limbo state from the SYSTEM domain.
>>
>> E.g. instead of a SYSTEM BO without pages you allocate a BO without a resource in general which is now possible since bo->resource is a pointer.
>>
>> This would still allow us to allocate "empty shell" BOs. But a validation of those BOs doesn't cause a move, but rather just allocates the resource for the first time.
>>
>> The problem so far was just that we access bo->resource way to often without checking it.
> So all in all this would be a two step process 1) Eliminate the “limbo” state, 2) Split PL_SYSTEM into PL_SYSTEM, that drivers could use for really anything, and PL_SWAP which would be under complete TTM control, yes? If that’s the case that would certainly make my life a lot easier (because the drivers wouldn’t need to introduce/manage their own system placements) and I think it would make the code a lot easier to understand.

We also have a couple of other use cases for this, yes.

> That’s a bit of work though so the question what happens until this lands comes to mind. Is introduction of VMW_PL_SYSTEM and documenting that PL_SYSTEM is under complete TTM control (like this patch does) the way to go or do we want to start working on the above immediately? Because I’d love to be able to unload vmwgfx without kernel oopsing =)

I think documenting and getting into a clean state should be 
prerequisite for new development (even if the fix for the unload is 
trivial).

Regards,
Christian.

>
> z
>
Zack Rusin Nov. 22, 2021, 3:05 p.m. UTC | #13
On Mon, 2021-11-22 at 15:15 +0100, Christian König wrote:
> Am 16.11.21 um 16:53 schrieb Zack Rusin:
> > > On Nov 16, 2021, at 03:20, Christian König
> > > <christian.koenig@amd.com> wrote:
> > > 
> > > Am 16.11.21 um 08:43 schrieb Thomas Hellström:
> > > > On 11/16/21 08:19, Christian König wrote:
> > > > > Am 13.11.21 um 12:26 schrieb Thomas Hellström:
> > > > > > Hi, Zack,
> > > > > > 
> > > > > > On 11/11/21 17:44, Zack Rusin wrote:
> > > > > > > On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
> > > > > > > > TTM takes full control over TTM_PL_SYSTEM placed
> > > > > > > > buffers. This makes
> > > > > > > > driver internal usage of TTM_PL_SYSTEM prone to errors
> > > > > > > > because it
> > > > > > > > requires the drivers to manually handle all
> > > > > > > > interactions between TTM
> > > > > > > > which can swap out those buffers whenever it thinks
> > > > > > > > it's the right
> > > > > > > > thing to do and driver.
> > > > > > > > 
> > > > > > > > CPU buffers which need to be fenced and shared with
> > > > > > > > accelerators
> > > > > > > > should
> > > > > > > > be placed in driver specific placements that can
> > > > > > > > explicitly handle
> > > > > > > > CPU/accelerator buffer fencing.
> > > > > > > > Currently, apart, from things silently failing nothing
> > > > > > > > is enforcing
> > > > > > > > that requirement which means that it's easy for drivers
> > > > > > > > and new
> > > > > > > > developers to get this wrong. To avoid the confusion we
> > > > > > > > can document
> > > > > > > > this requirement and clarify the solution.
> > > > > > > > 
> > > > > > > > This came up during a discussion on dri-devel:
> > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Czackr%40vmware.com%7C084389d8acc04ffdbbb808d9adc28cfe%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637731873379429725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ahHm7HV9VPAhfBKsk9xOBcnObsJXHvVAbCdEhJ%2BjJYU%3D&amp;reserved=0
> > > > > > I took a slightly deeper look into this. I think we need to
> > > > > > formalize this a bit more to understand pros and cons and
> > > > > > what the restrictions are really all about. Anybody looking
> > > > > > at the prevous discussion will mostly see arguments similar
> > > > > > to "this is stupid and difficult" and "it has always been
> > > > > > this way" which are not really constructive.
> > > > > > 
> > > > > > First disregarding all accounting stuff, I think this all
> > > > > > boils down to TTM_PL_SYSTEM having three distinct states:
> > > > > > 1) POPULATED
> > > > > > 2) LIMBO (Or whatever we want to call it. No pages present)
> > > > > > 3) SWAPPED.
> > > > > > 
> > > > > > The ttm_bo_move_memcpy() helper understands these, and any
> > > > > > standalone driver implementation of the move() callback
> > > > > > _currently_ needs to understand these as well, unless using
> > > > > > the ttm_bo_move_memcpy() helper.
> > > > > > 
> > > > > > Now using a bounce domain to proxy SYSTEM means that the
> > > > > > driver can forget about the SWAPPED state, it's
> > > > > > automatically handled by the move setup code. However,
> > > > > > another pitfall is LIMBO, in that if when you move from
> > > > > > SYSTEM/LIMBO to your bounce domain, the BO will be
> > > > > > populated. So any naive accelerated move() implementation
> > > > > > creating a 1GB BO in fixed memory, like VRAM, will
> > > > > > needlessly allocate and free 1GB of system memory in the
> > > > > > process instead of just performing a clear operation. Looks
> > > > > > like amdgpu suffers from this?
> > > > > > 
> > > > > > I think what is really needed is either
> > > > > > 
> > > > > > a) A TTM helper that helps move callback implementations
> > > > > > resolve the issues populating system from LIMBO or SWAP,
> > > > > > and then also formalize driver notification for swapping.
> > > > > > At a minimum, I think the swap_notify() callback needs to
> > > > > > be able to return a late error.
> > > > > > 
> > > > > > b) Make LIMBO and SWAPPED distinct memory regions. (I think
> > > > > > I'd vote for this without looking into it in detail).
> > > > > > 
> > > > > > In both these cases, we should really make SYSTEM bindable
> > > > > > by GPU, otherwise we'd just be trading one pitfall for
> > > > > > another related without really resolving the root problem.
> > > > > > 
> > > > > > As for fencing not being supported by SYSTEM, I'm not sure
> > > > > > why we don't want this, because it would for example
> > > > > > prohibit async ttm_move_memcpy(), and also, async unbinding
> > > > > > of ttm_tt memory like MOB on vmgfx. (I think it's still
> > > > > > sync).
> > > > > > 
> > > > > > There might be an accounting issue related to this as well,
> > > > > > but I guess Christian would need to chime in on this. If
> > > > > > so, I think it needs to be well understood and documented
> > > > > > (in TTM, not in AMD drivers).
> > > > > I think the problem goes deeper than what has been mentioned
> > > > > here so far.
> > > > > 
> > > > > Having fences attached to BOs in the system domain is
> > > > > probably ok, but the key point is that the BOs in the system
> > > > > domain are under TTMs control and should not be touched by
> > > > > the driver.
> > > > > 
> > > > > What we have now is that TTMs internals like the allocation
> > > > > state of BOs in system memory (the populated, limbo, swapped
> > > > > you mentioned above) is leaking into the drivers and I think
> > > > > exactly that is the part which doesn't work reliable here.
> > > > > You can of course can get that working, but that requires
> > > > > knowledge of the internal state which in my eyes was always
> > > > > illegal.
> > > > > 
> > > > Well, I tend to agree to some extent, but then, like said above
> > > > even disregarding swap will cause trouble with the limbo state,
> > > > because the driver's move callback would need knowledge of that
> > > > to implement moves limbo -> vram efficiently.
> > > Well my long term plan is to audit the code base once more and
> > > remove the limbo state from the SYSTEM domain.
> > > 
> > > E.g. instead of a SYSTEM BO without pages you allocate a BO
> > > without a resource in general which is now possible since bo-
> > > >resource is a pointer.
> > > 
> > > This would still allow us to allocate "empty shell" BOs. But a
> > > validation of those BOs doesn't cause a move, but rather just
> > > allocates the resource for the first time.
> > > 
> > > The problem so far was just that we access bo->resource way to
> > > often without checking it.
> > So all in all this would be a two step process 1) Eliminate the
> > “limbo” state, 2) Split PL_SYSTEM into PL_SYSTEM, that drivers
> > could use for really anything, and PL_SWAP which would be under
> > complete TTM control, yes? If that’s the case that would certainly
> > make my life a lot easier (because the drivers wouldn’t need to
> > introduce/manage their own system placements) and I think it would
> > make the code a lot easier to understand.
> 
> We also have a couple of other use cases for this, yes.
> 
> > That’s a bit of work though so the question what happens until this
> > lands comes to mind. Is introduction of VMW_PL_SYSTEM and
> > documenting that PL_SYSTEM is under complete TTM control (like this
> > patch does) the way to go or do we want to start working on the
> > above immediately? Because I’d love to be able to unload vmwgfx
> > without kernel oopsing =)
> 
> I think documenting and getting into a clean state should be 
> prerequisite for new development (even if the fix for the unload is 
> trivial).

That sounds great (both points). Could you double check the wording and
ack it (ideally both the docs and the VMW_PL_SYSTEM patch) if you think
it's ready to go in?

z
Christian König Nov. 24, 2021, 8:22 a.m. UTC | #14
Am 10.11.21 um 15:50 schrieb Zack Rusin:
> TTM takes full control over TTM_PL_SYSTEM placed buffers. This makes
> driver internal usage of TTM_PL_SYSTEM prone to errors because it
> requires the drivers to manually handle all interactions between TTM
> which can swap out those buffers whenever it thinks it's the right
> thing to do and driver.
>
> CPU buffers which need to be fenced and shared with accelerators should
> be placed in driver specific placements that can explicitly handle
> CPU/accelerator buffer fencing.
> Currently, apart, from things silently failing nothing is enforcing
> that requirement which means that it's easy for drivers and new
> developers to get this wrong. To avoid the confusion we can document
> this requirement and clarify the solution.
>
> This came up during a discussion on dri-devel:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cbcf8d8977e68448fa20808d9a4597927%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637721526467013347%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=3pldUWJ8u7AEDxtG0vQBINIL7%2FQE4HiE%2FQ7x8fi0MK8%3D&amp;reserved=0
>
> Signed-off-by: Zack Rusin <zackr@vmware.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   include/drm/ttm/ttm_placement.h | 11 +++++++++++
>   1 file changed, 11 insertions(+)
>
> diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h
> index 76d1b9119a2b..8074d0f6cae5 100644
> --- a/include/drm/ttm/ttm_placement.h
> +++ b/include/drm/ttm/ttm_placement.h
> @@ -35,6 +35,17 @@
>   
>   /*
>    * Memory regions for data placement.
> + *
> + * Buffers placed in TTM_PL_SYSTEM are considered under TTMs control and can
> + * be swapped out whenever TTMs thinks it is a good idea.
> + * In cases where drivers would like to use TTM_PL_SYSTEM as a valid
> + * placement they need to be able to handle the issues that arise due to the
> + * above manually.
> + *
> + * For BO's which reside in system memory but for which the accelerator
> + * requires direct access (i.e. their usage needs to be synchronized
> + * between the CPU and accelerator via fences) a new, driver private
> + * placement that can handle such scenarios is a good idea.
>    */
>   
>   #define TTM_PL_SYSTEM           0
diff mbox series

Patch

diff --git a/include/drm/ttm/ttm_placement.h b/include/drm/ttm/ttm_placement.h
index 76d1b9119a2b..8074d0f6cae5 100644
--- a/include/drm/ttm/ttm_placement.h
+++ b/include/drm/ttm/ttm_placement.h
@@ -35,6 +35,17 @@ 
 
 /*
  * Memory regions for data placement.
+ *
+ * Buffers placed in TTM_PL_SYSTEM are considered under TTMs control and can
+ * be swapped out whenever TTMs thinks it is a good idea.
+ * In cases where drivers would like to use TTM_PL_SYSTEM as a valid
+ * placement they need to be able to handle the issues that arise due to the
+ * above manually.
+ *
+ * For BO's which reside in system memory but for which the accelerator
+ * requires direct access (i.e. their usage needs to be synchronized
+ * between the CPU and accelerator via fences) a new, driver private
+ * placement that can handle such scenarios is a good idea.
  */
 
 #define TTM_PL_SYSTEM           0