diff mbox series

[v3,05/12] drm/ttm: Expose ttm_tt_unpopulate for driver use

Message ID 1605936082-3099-6-git-send-email-andrey.grodzovsky@amd.com (mailing list archive)
State New, archived
Headers show
Series RFC Support hot device unplug in amdgpu | expand

Commit Message

Andrey Grodzovsky Nov. 21, 2020, 5:21 a.m. UTC
It's needed to drop iommu backed pages on device unplug
before device's IOMMU group is released.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/ttm/ttm_tt.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Andrey Grodzovsky Nov. 23, 2020, 8:05 p.m. UTC | #1
On 11/25/20 5:42 AM, Christian König wrote:
> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>> It's needed to drop iommu backed pages on device unplug
>> before device's IOMMU group is released.
>
> It would be cleaner if we could do the whole handling in TTM. I also need to 
> double check what you are doing with this function.
>
> Christian.


Check patch "drm/amdgpu: Register IOMMU topology notifier per device." to see
how i use it. I don't see why this should go into TTM mid-layer - the stuff I do 
inside
is vendor specific and also I don't think TTM is explicitly aware of IOMMU ?
Do you mean you prefer the IOMMU notifier to be registered from within TTM
and then use a hook to call into vendor specific handler ?

Andrey


>
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
>> index 1ccf1ef..29248a5 100644
>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>> @@ -495,3 +495,4 @@ void ttm_tt_unpopulate(struct ttm_tt *ttm)
>>       else
>>           ttm_pool_unpopulate(ttm);
>>   }
>> +EXPORT_SYMBOL(ttm_tt_unpopulate);
>
Christian König Nov. 23, 2020, 8:20 p.m. UTC | #2
Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>
> On 11/25/20 5:42 AM, Christian König wrote:
>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>> It's needed to drop iommu backed pages on device unplug
>>> before device's IOMMU group is released.
>>
>> It would be cleaner if we could do the whole handling in TTM. I also 
>> need to double check what you are doing with this function.
>>
>> Christian.
>
>
> Check patch "drm/amdgpu: Register IOMMU topology notifier per device." 
> to see
> how i use it. I don't see why this should go into TTM mid-layer - the 
> stuff I do inside
> is vendor specific and also I don't think TTM is explicitly aware of 
> IOMMU ?
> Do you mean you prefer the IOMMU notifier to be registered from within 
> TTM
> and then use a hook to call into vendor specific handler ?

No, that is really vendor specific.

What I meant is to have a function like ttm_resource_manager_evict_all() 
which you only need to call and all tt objects are unpopulated.

Give me a day or two to look into this.

Christian.

>
> Andrey
>
>
>>
>>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c 
>>> b/drivers/gpu/drm/ttm/ttm_tt.c
>>> index 1ccf1ef..29248a5 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>> @@ -495,3 +495,4 @@ void ttm_tt_unpopulate(struct ttm_tt *ttm)
>>>       else
>>>           ttm_pool_unpopulate(ttm);
>>>   }
>>> +EXPORT_SYMBOL(ttm_tt_unpopulate);
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Andrey Grodzovsky Nov. 23, 2020, 8:38 p.m. UTC | #3
On 11/23/20 3:20 PM, Christian König wrote:
> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>
>> On 11/25/20 5:42 AM, Christian König wrote:
>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>> It's needed to drop iommu backed pages on device unplug
>>>> before device's IOMMU group is released.
>>>
>>> It would be cleaner if we could do the whole handling in TTM. I also need to 
>>> double check what you are doing with this function.
>>>
>>> Christian.
>>
>>
>> Check patch "drm/amdgpu: Register IOMMU topology notifier per device." to see
>> how i use it. I don't see why this should go into TTM mid-layer - the stuff I 
>> do inside
>> is vendor specific and also I don't think TTM is explicitly aware of IOMMU ?
>> Do you mean you prefer the IOMMU notifier to be registered from within TTM
>> and then use a hook to call into vendor specific handler ?
>
> No, that is really vendor specific.
>
> What I meant is to have a function like ttm_resource_manager_evict_all() which 
> you only need to call and all tt objects are unpopulated.


So instead of this BO list i create and later iterate in amdgpu from the IOMMU 
patch you just want to do it within
TTM with a single function ? Makes much more sense.

Andrey


>
> Give me a day or two to look into this.
>
> Christian.
>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>>>>   1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
>>>> index 1ccf1ef..29248a5 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>> @@ -495,3 +495,4 @@ void ttm_tt_unpopulate(struct ttm_tt *ttm)
>>>>       else
>>>>           ttm_pool_unpopulate(ttm);
>>>>   }
>>>> +EXPORT_SYMBOL(ttm_tt_unpopulate);
>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9be029f26a4746347a6108d88fed299b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637417596065559955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tZ3do%2FeKzBtRlNaFbBjCtRvUHKdvwDZ7SoYhEBu4%2BT8%3D&amp;reserved=0 
>>
>
Christian König Nov. 23, 2020, 8:41 p.m. UTC | #4
Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
>
> On 11/23/20 3:20 PM, Christian König wrote:
>> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>>
>>> On 11/25/20 5:42 AM, Christian König wrote:
>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>> It's needed to drop iommu backed pages on device unplug
>>>>> before device's IOMMU group is released.
>>>>
>>>> It would be cleaner if we could do the whole handling in TTM. I 
>>>> also need to double check what you are doing with this function.
>>>>
>>>> Christian.
>>>
>>>
>>> Check patch "drm/amdgpu: Register IOMMU topology notifier per 
>>> device." to see
>>> how i use it. I don't see why this should go into TTM mid-layer - 
>>> the stuff I do inside
>>> is vendor specific and also I don't think TTM is explicitly aware of 
>>> IOMMU ?
>>> Do you mean you prefer the IOMMU notifier to be registered from 
>>> within TTM
>>> and then use a hook to call into vendor specific handler ?
>>
>> No, that is really vendor specific.
>>
>> What I meant is to have a function like 
>> ttm_resource_manager_evict_all() which you only need to call and all 
>> tt objects are unpopulated.
>
>
> So instead of this BO list i create and later iterate in amdgpu from 
> the IOMMU patch you just want to do it within
> TTM with a single function ? Makes much more sense.

Yes, exactly.

The list_empty() checks we have in TTM for the LRU are actually not the 
best idea, we should now check the pin_count instead. This way we could 
also have a list of the pinned BOs in TTM.

BTW: Have you thought about what happens when we unpopulate a BO while 
we still try to use a kernel mapping for it? That could have unforeseen 
consequences.

Christian.

>
> Andrey
>
>
>>
>> Give me a day or two to look into this.
>>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>>>>>   1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c 
>>>>> b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> index 1ccf1ef..29248a5 100644
>>>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> @@ -495,3 +495,4 @@ void ttm_tt_unpopulate(struct ttm_tt *ttm)
>>>>>       else
>>>>>           ttm_pool_unpopulate(ttm);
>>>>>   }
>>>>> +EXPORT_SYMBOL(ttm_tt_unpopulate);
>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9be029f26a4746347a6108d88fed299b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637417596065559955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tZ3do%2FeKzBtRlNaFbBjCtRvUHKdvwDZ7SoYhEBu4%2BT8%3D&amp;reserved=0 
>>>
>>
Andrey Grodzovsky Nov. 23, 2020, 9:08 p.m. UTC | #5
On 11/23/20 3:41 PM, Christian König wrote:
> Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
>>
>> On 11/23/20 3:20 PM, Christian König wrote:
>>> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>>>
>>>> On 11/25/20 5:42 AM, Christian König wrote:
>>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>>> It's needed to drop iommu backed pages on device unplug
>>>>>> before device's IOMMU group is released.
>>>>>
>>>>> It would be cleaner if we could do the whole handling in TTM. I also need 
>>>>> to double check what you are doing with this function.
>>>>>
>>>>> Christian.
>>>>
>>>>
>>>> Check patch "drm/amdgpu: Register IOMMU topology notifier per device." to see
>>>> how i use it. I don't see why this should go into TTM mid-layer - the stuff 
>>>> I do inside
>>>> is vendor specific and also I don't think TTM is explicitly aware of IOMMU ?
>>>> Do you mean you prefer the IOMMU notifier to be registered from within TTM
>>>> and then use a hook to call into vendor specific handler ?
>>>
>>> No, that is really vendor specific.
>>>
>>> What I meant is to have a function like ttm_resource_manager_evict_all() 
>>> which you only need to call and all tt objects are unpopulated.
>>
>>
>> So instead of this BO list i create and later iterate in amdgpu from the 
>> IOMMU patch you just want to do it within
>> TTM with a single function ? Makes much more sense.
>
> Yes, exactly.
>
> The list_empty() checks we have in TTM for the LRU are actually not the best 
> idea, we should now check the pin_count instead. This way we could also have a 
> list of the pinned BOs in TTM.


So from my IOMMU topology handler I will iterate the TTM LRU for the unpinned 
BOs and this new function for the pinned ones  ?
It's probably a good idea to combine both iterations into this new function to 
cover all the BOs allocated on the device.


>
> BTW: Have you thought about what happens when we unpopulate a BO while we 
> still try to use a kernel mapping for it? That could have unforeseen 
> consequences.


Are you asking what happens to kmap or vmap style mapped CPU accesses once we 
drop all the DMA backing pages for a particular BO ? Because for user mappings
(mmap) we took care of this with dummy page reroute but indeed nothing was done 
for in kernel CPU mappings.

Andrey


>
> Christian.
>
>>
>> Andrey
>>
>>
>>>
>>> Give me a day or two to look into this.
>>>
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>>>>>>   1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>> index 1ccf1ef..29248a5 100644
>>>>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>> @@ -495,3 +495,4 @@ void ttm_tt_unpopulate(struct ttm_tt *ttm)
>>>>>>       else
>>>>>>           ttm_pool_unpopulate(ttm);
>>>>>>   }
>>>>>> +EXPORT_SYMBOL(ttm_tt_unpopulate);
>>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9be029f26a4746347a6108d88fed299b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637417596065559955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tZ3do%2FeKzBtRlNaFbBjCtRvUHKdvwDZ7SoYhEBu4%2BT8%3D&amp;reserved=0 
>>>>
>>>
>
Christian König Nov. 24, 2020, 7:41 a.m. UTC | #6
Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
>
> On 11/23/20 3:41 PM, Christian König wrote:
>> Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
>>>
>>> On 11/23/20 3:20 PM, Christian König wrote:
>>>> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 11/25/20 5:42 AM, Christian König wrote:
>>>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>>>> It's needed to drop iommu backed pages on device unplug
>>>>>>> before device's IOMMU group is released.
>>>>>>
>>>>>> It would be cleaner if we could do the whole handling in TTM. I 
>>>>>> also need to double check what you are doing with this function.
>>>>>>
>>>>>> Christian.
>>>>>
>>>>>
>>>>> Check patch "drm/amdgpu: Register IOMMU topology notifier per 
>>>>> device." to see
>>>>> how i use it. I don't see why this should go into TTM mid-layer - 
>>>>> the stuff I do inside
>>>>> is vendor specific and also I don't think TTM is explicitly aware 
>>>>> of IOMMU ?
>>>>> Do you mean you prefer the IOMMU notifier to be registered from 
>>>>> within TTM
>>>>> and then use a hook to call into vendor specific handler ?
>>>>
>>>> No, that is really vendor specific.
>>>>
>>>> What I meant is to have a function like 
>>>> ttm_resource_manager_evict_all() which you only need to call and 
>>>> all tt objects are unpopulated.
>>>
>>>
>>> So instead of this BO list i create and later iterate in amdgpu from 
>>> the IOMMU patch you just want to do it within
>>> TTM with a single function ? Makes much more sense.
>>
>> Yes, exactly.
>>
>> The list_empty() checks we have in TTM for the LRU are actually not 
>> the best idea, we should now check the pin_count instead. This way we 
>> could also have a list of the pinned BOs in TTM.
>
>
> So from my IOMMU topology handler I will iterate the TTM LRU for the 
> unpinned BOs and this new function for the pinned ones  ?
> It's probably a good idea to combine both iterations into this new 
> function to cover all the BOs allocated on the device.

Yes, that's what I had in my mind as well.

>
>
>>
>> BTW: Have you thought about what happens when we unpopulate a BO 
>> while we still try to use a kernel mapping for it? That could have 
>> unforeseen consequences.
>
>
> Are you asking what happens to kmap or vmap style mapped CPU accesses 
> once we drop all the DMA backing pages for a particular BO ? Because 
> for user mappings
> (mmap) we took care of this with dummy page reroute but indeed nothing 
> was done for in kernel CPU mappings.

Yes exactly that.

In other words what happens if we free the ring buffer while the kernel 
still writes to it?

Christian.

>
> Andrey
>
>
>>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>> Give me a day or two to look into this.
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>>>>>>>   1 file changed, 1 insertion(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c 
>>>>>>> b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>>> index 1ccf1ef..29248a5 100644
>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>>> @@ -495,3 +495,4 @@ void ttm_tt_unpopulate(struct ttm_tt *ttm)
>>>>>>>       else
>>>>>>>           ttm_pool_unpopulate(ttm);
>>>>>>>   }
>>>>>>> +EXPORT_SYMBOL(ttm_tt_unpopulate);
>>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9be029f26a4746347a6108d88fed299b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637417596065559955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tZ3do%2FeKzBtRlNaFbBjCtRvUHKdvwDZ7SoYhEBu4%2BT8%3D&amp;reserved=0 
>>>>>
>>>>
>>
Andrey Grodzovsky Nov. 24, 2020, 4:22 p.m. UTC | #7
On 11/24/20 2:41 AM, Christian König wrote:
> Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
>>
>> On 11/23/20 3:41 PM, Christian König wrote:
>>> Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
>>>>
>>>> On 11/23/20 3:20 PM, Christian König wrote:
>>>>> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> On 11/25/20 5:42 AM, Christian König wrote:
>>>>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>>>>> It's needed to drop iommu backed pages on device unplug
>>>>>>>> before device's IOMMU group is released.
>>>>>>>
>>>>>>> It would be cleaner if we could do the whole handling in TTM. I also 
>>>>>>> need to double check what you are doing with this function.
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>>
>>>>>> Check patch "drm/amdgpu: Register IOMMU topology notifier per device." to 
>>>>>> see
>>>>>> how i use it. I don't see why this should go into TTM mid-layer - the 
>>>>>> stuff I do inside
>>>>>> is vendor specific and also I don't think TTM is explicitly aware of IOMMU ?
>>>>>> Do you mean you prefer the IOMMU notifier to be registered from within TTM
>>>>>> and then use a hook to call into vendor specific handler ?
>>>>>
>>>>> No, that is really vendor specific.
>>>>>
>>>>> What I meant is to have a function like ttm_resource_manager_evict_all() 
>>>>> which you only need to call and all tt objects are unpopulated.
>>>>
>>>>
>>>> So instead of this BO list i create and later iterate in amdgpu from the 
>>>> IOMMU patch you just want to do it within
>>>> TTM with a single function ? Makes much more sense.
>>>
>>> Yes, exactly.
>>>
>>> The list_empty() checks we have in TTM for the LRU are actually not the best 
>>> idea, we should now check the pin_count instead. This way we could also have 
>>> a list of the pinned BOs in TTM.
>>
>>
>> So from my IOMMU topology handler I will iterate the TTM LRU for the unpinned 
>> BOs and this new function for the pinned ones  ?
>> It's probably a good idea to combine both iterations into this new function 
>> to cover all the BOs allocated on the device.
>
> Yes, that's what I had in my mind as well.
>
>>
>>
>>>
>>> BTW: Have you thought about what happens when we unpopulate a BO while we 
>>> still try to use a kernel mapping for it? That could have unforeseen 
>>> consequences.
>>
>>
>> Are you asking what happens to kmap or vmap style mapped CPU accesses once we 
>> drop all the DMA backing pages for a particular BO ? Because for user mappings
>> (mmap) we took care of this with dummy page reroute but indeed nothing was 
>> done for in kernel CPU mappings.
>
> Yes exactly that.
>
> In other words what happens if we free the ring buffer while the kernel still 
> writes to it?
>
> Christian.


While we can't control user application accesses to the mapped buffers 
explicitly and hence we use page fault rerouting
I am thinking that in this  case we may be able to sprinkle drm_dev_enter/exit 
in any such sensitive place were we might
CPU access a DMA buffer from the kernel ? Things like CPU page table updates, 
ring buffer accesses and FW memcpy ? Is there other places ?
Another point is that at this point the driver shouldn't access any such buffers 
as we are at the process finishing the device.
AFAIK there is no page fault mechanism for kernel mappings so I don't think 
there is anything else to do ?

Andrey


>
>>
>> Andrey
>>
>>
>>>
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> Give me a day or two to look into this.
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>>>>>>>>   1 file changed, 1 insertion(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>>>> index 1ccf1ef..29248a5 100644
>>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>>>> @@ -495,3 +495,4 @@ void ttm_tt_unpopulate(struct ttm_tt *ttm)
>>>>>>>>       else
>>>>>>>>           ttm_pool_unpopulate(ttm);
>>>>>>>>   }
>>>>>>>> +EXPORT_SYMBOL(ttm_tt_unpopulate);
>>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9be029f26a4746347a6108d88fed299b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637417596065559955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tZ3do%2FeKzBtRlNaFbBjCtRvUHKdvwDZ7SoYhEBu4%2BT8%3D&amp;reserved=0 
>>>>>>
>>>>>
>>>
>
Christian König Nov. 24, 2020, 4:44 p.m. UTC | #8
Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
>
> On 11/24/20 2:41 AM, Christian König wrote:
>> Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
>>>
>>> On 11/23/20 3:41 PM, Christian König wrote:
>>>> Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 11/23/20 3:20 PM, Christian König wrote:
>>>>>> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>>>>>>
>>>>>>> On 11/25/20 5:42 AM, Christian König wrote:
>>>>>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>>>>>> It's needed to drop iommu backed pages on device unplug
>>>>>>>>> before device's IOMMU group is released.
>>>>>>>>
>>>>>>>> It would be cleaner if we could do the whole handling in TTM. I 
>>>>>>>> also need to double check what you are doing with this function.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>
>>>>>>>
>>>>>>> Check patch "drm/amdgpu: Register IOMMU topology notifier per 
>>>>>>> device." to see
>>>>>>> how i use it. I don't see why this should go into TTM mid-layer 
>>>>>>> - the stuff I do inside
>>>>>>> is vendor specific and also I don't think TTM is explicitly 
>>>>>>> aware of IOMMU ?
>>>>>>> Do you mean you prefer the IOMMU notifier to be registered from 
>>>>>>> within TTM
>>>>>>> and then use a hook to call into vendor specific handler ?
>>>>>>
>>>>>> No, that is really vendor specific.
>>>>>>
>>>>>> What I meant is to have a function like 
>>>>>> ttm_resource_manager_evict_all() which you only need to call and 
>>>>>> all tt objects are unpopulated.
>>>>>
>>>>>
>>>>> So instead of this BO list i create and later iterate in amdgpu 
>>>>> from the IOMMU patch you just want to do it within
>>>>> TTM with a single function ? Makes much more sense.
>>>>
>>>> Yes, exactly.
>>>>
>>>> The list_empty() checks we have in TTM for the LRU are actually not 
>>>> the best idea, we should now check the pin_count instead. This way 
>>>> we could also have a list of the pinned BOs in TTM.
>>>
>>>
>>> So from my IOMMU topology handler I will iterate the TTM LRU for the 
>>> unpinned BOs and this new function for the pinned ones  ?
>>> It's probably a good idea to combine both iterations into this new 
>>> function to cover all the BOs allocated on the device.
>>
>> Yes, that's what I had in my mind as well.
>>
>>>
>>>
>>>>
>>>> BTW: Have you thought about what happens when we unpopulate a BO 
>>>> while we still try to use a kernel mapping for it? That could have 
>>>> unforeseen consequences.
>>>
>>>
>>> Are you asking what happens to kmap or vmap style mapped CPU 
>>> accesses once we drop all the DMA backing pages for a particular BO 
>>> ? Because for user mappings
>>> (mmap) we took care of this with dummy page reroute but indeed 
>>> nothing was done for in kernel CPU mappings.
>>
>> Yes exactly that.
>>
>> In other words what happens if we free the ring buffer while the 
>> kernel still writes to it?
>>
>> Christian.
>
>
> While we can't control user application accesses to the mapped buffers 
> explicitly and hence we use page fault rerouting
> I am thinking that in this  case we may be able to sprinkle 
> drm_dev_enter/exit in any such sensitive place were we might
> CPU access a DMA buffer from the kernel ?

Yes, I fear we are going to need that.

> Things like CPU page table updates, ring buffer accesses and FW memcpy 
> ? Is there other places ?

Puh, good question. I have no idea.

> Another point is that at this point the driver shouldn't access any 
> such buffers as we are at the process finishing the device.
> AFAIK there is no page fault mechanism for kernel mappings so I don't 
> think there is anything else to do ?

Well there is a page fault handler for kernel mappings, but that one 
just prints the stack trace into the system log and calls BUG(); :)

Long story short we need to avoid any access to released pages after 
unplug. No matter if it's from the kernel or userspace.

Regards,
Christian.

>
> Andrey
Daniel Vetter Nov. 25, 2020, 10:40 a.m. UTC | #9
On Tue, Nov 24, 2020 at 05:44:07PM +0100, Christian König wrote:
> Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
> > 
> > On 11/24/20 2:41 AM, Christian König wrote:
> > > Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
> > > > 
> > > > On 11/23/20 3:41 PM, Christian König wrote:
> > > > > Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
> > > > > > 
> > > > > > On 11/23/20 3:20 PM, Christian König wrote:
> > > > > > > Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
> > > > > > > > 
> > > > > > > > On 11/25/20 5:42 AM, Christian König wrote:
> > > > > > > > > Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
> > > > > > > > > > It's needed to drop iommu backed pages on device unplug
> > > > > > > > > > before device's IOMMU group is released.
> > > > > > > > > 
> > > > > > > > > It would be cleaner if we could do the whole
> > > > > > > > > handling in TTM. I also need to double check
> > > > > > > > > what you are doing with this function.
> > > > > > > > > 
> > > > > > > > > Christian.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Check patch "drm/amdgpu: Register IOMMU topology
> > > > > > > > notifier per device." to see
> > > > > > > > how i use it. I don't see why this should go
> > > > > > > > into TTM mid-layer - the stuff I do inside
> > > > > > > > is vendor specific and also I don't think TTM is
> > > > > > > > explicitly aware of IOMMU ?
> > > > > > > > Do you mean you prefer the IOMMU notifier to be
> > > > > > > > registered from within TTM
> > > > > > > > and then use a hook to call into vendor specific handler ?
> > > > > > > 
> > > > > > > No, that is really vendor specific.
> > > > > > > 
> > > > > > > What I meant is to have a function like
> > > > > > > ttm_resource_manager_evict_all() which you only need
> > > > > > > to call and all tt objects are unpopulated.
> > > > > > 
> > > > > > 
> > > > > > So instead of this BO list i create and later iterate in
> > > > > > amdgpu from the IOMMU patch you just want to do it
> > > > > > within
> > > > > > TTM with a single function ? Makes much more sense.
> > > > > 
> > > > > Yes, exactly.
> > > > > 
> > > > > The list_empty() checks we have in TTM for the LRU are
> > > > > actually not the best idea, we should now check the
> > > > > pin_count instead. This way we could also have a list of the
> > > > > pinned BOs in TTM.
> > > > 
> > > > 
> > > > So from my IOMMU topology handler I will iterate the TTM LRU for
> > > > the unpinned BOs and this new function for the pinned ones  ?
> > > > It's probably a good idea to combine both iterations into this
> > > > new function to cover all the BOs allocated on the device.
> > > 
> > > Yes, that's what I had in my mind as well.
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > BTW: Have you thought about what happens when we unpopulate
> > > > > a BO while we still try to use a kernel mapping for it? That
> > > > > could have unforeseen consequences.
> > > > 
> > > > 
> > > > Are you asking what happens to kmap or vmap style mapped CPU
> > > > accesses once we drop all the DMA backing pages for a particular
> > > > BO ? Because for user mappings
> > > > (mmap) we took care of this with dummy page reroute but indeed
> > > > nothing was done for in kernel CPU mappings.
> > > 
> > > Yes exactly that.
> > > 
> > > In other words what happens if we free the ring buffer while the
> > > kernel still writes to it?
> > > 
> > > Christian.
> > 
> > 
> > While we can't control user application accesses to the mapped buffers
> > explicitly and hence we use page fault rerouting
> > I am thinking that in this  case we may be able to sprinkle
> > drm_dev_enter/exit in any such sensitive place were we might
> > CPU access a DMA buffer from the kernel ?
> 
> Yes, I fear we are going to need that.

Uh ... problem is that dma_buf_vmap are usually permanent things. Maybe we
could stuff this into begin/end_cpu_access (but only for the kernel, so a
bit tricky)?

btw the other issue with dma-buf (and even worse with dma_fence) is
refcounting of the underlying drm_device. I'd expect that all your
callbacks go boom if the dma_buf outlives your drm_device. That part isn't
yet solved in your series here.
-Daniel

> 
> > Things like CPU page table updates, ring buffer accesses and FW memcpy ?
> > Is there other places ?
> 
> Puh, good question. I have no idea.
> 
> > Another point is that at this point the driver shouldn't access any such
> > buffers as we are at the process finishing the device.
> > AFAIK there is no page fault mechanism for kernel mappings so I don't
> > think there is anything else to do ?
> 
> Well there is a page fault handler for kernel mappings, but that one just
> prints the stack trace into the system log and calls BUG(); :)
> 
> Long story short we need to avoid any access to released pages after unplug.
> No matter if it's from the kernel or userspace.
> 
> Regards,
> Christian.
> 
> > 
> > Andrey
>
Christian König Nov. 25, 2020, 10:42 a.m. UTC | #10
Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
> It's needed to drop iommu backed pages on device unplug
> before device's IOMMU group is released.

It would be cleaner if we could do the whole handling in TTM. I also 
need to double check what you are doing with this function.

Christian.

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/ttm/ttm_tt.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> index 1ccf1ef..29248a5 100644
> --- a/drivers/gpu/drm/ttm/ttm_tt.c
> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> @@ -495,3 +495,4 @@ void ttm_tt_unpopulate(struct ttm_tt *ttm)
>   	else
>   		ttm_pool_unpopulate(ttm);
>   }
> +EXPORT_SYMBOL(ttm_tt_unpopulate);
Christian König Nov. 25, 2020, 12:57 p.m. UTC | #11
Am 25.11.20 um 11:40 schrieb Daniel Vetter:
> On Tue, Nov 24, 2020 at 05:44:07PM +0100, Christian König wrote:
>> Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
>>> On 11/24/20 2:41 AM, Christian König wrote:
>>>> Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
>>>>> On 11/23/20 3:41 PM, Christian König wrote:
>>>>>> Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
>>>>>>> On 11/23/20 3:20 PM, Christian König wrote:
>>>>>>>> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>>>>>>>> On 11/25/20 5:42 AM, Christian König wrote:
>>>>>>>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>>>>>>>> It's needed to drop iommu backed pages on device unplug
>>>>>>>>>>> before device's IOMMU group is released.
>>>>>>>>>> It would be cleaner if we could do the whole
>>>>>>>>>> handling in TTM. I also need to double check
>>>>>>>>>> what you are doing with this function.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Check patch "drm/amdgpu: Register IOMMU topology
>>>>>>>>> notifier per device." to see
>>>>>>>>> how i use it. I don't see why this should go
>>>>>>>>> into TTM mid-layer - the stuff I do inside
>>>>>>>>> is vendor specific and also I don't think TTM is
>>>>>>>>> explicitly aware of IOMMU ?
>>>>>>>>> Do you mean you prefer the IOMMU notifier to be
>>>>>>>>> registered from within TTM
>>>>>>>>> and then use a hook to call into vendor specific handler ?
>>>>>>>> No, that is really vendor specific.
>>>>>>>>
>>>>>>>> What I meant is to have a function like
>>>>>>>> ttm_resource_manager_evict_all() which you only need
>>>>>>>> to call and all tt objects are unpopulated.
>>>>>>>
>>>>>>> So instead of this BO list i create and later iterate in
>>>>>>> amdgpu from the IOMMU patch you just want to do it
>>>>>>> within
>>>>>>> TTM with a single function ? Makes much more sense.
>>>>>> Yes, exactly.
>>>>>>
>>>>>> The list_empty() checks we have in TTM for the LRU are
>>>>>> actually not the best idea, we should now check the
>>>>>> pin_count instead. This way we could also have a list of the
>>>>>> pinned BOs in TTM.
>>>>>
>>>>> So from my IOMMU topology handler I will iterate the TTM LRU for
>>>>> the unpinned BOs and this new function for the pinned ones  ?
>>>>> It's probably a good idea to combine both iterations into this
>>>>> new function to cover all the BOs allocated on the device.
>>>> Yes, that's what I had in my mind as well.
>>>>
>>>>>
>>>>>> BTW: Have you thought about what happens when we unpopulate
>>>>>> a BO while we still try to use a kernel mapping for it? That
>>>>>> could have unforeseen consequences.
>>>>>
>>>>> Are you asking what happens to kmap or vmap style mapped CPU
>>>>> accesses once we drop all the DMA backing pages for a particular
>>>>> BO ? Because for user mappings
>>>>> (mmap) we took care of this with dummy page reroute but indeed
>>>>> nothing was done for in kernel CPU mappings.
>>>> Yes exactly that.
>>>>
>>>> In other words what happens if we free the ring buffer while the
>>>> kernel still writes to it?
>>>>
>>>> Christian.
>>>
>>> While we can't control user application accesses to the mapped buffers
>>> explicitly and hence we use page fault rerouting
>>> I am thinking that in this  case we may be able to sprinkle
>>> drm_dev_enter/exit in any such sensitive place were we might
>>> CPU access a DMA buffer from the kernel ?
>> Yes, I fear we are going to need that.
> Uh ... problem is that dma_buf_vmap are usually permanent things. Maybe we
> could stuff this into begin/end_cpu_access (but only for the kernel, so a
> bit tricky)?

Oh very very good point! I haven't thought about DMA-buf mmaps in this 
context yet.


> btw the other issue with dma-buf (and even worse with dma_fence) is
> refcounting of the underlying drm_device. I'd expect that all your
> callbacks go boom if the dma_buf outlives your drm_device. That part isn't
> yet solved in your series here.

Well thinking more about this, it seems to be a another really good 
argument why mapping pages from DMA-bufs into application address space 
directly is a very bad idea :)

But yes, we essentially can't remove the device as long as there is a 
DMA-buf with mappings. No idea how to clean that one up.

Christian.

> -Daniel
>
>>> Things like CPU page table updates, ring buffer accesses and FW memcpy ?
>>> Is there other places ?
>> Puh, good question. I have no idea.
>>
>>> Another point is that at this point the driver shouldn't access any such
>>> buffers as we are at the process finishing the device.
>>> AFAIK there is no page fault mechanism for kernel mappings so I don't
>>> think there is anything else to do ?
>> Well there is a page fault handler for kernel mappings, but that one just
>> prints the stack trace into the system log and calls BUG(); :)
>>
>> Long story short we need to avoid any access to released pages after unplug.
>> No matter if it's from the kernel or userspace.
>>
>> Regards,
>> Christian.
>>
>>> Andrey
Daniel Vetter Nov. 25, 2020, 4:36 p.m. UTC | #12
On Wed, Nov 25, 2020 at 01:57:40PM +0100, Christian König wrote:
> Am 25.11.20 um 11:40 schrieb Daniel Vetter:
> > On Tue, Nov 24, 2020 at 05:44:07PM +0100, Christian König wrote:
> > > Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
> > > > On 11/24/20 2:41 AM, Christian König wrote:
> > > > > Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
> > > > > > On 11/23/20 3:41 PM, Christian König wrote:
> > > > > > > Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
> > > > > > > > On 11/23/20 3:20 PM, Christian König wrote:
> > > > > > > > > Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
> > > > > > > > > > On 11/25/20 5:42 AM, Christian König wrote:
> > > > > > > > > > > Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
> > > > > > > > > > > > It's needed to drop iommu backed pages on device unplug
> > > > > > > > > > > > before device's IOMMU group is released.
> > > > > > > > > > > It would be cleaner if we could do the whole
> > > > > > > > > > > handling in TTM. I also need to double check
> > > > > > > > > > > what you are doing with this function.
> > > > > > > > > > > 
> > > > > > > > > > > Christian.
> > > > > > > > > > 
> > > > > > > > > > Check patch "drm/amdgpu: Register IOMMU topology
> > > > > > > > > > notifier per device." to see
> > > > > > > > > > how i use it. I don't see why this should go
> > > > > > > > > > into TTM mid-layer - the stuff I do inside
> > > > > > > > > > is vendor specific and also I don't think TTM is
> > > > > > > > > > explicitly aware of IOMMU ?
> > > > > > > > > > Do you mean you prefer the IOMMU notifier to be
> > > > > > > > > > registered from within TTM
> > > > > > > > > > and then use a hook to call into vendor specific handler ?
> > > > > > > > > No, that is really vendor specific.
> > > > > > > > > 
> > > > > > > > > What I meant is to have a function like
> > > > > > > > > ttm_resource_manager_evict_all() which you only need
> > > > > > > > > to call and all tt objects are unpopulated.
> > > > > > > > 
> > > > > > > > So instead of this BO list i create and later iterate in
> > > > > > > > amdgpu from the IOMMU patch you just want to do it
> > > > > > > > within
> > > > > > > > TTM with a single function ? Makes much more sense.
> > > > > > > Yes, exactly.
> > > > > > > 
> > > > > > > The list_empty() checks we have in TTM for the LRU are
> > > > > > > actually not the best idea, we should now check the
> > > > > > > pin_count instead. This way we could also have a list of the
> > > > > > > pinned BOs in TTM.
> > > > > > 
> > > > > > So from my IOMMU topology handler I will iterate the TTM LRU for
> > > > > > the unpinned BOs and this new function for the pinned ones  ?
> > > > > > It's probably a good idea to combine both iterations into this
> > > > > > new function to cover all the BOs allocated on the device.
> > > > > Yes, that's what I had in my mind as well.
> > > > > 
> > > > > > 
> > > > > > > BTW: Have you thought about what happens when we unpopulate
> > > > > > > a BO while we still try to use a kernel mapping for it? That
> > > > > > > could have unforeseen consequences.
> > > > > > 
> > > > > > Are you asking what happens to kmap or vmap style mapped CPU
> > > > > > accesses once we drop all the DMA backing pages for a particular
> > > > > > BO ? Because for user mappings
> > > > > > (mmap) we took care of this with dummy page reroute but indeed
> > > > > > nothing was done for in kernel CPU mappings.
> > > > > Yes exactly that.
> > > > > 
> > > > > In other words what happens if we free the ring buffer while the
> > > > > kernel still writes to it?
> > > > > 
> > > > > Christian.
> > > > 
> > > > While we can't control user application accesses to the mapped buffers
> > > > explicitly and hence we use page fault rerouting
> > > > I am thinking that in this  case we may be able to sprinkle
> > > > drm_dev_enter/exit in any such sensitive place were we might
> > > > CPU access a DMA buffer from the kernel ?
> > > Yes, I fear we are going to need that.
> > Uh ... problem is that dma_buf_vmap are usually permanent things. Maybe we
> > could stuff this into begin/end_cpu_access (but only for the kernel, so a
> > bit tricky)?
> 
> Oh very very good point! I haven't thought about DMA-buf mmaps in this
> context yet.
> 
> 
> > btw the other issue with dma-buf (and even worse with dma_fence) is
> > refcounting of the underlying drm_device. I'd expect that all your
> > callbacks go boom if the dma_buf outlives your drm_device. That part isn't
> > yet solved in your series here.
> 
> Well thinking more about this, it seems to be a another really good argument
> why mapping pages from DMA-bufs into application address space directly is a
> very bad idea :)
> 
> But yes, we essentially can't remove the device as long as there is a
> DMA-buf with mappings. No idea how to clean that one up.

drm_dev_get/put in drm_prime helpers should get us like 90% there I think.

The even more worrying thing is random dma_fence attached to the dma_resv
object. We could try to clean all of ours up, but they could have escaped
already into some other driver. And since we're talking about egpu
hotunplug, dma_fence escaping to the igpu is a pretty reasonable use-case.

I have no how to fix that one :-/
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > > > Things like CPU page table updates, ring buffer accesses and FW memcpy ?
> > > > Is there other places ?
> > > Puh, good question. I have no idea.
> > > 
> > > > Another point is that at this point the driver shouldn't access any such
> > > > buffers as we are at the process finishing the device.
> > > > AFAIK there is no page fault mechanism for kernel mappings so I don't
> > > > think there is anything else to do ?
> > > Well there is a page fault handler for kernel mappings, but that one just
> > > prints the stack trace into the system log and calls BUG(); :)
> > > 
> > > Long story short we need to avoid any access to released pages after unplug.
> > > No matter if it's from the kernel or userspace.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > Andrey
>
Michel Dänzer Nov. 25, 2020, 4:56 p.m. UTC | #13
On 2020-11-25 1:57 p.m., Christian König wrote:
> 
> Well thinking more about this, it seems to be a another really good 
> argument why mapping pages from DMA-bufs into application address space 
> directly is a very bad idea :)

Apologies for going off on a tangent here...

Since allowing userspace mmap with dma-buf fds seems to be a trap in 
general[0], I wonder if there's any way we could stop supporting that?


[0] E.g. mutter had to disable handing out dma-bufs for screen capture 
by default with non-i915 for now, because in particular with discrete 
GPUs, direct CPU reads can be unusably slow (think single-digit frames 
per second), and of course there's other userspace which goes "ooh, 
dma-buf, let's map and read!".
Daniel Vetter Nov. 25, 2020, 5:02 p.m. UTC | #14
On Wed, Nov 25, 2020 at 5:56 PM Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2020-11-25 1:57 p.m., Christian König wrote:
> >
> > Well thinking more about this, it seems to be a another really good
> > argument why mapping pages from DMA-bufs into application address space
> > directly is a very bad idea :)
>
> Apologies for going off on a tangent here...
>
> Since allowing userspace mmap with dma-buf fds seems to be a trap in
> general[0], I wonder if there's any way we could stop supporting that?
>
>
> [0] E.g. mutter had to disable handing out dma-bufs for screen capture
> by default with non-i915 for now, because in particular with discrete
> GPUs, direct CPU reads can be unusably slow (think single-digit frames
> per second), and of course there's other userspace which goes "ooh,
> dma-buf, let's map and read!".

I think a pile of applications (cros included) use it to do uploads
across process boundaries. Think locked down jpeg decoder and stuff
like that. For that use-case it seems to work ok.

But yeah don't read from dma-buf. I'm pretty sure it's dead slow on
almost everything, except integrated gpu which have A) a coherent
fabric with the gpu and B) that fabric is actually faster for
rendering in general, not just for dedicated buffers allocated for
down/upload.
-Daniel
Andrey Grodzovsky Nov. 25, 2020, 7:34 p.m. UTC | #15
On 11/25/20 11:36 AM, Daniel Vetter wrote:
> On Wed, Nov 25, 2020 at 01:57:40PM +0100, Christian König wrote:
>> Am 25.11.20 um 11:40 schrieb Daniel Vetter:
>>> On Tue, Nov 24, 2020 at 05:44:07PM +0100, Christian König wrote:
>>>> Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
>>>>> On 11/24/20 2:41 AM, Christian König wrote:
>>>>>> Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
>>>>>>> On 11/23/20 3:41 PM, Christian König wrote:
>>>>>>>> Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
>>>>>>>>> On 11/23/20 3:20 PM, Christian König wrote:
>>>>>>>>>> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>>>>>>>>>> On 11/25/20 5:42 AM, Christian König wrote:
>>>>>>>>>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>> It's needed to drop iommu backed pages on device unplug
>>>>>>>>>>>>> before device's IOMMU group is released.
>>>>>>>>>>>> It would be cleaner if we could do the whole
>>>>>>>>>>>> handling in TTM. I also need to double check
>>>>>>>>>>>> what you are doing with this function.
>>>>>>>>>>>>
>>>>>>>>>>>> Christian.
>>>>>>>>>>> Check patch "drm/amdgpu: Register IOMMU topology
>>>>>>>>>>> notifier per device." to see
>>>>>>>>>>> how i use it. I don't see why this should go
>>>>>>>>>>> into TTM mid-layer - the stuff I do inside
>>>>>>>>>>> is vendor specific and also I don't think TTM is
>>>>>>>>>>> explicitly aware of IOMMU ?
>>>>>>>>>>> Do you mean you prefer the IOMMU notifier to be
>>>>>>>>>>> registered from within TTM
>>>>>>>>>>> and then use a hook to call into vendor specific handler ?
>>>>>>>>>> No, that is really vendor specific.
>>>>>>>>>>
>>>>>>>>>> What I meant is to have a function like
>>>>>>>>>> ttm_resource_manager_evict_all() which you only need
>>>>>>>>>> to call and all tt objects are unpopulated.
>>>>>>>>> So instead of this BO list i create and later iterate in
>>>>>>>>> amdgpu from the IOMMU patch you just want to do it
>>>>>>>>> within
>>>>>>>>> TTM with a single function ? Makes much more sense.
>>>>>>>> Yes, exactly.
>>>>>>>>
>>>>>>>> The list_empty() checks we have in TTM for the LRU are
>>>>>>>> actually not the best idea, we should now check the
>>>>>>>> pin_count instead. This way we could also have a list of the
>>>>>>>> pinned BOs in TTM.
>>>>>>> So from my IOMMU topology handler I will iterate the TTM LRU for
>>>>>>> the unpinned BOs and this new function for the pinned ones  ?
>>>>>>> It's probably a good idea to combine both iterations into this
>>>>>>> new function to cover all the BOs allocated on the device.
>>>>>> Yes, that's what I had in my mind as well.
>>>>>>
>>>>>>>> BTW: Have you thought about what happens when we unpopulate
>>>>>>>> a BO while we still try to use a kernel mapping for it? That
>>>>>>>> could have unforeseen consequences.
>>>>>>> Are you asking what happens to kmap or vmap style mapped CPU
>>>>>>> accesses once we drop all the DMA backing pages for a particular
>>>>>>> BO ? Because for user mappings
>>>>>>> (mmap) we took care of this with dummy page reroute but indeed
>>>>>>> nothing was done for in kernel CPU mappings.
>>>>>> Yes exactly that.
>>>>>>
>>>>>> In other words what happens if we free the ring buffer while the
>>>>>> kernel still writes to it?
>>>>>>
>>>>>> Christian.
>>>>> While we can't control user application accesses to the mapped buffers
>>>>> explicitly and hence we use page fault rerouting
>>>>> I am thinking that in this  case we may be able to sprinkle
>>>>> drm_dev_enter/exit in any such sensitive place were we might
>>>>> CPU access a DMA buffer from the kernel ?
>>>> Yes, I fear we are going to need that.
>>> Uh ... problem is that dma_buf_vmap are usually permanent things. Maybe we
>>> could stuff this into begin/end_cpu_access


Do you mean guarding with drm_dev_enter/exit in dma_buf_ops.begin/end_cpu_access
driver specific hook ?


>>> (but only for the kernel, so a
>>> bit tricky)?


Why only kernel ? Why is it a problem to do it if it comes from dma_buf_ioctl by
some user process ? And  if we do need this distinction I think we should be able to
differentiate by looking at current->mm (i.e. mm_struct) pointer being NULL for 
kernel thread.


>> Oh very very good point! I haven't thought about DMA-buf mmaps in this
>> context yet.
>>
>>
>>> btw the other issue with dma-buf (and even worse with dma_fence) is
>>> refcounting of the underlying drm_device. I'd expect that all your
>>> callbacks go boom if the dma_buf outlives your drm_device. That part isn't
>>> yet solved in your series here.
>> Well thinking more about this, it seems to be a another really good argument
>> why mapping pages from DMA-bufs into application address space directly is a
>> very bad idea :)
>>
>> But yes, we essentially can't remove the device as long as there is a
>> DMA-buf with mappings. No idea how to clean that one up.
> drm_dev_get/put in drm_prime helpers should get us like 90% there I think.


What are the other 10% ?


>
> The even more worrying thing is random dma_fence attached to the dma_resv
> object. We could try to clean all of ours up, but they could have escaped
> already into some other driver. And since we're talking about egpu
> hotunplug, dma_fence escaping to the igpu is a pretty reasonable use-case.
>
> I have no how to fix that one :-/
> -Daniel


I assume you are referring to sync_file_create/sync_file_get_fence API  for 
dma_fence export/import ?
So with DMA bufs we have the drm_gem_object as exporter specific private data
and so we can do drm_dev_get and put at the drm_gem_object layer to bind device 
life cycle
to that of each GEM object but, we don't have such mid-layer for dma_fence which 
could allow
us to increment device reference for each fence out there related to that device 
- is my understanding correct ?

Andrey


Andrey


>> Christian.
>>
>>> -Daniel
>>>
>>>>> Things like CPU page table updates, ring buffer accesses and FW memcpy ?
>>>>> Is there other places ?
>>>> Puh, good question. I have no idea.
>>>>
>>>>> Another point is that at this point the driver shouldn't access any such
>>>>> buffers as we are at the process finishing the device.
>>>>> AFAIK there is no page fault mechanism for kernel mappings so I don't
>>>>> think there is anything else to do ?
>>>> Well there is a page fault handler for kernel mappings, but that one just
>>>> prints the stack trace into the system log and calls BUG(); :)
>>>>
>>>> Long story short we need to avoid any access to released pages after unplug.
>>>> No matter if it's from the kernel or userspace.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Andrey
Andrey Grodzovsky Nov. 27, 2020, 1:10 p.m. UTC | #16
Hey Daniel, just a ping on a bunch of questions i posted bellow.

Andtey
Daniel Vetter Nov. 27, 2020, 2:59 p.m. UTC | #17
On Wed, Nov 25, 2020 at 02:34:44PM -0500, Andrey Grodzovsky wrote:
> 
> On 11/25/20 11:36 AM, Daniel Vetter wrote:
> > On Wed, Nov 25, 2020 at 01:57:40PM +0100, Christian König wrote:
> > > Am 25.11.20 um 11:40 schrieb Daniel Vetter:
> > > > On Tue, Nov 24, 2020 at 05:44:07PM +0100, Christian König wrote:
> > > > > Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
> > > > > > On 11/24/20 2:41 AM, Christian König wrote:
> > > > > > > Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
> > > > > > > > On 11/23/20 3:41 PM, Christian König wrote:
> > > > > > > > > Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
> > > > > > > > > > On 11/23/20 3:20 PM, Christian König wrote:
> > > > > > > > > > > Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
> > > > > > > > > > > > On 11/25/20 5:42 AM, Christian König wrote:
> > > > > > > > > > > > > Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
> > > > > > > > > > > > > > It's needed to drop iommu backed pages on device unplug
> > > > > > > > > > > > > > before device's IOMMU group is released.
> > > > > > > > > > > > > It would be cleaner if we could do the whole
> > > > > > > > > > > > > handling in TTM. I also need to double check
> > > > > > > > > > > > > what you are doing with this function.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Christian.
> > > > > > > > > > > > Check patch "drm/amdgpu: Register IOMMU topology
> > > > > > > > > > > > notifier per device." to see
> > > > > > > > > > > > how i use it. I don't see why this should go
> > > > > > > > > > > > into TTM mid-layer - the stuff I do inside
> > > > > > > > > > > > is vendor specific and also I don't think TTM is
> > > > > > > > > > > > explicitly aware of IOMMU ?
> > > > > > > > > > > > Do you mean you prefer the IOMMU notifier to be
> > > > > > > > > > > > registered from within TTM
> > > > > > > > > > > > and then use a hook to call into vendor specific handler ?
> > > > > > > > > > > No, that is really vendor specific.
> > > > > > > > > > > 
> > > > > > > > > > > What I meant is to have a function like
> > > > > > > > > > > ttm_resource_manager_evict_all() which you only need
> > > > > > > > > > > to call and all tt objects are unpopulated.
> > > > > > > > > > So instead of this BO list i create and later iterate in
> > > > > > > > > > amdgpu from the IOMMU patch you just want to do it
> > > > > > > > > > within
> > > > > > > > > > TTM with a single function ? Makes much more sense.
> > > > > > > > > Yes, exactly.
> > > > > > > > > 
> > > > > > > > > The list_empty() checks we have in TTM for the LRU are
> > > > > > > > > actually not the best idea, we should now check the
> > > > > > > > > pin_count instead. This way we could also have a list of the
> > > > > > > > > pinned BOs in TTM.
> > > > > > > > So from my IOMMU topology handler I will iterate the TTM LRU for
> > > > > > > > the unpinned BOs and this new function for the pinned ones  ?
> > > > > > > > It's probably a good idea to combine both iterations into this
> > > > > > > > new function to cover all the BOs allocated on the device.
> > > > > > > Yes, that's what I had in my mind as well.
> > > > > > > 
> > > > > > > > > BTW: Have you thought about what happens when we unpopulate
> > > > > > > > > a BO while we still try to use a kernel mapping for it? That
> > > > > > > > > could have unforeseen consequences.
> > > > > > > > Are you asking what happens to kmap or vmap style mapped CPU
> > > > > > > > accesses once we drop all the DMA backing pages for a particular
> > > > > > > > BO ? Because for user mappings
> > > > > > > > (mmap) we took care of this with dummy page reroute but indeed
> > > > > > > > nothing was done for in kernel CPU mappings.
> > > > > > > Yes exactly that.
> > > > > > > 
> > > > > > > In other words what happens if we free the ring buffer while the
> > > > > > > kernel still writes to it?
> > > > > > > 
> > > > > > > Christian.
> > > > > > While we can't control user application accesses to the mapped buffers
> > > > > > explicitly and hence we use page fault rerouting
> > > > > > I am thinking that in this  case we may be able to sprinkle
> > > > > > drm_dev_enter/exit in any such sensitive place were we might
> > > > > > CPU access a DMA buffer from the kernel ?
> > > > > Yes, I fear we are going to need that.
> > > > Uh ... problem is that dma_buf_vmap are usually permanent things. Maybe we
> > > > could stuff this into begin/end_cpu_access
> 
> 
> Do you mean guarding with drm_dev_enter/exit in dma_buf_ops.begin/end_cpu_access
> driver specific hook ?
> 
> 
> > > > (but only for the kernel, so a
> > > > bit tricky)?
> 
> 
> Why only kernel ? Why is it a problem to do it if it comes from dma_buf_ioctl by
> some user process ? And  if we do need this distinction I think we should be able to
> differentiate by looking at current->mm (i.e. mm_struct) pointer being NULL
> for kernel thread.

Userspace mmap is handled by punching out the pte. So we don't need to do
anything special there.

For kernel mmap the begin/end should be all in the same context (so we
could use the srcu lock that works underneath drm_dev_enter/exit), since
at least right now kernel vmaps of dma-buf are very long-lived.

But the good news is that Thomas Zimmerman is working on this problem
already for different reasons, so it might be that we won't have any
long-lived kernel vmap anymore. And we could put the drm_dev_enter/exit in
there.

> > > Oh very very good point! I haven't thought about DMA-buf mmaps in this
> > > context yet.
> > > 
> > > 
> > > > btw the other issue with dma-buf (and even worse with dma_fence) is
> > > > refcounting of the underlying drm_device. I'd expect that all your
> > > > callbacks go boom if the dma_buf outlives your drm_device. That part isn't
> > > > yet solved in your series here.
> > > Well thinking more about this, it seems to be a another really good argument
> > > why mapping pages from DMA-bufs into application address space directly is a
> > > very bad idea :)
> > > 
> > > But yes, we essentially can't remove the device as long as there is a
> > > DMA-buf with mappings. No idea how to clean that one up.
> > drm_dev_get/put in drm_prime helpers should get us like 90% there I think.
> 
> 
> What are the other 10% ?

dma_fence, which is also about 90% of the work probably. But I'm
guesstimating only 10% of the oopses you can hit. Since generally the
dma_fence for a buffer don't outlive the underlying buffer. So usually no
problems happen when we've solved the dma-buf sharing, but the dma_fence
can outlive the dma-buf, so there's still possibilities of crashing.

> > The even more worrying thing is random dma_fence attached to the dma_resv
> > object. We could try to clean all of ours up, but they could have escaped
> > already into some other driver. And since we're talking about egpu
> > hotunplug, dma_fence escaping to the igpu is a pretty reasonable use-case.
> > 
> > I have no how to fix that one :-/
> > -Daniel
> 
> 
> I assume you are referring to sync_file_create/sync_file_get_fence API  for
> dma_fence export/import ?

So dma_fence is a general issue, there's a pile of interfaces that result
in sharing with other drivers:
- dma_resv in the dma_buf
- sync_file
- drm_syncobj (but I think that's not yet cross driver, but probably
  changes)

In each of these cases drivers can pick up the dma_fence and use it
internally for all kinds of purposes (could end up in the scheduler or
wherever).

> So with DMA bufs we have the drm_gem_object as exporter specific private data
> and so we can do drm_dev_get and put at the drm_gem_object layer to bind
> device life cycle
> to that of each GEM object but, we don't have such mid-layer for dma_fence
> which could allow
> us to increment device reference for each fence out there related to that
> device - is my understanding correct ?

Yeah that's the annoying part with dma-fence. No existing generic place to
put the drm_dev_get/put. tbf I'd note this as a todo and try to solve the
other problems first.
-Daniel

> 
> Andrey
> 
> 
> Andrey
> 
> 
> > > Christian.
> > > 
> > > > -Daniel
> > > > 
> > > > > > Things like CPU page table updates, ring buffer accesses and FW memcpy ?
> > > > > > Is there other places ?
> > > > > Puh, good question. I have no idea.
> > > > > 
> > > > > > Another point is that at this point the driver shouldn't access any such
> > > > > > buffers as we are at the process finishing the device.
> > > > > > AFAIK there is no page fault mechanism for kernel mappings so I don't
> > > > > > think there is anything else to do ?
> > > > > Well there is a page fault handler for kernel mappings, but that one just
> > > > > prints the stack trace into the system log and calls BUG(); :)
> > > > > 
> > > > > Long story short we need to avoid any access to released pages after unplug.
> > > > > No matter if it's from the kernel or userspace.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > Andrey
Andrey Grodzovsky Nov. 27, 2020, 4:04 p.m. UTC | #18
On 11/27/20 9:59 AM, Daniel Vetter wrote:
> On Wed, Nov 25, 2020 at 02:34:44PM -0500, Andrey Grodzovsky wrote:
>> On 11/25/20 11:36 AM, Daniel Vetter wrote:
>>> On Wed, Nov 25, 2020 at 01:57:40PM +0100, Christian König wrote:
>>>> Am 25.11.20 um 11:40 schrieb Daniel Vetter:
>>>>> On Tue, Nov 24, 2020 at 05:44:07PM +0100, Christian König wrote:
>>>>>> Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
>>>>>>> On 11/24/20 2:41 AM, Christian König wrote:
>>>>>>>> Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
>>>>>>>>> On 11/23/20 3:41 PM, Christian König wrote:
>>>>>>>>>> Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
>>>>>>>>>>> On 11/23/20 3:20 PM, Christian König wrote:
>>>>>>>>>>>> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>> On 11/25/20 5:42 AM, Christian König wrote:
>>>>>>>>>>>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>> It's needed to drop iommu backed pages on device unplug
>>>>>>>>>>>>>>> before device's IOMMU group is released.
>>>>>>>>>>>>>> It would be cleaner if we could do the whole
>>>>>>>>>>>>>> handling in TTM. I also need to double check
>>>>>>>>>>>>>> what you are doing with this function.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>> Check patch "drm/amdgpu: Register IOMMU topology
>>>>>>>>>>>>> notifier per device." to see
>>>>>>>>>>>>> how i use it. I don't see why this should go
>>>>>>>>>>>>> into TTM mid-layer - the stuff I do inside
>>>>>>>>>>>>> is vendor specific and also I don't think TTM is
>>>>>>>>>>>>> explicitly aware of IOMMU ?
>>>>>>>>>>>>> Do you mean you prefer the IOMMU notifier to be
>>>>>>>>>>>>> registered from within TTM
>>>>>>>>>>>>> and then use a hook to call into vendor specific handler ?
>>>>>>>>>>>> No, that is really vendor specific.
>>>>>>>>>>>>
>>>>>>>>>>>> What I meant is to have a function like
>>>>>>>>>>>> ttm_resource_manager_evict_all() which you only need
>>>>>>>>>>>> to call and all tt objects are unpopulated.
>>>>>>>>>>> So instead of this BO list i create and later iterate in
>>>>>>>>>>> amdgpu from the IOMMU patch you just want to do it
>>>>>>>>>>> within
>>>>>>>>>>> TTM with a single function ? Makes much more sense.
>>>>>>>>>> Yes, exactly.
>>>>>>>>>>
>>>>>>>>>> The list_empty() checks we have in TTM for the LRU are
>>>>>>>>>> actually not the best idea, we should now check the
>>>>>>>>>> pin_count instead. This way we could also have a list of the
>>>>>>>>>> pinned BOs in TTM.
>>>>>>>>> So from my IOMMU topology handler I will iterate the TTM LRU for
>>>>>>>>> the unpinned BOs and this new function for the pinned ones  ?
>>>>>>>>> It's probably a good idea to combine both iterations into this
>>>>>>>>> new function to cover all the BOs allocated on the device.
>>>>>>>> Yes, that's what I had in my mind as well.
>>>>>>>>
>>>>>>>>>> BTW: Have you thought about what happens when we unpopulate
>>>>>>>>>> a BO while we still try to use a kernel mapping for it? That
>>>>>>>>>> could have unforeseen consequences.
>>>>>>>>> Are you asking what happens to kmap or vmap style mapped CPU
>>>>>>>>> accesses once we drop all the DMA backing pages for a particular
>>>>>>>>> BO ? Because for user mappings
>>>>>>>>> (mmap) we took care of this with dummy page reroute but indeed
>>>>>>>>> nothing was done for in kernel CPU mappings.
>>>>>>>> Yes exactly that.
>>>>>>>>
>>>>>>>> In other words what happens if we free the ring buffer while the
>>>>>>>> kernel still writes to it?
>>>>>>>>
>>>>>>>> Christian.
>>>>>>> While we can't control user application accesses to the mapped buffers
>>>>>>> explicitly and hence we use page fault rerouting
>>>>>>> I am thinking that in this  case we may be able to sprinkle
>>>>>>> drm_dev_enter/exit in any such sensitive place were we might
>>>>>>> CPU access a DMA buffer from the kernel ?
>>>>>> Yes, I fear we are going to need that.
>>>>> Uh ... problem is that dma_buf_vmap are usually permanent things. Maybe we
>>>>> could stuff this into begin/end_cpu_access
>>
>> Do you mean guarding with drm_dev_enter/exit in dma_buf_ops.begin/end_cpu_access
>> driver specific hook ?
>>
>>
>>>>> (but only for the kernel, so a
>>>>> bit tricky)?
>>
>> Why only kernel ? Why is it a problem to do it if it comes from dma_buf_ioctl by
>> some user process ? And  if we do need this distinction I think we should be able to
>> differentiate by looking at current->mm (i.e. mm_struct) pointer being NULL
>> for kernel thread.
> Userspace mmap is handled by punching out the pte. So we don't need to do
> anything special there.
>
> For kernel mmap the begin/end should be all in the same context (so we
> could use the srcu lock that works underneath drm_dev_enter/exit), since
> at least right now kernel vmaps of dma-buf are very long-lived.


If by same context you mean the right drm_device (the exporter's one)
then this should be ok as I am seeing from amdgpu implementation
of the callback - amdgpu_dma_buf_begin_cpu_access. We just need to add
handler for .end_cpu_access callback to call drm_dev_exit there.

Andrey


>
> But the good news is that Thomas Zimmerman is working on this problem
> already for different reasons, so it might be that we won't have any
> long-lived kernel vmap anymore. And we could put the drm_dev_enter/exit in
> there.
>
>>>> Oh very very good point! I haven't thought about DMA-buf mmaps in this
>>>> context yet.
>>>>
>>>>
>>>>> btw the other issue with dma-buf (and even worse with dma_fence) is
>>>>> refcounting of the underlying drm_device. I'd expect that all your
>>>>> callbacks go boom if the dma_buf outlives your drm_device. That part isn't
>>>>> yet solved in your series here.
>>>> Well thinking more about this, it seems to be a another really good argument
>>>> why mapping pages from DMA-bufs into application address space directly is a
>>>> very bad idea :)
>>>>
>>>> But yes, we essentially can't remove the device as long as there is a
>>>> DMA-buf with mappings. No idea how to clean that one up.
>>> drm_dev_get/put in drm_prime helpers should get us like 90% there I think.
>>
>> What are the other 10% ?
> dma_fence, which is also about 90% of the work probably. But I'm
> guesstimating only 10% of the oopses you can hit. Since generally the
> dma_fence for a buffer don't outlive the underlying buffer. So usually no
> problems happen when we've solved the dma-buf sharing, but the dma_fence
> can outlive the dma-buf, so there's still possibilities of crashing.
>
>>> The even more worrying thing is random dma_fence attached to the dma_resv
>>> object. We could try to clean all of ours up, but they could have escaped
>>> already into some other driver. And since we're talking about egpu
>>> hotunplug, dma_fence escaping to the igpu is a pretty reasonable use-case.
>>>
>>> I have no how to fix that one :-/
>>> -Daniel
>>
>> I assume you are referring to sync_file_create/sync_file_get_fence API  for
>> dma_fence export/import ?
> So dma_fence is a general issue, there's a pile of interfaces that result
> in sharing with other drivers:
> - dma_resv in the dma_buf
> - sync_file
> - drm_syncobj (but I think that's not yet cross driver, but probably
>    changes)
>
> In each of these cases drivers can pick up the dma_fence and use it
> internally for all kinds of purposes (could end up in the scheduler or
> wherever).
>
>> So with DMA bufs we have the drm_gem_object as exporter specific private data
>> and so we can do drm_dev_get and put at the drm_gem_object layer to bind
>> device life cycle
>> to that of each GEM object but, we don't have such mid-layer for dma_fence
>> which could allow
>> us to increment device reference for each fence out there related to that
>> device - is my understanding correct ?
> Yeah that's the annoying part with dma-fence. No existing generic place to
> put the drm_dev_get/put. tbf I'd note this as a todo and try to solve the
> other problems first.
> -Daniel
>
>> Andrey
>>
>>
>> Andrey
>>
>>
>>>> Christian.
>>>>
>>>>> -Daniel
>>>>>
>>>>>>> Things like CPU page table updates, ring buffer accesses and FW memcpy ?
>>>>>>> Is there other places ?
>>>>>> Puh, good question. I have no idea.
>>>>>>
>>>>>>> Another point is that at this point the driver shouldn't access any such
>>>>>>> buffers as we are at the process finishing the device.
>>>>>>> AFAIK there is no page fault mechanism for kernel mappings so I don't
>>>>>>> think there is anything else to do ?
>>>>>> Well there is a page fault handler for kernel mappings, but that one just
>>>>>> prints the stack trace into the system log and calls BUG(); :)
>>>>>>
>>>>>> Long story short we need to avoid any access to released pages after unplug.
>>>>>> No matter if it's from the kernel or userspace.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Andrey
Daniel Vetter Nov. 30, 2020, 2:15 p.m. UTC | #19
On Fri, Nov 27, 2020 at 11:04:55AM -0500, Andrey Grodzovsky wrote:
> 
> On 11/27/20 9:59 AM, Daniel Vetter wrote:
> > On Wed, Nov 25, 2020 at 02:34:44PM -0500, Andrey Grodzovsky wrote:
> > > On 11/25/20 11:36 AM, Daniel Vetter wrote:
> > > > On Wed, Nov 25, 2020 at 01:57:40PM +0100, Christian König wrote:
> > > > > Am 25.11.20 um 11:40 schrieb Daniel Vetter:
> > > > > > On Tue, Nov 24, 2020 at 05:44:07PM +0100, Christian König wrote:
> > > > > > > Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
> > > > > > > > On 11/24/20 2:41 AM, Christian König wrote:
> > > > > > > > > Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
> > > > > > > > > > On 11/23/20 3:41 PM, Christian König wrote:
> > > > > > > > > > > Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
> > > > > > > > > > > > On 11/23/20 3:20 PM, Christian König wrote:
> > > > > > > > > > > > > Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
> > > > > > > > > > > > > > On 11/25/20 5:42 AM, Christian König wrote:
> > > > > > > > > > > > > > > Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
> > > > > > > > > > > > > > > > It's needed to drop iommu backed pages on device unplug
> > > > > > > > > > > > > > > > before device's IOMMU group is released.
> > > > > > > > > > > > > > > It would be cleaner if we could do the whole
> > > > > > > > > > > > > > > handling in TTM. I also need to double check
> > > > > > > > > > > > > > > what you are doing with this function.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Christian.
> > > > > > > > > > > > > > Check patch "drm/amdgpu: Register IOMMU topology
> > > > > > > > > > > > > > notifier per device." to see
> > > > > > > > > > > > > > how i use it. I don't see why this should go
> > > > > > > > > > > > > > into TTM mid-layer - the stuff I do inside
> > > > > > > > > > > > > > is vendor specific and also I don't think TTM is
> > > > > > > > > > > > > > explicitly aware of IOMMU ?
> > > > > > > > > > > > > > Do you mean you prefer the IOMMU notifier to be
> > > > > > > > > > > > > > registered from within TTM
> > > > > > > > > > > > > > and then use a hook to call into vendor specific handler ?
> > > > > > > > > > > > > No, that is really vendor specific.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > What I meant is to have a function like
> > > > > > > > > > > > > ttm_resource_manager_evict_all() which you only need
> > > > > > > > > > > > > to call and all tt objects are unpopulated.
> > > > > > > > > > > > So instead of this BO list i create and later iterate in
> > > > > > > > > > > > amdgpu from the IOMMU patch you just want to do it
> > > > > > > > > > > > within
> > > > > > > > > > > > TTM with a single function ? Makes much more sense.
> > > > > > > > > > > Yes, exactly.
> > > > > > > > > > > 
> > > > > > > > > > > The list_empty() checks we have in TTM for the LRU are
> > > > > > > > > > > actually not the best idea, we should now check the
> > > > > > > > > > > pin_count instead. This way we could also have a list of the
> > > > > > > > > > > pinned BOs in TTM.
> > > > > > > > > > So from my IOMMU topology handler I will iterate the TTM LRU for
> > > > > > > > > > the unpinned BOs and this new function for the pinned ones  ?
> > > > > > > > > > It's probably a good idea to combine both iterations into this
> > > > > > > > > > new function to cover all the BOs allocated on the device.
> > > > > > > > > Yes, that's what I had in my mind as well.
> > > > > > > > > 
> > > > > > > > > > > BTW: Have you thought about what happens when we unpopulate
> > > > > > > > > > > a BO while we still try to use a kernel mapping for it? That
> > > > > > > > > > > could have unforeseen consequences.
> > > > > > > > > > Are you asking what happens to kmap or vmap style mapped CPU
> > > > > > > > > > accesses once we drop all the DMA backing pages for a particular
> > > > > > > > > > BO ? Because for user mappings
> > > > > > > > > > (mmap) we took care of this with dummy page reroute but indeed
> > > > > > > > > > nothing was done for in kernel CPU mappings.
> > > > > > > > > Yes exactly that.
> > > > > > > > > 
> > > > > > > > > In other words what happens if we free the ring buffer while the
> > > > > > > > > kernel still writes to it?
> > > > > > > > > 
> > > > > > > > > Christian.
> > > > > > > > While we can't control user application accesses to the mapped buffers
> > > > > > > > explicitly and hence we use page fault rerouting
> > > > > > > > I am thinking that in this  case we may be able to sprinkle
> > > > > > > > drm_dev_enter/exit in any such sensitive place were we might
> > > > > > > > CPU access a DMA buffer from the kernel ?
> > > > > > > Yes, I fear we are going to need that.
> > > > > > Uh ... problem is that dma_buf_vmap are usually permanent things. Maybe we
> > > > > > could stuff this into begin/end_cpu_access
> > > 
> > > Do you mean guarding with drm_dev_enter/exit in dma_buf_ops.begin/end_cpu_access
> > > driver specific hook ?
> > > 
> > > 
> > > > > > (but only for the kernel, so a
> > > > > > bit tricky)?
> > > 
> > > Why only kernel ? Why is it a problem to do it if it comes from dma_buf_ioctl by
> > > some user process ? And  if we do need this distinction I think we should be able to
> > > differentiate by looking at current->mm (i.e. mm_struct) pointer being NULL
> > > for kernel thread.
> > Userspace mmap is handled by punching out the pte. So we don't need to do
> > anything special there.
> > 
> > For kernel mmap the begin/end should be all in the same context (so we
> > could use the srcu lock that works underneath drm_dev_enter/exit), since
> > at least right now kernel vmaps of dma-buf are very long-lived.
> 
> 
> If by same context you mean the right drm_device (the exporter's one)
> then this should be ok as I am seeing from amdgpu implementation
> of the callback - amdgpu_dma_buf_begin_cpu_access. We just need to add
> handler for .end_cpu_access callback to call drm_dev_exit there.

Same context = same system call essentially. You cannot hold locks while
returning to userspace. And current userspace can call the
begin/end_cpu_access callbacks through ioctls, so just putting a
drm_dev_enter/exit in them will break really badly. Iirc there's an igt
also for testing these ioctl - if there isn't we really should have one.

Hence why we need to be more careful here about how's calling and where we
can put the drm_dev_enter/exit.
-Daniel

> 
> Andrey
> 
> 
> > 
> > But the good news is that Thomas Zimmerman is working on this problem
> > already for different reasons, so it might be that we won't have any
> > long-lived kernel vmap anymore. And we could put the drm_dev_enter/exit in
> > there.
> > 
> > > > > Oh very very good point! I haven't thought about DMA-buf mmaps in this
> > > > > context yet.
> > > > > 
> > > > > 
> > > > > > btw the other issue with dma-buf (and even worse with dma_fence) is
> > > > > > refcounting of the underlying drm_device. I'd expect that all your
> > > > > > callbacks go boom if the dma_buf outlives your drm_device. That part isn't
> > > > > > yet solved in your series here.
> > > > > Well thinking more about this, it seems to be a another really good argument
> > > > > why mapping pages from DMA-bufs into application address space directly is a
> > > > > very bad idea :)
> > > > > 
> > > > > But yes, we essentially can't remove the device as long as there is a
> > > > > DMA-buf with mappings. No idea how to clean that one up.
> > > > drm_dev_get/put in drm_prime helpers should get us like 90% there I think.
> > > 
> > > What are the other 10% ?
> > dma_fence, which is also about 90% of the work probably. But I'm
> > guesstimating only 10% of the oopses you can hit. Since generally the
> > dma_fence for a buffer don't outlive the underlying buffer. So usually no
> > problems happen when we've solved the dma-buf sharing, but the dma_fence
> > can outlive the dma-buf, so there's still possibilities of crashing.
> > 
> > > > The even more worrying thing is random dma_fence attached to the dma_resv
> > > > object. We could try to clean all of ours up, but they could have escaped
> > > > already into some other driver. And since we're talking about egpu
> > > > hotunplug, dma_fence escaping to the igpu is a pretty reasonable use-case.
> > > > 
> > > > I have no how to fix that one :-/
> > > > -Daniel
> > > 
> > > I assume you are referring to sync_file_create/sync_file_get_fence API  for
> > > dma_fence export/import ?
> > So dma_fence is a general issue, there's a pile of interfaces that result
> > in sharing with other drivers:
> > - dma_resv in the dma_buf
> > - sync_file
> > - drm_syncobj (but I think that's not yet cross driver, but probably
> >    changes)
> > 
> > In each of these cases drivers can pick up the dma_fence and use it
> > internally for all kinds of purposes (could end up in the scheduler or
> > wherever).
> > 
> > > So with DMA bufs we have the drm_gem_object as exporter specific private data
> > > and so we can do drm_dev_get and put at the drm_gem_object layer to bind
> > > device life cycle
> > > to that of each GEM object but, we don't have such mid-layer for dma_fence
> > > which could allow
> > > us to increment device reference for each fence out there related to that
> > > device - is my understanding correct ?
> > Yeah that's the annoying part with dma-fence. No existing generic place to
> > put the drm_dev_get/put. tbf I'd note this as a todo and try to solve the
> > other problems first.
> > -Daniel
> > 
> > > Andrey
> > > 
> > > 
> > > Andrey
> > > 
> > > 
> > > > > Christian.
> > > > > 
> > > > > > -Daniel
> > > > > > 
> > > > > > > > Things like CPU page table updates, ring buffer accesses and FW memcpy ?
> > > > > > > > Is there other places ?
> > > > > > > Puh, good question. I have no idea.
> > > > > > > 
> > > > > > > > Another point is that at this point the driver shouldn't access any such
> > > > > > > > buffers as we are at the process finishing the device.
> > > > > > > > AFAIK there is no page fault mechanism for kernel mappings so I don't
> > > > > > > > think there is anything else to do ?
> > > > > > > Well there is a page fault handler for kernel mappings, but that one just
> > > > > > > prints the stack trace into the system log and calls BUG(); :)
> > > > > > > 
> > > > > > > Long story short we need to avoid any access to released pages after unplug.
> > > > > > > No matter if it's from the kernel or userspace.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > Andrey
Andrey Grodzovsky Dec. 15, 2020, 8:18 p.m. UTC | #20
On 11/24/20 11:44 AM, Christian König wrote:
> Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
>>
>> On 11/24/20 2:41 AM, Christian König wrote:
>>> Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
>>>>
>>>> On 11/23/20 3:41 PM, Christian König wrote:
>>>>> Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> On 11/23/20 3:20 PM, Christian König wrote:
>>>>>>> Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
>>>>>>>>
>>>>>>>> On 11/25/20 5:42 AM, Christian König wrote:
>>>>>>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>>>>>>> It's needed to drop iommu backed pages on device unplug
>>>>>>>>>> before device's IOMMU group is released.
>>>>>>>>>
>>>>>>>>> It would be cleaner if we could do the whole handling in TTM. I also 
>>>>>>>>> need to double check what you are doing with this function.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>
>>>>>>>> Check patch "drm/amdgpu: Register IOMMU topology notifier per device." 
>>>>>>>> to see
>>>>>>>> how i use it. I don't see why this should go into TTM mid-layer - the 
>>>>>>>> stuff I do inside
>>>>>>>> is vendor specific and also I don't think TTM is explicitly aware of 
>>>>>>>> IOMMU ?
>>>>>>>> Do you mean you prefer the IOMMU notifier to be registered from within TTM
>>>>>>>> and then use a hook to call into vendor specific handler ?
>>>>>>>
>>>>>>> No, that is really vendor specific.
>>>>>>>
>>>>>>> What I meant is to have a function like ttm_resource_manager_evict_all() 
>>>>>>> which you only need to call and all tt objects are unpopulated.
>>>>>>
>>>>>>
>>>>>> So instead of this BO list i create and later iterate in amdgpu from the 
>>>>>> IOMMU patch you just want to do it within
>>>>>> TTM with a single function ? Makes much more sense.
>>>>>
>>>>> Yes, exactly.
>>>>>
>>>>> The list_empty() checks we have in TTM for the LRU are actually not the 
>>>>> best idea, we should now check the pin_count instead. This way we could 
>>>>> also have a list of the pinned BOs in TTM.
>>>>
>>>>
>>>> So from my IOMMU topology handler I will iterate the TTM LRU for the 
>>>> unpinned BOs and this new function for the pinned ones  ?
>>>> It's probably a good idea to combine both iterations into this new function 
>>>> to cover all the BOs allocated on the device.
>>>
>>> Yes, that's what I had in my mind as well.
>>>
>>>>
>>>>
>>>>>
>>>>> BTW: Have you thought about what happens when we unpopulate a BO while we 
>>>>> still try to use a kernel mapping for it? That could have unforeseen 
>>>>> consequences.
>>>>
>>>>
>>>> Are you asking what happens to kmap or vmap style mapped CPU accesses once 
>>>> we drop all the DMA backing pages for a particular BO ? Because for user 
>>>> mappings
>>>> (mmap) we took care of this with dummy page reroute but indeed nothing was 
>>>> done for in kernel CPU mappings.
>>>
>>> Yes exactly that.
>>>
>>> In other words what happens if we free the ring buffer while the kernel 
>>> still writes to it?
>>>
>>> Christian.
>>
>>
>> While we can't control user application accesses to the mapped buffers 
>> explicitly and hence we use page fault rerouting
>> I am thinking that in this  case we may be able to sprinkle 
>> drm_dev_enter/exit in any such sensitive place were we might
>> CPU access a DMA buffer from the kernel ?
>
> Yes, I fear we are going to need that.
>
>> Things like CPU page table updates, ring buffer accesses and FW memcpy ? Is 
>> there other places ?
>
> Puh, good question. I have no idea.
>
>> Another point is that at this point the driver shouldn't access any such 
>> buffers as we are at the process finishing the device.
>> AFAIK there is no page fault mechanism for kernel mappings so I don't think 
>> there is anything else to do ?
>
> Well there is a page fault handler for kernel mappings, but that one just 
> prints the stack trace into the system log and calls BUG(); :)
>
> Long story short we need to avoid any access to released pages after unplug. 
> No matter if it's from the kernel or userspace.


I was just about to start guarding with drm_dev_enter/exit CPU accesses from 
kernel to GTT ot VRAM buffers but then i looked more in the code
and seems like ttm_tt_unpopulate just deletes DMA mappings (for the sake of 
device to main memory access). Kernel page table is not touched
until last bo refcount is dropped and the bo is released 
(ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This is both
for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped by 
ioremap. So as i see it, nothing will bad will happen after we
unpopulate a BO while we still try to use a kernel mapping for it, system memory 
pages backing GTT BOs are still mapped and not freed and for
VRAM BOs same is for the IO physical ranges mapped into the kernel page table 
since iounmap wasn't called yet. I loaded the driver with vm_update_mode=3
meaning all VM updates done using CPU and hasn't seen any OOPs after removing 
the device. I guess i can test it more by allocating GTT and VRAM BOs
and trying to read/write to them after device is removed.

Andrey


>
> Regards,
> Christian.
>
>>
>> Andrey
>
>
Christian König Dec. 16, 2020, 8:04 a.m. UTC | #21
Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
> [SNIP]
>>>
>>> While we can't control user application accesses to the mapped 
>>> buffers explicitly and hence we use page fault rerouting
>>> I am thinking that in this  case we may be able to sprinkle 
>>> drm_dev_enter/exit in any such sensitive place were we might
>>> CPU access a DMA buffer from the kernel ?
>>
>> Yes, I fear we are going to need that.
>>
>>> Things like CPU page table updates, ring buffer accesses and FW 
>>> memcpy ? Is there other places ?
>>
>> Puh, good question. I have no idea.
>>
>>> Another point is that at this point the driver shouldn't access any 
>>> such buffers as we are at the process finishing the device.
>>> AFAIK there is no page fault mechanism for kernel mappings so I 
>>> don't think there is anything else to do ?
>>
>> Well there is a page fault handler for kernel mappings, but that one 
>> just prints the stack trace into the system log and calls BUG(); :)
>>
>> Long story short we need to avoid any access to released pages after 
>> unplug. No matter if it's from the kernel or userspace.
>
>
> I was just about to start guarding with drm_dev_enter/exit CPU 
> accesses from kernel to GTT ot VRAM buffers but then i looked more in 
> the code
> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the 
> sake of device to main memory access). Kernel page table is not touched
> until last bo refcount is dropped and the bo is released 
> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This 
> is both
> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped 
> by ioremap. So as i see it, nothing will bad will happen after we
> unpopulate a BO while we still try to use a kernel mapping for it, 
> system memory pages backing GTT BOs are still mapped and not freed and 
> for
> VRAM BOs same is for the IO physical ranges mapped into the kernel 
> page table since iounmap wasn't called yet.

The problem is the system pages would be freed and if we kernel driver 
still happily write to them we are pretty much busted because we write 
to freed up memory.

Christian.

> I loaded the driver with vm_update_mode=3
> meaning all VM updates done using CPU and hasn't seen any OOPs after 
> removing the device. I guess i can test it more by allocating GTT and 
> VRAM BOs
> and trying to read/write to them after device is removed.
>
> Andrey
>
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Andrey
>>
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Daniel Vetter Dec. 16, 2020, 2:21 p.m. UTC | #22
On Wed, Dec 16, 2020 at 9:04 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
> > [SNIP]
> >>>
> >>> While we can't control user application accesses to the mapped
> >>> buffers explicitly and hence we use page fault rerouting
> >>> I am thinking that in this  case we may be able to sprinkle
> >>> drm_dev_enter/exit in any such sensitive place were we might
> >>> CPU access a DMA buffer from the kernel ?
> >>
> >> Yes, I fear we are going to need that.
> >>
> >>> Things like CPU page table updates, ring buffer accesses and FW
> >>> memcpy ? Is there other places ?
> >>
> >> Puh, good question. I have no idea.
> >>
> >>> Another point is that at this point the driver shouldn't access any
> >>> such buffers as we are at the process finishing the device.
> >>> AFAIK there is no page fault mechanism for kernel mappings so I
> >>> don't think there is anything else to do ?
> >>
> >> Well there is a page fault handler for kernel mappings, but that one
> >> just prints the stack trace into the system log and calls BUG(); :)
> >>
> >> Long story short we need to avoid any access to released pages after
> >> unplug. No matter if it's from the kernel or userspace.
> >
> >
> > I was just about to start guarding with drm_dev_enter/exit CPU
> > accesses from kernel to GTT ot VRAM buffers but then i looked more in
> > the code
> > and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
> > sake of device to main memory access). Kernel page table is not touched
> > until last bo refcount is dropped and the bo is released
> > (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
> > is both
> > for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
> > by ioremap. So as i see it, nothing will bad will happen after we
> > unpopulate a BO while we still try to use a kernel mapping for it,
> > system memory pages backing GTT BOs are still mapped and not freed and
> > for
> > VRAM BOs same is for the IO physical ranges mapped into the kernel
> > page table since iounmap wasn't called yet.
>
> The problem is the system pages would be freed and if we kernel driver
> still happily write to them we are pretty much busted because we write
> to freed up memory.

Similar for vram, if this is actual hotunplug and then replug, there's
going to be a different device behind the same mmio bar range most
likely (the higher bridges all this have the same windows assigned),
and that's bad news if we keep using it for current drivers. So we
really have to point all these cpu ptes to some other place.
-Daniel

>
> Christian.
>
> > I loaded the driver with vm_update_mode=3
> > meaning all VM updates done using CPU and hasn't seen any OOPs after
> > removing the device. I guess i can test it more by allocating GTT and
> > VRAM BOs
> > and trying to read/write to them after device is removed.
> >
> > Andrey
> >
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Andrey
> >>
> >>
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
Andrey Grodzovsky Dec. 16, 2020, 4:13 p.m. UTC | #23
On 12/16/20 9:21 AM, Daniel Vetter wrote:
> On Wed, Dec 16, 2020 at 9:04 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
>>> [SNIP]
>>>>> While we can't control user application accesses to the mapped
>>>>> buffers explicitly and hence we use page fault rerouting
>>>>> I am thinking that in this  case we may be able to sprinkle
>>>>> drm_dev_enter/exit in any such sensitive place were we might
>>>>> CPU access a DMA buffer from the kernel ?
>>>> Yes, I fear we are going to need that.
>>>>
>>>>> Things like CPU page table updates, ring buffer accesses and FW
>>>>> memcpy ? Is there other places ?
>>>> Puh, good question. I have no idea.
>>>>
>>>>> Another point is that at this point the driver shouldn't access any
>>>>> such buffers as we are at the process finishing the device.
>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
>>>>> don't think there is anything else to do ?
>>>> Well there is a page fault handler for kernel mappings, but that one
>>>> just prints the stack trace into the system log and calls BUG(); :)
>>>>
>>>> Long story short we need to avoid any access to released pages after
>>>> unplug. No matter if it's from the kernel or userspace.
>>>
>>> I was just about to start guarding with drm_dev_enter/exit CPU
>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
>>> the code
>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
>>> sake of device to main memory access). Kernel page table is not touched
>>> until last bo refcount is dropped and the bo is released
>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
>>> is both
>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
>>> by ioremap. So as i see it, nothing will bad will happen after we
>>> unpopulate a BO while we still try to use a kernel mapping for it,
>>> system memory pages backing GTT BOs are still mapped and not freed and
>>> for
>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
>>> page table since iounmap wasn't called yet.
>> The problem is the system pages would be freed and if we kernel driver
>> still happily write to them we are pretty much busted because we write
>> to freed up memory.


OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will release
the GTT BO pages. But then isn't there a problem in ttm_bo_release since
ttm_bo_cleanup_memtype_use which also leads to pages release comes
before bo->destroy which unmaps the pages from kernel page table ? Won't
we have end up writing to freed memory in this time interval ? Don't we
need to postpone pages freeing to after kernel page table unmapping ?


> Similar for vram, if this is actual hotunplug and then replug, there's
> going to be a different device behind the same mmio bar range most
> likely (the higher bridges all this have the same windows assigned),


No idea how this actually works but if we haven't called iounmap yet
doesn't it mean that those physical ranges that are still mapped into page
table should be reserved and cannot be reused for another
device ? As a guess, maybe another subrange from the higher bridge's total
range will be allocated.

> and that's bad news if we keep using it for current drivers. So we
> really have to point all these cpu ptes to some other place.


We can't just unmap it without syncing against any in kernel accesses to those 
buffers
and since page faulting technique we use for user mapped buffers seems to not be 
possible
for kernel mapped buffers I am not sure how to do it gracefully...

Andrey


> -Daniel
>
>> Christian.
>>
>>> I loaded the driver with vm_update_mode=3
>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
>>> removing the device. I guess i can test it more by allocating GTT and
>>> VRAM BOs
>>> and trying to read/write to them after device is removed.
>>>
>>> Andrey
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Andrey
>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C6ee2a428d88a4742f45a08d8a1cde9c7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437253067654506%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=WRL2smY7iemgZdlH3taUZCoa8h%2BuaKD1Hv0tbHUclAQ%3D&amp;reserved=0
>
Christian König Dec. 16, 2020, 4:18 p.m. UTC | #24
Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
>
> On 12/16/20 9:21 AM, Daniel Vetter wrote:
>> On Wed, Dec 16, 2020 at 9:04 AM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
>>>> [SNIP]
>>>>>> While we can't control user application accesses to the mapped
>>>>>> buffers explicitly and hence we use page fault rerouting
>>>>>> I am thinking that in this  case we may be able to sprinkle
>>>>>> drm_dev_enter/exit in any such sensitive place were we might
>>>>>> CPU access a DMA buffer from the kernel ?
>>>>> Yes, I fear we are going to need that.
>>>>>
>>>>>> Things like CPU page table updates, ring buffer accesses and FW
>>>>>> memcpy ? Is there other places ?
>>>>> Puh, good question. I have no idea.
>>>>>
>>>>>> Another point is that at this point the driver shouldn't access any
>>>>>> such buffers as we are at the process finishing the device.
>>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
>>>>>> don't think there is anything else to do ?
>>>>> Well there is a page fault handler for kernel mappings, but that one
>>>>> just prints the stack trace into the system log and calls BUG(); :)
>>>>>
>>>>> Long story short we need to avoid any access to released pages after
>>>>> unplug. No matter if it's from the kernel or userspace.
>>>>
>>>> I was just about to start guarding with drm_dev_enter/exit CPU
>>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
>>>> the code
>>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
>>>> sake of device to main memory access). Kernel page table is not 
>>>> touched
>>>> until last bo refcount is dropped and the bo is released
>>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
>>>> is both
>>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
>>>> by ioremap. So as i see it, nothing will bad will happen after we
>>>> unpopulate a BO while we still try to use a kernel mapping for it,
>>>> system memory pages backing GTT BOs are still mapped and not freed and
>>>> for
>>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
>>>> page table since iounmap wasn't called yet.
>>> The problem is the system pages would be freed and if we kernel driver
>>> still happily write to them we are pretty much busted because we write
>>> to freed up memory.
>
>
> OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will 
> release
> the GTT BO pages. But then isn't there a problem in ttm_bo_release since
> ttm_bo_cleanup_memtype_use which also leads to pages release comes
> before bo->destroy which unmaps the pages from kernel page table ? Won't
> we have end up writing to freed memory in this time interval ? Don't we
> need to postpone pages freeing to after kernel page table unmapping ?

BOs are only destroyed when there is a guarantee that nobody is 
accessing them any more.

The problem here is that the pages as well as the VRAM can be 
immediately reused after the hotplug event.

>
>
>> Similar for vram, if this is actual hotunplug and then replug, there's
>> going to be a different device behind the same mmio bar range most
>> likely (the higher bridges all this have the same windows assigned),
>
>
> No idea how this actually works but if we haven't called iounmap yet
> doesn't it mean that those physical ranges that are still mapped into 
> page
> table should be reserved and cannot be reused for another
> device ? As a guess, maybe another subrange from the higher bridge's 
> total
> range will be allocated.

Nope, the PCIe subsystem doesn't care about any ioremap still active for 
a range when it is hotplugged.

>
>> and that's bad news if we keep using it for current drivers. So we
>> really have to point all these cpu ptes to some other place.
>
>
> We can't just unmap it without syncing against any in kernel accesses 
> to those buffers
> and since page faulting technique we use for user mapped buffers seems 
> to not be possible
> for kernel mapped buffers I am not sure how to do it gracefully...

We could try to replace the kmap with a dummy page under the hood, but 
that is extremely tricky.

Especially since BOs which are just 1 page in size could point to the 
linear mapping directly.

Christian.

>
> Andrey
>
>
>> -Daniel
>>
>>> Christian.
>>>
>>>> I loaded the driver with vm_update_mode=3
>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
>>>> removing the device. I guess i can test it more by allocating GTT and
>>>> VRAM BOs
>>>> and trying to read/write to them after device is removed.
>>>>
>>>> Andrey
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Andrey
>>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C6ee2a428d88a4742f45a08d8a1cde9c7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437253067654506%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=WRL2smY7iemgZdlH3taUZCoa8h%2BuaKD1Hv0tbHUclAQ%3D&amp;reserved=0 
>>>>
>>
Daniel Vetter Dec. 16, 2020, 5:12 p.m. UTC | #25
On Wed, Dec 16, 2020 at 5:18 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
> >
> > On 12/16/20 9:21 AM, Daniel Vetter wrote:
> >> On Wed, Dec 16, 2020 at 9:04 AM Christian König
> >> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
> >>>> [SNIP]
> >>>>>> While we can't control user application accesses to the mapped
> >>>>>> buffers explicitly and hence we use page fault rerouting
> >>>>>> I am thinking that in this  case we may be able to sprinkle
> >>>>>> drm_dev_enter/exit in any such sensitive place were we might
> >>>>>> CPU access a DMA buffer from the kernel ?
> >>>>> Yes, I fear we are going to need that.
> >>>>>
> >>>>>> Things like CPU page table updates, ring buffer accesses and FW
> >>>>>> memcpy ? Is there other places ?
> >>>>> Puh, good question. I have no idea.
> >>>>>
> >>>>>> Another point is that at this point the driver shouldn't access any
> >>>>>> such buffers as we are at the process finishing the device.
> >>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
> >>>>>> don't think there is anything else to do ?
> >>>>> Well there is a page fault handler for kernel mappings, but that one
> >>>>> just prints the stack trace into the system log and calls BUG(); :)
> >>>>>
> >>>>> Long story short we need to avoid any access to released pages after
> >>>>> unplug. No matter if it's from the kernel or userspace.
> >>>>
> >>>> I was just about to start guarding with drm_dev_enter/exit CPU
> >>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
> >>>> the code
> >>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
> >>>> sake of device to main memory access). Kernel page table is not
> >>>> touched
> >>>> until last bo refcount is dropped and the bo is released
> >>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
> >>>> is both
> >>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
> >>>> by ioremap. So as i see it, nothing will bad will happen after we
> >>>> unpopulate a BO while we still try to use a kernel mapping for it,
> >>>> system memory pages backing GTT BOs are still mapped and not freed and
> >>>> for
> >>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
> >>>> page table since iounmap wasn't called yet.
> >>> The problem is the system pages would be freed and if we kernel driver
> >>> still happily write to them we are pretty much busted because we write
> >>> to freed up memory.
> >
> >
> > OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
> > release
> > the GTT BO pages. But then isn't there a problem in ttm_bo_release since
> > ttm_bo_cleanup_memtype_use which also leads to pages release comes
> > before bo->destroy which unmaps the pages from kernel page table ? Won't
> > we have end up writing to freed memory in this time interval ? Don't we
> > need to postpone pages freeing to after kernel page table unmapping ?
>
> BOs are only destroyed when there is a guarantee that nobody is
> accessing them any more.
>
> The problem here is that the pages as well as the VRAM can be
> immediately reused after the hotplug event.
>
> >
> >
> >> Similar for vram, if this is actual hotunplug and then replug, there's
> >> going to be a different device behind the same mmio bar range most
> >> likely (the higher bridges all this have the same windows assigned),
> >
> >
> > No idea how this actually works but if we haven't called iounmap yet
> > doesn't it mean that those physical ranges that are still mapped into
> > page
> > table should be reserved and cannot be reused for another
> > device ? As a guess, maybe another subrange from the higher bridge's
> > total
> > range will be allocated.
>
> Nope, the PCIe subsystem doesn't care about any ioremap still active for
> a range when it is hotplugged.
>
> >
> >> and that's bad news if we keep using it for current drivers. So we
> >> really have to point all these cpu ptes to some other place.
> >
> >
> > We can't just unmap it without syncing against any in kernel accesses
> > to those buffers
> > and since page faulting technique we use for user mapped buffers seems
> > to not be possible
> > for kernel mapped buffers I am not sure how to do it gracefully...
>
> We could try to replace the kmap with a dummy page under the hood, but
> that is extremely tricky.
>
> Especially since BOs which are just 1 page in size could point to the
> linear mapping directly.

I think it's just more work. Essentially
- convert as much as possible of the kernel mappings to vmap_local,
which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
serve as a barrier, and ofc any new vmap needs to fail or hand out a
dummy mapping.
- handle fbcon somehow. I think shutting it all down should work out.
- worst case keep the system backing storage around for shared dma-buf
until the other non-dynamic driver releases it. for vram we require
dynamic importers (and maybe it wasn't such a bright idea to allow
pinning of importer buffers, might need to revisit that).

Cheers, Daniel

>
> Christian.
>
> >
> > Andrey
> >
> >
> >> -Daniel
> >>
> >>> Christian.
> >>>
> >>>> I loaded the driver with vm_update_mode=3
> >>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
> >>>> removing the device. I guess i can test it more by allocating GTT and
> >>>> VRAM BOs
> >>>> and trying to read/write to them after device is removed.
> >>>>
> >>>> Andrey
> >>>>
> >>>>
> >>>>> Regards,
> >>>>> Christian.
> >>>>>
> >>>>>> Andrey
> >>>>>
> >>>> _______________________________________________
> >>>> amd-gfx mailing list
> >>>> amd-gfx@lists.freedesktop.org
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C6ee2a428d88a4742f45a08d8a1cde9c7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437253067654506%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=WRL2smY7iemgZdlH3taUZCoa8h%2BuaKD1Hv0tbHUclAQ%3D&amp;reserved=0
> >>>>
> >>
>
Daniel Vetter Dec. 16, 2020, 5:20 p.m. UTC | #26
On Wed, Dec 16, 2020 at 6:12 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Wed, Dec 16, 2020 at 5:18 PM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
> > >
> > > On 12/16/20 9:21 AM, Daniel Vetter wrote:
> > >> On Wed, Dec 16, 2020 at 9:04 AM Christian König
> > >> <ckoenig.leichtzumerken@gmail.com> wrote:
> > >>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
> > >>>> [SNIP]
> > >>>>>> While we can't control user application accesses to the mapped
> > >>>>>> buffers explicitly and hence we use page fault rerouting
> > >>>>>> I am thinking that in this  case we may be able to sprinkle
> > >>>>>> drm_dev_enter/exit in any such sensitive place were we might
> > >>>>>> CPU access a DMA buffer from the kernel ?
> > >>>>> Yes, I fear we are going to need that.
> > >>>>>
> > >>>>>> Things like CPU page table updates, ring buffer accesses and FW
> > >>>>>> memcpy ? Is there other places ?
> > >>>>> Puh, good question. I have no idea.
> > >>>>>
> > >>>>>> Another point is that at this point the driver shouldn't access any
> > >>>>>> such buffers as we are at the process finishing the device.
> > >>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
> > >>>>>> don't think there is anything else to do ?
> > >>>>> Well there is a page fault handler for kernel mappings, but that one
> > >>>>> just prints the stack trace into the system log and calls BUG(); :)
> > >>>>>
> > >>>>> Long story short we need to avoid any access to released pages after
> > >>>>> unplug. No matter if it's from the kernel or userspace.
> > >>>>
> > >>>> I was just about to start guarding with drm_dev_enter/exit CPU
> > >>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
> > >>>> the code
> > >>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
> > >>>> sake of device to main memory access). Kernel page table is not
> > >>>> touched
> > >>>> until last bo refcount is dropped and the bo is released
> > >>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
> > >>>> is both
> > >>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
> > >>>> by ioremap. So as i see it, nothing will bad will happen after we
> > >>>> unpopulate a BO while we still try to use a kernel mapping for it,
> > >>>> system memory pages backing GTT BOs are still mapped and not freed and
> > >>>> for
> > >>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
> > >>>> page table since iounmap wasn't called yet.
> > >>> The problem is the system pages would be freed and if we kernel driver
> > >>> still happily write to them we are pretty much busted because we write
> > >>> to freed up memory.
> > >
> > >
> > > OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
> > > release
> > > the GTT BO pages. But then isn't there a problem in ttm_bo_release since
> > > ttm_bo_cleanup_memtype_use which also leads to pages release comes
> > > before bo->destroy which unmaps the pages from kernel page table ? Won't
> > > we have end up writing to freed memory in this time interval ? Don't we
> > > need to postpone pages freeing to after kernel page table unmapping ?
> >
> > BOs are only destroyed when there is a guarantee that nobody is
> > accessing them any more.
> >
> > The problem here is that the pages as well as the VRAM can be
> > immediately reused after the hotplug event.
> >
> > >
> > >
> > >> Similar for vram, if this is actual hotunplug and then replug, there's
> > >> going to be a different device behind the same mmio bar range most
> > >> likely (the higher bridges all this have the same windows assigned),
> > >
> > >
> > > No idea how this actually works but if we haven't called iounmap yet
> > > doesn't it mean that those physical ranges that are still mapped into
> > > page
> > > table should be reserved and cannot be reused for another
> > > device ? As a guess, maybe another subrange from the higher bridge's
> > > total
> > > range will be allocated.
> >
> > Nope, the PCIe subsystem doesn't care about any ioremap still active for
> > a range when it is hotplugged.
> >
> > >
> > >> and that's bad news if we keep using it for current drivers. So we
> > >> really have to point all these cpu ptes to some other place.
> > >
> > >
> > > We can't just unmap it without syncing against any in kernel accesses
> > > to those buffers
> > > and since page faulting technique we use for user mapped buffers seems
> > > to not be possible
> > > for kernel mapped buffers I am not sure how to do it gracefully...
> >
> > We could try to replace the kmap with a dummy page under the hood, but
> > that is extremely tricky.
> >
> > Especially since BOs which are just 1 page in size could point to the
> > linear mapping directly.
>
> I think it's just more work. Essentially
> - convert as much as possible of the kernel mappings to vmap_local,
> which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
> serve as a barrier, and ofc any new vmap needs to fail or hand out a
> dummy mapping.
> - handle fbcon somehow. I think shutting it all down should work out.

Oh also for fbdev I think best to switch over to
drm_fbdev_generic_setup(). That should handle all the lifetime fun
correctly already, minus the vram complication. So at least fewer
oopses for other reasons :-)
-Daniel

> - worst case keep the system backing storage around for shared dma-buf
> until the other non-dynamic driver releases it. for vram we require
> dynamic importers (and maybe it wasn't such a bright idea to allow
> pinning of importer buffers, might need to revisit that).
>
> Cheers, Daniel
>
> >
> > Christian.
> >
> > >
> > > Andrey
> > >
> > >
> > >> -Daniel
> > >>
> > >>> Christian.
> > >>>
> > >>>> I loaded the driver with vm_update_mode=3
> > >>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
> > >>>> removing the device. I guess i can test it more by allocating GTT and
> > >>>> VRAM BOs
> > >>>> and trying to read/write to them after device is removed.
> > >>>>
> > >>>> Andrey
> > >>>>
> > >>>>
> > >>>>> Regards,
> > >>>>> Christian.
> > >>>>>
> > >>>>>> Andrey
> > >>>>>
> > >>>> _______________________________________________
> > >>>> amd-gfx mailing list
> > >>>> amd-gfx@lists.freedesktop.org
> > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C6ee2a428d88a4742f45a08d8a1cde9c7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437253067654506%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=WRL2smY7iemgZdlH3taUZCoa8h%2BuaKD1Hv0tbHUclAQ%3D&amp;reserved=0
> > >>>>
> > >>
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Andrey Grodzovsky Dec. 16, 2020, 6:26 p.m. UTC | #27
On 12/16/20 12:12 PM, Daniel Vetter wrote:
> On Wed, Dec 16, 2020 at 5:18 PM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
>>> On 12/16/20 9:21 AM, Daniel Vetter wrote:
>>>> On Wed, Dec 16, 2020 at 9:04 AM Christian König
>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
>>>>>> [SNIP]
>>>>>>>> While we can't control user application accesses to the mapped
>>>>>>>> buffers explicitly and hence we use page fault rerouting
>>>>>>>> I am thinking that in this  case we may be able to sprinkle
>>>>>>>> drm_dev_enter/exit in any such sensitive place were we might
>>>>>>>> CPU access a DMA buffer from the kernel ?
>>>>>>> Yes, I fear we are going to need that.
>>>>>>>
>>>>>>>> Things like CPU page table updates, ring buffer accesses and FW
>>>>>>>> memcpy ? Is there other places ?
>>>>>>> Puh, good question. I have no idea.
>>>>>>>
>>>>>>>> Another point is that at this point the driver shouldn't access any
>>>>>>>> such buffers as we are at the process finishing the device.
>>>>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
>>>>>>>> don't think there is anything else to do ?
>>>>>>> Well there is a page fault handler for kernel mappings, but that one
>>>>>>> just prints the stack trace into the system log and calls BUG(); :)
>>>>>>>
>>>>>>> Long story short we need to avoid any access to released pages after
>>>>>>> unplug. No matter if it's from the kernel or userspace.
>>>>>> I was just about to start guarding with drm_dev_enter/exit CPU
>>>>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
>>>>>> the code
>>>>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
>>>>>> sake of device to main memory access). Kernel page table is not
>>>>>> touched
>>>>>> until last bo refcount is dropped and the bo is released
>>>>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
>>>>>> is both
>>>>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
>>>>>> by ioremap. So as i see it, nothing will bad will happen after we
>>>>>> unpopulate a BO while we still try to use a kernel mapping for it,
>>>>>> system memory pages backing GTT BOs are still mapped and not freed and
>>>>>> for
>>>>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
>>>>>> page table since iounmap wasn't called yet.
>>>>> The problem is the system pages would be freed and if we kernel driver
>>>>> still happily write to them we are pretty much busted because we write
>>>>> to freed up memory.
>>>
>>> OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
>>> release
>>> the GTT BO pages. But then isn't there a problem in ttm_bo_release since
>>> ttm_bo_cleanup_memtype_use which also leads to pages release comes
>>> before bo->destroy which unmaps the pages from kernel page table ? Won't
>>> we have end up writing to freed memory in this time interval ? Don't we
>>> need to postpone pages freeing to after kernel page table unmapping ?
>> BOs are only destroyed when there is a guarantee that nobody is
>> accessing them any more.
>>
>> The problem here is that the pages as well as the VRAM can be
>> immediately reused after the hotplug event.
>>
>>>
>>>> Similar for vram, if this is actual hotunplug and then replug, there's
>>>> going to be a different device behind the same mmio bar range most
>>>> likely (the higher bridges all this have the same windows assigned),
>>>
>>> No idea how this actually works but if we haven't called iounmap yet
>>> doesn't it mean that those physical ranges that are still mapped into
>>> page
>>> table should be reserved and cannot be reused for another
>>> device ? As a guess, maybe another subrange from the higher bridge's
>>> total
>>> range will be allocated.
>> Nope, the PCIe subsystem doesn't care about any ioremap still active for
>> a range when it is hotplugged.
>>
>>>> and that's bad news if we keep using it for current drivers. So we
>>>> really have to point all these cpu ptes to some other place.
>>>
>>> We can't just unmap it without syncing against any in kernel accesses
>>> to those buffers
>>> and since page faulting technique we use for user mapped buffers seems
>>> to not be possible
>>> for kernel mapped buffers I am not sure how to do it gracefully...
>> We could try to replace the kmap with a dummy page under the hood, but
>> that is extremely tricky.
>>
>> Especially since BOs which are just 1 page in size could point to the
>> linear mapping directly.
> I think it's just more work. Essentially
> - convert as much as possible of the kernel mappings to vmap_local,
> which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
> serve as a barrier, and ofc any new vmap needs to fail or hand out a
> dummy mapping.

Read those patches. I am not sure how this helps with protecting
against accesses to released backing pages or IO physical ranges of BO
which is already mapped during the unplug event ?

Andrey


> - handle fbcon somehow. I think shutting it all down should work out.
> - worst case keep the system backing storage around for shared dma-buf
> until the other non-dynamic driver releases it. for vram we require
> dynamic importers (and maybe it wasn't such a bright idea to allow
> pinning of importer buffers, might need to revisit that).
>
> Cheers, Daniel
>
>> Christian.
>>
>>> Andrey
>>>
>>>
>>>> -Daniel
>>>>
>>>>> Christian.
>>>>>
>>>>>> I loaded the driver with vm_update_mode=3
>>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
>>>>>> removing the device. I guess i can test it more by allocating GTT and
>>>>>> VRAM BOs
>>>>>> and trying to read/write to them after device is removed.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Andrey
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C37b4367cbdaa4133b01d08d8a1e5bf41%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437355430007196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=r0bIJS3HUDkFPqyFinAt4eahM%2BjF01DObZ5abgstzSU%3D&amp;reserved=0
>>>>>>
>
Daniel Vetter Dec. 16, 2020, 11:15 p.m. UTC | #28
On Wed, Dec 16, 2020 at 7:26 PM Andrey Grodzovsky
<Andrey.Grodzovsky@amd.com> wrote:
>
>
> On 12/16/20 12:12 PM, Daniel Vetter wrote:
> > On Wed, Dec 16, 2020 at 5:18 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
> >>> On 12/16/20 9:21 AM, Daniel Vetter wrote:
> >>>> On Wed, Dec 16, 2020 at 9:04 AM Christian König
> >>>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
> >>>>>> [SNIP]
> >>>>>>>> While we can't control user application accesses to the mapped
> >>>>>>>> buffers explicitly and hence we use page fault rerouting
> >>>>>>>> I am thinking that in this  case we may be able to sprinkle
> >>>>>>>> drm_dev_enter/exit in any such sensitive place were we might
> >>>>>>>> CPU access a DMA buffer from the kernel ?
> >>>>>>> Yes, I fear we are going to need that.
> >>>>>>>
> >>>>>>>> Things like CPU page table updates, ring buffer accesses and FW
> >>>>>>>> memcpy ? Is there other places ?
> >>>>>>> Puh, good question. I have no idea.
> >>>>>>>
> >>>>>>>> Another point is that at this point the driver shouldn't access any
> >>>>>>>> such buffers as we are at the process finishing the device.
> >>>>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
> >>>>>>>> don't think there is anything else to do ?
> >>>>>>> Well there is a page fault handler for kernel mappings, but that one
> >>>>>>> just prints the stack trace into the system log and calls BUG(); :)
> >>>>>>>
> >>>>>>> Long story short we need to avoid any access to released pages after
> >>>>>>> unplug. No matter if it's from the kernel or userspace.
> >>>>>> I was just about to start guarding with drm_dev_enter/exit CPU
> >>>>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
> >>>>>> the code
> >>>>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
> >>>>>> sake of device to main memory access). Kernel page table is not
> >>>>>> touched
> >>>>>> until last bo refcount is dropped and the bo is released
> >>>>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
> >>>>>> is both
> >>>>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
> >>>>>> by ioremap. So as i see it, nothing will bad will happen after we
> >>>>>> unpopulate a BO while we still try to use a kernel mapping for it,
> >>>>>> system memory pages backing GTT BOs are still mapped and not freed and
> >>>>>> for
> >>>>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
> >>>>>> page table since iounmap wasn't called yet.
> >>>>> The problem is the system pages would be freed and if we kernel driver
> >>>>> still happily write to them we are pretty much busted because we write
> >>>>> to freed up memory.
> >>>
> >>> OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
> >>> release
> >>> the GTT BO pages. But then isn't there a problem in ttm_bo_release since
> >>> ttm_bo_cleanup_memtype_use which also leads to pages release comes
> >>> before bo->destroy which unmaps the pages from kernel page table ? Won't
> >>> we have end up writing to freed memory in this time interval ? Don't we
> >>> need to postpone pages freeing to after kernel page table unmapping ?
> >> BOs are only destroyed when there is a guarantee that nobody is
> >> accessing them any more.
> >>
> >> The problem here is that the pages as well as the VRAM can be
> >> immediately reused after the hotplug event.
> >>
> >>>
> >>>> Similar for vram, if this is actual hotunplug and then replug, there's
> >>>> going to be a different device behind the same mmio bar range most
> >>>> likely (the higher bridges all this have the same windows assigned),
> >>>
> >>> No idea how this actually works but if we haven't called iounmap yet
> >>> doesn't it mean that those physical ranges that are still mapped into
> >>> page
> >>> table should be reserved and cannot be reused for another
> >>> device ? As a guess, maybe another subrange from the higher bridge's
> >>> total
> >>> range will be allocated.
> >> Nope, the PCIe subsystem doesn't care about any ioremap still active for
> >> a range when it is hotplugged.
> >>
> >>>> and that's bad news if we keep using it for current drivers. So we
> >>>> really have to point all these cpu ptes to some other place.
> >>>
> >>> We can't just unmap it without syncing against any in kernel accesses
> >>> to those buffers
> >>> and since page faulting technique we use for user mapped buffers seems
> >>> to not be possible
> >>> for kernel mapped buffers I am not sure how to do it gracefully...
> >> We could try to replace the kmap with a dummy page under the hood, but
> >> that is extremely tricky.
> >>
> >> Especially since BOs which are just 1 page in size could point to the
> >> linear mapping directly.
> > I think it's just more work. Essentially
> > - convert as much as possible of the kernel mappings to vmap_local,
> > which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
> > serve as a barrier, and ofc any new vmap needs to fail or hand out a
> > dummy mapping.
>
> Read those patches. I am not sure how this helps with protecting
> against accesses to released backing pages or IO physical ranges of BO
> which is already mapped during the unplug event ?

By eliminating such users, and replacing them with local maps which
are strictly bound in how long they can exist (and hence we can
serialize against them finishing in our hotunplug code). It doesn't
solve all your problems, but it's a tool to get there.
-Daniel

> Andrey
>
>
> > - handle fbcon somehow. I think shutting it all down should work out.
> > - worst case keep the system backing storage around for shared dma-buf
> > until the other non-dynamic driver releases it. for vram we require
> > dynamic importers (and maybe it wasn't such a bright idea to allow
> > pinning of importer buffers, might need to revisit that).
> >
> > Cheers, Daniel
> >
> >> Christian.
> >>
> >>> Andrey
> >>>
> >>>
> >>>> -Daniel
> >>>>
> >>>>> Christian.
> >>>>>
> >>>>>> I loaded the driver with vm_update_mode=3
> >>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
> >>>>>> removing the device. I guess i can test it more by allocating GTT and
> >>>>>> VRAM BOs
> >>>>>> and trying to read/write to them after device is removed.
> >>>>>>
> >>>>>> Andrey
> >>>>>>
> >>>>>>
> >>>>>>> Regards,
> >>>>>>> Christian.
> >>>>>>>
> >>>>>>>> Andrey
> >>>>>> _______________________________________________
> >>>>>> amd-gfx mailing list
> >>>>>> amd-gfx@lists.freedesktop.org
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C37b4367cbdaa4133b01d08d8a1e5bf41%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437355430007196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=r0bIJS3HUDkFPqyFinAt4eahM%2BjF01DObZ5abgstzSU%3D&amp;reserved=0
> >>>>>>
> >
Andrey Grodzovsky Dec. 17, 2020, 12:20 a.m. UTC | #29
On 12/16/20 6:15 PM, Daniel Vetter wrote:
> On Wed, Dec 16, 2020 at 7:26 PM Andrey Grodzovsky
> <Andrey.Grodzovsky@amd.com> wrote:
>>
>> On 12/16/20 12:12 PM, Daniel Vetter wrote:
>>> On Wed, Dec 16, 2020 at 5:18 PM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
>>>>> On 12/16/20 9:21 AM, Daniel Vetter wrote:
>>>>>> On Wed, Dec 16, 2020 at 9:04 AM Christian König
>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
>>>>>>>> [SNIP]
>>>>>>>>>> While we can't control user application accesses to the mapped
>>>>>>>>>> buffers explicitly and hence we use page fault rerouting
>>>>>>>>>> I am thinking that in this  case we may be able to sprinkle
>>>>>>>>>> drm_dev_enter/exit in any such sensitive place were we might
>>>>>>>>>> CPU access a DMA buffer from the kernel ?
>>>>>>>>> Yes, I fear we are going to need that.
>>>>>>>>>
>>>>>>>>>> Things like CPU page table updates, ring buffer accesses and FW
>>>>>>>>>> memcpy ? Is there other places ?
>>>>>>>>> Puh, good question. I have no idea.
>>>>>>>>>
>>>>>>>>>> Another point is that at this point the driver shouldn't access any
>>>>>>>>>> such buffers as we are at the process finishing the device.
>>>>>>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
>>>>>>>>>> don't think there is anything else to do ?
>>>>>>>>> Well there is a page fault handler for kernel mappings, but that one
>>>>>>>>> just prints the stack trace into the system log and calls BUG(); :)
>>>>>>>>>
>>>>>>>>> Long story short we need to avoid any access to released pages after
>>>>>>>>> unplug. No matter if it's from the kernel or userspace.
>>>>>>>> I was just about to start guarding with drm_dev_enter/exit CPU
>>>>>>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
>>>>>>>> the code
>>>>>>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
>>>>>>>> sake of device to main memory access). Kernel page table is not
>>>>>>>> touched
>>>>>>>> until last bo refcount is dropped and the bo is released
>>>>>>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
>>>>>>>> is both
>>>>>>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
>>>>>>>> by ioremap. So as i see it, nothing will bad will happen after we
>>>>>>>> unpopulate a BO while we still try to use a kernel mapping for it,
>>>>>>>> system memory pages backing GTT BOs are still mapped and not freed and
>>>>>>>> for
>>>>>>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
>>>>>>>> page table since iounmap wasn't called yet.
>>>>>>> The problem is the system pages would be freed and if we kernel driver
>>>>>>> still happily write to them we are pretty much busted because we write
>>>>>>> to freed up memory.
>>>>> OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
>>>>> release
>>>>> the GTT BO pages. But then isn't there a problem in ttm_bo_release since
>>>>> ttm_bo_cleanup_memtype_use which also leads to pages release comes
>>>>> before bo->destroy which unmaps the pages from kernel page table ? Won't
>>>>> we have end up writing to freed memory in this time interval ? Don't we
>>>>> need to postpone pages freeing to after kernel page table unmapping ?
>>>> BOs are only destroyed when there is a guarantee that nobody is
>>>> accessing them any more.
>>>>
>>>> The problem here is that the pages as well as the VRAM can be
>>>> immediately reused after the hotplug event.
>>>>
>>>>>> Similar for vram, if this is actual hotunplug and then replug, there's
>>>>>> going to be a different device behind the same mmio bar range most
>>>>>> likely (the higher bridges all this have the same windows assigned),
>>>>> No idea how this actually works but if we haven't called iounmap yet
>>>>> doesn't it mean that those physical ranges that are still mapped into
>>>>> page
>>>>> table should be reserved and cannot be reused for another
>>>>> device ? As a guess, maybe another subrange from the higher bridge's
>>>>> total
>>>>> range will be allocated.
>>>> Nope, the PCIe subsystem doesn't care about any ioremap still active for
>>>> a range when it is hotplugged.
>>>>
>>>>>> and that's bad news if we keep using it for current drivers. So we
>>>>>> really have to point all these cpu ptes to some other place.
>>>>> We can't just unmap it without syncing against any in kernel accesses
>>>>> to those buffers
>>>>> and since page faulting technique we use for user mapped buffers seems
>>>>> to not be possible
>>>>> for kernel mapped buffers I am not sure how to do it gracefully...
>>>> We could try to replace the kmap with a dummy page under the hood, but
>>>> that is extremely tricky.
>>>>
>>>> Especially since BOs which are just 1 page in size could point to the
>>>> linear mapping directly.
>>> I think it's just more work. Essentially
>>> - convert as much as possible of the kernel mappings to vmap_local,
>>> which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
>>> serve as a barrier, and ofc any new vmap needs to fail or hand out a
>>> dummy mapping.
>> Read those patches. I am not sure how this helps with protecting
>> against accesses to released backing pages or IO physical ranges of BO
>> which is already mapped during the unplug event ?
> By eliminating such users, and replacing them with local maps which
> are strictly bound in how long they can exist (and hence we can
> serialize against them finishing in our hotunplug code).

Not sure I see how serializing against BO map/unmap helps -  our problem as you 
described is that once
device is extracted and then something else quickly takes it's place in the PCI 
topology
and gets assigned same physical IO ranges, then our driver will start accessing this
new device because our 'zombie' BOs are still pointing to those ranges.

Another point regarding serializing - problem  is that some of those BOs are 
very long lived, take for example the HW command
ring buffer Christian mentioned before - 
(amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
is basically for the entire time the device exists, it's destroyed only in the 
SW fini stage (when last drm_dev
reference is dropped) and so should I grab it's dma_resv_lock from 
amdgpu_pci_remove code and wait
for it to be unmapped before proceeding with the PCI remove code ? This can take 
unbound time and that why I don't understand
how serializing will help.

Andrey


> It doesn't
> solve all your problems, but it's a tool to get there.
> -Daniel
>
>> Andrey
>>
>>
>>> - handle fbcon somehow. I think shutting it all down should work out.
>>> - worst case keep the system backing storage around for shared dma-buf
>>> until the other non-dynamic driver releases it. for vram we require
>>> dynamic importers (and maybe it wasn't such a bright idea to allow
>>> pinning of importer buffers, might need to revisit that).
>>>
>>> Cheers, Daniel
>>>
>>>> Christian.
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>> -Daniel
>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>> I loaded the driver with vm_update_mode=3
>>>>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
>>>>>>>> removing the device. I guess i can test it more by allocating GTT and
>>>>>>>> VRAM BOs
>>>>>>>> and trying to read/write to them after device is removed.
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C5849827698d1428065d408d8a2188518%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437573486129589%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=uahs3RgQxLpUJ9hCzE8pvK9UWFCpyIz1i4MNKikl0tY%3D&amp;reserved=0
>>>>>>>>
>
>
Daniel Vetter Dec. 17, 2020, 12:01 p.m. UTC | #30
On Wed, Dec 16, 2020 at 07:20:02PM -0500, Andrey Grodzovsky wrote:
> 
> On 12/16/20 6:15 PM, Daniel Vetter wrote:
> > On Wed, Dec 16, 2020 at 7:26 PM Andrey Grodzovsky
> > <Andrey.Grodzovsky@amd.com> wrote:
> > > 
> > > On 12/16/20 12:12 PM, Daniel Vetter wrote:
> > > > On Wed, Dec 16, 2020 at 5:18 PM Christian König
> > > > <christian.koenig@amd.com> wrote:
> > > > > Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
> > > > > > On 12/16/20 9:21 AM, Daniel Vetter wrote:
> > > > > > > On Wed, Dec 16, 2020 at 9:04 AM Christian König
> > > > > > > <ckoenig.leichtzumerken@gmail.com> wrote:
> > > > > > > > Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
> > > > > > > > > [SNIP]
> > > > > > > > > > > While we can't control user application accesses to the mapped
> > > > > > > > > > > buffers explicitly and hence we use page fault rerouting
> > > > > > > > > > > I am thinking that in this  case we may be able to sprinkle
> > > > > > > > > > > drm_dev_enter/exit in any such sensitive place were we might
> > > > > > > > > > > CPU access a DMA buffer from the kernel ?
> > > > > > > > > > Yes, I fear we are going to need that.
> > > > > > > > > > 
> > > > > > > > > > > Things like CPU page table updates, ring buffer accesses and FW
> > > > > > > > > > > memcpy ? Is there other places ?
> > > > > > > > > > Puh, good question. I have no idea.
> > > > > > > > > > 
> > > > > > > > > > > Another point is that at this point the driver shouldn't access any
> > > > > > > > > > > such buffers as we are at the process finishing the device.
> > > > > > > > > > > AFAIK there is no page fault mechanism for kernel mappings so I
> > > > > > > > > > > don't think there is anything else to do ?
> > > > > > > > > > Well there is a page fault handler for kernel mappings, but that one
> > > > > > > > > > just prints the stack trace into the system log and calls BUG(); :)
> > > > > > > > > > 
> > > > > > > > > > Long story short we need to avoid any access to released pages after
> > > > > > > > > > unplug. No matter if it's from the kernel or userspace.
> > > > > > > > > I was just about to start guarding with drm_dev_enter/exit CPU
> > > > > > > > > accesses from kernel to GTT ot VRAM buffers but then i looked more in
> > > > > > > > > the code
> > > > > > > > > and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
> > > > > > > > > sake of device to main memory access). Kernel page table is not
> > > > > > > > > touched
> > > > > > > > > until last bo refcount is dropped and the bo is released
> > > > > > > > > (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
> > > > > > > > > is both
> > > > > > > > > for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
> > > > > > > > > by ioremap. So as i see it, nothing will bad will happen after we
> > > > > > > > > unpopulate a BO while we still try to use a kernel mapping for it,
> > > > > > > > > system memory pages backing GTT BOs are still mapped and not freed and
> > > > > > > > > for
> > > > > > > > > VRAM BOs same is for the IO physical ranges mapped into the kernel
> > > > > > > > > page table since iounmap wasn't called yet.
> > > > > > > > The problem is the system pages would be freed and if we kernel driver
> > > > > > > > still happily write to them we are pretty much busted because we write
> > > > > > > > to freed up memory.
> > > > > > OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
> > > > > > release
> > > > > > the GTT BO pages. But then isn't there a problem in ttm_bo_release since
> > > > > > ttm_bo_cleanup_memtype_use which also leads to pages release comes
> > > > > > before bo->destroy which unmaps the pages from kernel page table ? Won't
> > > > > > we have end up writing to freed memory in this time interval ? Don't we
> > > > > > need to postpone pages freeing to after kernel page table unmapping ?
> > > > > BOs are only destroyed when there is a guarantee that nobody is
> > > > > accessing them any more.
> > > > > 
> > > > > The problem here is that the pages as well as the VRAM can be
> > > > > immediately reused after the hotplug event.
> > > > > 
> > > > > > > Similar for vram, if this is actual hotunplug and then replug, there's
> > > > > > > going to be a different device behind the same mmio bar range most
> > > > > > > likely (the higher bridges all this have the same windows assigned),
> > > > > > No idea how this actually works but if we haven't called iounmap yet
> > > > > > doesn't it mean that those physical ranges that are still mapped into
> > > > > > page
> > > > > > table should be reserved and cannot be reused for another
> > > > > > device ? As a guess, maybe another subrange from the higher bridge's
> > > > > > total
> > > > > > range will be allocated.
> > > > > Nope, the PCIe subsystem doesn't care about any ioremap still active for
> > > > > a range when it is hotplugged.
> > > > > 
> > > > > > > and that's bad news if we keep using it for current drivers. So we
> > > > > > > really have to point all these cpu ptes to some other place.
> > > > > > We can't just unmap it without syncing against any in kernel accesses
> > > > > > to those buffers
> > > > > > and since page faulting technique we use for user mapped buffers seems
> > > > > > to not be possible
> > > > > > for kernel mapped buffers I am not sure how to do it gracefully...
> > > > > We could try to replace the kmap with a dummy page under the hood, but
> > > > > that is extremely tricky.
> > > > > 
> > > > > Especially since BOs which are just 1 page in size could point to the
> > > > > linear mapping directly.
> > > > I think it's just more work. Essentially
> > > > - convert as much as possible of the kernel mappings to vmap_local,
> > > > which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
> > > > serve as a barrier, and ofc any new vmap needs to fail or hand out a
> > > > dummy mapping.
> > > Read those patches. I am not sure how this helps with protecting
> > > against accesses to released backing pages or IO physical ranges of BO
> > > which is already mapped during the unplug event ?
> > By eliminating such users, and replacing them with local maps which
> > are strictly bound in how long they can exist (and hence we can
> > serialize against them finishing in our hotunplug code).
> 
> Not sure I see how serializing against BO map/unmap helps -  our problem as
> you described is that once
> device is extracted and then something else quickly takes it's place in the
> PCI topology
> and gets assigned same physical IO ranges, then our driver will start accessing this
> new device because our 'zombie' BOs are still pointing to those ranges.

Until your driver's remove callback is finished the ranges stay reserved.
If that's not the case, then hotunplug would be fundamentally impossible
ot handle correctly.

Of course all the mmio actions will time out, so it might take some time
to get through it all.

> Another point regarding serializing - problem  is that some of those BOs are
> very long lived, take for example the HW command
> ring buffer Christian mentioned before -
> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
> is basically for the entire time the device exists, it's destroyed only in
> the SW fini stage (when last drm_dev
> reference is dropped) and so should I grab it's dma_resv_lock from
> amdgpu_pci_remove code and wait
> for it to be unmapped before proceeding with the PCI remove code ? This can
> take unbound time and that why I don't understand
> how serializing will help.

Uh you need to untangle that. After hw cleanup is done no one is allowed
to touch that ringbuffer bo anymore from the kernel. That's what
drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
references to disappear.

The vmap_local is for mappings done by other drivers, through the dma-buf
interface (where "other drivers" can include fbdev/fbcon, if you use the
generic helpers).
-Daniel

> 
> Andrey
> 
> 
> > It doesn't
> > solve all your problems, but it's a tool to get there.
> > -Daniel
> > 
> > > Andrey
> > > 
> > > 
> > > > - handle fbcon somehow. I think shutting it all down should work out.
> > > > - worst case keep the system backing storage around for shared dma-buf
> > > > until the other non-dynamic driver releases it. for vram we require
> > > > dynamic importers (and maybe it wasn't such a bright idea to allow
> > > > pinning of importer buffers, might need to revisit that).
> > > > 
> > > > Cheers, Daniel
> > > > 
> > > > > Christian.
> > > > > 
> > > > > > Andrey
> > > > > > 
> > > > > > 
> > > > > > > -Daniel
> > > > > > > 
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > > > I loaded the driver with vm_update_mode=3
> > > > > > > > > meaning all VM updates done using CPU and hasn't seen any OOPs after
> > > > > > > > > removing the device. I guess i can test it more by allocating GTT and
> > > > > > > > > VRAM BOs
> > > > > > > > > and trying to read/write to them after device is removed.
> > > > > > > > > 
> > > > > > > > > Andrey
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Regards,
> > > > > > > > > > Christian.
> > > > > > > > > > 
> > > > > > > > > > > Andrey
> > > > > > > > > _______________________________________________
> > > > > > > > > amd-gfx mailing list
> > > > > > > > > amd-gfx@lists.freedesktop.org
> > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C5849827698d1428065d408d8a2188518%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437573486129589%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=uahs3RgQxLpUJ9hCzE8pvK9UWFCpyIz1i4MNKikl0tY%3D&amp;reserved=0
> > > > > > > > > 
> > 
> >
Andrey Grodzovsky Dec. 17, 2020, 7:19 p.m. UTC | #31
On 12/17/20 7:01 AM, Daniel Vetter wrote:
> On Wed, Dec 16, 2020 at 07:20:02PM -0500, Andrey Grodzovsky wrote:
>> On 12/16/20 6:15 PM, Daniel Vetter wrote:
>>> On Wed, Dec 16, 2020 at 7:26 PM Andrey Grodzovsky
>>> <Andrey.Grodzovsky@amd.com> wrote:
>>>> On 12/16/20 12:12 PM, Daniel Vetter wrote:
>>>>> On Wed, Dec 16, 2020 at 5:18 PM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
>>>>>>> On 12/16/20 9:21 AM, Daniel Vetter wrote:
>>>>>>>> On Wed, Dec 16, 2020 at 9:04 AM Christian König
>>>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>>>>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
>>>>>>>>>> [SNIP]
>>>>>>>>>>>> While we can't control user application accesses to the mapped
>>>>>>>>>>>> buffers explicitly and hence we use page fault rerouting
>>>>>>>>>>>> I am thinking that in this  case we may be able to sprinkle
>>>>>>>>>>>> drm_dev_enter/exit in any such sensitive place were we might
>>>>>>>>>>>> CPU access a DMA buffer from the kernel ?
>>>>>>>>>>> Yes, I fear we are going to need that.
>>>>>>>>>>>
>>>>>>>>>>>> Things like CPU page table updates, ring buffer accesses and FW
>>>>>>>>>>>> memcpy ? Is there other places ?
>>>>>>>>>>> Puh, good question. I have no idea.
>>>>>>>>>>>
>>>>>>>>>>>> Another point is that at this point the driver shouldn't access any
>>>>>>>>>>>> such buffers as we are at the process finishing the device.
>>>>>>>>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
>>>>>>>>>>>> don't think there is anything else to do ?
>>>>>>>>>>> Well there is a page fault handler for kernel mappings, but that one
>>>>>>>>>>> just prints the stack trace into the system log and calls BUG(); :)
>>>>>>>>>>>
>>>>>>>>>>> Long story short we need to avoid any access to released pages after
>>>>>>>>>>> unplug. No matter if it's from the kernel or userspace.
>>>>>>>>>> I was just about to start guarding with drm_dev_enter/exit CPU
>>>>>>>>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
>>>>>>>>>> the code
>>>>>>>>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
>>>>>>>>>> sake of device to main memory access). Kernel page table is not
>>>>>>>>>> touched
>>>>>>>>>> until last bo refcount is dropped and the bo is released
>>>>>>>>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
>>>>>>>>>> is both
>>>>>>>>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
>>>>>>>>>> by ioremap. So as i see it, nothing will bad will happen after we
>>>>>>>>>> unpopulate a BO while we still try to use a kernel mapping for it,
>>>>>>>>>> system memory pages backing GTT BOs are still mapped and not freed and
>>>>>>>>>> for
>>>>>>>>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
>>>>>>>>>> page table since iounmap wasn't called yet.
>>>>>>>>> The problem is the system pages would be freed and if we kernel driver
>>>>>>>>> still happily write to them we are pretty much busted because we write
>>>>>>>>> to freed up memory.
>>>>>>> OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
>>>>>>> release
>>>>>>> the GTT BO pages. But then isn't there a problem in ttm_bo_release since
>>>>>>> ttm_bo_cleanup_memtype_use which also leads to pages release comes
>>>>>>> before bo->destroy which unmaps the pages from kernel page table ? Won't
>>>>>>> we have end up writing to freed memory in this time interval ? Don't we
>>>>>>> need to postpone pages freeing to after kernel page table unmapping ?
>>>>>> BOs are only destroyed when there is a guarantee that nobody is
>>>>>> accessing them any more.
>>>>>>
>>>>>> The problem here is that the pages as well as the VRAM can be
>>>>>> immediately reused after the hotplug event.
>>>>>>
>>>>>>>> Similar for vram, if this is actual hotunplug and then replug, there's
>>>>>>>> going to be a different device behind the same mmio bar range most
>>>>>>>> likely (the higher bridges all this have the same windows assigned),
>>>>>>> No idea how this actually works but if we haven't called iounmap yet
>>>>>>> doesn't it mean that those physical ranges that are still mapped into
>>>>>>> page
>>>>>>> table should be reserved and cannot be reused for another
>>>>>>> device ? As a guess, maybe another subrange from the higher bridge's
>>>>>>> total
>>>>>>> range will be allocated.
>>>>>> Nope, the PCIe subsystem doesn't care about any ioremap still active for
>>>>>> a range when it is hotplugged.
>>>>>>
>>>>>>>> and that's bad news if we keep using it for current drivers. So we
>>>>>>>> really have to point all these cpu ptes to some other place.
>>>>>>> We can't just unmap it without syncing against any in kernel accesses
>>>>>>> to those buffers
>>>>>>> and since page faulting technique we use for user mapped buffers seems
>>>>>>> to not be possible
>>>>>>> for kernel mapped buffers I am not sure how to do it gracefully...
>>>>>> We could try to replace the kmap with a dummy page under the hood, but
>>>>>> that is extremely tricky.
>>>>>>
>>>>>> Especially since BOs which are just 1 page in size could point to the
>>>>>> linear mapping directly.
>>>>> I think it's just more work. Essentially
>>>>> - convert as much as possible of the kernel mappings to vmap_local,
>>>>> which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
>>>>> serve as a barrier, and ofc any new vmap needs to fail or hand out a
>>>>> dummy mapping.
>>>> Read those patches. I am not sure how this helps with protecting
>>>> against accesses to released backing pages or IO physical ranges of BO
>>>> which is already mapped during the unplug event ?
>>> By eliminating such users, and replacing them with local maps which
>>> are strictly bound in how long they can exist (and hence we can
>>> serialize against them finishing in our hotunplug code).
>> Not sure I see how serializing against BO map/unmap helps -  our problem as
>> you described is that once
>> device is extracted and then something else quickly takes it's place in the
>> PCI topology
>> and gets assigned same physical IO ranges, then our driver will start accessing this
>> new device because our 'zombie' BOs are still pointing to those ranges.
> Until your driver's remove callback is finished the ranges stay reserved.


The ranges stay reserved until unmapped which happens in bo->destroy
which for most internally allocated  buffers is during sw_fini when last drm_put
is called.


> If that's not the case, then hotunplug would be fundamentally impossible
> ot handle correctly.
>
> Of course all the mmio actions will time out, so it might take some time
> to get through it all.


I found that PCI code provides pci_device_is_present function
we can use to avoid timeouts - it reads device vendor and checks if all 1s is 
returned
or not. We can call it from within register accessors before trying read/write


>
>> Another point regarding serializing - problem  is that some of those BOs are
>> very long lived, take for example the HW command
>> ring buffer Christian mentioned before -
>> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
>> is basically for the entire time the device exists, it's destroyed only in
>> the SW fini stage (when last drm_dev
>> reference is dropped) and so should I grab it's dma_resv_lock from
>> amdgpu_pci_remove code and wait
>> for it to be unmapped before proceeding with the PCI remove code ? This can
>> take unbound time and that why I don't understand
>> how serializing will help.
> Uh you need to untangle that. After hw cleanup is done no one is allowed
> to touch that ringbuffer bo anymore from the kernel.


I would assume we are not allowed to touch it once we identified the device is
gone in order to minimize the chance of accidental writes to some other device 
which might now
occupy those IO ranges ?


>   That's what
> drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
> references to disappear.


Yes, didn't make sense to me why would we use vmap_local for internally
allocated buffers. I think we should also guard registers read/writes for the
same reason as above.


>
> The vmap_local is for mappings done by other drivers, through the dma-buf
> interface (where "other drivers" can include fbdev/fbcon, if you use the
> generic helpers).
> -Daniel


Ok, so I assumed that with vmap_local you were trying to solve the problem of 
quick reinsertion
of another device into same MMIO range that my driver still points too but 
actually are you trying to solve
the issue of exported dma buffers outliving the device ? For this we have 
drm_device refcount in the GEM layer
i think.

Andrey


>
>> Andrey
>>
>>
>>> It doesn't
>>> solve all your problems, but it's a tool to get there.
>>> -Daniel
>>>
>>>> Andrey
>>>>
>>>>
>>>>> - handle fbcon somehow. I think shutting it all down should work out.
>>>>> - worst case keep the system backing storage around for shared dma-buf
>>>>> until the other non-dynamic driver releases it. for vram we require
>>>>> dynamic importers (and maybe it wasn't such a bright idea to allow
>>>>> pinning of importer buffers, might need to revisit that).
>>>>>
>>>>> Cheers, Daniel
>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>> -Daniel
>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> I loaded the driver with vm_update_mode=3
>>>>>>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
>>>>>>>>>> removing the device. I guess i can test it more by allocating GTT and
>>>>>>>>>> VRAM BOs
>>>>>>>>>> and trying to read/write to them after device is removed.
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>> _______________________________________________
>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C92654f053679415de74808d8a2838b3e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438033181843512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BeS7v5CrHRfblj2FFCd4nrDLxUxzam6EyHM6poPkGc4%3D&amp;reserved=0
>>>>>>>>>>
>>>
Christian König Dec. 17, 2020, 8:10 p.m. UTC | #32
[SNIP]
>>> By eliminating such users, and replacing them with local maps which
>>>> are strictly bound in how long they can exist (and hence we can
>>>> serialize against them finishing in our hotunplug code).
>>> Not sure I see how serializing against BO map/unmap helps - our 
>>> problem as
>>> you described is that once
>>> device is extracted and then something else quickly takes it's place 
>>> in the
>>> PCI topology
>>> and gets assigned same physical IO ranges, then our driver will 
>>> start accessing this
>>> new device because our 'zombie' BOs are still pointing to those ranges.
>> Until your driver's remove callback is finished the ranges stay 
>> reserved.
>
>
> The ranges stay reserved until unmapped which happens in bo->destroy

I'm not sure of that. Why do you think that?

> which for most internally allocated  buffers is during sw_fini when 
> last drm_put
> is called.
>
>
>> If that's not the case, then hotunplug would be fundamentally impossible
>> ot handle correctly.
>>
>> Of course all the mmio actions will time out, so it might take some time
>> to get through it all.
>
>
> I found that PCI code provides pci_device_is_present function
> we can use to avoid timeouts - it reads device vendor and checks if 
> all 1s is returned
> or not. We can call it from within register accessors before trying 
> read/write

That's way to much overhead! We need to keep that much lower or it will 
result in quite a performance drop.

I suggest to rather think about adding drm_dev_enter/exit guards.

Christian.

>
>>> Another point regarding serializing - problem  is that some of those 
>>> BOs are
>>> very long lived, take for example the HW command
>>> ring buffer Christian mentioned before -
>>> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
>>> is basically for the entire time the device exists, it's destroyed 
>>> only in
>>> the SW fini stage (when last drm_dev
>>> reference is dropped) and so should I grab it's dma_resv_lock from
>>> amdgpu_pci_remove code and wait
>>> for it to be unmapped before proceeding with the PCI remove code ? 
>>> This can
>>> take unbound time and that why I don't understand
>>> how serializing will help.
>> Uh you need to untangle that. After hw cleanup is done no one is allowed
>> to touch that ringbuffer bo anymore from the kernel.
>
>
> I would assume we are not allowed to touch it once we identified the 
> device is
> gone in order to minimize the chance of accidental writes to some 
> other device which might now
> occupy those IO ranges ?
>
>
>>   That's what
>> drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
>> references to disappear.
>
>
> Yes, didn't make sense to me why would we use vmap_local for internally
> allocated buffers. I think we should also guard registers read/writes 
> for the
> same reason as above.
>
>
>>
>> The vmap_local is for mappings done by other drivers, through the 
>> dma-buf
>> interface (where "other drivers" can include fbdev/fbcon, if you use the
>> generic helpers).
>> -Daniel
>
>
> Ok, so I assumed that with vmap_local you were trying to solve the 
> problem of quick reinsertion
> of another device into same MMIO range that my driver still points too 
> but actually are you trying to solve
> the issue of exported dma buffers outliving the device ? For this we 
> have drm_device refcount in the GEM layer
> i think.
>
> Andrey
>
>
>>
>>> Andrey
>>>
>>>
>>>> It doesn't
>>>> solve all your problems, but it's a tool to get there.
>>>> -Daniel
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>> - handle fbcon somehow. I think shutting it all down should work 
>>>>>> out.
>>>>>> - worst case keep the system backing storage around for shared 
>>>>>> dma-buf
>>>>>> until the other non-dynamic driver releases it. for vram we require
>>>>>> dynamic importers (and maybe it wasn't such a bright idea to allow
>>>>>> pinning of importer buffers, might need to revisit that).
>>>>>>
>>>>>> Cheers, Daniel
>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>> -Daniel
>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> I loaded the driver with vm_update_mode=3
>>>>>>>>>>> meaning all VM updates done using CPU and hasn't seen any 
>>>>>>>>>>> OOPs after
>>>>>>>>>>> removing the device. I guess i can test it more by 
>>>>>>>>>>> allocating GTT and
>>>>>>>>>>> VRAM BOs
>>>>>>>>>>> and trying to read/write to them after device is removed.
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C92654f053679415de74808d8a2838b3e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438033181843512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BeS7v5CrHRfblj2FFCd4nrDLxUxzam6EyHM6poPkGc4%3D&amp;reserved=0 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>
Andrey Grodzovsky Dec. 17, 2020, 8:38 p.m. UTC | #33
On 12/17/20 3:10 PM, Christian König wrote:
> [SNIP]
>>>> By eliminating such users, and replacing them with local maps which
>>>>> are strictly bound in how long they can exist (and hence we can
>>>>> serialize against them finishing in our hotunplug code).
>>>> Not sure I see how serializing against BO map/unmap helps - our problem as
>>>> you described is that once
>>>> device is extracted and then something else quickly takes it's place in the
>>>> PCI topology
>>>> and gets assigned same physical IO ranges, then our driver will start 
>>>> accessing this
>>>> new device because our 'zombie' BOs are still pointing to those ranges.
>>> Until your driver's remove callback is finished the ranges stay reserved.
>>
>>
>> The ranges stay reserved until unmapped which happens in bo->destroy
>
> I'm not sure of that. Why do you think that?


Because of this sequence 
ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap->...->iounmap
Is there another place I am missing ?


>
>> which for most internally allocated buffers is during sw_fini when last drm_put
>> is called.
>>
>>
>>> If that's not the case, then hotunplug would be fundamentally impossible
>>> ot handle correctly.
>>>
>>> Of course all the mmio actions will time out, so it might take some time
>>> to get through it all.
>>
>>
>> I found that PCI code provides pci_device_is_present function
>> we can use to avoid timeouts - it reads device vendor and checks if all 1s is 
>> returned
>> or not. We can call it from within register accessors before trying read/write
>
> That's way to much overhead! We need to keep that much lower or it will result 
> in quite a performance drop.
>
> I suggest to rather think about adding drm_dev_enter/exit guards.


Sure, this one is just a bit upstream to the disconnect event. Eventually none 
of them is watertight.

Andrey


>
> Christian.
>
>>
>>>> Another point regarding serializing - problem  is that some of those BOs are
>>>> very long lived, take for example the HW command
>>>> ring buffer Christian mentioned before -
>>>> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
>>>> is basically for the entire time the device exists, it's destroyed only in
>>>> the SW fini stage (when last drm_dev
>>>> reference is dropped) and so should I grab it's dma_resv_lock from
>>>> amdgpu_pci_remove code and wait
>>>> for it to be unmapped before proceeding with the PCI remove code ? This can
>>>> take unbound time and that why I don't understand
>>>> how serializing will help.
>>> Uh you need to untangle that. After hw cleanup is done no one is allowed
>>> to touch that ringbuffer bo anymore from the kernel.
>>
>>
>> I would assume we are not allowed to touch it once we identified the device is
>> gone in order to minimize the chance of accidental writes to some other 
>> device which might now
>> occupy those IO ranges ?
>>
>>
>>>   That's what
>>> drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
>>> references to disappear.
>>
>>
>> Yes, didn't make sense to me why would we use vmap_local for internally
>> allocated buffers. I think we should also guard registers read/writes for the
>> same reason as above.
>>
>>
>>>
>>> The vmap_local is for mappings done by other drivers, through the dma-buf
>>> interface (where "other drivers" can include fbdev/fbcon, if you use the
>>> generic helpers).
>>> -Daniel
>>
>>
>> Ok, so I assumed that with vmap_local you were trying to solve the problem of 
>> quick reinsertion
>> of another device into same MMIO range that my driver still points too but 
>> actually are you trying to solve
>> the issue of exported dma buffers outliving the device ? For this we have 
>> drm_device refcount in the GEM layer
>> i think.
>>
>> Andrey
>>
>>
>>>
>>>> Andrey
>>>>
>>>>
>>>>> It doesn't
>>>>> solve all your problems, but it's a tool to get there.
>>>>> -Daniel
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> - handle fbcon somehow. I think shutting it all down should work out.
>>>>>>> - worst case keep the system backing storage around for shared dma-buf
>>>>>>> until the other non-dynamic driver releases it. for vram we require
>>>>>>> dynamic importers (and maybe it wasn't such a bright idea to allow
>>>>>>> pinning of importer buffers, might need to revisit that).
>>>>>>>
>>>>>>> Cheers, Daniel
>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -Daniel
>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> I loaded the driver with vm_update_mode=3
>>>>>>>>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
>>>>>>>>>>>> removing the device. I guess i can test it more by allocating GTT and
>>>>>>>>>>>> VRAM BOs
>>>>>>>>>>>> and trying to read/write to them after device is removed.
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C92654f053679415de74808d8a2838b3e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438033181843512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BeS7v5CrHRfblj2FFCd4nrDLxUxzam6EyHM6poPkGc4%3D&amp;reserved=0 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>
>
Daniel Vetter Dec. 17, 2020, 8:42 p.m. UTC | #34
On Thu, Dec 17, 2020 at 8:19 PM Andrey Grodzovsky
<Andrey.Grodzovsky@amd.com> wrote:
>
>
> On 12/17/20 7:01 AM, Daniel Vetter wrote:
> > On Wed, Dec 16, 2020 at 07:20:02PM -0500, Andrey Grodzovsky wrote:
> >> On 12/16/20 6:15 PM, Daniel Vetter wrote:
> >>> On Wed, Dec 16, 2020 at 7:26 PM Andrey Grodzovsky
> >>> <Andrey.Grodzovsky@amd.com> wrote:
> >>>> On 12/16/20 12:12 PM, Daniel Vetter wrote:
> >>>>> On Wed, Dec 16, 2020 at 5:18 PM Christian König
> >>>>> <christian.koenig@amd.com> wrote:
> >>>>>> Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
> >>>>>>> On 12/16/20 9:21 AM, Daniel Vetter wrote:
> >>>>>>>> On Wed, Dec 16, 2020 at 9:04 AM Christian König
> >>>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>>>>>>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
> >>>>>>>>>> [SNIP]
> >>>>>>>>>>>> While we can't control user application accesses to the mapped
> >>>>>>>>>>>> buffers explicitly and hence we use page fault rerouting
> >>>>>>>>>>>> I am thinking that in this  case we may be able to sprinkle
> >>>>>>>>>>>> drm_dev_enter/exit in any such sensitive place were we might
> >>>>>>>>>>>> CPU access a DMA buffer from the kernel ?
> >>>>>>>>>>> Yes, I fear we are going to need that.
> >>>>>>>>>>>
> >>>>>>>>>>>> Things like CPU page table updates, ring buffer accesses and FW
> >>>>>>>>>>>> memcpy ? Is there other places ?
> >>>>>>>>>>> Puh, good question. I have no idea.
> >>>>>>>>>>>
> >>>>>>>>>>>> Another point is that at this point the driver shouldn't access any
> >>>>>>>>>>>> such buffers as we are at the process finishing the device.
> >>>>>>>>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
> >>>>>>>>>>>> don't think there is anything else to do ?
> >>>>>>>>>>> Well there is a page fault handler for kernel mappings, but that one
> >>>>>>>>>>> just prints the stack trace into the system log and calls BUG(); :)
> >>>>>>>>>>>
> >>>>>>>>>>> Long story short we need to avoid any access to released pages after
> >>>>>>>>>>> unplug. No matter if it's from the kernel or userspace.
> >>>>>>>>>> I was just about to start guarding with drm_dev_enter/exit CPU
> >>>>>>>>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
> >>>>>>>>>> the code
> >>>>>>>>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
> >>>>>>>>>> sake of device to main memory access). Kernel page table is not
> >>>>>>>>>> touched
> >>>>>>>>>> until last bo refcount is dropped and the bo is released
> >>>>>>>>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
> >>>>>>>>>> is both
> >>>>>>>>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
> >>>>>>>>>> by ioremap. So as i see it, nothing will bad will happen after we
> >>>>>>>>>> unpopulate a BO while we still try to use a kernel mapping for it,
> >>>>>>>>>> system memory pages backing GTT BOs are still mapped and not freed and
> >>>>>>>>>> for
> >>>>>>>>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
> >>>>>>>>>> page table since iounmap wasn't called yet.
> >>>>>>>>> The problem is the system pages would be freed and if we kernel driver
> >>>>>>>>> still happily write to them we are pretty much busted because we write
> >>>>>>>>> to freed up memory.
> >>>>>>> OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
> >>>>>>> release
> >>>>>>> the GTT BO pages. But then isn't there a problem in ttm_bo_release since
> >>>>>>> ttm_bo_cleanup_memtype_use which also leads to pages release comes
> >>>>>>> before bo->destroy which unmaps the pages from kernel page table ? Won't
> >>>>>>> we have end up writing to freed memory in this time interval ? Don't we
> >>>>>>> need to postpone pages freeing to after kernel page table unmapping ?
> >>>>>> BOs are only destroyed when there is a guarantee that nobody is
> >>>>>> accessing them any more.
> >>>>>>
> >>>>>> The problem here is that the pages as well as the VRAM can be
> >>>>>> immediately reused after the hotplug event.
> >>>>>>
> >>>>>>>> Similar for vram, if this is actual hotunplug and then replug, there's
> >>>>>>>> going to be a different device behind the same mmio bar range most
> >>>>>>>> likely (the higher bridges all this have the same windows assigned),
> >>>>>>> No idea how this actually works but if we haven't called iounmap yet
> >>>>>>> doesn't it mean that those physical ranges that are still mapped into
> >>>>>>> page
> >>>>>>> table should be reserved and cannot be reused for another
> >>>>>>> device ? As a guess, maybe another subrange from the higher bridge's
> >>>>>>> total
> >>>>>>> range will be allocated.
> >>>>>> Nope, the PCIe subsystem doesn't care about any ioremap still active for
> >>>>>> a range when it is hotplugged.
> >>>>>>
> >>>>>>>> and that's bad news if we keep using it for current drivers. So we
> >>>>>>>> really have to point all these cpu ptes to some other place.
> >>>>>>> We can't just unmap it without syncing against any in kernel accesses
> >>>>>>> to those buffers
> >>>>>>> and since page faulting technique we use for user mapped buffers seems
> >>>>>>> to not be possible
> >>>>>>> for kernel mapped buffers I am not sure how to do it gracefully...
> >>>>>> We could try to replace the kmap with a dummy page under the hood, but
> >>>>>> that is extremely tricky.
> >>>>>>
> >>>>>> Especially since BOs which are just 1 page in size could point to the
> >>>>>> linear mapping directly.
> >>>>> I think it's just more work. Essentially
> >>>>> - convert as much as possible of the kernel mappings to vmap_local,
> >>>>> which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
> >>>>> serve as a barrier, and ofc any new vmap needs to fail or hand out a
> >>>>> dummy mapping.
> >>>> Read those patches. I am not sure how this helps with protecting
> >>>> against accesses to released backing pages or IO physical ranges of BO
> >>>> which is already mapped during the unplug event ?
> >>> By eliminating such users, and replacing them with local maps which
> >>> are strictly bound in how long they can exist (and hence we can
> >>> serialize against them finishing in our hotunplug code).
> >> Not sure I see how serializing against BO map/unmap helps -  our problem as
> >> you described is that once
> >> device is extracted and then something else quickly takes it's place in the
> >> PCI topology
> >> and gets assigned same physical IO ranges, then our driver will start accessing this
> >> new device because our 'zombie' BOs are still pointing to those ranges.
> > Until your driver's remove callback is finished the ranges stay reserved.
>
>
> The ranges stay reserved until unmapped which happens in bo->destroy
> which for most internally allocated  buffers is during sw_fini when last drm_put
> is called.
>
>
> > If that's not the case, then hotunplug would be fundamentally impossible
> > ot handle correctly.
> >
> > Of course all the mmio actions will time out, so it might take some time
> > to get through it all.
>
>
> I found that PCI code provides pci_device_is_present function
> we can use to avoid timeouts - it reads device vendor and checks if all 1s is
> returned
> or not. We can call it from within register accessors before trying read/write

drm_dev_enter/exit is a _lot_ less overhead, plus makes a lot stronger
guarantees for hotunplug ordering. Please use that one instead of
hand-rolling something which only mostly works for closing hotunplug
races. pciconfig access is really slow.

> >> Another point regarding serializing - problem  is that some of those BOs are
> >> very long lived, take for example the HW command
> >> ring buffer Christian mentioned before -
> >> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
> >> is basically for the entire time the device exists, it's destroyed only in
> >> the SW fini stage (when last drm_dev
> >> reference is dropped) and so should I grab it's dma_resv_lock from
> >> amdgpu_pci_remove code and wait
> >> for it to be unmapped before proceeding with the PCI remove code ? This can
> >> take unbound time and that why I don't understand
> >> how serializing will help.
> > Uh you need to untangle that. After hw cleanup is done no one is allowed
> > to touch that ringbuffer bo anymore from the kernel.
>
>
> I would assume we are not allowed to touch it once we identified the device is
> gone in order to minimize the chance of accidental writes to some other device
> which might now
> occupy those IO ranges ?
>
>
> >   That's what
> > drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
> > references to disappear.
>
>
> Yes, didn't make sense to me why would we use vmap_local for internally
> allocated buffers. I think we should also guard registers read/writes for the
> same reason as above.
>
>
> >
> > The vmap_local is for mappings done by other drivers, through the dma-buf
> > interface (where "other drivers" can include fbdev/fbcon, if you use the
> > generic helpers).
> > -Daniel
>
>
> Ok, so I assumed that with vmap_local you were trying to solve the problem of
> quick reinsertion
> of another device into same MMIO range that my driver still points too but
> actually are you trying to solve
> the issue of exported dma buffers outliving the device ? For this we have
> drm_device refcount in the GEM layer
> i think.

That's completely different lifetime problems. Don't mix them up :-)
One problem is the hardware disappearing, and for that we _have_ to
guarantee timeliness, or otherwise the pci subsystem gets pissed
(since like you say, a new device might show up and need it's mmio
bars assigned to io ranges). The other is lifetim of the software
objects we use as interfaces, both from userspace and from other
kernel drivers. There we fundamentally can't enforce timely cleanup,
and have to resort to refcounting.

We need both.
-Daniel

> Andrey
>
>
> >
> >> Andrey
> >>
> >>
> >>> It doesn't
> >>> solve all your problems, but it's a tool to get there.
> >>> -Daniel
> >>>
> >>>> Andrey
> >>>>
> >>>>
> >>>>> - handle fbcon somehow. I think shutting it all down should work out.
> >>>>> - worst case keep the system backing storage around for shared dma-buf
> >>>>> until the other non-dynamic driver releases it. for vram we require
> >>>>> dynamic importers (and maybe it wasn't such a bright idea to allow
> >>>>> pinning of importer buffers, might need to revisit that).
> >>>>>
> >>>>> Cheers, Daniel
> >>>>>
> >>>>>> Christian.
> >>>>>>
> >>>>>>> Andrey
> >>>>>>>
> >>>>>>>
> >>>>>>>> -Daniel
> >>>>>>>>
> >>>>>>>>> Christian.
> >>>>>>>>>
> >>>>>>>>>> I loaded the driver with vm_update_mode=3
> >>>>>>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
> >>>>>>>>>> removing the device. I guess i can test it more by allocating GTT and
> >>>>>>>>>> VRAM BOs
> >>>>>>>>>> and trying to read/write to them after device is removed.
> >>>>>>>>>>
> >>>>>>>>>> Andrey
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Christian.
> >>>>>>>>>>>
> >>>>>>>>>>>> Andrey
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> amd-gfx mailing list
> >>>>>>>>>> amd-gfx@lists.freedesktop.org
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C92654f053679415de74808d8a2838b3e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438033181843512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BeS7v5CrHRfblj2FFCd4nrDLxUxzam6EyHM6poPkGc4%3D&amp;reserved=0
> >>>>>>>>>>
> >>>
Daniel Vetter Dec. 17, 2020, 8:48 p.m. UTC | #35
On Thu, Dec 17, 2020 at 9:38 PM Andrey Grodzovsky
<Andrey.Grodzovsky@amd.com> wrote:
>
>
> On 12/17/20 3:10 PM, Christian König wrote:
> > [SNIP]
> >>>> By eliminating such users, and replacing them with local maps which
> >>>>> are strictly bound in how long they can exist (and hence we can
> >>>>> serialize against them finishing in our hotunplug code).
> >>>> Not sure I see how serializing against BO map/unmap helps - our problem as
> >>>> you described is that once
> >>>> device is extracted and then something else quickly takes it's place in the
> >>>> PCI topology
> >>>> and gets assigned same physical IO ranges, then our driver will start
> >>>> accessing this
> >>>> new device because our 'zombie' BOs are still pointing to those ranges.
> >>> Until your driver's remove callback is finished the ranges stay reserved.
> >>
> >>
> >> The ranges stay reserved until unmapped which happens in bo->destroy
> >
> > I'm not sure of that. Why do you think that?
>
>
> Because of this sequence
> ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap->...->iounmap
> Is there another place I am missing ?

iounmap is just the mapping, it doesn't reserve anything in the resource tree.

And I don't think we should keep resources reserved past the pci
remove callback, because that would upset the pci subsystem trying to
assign resources to a newly hotplugged pci device.

Also from a quick check amdgpu does not reserve the pci bars it's
using. Somehow most drm drivers don't do that, not exactly sure why,
maybe auto-enumeration of resources just works too good and we don't
need the safety net of kernel/resource.c anymore.
-Daniel


> >
> >> which for most internally allocated buffers is during sw_fini when last drm_put
> >> is called.
> >>
> >>
> >>> If that's not the case, then hotunplug would be fundamentally impossible
> >>> ot handle correctly.
> >>>
> >>> Of course all the mmio actions will time out, so it might take some time
> >>> to get through it all.
> >>
> >>
> >> I found that PCI code provides pci_device_is_present function
> >> we can use to avoid timeouts - it reads device vendor and checks if all 1s is
> >> returned
> >> or not. We can call it from within register accessors before trying read/write
> >
> > That's way to much overhead! We need to keep that much lower or it will result
> > in quite a performance drop.
> >
> > I suggest to rather think about adding drm_dev_enter/exit guards.
>
>
> Sure, this one is just a bit upstream to the disconnect event. Eventually none
> of them is watertight.
>
> Andrey
>
>
> >
> > Christian.
> >
> >>
> >>>> Another point regarding serializing - problem  is that some of those BOs are
> >>>> very long lived, take for example the HW command
> >>>> ring buffer Christian mentioned before -
> >>>> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
> >>>> is basically for the entire time the device exists, it's destroyed only in
> >>>> the SW fini stage (when last drm_dev
> >>>> reference is dropped) and so should I grab it's dma_resv_lock from
> >>>> amdgpu_pci_remove code and wait
> >>>> for it to be unmapped before proceeding with the PCI remove code ? This can
> >>>> take unbound time and that why I don't understand
> >>>> how serializing will help.
> >>> Uh you need to untangle that. After hw cleanup is done no one is allowed
> >>> to touch that ringbuffer bo anymore from the kernel.
> >>
> >>
> >> I would assume we are not allowed to touch it once we identified the device is
> >> gone in order to minimize the chance of accidental writes to some other
> >> device which might now
> >> occupy those IO ranges ?
> >>
> >>
> >>>   That's what
> >>> drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
> >>> references to disappear.
> >>
> >>
> >> Yes, didn't make sense to me why would we use vmap_local for internally
> >> allocated buffers. I think we should also guard registers read/writes for the
> >> same reason as above.
> >>
> >>
> >>>
> >>> The vmap_local is for mappings done by other drivers, through the dma-buf
> >>> interface (where "other drivers" can include fbdev/fbcon, if you use the
> >>> generic helpers).
> >>> -Daniel
> >>
> >>
> >> Ok, so I assumed that with vmap_local you were trying to solve the problem of
> >> quick reinsertion
> >> of another device into same MMIO range that my driver still points too but
> >> actually are you trying to solve
> >> the issue of exported dma buffers outliving the device ? For this we have
> >> drm_device refcount in the GEM layer
> >> i think.
> >>
> >> Andrey
> >>
> >>
> >>>
> >>>> Andrey
> >>>>
> >>>>
> >>>>> It doesn't
> >>>>> solve all your problems, but it's a tool to get there.
> >>>>> -Daniel
> >>>>>
> >>>>>> Andrey
> >>>>>>
> >>>>>>
> >>>>>>> - handle fbcon somehow. I think shutting it all down should work out.
> >>>>>>> - worst case keep the system backing storage around for shared dma-buf
> >>>>>>> until the other non-dynamic driver releases it. for vram we require
> >>>>>>> dynamic importers (and maybe it wasn't such a bright idea to allow
> >>>>>>> pinning of importer buffers, might need to revisit that).
> >>>>>>>
> >>>>>>> Cheers, Daniel
> >>>>>>>
> >>>>>>>> Christian.
> >>>>>>>>
> >>>>>>>>> Andrey
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> -Daniel
> >>>>>>>>>>
> >>>>>>>>>>> Christian.
> >>>>>>>>>>>
> >>>>>>>>>>>> I loaded the driver with vm_update_mode=3
> >>>>>>>>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
> >>>>>>>>>>>> removing the device. I guess i can test it more by allocating GTT and
> >>>>>>>>>>>> VRAM BOs
> >>>>>>>>>>>> and trying to read/write to them after device is removed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Andrey
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Christian.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Andrey
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> amd-gfx mailing list
> >>>>>>>>>>>> amd-gfx@lists.freedesktop.org
> >>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C92654f053679415de74808d8a2838b3e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438033181843512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BeS7v5CrHRfblj2FFCd4nrDLxUxzam6EyHM6poPkGc4%3D&amp;reserved=0
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>
> >
Andrey Grodzovsky Dec. 17, 2020, 9:06 p.m. UTC | #36
On 12/17/20 3:48 PM, Daniel Vetter wrote:
> On Thu, Dec 17, 2020 at 9:38 PM Andrey Grodzovsky
> <Andrey.Grodzovsky@amd.com> wrote:
>>
>> On 12/17/20 3:10 PM, Christian König wrote:
>>> [SNIP]
>>>>>> By eliminating such users, and replacing them with local maps which
>>>>>>> are strictly bound in how long they can exist (and hence we can
>>>>>>> serialize against them finishing in our hotunplug code).
>>>>>> Not sure I see how serializing against BO map/unmap helps - our problem as
>>>>>> you described is that once
>>>>>> device is extracted and then something else quickly takes it's place in the
>>>>>> PCI topology
>>>>>> and gets assigned same physical IO ranges, then our driver will start
>>>>>> accessing this
>>>>>> new device because our 'zombie' BOs are still pointing to those ranges.
>>>>> Until your driver's remove callback is finished the ranges stay reserved.
>>>>
>>>> The ranges stay reserved until unmapped which happens in bo->destroy
>>> I'm not sure of that. Why do you think that?
>>
>> Because of this sequence
>> ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap->...->iounmap
>> Is there another place I am missing ?
> iounmap is just the mapping, it doesn't reserve anything in the resource tree.
>
> And I don't think we should keep resources reserved past the pci
> remove callback, because that would upset the pci subsystem trying to
> assign resources to a newly hotplugged pci device.


I assumed we are talking about VA ranges still mapped in the page table. I just 
assumed
that part of ioremap is also reservation of the mapped physical ranges. In fact, 
if we
do can explicitly reserve those ranges (as you mention here) then together with 
postponing
system memory pages freeing/releasing back to the page pool until after BO is 
unmapped
from the kernel address space I believe this could solve the issue of quick HW 
reinsertion
and make all the drm_dev_ener/exit guarding obsolete.

Andrey


> Also from a quick check amdgpu does not reserve the pci bars it's
> using. Somehow most drm drivers don't do that, not exactly sure why,
> maybe auto-enumeration of resources just works too good and we don't
> need the safety net of kernel/resource.c anymore.
> -Daniel
>
>
>>>> which for most internally allocated buffers is during sw_fini when last drm_put
>>>> is called.
>>>>
>>>>
>>>>> If that's not the case, then hotunplug would be fundamentally impossible
>>>>> ot handle correctly.
>>>>>
>>>>> Of course all the mmio actions will time out, so it might take some time
>>>>> to get through it all.
>>>>
>>>> I found that PCI code provides pci_device_is_present function
>>>> we can use to avoid timeouts - it reads device vendor and checks if all 1s is
>>>> returned
>>>> or not. We can call it from within register accessors before trying read/write
>>> That's way to much overhead! We need to keep that much lower or it will result
>>> in quite a performance drop.
>>>
>>> I suggest to rather think about adding drm_dev_enter/exit guards.
>>
>> Sure, this one is just a bit upstream to the disconnect event. Eventually none
>> of them is watertight.
>>
>> Andrey
>>
>>
>>> Christian.
>>>
>>>>>> Another point regarding serializing - problem  is that some of those BOs are
>>>>>> very long lived, take for example the HW command
>>>>>> ring buffer Christian mentioned before -
>>>>>> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
>>>>>> is basically for the entire time the device exists, it's destroyed only in
>>>>>> the SW fini stage (when last drm_dev
>>>>>> reference is dropped) and so should I grab it's dma_resv_lock from
>>>>>> amdgpu_pci_remove code and wait
>>>>>> for it to be unmapped before proceeding with the PCI remove code ? This can
>>>>>> take unbound time and that why I don't understand
>>>>>> how serializing will help.
>>>>> Uh you need to untangle that. After hw cleanup is done no one is allowed
>>>>> to touch that ringbuffer bo anymore from the kernel.
>>>>
>>>> I would assume we are not allowed to touch it once we identified the device is
>>>> gone in order to minimize the chance of accidental writes to some other
>>>> device which might now
>>>> occupy those IO ranges ?
>>>>
>>>>
>>>>>    That's what
>>>>> drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
>>>>> references to disappear.
>>>>
>>>> Yes, didn't make sense to me why would we use vmap_local for internally
>>>> allocated buffers. I think we should also guard registers read/writes for the
>>>> same reason as above.
>>>>
>>>>
>>>>> The vmap_local is for mappings done by other drivers, through the dma-buf
>>>>> interface (where "other drivers" can include fbdev/fbcon, if you use the
>>>>> generic helpers).
>>>>> -Daniel
>>>>
>>>> Ok, so I assumed that with vmap_local you were trying to solve the problem of
>>>> quick reinsertion
>>>> of another device into same MMIO range that my driver still points too but
>>>> actually are you trying to solve
>>>> the issue of exported dma buffers outliving the device ? For this we have
>>>> drm_device refcount in the GEM layer
>>>> i think.
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> It doesn't
>>>>>>> solve all your problems, but it's a tool to get there.
>>>>>>> -Daniel
>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>> - handle fbcon somehow. I think shutting it all down should work out.
>>>>>>>>> - worst case keep the system backing storage around for shared dma-buf
>>>>>>>>> until the other non-dynamic driver releases it. for vram we require
>>>>>>>>> dynamic importers (and maybe it wasn't such a bright idea to allow
>>>>>>>>> pinning of importer buffers, might need to revisit that).
>>>>>>>>>
>>>>>>>>> Cheers, Daniel
>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -Daniel
>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I loaded the driver with vm_update_mode=3
>>>>>>>>>>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
>>>>>>>>>>>>>> removing the device. I guess i can test it more by allocating GTT and
>>>>>>>>>>>>>> VRAM BOs
>>>>>>>>>>>>>> and trying to read/write to them after device is removed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7Cc632e5bd5a1f402ac40608d8a2cd2072%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438349203619335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tKk0GTmSnkLVV42HuQaPAj01qFiwDW6Zs%2Bgi2hoq%2BvA%3D&amp;reserved=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>
>
Andrey Grodzovsky Dec. 17, 2020, 9:13 p.m. UTC | #37
On 12/17/20 3:42 PM, Daniel Vetter wrote:
> On Thu, Dec 17, 2020 at 8:19 PM Andrey Grodzovsky
> <Andrey.Grodzovsky@amd.com> wrote:
>>
>> On 12/17/20 7:01 AM, Daniel Vetter wrote:
>>> On Wed, Dec 16, 2020 at 07:20:02PM -0500, Andrey Grodzovsky wrote:
>>>> On 12/16/20 6:15 PM, Daniel Vetter wrote:
>>>>> On Wed, Dec 16, 2020 at 7:26 PM Andrey Grodzovsky
>>>>> <Andrey.Grodzovsky@amd.com> wrote:
>>>>>> On 12/16/20 12:12 PM, Daniel Vetter wrote:
>>>>>>> On Wed, Dec 16, 2020 at 5:18 PM Christian König
>>>>>>> <christian.koenig@amd.com> wrote:
>>>>>>>> Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
>>>>>>>>> On 12/16/20 9:21 AM, Daniel Vetter wrote:
>>>>>>>>>> On Wed, Dec 16, 2020 at 9:04 AM Christian König
>>>>>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>>>>>>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> [SNIP]
>>>>>>>>>>>>>> While we can't control user application accesses to the mapped
>>>>>>>>>>>>>> buffers explicitly and hence we use page fault rerouting
>>>>>>>>>>>>>> I am thinking that in this  case we may be able to sprinkle
>>>>>>>>>>>>>> drm_dev_enter/exit in any such sensitive place were we might
>>>>>>>>>>>>>> CPU access a DMA buffer from the kernel ?
>>>>>>>>>>>>> Yes, I fear we are going to need that.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Things like CPU page table updates, ring buffer accesses and FW
>>>>>>>>>>>>>> memcpy ? Is there other places ?
>>>>>>>>>>>>> Puh, good question. I have no idea.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Another point is that at this point the driver shouldn't access any
>>>>>>>>>>>>>> such buffers as we are at the process finishing the device.
>>>>>>>>>>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
>>>>>>>>>>>>>> don't think there is anything else to do ?
>>>>>>>>>>>>> Well there is a page fault handler for kernel mappings, but that one
>>>>>>>>>>>>> just prints the stack trace into the system log and calls BUG(); :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Long story short we need to avoid any access to released pages after
>>>>>>>>>>>>> unplug. No matter if it's from the kernel or userspace.
>>>>>>>>>>>> I was just about to start guarding with drm_dev_enter/exit CPU
>>>>>>>>>>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
>>>>>>>>>>>> the code
>>>>>>>>>>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
>>>>>>>>>>>> sake of device to main memory access). Kernel page table is not
>>>>>>>>>>>> touched
>>>>>>>>>>>> until last bo refcount is dropped and the bo is released
>>>>>>>>>>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
>>>>>>>>>>>> is both
>>>>>>>>>>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
>>>>>>>>>>>> by ioremap. So as i see it, nothing will bad will happen after we
>>>>>>>>>>>> unpopulate a BO while we still try to use a kernel mapping for it,
>>>>>>>>>>>> system memory pages backing GTT BOs are still mapped and not freed and
>>>>>>>>>>>> for
>>>>>>>>>>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
>>>>>>>>>>>> page table since iounmap wasn't called yet.
>>>>>>>>>>> The problem is the system pages would be freed and if we kernel driver
>>>>>>>>>>> still happily write to them we are pretty much busted because we write
>>>>>>>>>>> to freed up memory.
>>>>>>>>> OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
>>>>>>>>> release
>>>>>>>>> the GTT BO pages. But then isn't there a problem in ttm_bo_release since
>>>>>>>>> ttm_bo_cleanup_memtype_use which also leads to pages release comes
>>>>>>>>> before bo->destroy which unmaps the pages from kernel page table ? Won't
>>>>>>>>> we have end up writing to freed memory in this time interval ? Don't we
>>>>>>>>> need to postpone pages freeing to after kernel page table unmapping ?
>>>>>>>> BOs are only destroyed when there is a guarantee that nobody is
>>>>>>>> accessing them any more.
>>>>>>>>
>>>>>>>> The problem here is that the pages as well as the VRAM can be
>>>>>>>> immediately reused after the hotplug event.
>>>>>>>>
>>>>>>>>>> Similar for vram, if this is actual hotunplug and then replug, there's
>>>>>>>>>> going to be a different device behind the same mmio bar range most
>>>>>>>>>> likely (the higher bridges all this have the same windows assigned),
>>>>>>>>> No idea how this actually works but if we haven't called iounmap yet
>>>>>>>>> doesn't it mean that those physical ranges that are still mapped into
>>>>>>>>> page
>>>>>>>>> table should be reserved and cannot be reused for another
>>>>>>>>> device ? As a guess, maybe another subrange from the higher bridge's
>>>>>>>>> total
>>>>>>>>> range will be allocated.
>>>>>>>> Nope, the PCIe subsystem doesn't care about any ioremap still active for
>>>>>>>> a range when it is hotplugged.
>>>>>>>>
>>>>>>>>>> and that's bad news if we keep using it for current drivers. So we
>>>>>>>>>> really have to point all these cpu ptes to some other place.
>>>>>>>>> We can't just unmap it without syncing against any in kernel accesses
>>>>>>>>> to those buffers
>>>>>>>>> and since page faulting technique we use for user mapped buffers seems
>>>>>>>>> to not be possible
>>>>>>>>> for kernel mapped buffers I am not sure how to do it gracefully...
>>>>>>>> We could try to replace the kmap with a dummy page under the hood, but
>>>>>>>> that is extremely tricky.
>>>>>>>>
>>>>>>>> Especially since BOs which are just 1 page in size could point to the
>>>>>>>> linear mapping directly.
>>>>>>> I think it's just more work. Essentially
>>>>>>> - convert as much as possible of the kernel mappings to vmap_local,
>>>>>>> which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
>>>>>>> serve as a barrier, and ofc any new vmap needs to fail or hand out a
>>>>>>> dummy mapping.
>>>>>> Read those patches. I am not sure how this helps with protecting
>>>>>> against accesses to released backing pages or IO physical ranges of BO
>>>>>> which is already mapped during the unplug event ?
>>>>> By eliminating such users, and replacing them with local maps which
>>>>> are strictly bound in how long they can exist (and hence we can
>>>>> serialize against them finishing in our hotunplug code).
>>>> Not sure I see how serializing against BO map/unmap helps -  our problem as
>>>> you described is that once
>>>> device is extracted and then something else quickly takes it's place in the
>>>> PCI topology
>>>> and gets assigned same physical IO ranges, then our driver will start accessing this
>>>> new device because our 'zombie' BOs are still pointing to those ranges.
>>> Until your driver's remove callback is finished the ranges stay reserved.
>>
>> The ranges stay reserved until unmapped which happens in bo->destroy
>> which for most internally allocated  buffers is during sw_fini when last drm_put
>> is called.
>>
>>
>>> If that's not the case, then hotunplug would be fundamentally impossible
>>> ot handle correctly.
>>>
>>> Of course all the mmio actions will time out, so it might take some time
>>> to get through it all.
>>
>> I found that PCI code provides pci_device_is_present function
>> we can use to avoid timeouts - it reads device vendor and checks if all 1s is
>> returned
>> or not. We can call it from within register accessors before trying read/write
> drm_dev_enter/exit is a _lot_ less overhead, plus makes a lot stronger
> guarantees for hotunplug ordering. Please use that one instead of
> hand-rolling something which only mostly works for closing hotunplug
> races. pciconfig access is really slow.
>
>>>> Another point regarding serializing - problem  is that some of those BOs are
>>>> very long lived, take for example the HW command
>>>> ring buffer Christian mentioned before -
>>>> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
>>>> is basically for the entire time the device exists, it's destroyed only in
>>>> the SW fini stage (when last drm_dev
>>>> reference is dropped) and so should I grab it's dma_resv_lock from
>>>> amdgpu_pci_remove code and wait
>>>> for it to be unmapped before proceeding with the PCI remove code ? This can
>>>> take unbound time and that why I don't understand
>>>> how serializing will help.
>>> Uh you need to untangle that. After hw cleanup is done no one is allowed
>>> to touch that ringbuffer bo anymore from the kernel.
>>
>> I would assume we are not allowed to touch it once we identified the device is
>> gone in order to minimize the chance of accidental writes to some other device
>> which might now
>> occupy those IO ranges ?
>>
>>
>>>    That's what
>>> drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
>>> references to disappear.
>>
>> Yes, didn't make sense to me why would we use vmap_local for internally
>> allocated buffers. I think we should also guard registers read/writes for the
>> same reason as above.
>>
>>
>>> The vmap_local is for mappings done by other drivers, through the dma-buf
>>> interface (where "other drivers" can include fbdev/fbcon, if you use the
>>> generic helpers).
>>> -Daniel
>>
>> Ok, so I assumed that with vmap_local you were trying to solve the problem of
>> quick reinsertion
>> of another device into same MMIO range that my driver still points too but
>> actually are you trying to solve
>> the issue of exported dma buffers outliving the device ? For this we have
>> drm_device refcount in the GEM layer
>> i think.
> That's completely different lifetime problems. Don't mix them up :-)
> One problem is the hardware disappearing, and for that we _have_ to
> guarantee timeliness, or otherwise the pci subsystem gets pissed
> (since like you say, a new device might show up and need it's mmio
> bars assigned to io ranges). The other is lifetim of the software
> objects we use as interfaces, both from userspace and from other
> kernel drivers. There we fundamentally can't enforce timely cleanup,
> and have to resort to refcounting.


So regarding the second issue, as I mentioned above, don't we already use 
drm_dev_get/put
for exported BOs ? Earlier in this discussion you mentioned that we are ok for 
dma buffers since
we already have the refcounting at the GEM layer and the real life cycle problem 
we have is the dma_fences
for which there is no drm_dev refcounting. Seems to me then that vmap_local is 
superfluous because
of the recounting we already have for exported dma_bufs and for dma_fences it 
won't help.

Andrey


>
> We need both.
> -Daniel
>
>> Andrey
>>
>>
>>>> Andrey
>>>>
>>>>
>>>>> It doesn't
>>>>> solve all your problems, but it's a tool to get there.
>>>>> -Daniel
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> - handle fbcon somehow. I think shutting it all down should work out.
>>>>>>> - worst case keep the system backing storage around for shared dma-buf
>>>>>>> until the other non-dynamic driver releases it. for vram we require
>>>>>>> dynamic importers (and maybe it wasn't such a bright idea to allow
>>>>>>> pinning of importer buffers, might need to revisit that).
>>>>>>>
>>>>>>> Cheers, Daniel
>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -Daniel
>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> I loaded the driver with vm_update_mode=3
>>>>>>>>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
>>>>>>>>>>>> removing the device. I guess i can test it more by allocating GTT and
>>>>>>>>>>>> VRAM BOs
>>>>>>>>>>>> and trying to read/write to them after device is removed.
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7Ca9b7d1c15b404ba8f71508d8a2cc465d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438345528118947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QZAPRM%2BEc5b6uRCe9ws5AzLpL7VdyzACZG%2Blp8rO738%3D&amp;reserved=0
>>>>>>>>>>>>
>
>
Daniel Vetter Dec. 18, 2020, 2:30 p.m. UTC | #38
On Thu, Dec 17, 2020 at 04:06:38PM -0500, Andrey Grodzovsky wrote:
> 
> On 12/17/20 3:48 PM, Daniel Vetter wrote:
> > On Thu, Dec 17, 2020 at 9:38 PM Andrey Grodzovsky
> > <Andrey.Grodzovsky@amd.com> wrote:
> > > 
> > > On 12/17/20 3:10 PM, Christian König wrote:
> > > > [SNIP]
> > > > > > > By eliminating such users, and replacing them with local maps which
> > > > > > > > are strictly bound in how long they can exist (and hence we can
> > > > > > > > serialize against them finishing in our hotunplug code).
> > > > > > > Not sure I see how serializing against BO map/unmap helps - our problem as
> > > > > > > you described is that once
> > > > > > > device is extracted and then something else quickly takes it's place in the
> > > > > > > PCI topology
> > > > > > > and gets assigned same physical IO ranges, then our driver will start
> > > > > > > accessing this
> > > > > > > new device because our 'zombie' BOs are still pointing to those ranges.
> > > > > > Until your driver's remove callback is finished the ranges stay reserved.
> > > > > 
> > > > > The ranges stay reserved until unmapped which happens in bo->destroy
> > > > I'm not sure of that. Why do you think that?
> > > 
> > > Because of this sequence
> > > ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap->...->iounmap
> > > Is there another place I am missing ?
> > iounmap is just the mapping, it doesn't reserve anything in the resource tree.
> > 
> > And I don't think we should keep resources reserved past the pci
> > remove callback, because that would upset the pci subsystem trying to
> > assign resources to a newly hotplugged pci device.
> 
> 
> I assumed we are talking about VA ranges still mapped in the page table. I
> just assumed
> that part of ioremap is also reservation of the mapped physical ranges. In
> fact, if we
> do can explicitly reserve those ranges (as you mention here) then together
> with postponing
> system memory pages freeing/releasing back to the page pool until after BO
> is unmapped
> from the kernel address space I believe this could solve the issue of quick
> HW reinsertion
> and make all the drm_dev_ener/exit guarding obsolete.

We can't reserve these ranges, that's what I tried to explaine:
- kernel/resource.c isn't very consistently used
- the pci core will get pissed if there's suddenly a range in the middle
  of a bridge that it can't use
- nesting is allowed for resources, so this doesn't actually garuantee
  much

I just wanted to point out that ioremap does do any reserving, so not
enough by far.

We really have to stop using any mmio ranges before the pci remove
callback is finished.
-Daniel

> 
> Andrey
> 
> 
> > Also from a quick check amdgpu does not reserve the pci bars it's
> > using. Somehow most drm drivers don't do that, not exactly sure why,
> > maybe auto-enumeration of resources just works too good and we don't
> > need the safety net of kernel/resource.c anymore.
> > -Daniel
> > 
> > 
> > > > > which for most internally allocated buffers is during sw_fini when last drm_put
> > > > > is called.
> > > > > 
> > > > > 
> > > > > > If that's not the case, then hotunplug would be fundamentally impossible
> > > > > > ot handle correctly.
> > > > > > 
> > > > > > Of course all the mmio actions will time out, so it might take some time
> > > > > > to get through it all.
> > > > > 
> > > > > I found that PCI code provides pci_device_is_present function
> > > > > we can use to avoid timeouts - it reads device vendor and checks if all 1s is
> > > > > returned
> > > > > or not. We can call it from within register accessors before trying read/write
> > > > That's way to much overhead! We need to keep that much lower or it will result
> > > > in quite a performance drop.
> > > > 
> > > > I suggest to rather think about adding drm_dev_enter/exit guards.
> > > 
> > > Sure, this one is just a bit upstream to the disconnect event. Eventually none
> > > of them is watertight.
> > > 
> > > Andrey
> > > 
> > > 
> > > > Christian.
> > > > 
> > > > > > > Another point regarding serializing - problem  is that some of those BOs are
> > > > > > > very long lived, take for example the HW command
> > > > > > > ring buffer Christian mentioned before -
> > > > > > > (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
> > > > > > > is basically for the entire time the device exists, it's destroyed only in
> > > > > > > the SW fini stage (when last drm_dev
> > > > > > > reference is dropped) and so should I grab it's dma_resv_lock from
> > > > > > > amdgpu_pci_remove code and wait
> > > > > > > for it to be unmapped before proceeding with the PCI remove code ? This can
> > > > > > > take unbound time and that why I don't understand
> > > > > > > how serializing will help.
> > > > > > Uh you need to untangle that. After hw cleanup is done no one is allowed
> > > > > > to touch that ringbuffer bo anymore from the kernel.
> > > > > 
> > > > > I would assume we are not allowed to touch it once we identified the device is
> > > > > gone in order to minimize the chance of accidental writes to some other
> > > > > device which might now
> > > > > occupy those IO ranges ?
> > > > > 
> > > > > 
> > > > > >    That's what
> > > > > > drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
> > > > > > references to disappear.
> > > > > 
> > > > > Yes, didn't make sense to me why would we use vmap_local for internally
> > > > > allocated buffers. I think we should also guard registers read/writes for the
> > > > > same reason as above.
> > > > > 
> > > > > 
> > > > > > The vmap_local is for mappings done by other drivers, through the dma-buf
> > > > > > interface (where "other drivers" can include fbdev/fbcon, if you use the
> > > > > > generic helpers).
> > > > > > -Daniel
> > > > > 
> > > > > Ok, so I assumed that with vmap_local you were trying to solve the problem of
> > > > > quick reinsertion
> > > > > of another device into same MMIO range that my driver still points too but
> > > > > actually are you trying to solve
> > > > > the issue of exported dma buffers outliving the device ? For this we have
> > > > > drm_device refcount in the GEM layer
> > > > > i think.
> > > > > 
> > > > > Andrey
> > > > > 
> > > > > 
> > > > > > > Andrey
> > > > > > > 
> > > > > > > 
> > > > > > > > It doesn't
> > > > > > > > solve all your problems, but it's a tool to get there.
> > > > > > > > -Daniel
> > > > > > > > 
> > > > > > > > > Andrey
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > - handle fbcon somehow. I think shutting it all down should work out.
> > > > > > > > > > - worst case keep the system backing storage around for shared dma-buf
> > > > > > > > > > until the other non-dynamic driver releases it. for vram we require
> > > > > > > > > > dynamic importers (and maybe it wasn't such a bright idea to allow
> > > > > > > > > > pinning of importer buffers, might need to revisit that).
> > > > > > > > > > 
> > > > > > > > > > Cheers, Daniel
> > > > > > > > > > 
> > > > > > > > > > > Christian.
> > > > > > > > > > > 
> > > > > > > > > > > > Andrey
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > -Daniel
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Christian.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I loaded the driver with vm_update_mode=3
> > > > > > > > > > > > > > > meaning all VM updates done using CPU and hasn't seen any OOPs after
> > > > > > > > > > > > > > > removing the device. I guess i can test it more by allocating GTT and
> > > > > > > > > > > > > > > VRAM BOs
> > > > > > > > > > > > > > > and trying to read/write to them after device is removed.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Andrey
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > Christian.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Andrey
> > > > > > > > > > > > > > > _______________________________________________
> > > > > > > > > > > > > > > amd-gfx mailing list
> > > > > > > > > > > > > > > amd-gfx@lists.freedesktop.org
> > > > > > > > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7Cc632e5bd5a1f402ac40608d8a2cd2072%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438349203619335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tKk0GTmSnkLVV42HuQaPAj01qFiwDW6Zs%2Bgi2hoq%2BvA%3D&amp;reserved=0
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > 
> >
Andrey Grodzovsky Jan. 4, 2021, 4:33 p.m. UTC | #39
Hey Daniel, back from vacation and going over our last long thread i think you 
didn't reply
to my last question bellow (Or at least I can't find it).

Andrey

On 12/17/20 4:13 PM, Andrey Grodzovsky wrote:
>>> Ok, so I assumed that with vmap_local you were trying to solve the problem of
>>> quick reinsertion
>>> of another device into same MMIO range that my driver still points too but
>>> actually are you trying to solve
>>> the issue of exported dma buffers outliving the device ? For this we have
>>> drm_device refcount in the GEM layer
>>> i think.
>> That's completely different lifetime problems. Don't mix them up :-)
>> One problem is the hardware disappearing, and for that we _have_ to
>> guarantee timeliness, or otherwise the pci subsystem gets pissed
>> (since like you say, a new device might show up and need it's mmio
>> bars assigned to io ranges). The other is lifetim of the software
>> objects we use as interfaces, both from userspace and from other
>> kernel drivers. There we fundamentally can't enforce timely cleanup,
>> and have to resort to refcounting.
>
>
> So regarding the second issue, as I mentioned above, don't we already use 
> drm_dev_get/put
> for exported BOs ? Earlier in this discussion you mentioned that we are ok for 
> dma buffers since
> we already have the refcounting at the GEM layer and the real life cycle 
> problem we have is the dma_fences
> for which there is no drm_dev refcounting. Seems to me then that vmap_local is 
> superfluous because
> of the recounting we already have for exported dma_bufs and for dma_fences it 
> won't help.
>
> Andrey
diff mbox series

Patch

diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 1ccf1ef..29248a5 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -495,3 +495,4 @@  void ttm_tt_unpopulate(struct ttm_tt *ttm)
 	else
 		ttm_pool_unpopulate(ttm);
 }
+EXPORT_SYMBOL(ttm_tt_unpopulate);