Message ID | 0-v2-d4374a7bf0c9+c4-vfio_dev_counter_jgg@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] vfio: Remove vfio_group dev_counter | expand |
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, August 17, 2022 3:13 AM > > + * > + * A driver may only call this function if the vfio_device was created > + * by vfio_register_emulated_iommu_dev(). > */ > int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, > int npage, int prot, struct page **pages) Though I agree with the code change, I'm still unclear about this comment. First this comment is not equivalent to the old code which only checks dev_counter (implying a physical device in a singleton group can use this API too). though I didn't get why the singleton restriction was imposed in the first place... Second I also didn't get why such a pinning API is tied to emulated iommu now. Though not required for any physical device today, what would be the actual problem of allowing a variant driver to make such call? Thanks Kevin
On Mon, 22 Aug 2022 04:39:45 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, August 17, 2022 3:13 AM > > > > + * > > + * A driver may only call this function if the vfio_device was created > > + * by vfio_register_emulated_iommu_dev(). > > */ > > int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, > > int npage, int prot, struct page **pages) > > Though I agree with the code change, I'm still unclear about this > comment. > > First this comment is not equivalent to the old code which only > checks dev_counter (implying a physical device in a singleton > group can use this API too). though I didn't get why the singleton > restriction was imposed in the first place... It's related to the dirty page scope. Use of the pinned page interface is essentially a contract with the IOMMU back-end that only pinned pages will be considered for the dirty page scope. However, type1 operates on groups, therefore we needed to create a restriction that only singleton groups could make use of page pinning such that the dirty page scope could be attached to the group. > Second I also didn't get why such a pinning API is tied to emulated > iommu now. Though not required for any physical device today, what > would be the actual problem of allowing a variant driver to make > such call? In fact I do recall such discussions. An IOMMU backed mdev (defunct) or vfio-pci variant driver could gratuitously pin pages in order to limit the dirty page scope. We don't have anything in-tree that relies on this. It also seems we're heading more in the direction of device level DMA dirty tracking as Yishai proposes in the series for mlx5. These interfaces are far more efficient for this use case, but perhaps you have another use case in mind where we couldn't use the dma_rw interface? I think the assumption is that devices that can perform DMA through an IOMMU generally wouldn't need to twiddle guest DMA targets on a regular basis otherwise, therefore limiting this to emulated IOMMU devices is reasonable. Thanks, Alex
Hi, Alex, Thanks for the explanation! > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, August 23, 2022 2:36 AM > > On Mon, 22 Aug 2022 04:39:45 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Wednesday, August 17, 2022 3:13 AM > > > > > > + * > > > + * A driver may only call this function if the vfio_device was created > > > + * by vfio_register_emulated_iommu_dev(). > > > */ > > > int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, > > > int npage, int prot, struct page **pages) > > > > Though I agree with the code change, I'm still unclear about this > > comment. > > > > First this comment is not equivalent to the old code which only > > checks dev_counter (implying a physical device in a singleton > > group can use this API too). though I didn't get why the singleton > > restriction was imposed in the first place... > > It's related to the dirty page scope. Use of the pinned page interface > is essentially a contract with the IOMMU back-end that only pinned pages > will be considered for the dirty page scope. However, type1 operates > on groups, therefore we needed to create a restriction that only > singleton groups could make use of page pinning such that the dirty > page scope could be attached to the group. I get the former part but not the latter. Can you elaborate why multi-device group can not be attached by the dirty page scope? It's kind of sharing the scope by all devices in the group. > > > Second I also didn't get why such a pinning API is tied to emulated > > iommu now. Though not required for any physical device today, what > > would be the actual problem of allowing a variant driver to make > > such call? > > In fact I do recall such discussions. An IOMMU backed mdev (defunct) > or vfio-pci variant driver could gratuitously pin pages in order to > limit the dirty page scope. We don't have anything in-tree that relies > on this. It also seems we're heading more in the direction of device > level DMA dirty tracking as Yishai proposes in the series for mlx5. > These interfaces are far more efficient for this use case, but perhaps > you have another use case in mind where we couldn't use the dma_rw > interface? One potential scenario is when I/O page fault is supported VFIO can enable on-demand paging in stage-2 mappings. In case a device cannot tolerate faults in all paths then a variant driver could use this interface to pin down structures which don't expect faults. > > I think the assumption is that devices that can perform DMA through an > IOMMU generally wouldn't need to twiddle guest DMA targets on a regular > basis otherwise, therefore limiting this to emulated IOMMU devices is > reasonable. Thanks, > IMHO if functionally this function only works for emulated case then we should add code to detect and fail if it's called otherwise. Otherwise it doesn't make much sense to add a comment to explicitly limit it to an existing use case. Thanks Kevin
On Tue, Aug 23, 2022 at 01:31:11AM +0000, Tian, Kevin wrote: > > In fact I do recall such discussions. An IOMMU backed mdev (defunct) > > or vfio-pci variant driver could gratuitously pin pages in order to > > limit the dirty page scope. We don't have anything in-tree that relies > > on this. It also seems we're heading more in the direction of device > > level DMA dirty tracking as Yishai proposes in the series for mlx5. > > These interfaces are far more efficient for this use case, but perhaps > > you have another use case in mind where we couldn't use the dma_rw > > interface? > > One potential scenario is when I/O page fault is supported VFIO can > enable on-demand paging in stage-2 mappings. In case a device cannot > tolerate faults in all paths then a variant driver could use this interface > to pin down structures which don't expect faults. If this need arises, and I've had discussions about such things in the past, it makes more sense to have a proper API to inhibit faulting of a sub-range in what would have otherwise be a faultable iommu_domain. Inhibiting faulting might be the same underlying code as pinning, but I would prefer we don't co-mingle these very different concepts at the device driver level. > IMHO if functionally this function only works for emulated case then we > should add code to detect and fail if it's called otherwise. Today it only works correctly for the emulated case because only the emulated case will be guarenteed to have a singleton group. It *might* work for other cases, but not generally. In the general case a physical device driver may be faced with multi-device groups and it shouldn't fail. So, I would prefer to comment it like this and if someone comes with a driver that wants to use it in some other way they have to address these problems so it works generally and correctly. I don't want to write more code to protect against something that auditing tells us doesn't happen today. The whole thing should naturally become fixed fairly soon, as once we have Yishai and Joao's changes there will be no use for the dirty tracking code in type1 that is causing this problem. Jason
On Tue, 23 Aug 2022 21:38:08 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > So, I would prefer to comment it like this and if someone comes with a > driver that wants to use it in some other way they have to address > these problems so it works generally and correctly. I don't want to > write more code to protect against something that auditing tells us > doesn't happen today. > > The whole thing should naturally become fixed fairly soon, as once we > have Yishai and Joao's changes there will be no use for the dirty > tracking code in type1 that is causing this problem. I assume this is referring to the DMA logging interface and implementation for mlx5[1], but I don't see anyone else getting on board with that proposal (or reviewing). AIUI, most are waiting for system IOMMU DMA logging, so the above seems a bit of a bold claim. Do we expect system IOMMU logging flowing through this device feature or will there be an interface at the shared IOMMU context level? Do we expect mdev drivers supporting migration to track their dirty iovas locally and implement this feature? And touching on this question that wasn't addressed... On Tue, 23 Aug 2022 01:31:11 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Tuesday, August 23, 2022 2:36 AM > > It's related to the dirty page scope. Use of the pinned page interface > > is essentially a contract with the IOMMU back-end that only pinned pages > > will be considered for the dirty page scope. However, type1 operates > > on groups, therefore we needed to create a restriction that only > > singleton groups could make use of page pinning such that the dirty > > page scope could be attached to the group. > > I get the former part but not the latter. Can you elaborate why > multi-device group can not be attached by the dirty page scope? > It's kind of sharing the scope by all devices in the group. There's no fundamental reason we couldn't do it. At the time it wasn't necessary and wasn't convenient since type1 operated exclusively on groups. Now maybe we'd expand the device_list beyond just devices with dma_unmap callbacks and link page pinning to the device rather than the group to get better granularity, but that all seems to be work in the opposite direction from where we want the IOMMU uAPI to go. If type1 dirty logging goes away and moves to the drivers, we could scrap the singleton requirement, but as Jason notes, expanding its use to hack around regions that can't fault is a bit of an abuse of the intended use case. Thanks, Alex [1]20220815151109.180403-1-yishaih@nvidia.com
On Wed, Aug 24, 2022 at 04:02:01PM -0600, Alex Williamson wrote: > On Tue, 23 Aug 2022 21:38:08 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > So, I would prefer to comment it like this and if someone comes with a > > driver that wants to use it in some other way they have to address > > these problems so it works generally and correctly. I don't want to > > write more code to protect against something that auditing tells us > > doesn't happen today. > > > > The whole thing should naturally become fixed fairly soon, as once we > > have Yishai and Joao's changes there will be no use for the dirty > > tracking code in type1 that is causing this problem. > > I assume this is referring to the DMA logging interface and > implementation for mlx5[1], but I don't see anyone else getting on board > with that proposal (or reviewing). We have a large group pushing in this direction together. Yishai implemented the API in vfio for mlx5 showing how a device centric tracker and work. In my research I found no other groups besides NVIDIA planning to do device tracking, so it is not surprising his series has been as well reviewed as it probably will be. The other use cases inside NVIDIA are also happy with this design. Joao implemented the similar API in iommufd for the system iommu, and did this with hisilicon's support because they need it for their merged hisilicon's vfio driver. https://github.com/jpemartins/linux/commit/7baa3d51417419690aa8b57f6cc58be7ab3a40a9 I understand Intel to use this with IDXD as well.. So, basically everyone who is working in the migratable VFIO device space is on board with, and contributing to, this plan right now. > IOMMU DMA logging, so the above seems a bit of a bold claim. Do we > expect system IOMMU logging flowing through this device feature or will > there be an interface at the shared IOMMU context level? Joao's draft series shows the basic idea: https://github.com/jpemartins/linux/commit/fada208468303a3a6df340ac20cda8fc1b869e7a#diff-4318a59849ffa0d4aa992221863be15cde4c01e1f82d2f6d06337cfe6dd316afR228 We are working this all through in pieces as they become ready. > Do we expect mdev drivers supporting migration to track their dirty > iovas locally and implement this feature? Assuming we ever get a SW only device that can do migration.. I would expect there to be a library to manage the dirty bitmap datastructure and the instance of that datastructure to be linked to the IOAS the device is attached to. Ie one IOAS one datastructure. The additional code in the mdev driver would be a handful of lines. If we want it to report out through the vfio or through a iommufd API is an interesting question I don't have a solid answer to right now. Either will work - iommufd has a nice hole to put a "sw page table" object parallel to the "hw page table" object that the draft series put the API on. The appeal is it makes the sharing of the datastructure clearer to userpsace. The trouble with mlx5 is there isn't a nice iommufd spot to put a device tracker. The closest is on the device id, but using the device id in APIs intended for "page table" id's is wonky. And there is no datatructure sharing here, obviously. Not such a nice fit. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, August 24, 2022 8:38 AM > > On Tue, Aug 23, 2022 at 01:31:11AM +0000, Tian, Kevin wrote: > > > > In fact I do recall such discussions. An IOMMU backed mdev (defunct) > > > or vfio-pci variant driver could gratuitously pin pages in order to > > > limit the dirty page scope. We don't have anything in-tree that relies > > > on this. It also seems we're heading more in the direction of device > > > level DMA dirty tracking as Yishai proposes in the series for mlx5. > > > These interfaces are far more efficient for this use case, but perhaps > > > you have another use case in mind where we couldn't use the dma_rw > > > interface? > > > > One potential scenario is when I/O page fault is supported VFIO can > > enable on-demand paging in stage-2 mappings. In case a device cannot > > tolerate faults in all paths then a variant driver could use this interface > > to pin down structures which don't expect faults. > > If this need arises, and I've had discussions about such things in the > past, it makes more sense to have a proper API to inhibit faulting of > a sub-range in what would have otherwise be a faultable iommu_domain. > > Inhibiting faulting might be the same underlying code as pinning, but > I would prefer we don't co-mingle these very different concepts at the > device driver level. > > > IMHO if functionally this function only works for emulated case then we > > should add code to detect and fail if it's called otherwise. > > Today it only works correctly for the emulated case because only the > emulated case will be guarenteed to have a singleton group. > > It *might* work for other cases, but not generally. In the general > case a physical device driver may be faced with multi-device groups > and it shouldn't fail. > > So, I would prefer to comment it like this and if someone comes with a > driver that wants to use it in some other way they have to address > these problems so it works generally and correctly. I don't want to > write more code to protect against something that auditing tells us > doesn't happen today. > With all the discussions in this thread I'm fine adding a comment like this (though not my preference which is not strong either): Reviewed-by: Kevin Tian <kevin.tian@intel.com>
On Tue, 16 Aug 2022 16:13:04 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > This counts the number of devices attached to a vfio_group, ie the number > of items in the group->device_list. > > It is only read in vfio_pin_pages(), as some kind of protection against > limitations in type1. > > However, with all the code cleanups in this area, now that > vfio_pin_pages() accepts a vfio_device directly it is redundant. All > drivers are already calling vfio_register_emulated_iommu_dev() which > directly creates a group specifically for the device and thus it is > guaranteed that there is a singleton group. > > Leave a note in the comment about this requirement and remove the logic. > > Reviewed-by: Yi Liu <yi.l.liu@intel.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > drivers/vfio/vfio_main.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) Applied to vfio next branch for v6.1. Thanks, Alex
Hi, Jason, > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, August 25, 2022 8:44 AM > > On Wed, Aug 24, 2022 at 04:02:01PM -0600, Alex Williamson wrote: > > > Do we expect mdev drivers supporting migration to track their dirty > > iovas locally and implement this feature? > > Assuming we ever get a SW only device that can do migration.. > > I would expect there to be a library to manage the dirty bitmap > datastructure and the instance of that datastructure to be linked to > the IOAS the device is attached to. Ie one IOAS one datastructure. > > The additional code in the mdev driver would be a handful of lines. > > If we want it to report out through the vfio or through a iommufd API > is an interesting question I don't have a solid answer to right > now. Either will work - iommufd has a nice hole to put a "sw page > table" object parallel to the "hw page table" object that the draft > series put the API on. The appeal is it makes the sharing of the > datastructure clearer to userpsace. > > The trouble with mlx5 is there isn't a nice iommufd spot to put a > device tracker. The closest is on the device id, but using the device > id in APIs intended for "page table" id's is wonky. And there is no > datatructure sharing here, obviously. Not such a nice fit. > A bit more thinking on this. Although mlx5 internal tracking is not generic, the uAPI proposed in Yishai's series is pretty generic. I wonder whether it can be extended as a formal interface in iommufd with iommu dirty tracking also being a dirty tracker implementation. Currently the concept between Yishai's and Joao's series is actually similar, both having a capability interface and a command to read/clear the bitmap of a tracked range except that Yishai's series allows userspace to do range-based tracking while Joao's series currently have the tracking enabled per the entire page table. But in concept iommu dirty tracking can support range-based interfaces too. For global dirty bit control (e.g. Intel and AMD) the iommu driver can just turn it on globally upon any enabled range. For per-PTE dirty bit control (e.g. ARM) it's actually a nice fit for range-based tracking. This pays the complexity of introducing a new object (dirty tracker) in iommufd object hierarchy, with the gain of providing a single dirty tracking interface to userspace. Kind of like an page table can have a list of dirty trackers and each tracker is used by a list of devices. If iommu dirty tracker is available it is preferred to and used by all devices (except mdev) attached to the page table. Otherwise mlx5 etc. picks a vendor specific dirty tracker if available. mdev's just share a generic software dirty tracker, not affected by the presence of iommu dirty tracker. Multiple trackers might be used for a page table at the same time. Is above all worth it? Thanks Kevin
On Tue, Sep 06, 2022 at 09:21:35AM +0000, Tian, Kevin wrote: > Although mlx5 internal tracking is not generic, the uAPI proposed > in Yishai's series is pretty generic. I wonder whether it can be extended > as a formal interface in iommufd with iommu dirty tracking also being > a dirty tracker implementation. It probably could, but I don't think it is helpful > Currently the concept between Yishai's and Joao's series is actually > similar, both having a capability interface and a command to read/clear > the bitmap of a tracked range except that Yishai's series allows > userspace to do range-based tracking while Joao's series currently > have the tracking enabled per the entire page table. This similarity is deliberate > But in concept iommu dirty tracking can support range-based interfaces > too. For global dirty bit control (e.g. Intel and AMD) the iommu driver > can just turn it on globally upon any enabled range. For per-PTE > dirty bit control (e.g. ARM) it's actually a nice fit for range-based > tracking. We don't actually have a use case for range-based tracking. The reason it exists for mlx5 is only because the device just cannot do global tracking. I was hoping not to bring that complexity into the system iommu side. > This pays the complexity of introducing a new object (dirty tracker) > in iommufd object hierarchy, with the gain of providing a single > dirty tracking interface to userspace. Well, it isn't single, it still two interfaces that work in two quite different ways. Userspace has to know exactly which of the two ways it is working in and do everything accordingly. We get to share some structs and ioctl #s, but I don't see a real simplification here. If anything the false sharing creates a new kind of complexity: > Kind of like an page table can have a list of dirty trackers and each > tracker is used by a list of devices. > > If iommu dirty tracker is available it is preferred to and used by all > devices (except mdev) attached to the page table. > > Otherwise mlx5 etc. picks a vendor specific dirty tracker if available. > > mdev's just share a generic software dirty tracker, not affected by > the presence of iommu dirty tracker. > > Multiple trackers might be used for a page table at the same time. > > Is above all worth it? Because now we are trying to create one master interface but now we need new APIs to control what trackers exists, what trackers are turned on, and so forth. Now we've added complexity. You end up with iommufd proxying a VFIO object that is completely unrelated to drivers/iommu - I don't like it just on that basis. The tracker is a vfio object, created by a vfio driver, with an interface deliberately similar to iommufd's interface to ease the implementation in userspace. It is not realed to drivers/iommu and it doesn't benefit from anything in drivers/iommu beyond the shared code to manage the shared uapi. Jason
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 7cb56c382c97a2..94a76cb86a0388 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -74,7 +74,6 @@ struct vfio_group { struct list_head vfio_next; struct list_head container_next; enum vfio_group_type type; - unsigned int dev_counter; struct rw_semaphore group_rwsem; struct kvm *kvm; struct file *opened_file; @@ -608,7 +607,6 @@ static int __vfio_register_dev(struct vfio_device *device, mutex_lock(&group->device_lock); list_add(&device->group_next, &group->device_list); - group->dev_counter++; mutex_unlock(&group->device_lock); return 0; @@ -696,7 +694,6 @@ void vfio_unregister_group_dev(struct vfio_device *device) mutex_lock(&group->device_lock); list_del(&device->group_next); - group->dev_counter--; mutex_unlock(&group->device_lock); if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU) @@ -1946,6 +1943,9 @@ EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare); * @prot [in] : protection flags * @pages[out] : array of host pages * Return error or number of pages pinned. + * + * A driver may only call this function if the vfio_device was created + * by vfio_register_emulated_iommu_dev(). */ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, int npage, int prot, struct page **pages) @@ -1961,9 +1961,6 @@ int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova, if (npage > VFIO_PIN_PAGES_MAX_ENTRIES) return -E2BIG; - if (group->dev_counter > 1) - return -EINVAL; - /* group->container cannot change while a vfio device is open */ container = group->container; driver = container->iommu_driver;