[v2] vfio: Remove vfio_group dev_counter

Message ID	0-v2-d4374a7bf0c9+c4-vfio_dev_counter_jgg@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Jason Gunthorpe <jgg@nvidia.com> To: Alex Williamson <alex.williamson@redhat.com>, Cornelia Huck <cohuck@redhat.com>, kvm@vger.kernel.org Cc: Yi Liu <yi.l.liu@intel.com> Subject: [PATCH v2] vfio: Remove vfio_group dev_counter Date: Tue, 16 Aug 2022 16:13:04 -0300 Message-Id: <0-v2-d4374a7bf0c9+c4-vfio_dev_counter_jgg@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	[v2] vfio: Remove vfio_group dev_counter \| expand [v2] vfio: Remove vfio_group dev_counter

Jason Gunthorpe Aug. 16, 2022, 7:13 p.m. UTC

This counts the number of devices attached to a vfio_group, ie the number
of items in the group->device_list.

It is only read in vfio_pin_pages(), as some kind of protection against
limitations in type1.

However, with all the code cleanups in this area, now that
vfio_pin_pages() accepts a vfio_device directly it is redundant.  All
drivers are already calling vfio_register_emulated_iommu_dev() which
directly creates a group specifically for the device and thus it is
guaranteed that there is a singleton group.

Leave a note in the comment about this requirement and remove the logic.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/vfio_main.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

v2:
 - Reword commit message
 - Add a note in the comment that vfio_register_emulated_iommu_dev() must
    be used
v1: https://lore.kernel.org/r/0-v1-efa4029ed93d+22-vfio_dev_counter_jgg@nvidia.com


base-commit: 568035b01cfb107af8d2e4bd2fb9aea22cf5b868

Tian, Kevin Aug. 22, 2022, 4:39 a.m. UTC | #1

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, August 17, 2022 3:13 AM
> 
> + *
> + * A driver may only call this function if the vfio_device was created
> + * by vfio_register_emulated_iommu_dev().
>   */
>  int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,
>  		   int npage, int prot, struct page **pages)

Though I agree with the code change, I'm still unclear about this
comment.

First this comment is not equivalent to the old code which only
checks dev_counter (implying a physical device in a singleton
group can use this API too). though I didn't get why the singleton
restriction was imposed in the first place...

Second I also didn't get why such a pinning API is tied to emulated
iommu now. Though not required for any physical device today, what
would be the actual problem of allowing a variant driver to make 
such call? 

Thanks
Kevin

Alex Williamson Aug. 22, 2022, 6:35 p.m. UTC | #2

On Mon, 22 Aug 2022 04:39:45 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, August 17, 2022 3:13 AM
> > 
> > + *
> > + * A driver may only call this function if the vfio_device was created
> > + * by vfio_register_emulated_iommu_dev().
> >   */
> >  int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,
> >  		   int npage, int prot, struct page **pages)  
> 
> Though I agree with the code change, I'm still unclear about this
> comment.
> 
> First this comment is not equivalent to the old code which only
> checks dev_counter (implying a physical device in a singleton
> group can use this API too). though I didn't get why the singleton
> restriction was imposed in the first place...

It's related to the dirty page scope.  Use of the pinned page interface
is essentially a contract with the IOMMU back-end that only pinned pages
will be considered for the dirty page scope.  However, type1 operates
on groups, therefore we needed to create a restriction that only
singleton groups could make use of page pinning such that the dirty
page scope could be attached to the group.

> Second I also didn't get why such a pinning API is tied to emulated
> iommu now. Though not required for any physical device today, what
> would be the actual problem of allowing a variant driver to make 
> such call? 

In fact I do recall such discussions.  An IOMMU backed mdev (defunct)
or vfio-pci variant driver could gratuitously pin pages in order to
limit the dirty page scope.  We don't have anything in-tree that relies
on this.  It also seems we're heading more in the direction of device
level DMA dirty tracking as Yishai proposes in the series for mlx5.
These interfaces are far more efficient for this use case, but perhaps
you have another use case in mind where we couldn't use the dma_rw
interface?

I think the assumption is that devices that can perform DMA through an
IOMMU generally wouldn't need to twiddle guest DMA targets on a regular
basis otherwise, therefore limiting this to emulated IOMMU devices is
reasonable.  Thanks,

Alex

Tian, Kevin Aug. 23, 2022, 1:31 a.m. UTC | #3

Hi, Alex,

Thanks for the explanation!

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, August 23, 2022 2:36 AM
> 
> On Mon, 22 Aug 2022 04:39:45 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, August 17, 2022 3:13 AM
> > >
> > > + *
> > > + * A driver may only call this function if the vfio_device was created
> > > + * by vfio_register_emulated_iommu_dev().
> > >   */
> > >  int vfio_pin_pages(struct vfio_device *device, dma_addr_t iova,
> > >  		   int npage, int prot, struct page **pages)
> >
> > Though I agree with the code change, I'm still unclear about this
> > comment.
> >
> > First this comment is not equivalent to the old code which only
> > checks dev_counter (implying a physical device in a singleton
> > group can use this API too). though I didn't get why the singleton
> > restriction was imposed in the first place...
> 
> It's related to the dirty page scope.  Use of the pinned page interface
> is essentially a contract with the IOMMU back-end that only pinned pages
> will be considered for the dirty page scope.  However, type1 operates
> on groups, therefore we needed to create a restriction that only
> singleton groups could make use of page pinning such that the dirty
> page scope could be attached to the group.

I get the former part but not the latter. Can you elaborate why
multi-device group can not be attached by the dirty page scope?
It's kind of sharing the scope by all devices in the group.

> 
> > Second I also didn't get why such a pinning API is tied to emulated
> > iommu now. Though not required for any physical device today, what
> > would be the actual problem of allowing a variant driver to make
> > such call?
> 
> In fact I do recall such discussions.  An IOMMU backed mdev (defunct)
> or vfio-pci variant driver could gratuitously pin pages in order to
> limit the dirty page scope.  We don't have anything in-tree that relies
> on this.  It also seems we're heading more in the direction of device
> level DMA dirty tracking as Yishai proposes in the series for mlx5.
> These interfaces are far more efficient for this use case, but perhaps
> you have another use case in mind where we couldn't use the dma_rw
> interface?

One potential scenario is when I/O page fault is supported VFIO can
enable on-demand paging in stage-2 mappings. In case a device cannot
tolerate faults in all paths then a variant driver could use this interface
to pin down structures which don't expect faults.

> 
> I think the assumption is that devices that can perform DMA through an
> IOMMU generally wouldn't need to twiddle guest DMA targets on a regular
> basis otherwise, therefore limiting this to emulated IOMMU devices is
> reasonable.  Thanks,
> 

IMHO if functionally this function only works for emulated case then we
should add code to detect and fail if it's called otherwise.

Otherwise it doesn't make much sense to add a comment to explicitly
limit it to an existing use case.

Thanks
Kevin

Jason Gunthorpe Aug. 24, 2022, 12:38 a.m. UTC | #4

On Tue, Aug 23, 2022 at 01:31:11AM +0000, Tian, Kevin wrote:

> > In fact I do recall such discussions.  An IOMMU backed mdev (defunct)
> > or vfio-pci variant driver could gratuitously pin pages in order to
> > limit the dirty page scope.  We don't have anything in-tree that relies
> > on this.  It also seems we're heading more in the direction of device
> > level DMA dirty tracking as Yishai proposes in the series for mlx5.
> > These interfaces are far more efficient for this use case, but perhaps
> > you have another use case in mind where we couldn't use the dma_rw
> > interface?
> 
> One potential scenario is when I/O page fault is supported VFIO can
> enable on-demand paging in stage-2 mappings. In case a device cannot
> tolerate faults in all paths then a variant driver could use this interface
> to pin down structures which don't expect faults.

If this need arises, and I've had discussions about such things in the
past, it makes more sense to have a proper API to inhibit faulting of
a sub-range in what would have otherwise be a faultable iommu_domain.

Inhibiting faulting might be the same underlying code as pinning, but
I would prefer we don't co-mingle these very different concepts at the
device driver level.

> IMHO if functionally this function only works for emulated case then we
> should add code to detect and fail if it's called otherwise.

Today it only works correctly for the emulated case because only the
emulated case will be guarenteed to have a singleton group.

It *might* work for other cases, but not generally. In the general
case a physical device driver may be faced with multi-device groups
and it shouldn't fail.

So, I would prefer to comment it like this and if someone comes with a
driver that wants to use it in some other way they have to address
these problems so it works generally and correctly. I don't want to
write more code to protect against something that auditing tells us
doesn't happen today.

The whole thing should naturally become fixed fairly soon, as once we
have Yishai and Joao's changes there will be no use for the dirty
tracking code in type1 that is causing this problem.

Jason

Alex Williamson Aug. 24, 2022, 10:02 p.m. UTC | #5

On Tue, 23 Aug 2022 21:38:08 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> So, I would prefer to comment it like this and if someone comes with a
> driver that wants to use it in some other way they have to address
> these problems so it works generally and correctly. I don't want to
> write more code to protect against something that auditing tells us
> doesn't happen today.
> 
> The whole thing should naturally become fixed fairly soon, as once we
> have Yishai and Joao's changes there will be no use for the dirty
> tracking code in type1 that is causing this problem.

I assume this is referring to the DMA logging interface and
implementation for mlx5[1], but I don't see anyone else getting on board
with that proposal (or reviewing).  AIUI, most are waiting for system
IOMMU DMA logging, so the above seems a bit of a bold claim.  Do we
expect system IOMMU logging flowing through this device feature or will
there be an interface at the shared IOMMU context level?  Do we expect
mdev drivers supporting migration to track their dirty iovas locally
and implement this feature? 

And touching on this question that wasn't addressed...

On Tue, 23 Aug 2022 01:31:11 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, August 23, 2022 2:36 AM
> > It's related to the dirty page scope.  Use of the pinned page interface
> > is essentially a contract with the IOMMU back-end that only pinned pages
> > will be considered for the dirty page scope.  However, type1 operates
> > on groups, therefore we needed to create a restriction that only
> > singleton groups could make use of page pinning such that the dirty
> > page scope could be attached to the group.  
> 
> I get the former part but not the latter. Can you elaborate why
> multi-device group can not be attached by the dirty page scope?
> It's kind of sharing the scope by all devices in the group.

There's no fundamental reason we couldn't do it.  At the time it wasn't
necessary and wasn't convenient since type1 operated exclusively on
groups.  Now maybe we'd expand the device_list beyond just devices with
dma_unmap callbacks and link page pinning to the device rather than the
group to get better granularity, but that all seems to be work in the
opposite direction from where we want the IOMMU uAPI to go.  If type1
dirty logging goes away and moves to the drivers, we could scrap the
singleton requirement, but as Jason notes, expanding its use to hack
around regions that can't fault is a bit of an abuse of the intended
use case.  Thanks,

Alex

[1]20220815151109.180403-1-yishaih@nvidia.com

Jason Gunthorpe Aug. 25, 2022, 12:43 a.m. UTC | #6

On Wed, Aug 24, 2022 at 04:02:01PM -0600, Alex Williamson wrote:
> On Tue, 23 Aug 2022 21:38:08 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> > So, I would prefer to comment it like this and if someone comes with a
> > driver that wants to use it in some other way they have to address
> > these problems so it works generally and correctly. I don't want to
> > write more code to protect against something that auditing tells us
> > doesn't happen today.
> > 
> > The whole thing should naturally become fixed fairly soon, as once we
> > have Yishai and Joao's changes there will be no use for the dirty
> > tracking code in type1 that is causing this problem.
> 
> I assume this is referring to the DMA logging interface and
> implementation for mlx5[1], but I don't see anyone else getting on board
> with that proposal (or reviewing).

We have a large group pushing in this direction together.

Yishai implemented the API in vfio for mlx5 showing how a device
centric tracker and work. In my research I found no other groups
besides NVIDIA planning to do device tracking, so it is not surprising
his series has been as well reviewed as it probably will be. The other
use cases inside NVIDIA are also happy with this design.

Joao implemented the similar API in iommufd for the system iommu, and
did this with hisilicon's support because they need it for their
merged hisilicon's vfio driver.

https://github.com/jpemartins/linux/commit/7baa3d51417419690aa8b57f6cc58be7ab3a40a9

I understand Intel to use this with IDXD as well..

So, basically everyone who is working in the migratable VFIO device
space is on board with, and contributing to, this plan right now.

> IOMMU DMA logging, so the above seems a bit of a bold claim.  Do we
> expect system IOMMU logging flowing through this device feature or will
> there be an interface at the shared IOMMU context level? 

Joao's draft series shows the basic idea:

https://github.com/jpemartins/linux/commit/fada208468303a3a6df340ac20cda8fc1b869e7a#diff-4318a59849ffa0d4aa992221863be15cde4c01e1f82d2f6d06337cfe6dd316afR228

We are working this all through in pieces as they become
ready.

> Do we expect mdev drivers supporting migration to track their dirty
> iovas locally and implement this feature?

Assuming we ever get a SW only device that can do migration..

I would expect there to be a library to manage the dirty bitmap
datastructure and the instance of that datastructure to be linked to
the IOAS the device is attached to. Ie one IOAS one datastructure.

The additional code in the mdev driver would be a handful of lines.

If we want it to report out through the vfio or through a iommufd API
is an interesting question I don't have a solid answer to right
now. Either will work - iommufd has a nice hole to put a "sw page
table" object parallel to the "hw page table" object that the draft
series put the API on. The appeal is it makes the sharing of the
datastructure clearer to userpsace.

The trouble with mlx5 is there isn't a nice iommufd spot to put a
device tracker. The closest is on the device id, but using the device
id in APIs intended for "page table" id's is wonky. And there is no
datatructure sharing here, obviously. Not such a nice fit.

Jason

Tian, Kevin Aug. 29, 2022, 2:02 a.m. UTC | #7

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, August 24, 2022 8:38 AM
> 
> On Tue, Aug 23, 2022 at 01:31:11AM +0000, Tian, Kevin wrote:
> 
> > > In fact I do recall such discussions.  An IOMMU backed mdev (defunct)
> > > or vfio-pci variant driver could gratuitously pin pages in order to
> > > limit the dirty page scope.  We don't have anything in-tree that relies
> > > on this.  It also seems we're heading more in the direction of device
> > > level DMA dirty tracking as Yishai proposes in the series for mlx5.
> > > These interfaces are far more efficient for this use case, but perhaps
> > > you have another use case in mind where we couldn't use the dma_rw
> > > interface?
> >
> > One potential scenario is when I/O page fault is supported VFIO can
> > enable on-demand paging in stage-2 mappings. In case a device cannot
> > tolerate faults in all paths then a variant driver could use this interface
> > to pin down structures which don't expect faults.
> 
> If this need arises, and I've had discussions about such things in the
> past, it makes more sense to have a proper API to inhibit faulting of
> a sub-range in what would have otherwise be a faultable iommu_domain.
> 
> Inhibiting faulting might be the same underlying code as pinning, but
> I would prefer we don't co-mingle these very different concepts at the
> device driver level.
> 
> > IMHO if functionally this function only works for emulated case then we
> > should add code to detect and fail if it's called otherwise.
> 
> Today it only works correctly for the emulated case because only the
> emulated case will be guarenteed to have a singleton group.
> 
> It *might* work for other cases, but not generally. In the general
> case a physical device driver may be faced with multi-device groups
> and it shouldn't fail.
> 
> So, I would prefer to comment it like this and if someone comes with a
> driver that wants to use it in some other way they have to address
> these problems so it works generally and correctly. I don't want to
> write more code to protect against something that auditing tells us
> doesn't happen today.
> 

With all the discussions in this thread I'm fine adding a comment
like this (though not my preference which is not strong either):

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Alex Williamson Sept. 2, 2022, 6:42 p.m. UTC | #8

On Tue, 16 Aug 2022 16:13:04 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> This counts the number of devices attached to a vfio_group, ie the number
> of items in the group->device_list.
> 
> It is only read in vfio_pin_pages(), as some kind of protection against
> limitations in type1.
> 
> However, with all the code cleanups in this area, now that
> vfio_pin_pages() accepts a vfio_device directly it is redundant.  All
> drivers are already calling vfio_register_emulated_iommu_dev() which
> directly creates a group specifically for the device and thus it is
> guaranteed that there is a singleton group.
> 
> Leave a note in the comment about this requirement and remove the logic.
> 
> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/vfio/vfio_main.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)

Applied to vfio next branch for v6.1.  Thanks,

Alex

Tian, Kevin Sept. 6, 2022, 9:21 a.m. UTC | #9

Hi, Jason,

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, August 25, 2022 8:44 AM
> 
> On Wed, Aug 24, 2022 at 04:02:01PM -0600, Alex Williamson wrote:
> 
> > Do we expect mdev drivers supporting migration to track their dirty
> > iovas locally and implement this feature?
> 
> Assuming we ever get a SW only device that can do migration..
> 
> I would expect there to be a library to manage the dirty bitmap
> datastructure and the instance of that datastructure to be linked to
> the IOAS the device is attached to. Ie one IOAS one datastructure.
> 
> The additional code in the mdev driver would be a handful of lines.
> 
> If we want it to report out through the vfio or through a iommufd API
> is an interesting question I don't have a solid answer to right
> now. Either will work - iommufd has a nice hole to put a "sw page
> table" object parallel to the "hw page table" object that the draft
> series put the API on. The appeal is it makes the sharing of the
> datastructure clearer to userpsace.
> 
> The trouble with mlx5 is there isn't a nice iommufd spot to put a
> device tracker. The closest is on the device id, but using the device
> id in APIs intended for "page table" id's is wonky. And there is no
> datatructure sharing here, obviously. Not such a nice fit.
> 

A bit more thinking on this.

Although mlx5 internal tracking is not generic, the uAPI proposed
in Yishai's series is pretty generic. I wonder whether it can be extended
as a formal interface in iommufd with iommu dirty tracking also being
a dirty tracker implementation.

Currently the concept between Yishai's and Joao's series is actually
similar, both having a capability interface and a command to read/clear
the bitmap of a tracked range except that Yishai's series allows
userspace to do range-based tracking while Joao's series currently
have the tracking enabled per the entire page table.

But in concept iommu dirty tracking can support range-based interfaces
too. For global dirty bit control (e.g. Intel and AMD) the iommu driver
can just turn it on globally upon any enabled range. For per-PTE
dirty bit control (e.g. ARM) it's actually a nice fit for range-based
tracking.

This pays the complexity of introducing a new object (dirty tracker)
in iommufd object hierarchy, with the gain of providing a single
dirty tracking interface to userspace.

Kind of like an page table can have a list of dirty trackers and each
tracker is used by a list of devices.

If iommu dirty tracker is available it is preferred to and used by all
devices (except mdev) attached to the page table.

Otherwise mlx5 etc. picks a vendor specific dirty tracker if available.

mdev's just share a generic software dirty tracker, not affected by 
the presence of iommu dirty tracker.

Multiple trackers might be used for a page table at the same time.

Is above all worth it?

Thanks
Kevin

Jason Gunthorpe Sept. 6, 2022, 1:59 p.m. UTC | #10

On Tue, Sep 06, 2022 at 09:21:35AM +0000, Tian, Kevin wrote:

> Although mlx5 internal tracking is not generic, the uAPI proposed
> in Yishai's series is pretty generic. I wonder whether it can be extended
> as a formal interface in iommufd with iommu dirty tracking also being
> a dirty tracker implementation.

It probably could, but I don't think it is helpful

> Currently the concept between Yishai's and Joao's series is actually
> similar, both having a capability interface and a command to read/clear
> the bitmap of a tracked range except that Yishai's series allows
> userspace to do range-based tracking while Joao's series currently
> have the tracking enabled per the entire page table.

This similarity is deliberate

> But in concept iommu dirty tracking can support range-based interfaces
> too. For global dirty bit control (e.g. Intel and AMD) the iommu driver
> can just turn it on globally upon any enabled range. For per-PTE
> dirty bit control (e.g. ARM) it's actually a nice fit for range-based
> tracking.

We don't actually have a use case for range-based tracking. The reason
it exists for mlx5 is only because the device just cannot do global
tracking.

I was hoping not to bring that complexity into the system iommu side.

> This pays the complexity of introducing a new object (dirty tracker)
> in iommufd object hierarchy, with the gain of providing a single
> dirty tracking interface to userspace.

Well, it isn't single, it still two interfaces that work in two quite
different ways. Userspace has to know exactly which of the two ways it
is working in and do everything accordingly.

We get to share some structs and ioctl #s, but I don't see a
real simplification here.

If anything the false sharing creates a new kind of complexity:

> Kind of like an page table can have a list of dirty trackers and each
> tracker is used by a list of devices.
> 
> If iommu dirty tracker is available it is preferred to and used by all
> devices (except mdev) attached to the page table.
> 
> Otherwise mlx5 etc. picks a vendor specific dirty tracker if available.
> 
> mdev's just share a generic software dirty tracker, not affected by 
> the presence of iommu dirty tracker.
> 
> Multiple trackers might be used for a page table at the same time.
> 
> Is above all worth it?

Because now we are trying to create one master interface but now we
need new APIs to control what trackers exists, what trackers are
turned on, and so forth. Now we've added complexity.

You end up with iommufd proxying a VFIO object that is completely
unrelated to drivers/iommu - I don't like it just on that basis.

The tracker is a vfio object, created by a vfio driver, with an
interface deliberately similar to iommufd's interface to ease the
implementation in userspace. It is not realed to drivers/iommu and it
doesn't benefit from anything in drivers/iommu beyond the shared code
to manage the shared uapi.

Jason

[v2] vfio: Remove vfio_group dev_counter

Commit Message

Comments

Patch