[RFC,05/10] vfio: Create a vfio_device from vma lookup

Message ID	161401268537.16443.2329805617992345365.stgit@gimli.home (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Subject: [RFC PATCH 05/10] vfio: Create a vfio_device from vma lookup From: Alex Williamson <alex.williamson@redhat.com> To: alex.williamson@redhat.com Cc: cohuck@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, jgg@nvidia.com, peterx@redhat.com Date: Mon, 22 Feb 2021 09:51:25 -0700 Message-ID: <161401268537.16443.2329805617992345365.stgit@gimli.home> In-Reply-To: <161401167013.16443.8389863523766611711.stgit@gimli.home> References: <161401167013.16443.8389863523766611711.stgit@gimli.home> User-Agent: StGit/0.21-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	vfio: Device memory DMA mapping improvements \| expand [RFC,00/10] vfio: Device memory DMA mapping improvements [RFC,01/10] vfio: Create vfio_fs_type with inode per device [RFC,02/10] vfio: Update vfio_add_group_dev() API [RFC,03/10] vfio: Export unmap_mapping_range() wrapper [RFC,04/10] vfio/pci: Use vfio_device_unmap_mapping_range() [RFC,05/10] vfio: Create a vfio_device from vma lookup [RFC,06/10] vfio: Add a device notifier interface [RFC,07/10] vfio/pci: Notify on device release [RFC,08/10] vfio/type1: Refactor pfn_list clearing [RFC,09/10] vfio/type1: Pass iommu and dma objects through to vaddr_get_pfn [RFC,10/10] vfio/type1: Register device notifier

Alex Williamson Feb. 22, 2021, 4:51 p.m. UTC

Introduce a vfio bus driver policy where using the exported
vfio_device_vma_open() as the vm_ops.open for a vma indicates
vm_private_data is the struct vfio_device pointer associated
to the vma.  This allows a direct vma to device lookup.

Operating on an active, open vma to the device, we should be
able to directly increment the vfio_device reference.

Implemented only for vfio-pci for now.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c |    6 ++++--
 drivers/vfio/vfio.c         |   24 ++++++++++++++++++++++++
 include/linux/vfio.h        |    2 ++
 3 files changed, 30 insertions(+), 2 deletions(-)

Jason Gunthorpe Feb. 22, 2021, 5:29 p.m. UTC | #1

On Mon, Feb 22, 2021 at 09:51:25AM -0700, Alex Williamson wrote:

> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index da212425ab30..399c42b77fbb 100644
> +++ b/drivers/vfio/vfio.c
> @@ -572,6 +572,15 @@ void vfio_device_unmap_mapping_range(struct vfio_device *device,
>  }
>  EXPORT_SYMBOL_GPL(vfio_device_unmap_mapping_range);
>  
> +/*
> + * A VFIO bus driver using this open callback will provide a
> + * struct vfio_device pointer in the vm_private_data field.

The vfio_device pointer should be stored in the struct file

> +struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
> +{
> +	struct vfio_device *device;
> +
> +	if (vma->vm_ops->open != vfio_device_vma_open)
> +		return ERR_PTR(-ENODEV);
> +

Having looked at VFIO alot more closely last week, this is even more
trivial - VFIO only creates mmaps of memory we want to invalidate, so
this is just very simple:

struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
{
       if (!vma->vm_file ||vma->vm_file->f_op != &vfio_device_fops)
	   return ERR_PTR(-ENODEV);
       return vma->vm_file->f_private;
}

The only use of the special ops would be if there are multiple types
of mmap's going on, but for this narrow use case those would be safely
distinguished by the vm_pgoff instead

> +extern void vfio_device_vma_open(struct vm_area_struct *vma);
> +extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma);

No externs on function prototypes in new code please, we've been
slowly deleting them..

Jason

Alex Williamson Feb. 24, 2021, 9:55 p.m. UTC | #2

On Mon, 22 Feb 2021 13:29:13 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 22, 2021 at 09:51:25AM -0700, Alex Williamson wrote:
> 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index da212425ab30..399c42b77fbb 100644
> > +++ b/drivers/vfio/vfio.c
> > @@ -572,6 +572,15 @@ void vfio_device_unmap_mapping_range(struct vfio_device *device,
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_device_unmap_mapping_range);
> >  
> > +/*
> > + * A VFIO bus driver using this open callback will provide a
> > + * struct vfio_device pointer in the vm_private_data field.  
> 
> The vfio_device pointer should be stored in the struct file
> 
> > +struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
> > +{
> > +	struct vfio_device *device;
> > +
> > +	if (vma->vm_ops->open != vfio_device_vma_open)
> > +		return ERR_PTR(-ENODEV);
> > +  
> 
> Having looked at VFIO alot more closely last week, this is even more
> trivial - VFIO only creates mmaps of memory we want to invalidate, so
> this is just very simple:
> 
> struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma)
> {
>        if (!vma->vm_file ||vma->vm_file->f_op != &vfio_device_fops)
> 	   return ERR_PTR(-ENODEV);
>        return vma->vm_file->f_private;
> }

That's pretty clever.

> The only use of the special ops would be if there are multiple types
> of mmap's going on, but for this narrow use case those would be safely
> distinguished by the vm_pgoff instead

We potentially do have device specific regions which can support mmap,
for example the migration region.  We'll need to think about how we
could even know if portions of those regions map to a device.  We could
use the notifier to announce it and require the code supporting those
device specific regions manage it.

I'm not really clear what you're getting at with vm_pgoff though, could
you explain further?

> > +extern void vfio_device_vma_open(struct vm_area_struct *vma);
> > +extern struct vfio_device *vfio_device_get_from_vma(struct vm_area_struct *vma);  
> 
> No externs on function prototypes in new code please, we've been
> slowly deleting them..

For now it's consistent with what we have there already, I'd prefer a
separate cleanup of that before or after rather than introducing
inconsistency here.  Thanks,

Alex

Jason Gunthorpe Feb. 25, 2021, 12:06 a.m. UTC | #3

On Wed, Feb 24, 2021 at 02:55:06PM -0700, Alex Williamson wrote:

> > The only use of the special ops would be if there are multiple types
> > of mmap's going on, but for this narrow use case those would be safely
> > distinguished by the vm_pgoff instead
> 
> We potentially do have device specific regions which can support mmap,
> for example the migration region.  We'll need to think about how we
> could even know if portions of those regions map to a device.  We could
> use the notifier to announce it and require the code supporting those
> device specific regions manage it.

So, the above basically says any VFIO VMA is allowed for VFIO to map
to the IOMMU.

If there are places creating mmaps for VFIO that should not go to the
IOMMU then they need to return NULL from this function.

> I'm not really clear what you're getting at with vm_pgoff though, could
> you explain further?

Ah, so I have to take a side discussion to explain what I ment.

The vm_pgoff is a bit confused because we change it here in vfio_pci:

    vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

But the address_space invalidation assumes it still has the region
based encoding:

+	vfio_device_unmap_mapping_range(vdev->device,
+			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
+			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
+			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));

Those three indexes are in the vm_pgoff numberspace and so vm_pgoff
must always be set to the same thing - either the
VFIO_PCI_INDEX_TO_OFFSET() coding or the physical pfn. 

Since you say we need a limited invalidation this looks like a bug to
me - and it must always be the VFIO_PCI_INDEX_TO_OFFSET coding.

So, the PCI vma needs to get switched to use the
VFIO_PCI_INDEX_TO_OFFSET coding and then we can always extract the
region number from the vm_pgoff and thus access any additional data,
such as the base pfn or a flag saying this cannot be mapped to the
IOMMU. Do the reverse of VFIO_PCI_INDEX_TO_OFFSET and consult
information attached to that region ID.

All places creating vfio mmaps have to set the vm_pgoff to
VFIO_PCI_INDEX_TO_OFFSET().

But we have these violations that need fixing:

drivers/vfio/fsl-mc/vfio_fsl_mc.c:      vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;
drivers/vfio/platform/vfio_platform_common.c:   vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;

Couldn't see any purpose to this code, cargo cult copy? Just delete
it.

drivers/vfio/pci/vfio_pci.c:    vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

Used to implement fault() but we could get the region number and
extract the pfn from the vfio_pci_device's data easy enough.

I manually checked that other parts of VFIO not under drivers/vfio are
doing it OK, looks fine.

Jason

Alex Williamson Feb. 25, 2021, 10:21 p.m. UTC | #4

On Wed, 24 Feb 2021 20:06:10 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Feb 24, 2021 at 02:55:06PM -0700, Alex Williamson wrote:
> 
> > > The only use of the special ops would be if there are multiple types
> > > of mmap's going on, but for this narrow use case those would be safely
> > > distinguished by the vm_pgoff instead  
> > 
> > We potentially do have device specific regions which can support mmap,
> > for example the migration region.  We'll need to think about how we
> > could even know if portions of those regions map to a device.  We could
> > use the notifier to announce it and require the code supporting those
> > device specific regions manage it.  
> 
> So, the above basically says any VFIO VMA is allowed for VFIO to map
> to the IOMMU.
> 
> If there are places creating mmaps for VFIO that should not go to the
> IOMMU then they need to return NULL from this function.
> 
> > I'm not really clear what you're getting at with vm_pgoff though, could
> > you explain further?  
> 
> Ah, so I have to take a side discussion to explain what I ment.
> 
> The vm_pgoff is a bit confused because we change it here in vfio_pci:
> 
>     vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> 
> But the address_space invalidation assumes it still has the region
> based encoding:
> 
> +	vfio_device_unmap_mapping_range(vdev->device,
> +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX),
> +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX) -
> +			VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX));
> 
> Those three indexes are in the vm_pgoff numberspace and so vm_pgoff
> must always be set to the same thing - either the
> VFIO_PCI_INDEX_TO_OFFSET() coding or the physical pfn. 

Aha, I hadn't made that connection.

> Since you say we need a limited invalidation this looks like a bug to
> me - and it must always be the VFIO_PCI_INDEX_TO_OFFSET coding.

Yes, this must have only worked in testing because I mmap'd BAR0 which
is at index/offset zero, so the pfn range overlapped the user offset.
I'm glad you caught that...

> So, the PCI vma needs to get switched to use the
> VFIO_PCI_INDEX_TO_OFFSET coding and then we can always extract the
> region number from the vm_pgoff and thus access any additional data,
> such as the base pfn or a flag saying this cannot be mapped to the
> IOMMU. Do the reverse of VFIO_PCI_INDEX_TO_OFFSET and consult
> information attached to that region ID.
> 
> All places creating vfio mmaps have to set the vm_pgoff to
> VFIO_PCI_INDEX_TO_OFFSET().

This is where it gets tricky.  The vm_pgoff we get from
file_operations.mmap is already essentially describing an offset from
the base of a specific resource.  We could convert that from an absolute
offset to a pfn offset, but it's only the bus driver code (ex.
vfio-pci) that knows how to get the base, assuming there is a single
base per region (we can't assume enough bits per region to store
absolute pfn).  Also note that you're suggesting that all vfio mmaps
would need to standardize on the vfio-pci implementation of region
layouts.  Not that most drivers haven't copied vfio-pci, but we've
specifically avoided exposing it as a fixed uAPI such that we could have
the flexibility for a bus driver to implement regions offsets however
they need.

So I'm not really sure what this looks like.  Within vfio-pci we could
keep the index bits in place to allow unmmap_mapping_range() to
selectively zap matching vm_pgoffs but expanding that to a vfio
standard such that the IOMMU backend can also extract a pfn looks very
limiting, or ugly.  Thanks,

Alex

> But we have these violations that need fixing:
> 
> drivers/vfio/fsl-mc/vfio_fsl_mc.c:      vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;
> drivers/vfio/platform/vfio_platform_common.c:   vma->vm_pgoff = (region.addr >> PAGE_SHIFT) + pgoff;
> 
> Couldn't see any purpose to this code, cargo cult copy? Just delete
> it.
> 
> drivers/vfio/pci/vfio_pci.c:    vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> 
> Used to implement fault() but we could get the region number and
> extract the pfn from the vfio_pci_device's data easy enough.
> 
> I manually checked that other parts of VFIO not under drivers/vfio are
> doing it OK, looks fine.
> 
> Jason
>

Jason Gunthorpe Feb. 25, 2021, 11:49 p.m. UTC | #5

On Thu, Feb 25, 2021 at 03:21:13PM -0700, Alex Williamson wrote:

> This is where it gets tricky.  The vm_pgoff we get from
> file_operations.mmap is already essentially describing an offset from
> the base of a specific resource.  We could convert that from an absolute
> offset to a pfn offset, but it's only the bus driver code (ex.
> vfio-pci) that knows how to get the base, assuming there is a single
> base per region (we can't assume enough bits per region to store
> absolute pfn).  Also note that you're suggesting that all vfio mmaps
> would need to standardize on the vfio-pci implementation of region
> layouts.  Not that most drivers haven't copied vfio-pci, but we've
> specifically avoided exposing it as a fixed uAPI such that we could have
> the flexibility for a bus driver to implement regions offsets however
> they need.

Okay, well the bus driver owns the address space and the bus driver is
in control of the vm_pgoff. If it doesn't want to zap then it doesn't
need to do anything

vfio-pci can consistently use the index encoding and be fine
 
> So I'm not really sure what this looks like.  Within vfio-pci we could
> keep the index bits in place to allow unmmap_mapping_range() to
> selectively zap matching vm_pgoffs but expanding that to a vfio
> standard such that the IOMMU backend can also extract a pfn looks very
> limiting, or ugly.  Thanks,

Lets add a op to convert a vma into a PFN range. The map code will
pass the vma to the op and get back a pfn (or failure).

pci will switch the vm_pgoff to an index, find the bar base and
compute the pfn.

It is starting to look more and more like dma buf though

Jason

Alex Williamson March 4, 2021, 9:37 p.m. UTC | #6

On Thu, 25 Feb 2021 19:49:49 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Feb 25, 2021 at 03:21:13PM -0700, Alex Williamson wrote:
> 
> > This is where it gets tricky.  The vm_pgoff we get from
> > file_operations.mmap is already essentially describing an offset from
> > the base of a specific resource.  We could convert that from an absolute
> > offset to a pfn offset, but it's only the bus driver code (ex.
> > vfio-pci) that knows how to get the base, assuming there is a single
> > base per region (we can't assume enough bits per region to store
> > absolute pfn).  Also note that you're suggesting that all vfio mmaps
> > would need to standardize on the vfio-pci implementation of region
> > layouts.  Not that most drivers haven't copied vfio-pci, but we've
> > specifically avoided exposing it as a fixed uAPI such that we could have
> > the flexibility for a bus driver to implement regions offsets however
> > they need.  
> 
> Okay, well the bus driver owns the address space and the bus driver is
> in control of the vm_pgoff. If it doesn't want to zap then it doesn't
> need to do anything
> 
> vfio-pci can consistently use the index encoding and be fine
>  
> > So I'm not really sure what this looks like.  Within vfio-pci we could
> > keep the index bits in place to allow unmmap_mapping_range() to
> > selectively zap matching vm_pgoffs but expanding that to a vfio
> > standard such that the IOMMU backend can also extract a pfn looks very
> > limiting, or ugly.  Thanks,  
> 
> Lets add a op to convert a vma into a PFN range. The map code will
> pass the vma to the op and get back a pfn (or failure).
> 
> pci will switch the vm_pgoff to an index, find the bar base and
> compute the pfn.
> 
> It is starting to look more and more like dma buf though

How terrible would it be if vfio-core used a shared vm_private_data
structure and inserted itself into the vm_ops call chain to reference
count that structure?  We already wrap the device file descriptor via
vfio_device_fops.mmap, so we could:

	struct vfio_vma_private_data *priv;

	priv = kzalloc(...
	
	priv->device = device;
	vma->vm_private_data = priv;

	device->ops->mmap(device->device_data, vma);

	if (vma->vm_private_data == priv) {
		priv->vm_ops = vma->vm_ops;
		vma->vm_ops = &vfio_vm_ops;
	} else
		kfree(priv);

Where:

struct vfio_vma_private_data {
	struct vfio_device *device;
	unsigned long pfn_base;
	void *private_data; // maybe not needed
	const struct vm_operations_struct *vm_ops;
	struct kref kref;
	unsigned int release_notification:1;
};

Therefore unless a bus driver opts-out by replacing vm_private_data, we
can identify participating vmas by the vm_ops and have flags indicating
if the vma maps device memory such that vfio_get_device_from_vma()
should produce a device reference.  The vfio IOMMU backends would also
consume this, ie. if they get a valid vfio_device from the vma, use the
pfn_base field directly.  vfio_vm_ops would wrap the bus driver
callbacks and provide reference counting on open/close to release this
object.

I'm not thrilled with a vfio_device_ops callback plumbed through
vfio-core to do vma-to-pfn translation, so I thought this might be a
better alternative.  Thanks,

Alex

Jason Gunthorpe March 4, 2021, 11:16 p.m. UTC | #7

On Thu, Mar 04, 2021 at 02:37:57PM -0700, Alex Williamson wrote:

> Therefore unless a bus driver opts-out by replacing vm_private_data, we
> can identify participating vmas by the vm_ops and have flags indicating
> if the vma maps device memory such that vfio_get_device_from_vma()
> should produce a device reference.  The vfio IOMMU backends would also
> consume this, ie. if they get a valid vfio_device from the vma, use the
> pfn_base field directly.  vfio_vm_ops would wrap the bus driver
> callbacks and provide reference counting on open/close to release this
> object.

> I'm not thrilled with a vfio_device_ops callback plumbed through
> vfio-core to do vma-to-pfn translation, so I thought this might be a
> better alternative.  Thanks,

Maybe you could explain why, because I'm looking at this idea and
thinking it looks very complicated compared to a simple driver op
callback?

The implementation of such an op for vfio_pci is one line trivial, why
do we need allocated memory and a entire shim layer instead? 

Shim layers are bad.

We still need a driver op of some kind because only the driver can
convert a pg_off into a PFN. Remember the big point here is to remove
the sketchy follow_pte()...

Jason

Alex Williamson March 5, 2021, 12:07 a.m. UTC | #8

On Thu, 4 Mar 2021 19:16:33 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Mar 04, 2021 at 02:37:57PM -0700, Alex Williamson wrote:
> 
> > Therefore unless a bus driver opts-out by replacing vm_private_data, we
> > can identify participating vmas by the vm_ops and have flags indicating
> > if the vma maps device memory such that vfio_get_device_from_vma()
> > should produce a device reference.  The vfio IOMMU backends would also
> > consume this, ie. if they get a valid vfio_device from the vma, use the
> > pfn_base field directly.  vfio_vm_ops would wrap the bus driver
> > callbacks and provide reference counting on open/close to release this
> > object.  
> 
> > I'm not thrilled with a vfio_device_ops callback plumbed through
> > vfio-core to do vma-to-pfn translation, so I thought this might be a
> > better alternative.  Thanks,  
> 
> Maybe you could explain why, because I'm looking at this idea and
> thinking it looks very complicated compared to a simple driver op
> callback?

vfio-core needs to export a vfio_vma_to_pfn() which I think assumes the
caller has already used vfio_device_get_from_vma(), but should still
validate the vma is one from a vfio device before calling this new
vfio_device_ops callback.  vfio-pci needs to validate the vm_pgoff
value falls within a BAR region, mask off the index and get the
pci_resource_start() for the BAR index.

Then we need a solution for how vfio_device_get_from_vma() determines
whether to grant a device reference for a given vma, where that vma may
map something other than device memory.  Are you imagining that we hand
out device references independently and vfio_vma_to_pfn() would return
an errno for vm_pgoff values that don't map device memory and the IOMMU
driver would release the reference?

It all seems rather ad-hoc.

> The implementation of such an op for vfio_pci is one line trivial, why
> do we need allocated memory and a entire shim layer instead? 
> 
> Shim layers are bad.

The only thing here I'd consider a shim layer is overriding vm_ops,
which just seemed like a cleaner and simpler solution than exporting
open/close functions and validating the bus driver installs them, and
the error path should they not.

> We still need a driver op of some kind because only the driver can
> convert a pg_off into a PFN. Remember the big point here is to remove
> the sketchy follow_pte()...

The bus driver simply writes the base_pfn value in the example
structure I outlined in its .mmap callback.  I'm just looking for an
alternative place to store our former vm_pgoff in a way that doesn't
prevent using unmmap_mapping_range().  The IOMMU backend, once it has a
vfio_device via vfio_device_get_from_vma() can know the format of
vm_private_data, cast it as a vfio_vma_private_data and directly use
base_pfn, accomplishing the big point.  They're all operating in the
agreed upon vm_private_data format.  Thanks,

Alex

Jason Gunthorpe March 5, 2021, 12:36 a.m. UTC | #9

On Thu, Mar 04, 2021 at 05:07:31PM -0700, Alex Williamson wrote:
> On Thu, 4 Mar 2021 19:16:33 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Mar 04, 2021 at 02:37:57PM -0700, Alex Williamson wrote:
> > 
> > > Therefore unless a bus driver opts-out by replacing vm_private_data, we
> > > can identify participating vmas by the vm_ops and have flags indicating
> > > if the vma maps device memory such that vfio_get_device_from_vma()
> > > should produce a device reference.  The vfio IOMMU backends would also
> > > consume this, ie. if they get a valid vfio_device from the vma, use the
> > > pfn_base field directly.  vfio_vm_ops would wrap the bus driver
> > > callbacks and provide reference counting on open/close to release this
> > > object.  
> > 
> > > I'm not thrilled with a vfio_device_ops callback plumbed through
> > > vfio-core to do vma-to-pfn translation, so I thought this might be a
> > > better alternative.  Thanks,  
> > 
> > Maybe you could explain why, because I'm looking at this idea and
> > thinking it looks very complicated compared to a simple driver op
> > callback?
> 
> vfio-core needs to export a vfio_vma_to_pfn() which I think assumes the
> caller has already used vfio_device_get_from_vma(), but should still
> validate the vma is one from a vfio device before calling this new
> vfio_device_ops callback.

Huh? Validate? Why?

Something like this in the IOMMU stuff:

   struct vfio_device *vfio = vfio_device_get_from_vma(vma)

   if (!vfio->vma_to_pfn)
        return -EINVAL;
   vfio->ops->vma_to_pfn(vfio, vma, offset_from_vma_start)

Is fine, why do we need to over complicate?

I don't need to check that the vma belongs to the vfio because it is
the API contract that the caller will guarantee that.

This is the kernel, I can (and do!) check these sorts of things by
code inspection when working on stuff - I can look at every
implementation and every call site to prove these things.

IMHO doing an expensive check like that is a style of defensive
programming the kernel community frowns upon.

> vfio-pci needs to validate the vm_pgoff value falls within a BAR
> region, mask off the index and get the pci_resource_start() for the
> BAR index.

It needs to do the same thing fault() already does, which is currently
one line..

> Then we need a solution for how vfio_device_get_from_vma() determines
> whether to grant a device reference for a given vma, where that vma may
> map something other than device memory. Are you imagining that we hand
> out device references independently and vfio_vma_to_pfn() would return
> an errno for vm_pgoff values that don't map device memory and the IOMMU
> driver would release the reference?

That seems a reasonable place to start

> prevent using unmmap_mapping_range().  The IOMMU backend, once it has a
> vfio_device via vfio_device_get_from_vma() can know the format of
> vm_private_data, cast it as a vfio_vma_private_data and directly use
> base_pfn, accomplishing the big point.  They're all operating in the
> agreed upon vm_private_data format.  Thanks,

If we force all drivers into a mandatory (!) vm_private_data format
then every driver has to be audited and updated before the new pfn
code can be done. If any driver in the future makes a mistake here
(and omitting the unique vm_private_data magics is a very easy mistake
to make) then it will cause a kernel crash in an obscure scenario.

It is the "design the API to be hard to use wrong" philosophy.

Jason

[RFC,05/10] vfio: Create a vfio_device from vma lookup

Commit Message

Comments

Patch