diff mbox series

[v3,10/10] vfio/qat: Add vfio_pci driver for Intel QAT VF devices

Message ID 20240221155008.960369-11-xin.zeng@intel.com (mailing list archive)
State Superseded
Delegated to: Herbert Xu
Headers show
Series crypto: qat - enable QAT GEN4 SRIOV VF live migration for QAT GEN4 | expand

Commit Message

Xin Zeng Feb. 21, 2024, 3:50 p.m. UTC
Add vfio pci driver for Intel QAT VF devices.

This driver uses vfio_pci_core to register to the VFIO subsystem. It
acts as a vfio agent and interacts with the QAT PF driver to implement
VF live migration.

Co-developed-by: Yahui Cao <yahui.cao@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
Signed-off-by: Xin Zeng <xin.zeng@intel.com>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
---
 MAINTAINERS                         |   8 +
 drivers/vfio/pci/Kconfig            |   2 +
 drivers/vfio/pci/Makefile           |   2 +
 drivers/vfio/pci/intel/qat/Kconfig  |  12 +
 drivers/vfio/pci/intel/qat/Makefile |   3 +
 drivers/vfio/pci/intel/qat/main.c   | 663 ++++++++++++++++++++++++++++
 6 files changed, 690 insertions(+)
 create mode 100644 drivers/vfio/pci/intel/qat/Kconfig
 create mode 100644 drivers/vfio/pci/intel/qat/Makefile
 create mode 100644 drivers/vfio/pci/intel/qat/main.c

Comments

Alex Williamson Feb. 26, 2024, 6:55 p.m. UTC | #1
On Wed, 21 Feb 2024 23:50:08 +0800
Xin Zeng <xin.zeng@intel.com> wrote:

> Add vfio pci driver for Intel QAT VF devices.
> 
> This driver uses vfio_pci_core to register to the VFIO subsystem. It
> acts as a vfio agent and interacts with the QAT PF driver to implement
> VF live migration.
> 
> Co-developed-by: Yahui Cao <yahui.cao@intel.com>
> Signed-off-by: Yahui Cao <yahui.cao@intel.com>
> Signed-off-by: Xin Zeng <xin.zeng@intel.com>
> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
> ---
>  MAINTAINERS                         |   8 +
>  drivers/vfio/pci/Kconfig            |   2 +
>  drivers/vfio/pci/Makefile           |   2 +
>  drivers/vfio/pci/intel/qat/Kconfig  |  12 +
>  drivers/vfio/pci/intel/qat/Makefile |   3 +
>  drivers/vfio/pci/intel/qat/main.c   | 663 ++++++++++++++++++++++++++++
>  6 files changed, 690 insertions(+)
>  create mode 100644 drivers/vfio/pci/intel/qat/Kconfig
>  create mode 100644 drivers/vfio/pci/intel/qat/Makefile
>  create mode 100644 drivers/vfio/pci/intel/qat/main.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 5a4051996f1e..8961c7033b31 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23099,6 +23099,14 @@ S:	Maintained
>  F:	Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst
>  F:	drivers/vfio/pci/pds/
>  
> +VFIO QAT PCI DRIVER
> +M:	Xin Zeng <xin.zeng@intel.com>
> +M:	Giovanni Cabiddu <giovanni.cabiddu@intel.com>
> +L:	kvm@vger.kernel.org
> +L:	qat-linux@intel.com
> +S:	Supported
> +F:	drivers/vfio/pci/intel/qat/
> +

Alphabetical please.

>  VFIO PLATFORM DRIVER
>  M:	Eric Auger <eric.auger@redhat.com>
>  L:	kvm@vger.kernel.org
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 18c397df566d..329d25c53274 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -67,4 +67,6 @@ source "drivers/vfio/pci/pds/Kconfig"
>  
>  source "drivers/vfio/pci/virtio/Kconfig"
>  
> +source "drivers/vfio/pci/intel/qat/Kconfig"

This will be the first intel vfio-pci variant driver, I don't think we
need an intel sub-directory just yet.

Tangentially, I think an issue we're running into with
PCI_DRIVER_OVERRIDE_DEVICE_VFIO is that we require driver_override to
bind the device and therefore the id_table becomes little more than a
suggestion.  Our QE is already asking, for example, if they should use
mlx5-vfio-pci for all mlx5 compatible devices.

I wonder if all vfio-pci variant drivers that specify an id_table
shouldn't include in their probe function:

	if (!pci_match_id(pdev, id)) {
		pci_info(pdev, "Incompatible device, disallowing driver_override\n");
		return -ENODEV;
	}

(And yes, I see the irony that vfio introduced driver_override and
we've created variant drivers that require driver_override and now we
want to prevent driver_overrides)

Jason, are you seeing any of this as well and do you have a better
suggestion how we might address the issue?  Thanks,

Alex
Jason Gunthorpe Feb. 26, 2024, 7:12 p.m. UTC | #2
On Mon, Feb 26, 2024 at 11:55:56AM -0700, Alex Williamson wrote:
> This will be the first intel vfio-pci variant driver, I don't think we
> need an intel sub-directory just yet.
> 
> Tangentially, I think an issue we're running into with
> PCI_DRIVER_OVERRIDE_DEVICE_VFIO is that we require driver_override to
> bind the device and therefore the id_table becomes little more than a
> suggestion.  Our QE is already asking, for example, if they should use
> mlx5-vfio-pci for all mlx5 compatible devices.

They don't have to, but it works fine, there is no reason not to.

I imagined that users would always bind the variant driver, it is why
the drivers all have "disabled" fallbacks to just be normal vfio-pci.

> I wonder if all vfio-pci variant drivers that specify an id_table
> shouldn't include in their probe function:
> 
> 	if (!pci_match_id(pdev, id)) {
> 		pci_info(pdev, "Incompatible device, disallowing driver_override\n");
> 		return -ENODEV;
> 	}

Certainly an interesting idea, doesn't that completely defeat driver
binding and new_id though?

You are worried about someone wrongly loading a mlx5 driver on, say,
an Intel device?

> (And yes, I see the irony that vfio introduced driver_override and
> we've created variant drivers that require driver_override and now we
> want to prevent driver_overrides)

Heh
 
> Jason, are you seeing any of this as well and do you have a better
> suggestion how we might address the issue?  Thanks,

I haven't heard of confusion here.. People who don't care just use
vfio-pci like the internet tells them, people who do care seem to be
very sophisticated right now..

Did the userspace tool Max sketched out to automatically parse the id
tables ever get completed? That seems like the best solution, just
automate it and remove the decision from the user.

Jason
Alex Williamson Feb. 26, 2024, 7:41 p.m. UTC | #3
On Mon, 26 Feb 2024 15:12:20 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 26, 2024 at 11:55:56AM -0700, Alex Williamson wrote:
> > This will be the first intel vfio-pci variant driver, I don't think we
> > need an intel sub-directory just yet.
> > 
> > Tangentially, I think an issue we're running into with
> > PCI_DRIVER_OVERRIDE_DEVICE_VFIO is that we require driver_override to
> > bind the device and therefore the id_table becomes little more than a
> > suggestion.  Our QE is already asking, for example, if they should use
> > mlx5-vfio-pci for all mlx5 compatible devices.  
> 
> They don't have to, but it works fine, there is no reason not to.

But there's also no reason to.  None of the metadata exposed by the
driver suggests it should be a general purpose vfio-pci stand-in.

> I imagined that users would always bind the variant driver, it is why
> the drivers all have "disabled" fallbacks to just be normal vfio-pci.
> 
> > I wonder if all vfio-pci variant drivers that specify an id_table
> > shouldn't include in their probe function:
> > 
> > 	if (!pci_match_id(pdev, id)) {
> > 		pci_info(pdev, "Incompatible device, disallowing driver_override\n");
> > 		return -ENODEV;
> > 	}  
> 
> Certainly an interesting idea, doesn't that completely defeat driver
> binding and new_id though?

I guess we always send a compatible id, so it'd be more a matter of
exporting and testing id against pci_device_id_any, that would be the
footprint of just a driver_override match (or an extremely liberal
dynamic id).

> You are worried about someone wrongly loading a mlx5 driver on, say,
> an Intel device?

That's sort of where we're headed if we consider it valid to bind a CX5
to mlx5-vfio-pci just because they have a host driver with a similar
name in common.  It's essentially a free for all.  I worry about test
matrices, user confusion, and being on guard for arbitrary devices at
every turn in variant drivers if our policy is that they should all work
equivalent to a basic vfio-pci-core implementation for anything.

> > (And yes, I see the irony that vfio introduced driver_override and
> > we've created variant drivers that require driver_override and now we
> > want to prevent driver_overrides)  
> 
> Heh
>  
> > Jason, are you seeing any of this as well and do you have a better
> > suggestion how we might address the issue?  Thanks,  
> 
> I haven't heard of confusion here.. People who don't care just use
> vfio-pci like the internet tells them, people who do care seem to be
> very sophisticated right now..
> 
> Did the userspace tool Max sketched out to automatically parse the id
> tables ever get completed? That seems like the best solution, just
> automate it and remove the decision from the user.

libvirt recently implemented support for managed="yes" with variant
drivers where it will find the best "vfio_pci" driver for a device
using an algorithm like Max suggested, but in practice it's not clear
how useful that will be considering devices likes CX7 require
configuration before binding to the variant driver.  libvirt has no
hooks to specify or perform configuration at that point.

The driverctl script also exists and could maybe consider the
"vfio-pci" driver name to be a fungible alias for the best matching
vfio_pci class driver, but I'm not sure if driverctl has a sufficient
following to make a difference.  Thanks,

Alex
Jason Gunthorpe Feb. 26, 2024, 7:49 p.m. UTC | #4
On Mon, Feb 26, 2024 at 12:41:07PM -0700, Alex Williamson wrote:
> On Mon, 26 Feb 2024 15:12:20 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Feb 26, 2024 at 11:55:56AM -0700, Alex Williamson wrote:
> > > This will be the first intel vfio-pci variant driver, I don't think we
> > > need an intel sub-directory just yet.
> > > 
> > > Tangentially, I think an issue we're running into with
> > > PCI_DRIVER_OVERRIDE_DEVICE_VFIO is that we require driver_override to
> > > bind the device and therefore the id_table becomes little more than a
> > > suggestion.  Our QE is already asking, for example, if they should use
> > > mlx5-vfio-pci for all mlx5 compatible devices.  
> > 
> > They don't have to, but it works fine, there is no reason not to.
> 
> But there's also no reason to.  None of the metadata exposed by the
> driver suggests it should be a general purpose vfio-pci stand-in.

I think the intent was to bind it to all the devices in its ID table
automatically for VFIO use and it should always be OK to do that. Not
doing so is a micro optimization.

Userspace binding it to other random things is a Bad Thing.

> > You are worried about someone wrongly loading a mlx5 driver on, say,
> > an Intel device?
> 
> That's sort of where we're headed if we consider it valid to bind a CX5
> to mlx5-vfio-pci just because they have a host driver with a similar
> name in common. 

I hope nobody is doing that, everyone should be using a tool that
checks that ID table.. If we lack a usable tool for that then it that
is the problemm.

> It's essentially a free for all.  I worry about test matrices, user
> confusion, and being on guard for arbitrary devices at every turn in
> variant drivers if our policy is that they should all work
> equivalent to a basic vfio-pci-core implementation for anything.

Today most of the drivers will just NOP themeslves if they can't find
a compatible PF driver, the most likely bug in this path would be a
NULL ptr deref or something in an untested path, or just total failure
to bind.

We could insist that VF drivers are always able to find their PF or
binding fails, that would narrow things considerably.

> libvirt recently implemented support for managed="yes" with variant
> drivers where it will find the best "vfio_pci" driver for a device
> using an algorithm like Max suggested, but in practice it's not clear
> how useful that will be considering devices likes CX7 require
> configuration before binding to the variant driver.  libvirt has no
> hooks to specify or perform configuration at that point.

I don't think this is fully accurate (or at least not what was
intended), the VFIO device can be configured any time up until the VM
mlx5 driver reaches the device startup.

Is something preventing this? Did we accidentally cache the migratable
flag in vfio or something??

> The driverctl script also exists and could maybe consider the
> "vfio-pci" driver name to be a fungible alias for the best matching
> vfio_pci class driver, but I'm not sure if driverctl has a sufficient
> following to make a difference.

I was also thinking about this tool as an alternative instruction to
using libvirt.

Maybe this would ecourage more people to use it if it implemented it?

Jason
Cabiddu, Giovanni Feb. 26, 2024, 8:25 p.m. UTC | #5
On Mon, Feb 26, 2024 at 11:55:56AM -0700, Alex Williamson wrote:
> On Wed, 21 Feb 2024 23:50:08 +0800
> Xin Zeng <xin.zeng@intel.com> wrote:
> > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> > index 18c397df566d..329d25c53274 100644
> > --- a/drivers/vfio/pci/Kconfig
> > +++ b/drivers/vfio/pci/Kconfig
> > @@ -67,4 +67,6 @@ source "drivers/vfio/pci/pds/Kconfig"
> >  
> >  source "drivers/vfio/pci/virtio/Kconfig"
> >  
> > +source "drivers/vfio/pci/intel/qat/Kconfig"
> 
> This will be the first intel vfio-pci variant driver, I don't think we
> need an intel sub-directory just yet.
I made that suggestion since there is another vfio-pci variant driver
for an Intel device in development that will be sent soon. I wanted to
avoid patch that moves paths.
Anyway, we can move this driver to a subdirectory whenever the new driver
will be sent out or keep both driver in the main directory, driver/vfio/pci.

Regards,
Alex Williamson Feb. 26, 2024, 10:24 p.m. UTC | #6
On Mon, 26 Feb 2024 15:49:52 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 26, 2024 at 12:41:07PM -0700, Alex Williamson wrote:
> > On Mon, 26 Feb 2024 15:12:20 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Feb 26, 2024 at 11:55:56AM -0700, Alex Williamson wrote:  
> > > > This will be the first intel vfio-pci variant driver, I don't think we
> > > > need an intel sub-directory just yet.
> > > > 
> > > > Tangentially, I think an issue we're running into with
> > > > PCI_DRIVER_OVERRIDE_DEVICE_VFIO is that we require driver_override to
> > > > bind the device and therefore the id_table becomes little more than a
> > > > suggestion.  Our QE is already asking, for example, if they should use
> > > > mlx5-vfio-pci for all mlx5 compatible devices.    
> > > 
> > > They don't have to, but it works fine, there is no reason not to.  
> > 
> > But there's also no reason to.  None of the metadata exposed by the
> > driver suggests it should be a general purpose vfio-pci stand-in.  
> 
> I think the intent was to bind it to all the devices in its ID table
> automatically for VFIO use and it should always be OK to do that. Not
> doing so is a micro optimization.

Automatic in what sense?  We use PCI_DRIVER_OVERRIDE_DEVICE_VFIO to set
the id_table entry to override_only, so any binding requires the user
to specify a driver_override.  Nothing automatically performs that
driver_override except the recent support in libvirt for "managed"
devices.

Effectively override_only negates the id_table to little more than a
suggestion to userspace of how the driver should be used.

> Userspace binding it to other random things is a Bad Thing.

Yes, and GregKH has opinions about who gets to keep the pieces of the
broken system if a user uses dynamic ids or overrides to bind an
unsupported driver to a device.

> > > You are worried about someone wrongly loading a mlx5 driver on, say,
> > > an Intel device?  
> > 
> > That's sort of where we're headed if we consider it valid to bind a CX5
> > to mlx5-vfio-pci just because they have a host driver with a similar
> > name in common.   
> 
> I hope nobody is doing that, everyone should be using a tool that
> checks that ID table.. If we lack a usable tool for that then it that
> is the problemm.

These are the sort of questions I'm currently fielding and yes, we
don't really have any tools to make this automatic outside of the
recent libvirt support.

> > It's essentially a free for all.  I worry about test matrices, user
> > confusion, and being on guard for arbitrary devices at every turn in
> > variant drivers if our policy is that they should all work
> > equivalent to a basic vfio-pci-core implementation for anything.  
> 
> Today most of the drivers will just NOP themeslves if they can't find
> a compatible PF driver, the most likely bug in this path would be a
> NULL ptr deref or something in an untested path, or just total failure
> to bind.
> 
> We could insist that VF drivers are always able to find their PF or
> binding fails, that would narrow things considerably.

I think that conflicts with the guidance we've been giving for things
like virtio-vfio-pci and nvgrace-gpu-vfio-pci where a device matching
the id_table should at least get vfio-pci-core level functionality
because there is no "try the next best driver" protocol defined if
probing fails.  OTOH, I think it's fair to fail probing of a device
that does not match the id_table.

Therefore to me, a CX7 VF with matching VID:DID should always bind to
mlx5-vfio-pci regardless of the PF support for migration because that's
what the id_table matches and the device has functionality with a
vfio-pci-core driver.  However I don't see that mlx5-vfio-pci has any
obligation to bind to a CX5 device.

> > libvirt recently implemented support for managed="yes" with variant
> > drivers where it will find the best "vfio_pci" driver for a device
> > using an algorithm like Max suggested, but in practice it's not clear
> > how useful that will be considering devices likes CX7 require
> > configuration before binding to the variant driver.  libvirt has no
> > hooks to specify or perform configuration at that point.  
> 
> I don't think this is fully accurate (or at least not what was
> intended), the VFIO device can be configured any time up until the VM
> mlx5 driver reaches the device startup.
> 
> Is something preventing this? Did we accidentally cache the migratable
> flag in vfio or something??

I don't think so, I think this was just the policy we had decided
relative to profiling VFs when they're created rather than providing a
means to do it though a common vfio variant driver interface[1].

Nothing necessarily prevents libvirt from configuring devices after
binding to a vfio-pci variant driver, but per the noted thread the
upstream policy is to have the device configured prior to giving it to
vfio and any configuration mechanisms are specific to the device and
variant driver, therefore what more is libvirt to do?

> > The driverctl script also exists and could maybe consider the
> > "vfio-pci" driver name to be a fungible alias for the best matching
> > vfio_pci class driver, but I'm not sure if driverctl has a sufficient
> > following to make a difference.  
> 
> I was also thinking about this tool as an alternative instruction to
> using libvirt.
> 
> Maybe this would ecourage more people to use it if it implemented it?

driverctl is a very powerful and very unchecked tool.  It loads the gun
and points it at the user's foot and taunts them to pull the trigger.
I suspect use cases beyond vfio are largely theoretical, but it allows
specifying an arbitrary driver for almost any device.

OTOH, libvirt is more targeted and while it will work automatically for
a "managed" device specified via domain xml, it's also available as a
utility through virsh, ex:

 # virsh nodedev-detach pci_0000_01_10_0

Which should trigger the code to bind to vfio-pci or the best variant
driver.

So while there's a place for driverctl and I wouldn't mind driver
matching being added, I have trouble seeing enterprise distros really
getting behind it to recommend as best practice without a lot more
safety checks.

Relative to variant driver probe() re-checking the id_table, I know we
generally don't try to set policy in the kernel and this sort of
behavior has been around for a long, long time with new_id, but the
barrier to entry has been substantially lowered with things like
driverctl that I'd still entertain the idea.  Thanks,

Alex 

[1]https://lore.kernel.org/all/20231018163333.GZ3952@nvidia.com/
Xin Zeng Feb. 28, 2024, 1:57 p.m. UTC | #7
On Tuesday, February 27, 2024 2:56 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> On Wed, 21 Feb 2024 23:50:08 +0800
> Xin Zeng <xin.zeng@intel.com> wrote:
> 
> >  MAINTAINERS                         |   8 +
> >  drivers/vfio/pci/Kconfig            |   2 +
> >  drivers/vfio/pci/Makefile           |   2 +
> >  drivers/vfio/pci/intel/qat/Kconfig  |  12 +
> >  drivers/vfio/pci/intel/qat/Makefile |   3 +
> >  drivers/vfio/pci/intel/qat/main.c   | 663 ++++++++++++++++++++++++++++
> >  6 files changed, 690 insertions(+)
> >  create mode 100644 drivers/vfio/pci/intel/qat/Kconfig
> >  create mode 100644 drivers/vfio/pci/intel/qat/Makefile
> >  create mode 100644 drivers/vfio/pci/intel/qat/main.c
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 5a4051996f1e..8961c7033b31 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -23099,6 +23099,14 @@ S:	Maintained
> >  F:
> 	Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci
> .rst
> >  F:	drivers/vfio/pci/pds/
> >
> > +VFIO QAT PCI DRIVER
> > +M:	Xin Zeng <xin.zeng@intel.com>
> > +M:	Giovanni Cabiddu <giovanni.cabiddu@intel.com>
> > +L:	kvm@vger.kernel.org
> > +L:	qat-linux@intel.com
> > +S:	Supported
> > +F:	drivers/vfio/pci/intel/qat/
> > +
> 
> Alphabetical please.

Sure, will update it in next version.

> 
> >  VFIO PLATFORM DRIVER
> >  M:	Eric Auger <eric.auger@redhat.com>
> >  L:	kvm@vger.kernel.org
> > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> > index 18c397df566d..329d25c53274 100644
> > --- a/drivers/vfio/pci/Kconfig
> > +++ b/drivers/vfio/pci/Kconfig
> > @@ -67,4 +67,6 @@ source "drivers/vfio/pci/pds/Kconfig"
> >
> >  source "drivers/vfio/pci/virtio/Kconfig"
> >
> > +source "drivers/vfio/pci/intel/qat/Kconfig"
> 
> This will be the first intel vfio-pci variant driver, I don't think we
> need an intel sub-directory just yet.
> 

Ok, will update it.

> Tangentially, I think an issue we're running into with
> PCI_DRIVER_OVERRIDE_DEVICE_VFIO is that we require driver_override to
> bind the device and therefore the id_table becomes little more than a
> suggestion.  Our QE is already asking, for example, if they should use
> mlx5-vfio-pci for all mlx5 compatible devices.
> 
> I wonder if all vfio-pci variant drivers that specify an id_table
> shouldn't include in their probe function:
> 
> 	if (!pci_match_id(pdev, id)) {
> 		pci_info(pdev, "Incompatible device, disallowing
> driver_override\n");
> 		return -ENODEV;
> 	}
> 

Ok, make sense. According to the late discuss, I will include it in
next version.

> (And yes, I see the irony that vfio introduced driver_override and
> we've created variant drivers that require driver_override and now we
> want to prevent driver_overrides)
> 
> Jason, are you seeing any of this as well and do you have a better
> suggestion how we might address the issue?  Thanks,
> 
> Alex
Alex Williamson Feb. 28, 2024, 7:07 p.m. UTC | #8
On Mon, 26 Feb 2024 15:24:58 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 26 Feb 2024 15:49:52 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Feb 26, 2024 at 12:41:07PM -0700, Alex Williamson wrote:  
> > > On Mon, 26 Feb 2024 15:12:20 -0400
> > > libvirt recently implemented support for managed="yes" with variant
> > > drivers where it will find the best "vfio_pci" driver for a device
> > > using an algorithm like Max suggested, but in practice it's not clear
> > > how useful that will be considering devices likes CX7 require
> > > configuration before binding to the variant driver.  libvirt has no
> > > hooks to specify or perform configuration at that point.    
> > 
> > I don't think this is fully accurate (or at least not what was
> > intended), the VFIO device can be configured any time up until the VM
> > mlx5 driver reaches the device startup.
> > 
> > Is something preventing this? Did we accidentally cache the migratable
> > flag in vfio or something??  
> 
> I don't think so, I think this was just the policy we had decided
> relative to profiling VFs when they're created rather than providing a
> means to do it though a common vfio variant driver interface[1].

Turns out that yes, migration support needs to be established at probe
time.  vfio_pci_core_register_device() expects migration_flags,
mig_ops, and log_ops to all be established by this point, which for
mlx5-vfio-pci occurs when the .init function calls
mlx5vf_cmd_set_migratable().

So the VF does indeed need to be "profiled" to enabled migration prior
to binding to the mlx5-vfio-pci driver in order to report support.

That also makes me wonder what happens if migration support is disabled
via devlink after binding the VF to mlx5-vfio-pci.  Arguably this could
be considered user error, but what's the failure mode and support
implication?  Thanks,

Alex
Yishai Hadas Feb. 29, 2024, 12:36 p.m. UTC | #9
On 28/02/2024 21:07, Alex Williamson wrote:
> On Mon, 26 Feb 2024 15:24:58 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Mon, 26 Feb 2024 15:49:52 -0400
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Mon, Feb 26, 2024 at 12:41:07PM -0700, Alex Williamson wrote:
>>>> On Mon, 26 Feb 2024 15:12:20 -0400
>>>> libvirt recently implemented support for managed="yes" with variant
>>>> drivers where it will find the best "vfio_pci" driver for a device
>>>> using an algorithm like Max suggested, but in practice it's not clear
>>>> how useful that will be considering devices likes CX7 require
>>>> configuration before binding to the variant driver.  libvirt has no
>>>> hooks to specify or perform configuration at that point.
>>>
>>> I don't think this is fully accurate (or at least not what was
>>> intended), the VFIO device can be configured any time up until the VM
>>> mlx5 driver reaches the device startup.
>>>
>>> Is something preventing this? Did we accidentally cache the migratable
>>> flag in vfio or something??
>>
>> I don't think so, I think this was just the policy we had decided
>> relative to profiling VFs when they're created rather than providing a
>> means to do it though a common vfio variant driver interface[1].
> 
> Turns out that yes, migration support needs to be established at probe
> time.  vfio_pci_core_register_device() expects migration_flags,
> mig_ops, and log_ops to all be established by this point, which for
> mlx5-vfio-pci occurs when the .init function calls
> mlx5vf_cmd_set_migratable().
> 
> So the VF does indeed need to be "profiled" to enabled migration prior
> to binding to the mlx5-vfio-pci driver in order to report support.
> 

Right, the 'profiling' of the VF in mlx5 case, need to be done prior to 
its probing/binding.

This is achieved today by running 'devlink <xxx> migratable enable' post 
of creating the VF.

> That also makes me wonder what happens if migration support is disabled
> via devlink after binding the VF to mlx5-vfio-pci.  Arguably this could
> be considered user error,

Yes, this is a clear user error.

  but what's the failure mode and support
> implication?  Thanks,
> 

The user will simply get an error from the firmware, the kernel and 
other stuff around will stay safe.

Further details:
In the source side, once the VM will be started the 'disable' itself 
will fail as that configuration can't be changed once the VF is 
running/active already.

In the target, as it's in a pending mode, the 'disable' will succeed. 
However, the migration will just fail later on in the firmware upon 
running a migration related command, as expected.

Yishai
Jason Gunthorpe Feb. 29, 2024, 2:10 p.m. UTC | #10
On Wed, Feb 28, 2024 at 12:07:06PM -0700, Alex Williamson wrote:
> On Mon, 26 Feb 2024 15:24:58 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Mon, 26 Feb 2024 15:49:52 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Mon, Feb 26, 2024 at 12:41:07PM -0700, Alex Williamson wrote:  
> > > > On Mon, 26 Feb 2024 15:12:20 -0400
> > > > libvirt recently implemented support for managed="yes" with variant
> > > > drivers where it will find the best "vfio_pci" driver for a device
> > > > using an algorithm like Max suggested, but in practice it's not clear
> > > > how useful that will be considering devices likes CX7 require
> > > > configuration before binding to the variant driver.  libvirt has no
> > > > hooks to specify or perform configuration at that point.    
> > > 
> > > I don't think this is fully accurate (or at least not what was
> > > intended), the VFIO device can be configured any time up until the VM
> > > mlx5 driver reaches the device startup.
> > > 
> > > Is something preventing this? Did we accidentally cache the migratable
> > > flag in vfio or something??  
> > 
> > I don't think so, I think this was just the policy we had decided
> > relative to profiling VFs when they're created rather than providing a
> > means to do it though a common vfio variant driver interface[1].
>
> Turns out that yes, migration support needs to be established at probe
> time.  vfio_pci_core_register_device() expects migration_flags,
> mig_ops, and log_ops to all be established by this point, which for
> mlx5-vfio-pci occurs when the .init function calls
> mlx5vf_cmd_set_migratable().

This is unfortunate, we should look at trying to accomodate this,
IMHO. Yishai?

> That also makes me wonder what happens if migration support is disabled
> via devlink after binding the VF to mlx5-vfio-pci.  Arguably this could
> be considered user error, but what's the failure mode and support
> implication?  Thanks,

I think tThe FW will start failing the migration commands.

Jason
Jason Gunthorpe Feb. 29, 2024, 2:22 p.m. UTC | #11
On Mon, Feb 26, 2024 at 03:24:58PM -0700, Alex Williamson wrote:
> On Mon, 26 Feb 2024 15:49:52 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Feb 26, 2024 at 12:41:07PM -0700, Alex Williamson wrote:
> > > On Mon, 26 Feb 2024 15:12:20 -0400
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Mon, Feb 26, 2024 at 11:55:56AM -0700, Alex Williamson wrote:  
> > > > > This will be the first intel vfio-pci variant driver, I don't think we
> > > > > need an intel sub-directory just yet.
> > > > > 
> > > > > Tangentially, I think an issue we're running into with
> > > > > PCI_DRIVER_OVERRIDE_DEVICE_VFIO is that we require driver_override to
> > > > > bind the device and therefore the id_table becomes little more than a
> > > > > suggestion.  Our QE is already asking, for example, if they should use
> > > > > mlx5-vfio-pci for all mlx5 compatible devices.    
> > > > 
> > > > They don't have to, but it works fine, there is no reason not to.  
> > > 
> > > But there's also no reason to.  None of the metadata exposed by the
> > > driver suggests it should be a general purpose vfio-pci stand-in.  
> > 
> > I think the intent was to bind it to all the devices in its ID table
> > automatically for VFIO use and it should always be OK to do that. Not
> > doing so is a micro optimization.
> 
> Automatic in what sense?  We use PCI_DRIVER_OVERRIDE_DEVICE_VFIO to set
> the id_table entry to override_only, so any binding requires the user
> to specify a driver_override.  Nothing automatically performs that
> driver_override except the recent support in libvirt for "managed"
> devices.

I mean in the sense the user would run some command it would find the
right driver..
 
> Effectively override_only negates the id_table to little more than a
> suggestion to userspace of how the driver should be used.

We once talked about having a "flavor" sysfs in the driver core where
instead of using driver_override you'd switch flavours to vfio and
then it would use the driver binding logic to get the right driver. I
forget why we abandoned that direction..

> > checks that ID table.. If we lack a usable tool for that then it that
> > is the problemm.
> 
> These are the sort of questions I'm currently fielding and yes, we
> don't really have any tools to make this automatic outside of the
> recent libvirt support.

I see. Having people do this manually was not my vision at a least.

> > > libvirt recently implemented support for managed="yes" with variant
> > > drivers where it will find the best "vfio_pci" driver for a device
> > > using an algorithm like Max suggested, but in practice it's not clear
> > > how useful that will be considering devices likes CX7 require
> > > configuration before binding to the variant driver.  libvirt has no
> > > hooks to specify or perform configuration at that point.  
> > 
> > I don't think this is fully accurate (or at least not what was
> > intended), the VFIO device can be configured any time up until the VM
> > mlx5 driver reaches the device startup.
> > 
> > Is something preventing this? Did we accidentally cache the migratable
> > flag in vfio or something??
> 
> I don't think so, I think this was just the policy we had decided
> relative to profiling VFs when they're created rather than providing a
> means to do it though a common vfio variant driver interface[1].

I didn't quite mean it so literally though, I imagined we could bind
the driver in VFIO, profile the VF, then open the VFIO cdev.

> Nothing necessarily prevents libvirt from configuring devices after
> binding to a vfio-pci variant driver, but per the noted thread the
> upstream policy is to have the device configured prior to giving it to
> vfio and any configuration mechanisms are specific to the device and
> variant driver, therefore what more is libvirt to do?

I admit I don't have a clear sense what the plan here is on the
userspace side. Somehow this all has to be threaded together and a
real environment needs a step to profile and provision these complex
VFIO devices.

For k8s I think we are doing this with the operator plugins. I don't
know of a plan for "raw" libvirt..

> OTOH, libvirt is more targeted and while it will work automatically for
> a "managed" device specified via domain xml, it's also available as a
> utility through virsh, ex:
> 
>  # virsh nodedev-detach pci_0000_01_10_0
> 
> Which should trigger the code to bind to vfio-pci or the best variant
> driver.

Maybe we just need to more clearly document this?

> Relative to variant driver probe() re-checking the id_table, I know we
> generally don't try to set policy in the kernel and this sort of
> behavior has been around for a long, long time with new_id, but the
> barrier to entry has been substantially lowered with things like
> driverctl that I'd still entertain the idea.  Thanks,

I'm not adverse to it, but maybe we should try to push the userspace
forward some more before putting kernel protections in?

Jason
Yishai Hadas Feb. 29, 2024, 3:53 p.m. UTC | #12
On 29/02/2024 16:10, Jason Gunthorpe wrote:
> On Wed, Feb 28, 2024 at 12:07:06PM -0700, Alex Williamson wrote:
>> On Mon, 26 Feb 2024 15:24:58 -0700
>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>
>>> On Mon, 26 Feb 2024 15:49:52 -0400
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>
>>>> On Mon, Feb 26, 2024 at 12:41:07PM -0700, Alex Williamson wrote:
>>>>> On Mon, 26 Feb 2024 15:12:20 -0400
>>>>> libvirt recently implemented support for managed="yes" with variant
>>>>> drivers where it will find the best "vfio_pci" driver for a device
>>>>> using an algorithm like Max suggested, but in practice it's not clear
>>>>> how useful that will be considering devices likes CX7 require
>>>>> configuration before binding to the variant driver.  libvirt has no
>>>>> hooks to specify or perform configuration at that point.
>>>>
>>>> I don't think this is fully accurate (or at least not what was
>>>> intended), the VFIO device can be configured any time up until the VM
>>>> mlx5 driver reaches the device startup.
>>>>
>>>> Is something preventing this? Did we accidentally cache the migratable
>>>> flag in vfio or something??
>>>
>>> I don't think so, I think this was just the policy we had decided
>>> relative to profiling VFs when they're created rather than providing a
>>> means to do it though a common vfio variant driver interface[1].
>>
>> Turns out that yes, migration support needs to be established at probe
>> time.  vfio_pci_core_register_device() expects migration_flags,
>> mig_ops, and log_ops to all be established by this point, which for
>> mlx5-vfio-pci occurs when the .init function calls
>> mlx5vf_cmd_set_migratable().
> 
> This is unfortunate, we should look at trying to accomodate this,
> IMHO. Yishai?

I'm not sure what is the alternative here.

Moving to the open() might be too late if the guest driver will be 
active already, this might prevent changing/setting some caps/profiling, 
by that time.

In addition, as Alex wrote, today upon probe() each driver calls 
vfio_pci_core_register_device() and by that time the 'mig/log' ops are 
checked.

Making a change here, I believe, requires a good justification / use 
case and a working alternative and design.

Note:
The above is true for other migration drivers around which upon probe 
should supply their 'mig/log' ops.

> 
>> That also makes me wonder what happens if migration support is disabled
>> via devlink after binding the VF to mlx5-vfio-pci.  Arguably this could
>> be considered user error, but what's the failure mode and support
>> implication?  Thanks,
> 
> I think tThe FW will start failing the migration commands.
> 

Right, please see my previous answer here, I supplied some further details.

Yishai
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 5a4051996f1e..8961c7033b31 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23099,6 +23099,14 @@  S:	Maintained
 F:	Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst
 F:	drivers/vfio/pci/pds/
 
+VFIO QAT PCI DRIVER
+M:	Xin Zeng <xin.zeng@intel.com>
+M:	Giovanni Cabiddu <giovanni.cabiddu@intel.com>
+L:	kvm@vger.kernel.org
+L:	qat-linux@intel.com
+S:	Supported
+F:	drivers/vfio/pci/intel/qat/
+
 VFIO PLATFORM DRIVER
 M:	Eric Auger <eric.auger@redhat.com>
 L:	kvm@vger.kernel.org
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 18c397df566d..329d25c53274 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -67,4 +67,6 @@  source "drivers/vfio/pci/pds/Kconfig"
 
 source "drivers/vfio/pci/virtio/Kconfig"
 
+source "drivers/vfio/pci/intel/qat/Kconfig"
+
 endmenu
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 046139a4eca5..a87b6b43ce1c 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -15,3 +15,5 @@  obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
 obj-$(CONFIG_PDS_VFIO_PCI) += pds/
 
 obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
+
+obj-$(CONFIG_QAT_VFIO_PCI) += intel/qat/
diff --git a/drivers/vfio/pci/intel/qat/Kconfig b/drivers/vfio/pci/intel/qat/Kconfig
new file mode 100644
index 000000000000..bf52cfa4b595
--- /dev/null
+++ b/drivers/vfio/pci/intel/qat/Kconfig
@@ -0,0 +1,12 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+config QAT_VFIO_PCI
+	tristate "VFIO support for QAT VF PCI devices"
+	select VFIO_PCI_CORE
+	depends on CRYPTO_DEV_QAT_4XXX
+	help
+	  This provides migration support for Intel(R) QAT Virtual Function
+	  using the VFIO framework.
+
+	  To compile this as a module, choose M here: the module
+	  will be called qat_vfio_pci. If you don't know what to do here,
+	  say N.
diff --git a/drivers/vfio/pci/intel/qat/Makefile b/drivers/vfio/pci/intel/qat/Makefile
new file mode 100644
index 000000000000..5fe5c4ec19d3
--- /dev/null
+++ b/drivers/vfio/pci/intel/qat/Makefile
@@ -0,0 +1,3 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_QAT_VFIO_PCI) += qat_vfio_pci.o
+qat_vfio_pci-y := main.o
diff --git a/drivers/vfio/pci/intel/qat/main.c b/drivers/vfio/pci/intel/qat/main.c
new file mode 100644
index 000000000000..4937bdb650f9
--- /dev/null
+++ b/drivers/vfio/pci/intel/qat/main.c
@@ -0,0 +1,663 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2024 Intel Corporation */
+
+#include <linux/anon_inodes.h>
+#include <linux/container_of.h>
+#include <linux/device.h>
+#include <linux/file.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/sizes.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/qat/qat_mig_dev.h>
+
+struct qat_vf_migration_file {
+	struct file *filp;
+	/* protects migration region context */
+	struct mutex lock;
+	bool disabled;
+	struct qat_vf_core_device *qat_vdev;
+	ssize_t filled_size;
+};
+
+struct qat_vf_core_device {
+	struct vfio_pci_core_device core_device;
+	struct qat_mig_dev *mdev;
+	/* protects migration state */
+	struct mutex state_mutex;
+	enum vfio_device_mig_state mig_state;
+	struct qat_vf_migration_file *resuming_migf;
+	struct qat_vf_migration_file *saving_migf;
+};
+
+static int qat_vf_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct qat_vf_core_device *qat_vdev =
+		container_of(core_vdev, struct qat_vf_core_device,
+			     core_device.vdev);
+	struct vfio_pci_core_device *vdev = &qat_vdev->core_device;
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	ret = qat_vfmig_open(qat_vdev->mdev);
+	if (ret) {
+		vfio_pci_core_disable(vdev);
+		return ret;
+	}
+	qat_vdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+
+	vfio_pci_core_finish_enable(vdev);
+
+	return 0;
+}
+
+static void qat_vf_disable_fd(struct qat_vf_migration_file *migf)
+{
+	mutex_lock(&migf->lock);
+	migf->disabled = true;
+	migf->filp->f_pos = 0;
+	migf->filled_size = 0;
+	mutex_unlock(&migf->lock);
+}
+
+static void qat_vf_disable_fds(struct qat_vf_core_device *qat_vdev)
+{
+	if (qat_vdev->resuming_migf) {
+		qat_vf_disable_fd(qat_vdev->resuming_migf);
+		fput(qat_vdev->resuming_migf->filp);
+		qat_vdev->resuming_migf = NULL;
+	}
+
+	if (qat_vdev->saving_migf) {
+		qat_vf_disable_fd(qat_vdev->saving_migf);
+		fput(qat_vdev->saving_migf->filp);
+		qat_vdev->saving_migf = NULL;
+	}
+}
+
+static void qat_vf_pci_close_device(struct vfio_device *core_vdev)
+{
+	struct qat_vf_core_device *qat_vdev = container_of(core_vdev,
+			struct qat_vf_core_device, core_device.vdev);
+
+	qat_vfmig_close(qat_vdev->mdev);
+	qat_vf_disable_fds(qat_vdev);
+	vfio_pci_core_close_device(core_vdev);
+}
+
+static long qat_vf_precopy_ioctl(struct file *filp,
+				 unsigned int cmd, unsigned long arg)
+{
+	struct qat_vf_migration_file *migf = filp->private_data;
+	struct qat_vf_core_device *qat_vdev = migf->qat_vdev;
+	struct qat_mig_dev *mig_dev = qat_vdev->mdev;
+	struct vfio_precopy_info info;
+	loff_t *pos = &filp->f_pos;
+	unsigned long minsz;
+	int ret;
+
+	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
+		return -ENOTTY;
+
+	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
+
+	if (copy_from_user(&info, (void __user *)arg, minsz))
+		return -EFAULT;
+	if (info.argsz < minsz)
+		return -EINVAL;
+
+	mutex_lock(&qat_vdev->state_mutex);
+	if (qat_vdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY) {
+		mutex_unlock(&qat_vdev->state_mutex);
+		return -EINVAL;
+	}
+
+	mutex_lock(&migf->lock);
+	if (migf->disabled) {
+		ret = -ENODEV;
+		goto err;
+	}
+
+	if (*pos > mig_dev->setup_size) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	info.dirty_bytes = 0;
+	info.initial_bytes = mig_dev->setup_size - *pos;
+	mutex_unlock(&migf->lock);
+	mutex_unlock(&qat_vdev->state_mutex);
+
+	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+
+err:
+	mutex_unlock(&migf->lock);
+	mutex_unlock(&qat_vdev->state_mutex);
+	return ret;
+}
+
+static ssize_t qat_vf_save_read(struct file *filp, char __user *buf,
+				size_t len, loff_t *pos)
+{
+	struct qat_vf_migration_file *migf = filp->private_data;
+	struct qat_mig_dev *mig_dev = migf->qat_vdev->mdev;
+	ssize_t done = 0;
+	loff_t *offs;
+	int ret;
+
+	if (pos)
+		return -ESPIPE;
+	offs = &filp->f_pos;
+
+	mutex_lock(&migf->lock);
+	if (*offs > migf->filled_size || *offs < 0) {
+		done = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	len = min_t(size_t, migf->filled_size - *offs, len);
+	if (len) {
+		ret = copy_to_user(buf, mig_dev->state + *offs, len);
+		if (ret) {
+			done = -EFAULT;
+			goto out_unlock;
+		}
+		*offs += len;
+		done = len;
+	}
+
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static int qat_vf_release_file(struct inode *inode, struct file *filp)
+{
+	struct qat_vf_migration_file *migf = filp->private_data;
+
+	qat_vf_disable_fd(migf);
+	mutex_destroy(&migf->lock);
+	kfree(migf);
+
+	return 0;
+}
+
+static const struct file_operations qat_vf_save_fops = {
+	.owner = THIS_MODULE,
+	.read = qat_vf_save_read,
+	.unlocked_ioctl = qat_vf_precopy_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
+	.release = qat_vf_release_file,
+	.llseek = no_llseek,
+};
+
+static int qat_vf_save_state(struct qat_vf_core_device *qat_vdev,
+			     struct qat_vf_migration_file *migf)
+{
+	int ret;
+
+	ret = qat_vfmig_save_state(qat_vdev->mdev);
+	if (ret)
+		return ret;
+	migf->filled_size = qat_vdev->mdev->state_size;
+
+	return 0;
+}
+
+static int qat_vf_save_setup(struct qat_vf_core_device *qat_vdev,
+			     struct qat_vf_migration_file *migf)
+{
+	int ret;
+
+	ret = qat_vfmig_save_setup(qat_vdev->mdev);
+	if (ret)
+		return ret;
+	migf->filled_size = qat_vdev->mdev->setup_size;
+
+	return 0;
+}
+
+static struct qat_vf_migration_file *
+qat_vf_save_device_data(struct qat_vf_core_device *qat_vdev, bool pre_copy)
+{
+	struct qat_vf_migration_file *migf;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("qat_vf_mig", &qat_vf_save_fops,
+					migf, O_RDONLY);
+	ret = PTR_ERR_OR_ZERO(migf->filp);
+	if (ret) {
+		kfree(migf);
+		return ERR_PTR(ret);
+	}
+
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+
+	if (pre_copy)
+		ret = qat_vf_save_setup(qat_vdev, migf);
+	else
+		ret = qat_vf_save_state(qat_vdev, migf);
+	if (ret) {
+		fput(migf->filp);
+		return ERR_PTR(ret);
+	}
+
+	migf->qat_vdev = qat_vdev;
+
+	return migf;
+}
+
+static ssize_t qat_vf_resume_write(struct file *filp, const char __user *buf,
+				   size_t len, loff_t *pos)
+{
+	struct qat_vf_migration_file *migf = filp->private_data;
+	struct qat_mig_dev *mig_dev = migf->qat_vdev->mdev;
+	loff_t end, *offs;
+	ssize_t done = 0;
+	int ret;
+
+	if (pos)
+		return -ESPIPE;
+	offs = &filp->f_pos;
+
+	if (*offs < 0 ||
+	    check_add_overflow((loff_t)len, *offs, &end))
+		return -EOVERFLOW;
+
+	if (end > mig_dev->state_size)
+		return -ENOMEM;
+
+	mutex_lock(&migf->lock);
+	if (migf->disabled) {
+		done = -ENODEV;
+		goto out_unlock;
+	}
+
+	ret = copy_from_user(mig_dev->state + *offs, buf, len);
+	if (ret) {
+		done = -EFAULT;
+		goto out_unlock;
+	}
+	*offs += len;
+	migf->filled_size += len;
+
+	ret = qat_vfmig_load_setup(mig_dev, migf->filled_size);
+	if (ret && ret != -EAGAIN) {
+		done = ret;
+		goto out_unlock;
+	}
+	done = len;
+
+out_unlock:
+	mutex_unlock(&migf->lock);
+	return done;
+}
+
+static const struct file_operations qat_vf_resume_fops = {
+	.owner = THIS_MODULE,
+	.write = qat_vf_resume_write,
+	.release = qat_vf_release_file,
+	.llseek = no_llseek,
+};
+
+static struct qat_vf_migration_file *
+qat_vf_resume_device_data(struct qat_vf_core_device *qat_vdev)
+{
+	struct qat_vf_migration_file *migf;
+	int ret;
+
+	migf = kzalloc(sizeof(*migf), GFP_KERNEL);
+	if (!migf)
+		return ERR_PTR(-ENOMEM);
+
+	migf->filp = anon_inode_getfile("qat_vf_mig", &qat_vf_resume_fops, migf, O_WRONLY);
+	ret = PTR_ERR_OR_ZERO(migf->filp);
+	if (ret) {
+		kfree(migf);
+		return ERR_PTR(ret);
+	}
+
+	migf->qat_vdev = qat_vdev;
+	migf->filled_size = 0;
+	stream_open(migf->filp->f_inode, migf->filp);
+	mutex_init(&migf->lock);
+
+	return migf;
+}
+
+static int qat_vf_load_device_data(struct qat_vf_core_device *qat_vdev)
+{
+	return qat_vfmig_load_state(qat_vdev->mdev);
+}
+
+static struct file *qat_vf_pci_step_device_state(struct qat_vf_core_device *qat_vdev, u32 new)
+{
+	u32 cur = qat_vdev->mig_state;
+	int ret;
+
+	if ((cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_RUNNING_P2P) ||
+	    (cur == VFIO_DEVICE_STATE_PRE_COPY && new == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
+		ret = qat_vfmig_suspend(qat_vdev->mdev);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if ((cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_STOP) ||
+	    (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING_P2P))
+		return NULL;
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) {
+		struct qat_vf_migration_file *migf;
+
+		migf = qat_vf_save_device_data(qat_vdev, false);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		qat_vdev->saving_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_PRE_COPY_P2P && new == VFIO_DEVICE_STATE_STOP_COPY) {
+		struct qat_vf_migration_file *migf = qat_vdev->saving_migf;
+
+		if (!migf)
+			return ERR_PTR(-EINVAL);
+		ret = qat_vf_save_state(qat_vdev, migf);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
+	if ((cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP) ||
+	    (cur == VFIO_DEVICE_STATE_PRE_COPY && new == VFIO_DEVICE_STATE_RUNNING) ||
+	    (cur == VFIO_DEVICE_STATE_PRE_COPY_P2P && new == VFIO_DEVICE_STATE_RUNNING_P2P)) {
+		qat_vf_disable_fds(qat_vdev);
+		return NULL;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) {
+		struct qat_vf_migration_file *migf;
+
+		migf = qat_vf_resume_device_data(qat_vdev);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		qat_vdev->resuming_migf = migf;
+		return migf->filp;
+	}
+
+	if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
+		ret = qat_vf_load_device_data(qat_vdev);
+		if (ret)
+			return ERR_PTR(ret);
+
+		qat_vf_disable_fds(qat_vdev);
+		return NULL;
+	}
+
+	if ((cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_PRE_COPY) ||
+	    (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_PRE_COPY_P2P)) {
+		struct qat_vf_migration_file *migf;
+
+		migf = qat_vf_save_device_data(qat_vdev, true);
+		if (IS_ERR(migf))
+			return ERR_CAST(migf);
+		get_file(migf->filp);
+		qat_vdev->saving_migf = migf;
+		return migf->filp;
+	}
+
+	if ((cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_RUNNING) ||
+	    (cur == VFIO_DEVICE_STATE_PRE_COPY_P2P && new == VFIO_DEVICE_STATE_PRE_COPY)) {
+		qat_vfmig_resume(qat_vdev->mdev);
+		return NULL;
+	}
+
+	/* vfio_mig_get_next_state() does not use arcs other than the above */
+	WARN_ON(true);
+	return ERR_PTR(-EINVAL);
+}
+
+static void qat_vf_reset_done(struct qat_vf_core_device *qat_vdev)
+{
+	qat_vdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+	qat_vfmig_reset(qat_vdev->mdev);
+	qat_vf_disable_fds(qat_vdev);
+}
+
+static struct file *qat_vf_pci_set_device_state(struct vfio_device *vdev,
+						enum vfio_device_mig_state new_state)
+{
+	struct qat_vf_core_device *qat_vdev = container_of(vdev,
+			struct qat_vf_core_device, core_device.vdev);
+	enum vfio_device_mig_state next_state;
+	struct file *res = NULL;
+	int ret;
+
+	mutex_lock(&qat_vdev->state_mutex);
+	while (new_state != qat_vdev->mig_state) {
+		ret = vfio_mig_get_next_state(vdev, qat_vdev->mig_state,
+					      new_state, &next_state);
+		if (ret) {
+			res = ERR_PTR(ret);
+			break;
+		}
+		res = qat_vf_pci_step_device_state(qat_vdev, next_state);
+		if (IS_ERR(res))
+			break;
+		qat_vdev->mig_state = next_state;
+		if (WARN_ON(res && new_state != qat_vdev->mig_state)) {
+			fput(res);
+			res = ERR_PTR(-EINVAL);
+			break;
+		}
+	}
+	mutex_unlock(&qat_vdev->state_mutex);
+
+	return res;
+}
+
+static int qat_vf_pci_get_device_state(struct vfio_device *vdev,
+				       enum vfio_device_mig_state *curr_state)
+{
+	struct qat_vf_core_device *qat_vdev = container_of(vdev,
+			struct qat_vf_core_device, core_device.vdev);
+
+	mutex_lock(&qat_vdev->state_mutex);
+	*curr_state = qat_vdev->mig_state;
+	mutex_unlock(&qat_vdev->state_mutex);
+
+	return 0;
+}
+
+static int qat_vf_pci_get_data_size(struct vfio_device *vdev,
+				    unsigned long *stop_copy_length)
+{
+	struct qat_vf_core_device *qat_vdev = container_of(vdev,
+			struct qat_vf_core_device, core_device.vdev);
+
+	mutex_lock(&qat_vdev->state_mutex);
+	*stop_copy_length = qat_vdev->mdev->state_size;
+	mutex_unlock(&qat_vdev->state_mutex);
+
+	return 0;
+}
+
+static const struct vfio_migration_ops qat_vf_pci_mig_ops = {
+	.migration_set_state = qat_vf_pci_set_device_state,
+	.migration_get_state = qat_vf_pci_get_device_state,
+	.migration_get_data_size = qat_vf_pci_get_data_size,
+};
+
+static void qat_vf_pci_release_dev(struct vfio_device *core_vdev)
+{
+	struct qat_vf_core_device *qat_vdev = container_of(core_vdev,
+			struct qat_vf_core_device, core_device.vdev);
+
+	qat_vfmig_cleanup(qat_vdev->mdev);
+	qat_vfmig_destroy(qat_vdev->mdev);
+	mutex_destroy(&qat_vdev->state_mutex);
+	vfio_pci_core_release_dev(core_vdev);
+}
+
+static int qat_vf_pci_init_dev(struct vfio_device *core_vdev)
+{
+	struct qat_vf_core_device *qat_vdev = container_of(core_vdev,
+			struct qat_vf_core_device, core_device.vdev);
+	struct qat_mig_dev *mdev;
+	struct pci_dev *parent;
+	int ret, vf_id;
+
+	core_vdev->migration_flags = VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P |
+				     VFIO_MIGRATION_PRE_COPY;
+	core_vdev->mig_ops = &qat_vf_pci_mig_ops;
+
+	ret = vfio_pci_core_init_dev(core_vdev);
+	if (ret)
+		return ret;
+
+	mutex_init(&qat_vdev->state_mutex);
+
+	parent = pci_physfn(qat_vdev->core_device.pdev);
+	vf_id = pci_iov_vf_id(qat_vdev->core_device.pdev);
+	if (vf_id < 0) {
+		ret = -ENODEV;
+		goto err_rel;
+	}
+
+	mdev = qat_vfmig_create(parent, vf_id);
+	if (IS_ERR(mdev)) {
+		ret = PTR_ERR(mdev);
+		goto err_rel;
+	}
+
+	ret = qat_vfmig_init(mdev);
+	if (ret)
+		goto err_destroy;
+
+	qat_vdev->mdev = mdev;
+
+	return 0;
+
+err_destroy:
+	qat_vfmig_destroy(mdev);
+err_rel:
+	vfio_pci_core_release_dev(core_vdev);
+	return ret;
+}
+
+static const struct vfio_device_ops qat_vf_pci_ops = {
+	.name = "qat-vf-vfio-pci",
+	.init = qat_vf_pci_init_dev,
+	.release = qat_vf_pci_release_dev,
+	.open_device = qat_vf_pci_open_device,
+	.close_device = qat_vf_pci_close_device,
+	.ioctl = vfio_pci_core_ioctl,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = vfio_pci_core_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+	.bind_iommufd = vfio_iommufd_physical_bind,
+	.unbind_iommufd = vfio_iommufd_physical_unbind,
+	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+	.detach_ioas = vfio_iommufd_physical_detach_ioas,
+};
+
+static struct qat_vf_core_device *qat_vf_drvdata(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = pci_get_drvdata(pdev);
+
+	return container_of(core_device, struct qat_vf_core_device, core_device);
+}
+
+static void qat_vf_pci_aer_reset_done(struct pci_dev *pdev)
+{
+	struct qat_vf_core_device *qat_vdev = qat_vf_drvdata(pdev);
+
+	if (!qat_vdev->mdev)
+		return;
+
+	mutex_lock(&qat_vdev->state_mutex);
+	qat_vf_reset_done(qat_vdev);
+	mutex_unlock(&qat_vdev->state_mutex);
+}
+
+static int
+qat_vf_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct device *dev = &pdev->dev;
+	struct qat_vf_core_device *qat_vdev;
+	int ret;
+
+	qat_vdev = vfio_alloc_device(qat_vf_core_device, core_device.vdev, dev, &qat_vf_pci_ops);
+	if (IS_ERR(qat_vdev))
+		return PTR_ERR(qat_vdev);
+
+	pci_set_drvdata(pdev, &qat_vdev->core_device);
+	ret = vfio_pci_core_register_device(&qat_vdev->core_device);
+	if (ret)
+		goto out_put_device;
+
+	return 0;
+
+out_put_device:
+	vfio_put_device(&qat_vdev->core_device.vdev);
+	return ret;
+}
+
+static void qat_vf_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct qat_vf_core_device *qat_vdev = qat_vf_drvdata(pdev);
+
+	vfio_pci_core_unregister_device(&qat_vdev->core_device);
+	vfio_put_device(&qat_vdev->core_device.vdev);
+}
+
+static const struct pci_device_id qat_vf_vfio_pci_table[] = {
+	/* Intel QAT GEN4 4xxx VF device */
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0x4941) },
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0x4943) },
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0x4945) },
+	{}
+};
+MODULE_DEVICE_TABLE(pci, qat_vf_vfio_pci_table);
+
+static const struct pci_error_handlers qat_vf_err_handlers = {
+	.reset_done = qat_vf_pci_aer_reset_done,
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
+static struct pci_driver qat_vf_vfio_pci_driver = {
+	.name = "qat_vfio_pci",
+	.id_table = qat_vf_vfio_pci_table,
+	.probe = qat_vf_vfio_pci_probe,
+	.remove = qat_vf_vfio_pci_remove,
+	.err_handler = &qat_vf_err_handlers,
+	.driver_managed_dma = true,
+};
+module_pci_driver(qat_vf_vfio_pci_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Intel Corporation");
+MODULE_DESCRIPTION("QAT VFIO PCI - VFIO PCI driver with live migration support for Intel(R) QAT GEN4 device family");
+MODULE_IMPORT_NS(CRYPTO_QAT);